Corpex - Corpora Explorer
Corpex let you swiftly browse through all the words of Wikipedia.It displays the context of characters and words within the text of different language versions of Wikipedia.
At first choose the desired language version of Wikipedia. Then you can type the first letters of a word and Corpex shows the first results. At the moment Corpex can handle only lowercase letters.
Corpex is also available as a restful webservice, simply call
When you start typing, the system shows you three statistics in six graphs. These are from left to right in the first row
In any case three dots (...) mean "other word/letter", the dollar sign ($) means "end of the word/end of sentence".
In the current version of the tool Corpex particularly interesting for researchers which analyze Corpora or do similar things. Perspectively, the results get more practical use, if you can involve the temporal course in the observations. For example, to investigate whether a language version tends to a certain recent political orientation. That can be done with identifying the frequency of characteristic words, phrases or expressions over time.
Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available. Feedback, and especially suggestions for cooperation, is welcome.
Corpex as well as the extracted n-grams are currently available in the following languages:
RENDER is funded by
All data used on this page originate from the Wikipedia
project of the Wikimedia Foundation.
The source code of these tools is, unless stated otherwise, licensed under GNU General Public License v3