Deutsch:

RENDER Toolkit for Knowledge Diversity in Wikipedia

 

Visual CAPTCHA

Corpex - Corpora Explorer

Corpex let you swiftly browse through all the words of Wikipedia.It displays the context of characters and words within the text of different language versions of Wikipedia.

Settings

At first choose the desired language version of Wikipedia. Then you can type the first letters of a word and Corpex shows the first results. At the moment Corpex can handle only lowercase letters.

Corpex is also available as a restful webservice, simply call http://km.aifb.kit.edu/sites/corpex/corpex.php?lang=XX&q=Y with XX being the Wikipedia language code (see below) and Y being the starting letter sequence. You will get back a JSON result with the same data that you see on the page.
The bigrams statistics are available through http://km.aifb.kit.edu/sites/corpex/corpex.php?lang=XX&q=Y with XX being two "+"-separated words representing the bigram in question, e.g. "star+wars".

Results

When you start typing, the system shows you three statistics in six graphs. These are from left to right in the first row

  1. the ten most frequent words that start with the typed sequence of letters (as a barcharts and a piechart),
  2. the most frequent letter following the already typed sequence of letters (again, as a barchart and a piechart), and
  3. both charts in the second row show the most frequend second word of all two-word-terms that start with the typed sequence as first word.

In any case three dots (...) mean "other word/letter", the dollar sign ($) means "end of the word/end of sentence".

Uses

In the current version of the tool Corpex particularly interesting for researchers which analyze Corpora or do similar things. Perspectively, the results get more practical use, if you can involve the temporal course in the observations. For example, to investigate whether a language version tends to a certain recent political orientation. That can be done with identifying the frequency of characteristic words, phrases or expressions over time.

Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available. Feedback, and especially suggestions for cooperation, is welcome.

Supported languages

Corpex as well as the extracted n-grams are currently available in the following languages:

CodeCountry1-grams2-grams
bgBulgarianbg_1-grams.csv.zipbg_2-grams.csv.zip
bsBosnianbs_1-grams.csv.zipbs_2-grams.csv.zip
csCzechcs_1-grams.csv.zipcs_2-grams.csv.zip
deGermande_1-grams.csv.zipde_2-grams.csv.zip
enEnglishen_1-grams.csv.zipen_2-grams.csv.zip
esSpanishes_1-grams.csv.zipes_2-grams.csv.zip
frFrenchfr_1-grams.csv.zipfr_2-grams.csv.zip
hrCroatianhr_1-grams.csv.ziphr_2-grams.csv.zip
huHungarianhu_1-grams.csv.ziphu_2-grams.csv.zip
itItalianit_1-grams.csv.zipit_2-grams.csv.zip
roRomanianro_1-grams.csv.zipro_2-grams.csv.zip
shSerbo-Croatiansh_1-grams.csv.zipsh_2-grams.csv.zip
sqAlbaniansq_1-grams.csv.zipsq_2-grams.csv.zip
srSerbiansr_1-grams.csv.zipsr_2-grams.csv.zip
svSwedishsv_1-grams.csv.zipsv_2-grams.csv.zip
simpleSimple Englishsimple_1-grams.csv.zipsimple_2-grams.csv.zip
brownBrown Corpus
More languages are being prepared.

 
RENDER is funded by

All data used on this page originate from the Wikipedia project of the Wikimedia Foundation.
Unless otherwise stated, all data on this page is licensed under Creative Commons Attribution-Share-Alike License 3.0

The source code of these tools is, unless stated otherwise, licensed under GNU General Public License v3

 
Imprint