Text corpora
The Text Corpus of the Institute of the Estonian Language
The material for the text corpus has been collected haphazardly, (10.4 million word forms; ca 80% of the texts come from newspapers), which is why the corpus is not representative. Nor is the corpus tagged, thus being suited for lexical search mainly.
The Historical Concordance of Estonian Bible Translations
The aim of the database is to provide a survey of the evolution of spiritual Estonian in the 17th and beginning of the 18th centuries. The database contains text translations and glossaries to them, enabling searches (on completed parts) a) by authors or texts, b) by passage, c) by a modernised keyword.
The Valency Corpus consists of orthographic passages from the Postimees daily, whose emotional tone (positive, negative, ambiguous, neutral) has been identified by readers. The identification was done using the method of dominant opinion (Pennebaker et al. 1997). The corpus is mainly intended to train statistical models, but it can also be used for other purposes. Queries can be done by rubrics (“Opinion“, “Estonia“, “Culture“, “Sports“, “Abroad“, “Criminal“) as well as by the emotional tone (positive, negative, ambiguous, neutral).
The Archive of Estonian Dialects and Finno-Ugric Languages (EMSUKA) of the Institute of the Estonian Language
This is the world´s biggest collection of Estonian dialect usage. It contains sound recordings as well as written records of Estonian dialects, Finno-Ugric languages and expatriate Estonian.
The Corpus of Speech Synthesis of the Institute of the Estonian Language
The corpus contains sound recordings of read texts used for the creation of voice models for Estonian text-to-speech synthesis.
The Conceptual File of Estonian Lexis of the Institute of the Estonian Language
The idea comes from Andrus Saareste. Collection started in the 1920s and went on until mid-1930s. Unlike the geographic division typical of dialect collections, here the vocabulary is divided by conceptual affinity. Material has been collected from the following domains: Marriage, Time, Gardening, Love life, Buildings, Haymaking, Weather, Humans, Fishing, Cattle breeding, Body, Clothing, Handicrafts, Traffic, Flax works, Animals, Mineral resources, Landscapes, Seabirds, Maritime affairs, Apiculture, Forestry, Measurement, Playing, Sign, Magic, Dishes, Woodwork, Agriculture, Family, Vehicles, Firmament, Volition, Plants, Health, Nourishment, Fire, Emotional life, Cognition, Work, Religion, Waterbodies, Water vehicles, Vodka, Wool, Colour, and Justice and Society.
Corpus of Estonian Coursebook Content 2017 (2017) NEW!
Estonian Coursebook Corpus 2017 (2017) NEW!
- Corpus of Written Estonian (1890-1990) (10 sub-corpora)
http://www.cl.ut.ee/korpused/baaskorpus
- Estonian Reference Corpus
http://www.keeletehnoloogia.ee/projektid/koondkorpus
- The Balanced Corpus of Estonian
http://www.cl.ut.ee/korpused/grammatikakorpus
- Mixed Corpus: New media
http://www.cl.ut.ee/korpused/segakorpus/uusmeedia
- Morphologically disambiguated corpus
http://www.cl.ut.ee/korpused/morfkorpus
- Corpus with Disambiguated Word Senses
http://www.cl.ut.ee/korpused/semkorpus
- Estonian Learner Corpus - Parallel Corpus
http://www.keeletehnoloogia.ee/projektid/veebipohine-keeleope/vead.zip
- Estonian Learner Corpus
http://www.murre.ut.ee/flee-korpused/#10-6ppija
- Corpus of Native Estonian Learners
http://www.murre.ut.ee/flee-korpused/#11-kooli
- Estonian Dialogue Corpus EDiC
http://math.ut.ee/~koit/Dialoog/EDiC.html
- Corpus of Old Written Estonian (VAKK)
http://www.murre.ut.ee/vakkur/Korpused/korpused.htm
- Shallow syntactically disambiguated corpus
http://math.ut.ee/~kaili/Korpus/pindmine
- Estonian-English parallel corpus
http://www.cl.ut.ee/korpused/paralleel
- Estonian Treebank
http://www.ut.ee/~kaili/Korpus/puud
- Estonian Interlanguage Corpus
http://evkk.tlu.ee/