Text corpora

The Text Corpus of the Institute of the Estonian Language

The material for the text corpus has been collected haphazardly, (10.4 million word forms; ca 80% of the texts come from newspapers), which is why the corpus is not representative. Nor is the corpus tagged, thus being suited for lexical search mainly.

The Historical Concordance of Estonian Bible Translations

The aim of the database is to provide a survey of the evolution of spiritual Estonian in the 17th and beginning of the 18th centuries. The database contains text translations and glossaries to them, enabling searches (on completed parts) a) by authors or texts, b) by passage, c) by a modernised keyword.

Valency corpus

The Valency Corpus consists of orthographic passages from the Postimees daily, whose emotional tone (positive, negative, ambiguous, neutral) has been identified by readers. The identification was done using the method of dominant opinion (Pennebaker et al. 1997). The corpus is mainly intended to train statistical models, but it can also be used for other purposes. Queries can be done by rubrics (“Opinion“, “Estonia“, “Culture“, “Sports“, “Abroad“, “Criminal“) as well as by the emotional tone (positive, negative, ambiguous, neutral).

The Archive of Estonian Dialects and Finno-Ugric Languages (EMSUKA) of the Institute of the Estonian Language

This is the world´s biggest collection of Estonian dialect usage. It contains sound recordings as well as written records of Estonian dialects, Finno-Ugric languages and expatriate Estonian.

The Corpus of Speech Synthesis of the Institute of the Estonian Language

The corpus contains sound recordings of read texts used for the creation of voice models for Estonian text-to-speech synthesis.

The Conceptual File of Estonian Lexis of the Institute of the Estonian Language

The idea comes from Andrus Saareste. Collection started in the 1920s and went on until mid-1930s. Unlike the geographic division typical of dialect collections, here the vocabulary is divided by conceptual affinity. Material has been collected from the following domains: Marriage, Time, Gardening, Love life, Buildings, Haymaking, Weather, Humans, Fishing, Cattle breeding, Body, Clothing, Handicrafts, Traffic, Flax works, Animals, Mineral resources, Landscapes, Seabirds, Maritime affairs, Apiculture, Forestry, Measurement, Playing, Sign, Magic, Dishes, Woodwork, Agriculture, Family, Vehicles, Firmament, Volition, Plants, Health, Nourishment, Fire, Emotional life, Cognition, Work, Religion, Waterbodies, Water vehicles, Vodka, Wool, Colour, and Justice and Society.

Corpus of Estonian Coursebook Content 2017 (2017) NEW!

Estonian Coursebook Corpus 2017 (2017) NEW!

Corpus of Written Estonian (1890-1990) (10 sub-corpora)
http://www.cl.ut.ee/korpused/baaskorpus

Estonian Reference Corpus
http://www.keeletehnoloogia.ee/projektid/koondkorpus

The Balanced Corpus of Estonian
http://www.cl.ut.ee/korpused/grammatikakorpus

Mixed Corpus: New media
http://www.cl.ut.ee/korpused/segakorpus/uusmeedia

Morphologically disambiguated corpus
http://www.cl.ut.ee/korpused/morfkorpus

Corpus with Disambiguated Word Senses
http://www.cl.ut.ee/korpused/semkorpus

Estonian Learner Corpus - Parallel Corpus
http://www.keeletehnoloogia.ee/projektid/veebipohine-keeleope/vead.zip

Estonian Learner Corpus
http://www.murre.ut.ee/flee-korpused/#10-6ppija

Corpus of Native Estonian Learners
http://www.murre.ut.ee/flee-korpused/#11-kooli

Estonian Dialogue Corpus EDiC
http://math.ut.ee/~koit/Dialoog/EDiC.html

Corpus of Old Written Estonian (VAKK)
http://www.murre.ut.ee/vakkur/Korpused/korpused.htm

Shallow syntactically disambiguated corpus
http://math.ut.ee/~kaili/Korpus/pindmine

Estonian-English parallel corpus
http://www.cl.ut.ee/korpused/paralleel

Estonian-Latvian Parallel Corpus of building product texts

Estonian Treebank
http://www.ut.ee/~kaili/Korpus/puud

Estonian Interlanguage Corpus
http://evkk.tlu.ee/

Nav view search

Navigation

Search

Text corpora

The Archive of Estonian Dialects and Finno-Ugric Languages (EMSUKA) of the Institute of the Estonian Language