Text corpora

EKI logo The Text Corpus of the Institute of the Estonian Language ressurss META-SHARE'is

The material for the text corpus has been collected haphazardly, (10.4 million word forms; ca 80% of the texts come from newspapers), which is why the corpus is not representative. Nor is the corpus tagged, thus being suited for lexical search mainly.

EKI logo The Historical Concordance of Estonian Bible Translations ressurss META-SHARE'is

The aim of the database is to provide a survey of the evolution of spiritual Estonian in the 17th and beginning of the 18th centuries. The database contains text translations and glossaries to them, enabling searches (on completed parts) a) by authors or texts, b) by passage, c) by a modernised keyword.

EKI logo Valency corpus ressurss META-SHARE'is

The Valency Corpus consists of orthographic passages from the Postimees daily, whose emotional tone (positive, negative, ambiguous, neutral) has been identified by readers. The identification was done using the method of dominant opinion (Pennebaker et al. 1997). The corpus is mainly intended to train statistical models, but it can also be used for other purposes. Queries can be done by rubrics (“Opinion“, “Estonia“, “Culture“, “Sports“, “Abroad“, “Criminal“) as well as by the emotional tone (positive, negative, ambiguous, neutral).

EKI logo The Archive of Estonian Dialects and Finno-Ugric Languages (EMSUKA) of the Institute of the Estonian Language  ressurss META-SHARE'is

This is the world´s biggest collection of Estonian dialect usage. It contains sound recordings as well as written records of Estonian dialects, Finno-Ugric languages and expatriate Estonian.

EKI logo The Corpus of Speech Synthesis of the Institute of the Estonian Language ressurss META-SHARE'is

The corpus contains sound recordings of read texts used for the creation of voice models for Estonian text-to-speech synthesis.

EKI logo The Conceptual File of Estonian Lexis of the Institute of the Estonian Language ressurss META-SHARE'is


The idea comes from Andrus Saareste. Collection started in the 1920s and went on until mid-1930s. Unlike the geographic division typical of dialect collections, here the vocabulary is divided by conceptual affinity. Material has been collected from the following domains: Marriage, Time, Gardening, Love life, Buildings, Haymaking, Weather, Humans, Fishing, Cattle breeding, Body, Clothing, Handicrafts, Traffic, Flax works, Animals, Mineral resources, Landscapes, Seabirds, Maritime affairs, Apiculture, Forestry, Measurement, Playing, Sign, Magic, Dishes, Woodwork, Agriculture, Family, Vehicles, Firmament, Volition, Plants, Health, Nourishment, Fire, Emotional life, Cognition, Work, Religion, Waterbodies, Water vehicles, Vodka, Wool, Colour, and Justice and Society.

 EKI logoCorpus of Estonian Coursebook Content 2017 (2017) ressurss META-SHARE'is NEW!

EKI logo Estonian Coursebook Corpus 2017 (2017) ressurss META-SHARE'is NEW!


  • Estonian-Latvian Parallel Corpus of building product texts ressurss META-SHARE'is