Find Estonian resources in the META-SHARE repository

A language resource is linguistic data in computer-readable format that is used for lingusitic resarch or for developing language technology. The data may not have been originally meant for linguistic research - anything that includes human language in text, speech, or even gestures, can become a language resource.
This term also covers software for managing or processing other language resources.

For a better overview the resources available through CELR have been organized into the following five groups. Please note that some of these resources may not yet have a full English manual page.

  • Text corpora - a corpus is structured text that is suitable for automatic analysis. CELR has general corpora (e.g. Estonian Reference Corpus), domain-specific corpora (e.g. texts from chat rooms); monolingual corpora and parallel corpora, that include translations of the same text in multiple languages. In addition to the source text a corpus can have additional layers of annotation: morphology, syntax, word senses, named entities etc).
  • Speech databases - speech corpora with recordings of speech, their transcriptions; databases necessary for speech synthesis; voices for speech synthesis etc.
  • Lexical resources - dictionaries, terminological resources, semantic resources, wordnets, frequency lists etc.
  • Text processing tools - software for managing and processing texts: speller, morhpological analyzer, noun phrase annotator, environment for managing dictionaries etc.
  • Speech processing tools -  software for managing and processing speech data: text-to-speech synthesis, speech recognition software and applications etc.