Corpus of Spoken Estonian of the University of Tartu
The research group of spoken Estonian was found in 1997. Laboratory of Spoken and Computer Mediated Communication of the University of Tartu was established in 2011. People: Tiit Hennoste, Andriela Rääbis, Olga Gerassimenko, Kirsi Laanesoo, Andra Rumm; earlier also Airi Jansons, Riina Kasterpalu (Vellerind), Liina Lindström, Krista Mihkels (Strandson), Piret Toomet.
The corpus of spoken Estonian has been collected since 1997.
The corpus is transcribed by the transcription of conversational analysis (CA). Each tape is provided with a header that lists in all 23 situational factors that have been found to affect language use in the analysis of various languages. For each concrete tape the number of possible factors is as high as possible.
The corpus is planned as an open corpus, i.e. no limits have been set. Our intention is to collect various types of oral speech, the usage of both everyday and institutional conversation, monologues and dialogues, face-to-face and telephone interaction and media texts. As of January 2019, the corpus consists of 3761 audio and 166 video records (703 hours, 3927 conversations alltogether) and 2337 transliterated texts (2 206 810 words according to Microsoft Word statistics).
Recordings divide to:
1345 face-to-face conversations
1924 phone conversations
456 radio and TV broadcasts
7 skype conversations
195 non-defined.
On the institutionality scale, conversations divide to:
824 everyday conversations
2796 institutional conversations
84 other conversations
221 non-defined.
The corpus is a data bank in the Word format and simple txt-format (ISO-8859-1). In order to access the corpus, a contract with the research group of Spoken Estonian is required (contact Andriela Rääbis, This email address is being protected from spambots. You need JavaScript enabled to view it.).
We use conversation analysis and interactional linguistics as primary research methods.
Information Retrieval from the Corpus of Spoken Estonian (demo)