Building a Spoken Corpus of Slovene

 

Final Report on a Marie Curie Training Site host PhD fellowship at the Bergen Advanced Training Site in Multilingual Tools (BATMULT), University of Bergen, September 15 – December 15,  2004

 

Jana Zemljarič Miklavčič, University of Ljubljana, Slovenia

 

 

January 6, 2005

 

Project description

 

My PhD research is aimed at theoretical foundation on building a spoken corpus of Slovene language, which is planned to complement the 100 million word FIDA corpus (http://www.fida.net/slo/index.html) as its spoken component. There are two main scientific challenges in my research: the first is to develop a set of criteria for the collection and selection of spoken material to be included in a representative and balanced spoken corpus, and the second is to outline the recommendations for the transcription and annotation of a spoken texts. As to the criteria for collection, a preliminary proposal for the selection of materials, based on the combination of the demographic and contextual method has been already worked out. The actual collection of spoken texts has also been started before coming to Batmult.

 

The main aim of my three months stay at Batmult as a Marie Curie Host PhD student was to compile a pilot spoken corpus of Slovene based on digital recordings, available in searchable form, with transcriptions linked to sound files. The purpose of a pilot corpus was to redefine the criteria for the collection, selection and documentation of spoken materials, to develop and test transcription and mark-up conventions, and finnally to show some possibilities for the use of a corpus for language description and language analysis. Batmult training site at The Department of Culture, Language and Information Technology (AKSIS) at the University of Bergen offered me a unique possibility to study and access different spoken language corpora on Corpus Work Bench and compare the work done on spoken languages in Norway and elsewhere. I was also trained in using different transcription tools (Praat and Transcriber) and tools for synchronization of  transcriptions and sound clips.

 

 

Concrete Achievements and Results

 

The design of a pilot corpus

 

Slovene language is spoken by 2 million speakers in Republic of Slovenia and in Slovene minorities in Italy, Austria and Hungary, partly also in diasporas around the word. Corpus of adult native speakers of Slovene should gain representativnes through a combination of demographic and contextual sampling. For the criteria of demographic sampling I proposed 4 categories: sex, age, region and education, and for contextual sampling structure of a text (mono-, dialog), setting (public, private), position between speakers and hearers (formal, informal), media (face to face, telephone, radio, TV) and genre of a text. All criteria have been taken into consideration while building a pilot corpus.

 

The pilot corpus consists of 7 digital recordings in total length of 89 minutes. All texts were recorded in year 2004. The specification of the recordings is shown in the following table:

 

 

ID

Duration

min   

No. of speakers

Place of

recording

Surreptitious

 

Genre

R01

2.17

2

University

No

interview

R02

54.50

6

Studio

No

round table

R03

3.58

2

Home

No

interview

R04

7.31

5

Office

No

spont. convers.

R05

3.23

5

Skate-park

No

interview

R06

11.54

3

Workplace

No

spont. convers.

R07

5.12

2

Home

Yes

spont. convers.

å=7

89.00

 

 

 

 

 

Table 1: Pilot corpus recording's documentation

 

 

All information about speakers has been collected on speaker’s identity lists. The data are represented in following table:

 

 

ID

Sex

Year of Birth

Age

Education

Region

G01

F

1963

41

University

Central

G02

M

1965

39

University

Central

G03

F

1966

38

University

Central

G04

F

1967

37

University

Central

G05

F

1968

36

University

Central

G06

F

1968

36

University

Central

G07

M

1970(?)

34(?)

University

Other

G08

M

1933(?)

71(?)

University

Central

G09

F

1979

25

University

South-east

G10

F

1967

37

High school

North-west

G11

M

1987(?)

17(?)

Primary sch.

Central

G12

M

1987(?)

17(?)

Primary sch.

Central

G13

M

1987(?)

17(?)

Primary sch.

Central

G14

F

1976

28

University

South-east

G15

F

1979

25

University

Central

G16

M

1978

26

High school

Central

G17

M

1978

26

High school

Central

G18

F

?

?

?

?

G19

F

1969

35

University

North-west

G20

M

1948

56

High school

North-west

 

Table 2: Pilot corpus speakers' documentation

 

 

The sample of 20 speakers is representative according to the sex of the speaker but not according to other demographic criteria. The actual spoken corpus should consist of texts representatively taken from 5 areas that represent 5 dialectal groups of Slovene language. Furthermore there should be 3 age classes and 3 educational classes. The rather opportunistic nature of a pilot corpus should be taken into consideration when analyzing it.

 

Pilot corpus is better designed in the concern of contextual criteria: different structure types, settings, speaker's positions, genres and media are represented among the texts. However, the telephone conversations and some other text genres should necessary be added to the planned spoken corpus. The final design of the pilot corpus according to contextual criteria is presented in following table:

 

 

Contextual criteria

Proportion

Dialogue (or multilogue) vs. Monologue 

   94 % : 6 %

Private vs. Public 

19,5 % : 80,5 %

Informal vs. Formal 

35,5 % : 64,5 %

Media vs. Face to face            

   31 % : 69 %

Surreptitious vs. Nonsurreptitious 

  5,6 % : 94,4 %

 

Table 3: Texts according to selected contextual criteria

 

 

Transcribing

 

I have learned about existing transcription software at Batmult, and tested three programs, Praat, Transcriber and WinPitch. According to their characteristics I've decided to use the first two mentioned to carry out actual transcription work. Transcriber is a tool for segmenting, labeling and transcribing speech; I found it more user-friendly than Praat, however, it doesn't allow transcribing overlapping speech of more than two speakers.    

 

 

Picture 1: Transcriber working platform,

transcription of a pilot spoken corpus of Slovene

 

 

Program Praat, on the other hand, is less suitable for transcribing and works very slowly for longer recordings (more than 30 minutes) but it allows transcribing overlapping speech of more than two speakers which is often the case with spontaneous speech.

 

 

Picture 2: Praat platform, transcribing for a pilot spoken corpus of Slovene

 

 

Both programs enable an automatic synchronization of transcriptions and sound clips. In WordPad format of transcriptions (either made in Transcriber or Praat) speakers’ utterances are clearly marked within a time coding, as shown on a following example:

 

 

 

 

 

Picture 3: Transcription, done in Praat program, in WordPad format

 

 

Transcription standard

 

During actual transcription work I had to decide about transcription principles for transcribing spoken Slovene language. I was following the TEI and EAGLES recommendations on transcribing and annotating spoken texts. As commonly experienced when creating spoken corpuses I decided for an individual form of modified orthographic transcription. Basic unit of a speech is an utterance, defined by a short pause or a speaker turns. No punctuation is used in transcription,  capital letters are used for proper names only.

 

The adopted transcription standard is presented on the following scheme:

 

Tag

Meaning

<pavza>

                <pravza>(5)

<ime>

<priimek>

                 <priimek><f>          

<neraz>

                 <neraz> (5)

<?> text </?>

<lz>

<repet>

<nst>word</nst>

<okr>word</okr>

[text]

<singing>text</singing>

<shift=vpr>text</>

<shift=poud>text</>

<tj: norv>text</tj>

<nv>laughing</nv>

(description)

<??> text</??>

short pause (app. 1 sec)

pause (5 sec)

personal name

family name

family name, a form for women

unintelligible

unintelligible (5 sec)

uncertain transcription

false start, truncated word

repetition

non-standard word or form

acronym or abbreviation

overlapping speech

paralinguistic markers

part of the text, recognised as a clear question

emphasised, stressed

a word or a text spoken in foreign language

nonverbal events

non communicative background sound

speaker unknown or uncertain

 

Table 4: Transcription standard used in Pilot Spoken corpus of Slovene

 

 

Converting transcriptions into a searchable corpus

 

The conversion of transcriptions, linked to sound files, into a searchable corpus, has been made by Knut Hofland at Aksis. Pilot corpus of Slovene became a part of Corpus Work Bench, used at the University of Bergen; the corpus is available on the web site http://torvald.hit.uib.no/talem/jana/s9.html:

 

 

 

Picture 4: Aksis Corpus Bench, Pilot Spoken Corpus of Slovene (Corpus Jana)

 

 

Corpus Analysis

 

Building a corpus of course involved a lot of transcription and annotation work. For 89 minutes of recordings I spent about 100 hours for actual transcription work. Additional time has been spent for many revisions while deciding about transcription standard. The size of a corpus is about 15.000 tokens – words and prosodic (<pause>) and non-linguistics (<nv>laughing</nv>) tags. The first version of a pilot corpus, derived from 3 recordings, has been put on Aksis corpus bench in mid November, however the necessary revisions have been made since then almost every day until my final day at Batmult. That explains that accurate analysis of a pilot corpus will follow at my further study. However some examples of a use of a corpus can be shown even at this stage of work.  

 

 

 

Picture 5: Pilot Spoken Corpus of Slovene, concordance of "slovenščina"

 

On the Picture 5, the concordance of the word "slovenščina" (Slovene language) can be observed. The whole utterance is linked to the actual sound file and attributed with speaker and record identification (G, R). The three special Slovene characters (č, ž, š) that presented a problem at one stage of conversion have been already properly used in this extract.

 

 

 

 

 

 

 

Picture 6: Pilot Spoken Corpus of Slovene, concordance of "mhm"

 

The discourse marker "mhm" has, as expected, very high absolute frequency (105) comparing to it's absolute frequency in ten thousand times bigger corpus Fida (156). With the pilot corpu we could argue the explanation of a meaning of the word "mhm" in Slovene standard dictionary: it is explained as a word of hesitation or a word of restrained agreement. We can not find even one example to prove that explanation among 105 mhms in the pilot corpus, however some highly represented meanings should be added to the explanation in the dictionary.

 

 

Picture 7: Pilot Spoken Corpus of Slovene, concordance of tag "<nst>"

 

 

"Non-standard word" is an annotation mark, difficult to define by empirical criteria; it's definition certainly needs further consideration. However, the pilot corpus shows the set of words that somehow resign, at least to my language intuition, from standard language. Among them we could find a lot of vulgar words, words from slang and dialects and words of foreign origin.

 

 

Frequency list

 


       1     498   35.422 je
      2     425   30.230 ne
      3     358   25.464 ə
      4     313   22.263 pa
      5     297   21.125 in
      6     284   20.201 se
      7     270   19.205 da
      8     268   19.063 to
      9     265   18.849 ja
     10     264   18.778 v
     11     186   13.230 na
     12     143   10.171 tudi
     13     130    9.247 za
     14     115    8.180 ki
     15     106    7.540 so
     16     105    7.469 tako
     17     105    7.469 mhm
     18      98    6.971 kaj
     19      88    6.259 a
     20      86    6.117 še
     21      84    5.975 če
     22      78    5.548 zdaj
     23      77    5.477 smeh
     24      77    5.477 sem
     25      74    5.264 əm
     26      74    5.264 kot
     27      68    4.837 vem
     28      68    4.837 samo
     29      68    4.837 kar
     30      67    4.766 ti
     31      67    4.766 potem
     32      66    4.695 bo
     33      65    4.623 s
     34      64    4.552 ampak
     35      63    4.481 no
     36      63    4.481 lahko
     37      61    4.339 ali
     38      60    4.268 z
     39      59    4.197 že
     40      58    4.125 saj

 

 

Picture 8: Frequency list of a Pilot Spoken Corpus of Slovene 

 

The frequency list shows 40 most frequently used words in pilot spoken corpus, their absolute and relative (on 1000 words) frequency. The most frequent word is "je", 3rd person singular form of a verb to be (is). The second is a negation word "ne", meaning, which can also be used as a discourse marker with no negative connotation; the third most frequently used word is a hesitation voice with mouth half open (ə). Among 40 most used words in a pilot spoken corpus we can find mostly grammatical words, discourse markers and filled pauses. All words need further study for a definition of their (contextual) meanings and discourse functions. 

 

 

 

Parallel Activities

 

Meetings

 

At the beginning of my stay at Aksis I was introduced to experienced transcribers for demonstrations of transcribing tools, Reidunn Hernes (Norsk Institute) for Praat (September 21) and Margrete Dyvik for Transcriber (October 6).

 

Since I was very interested in learners corpora, Gisle Andersen organized that I met Kari Tenfjord (October 7), leader of the ASK project (Norwegian learners corpus). She explained the design and demonstrated the use of the corpus to me. Later on, Paul Maurer, who works as a programmer for the ASK project, used some of my transcribed materials (speech of a non-native speaker of Slovene) for a short demonstration of SLASK – Slovene ASK:

                                   

 

Picture 9: Demo version of SLASK –

Slovene learners corpus (Type of the mistake: R – redundant word)

 

On October 13 Reidunn Andersen invited me to the meeting with a delegation from Latvian examination centre. The topic of the meeting – national language exams – was connected to my work at University of Ljubljana and I found the exchanging of ideas, experiences and views during the discussion extremely productive.

 

At the end of my stay at Batmult I met Kari Tenfjord again (December 15), this time in concern of an internet course for teachers for Norwegian as a second or foreign language. I found the introduced educational system very efficient and productive and I will certainly try to present it at Ljubljana University.

 

Seminar and lecture attendance

 

I participated on two internal Aksis seminars: Knut Hofland (October 14) presented IMS Corpus Workbench and Paul Maurer (October 28) presented Oslo-Bergen tagger and Norsk spräkbank. On November 11 I attended the lecture and seminar given by Knut Hofland at the University of Bergen and on December 14 the presentation of Kurdish version of Lexin – Dictionaries for minority language immigrants.

 

The attendance at the seminars and lecture, although presented in Norwegian, provided the most interesting insights in the newest researches being carried out by scholars at Aksis. I wish I could attend more lectures and seminars on Bergen University – sometimes I only found out about them when it was already too late.

 

 

Lecture          

 

At the end of my study period at BATMULT I gave two presentations of my project. The first one was at the Linguistics seminar at the Department of Linguistic Studies at the University of Bergen (December 3), and the second one internally at Aksis (December 10),<http://www.aksis.uib.no>.

The feedback to the lectures turned out to be very constructive and highlighted both strengths and weaknesses of a pilot spoken corpus of Slovene language. Some revisions might be required for transcription and annotation scheme, for example the tag for non-standard words should have a clearer definition, possibly dividing words taken from foreign languages from phonetically modified Slovene words (dialects, slang). The question of covering different dialectical groups also rose out at the lecture; there is a lack of speakers from different regions in pilot corpus, however a proposed demographic sampling of material in real corpus should diminish this problem. Another suggestion concerns the structure of some tags, for example the tag <repeat> should have the beginning and the end of tag, showing where the repeated word or utterance begins and where ends. Beside that I was also given a lot of good suggestions for further work from university professors, attended the lectures, Koenraad de Smedt, Gjert Kristoffersen, Helge Dyvik, and Øivin Andersen, and other scholars, presented at lectures.    

 

 

Acknowledgements

 

During my stay at Aksis at the University of Bergen the goals of my original project submitted to the BATMULT have been achieved. The BATMULT project is an integral component of my PhD thesis and I benefited a lot from training in computational linguistic tools. The pilot corpus supports the theoretical framework of design and annotation principles for a spoken corpus of Slovene that I intended to introduce in my PhD thesis. The project itself is a significant advance for my PhD studies and, according to my beliefs, represents a considerable innovation for Slovene research area.

Work and discussions with Knut Hofland on almost daily basis, available facilities at BATMULT, attendance at seminars and meetings with different experts also enabled me to develop further knowledge and understanding of computational linguistics.

 

During my transcription work, I was often uncertain about transcription of some spoken words or phrases. I frequently communicated on the subject with my PhD supervisor at Ljubljana University Professor Marko Stabej and with Professor Breda Pogorelec. The e-mail discussions with them were very helpful for further development of the transcription principles.

 

Aksis institute represents the most pleasant working environment I can imagine. During my stay there I also experienced friendly and supportive relations among co-workers. Therefore I would like to thank for this unique study opportunity to my supervisors, Professor Koenraad de Smedt, scientific coordinator for Batmult, Professor Gjert Kristoffersen, researcs director of Aksis, and also to Gisle Andersen, Batmult administrator.