Building a Spoken Corpus of Slovene
Final Report on a Marie Curie
Training Site host PhD fellowship at the Bergen Advanced Training Site in
Multilingual Tools (BATMULT), University of Bergen, September 15 – December 15,
2004
Jana Zemljarič Miklavčič,
Project description
My PhD research is aimed at theoretical
foundation on building a spoken corpus of Slovene language, which is planned to
complement the 100 million word FIDA corpus (http://www.fida.net/slo/index.html)
as its spoken component. There are two main scientific challenges in my
research: the first is to develop a set of criteria for the collection and
selection of spoken material to be included in a representative and balanced spoken corpus, and the
second is to outline the recommendations for the transcription and annotation of a spoken texts. As to the criteria for collection, a
preliminary proposal for the selection of materials, based on the combination
of the demographic and contextual method has been already worked out. The
actual collection of spoken texts has also been started before coming to Batmult.
The main aim of my three months stay at Batmult as a Marie Curie Host PhD student was to compile a pilot
spoken corpus of Slovene based on digital recordings, available in searchable
form, with transcriptions linked to sound files. The purpose of a pilot corpus
was to redefine the
criteria for the collection, selection and documentation of spoken materials, to
develop and test transcription and mark-up conventions, and finnally
to show some possibilities for the use of a corpus for language description and
language analysis. Batmult training site at The Department
of Culture, Language and Information Technology (AKSIS) at the
Concrete Achievements and Results
The design of a pilot corpus
Slovene language is spoken by 2 million
speakers in
The pilot corpus consists of 7 digital
recordings in total length of 89 minutes. All texts were recorded in year 2004.
The specification of the recordings is shown in the following table:
ID |
Duration
min |
No. of speakers |
Place of recording |
Surreptitious |
Genre |
R01 |
2.17 |
2 |
University |
No |
interview |
R02 |
54.50 |
6 |
Studio |
No |
round
table |
R03 |
3.58 |
2 |
Home |
No |
interview |
R04 |
7.31 |
5 |
Office |
No |
spont. convers. |
R05 |
3.23 |
5 |
Skate-park |
No |
interview |
R06 |
11.54 |
3 |
Workplace |
No |
spont. convers. |
R07 |
5.12 |
2 |
Home |
Yes |
spont. convers. |
å=7 |
89.00 |
|
|
|
|
Table 1: Pilot corpus recording's documentation
All
information about speakers has been collected on speaker’s identity lists. The
data are represented in following table:
ID |
Sex |
Year of Birth |
Age |
Education |
Region |
G01 |
F |
1963 |
41 |
University |
Central |
G02 |
M |
1965 |
39 |
University |
Central |
G03 |
F |
1966 |
38 |
University |
Central |
G04 |
F |
1967 |
37 |
University |
Central |
G05 |
F |
1968 |
36 |
University |
Central |
G06 |
F |
1968 |
36 |
University |
Central |
G07 |
M |
1970(?) |
34(?) |
University |
Other |
G08 |
M |
1933(?) |
71(?) |
University |
Central |
G09 |
F |
1979 |
25 |
University |
South-east |
G10 |
F |
1967 |
37 |
High school |
North-west |
G11 |
M |
1987(?) |
17(?) |
Primary sch. |
Central |
G12 |
M |
1987(?) |
17(?) |
Primary sch. |
Central |
G13 |
M |
1987(?) |
17(?) |
Primary sch. |
Central |
G14 |
F |
1976 |
28 |
University |
South-east |
G15 |
F |
1979 |
25 |
University |
Central |
G16 |
M |
1978 |
26 |
High school |
Central |
G17 |
M |
1978 |
26 |
High school |
Central |
G18 |
F |
? |
? |
? |
? |
G19 |
F |
1969 |
35 |
University |
North-west |
G20 |
M |
1948 |
56 |
High school |
North-west |
Table 2: Pilot corpus speakers' documentation
The sample of 20 speakers is representative
according to the sex of the speaker but not according to other demographic criteria.
The actual spoken corpus should consist of texts representatively taken from 5
areas that represent 5 dialectal groups of Slovene language. Furthermore there
should be 3 age classes and 3 educational classes. The rather opportunistic
nature of a pilot corpus should be taken into consideration when analyzing it.
Pilot corpus is better designed in the concern
of contextual criteria: different structure types, settings, speaker's positions,
genres and media are represented among the texts. However, the telephone conversations
and some other text genres should necessary be added to the planned spoken
corpus. The final design of the pilot corpus according to contextual criteria is
presented in following table:
Contextual criteria |
Proportion |
Dialogue
(or multilogue) vs. Monologue |
94 % : 6 % |
Private
vs. Public |
19,5 % :
80,5 % |
Informal
vs. Formal |
35,5 % :
64,5 % |
Media vs.
Face to face |
31 % : 69 % |
Surreptitious
vs. Nonsurreptitious |
5,6 % : 94,4 % |
Table 3: Texts according to selected contextual
criteria
Transcribing
I have learned about existing transcription
software at Batmult, and tested three programs, Praat, Transcriber and WinPitch.
According to their characteristics I've decided to use the first two mentioned
to carry out actual transcription work. Transcriber is a tool for segmenting,
labeling and transcribing speech; I found it more user-friendly than Praat, however, it doesn't allow transcribing overlapping
speech of more than two speakers.
Picture 1: Transcriber
working platform,
transcription of a pilot spoken corpus of Slovene
Program Praat, on the
other hand, is less suitable for transcribing and works very slowly for longer
recordings (more than 30 minutes) but it allows transcribing overlapping speech
of more than two speakers which is often the case with spontaneous speech.
Picture 2: Praat platform, transcribing for a pilot spoken corpus of Slovene
Both programs enable an automatic
synchronization of transcriptions and sound clips. In WordPad format of
transcriptions (either made in Transcriber or Praat)
speakers’ utterances are clearly marked within a time coding, as shown on a
following example:
Picture 3: Transcription, done in Praat program, in WordPad format
Transcription standard
During actual transcription work I had to
decide about transcription principles for transcribing spoken Slovene language.
I was following the TEI and EAGLES recommendations on transcribing and
annotating spoken texts. As commonly experienced when creating spoken corpuses
I decided for an individual form of modified orthographic transcription. Basic unit of a speech is an utterance,
defined by a short pause or a speaker turns. No punctuation is used in
transcription, capital
letters are used for proper names only.
The adopted
transcription standard is presented on the following scheme:
Tag |
Meaning |
<pavza> <pravza>(5) <ime> <priimek> <priimek><f> <neraz> <neraz>
(5) <?> text </?> <lz> <repet> <nst>word</nst> <okr>word</okr> [text] <singing>text</singing> <shift=vpr>text</> <shift=poud>text</> <tj: norv>text</tj> <nv>laughing</nv> (description) <??> text</??> |
short
pause (app. 1 sec) pause (5
sec) personal
name family
name family
name, a form for women unintelligible
unintelligible
(5 sec) uncertain
transcription false
start, truncated word repetition non-standard
word or form acronym
or abbreviation overlapping
speech paralinguistic
markers part of
the text, recognised as a clear question emphasised,
stressed a word or
a text spoken in foreign language nonverbal
events non
communicative background sound speaker
unknown or uncertain |
Table 4: Transcription standard used in Pilot
Spoken corpus of Slovene
Converting transcriptions into a searchable
corpus
The conversion of transcriptions, linked to sound files, into a searchable corpus, has been made by
Picture 4: Aksis
Corpus Bench, Pilot Spoken Corpus of Slovene (Corpus Jana)
Corpus Analysis
Building a corpus of course involved a lot of
transcription and annotation work. For 89 minutes of recordings I spent about
100 hours for actual transcription work. Additional time has been spent for
many revisions while deciding about transcription standard. The size of a
corpus is about 15.000 tokens – words and prosodic (<pause>) and
non-linguistics (<nv>laughing</nv>) tags. The
first version of a pilot corpus, derived from 3 recordings, has been put on Aksis corpus bench in mid November,
however the necessary revisions have been made since then almost every day
until my final day at Batmult. That explains that
accurate analysis of a pilot corpus will follow at my further study. However
some examples of a use of a corpus can be shown even at this stage of work.
Picture 5: Pilot Spoken Corpus of Slovene, concordance
of "slovenščina"
On the Picture 5, the concordance of the word
"slovenščina" (Slovene language) can be
observed. The whole utterance is linked to the actual sound file and attributed
with speaker and record identification (G, R). The three special Slovene characters
(č, ž, š) that presented a problem at one stage of conversion have been already
properly used in this extract.
Picture 6: Pilot Spoken Corpus of Slovene,
concordance of "mhm"
The discourse marker "mhm"
has, as expected, very high absolute frequency (105) comparing to it's absolute frequency in ten thousand times bigger corpus Fida (156). With the pilot corpu
we could argue the explanation of a meaning of the word "mhm" in Slovene standard dictionary: it is explained
as a word of hesitation or a word of restrained agreement. We can not find even
one example to prove that explanation among 105 mhms
in the pilot corpus, however some highly represented meanings should be added
to the explanation in the dictionary.
Picture 7: Pilot Spoken Corpus of Slovene,
concordance of tag "<nst>"
"Non-standard word" is an annotation
mark, difficult to define by empirical criteria; it's
definition certainly needs further consideration. However, the pilot corpus
shows the set of words that somehow resign, at least to my language intuition,
from standard language. Among them we could find a lot of vulgar words, words
from slang and dialects and words of foreign origin.
Frequency list
1 498 35.422 je
2 425 30.230 ne
3 358 25.464 ə
4 313 22.263 pa
5 297 21.125 in
6 284 20.201 se
7 270 19.205 da
8 268 19.063 to
9 265 18.849 ja
10 264 18.778 v
11 186 13.230 na
12 143 10.171 tudi
13 130 9.247 za
14 115 8.180 ki
15 106 7.540 so
16 105 7.469 tako
17 105 7.469 mhm
18 98 6.971 kaj
19 88 6.259 a
20 86 6.117 še
21 84 5.975 če
22 78 5.548 zdaj
23 77 5.477 smeh
24 77 5.477 sem
25 74 5.264 əm
26 74 5.264 kot
27 68 4.837 vem
28 68 4.837 samo
29 68 4.837 kar
30 67 4.766 ti
31 67 4.766 potem
32 66 4.695 bo
33 65 4.623 s
34 64 4.552 ampak
35 63 4.481 no
36 63 4.481 lahko
37 61 4.339 ali
38 60 4.268 z
39 59 4.197 že
40 58 4.125 saj
Picture 8: Frequency list of a Pilot Spoken
Corpus of Slovene
The frequency list shows 40 most frequently
used words in pilot spoken corpus, their absolute and relative (on 1000 words)
frequency. The most frequent word is "je",
3rd person singular form of a verb to be (is). The second is a negation word "ne",
meaning, which can also be used as a discourse marker with no negative
connotation; the third most frequently used word is a hesitation voice with
mouth half open (ə). Among 40 most used words in a pilot spoken corpus we
can find mostly grammatical words, discourse markers and filled pauses. All
words need further study for a definition of their (contextual) meanings and discourse
functions.
Parallel Activities
Meetings
At the beginning of my stay at Aksis I was introduced to experienced
transcribers for demonstrations of transcribing tools, Reidunn
Hernes (Norsk Institute) for
Praat (September 21) and Margrete
Dyvik for Transcriber (October
6).
Since I was very
interested in learners corpora,
Picture 9: Demo version of SLASK –
Slovene learners
corpus (Type of the mistake: R – redundant word)
On October 13 Reidunn Andersen invited me to the meeting with a
delegation from Latvian examination centre. The topic of the meeting – national
language exams – was connected to my work at
At the end of my stay
at Batmult I met Kari Tenfjord
again (December 15), this time in concern of an internet course for teachers
for Norwegian as a second or foreign language. I found the introduced educational
system very efficient and productive and I will certainly try to present it at
Seminar and lecture
attendance
I participated on two internal Aksis seminars:
The attendance at the seminars and lecture,
although presented in Norwegian, provided the most interesting insights in the
newest researches being carried out by scholars at Aksis.
I wish I could attend more lectures and seminars on
Lecture
At the end of my study period at BATMULT I gave
two presentations of my project. The first one was at the Linguistics seminar at
the Department of Linguistic Studies at the
The feedback to the lectures turned out to be
very constructive and highlighted both strengths and weaknesses of a pilot
spoken corpus of Slovene language. Some revisions might be required for
transcription and annotation scheme, for example the tag for non-standard words
should have a clearer definition, possibly dividing words taken from foreign
languages from phonetically modified Slovene words (dialects, slang). The
question of covering different dialectical groups also rose out at the lecture;
there is a lack of speakers from different regions in pilot corpus, however a proposed
demographic sampling of material in real corpus should diminish this problem.
Another suggestion concerns the structure of some tags, for example the tag
<repeat> should have the beginning and the end of tag, showing where the
repeated word or utterance begins and where ends. Beside that I was also given
a lot of good suggestions for further work from university professors, attended
the lectures, Koenraad de Smedt, Gjert Kristoffersen, Helge Dyvik, and Øivin Andersen, and other scholars, presented at lectures.
Acknowledgements
During my stay at Aksis at the
Work and discussions with
During my transcription work, I was often
uncertain about transcription of some spoken words or phrases. I frequently
communicated on the subject with my PhD supervisor at Ljubljana University Professor
Marko Stabej and with Professor Breda
Pogorelec. The e-mail discussions with them were very
helpful for further development of the transcription principles.
Aksis institute represents the most pleasant working
environment I can imagine. During my stay there I also experienced friendly and
supportive relations among co-workers. Therefore I would like to thank for this
unique study opportunity to my supervisors, Professor Koenraad de Smedt, scientific coordinator for Batmult, Professor Gjert
Kristoffersen, researcs director of Aksis, and also to