Re: [Corpora-List] Query about corpora of spoken English

From: Briony Williams (b.williams@bangor.ac.uk)
Date: Mon Dec 12 2005 - 22:14:19 MET

Next message: Hal Daume III: "[Corpora-List] CFP: Computationally Hard Problems in Speech and Language Processing"

Previous message: Linguistic Data Consortium: "[Corpora-List] News from the LDC"
In reply to: Rayson, Paul: "RE: [Corpora-List] Query about corpora of spoken English"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Rayson, Paul wrote:
> Hi,
>
> I've been told by Anne Wichmann and Gerry Knowles that the latest
> version of MARSEC is held by Daniel Hirst in Aix-en-Provence:
>
> http://aune.lpl.univ-aix.fr/~hirst/home.html

This is the Aix-MARSEC project: a more direct link is

http://www.lpl.univ-aix.fr/~EPGA/en_marsec_com.html

The Aix-MARSEC project takes the work of MARSEC a great deal further.

1) The original SEC was not time-aligned in any way with the speech data: it
consisted of transcripts only, at various lingustic levels.

2) The MARSEC project time-aligned the speech data with a word-level
transcription and also a transcription at the level of the tone group.

3) The Aix-MARSEC project time-aligns the speech data at several linguistic
levels, namely: the phoneme, the syllable, sub-syllabic constituents, the
rhythmic unit, the stress foot, the word, major and minor intonation units,
and the MOMEL/INTSINT intonational coding.

The annotations are available online from the Aix-MARSEC project, but the
recordings are available from them on CD-ROM only.

-----------------------------------------------------------------------
Quoting from:
http://www.lpl.univ-aix.fr/~EPGA/marsec_com/Auran_Bouzon_Hirst_SP2004.pdf

"For compatibility and processing reasons, the 332-minute long
audio component is available under the form of 408
16 kHz .wav format files.

"The annotation component currently comprises the 9
different levels mentioned earlier: phonemes, syllables,
subsyllabic constituents, words, stress feet, rhythm units, minor
and major intonation units, INTSINT coding and the
corresponding values of the targets in Hz. Each level is
represented by a separate tier in Praat TextGrids (as illustrated
in figure 1). Two supplementary levels, based on the syntactic
annotation of the corpus using the CLAWS system and a
Property Grammar system developed in the Laboratoire Parole
et Langage in Aix-en-Provence are to be integrated soon, thus
allowing not only future analyses taking into account the
grammatical tagging and parsing of the data, but also the direct
comparison of automatic syntactic annotation systems.

"The Aix-MARSEC tools consist of a set of reference files
(grapheme-phoneme conversion dictionaries) and (multiplatform)
Praat and Perl scripts."
------------------------------------------------------------------------

As one of the two prosodic transcribers of the original IBM/Lancaster SEC
project, I am delighted that this work has evolved into such a rich resource,
which will be of immense benefit to those studying the phonetics and
structure of spoken UK English.

Best regards

Briony Williams

Next message: Hal Daume III: "[Corpora-List] CFP: Computationally Hard Problems in Speech and Language Processing"
Previous message: Linguistic Data Consortium: "[Corpora-List] News from the LDC"
In reply to: Rayson, Paul: "RE: [Corpora-List] Query about corpora of spoken English"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Dec 12 2005 - 22:27:02 MET