Re: [Corpora-List] Query about corpora of spoken English

From: Briony Williams (b.williams@bangor.ac.uk)
Date: Mon Dec 12 2005 - 22:14:19 MET

  • Next message: Hal Daume III: "[Corpora-List] CFP: Computationally Hard Problems in Speech and Language Processing"

    Rayson, Paul wrote:
    > Hi,
    >
    > I've been told by Anne Wichmann and Gerry Knowles that the latest
    > version of MARSEC is held by Daniel Hirst in Aix-en-Provence:
    >
    > http://aune.lpl.univ-aix.fr/~hirst/home.html

    This is the Aix-MARSEC project: a more direct link is

    http://www.lpl.univ-aix.fr/~EPGA/en_marsec_com.html

    The Aix-MARSEC project takes the work of MARSEC a great deal further.

    1) The original SEC was not time-aligned in any way with the speech data: it
    consisted of transcripts only, at various lingustic levels.

    2) The MARSEC project time-aligned the speech data with a word-level
    transcription and also a transcription at the level of the tone group.

    3) The Aix-MARSEC project time-aligns the speech data at several linguistic
    levels, namely: the phoneme, the syllable, sub-syllabic constituents, the
    rhythmic unit, the stress foot, the word, major and minor intonation units,
    and the MOMEL/INTSINT intonational coding.

    The annotations are available online from the Aix-MARSEC project, but the
    recordings are available from them on CD-ROM only.

    -----------------------------------------------------------------------
    Quoting from:
    http://www.lpl.univ-aix.fr/~EPGA/marsec_com/Auran_Bouzon_Hirst_SP2004.pdf

    "For compatibility and processing reasons, the 332-minute long
    audio component is available under the form of 408
    16 kHz .wav format files.

    "The annotation component currently comprises the 9
    different levels mentioned earlier: phonemes, syllables,
    subsyllabic constituents, words, stress feet, rhythm units, minor
    and major intonation units, INTSINT coding and the
    corresponding values of the targets in Hz. Each level is
    represented by a separate tier in Praat TextGrids (as illustrated
    in figure 1). Two supplementary levels, based on the syntactic
    annotation of the corpus using the CLAWS system and a
    Property Grammar system developed in the Laboratoire Parole
    et Langage in Aix-en-Provence are to be integrated soon, thus
    allowing not only future analyses taking into account the
    grammatical tagging and parsing of the data, but also the direct
    comparison of automatic syntactic annotation systems.

    "The Aix-MARSEC tools consist of a set of reference files
    (grapheme-phoneme conversion dictionaries) and (multiplatform)
    Praat and Perl scripts."
    ------------------------------------------------------------------------

    As one of the two prosodic transcribers of the original IBM/Lancaster SEC
    project, I am delighted that this work has evolved into such a rich resource,
    which will be of immense benefit to those studying the phonetics and
    structure of spoken UK English.

    Best regards

    Briony Williams



    This archive was generated by hypermail 2b29 : Mon Dec 12 2005 - 22:27:02 MET