Re: [Corpora-List] Query about corpora of spoken English

From: Briony Williams (b.williams@bangor.ac.uk)
Date: Fri Dec 02 2005 - 16:55:03 MET

  • Next message: joshua raclaw: "Re: [Corpora-List] Query about corpora of spoken English"

    R.M.Salkie@bton.ac.uk wrote:
    > My colleague Nicolas Ballier (nicolas.ballier@lli.univ-paris13.fr
    > <mailto:nicolas.ballier@lli.univ-paris13.fr> ) has asked me to post the
    > following two queries. Please reply directly to him.

    It may be useful to others to have the replies in a public forum like this
    one - so here is a quick reply to the CORPORA list.

    > 1. Is there a web page which lists currently available corpora of
    > spoken English (eg MARSEC MAchine REadable Spoken ENglish Corpus), stating
    > whether the sound files are available?

    You could try the catalogue pages of:-

    a) Linguistic Data Consortium - subset "speech"-
    http://www.ldc.upenn.edu/Catalog/byType.jsp#speech

    b) Evaluations and Language Resources DIstribution Agency -
    http://www.elda.org/rubrique6.html

    c) International Computer Archive of Modern and Medieval English
    http://nora.hd.uib.no/whatis.html

    d) The MARSEC corpus
    http://www.rdg.ac.uk/AcaDepts/ll/speechlab/marsec/

    > 2. Is there software available to align texts and sound files: for
    > example, software that enables the user to listen to any part of the
    > document by clicking on a word in the text?

    First the soundfile needs to be aligned with the linguistic annotation. Some
    popular applications currently used for doing this manually are the following
    (there are other applications for automatic segmentation of speech files).
    All of these can be used to click on and listen to an individual word once a
    word-level segmentation has been carried out.

    a) Praat (has a very flexible scripting language):
    http://www.fon.hum.uva.nl/praat/

    b) Emu (segment-level and also higher linguistic levels, plus hierarchical
    structure: has some scripting capability for automatic building of trees):
    http://emu.sourceforge.net/

    c) Transcriber ("It provides a user-friendly graphical user interface for
    segmenting long duration speech recordings, transcribing them, and labeling
    speech turns, topic changes and acoustic conditions. It is more specifically
    designed for the annotation of broadcast news recordings, for creating
    corpora used in the development of automatic broadcast news transcription
    systems, but its features might be found useful in other areas of speech
    research.")
    http://trans.sourceforge.net/en/presentation.php

    d) MATE workbench ("a program designed to aid in the display, editing and
    querying of annotated speech corpora")
    http://www.cogsci.ed.ac.uk/~dmck/MateCode/

    These are by no means the only tools available (I have omitted xlabel, as it
    is no longer supported).

    Best regards

    Briony Williams



    This archive was generated by hypermail 2b29 : Fri Dec 02 2005 - 17:15:30 MET