Re: [Corpora-List] Corpus Benevolence

From: Steven Bird (sb@csse.unimelb.edu.au)
Date: Sun Feb 11 2007 - 08:16:38 MET

  • Next message: Lou Burnard: "Re: [Corpora-List] Corpus Benevolence"

    On 2/10/07, Adam Kilgarriff <adam@lexmasterclass.com> wrote:
    > - how do you describe a corpus?

    One minimalist answer to this question is "Use OLAC Metadata", because
    it provides uniform descriptors that help with resource discovery.

    OLAC, the Open Language Archives Community, is an international
    partnership of institutions and individuals who are creating a
    worldwide virtual library of language resources by: (i) developing
    consensus on best current practice for the digital archiving of
    language resources, and (ii) developing a network of interoperating
    repositories and services for housing and accessing such resources.
    http://www.language-archives.org/

    OLAC extends Dublin Core Metadata by providing vocabularies for
    describing language resources, including language identification,
    linguistic data type, discourse type, and linguistic subject.
    http://www.language-archives.org/REC/olac-extensions.html

    Many repositories of language resources categorize their holdings
    using OLAC Metadata, including LDC, SIL, Linguist List, Rosetta
    Project, Talkbank... http://www.language-archives.org/archives.php4

    Once corpora are categorized in this way they can be searched. OLAC
    has a federated search service that permits all repositories to be
    searched simultaneously. (Part of the inspiration for this was all
    the queries for obscure resources that have appeared on this list.)
    http://www.language-archives.org/tools/search/

    A paper that synthesizes all this appeared in the Literary and
    Linguistic Computing journal:
    Simons, Gary and Steven Bird (2003). The Open Language Archives
    Community: An infrastructure for distributed archiving of language
    resources. Literary and Linguistic Computing 18: 117-128.
    http://arxiv.org/abs/cs.CL/0306040

    -Steven Bird



    This archive was generated by hypermail 2b29 : Sun Feb 11 2007 - 08:14:59 MET