Corpora: Large corpora with virtually unlimited annotation based on n-grams

From: Mark Davies (mdavies@ilstu.edu)
Date: Wed Oct 24 2001 - 18:48:20 MET DST

  • Next message: Ivo Sanchez: "Corpora: Jean Hudson & Anne Finell"

    I'd be interested in references or pointers to any large corpora that have the following characteristics:

    1) Fairly large -- at least 50 million words

    2) Public available / obtainable, hopefully even via the web

    3) [And most important:] The organization of the corpus is more of less as follows--

    The corpus itself is only marginally annotated, if at all.  However, there are databases containing a list of all distinct n-grams (at least (1, 2, 3 grams), which can be queried, and whose output can then be used to search the actual corpus itself.  Most importantly, these databases of n-grams are linked to other databases that contain info on POS, lemma, synonyms, etc.  This joining of databases means that searches can be made on not just the n-grams, but on the POS, lemma, as well, providing searches like (for Spanish):

    *.pn_obj    querer.*    *.v_inf
    [a clitic followed by any form of "querer" (to want) followed by an infinitive]

    !mandar.*    *    *.v_subj_se
    [all of the forms of any synonym of "mandar" (to order)  followed two words later by a past subjunctive]

    Since the lists of n-grams are merely linked to other databases containing POS, lemma, synonyms, etc, the levels of annotation is essentially unlimited.  It's just a function of however many separate databases a person wants to create and link to the main n-grams database.  This would even allow users of the corpus to create their own "custom lists" of words, which could be stored in a certain database, and then used as part of the syntax for subsequent searches.

    In addition, since the databases are fairly static, they can contain frequency information that can be included as part of the search, i.e. cases like all of the 2-grams whose second element is a synonym of a given word, and which appear more than three times in a given segment of the corpus.

    My reason for asking is two-fold.  First, I'm working on a corpus similar to this for Spanish, and would like to look at other corpora that have taken the same approach.  Second, I was talking to a colleague last week, and his impression is that corpora such as these are quite common, and that they've been around since the mid-1980s.  Since I work primarily in Spanish, however, I'm less familiar with the underlying structure of corpora in English and other languages, so I'm not so sure that corpora such as these are in fact all that common.  Most of the large publicly-available corpora that I'm familiar with have (I believe) an organization in which most of the annotation is in the corpus itself, rather than in separate databases (based on n-grams) whose output is then linked to the corpus itself.

    At any rate, I'd appreciate any references that you might have, and will post a summary if there's interest.

    Thanks,

    Mark Davies


    ====================================================
    Mark Davies, Associate Professor, Spanish Linguistics
    4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    309-438-7975 (voice) / 309-438-8083 (fax)
         http://mdavies.for.ilstu.edu
    ** Historical and dialectal Spanish and Portuguese syntax **
    ** Corpus design and use / Web-database scripting /  Distance education **
    =====================================================



    This archive was generated by hypermail 2b29 : Wed Oct 24 2001 - 18:54:54 MET DST