[Corpora-List] POS tagging via relational databases

From: Mark Davies (Mark_Davies@byu.edu)
Date: Wed Sep 24 2003 - 21:17:55 MET DST

  • Next message: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"

    Is anyone aware of projects in which relational databases have been used
    to do POS tagging? Rather than passing through a linear text token by
    token, it would all be done via adjacent rows in the database, using
    subqueries or JOINs. For example, you would have a table with N number
    of rows, where N = number of words in the corpus. Each row would have
    the following structure (lemma would probably be here as well):

            ID word pos
            ----- ----- -----
            . . .
            516 the AT0
            517 play NN1
            518 by PREP
            519 Ibsen NP0
            . . .
            1450 wants VVZ
            1451 to PRP
            1452 play VVI
            . . .

    To disambiguate words like <play, strike, hit> to NOUN after a DET, the
    query would look something like:

            update t2
            set t2.pos = 'NN1'
            from tagger as t1, tagger as t2
            where t2.word = 'play' and t1.pos = 'AT0'
                    and t2.ID = t1.ID + 1

    Of course, rather than dealing with specific word forms (e.g. <play>
    above), you could use a sub-query to apply it to hundreds or thousands
    of items from another table (e.g. the lexicon). Likewise, you could
    apply it to all words that have a particular POS, as in the following,
    where all doubly-tagged <NN1-VVZ> go to <NN1> after <AT0>:

            update t2
            set t2.pos = 'NN1'
            from tagger as t1, tagger as t2
            where t2.pos = 'NN1-VVZ' and t1.pos = 'AT0'
                    and t2.ID = t1.ID + 1

    Anyway, assuming a robust relational database (e.g. SQL Server or
    Oracle), it should be possible to tag a decent-sized corpus (e.g. one
    million words) in less than an hour -- perhaps just a few minutes -- by
    doing the following:

    1) inserting POS and lemma information from the lexicon into the corpus
    (via simply UPDATE and JOIN commands) and then
    2) disambiguation, by applying hundreds of rules (like those described
    above) to the tagged corpus

    You could also:

    3) use morphological rules to disambiguate forms. For example, if
    <roller-blading> is not found in the lexicon, you would guess its tag
    from the <-ING>. In a more powerful way, you could tag forms that are
    not in the lexicon by using subqueries. For example, assuming that
    <mopeds> is not in the lexicon, you could run a sub-query to look for
    the base form <moped>, and if it is found as an <NN1>, then you assign
    <NN2> to <mopeds>. Again, this query could be run on many words in the
    corpus all at one time -- via a simply UPDATE command.

    In essence, then, the approach to tagging is kind of like a Brill
    tagger, but with all of the disambiguation done within the relational
    database itself.

    Anyway, has anyone seen such an approach? I'd be happy to share a
    summary of your comments, if there is sufficient response.

    Thanks in advance,

    Mark Davies

    =================================================
    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906
    http://davies-linguistics.byu.edu

    ** Corpus design and use // Web-database scripting **
    ** Historical linguistics // Functional-typological grammar **
    ** Spanish and Portuguese historical and dialectal syntax **
    =================================================



    This archive was generated by hypermail 2b29 : Wed Sep 24 2003 - 21:25:06 MET DST