[Corpora-List] POS tagging via relational databases

From: Mark Davies (Mark_Davies@byu.edu)
Date: Wed Sep 24 2003 - 21:17:55 MET DST

Next message: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"

Previous message: William Fletcher: "Re: [Corpora-List] spanglish corpus"
Next in thread: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"
Reply: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Is anyone aware of projects in which relational databases have been used
to do POS tagging? Rather than passing through a linear text token by
token, it would all be done via adjacent rows in the database, using
subqueries or JOINs. For example, you would have a table with N number
of rows, where N = number of words in the corpus. Each row would have
the following structure (lemma would probably be here as well):

        ID word pos
        ----- ----- -----
        . . .
        516 the AT0
        517 play NN1
        518 by PREP
        519 Ibsen NP0
        . . .
        1450 wants VVZ
        1451 to PRP
        1452 play VVI
        . . .

To disambiguate words like <play, strike, hit> to NOUN after a DET, the
query would look something like:

        update t2
        set t2.pos = 'NN1'
        from tagger as t1, tagger as t2
        where t2.word = 'play' and t1.pos = 'AT0'
                and t2.ID = t1.ID + 1

Of course, rather than dealing with specific word forms (e.g. <play>
above), you could use a sub-query to apply it to hundreds or thousands
of items from another table (e.g. the lexicon). Likewise, you could
apply it to all words that have a particular POS, as in the following,
where all doubly-tagged <NN1-VVZ> go to <NN1> after <AT0>:

        update t2
        set t2.pos = 'NN1'
        from tagger as t1, tagger as t2
        where t2.pos = 'NN1-VVZ' and t1.pos = 'AT0'
                and t2.ID = t1.ID + 1

Anyway, assuming a robust relational database (e.g. SQL Server or
Oracle), it should be possible to tag a decent-sized corpus (e.g. one
million words) in less than an hour -- perhaps just a few minutes -- by
doing the following:

1) inserting POS and lemma information from the lexicon into the corpus
(via simply UPDATE and JOIN commands) and then
2) disambiguation, by applying hundreds of rules (like those described
above) to the tagged corpus

You could also:

3) use morphological rules to disambiguate forms. For example, if
<roller-blading> is not found in the lexicon, you would guess its tag
from the <-ING>. In a more powerful way, you could tag forms that are
not in the lexicon by using subqueries. For example, assuming that
<mopeds> is not in the lexicon, you could run a sub-query to look for
the base form <moped>, and if it is found as an <NN1>, then you assign
<NN2> to <mopeds>. Again, this query could be run on many words in the
corpus all at one time -- via a simply UPDATE command.

In essence, then, the approach to tagging is kind of like a Brill
tagger, but with all of the disambiguation done within the relational
database itself.

Anyway, has anyone seen such an approach? I'd be happy to share a
summary of your comments, if there is sufficient response.

Thanks in advance,

Mark Davies

=================================================
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu

** Corpus design and use // Web-database scripting **
** Historical linguistics // Functional-typological grammar **
** Spanish and Portuguese historical and dialectal syntax **
=================================================

Next message: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"
Previous message: William Fletcher: "Re: [Corpora-List] spanglish corpus"
Next in thread: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"
Reply: Zhang Le: "Re: [Corpora-List] POS tagging via relational databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Sep 24 2003 - 21:25:06 MET DST