[Corpora-List] Cost of POS tagging, again

From: Kevin B. Cohen (kevin.cohen@gmail.com)
Date: Wed Dec 27 2006 - 00:48:02 MET

  • Next message: Elisabete Marques Ranchhod: "[Corpora-List] Deadline Extension - Special Issue of Lingvisticae Investigationes on Named Entities"

    Hi, Marc et al.,

    Christopher's points are well-made. A couple of other things to think
    about:

    1) You seem to be envisioning doing ex nihilo manual POS annotation.
    However, that will probably be neither practical nor desirable; rather,
    you're likely to want to do the initial annotation automatically, and then
    manually curate the output of the initial, automatically-generated
    annotation step.
    2) You actually may not want to directly curate the POS tagging at all.
    Rather, if you're going to do further processing--say, syntactic
    parsing--you might want to curate the POS tags as part or byproduct of the
    downstream curation.
    3) Even if you do want to directly curate the POS tagging, you will probably
    find some efficiencies to be gained from automatic means. For example, you
    are more likely to need to correct a bunch of adjective/past participle
    distinctions (I'm assuming here that your data is English) than you are to
    need to correct a bunch of mis-tagged commas (although I have certainly seen
    lots of mis-POS-tagged commas, too!). Scripting can help you out here.

    Finally, Christopher is right on with suggesting hourly, rather than
    per-token, budgeting.

    Hope this is helpful,

    Kevin

    -- 
    K. B. Cohen
    Biomedical Text Mining Group Lead
    Center for Computational Pharmacology
    303-916-2417 (cell) 303-377-9194 (home)
    http://compbio.uchsc.edu/Hunter_lab/Cohen
    



    This archive was generated by hypermail 2b29 : Wed Dec 27 2006 - 12:40:53 MET