RE: [Corpora-List] lemma list wanted

From: D.W.Hardcastle (D.W.Hardcastle@open.ac.uk)
Date: Sat Feb 24 2007 - 00:23:43 MET

  • Next message: Hunter, Duncan: "RE: [Corpora-List] lemma list wanted"

    Sorry - I have lost the original thread, but I recall that someone
    wanted lemma and inflection tables.
    I also need to lemmatise and reinflect dictionary words for my PhD
    project, so I have a lemmatiser that is based on CUVPlus
    (http://ota.ahds.ac.uk/texts/2469.html).

    If you are interested:
    I have put a zip file on my website (http://mcs.open.ac.uk/dh5368/) it
    contains a list of inflection-lemma mappings, lemma-inflection mappings
    and a file called singles.txt which contains forms in the lexicon that
    could not be reduced.

    The data was extracted from the CUVPlus lexicon by running a lemmatising
    algorithm to reduce every entry in the lexicon and checking the
    resulting proposed lemmas against the lexicon.

    The file lemmas.txt contains inflection-lemma mappings that were
    corroborated by the lexicon and inflect.txt contains the inverse
    mappings. These files include words that are already in base form.

    The singles.txt file contains word forms that judging by the tag should
    be reducible but for which no proposed lemma could be found in the
    lexicon. Most are adverbs that have no adjective base form, many are
    non-count plural forms. There are also some (BNC) tagging errors,
    misspellings and rare word forms. I have included the BNC frequency for
    each entry from the lexicon as most of the noise is of low frequency.

    Please note that this means that words not covered by the CUVPlus
    lexicon do not appear in the mappings.

    All the entries in the files are tagged using the C7 tagset.

    The data is work in progress, but it is pretty clean I believe.
    If you decide to use the mapping tables please cite my PhD thesis - it
    is at Birkbeck College, University of London and due for submission
    later this year.

    Thank you,

    Dave

    -- 
    David Hardcastle
    Research Programmer, Natural Language Generation Group
    Faculty of Mathematics and Computing, room 121, North Spur
    The Open University, Walton Hall, Milton Keynes, MK7 6AA
    +44 (0) 1908 659947
    



    This archive was generated by hypermail 2b29 : Sat Feb 24 2007 - 00:38:59 MET