Re: Corpora: lemma vs lexeme

Alex Chengyu Fang (alex@phonetics.UCL.ac.uk)
Thu, 04 Nov 1999 14:35:05 +0000

I developed a tagger and lemmatiser
(http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm) and found
the following criteria particular helpful:

Lemmatisation is the removel of inflections so that word forms are grouped
together according to their corresponding lemmas, e.g., works, working and
worked -> work. This process doesn't result in the change of wordclass status.

Proper lemmatisation needs POS information to, for instance, reduce
INTERESTED to INTEREST if it is contextually used as a verb and leave it
untouched in the case of an adjective use.

Sometimes, sense disambiguation and lexical subcategorisation have to be
available to correctly tag and then lemmatise, for example, LAY as either a
base-form verb or the past tense of LIE.

Lexematisation is the extension of lemmatisation to deal with derivatives
so that they can be grouped together under the same lexeme, e.g., computer
-> compute. This process typically results in the change of wordclass status.

-------------------------------------------------
Alex Chengyu Fang
Senior Research Fellow
Department of Phonetics and Linguistics
University College London
Wolfson House, 4 Stephenson Way, London NW1 2HE, UK
Tel: 44 (0)171 504 5026
Fax: 44 (0)171 383 0752
WWW: http://www.phon.ucl.ac.uk/home/alex/home.htm
-------------------------------------------------