Re: [Corpora-List] Incidence of MWEs

From: Ken Litkowski (ken@clres.com)
Date: Tue Mar 14 2006 - 20:09:46 MET

Next message: Gaël Dias: "Re: [Corpora-List] Incidence of MWEs"

Previous message: David Ahn: "[Corpora-List] Call for Participation: EACL 2006 Workshop on Multi-dimensional Markup in NLP/5th Workshop on NLP and XML"
In reply to: David Brooks: "[Corpora-List] Incidence of MWEs"
Next in thread: Erin McKean: "[Corpora-List] CALL FOR PAPERS: Dictionary Society of North America"
Next in thread: Amsler, Robert: "RE: [Corpora-List] Incidence of MWEs"
Reply: Erin McKean: "[Corpora-List] CALL FOR PAPERS: Dictionary Society of North America"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This discussion has the earmarks of the theoretical, speculative, and
the hyperstatistical. I think a practical and newly available method
can be used.

Lexicographers, and most notably, OUP, have compiled their lists of what
they think constitute MWEs. In the electronic XML version of the Oxford
Dictionary of English (ODE), there is an NLP element that incorporates a
quite thorough list of all the variants, including placeholders in
phrases like "give [someone] a hard time". James McCracken, in a fit of
genius, has created on Online ODE and has a rudimentary disambiguation
of all content (non-boring) words in the dictionary. To do this, James
first created an index of all the variants, "squeezing" the phrases
together (e.g., "byandlarge"). As the first step in disambiguating, he
searches for longest phrases, starting from 5 words and continuing down
to 2 words, under the assumption that a phrasal reading is preferred to
a compositional reading. Upon walking through the Perl script that does
this (in about a half hour for the entire dictionary), my first reaction
after "wow" was what proportion of the definitions consist of these
phrases. Haven't done this yet, but it is simple, just requiring a
couple of modifications in the script to make the necessary counts. The
Perl script also is written in such a way that the same subroutines can
be applied to free text. This is all available to interested
researchers who would like to investigate these issues. (And also, it's
important to say that Adam Kilgarriff was a guiding spirit to James'
initial forays.)

Based strictly on a casual perusal of the resulting Online ODE, I would
say that a 2% figure is much more likely than a 30% (or even 70%) figure.

Ken

David Brooks wrote:

> Dear Corpora-folk,
>
> I was wondering if anyone has estimated the incidence of multi-word
> expressions in language. I know that empirical estimates are tied to
> particular corpora, but does anyone have an account of MWEs for
> particular corpora, so that "ball-park" figures of the proportion of
> MWEs can be estimated?
>
> Better yet, can anyone give me a good reference for the incidence of MWEs?
>
> Regards,
> David

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken@clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com

Next message: Gaël Dias: "Re: [Corpora-List] Incidence of MWEs"
Previous message: David Ahn: "[Corpora-List] Call for Participation: EACL 2006 Workshop on Multi-dimensional Markup in NLP/5th Workshop on NLP and XML"
In reply to: David Brooks: "[Corpora-List] Incidence of MWEs"
Next in thread: Erin McKean: "[Corpora-List] CALL FOR PAPERS: Dictionary Society of North America"
Next in thread: Amsler, Robert: "RE: [Corpora-List] Incidence of MWEs"
Reply: Erin McKean: "[Corpora-List] CALL FOR PAPERS: Dictionary Society of North America"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Mar 14 2006 - 21:01:37 MET