RE: Corpora: MWUs and frequency

ralf.steinberger@jrc.it
07 Oct 1998 15:40:07 +0200

--__boundary__07:10:1998_13:35:16_(29617)
Content-Type: text/plain;
charset=iso-8859-1
Content-Transfer-Encoding: 7bit

Jean Hudson wrote:

I'd be interested to hear what Przemek intends to use frequency lists for
and, indeed, what others have to say about the significance of frequency.

We are interested in frequency lists of multi-word units as a resource for the automatic indexing of texts. It is sometimes useful to consider MWUs such as 'liquid crystal display' and 'hot dog' instead of the individual words 'display', 'hot' 'liquid', 'dog' and 'crystal'. For this purpose, MWUs such as 'the project' and 'experience of' are obviously irrelevant, whereas 'British Council' and 'environmental education' are potentially good candidates. In order to get a list which consists of mainly potentially good candidates, it is important to disallow (a rather large number of) stop words at either end of the expression. Note that stop words should be allowed inside the expression so that MWUs such as 'table of contents' and 'Member of Parliament' won't be excluded.

As far as I know, WordSmith Tools does not currently provide a facility to strip stop words from the ends of MWUs.

I'd be grateful for any pointers to frequency lists of such 'meaningful' MWUs.

Ralf

Ralf Steinberger (ralf.steinberger@jrc.it)
European Commission, Joint Research Center
ISIS - Advanced Techniques for Information Analysis (http://www.jrc.org/isis/atia/)
T.P. 361, I - 21020 Ispra (VA)