Re: Perplexity results using BNC

Tony Rose (tgr@hplb.hpl.hp.com)
Wed, 3 Jul 1996 11:25:24 +0100

> From: Miles Osborne <mosborne@csd.abdn.ac.uk>
> Date: Tue, 2 Jul 1996 18:09:46 +0100 (BST)
> Resent-Date: Tue, 2 Jul 1996 19:05:45 +0200
> Resent-From: corpora-request@lists.uib.no
>
> Hello. Has anyone done any work on building language models
> (eg. ngrams) from the British National Corpus?

Yes, I've spent most of the last 6 months working in precisely this area.

> In particular,
> I'm interested in the perplexities of the resulting models. From
> what I gather, perplexity varies according to genre, and so results
> cannot necessarily be compared with those for models constructed on
> non-BNC material.
>

Absolutely. There are in fact several relevant parameters: domain, genre (not
always the same as domain), vocab size, cutoffs, amount/source of test data,
and so on.

The results will be published soon. In the meantime, for a more general
discussion of the genre issue, you could try looking at some of Doug Biber's
work (e.g. the paper(s) in the special issue of Computational Linguistics on
Large Text Corpora, 1994 I think) or Adam Kilgarriff's (e.g. the 1996 AISB
Workshop on Language Engineering, Eds. Evett & Rose).

Cheers,
Tony Rose
=======================================================================
| Dr. T.G. Rose Interaction Technology Dept. |
| email : tgr@hplb.hpl.hp.com Personal Systems Lab, |
| WWW : http://www-uk.hpl.hp.com/ Hewlett-Packard Laboratories, |
| phone : 0117 9228488 Bristol BS12 6QZ |
| fax : 0117 9228920 England |
=======================================================================