Re: Corpora: corpus equilibrium?

John Aitchison (jaitchison@acm.org)
Fri, 10 Oct 1997 18:37:51 +0000

td wrote>
> the problem that you described (given a random variable, estimate
> something) is really a subset of the real problem.
>
> the real problem is given a random variable and a number of different
> experiments you can do at various specified costs, estimate
> something within a fixed budget. stratified sampling is a sort of
> subset of this framework in which the cost of each sample is equal.

well, i am not at all surprised that i have got hold of the wrong end
of the stick and that there is a much larger can of worms here.

But stratification does NOT assume equal costs, and there is a ton of
work in the sampling literature about optimum sample design with
differential costs ..


> to make this concrete, suppose that you can get a certain amount of
> usenet news text per day, and that you can get a certain amount of
> newswire text per dollar (pound, ecu...) and that you can get a
> certain much smaller amount of spoken language per dollar and that you
> can digitize some amount of out of copyright literary text per dollar.
>
> now further suppose that you have $100,000 and 100 days to build a
> corpus. how do you allocate your resources to optimally estimate some
> quantity (i.e. the frequency of the word "bank")? and how do you
> allocate your resources to optimally estimate 10^7 parameters (a
> speech recognition language model)?

This problem is too "diffuse" for me, but generally speaking costs
and constraints ARE formally taken into account in sampling
literature .. I don't think you will find them as explicitly
addressed in the experimental design field but the newer optimal
design approaches could certainly be adapted to yield an 'optimum' design within
a cost constraint. (have a look at Nam Nguyen or Dennis Lim's work on
near orthogonal designs, supersaturated designs and so on .. once the
orthogonality requirement is relaxed, lots of designs become
possible).

But I agree .. optimum design for woolly/fuzzy objectives is difficult.

> and finally, given multiple competing goals with specified value
> (political value, mostly), how do you come out smelling like a rose?

if this is seen as a sampling problem/experimental design problem, it
is most certainly addressed in the literature.

>
John Aitchison <jaitchison@acm.org>
Data Sciences Pty Ltd
Sydney, AUSTRALIA.