Re: Corpora: Stop-list etc.

Ken Church (kwc@research.att.com)
Tue, 21 Oct 1997 13:32:45 -0400 (EDT)

While there is much truth to what Ted is saying, one can argue that
there may be more to the story than just statistical considerations.

It is interesting to compare and contrast Information Retrieval and
Author Identification. Both fields use basically the same methods,
except for the weighting strategy. Content words are good
discriminators for Information Retrieval whereas stylistic words are
good discriminators for Author Identification. I can see how the
standard statistical considerations would discover much of this
weighting strategy, especially for high frequency words, but I don't
see how standard statistical considerations would capture the relevant
distinctions for low frequency words. I've argued elsewhere that the
weighting scheme needs at least 2 variables (term frequency + ???) in
order to capture the 4 possibilities:

| STYLISTIC CONTENT
-----------------------------------------------------
HIGH FREQ | the government
LOW FREQ | whereas aardvark

One variable (e.g., term frequency) can only make a two-way
distinction (e.g., 'the' vs. 'aardvark'), which isn't enough. There
are lots of different ways to think about the second variable:
burstiness, variance over documents, IDF, semantic content, etc. My
hunch is that these are all basically equally good, but I can't defend
this hunch right now.

At any rate, I think Adam's question is really quite deep and deserves
a lot of thought.

Ken Church

Date: Mon, 20 Oct 1997 10:12:25 -0700
Reply-To: "Ted E. Dunning" <ted@aptex.com>
From: "Ted E. Dunning" <ted@aptex.com>
To: Adam.Kilgarriff@itri.brighton.ac.uk
CC: einat@cogsci.ed.ac.uk, corpora@hd.uib.no, korin@cstr.ed.ac.uk
In-reply-to: <199710201059.LAA00435@cabral.itri.brighton.ac.uk> (Adam.Kilgarriff@itri.brighton.ac.uk)
Subject: Re: Corpora: Stop-list etc.
Reply-to: tdunning@aptex.com
Sender: owner-corpora@lists.uib.no
Precedence: bulk
Resent-Date: Mon, 20 Oct 1997 19:14:54 +0200
Content-Type: text
Content-Length: 1384

actually, adam is missing a very important fact about IR systems which
does give a principled reason for using stop lists.

in virtually all of the leading retrieval systems which support ranked
retrieval (there are some oddballs in this mix, but only a few), the
weight assigned to a retrieval term is inversely proportional to the
frequency of the term. any term which appears in every document is
given zero or near zero weight.

given this fact, it is an obvious economy to not store the information
about the occurrence of these words. this is very similar to other
sparse matrix techniques which avoid storing information about zero
elements. since most IR systems are at their hearts simply very large
matrix transpose and multiply systems, it is hardly surprising that
sparse matrix implementation techniques are used as much as possible.

>>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes:

ak> Einat Amitray wrote:

>> I'm not looking for the "right" list of words, but for the
>> reason behind using stop-lists at all. Is there an article
>> about the "'why's & 'why-not's?

ak> ... So the obvious
ak> hack is to exclude them.

ak> I don't think there is any theoretical justification for stop
ak> lists. The implicit assumption in much IR is that content can
ak> be assessed in isolation from form. ...