Re: Corpora: Re: Unsupervised learning

Ted E. Dunning (ted@aptex.com)
Fri, 12 Mar 1999 12:28:10 -0800 (PST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Tony Berber Sardinha: "Re: Corpora: Seeking a machine readable version of the Francis and Kucera word frequency list"
Previous message: Marie-Paule =?iso-8859-1?Q?P=E9ry=2DWoodley?= : "Corpora: Atelier TALN/Workshop TALN : corpus et TAL. 2nd CFP"

It should also be noted that all of the reduced dimensional vector
approaches such as the one used in Matchplus (reported as HNC system
in Tipster I and several TREC's), LSI (see Bellcore's TREC papers) and
some work at Xerox PARC (see Hinrich Schuetze's papers) are all
unsupervised learning systems. These systems can be applied very
effectively in text categorization applications. In an internal
experiment, I compared InRoute (UMASS routing system) with Convectis
(our Matchplus based categorization system) and Luduan (another
prototype of ours) on a categorization task based on TREC documents
and judgements. Convectis and InRoute performed at essentially the
same level of performance on this task.

The task in my study was to use all of the AP1988 documents and
judgements as training for several of the TREC queries and all of the
AP1989 documents for testing. This simulates a newswire
classification task more acccurately than the standard TREC routing
tasks and thus is more applicable for our products here at Aptex/HNC.

>>>>> "am" == Andrew McCallum <mccallum@sandbox.jprc.com> writes:

>>>>> "jmgh" == Jose Maria Gomez Hidalgo <jmgomez@dinar.esi.uem.es> writes:

jmgh> I would like to know about attempts to build classifiers
jmgh> through unsupervised learning, or to integrate other
jmgh> information sources in a supervised learning-based
jmgh> classifier. The only one I am aware of is the one by Yang and
jmgh> Chute [1].

am> Integrating supervised and unsupervised learning has been a
am> focus of mine and several others at CMU and elsewhere.

...

Next message: Tony Berber Sardinha: "Re: Corpora: Seeking a machine readable version of the Francis and Kucera word frequency list"
Previous message: Marie-Paule =?iso-8859-1?Q?P=E9ry=2DWoodley?= : "Corpora: Atelier TALN/Workshop TALN : corpus et TAL. 2nd CFP"