Corpora: Re: Unsupervised learning

Andrew McCallum (mccallum@sandbox.jprc.com)
Wed, 10 Mar 1999 12:15:17 -0500

From: Jose Maria Gomez Hidalgo <jmgomez@dinar.esi.uem.es>
Date: Wed, 10 Mar 1999 18:01:48 +0100

I would like to know about attempts to build classifiers through
unsupervised learning, or to integrate other information sources in
a supervised learning-based classifier. The only one I am aware of
is the one by Yang and Chute [1].

Integrating supervised and unsupervised learning has been a focus of
mine and several others at CMU and elsewhere. There was a NIPS
workshop on the subject ("Integrating Supervised and Unsupervised
Learning" http://www.cs.cmu.edu/~mccallum/supunsup).

Here are some examples of supervised/unsupervised learning applied to
text classification:

"Learning to Classify Text from Labeled and Unlabeled Documents"
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. AAAI-98
http://www.cs.cmu.edu/~mccallum/papers/emcat-aaai98.ps.gz

A longer version of the above, to appear in the Machine Learning Journal:
http://www.cs.cmu.edu/~knigam/papers/emcat-mlj99.ps

"Employing EM in Pool-Based Active Learning for Text Classification"
Andrew McCallum and Kamal Nigam.
Proc. of International Conference on Machine Learning (ICML-98)
http://www.cs.cmu.edu/~mccallum/papers/emactive-icml98.ps.gz

Shrinkage can also be seen as unsupervised learning, in that it uses
EM to "cluster" words into different ancestors in the hierarchy. Here
is a paper on using shrinkage in a hierarchy of classes to improve
document classification:

"Improving Text Classification by Shrinkage in a Hierarchy of Classes"
Andrew McCallum, Ronald Rosenfeld, Tom Mitchell and Andrew
Ng. ICML-98.
http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz

We also use unsupervised learning and unlabeled data to classify
research papers into the 70-leaf topic hierarchy in Cora, a search
engine over computer science research papers
(www.cora.justresearch.com). A paper describing Cora is: "Building
Domain-Specific Search Engines with Machine Learning
Techniques". Andrew McCallum, Kamal Nigam, Jason Rennie and Kristie
Seymore. AAAI-99 Spring Symposium.
http://www.cs.cmu.edu/~mccallum/papers/cora-aaaiss99.ps.gz