Re: [Corpora-List] Finding representative terms

From: Jing-Shin Chang (jshin@csie.ncnu.edu.tw)
Date: Mon Dec 26 2005 - 21:13:36 MET

  • Next message: radev@umich.edu: "Re: [Corpora-List] Finding representative terms"

    Hi, Delip,

    If I understand it correctly, you are using IDF without
    weighting terms with term frequencies (TF)!!

    This will surely result in poor performance since terms
    which are used in the same number of domains/documents will not
    be discriminated from each other. Correct ranking for the
    large number of terms will then not be possible.

    Including the multiplicative factor TF (term frequency in domains,
    Nij), on the other hand, will appreciate frequent terms in a domain/document
    and depreciate it in other less frequently used domains.
    Using both TF and IDF should improve the performance significantly
    from my early experiences, which used IDF-like measure along.

    This TF factor also partially resolve your single-domain/document probelm.
    Frequent terms are kings if you have only one document (or, in general,
    if the DF's are the same.)

    Also, I had tried a refined version of DF (more precisely, revision of
    log(DF)), called cross-domain entropy (CDE) or inter-domain entropy (IDE)
    (NEITHER relative entropy NOR cross entropy!!), which was then used
    to estimate the expected number of domains/documents as E[DF] = 2**CDE.
    (The term 'expectation' may be abused in a not-so-rigid way.)

    The CDE measure considers the probability of a term
    in a domain/document to decide whether one should increment the DF
    by one (or only by a fractional time) when the term appears in one
    domain/document.

    Roughly speaking, if it is a frequent term in a domain/document,
    DF tends to be incremented by one, otherwise, only a fractional
    count is added to DF.

    Such refinement (TF * Inverse E[DF]) consistently results
    in some improvement over the TF-IDF term weighting method
    in my experiments (for domain-specific word extraction and
    document classification). I would like to see if the refinement
    consistently gains better performance over TF-IDF in other tasks too.
    So you are welcome to refer to this work:

    Jing-Shin Chang, "Domain Specific Word Extraction from
        Hierarchical Web Documents: A First Step Toward Building
        Lexicon Trees from Web Corpora," Proceedings of the Fourth
        SIGHAN Workshop on Chinese Language Learning, IJCNLP-05
        (International Joint Conference on Natural Language Processing),
        pp. 64-71, Jeju Island, Korea, October 14-15, 2005.

    http://nlp.csie.ncnu.edu.tw/~shin/doc/SIGHAN.2005/DSW.SIGHAN.2005.Camera.Ready.Jing_Shin_Chang+B.pdf

    As a final comment, when refering to "representative terms",
    it might be more precise to say "domain-specific representative terms"
    or "representative terms" in a specific domain, since a term might be sense
    ambiguous and may not always be representative in all domains.

    For instance, "bank" may be specific/representative in the "finance" domain
    (for its high term frequency in that domain), but it may not be as
    representative as other terms (like "mountain", "river") when describing
    natural scenes (for its relatively lower term frequency).

    - Jing-Shin Chang -^^-
     
    > From owner-corpora@lists.uib.no Tue Dec 27 00:55:01 2005
    > Date: Tue, 27 Dec 2005 00:20:07 +0800 (CST)
    > From: Delip Rao <deliprao@yahoo.com>
    > Subject: [Corpora-List] Finding representative terms
    >
    > Hi,
    >
    > Is there any work that tries to find the most
    > important/representative words from a document? I have
    > tried using IDF but results were very poor. Also IDF
    > does not make sense if we have a single document and
    > want to get the most important term(s) out of it.
    >
    > Thanks!
    > Delip



    This archive was generated by hypermail 2b29 : Mon Dec 26 2005 - 21:36:14 MET