Re: [Corpora-List] Web search by document size

From: Gregor Erbach (gor@acm.org)
Date: Fri Mar 11 2005 - 15:40:05 MET

  • Next message: Linda Bawcom: "[Corpora-List] newspaper texts"

    Hi,
    AllTheWeb used to have an option for restricting the search
    according to document size, but this option appears to be
    no longer available.

    I assume most search engines use some variant of TF*IDF
    weighting for ranking search results; that is term frequency TF
    (how many times a term appears in a document) multiplied by
    the inverse of document frequency DF (in how many documents
    a term appears), in combination with hyperlink analysis.
    So, short documents in which infrequent search terms appear
    will rank highly, but also longer documents in which the
    search terms appear many times, which is not what you want.

    Your best bet is probably to download all search results
    (which should not be too many if your list of words is
    long enough), and then sort the results by document length.
    You can use the Google Web API (http://www.google.com/apis/)
    for this. It will allow you up to 1000 searches per day.

    regards,

       Gregor Erbach

    Brett Reynolds wrote:
    > I'd like to be able to search the web for the smallest document
    > containing all of a certain list of words. Is anyone aware of a search
    > engine that will allow this kind of query?
    >
    > -----------------------
    > Brett Reynolds
    > English Language Centre
    > Humber Institute of Technology and Advanced Learning
    > Toronto, Ontario, Canada
    > brett.reynolds@humber.ca
    >
    >

    --
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Dr. Gregor Erbach                     http://purl.org/net/gregor/
    DFKI GmbH, Language Technology Lab    http://www.dfki.de/
    Tel. +49 (681) 302-5354               mailto:erbach@dfki.de
    



    This archive was generated by hypermail 2b29 : Fri Mar 11 2005 - 15:40:41 MET