Re: [Corpora-List] Web pages corpus

From: Chris Jordan (cjordan@cs.dal.ca)
Date: Mon Mar 06 2006 - 15:49:10 MET

  • Next message: Paul Buitelaar: "[Corpora-List] 2nd CFP - 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge (OLP2)"

    Hello,

    I am actually interested in a standard Web Corpus as well. The reason I
    do not want to compile my own is that it is then difficult to compare
    the results from my experiments to other reported results. As well, I am
    hoping for an annotated corpus that includes lots of other valuable
    information such as genre, topic, and abstract which has to be added by
    assessors / subject matter experts.

    Imen, depending on your project / experiment, I would carefully consider
    what you are attempting to show. Creating a corpus is an option however
    it may make your experiment a one off and lower the value of your
    results. Furthermore, since you are doing document summarization, if you
    use your own corpus, you will be limited to performing a user evaluation
    to assess it's capabilities. Generally these types of evaluations are
    beyond the scope of a Master's.

    Chris

    Jakob Halskov wrote:

    >Dear Imen,
    >
    >It is very easy to compile a web corpus on your own using one of the freely available web search APIs. See for example:
    >
    >http://developer.yahoo.net/search/index.html
    >
    >or
    >
    >http://www.google.com/apis/
    >
    >Best regards,
    >
    >Jakob Halskov
    >--
    >PhD student
    >Dept. of Computational Linguistics
    >Copenhagen Business School
    >www.id.cbs.dk
    >
    >----- Original Message -----
    >From: "ismi.touati" <ismi.touati@laposte.net>
    >Date: Monday, March 6, 2006 12:29 pm
    >Subject: [Corpora-List] Web pages corpus
    >
    >
    >
    >>Dear all,
    >>
    >>I'm working on automatic summarization of web pages, i'm looking
    >>for a corpus of web
    >>
    >>pages (html documents) with their abstract to evaluate my system.
    >>
    >>Does anyone knows if such a corpus exists?
    >>
    >>Thanks in advance for the help.
    >>Imen.
    >>
    >>***********************************
    >>Imen Touati
    >>Master Student at Faculty of Economic Science and management of
    >>sfax,
    >>Tunisia.
    >>LARIS laboratory
    >>Addresse : LARIS, FSEGS, BP 1088, 3018 Sfax, Tunisia
    >>
    >>Accédez au courrier électronique de La Poste : www.laposte.net ;
    >>3615 LAPOSTENET (0,34 ?/mn) ; tél : 08 92 68 13 50 (0,34?/mn)
    >>
    >>
    >>
    >>
    >>
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Mon Mar 06 2006 - 16:09:48 MET