[Corpora-List] XML Wikipedia Collections for IR/ML Research

From: Ludovic DENOYER (ludovic.denoyer@lip6.fr)
Date: Fri Apr 07 2006 - 19:39:11 MET DST

  • Next message: Mark Dras: "[Corpora-List] Extended deadline: COLING/ACL Workshop on Tree Adjoining Grammar and Related Formalisms (TAG+8)"

    Wikipedia XML Corpus for research

    Ludovic DENOYER

    LIP6 - University of Paris 6

    http://www-connex.lip6.fr/~denoyer/wikipediaXML

    Technical report (currently Draft):
    http://www-connex.lip6.fr/~denoyer/homepage/publications/TECHREP2006.pdf

    =============

    This is an announcement for the release of a set of large XML document
    collections.
    These collections might be of interest to the Information Retrieval
    Community and to the Machine Learning community.
    These collections have been developped as a joint project between the
    DELOS and PASCAL Networks of Excellence.

    ===========

    We propose a large set of XML collections based on Wikipedia. These
    collections can be used in a large variety of XML IR/Machine Learning
    tasks like ad-hoc retrieval, categorization, clustering or Structure
    Mapping task. These corpora are, for example, used for INEX 2006
    competition (http://inex.is.informatik.uni-duisburg.de/2006) and for the
    XML Document Mining Challenge (http://xmlmining.lip6.fr).

    Brief Collections description:

    - 8 Different languages: English, German, French, Dutch, Spanish,
    Chinese, Arabian, Japanese

    - 660,000 documents for the English collection

    - All documents are organized in a hierarchy of categories

    - Some collections have been build for the comparison of
    categorization/clustering algorithms

    - Multimedia Collection (more than 300,000 pictures)

    - Entity Collection

    Other collections (Cross-Language, NLP Collection) will be provided soon.

    More information on the web site:
    http://www-connex.lip6.fr/~denoyer/wikipediaXML

    Best regards,

    Ludovic DENOYER

    Assistant Professor

    http://www-connex.lip6.fr/~denoyer



    This archive was generated by hypermail 2b29 : Fri Apr 07 2006 - 20:40:58 MET DST