Re: [Corpora-List] Looking for mid-range new corpora

From: Paul McNamee (paul.mcnamee@jhuapl.edu)
Date: Tue Aug 30 2005 - 01:13:25 MET DST

  • Next message: dychen: "[Corpora-List] For email of Wiktorsson"

    Chris,

    I've used the CACM collection - approx 3MB - for an IR class I teach.
    Its small, but there are queries and rel. judgments available, so
    I can run a mini-TREC evaluation with the class and provide
    students with some data to experiment with. Because its small,
    none of the students have problems working with it. It even fits
    in an editor and they can 'debug' their proto-IR engines by
    using the editor's built in search.

    The Reuters 21578 collection is about 22 MB, which is about the
    next stop before the TREC disks. It has labels for text classification,
    but no ad hoc queries that I am aware of.

    You could in principle use a subset of the TREC data, for example, some
    researchers report experiments using the AP or WSJ subsets. This would
    decrease the size, but you might have an issue in using the data for
    classroom instruction. I don't think the TREC data agreements permit
    this, but I suppose you could request permission for this use.

    Helpful links:
       Download site at Glasgow with several legacy IR test sets:
         http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/

       A recent web page for an IR course I taught:
         http://apl.jhu.edu/~paulmac/ir.html

    I don't have any suggestions for newness. You could use publicly
    available corpora (e.g., texts from Project Gutenberg or Wikipedia)
    but you'd have to come up with your own assessments.

    Best regards,

    - Paul

    Paul McNamee
    Research and Technology Development Center
    Johns Hopkins University Applied Physics Lab
    11100 Johns Hopkins Road
    Laurel MD 20723-6099 USA
    Voice: +1 443 778 3816
    Fax: +1 443 778 6904
    Email: paul.mcnamee@jhuapl.edu

    On Mon, 29 Aug 2005, Chris Jordan wrote:

    > Hey all,
    >
    > I am looking for a mid-range corpora that is relatively new for a higher
    > level undergrad course in Information Retrieval. I don't want to use the TREC
    > sets as they are giant though I don't want something that is insignificant
    > either. Having qrels and some publications on it is a bonus too. Thanks,
    >
    > --
    > Chris Jordan
    > Dalhousie Computer Science PhD Candidate
    > Dalhousie Student Union Graduate Senate Representative



    This archive was generated by hypermail 2b29 : Tue Aug 30 2005 - 01:40:55 MET DST