Re: [Corpora-List] Brown Corpus

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Wed Jun 15 2005 - 10:41:49 MET DST

  • Next message: Rayson, Paul: "RE: [Corpora-List] Lexicon with semantic features needed"

    steven,
    I think the original design plan for Brown was to collect 500 text
    samples, each of 2000 words (or up to end of sentence including the
    2000th word). For some text-categories, eg newspapers (categories A,B,C)
    the texts found were generally shorter than 2000 words, so several
    newspaper-articles were included into a single 2000-word "text".
    BUT most later 2000-word samples are from a single source.

    Other corpora have followed this design principle of a standard
    sample-size of about 2000 words (LOB, FLOB, FROWN,
    ICE: International Corpus of English, CCA: Corpus of Contemporary
    Arabic, ...), though not all have (eg BNC, ANC). I dont suppose for
    most applications it matters whether you combine small files into a big
    file to simplify storage/processing, as long as there is a record
    somewhere of the original sources (either in a Handbook, or in XML
    header markup)

    eric atwell, Leeds University

    On Tue, 14 Jun 2005, Steven Bird wrote:

    > Note that this version of the Brown Corpus contains 500 files, each
    > consisting of around 200 lines of text on average. Perhaps these were
    > as big as they could handle back in 1961. I think it would make matters
    > simpler if the file structure was rationalized now, so that, e.g.:
    >
    > Brown Corpus file names
    > Existing -> Proposed
    > ca01 .. ca44 -> a
    > cb01 .. cb26 -> b
    > etc
    >
    > (NB this is how things are being restructured in NLTK-Lite, a new,
    > steamlined version of NLTK that will be released later this month.)
    >
    > -Steven Bird
    >
    >
    > On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:
    >> By one of those uncanny coincidences, I am planning to include an
    >> XMLified version of the Brown corpus on the next edition of the BNC Baby
    >> corpus sampler. The version I have is derived from the GPLd version
    >> distributed as part of the LTK tool set (http://nltk.sourceforge.net)
    >> and includes POS tagging; there is also a version which has been
    >> enhanced to include Wordnet semantic tagging but I am not clear as to
    >> the rights in that.
    >>
    >> Lou Burnard
    >>
    >>
    >> Xiao, Zhonghua wrote:
    >>> The plain text version of Brown is available here:
    >>> http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt
    >>>
    >>> Richard
    >>> ________________________________
    >>>
    >>> From: owner-corpora@lists.uib.no on behalf of Jörg Schuster
    >>> Sent: Tue 14/06/2005 14:39
    >>> To: CORPORA@hd.uib.no
    >>> Subject: [Corpora-List] Brown Corpus
    >>>
    >>>
    >>>
    >>> Hello,
    >>>
    >>> where can the Brown Corpus be downloaded or purchased?
    >>>
    >>> Jörg Schuster
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>
    >
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Language research group, School of Computing, 
    Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric
    



    This archive was generated by hypermail 2b29 : Wed Jun 15 2005 - 10:53:18 MET DST