Re: [Corpora-List] Brown Corpus

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Wed Jun 15 2005 - 10:41:49 MET DST

Next message: Rayson, Paul: "RE: [Corpora-List] Lexicon with semantic features needed"

Previous message: Steven Bird: "Re: [Corpora-List] Brown Corpus"
In reply to: Steven Bird: "Re: [Corpora-List] Brown Corpus"
Next in thread: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

steven,
I think the original design plan for Brown was to collect 500 text
samples, each of 2000 words (or up to end of sentence including the
2000th word). For some text-categories, eg newspapers (categories A,B,C)
the texts found were generally shorter than 2000 words, so several
newspaper-articles were included into a single 2000-word "text".
BUT most later 2000-word samples are from a single source.

Other corpora have followed this design principle of a standard
sample-size of about 2000 words (LOB, FLOB, FROWN,
ICE: International Corpus of English, CCA: Corpus of Contemporary
Arabic, ...), though not all have (eg BNC, ANC). I dont suppose for
most applications it matters whether you combine small files into a big
file to simplify storage/processing, as long as there is a record
somewhere of the original sources (either in a Handbook, or in XML
header markup)

eric atwell, Leeds University

On Tue, 14 Jun 2005, Steven Bird wrote:

> Note that this version of the Brown Corpus contains 500 files, each
> consisting of around 200 lines of text on average. Perhaps these were
> as big as they could handle back in 1961. I think it would make matters
> simpler if the file structure was rationalized now, so that, e.g.:
>
> Brown Corpus file names
> Existing -> Proposed
> ca01 .. ca44 -> a
> cb01 .. cb26 -> b
> etc
>
> (NB this is how things are being restructured in NLTK-Lite, a new,
> steamlined version of NLTK that will be released later this month.)
>
> -Steven Bird
>
>
> On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:
>> By one of those uncanny coincidences, I am planning to include an
>> XMLified version of the Brown corpus on the next edition of the BNC Baby
>> corpus sampler. The version I have is derived from the GPLd version
>> distributed as part of the LTK tool set (http://nltk.sourceforge.net)
>> and includes POS tagging; there is also a version which has been
>> enhanced to include Wordnet semantic tagging but I am not clear as to
>> the rights in that.
>>
>> Lou Burnard
>>
>>
>> Xiao, Zhonghua wrote:
>>> The plain text version of Brown is available here:
>>> http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt
>>>
>>> Richard
>>> ________________________________
>>>
>>> From: owner-corpora@lists.uib.no on behalf of Jörg Schuster
>>> Sent: Tue 14/06/2005 14:39
>>> To: CORPORA@hd.uib.no
>>> Subject: [Corpora-List] Brown Corpus
>>>
>>>
>>>
>>> Hello,
>>>
>>> where can the Brown Corpus be downloaded or purchased?
>>>
>>> Jörg Schuster
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>

-- 
Eric Atwell, Senior Lecturer, Language research group, School of Computing, 
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric

Next message: Rayson, Paul: "RE: [Corpora-List] Lexicon with semantic features needed"
Previous message: Steven Bird: "Re: [Corpora-List] Brown Corpus"
In reply to: Steven Bird: "Re: [Corpora-List] Brown Corpus"
Next in thread: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 15 2005 - 10:53:18 MET DST