RE: [Corpora-List] Brown Corpus

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Fri Jun 17 2005 - 14:10:03 MET DST

Next message: Jean Veronis: "Re: [Corpora-List] Brown Corpus"

Previous message: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
In reply to: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Reply: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Reply: Steven Bird: "RE: [Corpora-List] Brown Corpus"
Reply: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Reply: Martin Wynne: "Re: [Corpora-List] Brown Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

All,

Like Lou, I think the original structure of the Brown, with *same-size*
samples, has a lot to commend it.

Where samples are all the same length, you can talk about the mean and
standard deviation of a phenomenon (eg, the frequency of "the") across the
samples and it becomes easy to use the t-test to establish whether the
phenomenon is systematically more common in one text type than another.

If all samples are different lengths, it is not easy: you can't use mean and
standard deviation (and standard tests like T-test) and fancy maths are
likely to make the findings impenetrable and unconvincing.

Many studies on LOB, Brown and relations have benefited from the fixed
sample length.

I come across all too many papers that argue, roughly "hey look, word X is N
times as common in corpus A as against corpus B, now let's investigate why"
- and I'm left wondering whether N is enough of a difference to be salient,
given the usually unexplored level of within-text-type variation.

This argument leads me to propose the *pseudodocument*, a fixed-length run
of text of a given text type, truncated at (eg) the 10,000th word. By
treating text in this way, we can use mean, SD, and T-test to work out if
the level of variation between one text type and another is significant.

You read it first on Corpora!

Adam

-----Original Message-----
From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
Behalf Of Lou Burnard
Sent: 14 June 2005 21:37
To: sb@ldc.upenn.edu
Cc: CORPORA
Subject: Re: [Corpora-List] Brown Corpus

Well, every generation has every right to reinvent the work of their
predecessors, but such a "rationalization" seems to me to play somewhat
fast and loose with the design chosen by the original compilers of the
Brown Corpus... it was intended to comprise 500 equally-sized samples,
selected from 15 pre-defined categories. By lumping together all the
samples from the same category you will wind up with differently sized
samples -- category J ("learned") has 80 texts, c. 160,000 words; while
category R ("humor") has only 9, i.e. 18,000 words. That may not
matter, of course, for many applications, but it seems a shame to lose
that feature of the original design. Or will you preserve the original
text boundaries with some additional markup?
If so, you might like to consider that many of the original 2000 word
samples have some internal structure too: even within 2000 words, there
are samples taken from originally discontinuous sections of the
original. Which most versions seem to disregard.

antiquariously,

Lou

Steven Bird wrote:

>Note that this version of the Brown Corpus contains 500 files, each
>consisting of around 200 lines of text on average. Perhaps these were
>as big as they could handle back in 1961. I think it would make matters
>simpler if the file structure was rationalized now, so that, e.g.:
>
>Brown Corpus file names
>Existing -> Proposed
>ca01 .. ca44 -> a
>cb01 .. cb26 -> b
>etc
>
>(NB this is how things are being restructured in NLTK-Lite, a new,
>steamlined version of NLTK that will be released later this month.)
>
>-Steven Bird
>
>
>On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:
>
>
>>By one of those uncanny coincidences, I am planning to include an
>>XMLified version of the Brown corpus on the next edition of the BNC Baby
>>corpus sampler. The version I have is derived from the GPLd version
>>distributed as part of the LTK tool set (http://nltk.sourceforge.net)
>>and includes POS tagging; there is also a version which has been
>>enhanced to include Wordnet semantic tagging but I am not clear as to
>>the rights in that.
>>
>>Lou Burnard
>>
>>
>>Xiao, Zhonghua wrote:
>>
>>
>>>The plain text version of Brown is available here:
>>>http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt
>>>
>>>Richard
>>>________________________________
>>>
>>>From: owner-corpora@lists.uib.no on behalf of Jörg Schuster
>>>Sent: Tue 14/06/2005 14:39
>>>To: CORPORA@hd.uib.no
>>>Subject: [Corpora-List] Brown Corpus
>>>
>>>
>>>
>>>Hello,
>>>
>>>where can the Brown Corpus be downloaded or purchased?
>>>
>>>Jörg Schuster
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>
>
>

Next message: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Previous message: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
In reply to: Lou Burnard: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Reply: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Reply: Steven Bird: "RE: [Corpora-List] Brown Corpus"
Reply: Jean Veronis: "Re: [Corpora-List] Brown Corpus"
Reply: Martin Wynne: "Re: [Corpora-List] Brown Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Jun 17 2005 - 14:16:14 MET DST