Re: [Corpora-List] Brown Corpus

From: Martin Wynne (martin.wynne@computing-services.oxford.ac.uk)
Date: Tue Jun 21 2005 - 11:45:17 MET DST

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] Brown Corpus"

    Dear Adam and everyone,

    I don't think that it is a good idea to aim to take samples of the same
    size from documents for inclusion in a corpus, as Adam suggested last
    week (copied below).

    Such an approach compromises the integrity of the texts in the corpus.
    This means that:
    (a) you risk losing interesting phenomena that occur at the end of texts
    (or in whichever bit you chop off), and being biased in favour of things
    occurring in the middle of texts;
    (b) the text at one end of the sample (or both) is orphaned from its
    cotext, thus affecting collocation measures;
    (c) in order to interpret something you find in a corpus, it is often
    necessary to read quite a lot, and maybe all, of the text. Unless you
    have easy access to the full original text, the relevant information may
    not be available (e.g. a satirical piece of journalism may only clearly
    reveal itself as satire, possibly entirely reversing the meaning of the
    text, in one sentence);
    (d) texts shorter than the fixed length get excluded, or if they are
    allowed they are stuck together, so you get some whole texts in some
    text types (e.g. newspapers, emails) and only fragments in others (e.g.
    novels) which adversely affects comparibility for the reasons given in
    (a) and (b) above.

    It is inconvenient that the population of texts which are sampled for
    corpora are not all the same length, but pretending that they are is not
    the answer. Corpora should be sampled to represent the population, not
    on the basis of convenience. If you need samples of the same length for
    statistical tests then this can be done by the analysis software, which
    can count the same number of words from each text. (You may object that
    the maximum length is then the length of the shortest text, but that's
    the data we have to deal with. Ignoring data because it doesn't fit )

    As far as I understand it, this has been done in the past primarily
    because only small numbers of words could be dealt with (Brown) and to
    obtain copyright clearance (BNC). It may also have been done to avoid
    having corpora biased by inclusion of very long texts from one source.
    In making corpora for lingusitic analysis, we may need to continue to
    sample texts for practical and legal reasons, and for reasons of
    representativeness, but I don't think we should do it for statistical
    convenience.

    Martin

    -- 
    Martin Wynne
    Head of the Oxford Text Archive and
    AHDS Literature, Languages and Linguistics
    

    Oxford University Computing Services 13 Banbury Road Oxford UK - OX2 6NN Tel: +44 1865 283299 Fax: +44 1865 273275 martin.wynne@oucs.ox.ac.uk

    Adam Kilgarriff wrote:

    >All, > >Like Lou, I think the original structure of the Brown, with *same-size* >samples, has a lot to commend it. > >Where samples are all the same length, you can talk about the mean and >standard deviation of a phenomenon (eg, the frequency of "the") across the >samples and it becomes easy to use the t-test to establish whether the >phenomenon is systematically more common in one text type than another. > >If all samples are different lengths, it is not easy: you can't use mean and >standard deviation (and standard tests like T-test) and fancy maths are >likely to make the findings impenetrable and unconvincing. > >Many studies on LOB, Brown and relations have benefited from the fixed >sample length. > >I come across all too many papers that argue, roughly "hey look, word X is N >times as common in corpus A as against corpus B, now let's investigate why" >- and I'm left wondering whether N is enough of a difference to be salient, >given the usually unexplored level of within-text-type variation. > >This argument leads me to propose the *pseudodocument*, a fixed-length run >of text of a given text type, truncated at (eg) the 10,000th word. By >treating text in this way, we can use mean, SD, and T-test to work out if >the level of variation between one text type and another is significant. > >You read it first on Corpora! > > Adam > >-----Original Message----- >From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On >Behalf Of Lou Burnard >Sent: 14 June 2005 21:37 >To: sb@ldc.upenn.edu >Cc: CORPORA >Subject: Re: [Corpora-List] Brown Corpus > >Well, every generation has every right to reinvent the work of their >predecessors, but such a "rationalization" seems to me to play somewhat >fast and loose with the design chosen by the original compilers of the >Brown Corpus... it was intended to comprise 500 equally-sized samples, >selected from 15 pre-defined categories. By lumping together all the >samples from the same category you will wind up with differently sized >samples -- category J ("learned") has 80 texts, c. 160,000 words; while >category R ("humor") has only 9, i.e. 18,000 words. That may not >matter, of course, for many applications, but it seems a shame to lose >that feature of the original design. Or will you preserve the original >text boundaries with some additional markup? >If so, you might like to consider that many of the original 2000 word >samples have some internal structure too: even within 2000 words, there >are samples taken from originally discontinuous sections of the >original. Which most versions seem to disregard. > >antiquariously, > >Lou > > >Steven Bird wrote: > > > >>Note that this version of the Brown Corpus contains 500 files, each >>consisting of around 200 lines of text on average. Perhaps these were >>as big as they could handle back in 1961. I think it would make matters >>simpler if the file structure was rationalized now, so that, e.g.: >> >>Brown Corpus file names >>Existing -> Proposed >>ca01 .. ca44 -> a >>cb01 .. cb26 -> b >>etc >> >>(NB this is how things are being restructured in NLTK-Lite, a new, >>steamlined version of NLTK that will be released later this month.) >> >>-Steven Bird >> >> >>On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote: >> >> >> >> >>>By one of those uncanny coincidences, I am planning to include an >>>XMLified version of the Brown corpus on the next edition of the BNC Baby >>>corpus sampler. The version I have is derived from the GPLd version >>>distributed as part of the LTK tool set (http://nltk.sourceforge.net) >>>and includes POS tagging; there is also a version which has been >>>enhanced to include Wordnet semantic tagging but I am not clear as to >>>the rights in that. >>> >>>Lou Burnard >>> >>> >>>Xiao, Zhonghua wrote: >>> >>> >>> >>> >>>>The plain text version of Brown is available here: >>>>http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt >>>> >>>>Richard >>>>________________________________ >>>> >>>>From: owner-corpora@lists.uib.no on behalf of Jörg Schuster >>>>Sent: Tue 14/06/2005 14:39 >>>>To: CORPORA@hd.uib.no >>>>Subject: [Corpora-List] Brown Corpus >>>> >>>> >>>> >>>>Hello, >>>> >>>>where can the Brown Corpus be downloaded or purchased? >>>> >>>>Jörg Schuster >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >> >> >> >> > > > > > > > >



    This archive was generated by hypermail 2b29 : Tue Jun 21 2005 - 12:11:41 MET DST