[Corpora-List] Web corpora vs. Gigaword

From: Serge Sharoff (S.Sharoff@leeds.ac.uk)
Date: Thu Jun 02 2005 - 14:12:53 MET DST

  • Next message: D.G.Damle: "[Corpora-List] SEMCOR"

    > > But then again, why not go simply to UPenn and purchase some
    > > license for English Gigaword plus some additional tens of millions
    > > words corpora from LDC?
    >
    > For example because I'm also interested in 1 billion words of Italian,
    > German and Japanese? Or because I think that the web can give us a more
    > varied picture of a language than a newswire corpus? But more in general

    apart from the issue of their cost (LDC corpora are prohibitively expensive) and availability for particular languages, the language of newswire corpora is quite different from the language used in the BNC and Internet corpora. I compared the frequency lists from several newswire corpora (Reuters and Gigaword, in particular) against corpora treated as representative (such as the BNC) and corpora compiled from the Internet. It is interesting that both Internet and BNC-like corpora share similar features: newswire corpora report past events (frequently financial: 56% in Reuters) in a more or less formal language, so they use fewer first and second personal pronouns, question words, modals etc. (these findings are reported in a paper currently under review; contact me, if you'd like to see the draft). At least for the purposes of lexicographic research, it's much better to use corpora compiled from the Internet (unless you're interested specifically in the language of newswires).

    Serge

    --
    Dr. Serge Sharoff
    Centre for Translation Studies
    School of Modern Languages and Cultures
    University of Leeds
    Leeds, LS2 9JT
    

    tel: +44(0)113 343 7287 fax: +44(0)113 343 3287



    This archive was generated by hypermail 2b29 : Thu Jun 02 2005 - 14:33:16 MET DST