[Corpora-List] estimates of written/spoken input: summary

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Fri Dec 09 2005 - 10:18:44 MET

  • Next message: Ute Römer: "[Corpora-List] Conference "Exploring the Lexis-Grammar Interface" - First call for papers..."

    Dear all,

    Two weeks ago I asked if somebody knew of work reporting estimates of how
    many words/sentences/etc. (adult) speakers of a language hear/write.

    I paste below the responses I got.

    Thanks a lot to all who responded!

    Regards,

    Marco

    ******************************************
    Reinhard Rapp
    ******************************************

    Dear Marco,

    I am also interested in the answer to your question. Some discussion
    can be found in a Psychological Review paper by Landauer & Dumais
    (1997) which is on the web at

    http://lsa.colorado.edu/papers/plato/plato.annote.html

    This is a citation from the most relevant part, which is footnote 6:

    ----------- start citation ------------

    > From his log-normal model of word frequency distribution and the
    observations in Carroll et al.

    (1971), Carroll estimated a total vocabulary of 609,000 words in the
    universe of text to which students through highschool might be exposed.
    Dahl (1979), whose distribution function agrees with a different but
    smaller sample of Howes (1966), found 17,871 word types in 1,058,888 tokens
    of spoken American English, compared to 50,406 in the comparable sized
    adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's
    data imply a total of roughly 150,000 word types in spoken English, thus
    approximately one-fourth the total, less to the extent that there are
    spoken words that do not appear in print. Moreover, the ratio of spoken to
    printed words to which a particular individual is exposed must be even more
    lopsided because local, ethnic and family usage undoubtedly restrict the
    variety of vocabulary more than published works intended for the general
    school-aged readership.
    If we assume that our seventh-grader has met a total of 50 million word
    tokens of spoken English (140 minutes a day at 100 words per minute for 10
    years) then the expected number of occasions on which the she would have
    heard a spoken word of mean frequency would be about 370. Carroll's
    estimate for the total vocabulary of seventh grade texts is 280,000, and we
    estimate below that the typical student would have read about 3.8 million
    words of print. Thus, the mean number of times she would have seen a
    printed word to which she might be exposed is only about 14. The rest of
    the frequency distributions for heard and seen words, while not
    proportional, would, at every point, show that spoken words have already
    had much greater opportunity to be learned than printed words, so will
    profit much less from an additional occurrence.

    ----------- end citation ------------

    ...

    With kind regards,

    Reinhard

    ******************************************
    Paula Newman
    ******************************************

    Marco,
    That's an interesting question. A little googling suggested that a lower
    bound might come from data on the average number of hours of TV watching
    per adult (multiplied by average words per minute on TV broadcasts).
    Paula

    ******************************************
    Paul Bennett
    ******************************************

    Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44) cite
    evidence that by the age of three a child in a professional household might
    have heard 30 million word tokens (but far fewer for children in other social
    classes). I know this relates to children rather than adults, but presumably
    the amount of language heard does not differ much by age.

    Their source is B. Hart and T. Risley: Meaningful Differences in the Everyday
    Experiences of Young Children (Paul H Brookes, 1995). I haven't read this, but
    I guess this would be a place to look for more information.

    Paul Bennett

    ******************************************
    Ilana Bromberg
    ******************************************

    Marco,

    There is some information regarding how much school-age children (up
    through HS I think) read in the following article. It's possible that some
    of the sources they cite may have more information about adults.

    Landuaer, Thomas K and Dumais, Susan T. 1997. A Solution to Plato's
    Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction,
    and Representation of Knowledge. Psychological Review, 104:2, 211-240.

    Good luck,
    Ilana

    -- 
    Marco Baroni
    SSLMIT, University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Fri Dec 09 2005 - 10:48:37 MET