Re: [Corpora-List] labels of COLT files in BNC spoken

From: Sebastian Hoffmann (sebhoff@es.unizh.ch)
Date: Thu Nov 13 2003 - 14:54:50 MET

  • Next message: Sylvain Loiseau: "[Corpora-List] Call for Papers: The Setting up of Observables in Linguistics (COLDOC 2004)"

    At 1:00 PM +0000 11/13/03, Eric Atwell wrote:
    >Lou,
    >thanks for this expert clarification.
    >Demo chatbots trained with a variety of BNC files are now on my web-page
    >http://www.comp.leeds.ac.uk/eric/ and we can add more ....
    >
    >- I have a follow-up question: can you suggest any specific BNC spoken
    > files which illustrate particularly "interesting" / idiosyncratic
    > language use? For example, the BNC file with the most swearing? :)
    > We want to identify a selection of "unusual" files, to train
    > a collection of noticeably different chatbots.
    >
    >thanks
    >
    >Eric
    >

    Eric,
    Here's some output from BNCweb which will probably help you with your
    search for the text with the most swearing - however, I didn't spend
    much time compiling the list of "bad words"... ;-)

    Your query "<stext>#((fuck|fucks|fucking|fucked|shit|arsehole|
    bastard|cunt|dickhead|bitch|prick))" returned 4032 matches in 144
    different texts

    It was most frequently found in the following files (only texts with
    at least three occurrences are considered)

      Name of Text | Number of words | Number of hits | Freq. pmw
      KE5 5,121 92 17965.24
      KDA 75,783 1,098 14488.74
      KP9 6,963 71 10196.75
      KD9 13,908 124 8915.73
      KE1 21,001 180 8571.02
      KR2 8,090 69 8529.05
      KPH 12,070 75 6213.75
      KPT 7,553 41 5428.31
      KDN 46,326 251 5418.12
      KP4 34,712 182 5243.14
      KCU 53,859 279 5180.19
      KPP 8,112 42 5177.51
      KNV 7,853 37 4711.58
      KP7 1,938 9 4643.96
      KSU 2,388 11 4606.37
      KB4 902 4 4434.59
      KR1 5,453 24 4401.25
      KP0 7,869 34 4320.75
      KSP 1,543 6 3888.53
      KPG 45,229 145 3205.91

    Your query was least frequently found in the following files (only
    texts with at least one occurrence are considered)

      Name of Text | Number of words | Number of hits | Freq. pmw
      KRT 158,430 1 6.31
      KCT 104,104 1 9.61
      KBW 123,017 2 16.26
      KDM 115,661 2 17.29
      KBH 51,340 1 19.48
      KC2 47,809 1 20.92
      KS7 43,335 1 23.08
      KBB 81,085 2 24.67
      KDV 29,392 1 34.02
      KCS 25,055 1 39.91
      FUK 20,220 1 49.46
      KR0 20,183 1 49.55
      JYN 19,468 1 51.37
      KB2 37,597 2 53.20
      KBF 111,948 6 53.60
      KP1 70,999 4 56.34
      KDJ 17,227 1 58.05
      K6W 17,142 1 58.34
      FUL 16,591 1 60.27
      HMA 16,298 1 61.36

    As the following list shows, more than 50% of all instances are
    covered by "fucking":

    There are 11 types and 4032 tokens in your sorted query result
    No. | Lexical item | No. of occurrences | Percent
    1 fucking 2162 53.62%
    2 shit 701 17.39%
    3 fuck 579 14.36%
    4 bastard 198 4.91%
    5 bitch 138 3.42%
    6 cunt 95 2.36%
    7 fucked 63 1.56%
    8 prick 34 0.84%
    9 arsehole 29 0.72%
    10 dickhead 23 0.57%
    11 fucks 10 0.25%

    If you'd like me to compile similar information for different lists
    of lexical items, just let me know.

    Best,
    Sebastian

    -- 
    

    Sebastian Hoffmann Englisches Seminar der Univ. Zürich Plattenstrasse 47 CH-8032 Zürich Tel: +41-1-634 3551 Fax: +41-1-634 4908



    This archive was generated by hypermail 2b29 : Thu Nov 13 2003 - 14:56:47 MET