FW: [Corpora-List] The genre of the Web

From: Mark Davies (Mark_Davies@byu.edu)
Date: Sun Sep 18 2005 - 22:47:29 MET DST

  • Next message: Alexander Gelbukh: "[Corpora-List] CFP: CICLing-2006 (Computational Linguistics and Intelligent Text Processing), Springer LNCS, February, Mexico"

    As I mentioned in my original post, we all know that there is a bit of every register on the Web -- SPOKEN (transcripts of interviews, etc), FICTION (repositories of literature), lots of NEWSPAPERS, ACADEMIC-oriented materials, etc etc. So, no question about that of course -- the Web has a bit of everything.
     
    The original question, though, was which genres/registers (of the BNC, for example) would have frequency data that would correspond *most closely* to reliable frequency data from the web -- i.e. for the Web *as a whole*?
     
    In some very, very preliminary work that I've done, it appears that the frequency data from the web is *most* in line with the frequency data from either the newspaper or academic registers of the BNC, rather than spoken or fiction. Again, not to say that there isn't a bit of everything, but it is *most similar* to the registers just mentioned.
     
    Part of the reason that I asked the question in the first place has to do with pedagogical concerns. Suppose that my students obtain frequency data from the web as well as frequency data from a spoken corpus. My guess is that they will find a fair amount of frequency data (lexical, grammatical, etc) in the spoken corpus that are relatively more common than that of the Web, and vice versa. My guess, though (based on very preliminary data) is that there would be less of a mismatch with newspaper or academic-based corpora.
     
    From what I've gathered taking to others over the past year, the issue of what register(s) make up the Web is an ongoing and important question for some researchers. I'd be interested in hearing from those people.
     
    Best,
     
    Mark Davies
     
    =================================================
    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906
    http://davies-linguistics.byu.edu

    ** Corpus design and use // Linguistic databases **
    ** Historical linguistics // Language variation **
    ** English, Spanish, and Portuguese **
    =================================================
     

    ________________________________

    From: owner-corpora@lists.uib.no on behalf of John F. Sowa
    Sent: Sun 9/18/2005 12:28 PM
    To: Mark P. Line
    Cc: corpora@uib.no
    Subject: Re: [Corpora-List] The genre of the Web

    I agree with Mark Lane on that point:

    > I don't think of the Web as a genre at all.

    On the other hand, it's not clear that the web
    is a medium.

    > It's a very flexible medium, in fact, because
    > it seems to carry all genres effectively.

    In that regard, it's more like a very dynamic
    library. But it is also as interactive as
    telephones or video games (which it carries
    as well).

    And I certainly don't agree with Mark Davies on
    that point:

    > most would probably agree that the web is more
    > like NEWSPAPER and ACADEMIC

    That's probably what most people on Corpora list
    would say. But the people who make the most money
    from the web are the gambling casinos and the
    porno peddlers.

    John Sowa



    This archive was generated by hypermail 2b29 : Sun Sep 18 2005 - 22:59:37 MET DST