Corpora: Re: Lines of English on Internet

From: juliewg@nac.net
Date: Mon Dec 03 2001 - 17:42:39 MET

  • Next message: Patrick Corliss: "Re: Corpora: Number of pages on the Internet"

    The question is a complex one. You need to take several things into
    account.

    1. Not all Internet content exists in HTML (which by itself would be
    difficult to to calculate). Pages of on-line full text books, articles,
    and white papers still exist in the not-so-common places on the
    Internet, such as Gopher and WAIS, and in various university FTP
    sources. Are you attempting to count these as well?

    2. There are probably million lines of text associated with USEnet
    newgroups. Are you trying to account for these as well? There are at
    least 30,000 different newsgroups, many of which contain thousands of
    lines in their threads.

    3. You also need to attempt to calculate some sort of sliding
    percentage, so that you can account for the constant growth of the
    Internet. It looks like you have already tried to apply probability
    theory to this. Does the amount of written content increase by 1% a
    month? 5%? There doesn't seem to be any agreed-upon number to use to
    calculate growth rate.

    I am sure I am forgetting to factor in a lot of other variables, as
    well. To be honest, I don't even know where to begin the data
    collection, and I would have great concern that after I had found a way
    to gather the data, it would already be antiquated.

    Regards,
    Julie Wang-Gempp

    Hristo Tanev wrote:

    > Dear Corpora List Members,
    > Every week I see on this list many interesting
    > questions and discussions. I think our email list is
    > something very useful and interesting to read!
    >
    > I want to put here a question, which answer I couldn't
    >
    > find in Internet or in the literature I have.
    > The question is: approximately how many pages in
    > English exist in Internet?
    >
    > A friend of mine told me something about the total
    > number of pages in Internet (1 milliard). However I
    > couldn't find some source, referring to this question.
    >
    > I tried to calculate the number of pages, using search
    > engine and a formula from the probabilistic theory.
    >
    > The results I obtained were about 50-80 millions of
    > pages in English.
    >
    > I don't know if this figures are wrong, but they seem
    > to me too low. Does someone of you know approximately
    > how many pages exist in Internet in ENglish language?
    > Thank you in advance!
    >
    > Hristo Tanev
    > ITC,Irst
    >
    > ________________________________________________________________
    > Nokia 5510 looks weird sounds great.
    > Go to http://uk.promotions.yahoo.com/nokia/ discover and win it!
    > The competition ends 16 th of December 2001.



    This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 19:21:18 MET