Re: [Corpora-List] ANC, FROWN, Fuzzy Logic

From: John F. Sowa (sowa@bestweb.net)
Date: Tue Jul 25 2006 - 18:30:43 MET DST

  • Next message: Ken Litkowski: "Re: [Corpora-List] ANC, FROWN, Fuzzy Logic"

    Rob,

    As you know, I have a great deal of sympathy for the idea
    of using corpora in various ways in language analysis.

    There is also growing evidence that the number of rules
    needed to parse a corpus does not seem to converge.
    Like the vocabulary of any language, whose distribution
    has a very long tail, the distribution of grammar rules
    (or any other kind of language description) also has
    a very long tail.

    > Otherwise put, that experimental observations are the
    > most compact representations for many systems.

    But that does not imply that natural language data cannot
    be compressed. The fact that the curve of the number of
    rules (or whatever kind of description you prefer) falls
    off very rapidly near the beginning means that language data
    can be compressed, even though perfect compression may be
    impossible.

    As soon as you admit that corpus data can be compressed,
    Chaitin's arguments imply that some algorithm for doing the
    compression must exist. The goal of linguistics is to find
    a more humanly readable characterization of that algorithm
    than the bit pattern of a computer program.

    > ... people need to accept that for some (Chaitin/Kolmogorov
    > tell us most) systems the experimental facts are the most
    > compact representation.

    But there is abundant evidence that NL data can be compressed.
    The fact that a two-year-old child can learn any natural language
    very rapidly implies that the corpus is highly compressible and
    that a relatively small sampling is adequate to make good
    predictions about the whole. The predictions are not 100%
    reliable, however, because adults are constantly learning
    (and inventing) new words and new grammatical constructions.

    I certainly admit that any set of rules (or other concise
    characterization of NL data) must be supplemented with data
    from a corpus. I will also admit that for any corpus of
    any given size, new data will have to be added from time
    to time. However, the fact that people can successfully
    use language, starting in early childhood, implies that
    it's possible to start with a corpus that is much, much
    smaller than totality and add more data as needed.

    John



    This archive was generated by hypermail 2b29 : Tue Jul 25 2006 - 18:29:41 MET DST