[Corpora-List] Re: ANC, FROWN, Fuzzy Logic

From: FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE (jfidel@siu.buap.mx)
Date: Wed Jul 26 2006 - 11:53:04 MET DST

  • Next message: FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE: "[Corpora-List] Re: on calls for papers"

    Hi, all,

    Before I start, I should clarify that I have never worked on compression, so
    maybe I'm missing something obvious to those who do work on it. Still, I
    can't buy Rob's claim that

    > To explain every idiosyncrasy of a given individual's productions it
    > seems likely you would need that individual's entire corpus, but for
    > understanding you would only need overlap.

    Getting a 'complete' corpus for any individual would be theoretically
    impossible, since no one produces all of their knowledge about language. Ie,
    there *does* exist passive knowledge (as well as implicit knowledge, not the
    same), as we know from all those 'silent period' kids (most of them) before
    they start speaking, but *do* understand (often not speaking at all until
    well into their second year of life); the well-known fact that our 'passive'
    vocabulary is much larger than the vocabulary we use; etc.

    On the other hand, it takes rather little exposure to a language to begin to
    make noticeable strides in acquiring it, as anyone who has ever learned a
    second language 'in situ' as an adult knows very well. You don't ever get it
    all, but you sure can get significant 'overlap', as Rob would say. And an
    hour or so of a demonstration of this to an impressed MIT freshman by
    Kenneth Pike is what made me into a linguist (that and a serendipitous but
    super course from Morris Halle a couple of years later).

    Earlier, Rob says:

    > To say "perfect compression may be impossible" is to concede the point I
    > wish to make.
    > ...
    > I see no evidence NL data can be compressed "completely". On the
    > contrary, the evidence indicates to me that any compression of NL data
    > must be "incomplete" (and each incomplete compression involves a loss
    > of information which can only be prevented by retaining the whole
    > corpus anyway.)

    Well, for starters, though it's a trivial example, we have a good example of
    a perfectly compressible code: (original) ASCII: a six-bit set on which we
    'waste' 8 bits per 'letter'. We can compress this to 20% of its former size
    in bytes, and get 100% correct results back. Now that ain't exactly NL, but
    I draw the conclusions that: 1) compression doesn't necessarily have to be
    lossful, or at least not too bad (and remember eg that any audio signal you
    can make out, however noisy, can be 'made out' [ie, greatly cleaned up,
    though with great loss--esp. of noise] by the computer using cepstra); 2)
    good rules (or their equivalent in whatever theory you're partial to) make
    all this possible. No linguist, however poor, would deny the importance of
    having good generalizations about a particular language, corpus, etc. And no
    decent linguist, however good, would (or certainly: should) deny that their
    analysis of a particular language, corpus, etc. could be bettered. That's
    what science is all about, after all.

    I don't think we need 'complete compression' to have perfectly useful
    results. After all, even humans make mistakes (!): in understanding, in
    production, etc. And even linguists, in putting POS tags, for example, don't
    do much better (in agreement among themselves) than the best empirical
    computer programs which get up to about 99% 'correct'. Of course, I would
    maintain that the 1% not covered here would vary wildly between linguists
    and computers (the latter making mistakes heavily among the least frequent
    words and uses, for example, which would be much less problematical for
    humans, in general). This does not indicate to me that computer processing
    is impossible, but rather that we just need better, more 'human-like'
    algorithms. (Not that it's trivial to discover them, of course.) Now, 99%
    correct, on the face of it, sounds great, until you reflect that in a corpus
    of 100 megawords, say, (nowadays, a smallish or at best medium-sized corpus)
    that implies a million words incorrectly classified (and, I would maintain,
    precisely some of the linguistically most interesting cases; although, I
    must admit, for many practical purposes this probably *would* be great or at
    least useful).

    One final point about compression of NL data. No practicing linguist can
    have failed to notice that *none* of the rules (again: or the
    equivalent--from here on, I will just talk about 'rules', with this
    parenthesis understood) which they have come up with is without exception. I
    can't think of a single rule I have ever come into contact with that doesn't
    have some exceptions (eg, linguists marvel over the description of a
    language [I forget which one] whose *only* irregular verb is 'to be'--but
    there *is* still one). Virtually all languages with any conjugation at all
    have at least a few deponent verbs. Etc.

    As Halle long ago remarked, however, exceptions may prove (ie, test) the
    rule, but they don't invalidate it. They may either indicate the necessity
    of a reformulation of the rule (remember Verner's Law, still famous after
    over 150 years as a 'correction' to Grimm's Law [actually, this should
    probably be: Grimms' Law], one of the most famous rules in linguistics), or
    they may be *real exceptions*, which all practicing linguists know really
    *do* exist. Our aim as analysts of language is to throw out the bathwater
    (the detritus of Verner's Law, eg) while keeping the baby (the rule: Grimm's
    Law), while still permitting true exceptions (eg, here, some onomatopoetic
    words, but also a few 'normal' words). Now, the description in the previous
    sentence is actually in the *best* of circumstances (eg, right after the
    'completion' of Grimm's Law). Later borrowings, analogic creations, etc. can
    further screw up the system, and in some cases (eg, English fricative
    voicing) demolish or radically restructure parts of the system. But ya gotta
    keep the baby!

    In a different vein, socioLINGUISTICS (in the sense where rules spread
    geographically and/or socially and/or partially [with respect to features,
    eg]; along with markedness) has been able to allow linguists to give nuances
    to the possible implementations of rules. Eg, with respect to partiality of
    Grimm's Law, this permits us to understand the so-called Rhenish Fan.

    At one point, Rob says:

    > Knowing this means we can find the abstraction (grammar) relevant to
    > any purpose we choose: make a given parsing decision, agree on the
    > significance of a word in a given context. Not knowing this means
    > we constantly swim around trying to find a single abstraction to fit
    > every purpose, and fail (we end up with "fuzzy" categories.)

    Well, I guess I have to admit that nearly all linguistic categories are
    fuzzy. That is decidedly *not*, however, a research strategy. The *only*
    reasonable (ie, scientific, I would say) research strategy is to always
    assume that any hypothesized categories are strict (yes or no) and see what
    that produces as results. If those results are unacceptable or
    contradictory, we should still, IMHO, carry on to the bitter end before
    backing up, because some of the further consequences of unacceptable
    conclusions may be enlightening in future research. Of course, since
    False(1) implies False(2), we have no permanent results yet, but still they
    may be useful in the future. And now you can see why I have never won the
    Nobel Prize (aside from the fact that I'm a linguist). To get back to the
    point, having discovered cases which apparently may fit in either of the
    hypothesized categories, there are still several options before accepting
    fuzzy categories, however conceptually appealing these latter may seem to
    be. For one, we may have missed a category (eg, if Adjectives sometimes
    behave as Verbs and sometimes as Nouns, it may indicate that these latter
    two categories are 'fuzzy'; or it may indicate that we need a further
    category Adjective; or it may indicate that we need a breakdown into some
    sort of Distinctive Features: Verb = [+verb, -noun]; Noun =
    [-verb, +noun]; Adjective = [+verb, +noun]. This last possibility, however,
    itself produces automatically further corrollaries, eg that there should
    exist another category [-verb, -noun]. Now in turn, this could be, say,
    Adverb; or it could imply a hierarchically superior category [+/-Major
    Class]. [+Major Class] divided by the [of course, I am assuming binary
    features, a whole nother discussion] features [+/- verb], [+/- noun] and
    [-Major Class], which would include everyting else: so-called function
    words, prepositions, markers, interjections {yes, Virginia, this *is* a real
    *linguistic* category!}, clitics, etc.)(bet you thought I'd forgotten that
    closing parenthesis).

    Anyway, every hypothesis leads to further hypotheses, back to corrections,
    on to further hypotheses, etc. Likewise, this is a cooperative enterprise.
    That's why we rejoice when our hypotheses get shot down, either by ourselves
    (best case scenario, obviously, and, after all, our obligation to try to do)
    or by others (thanks, guys). In the latter case, at least we know that
    someone is reading our work.

    OK. That's my story and I'm sticking to it (that's what happens when you let
    Old Dogs into the list!).

    Jim

    James L. Fidelholtz
    Posgrado en Ciencias del Lenguaje, ICSyH
    Benemérita Universidad Autónoma de Puebla MÉXICO

    Rob Freeman escribió:

    ...



    This archive was generated by hypermail 2b29 : Wed Jul 26 2006 - 16:30:07 MET DST