[Corpora-List] Encoding of apostrophes and ... CLEANEVAL ADVANCED WARNING

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Wed Jul 05 2006 - 18:12:56 MET DST

  • Next message: Mike Maxwell: "Re: [Corpora-List] Encoding of apostrophes and quotes"

                ===============================================
                         ADVANCE WARNING: CLEANEVAL
                ===============================================

    We agree fully with John Sowa's and others' comments about the difficulty
    of getting to a good "plain text" from an arbitrary web page.

    Moreover, we believe it is critical to progress in NLP. On this list, we
    don't need to rehearse the argument that more data gives better results.
    The obvious place to go for 'more data' is usually the web. But if the web
    text is dirty, everything suffers. So, we need to get good at cleaning
    web text.

    Under the auspices of SIGWAC (ACL SIG on Web as Corpus) we are planning a
    shared task / competitive evaluation on text cleaning - CLEANEVAL.

    We are planning to work on English and Chinese. If others are interested
    in contributing, particularly by organising a task for some other language,
    and/or find these questions provocative:

         1. what tools do you need to convert a terabyte of data for
            language X into a BNC?
         2. how do we know when we have succeeded?

    then do join the brand new cleaneval mailing list

    http://devel.sslmit.unibo.it/mailman/listinfo/sigwac

            CLEANEVAL Co-Ordinators
                    Marco Baroni
                    Adam Kilgarriff
                    Serge Sharoff

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of John F. Sowa
    Sent: 04 July 2006 21:13
    To: Corpora List
    Subject: Re: [Corpora-List] Encoding of apostrophes and quotes

    All this variability in how people use apostrophes and
    punctuation of any kind proves one very important point:
    no matter how systematic, expressive, and logical any
    system of encoding or tagging may be, people are going
    to do whatever they damn well please.

    Anybody who has ever tried to parse ordinary NL prose --
    even supposedly well-edited prose -- knows that punctuation
    is highly unreliable. It's useful to consider it, but
    only as one among many possibly contradictory sources of
    information about the structure of a text.

    Tagging a text correctly (according to some set of rules)
    is harder than punctuating it correctly. If people aren't
    very good at punctuation, I seriously doubt that they'll
    be any better at tagging.

    John Sowa



    This archive was generated by hypermail 2b29 : Wed Jul 05 2006 - 18:11:47 MET DST