Re: [Corpora-List] Auto-generation and how to spot it

From: Sravana Reddy (sravana.reddy@gmail.com)
Date: Tue Nov 14 2006 - 00:04:46 MET

  • Next message: Chiara Preite: "[Corpora-List] Linguistic and textual studies on call for papers"

    Believe it or not, that spam was _not_ artifically generated! At least at
    the sentence level. All the individual sentences are from
    http://kfba.net/Forums/. The only randomness there is the selection and
    order of the sentences.

    That aside, your question is very interesting. I woud guess that an
    artifically generated text has greater entropy than a human generated
    sample. So, perhaps you could train a reasonable order Markov model on some
    specialized corpus (sports discussion, in this case), and measure the
    redundancy of the test sample against that.

    Sravana

    On 11/13/06, Lou Burnard <lou.burnard@computing-services.oxford.ac.uk >
    wrote:
    >
    > "My eyes tell me that there are fabulous talents in every decade,
    > including this one. You have to remember where these young guys were
    > picked. You know things are different when there's a press seat
    > assigned to someone representing lebronjames. Like many sports, you are
    > going to have writers who are too close to the teams they cover and
    > writers who aren't."
    >
    >
    > This is the start of a spam which I (and presumably several thousand
    > other people) just received. My suspicion is that the text has been
    > automatically generated from a reasonably large corpus of authentic
    > email material (in this case, presumably, from some collection of sports
    > writing). The interesting question for this list is: how do I know it's
    > artificially generated? I'm guessing that the lack of coherence has
    > something to do with it, but what are the factors which indicate that?
    > And how much text would you need to scan before determining that there
    > was no natural coherence amongst its components?
    >
    > It's a question that several spam filter makers would probably pay good
    > money for an answer to.
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Nov 14 2006 - 00:02:16 MET