[Corpora-List] Auto-generation and how to spot it

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Mon Nov 13 2006 - 13:06:52 MET

  • Next message: Yorick Wilks: "Re: [Corpora-List] Auto-generation and how to spot it"

    "My eyes tell me that there are fabulous talents in every decade,
    including this one. You have to remember where these young guys were
    picked. You know things are different when there's a press seat
    assigned to someone representing lebronjames. Like many sports, you are
    going to have writers who are too close to the teams they cover and
    writers who aren't."

    This is the start of a spam which I (and presumably several thousand
    other people) just received. My suspicion is that the text has been
    automatically generated from a reasonably large corpus of authentic
    email material (in this case, presumably, from some collection of sports
    writing). The interesting question for this list is: how do I know it's
    artificially generated? I'm guessing that the lack of coherence has
    something to do with it, but what are the factors which indicate that?
    And how much text would you need to scan before determining that there
    was no natural coherence amongst its components?

    It's a question that several spam filter makers would probably pay good
    money for an answer to.



    This archive was generated by hypermail 2b29 : Mon Nov 13 2006 - 12:56:14 MET