Re: [Corpora-List] Auto-generation and how to spot it

From: Yorick Wilks (yorick@dcs.shef.ac.uk)
Date: Mon Nov 13 2006 - 13:30:58 MET

  • Next message: Diana Maynard: "Re: [Corpora-List] Auto-generation and how to spot it"

    Ive had more than one student do a project on spams generated from
    parts of coherent text--i.e. the job was to detect incoherence (usually
    by topic/vocab shift across sentence boundaries larger than control
    text)---such methods usually work well at the 95% level and could
    easily be put in filters if anyone wanted to.
    Yorick Wilks

    On 13 Nov 2006, at 12:06, Lou Burnard wrote:

    > "My eyes tell me that there are fabulous talents in every decade,
    > including this one. You have to remember where these young guys were
    > picked. You know things are different when there's a press seat
    > assigned to someone representing lebronjames. Like many sports, you
    > are going to have writers who are too close to the teams they cover
    > and writers who aren't."
    >
    >
    > This is the start of a spam which I (and presumably several thousand
    > other people) just received. My suspicion is that the text has been
    > automatically generated from a reasonably large corpus of authentic
    > email material (in this case, presumably, from some collection of
    > sports writing). The interesting question for this list is: how do I
    > know it's artificially generated? I'm guessing that the lack of
    > coherence has something to do with it, but what are the factors which
    > indicate that? And how much text would you need to scan before
    > determining that there was no natural coherence amongst its
    > components?
    >
    > It's a question that several spam filter makers would probably pay
    > good money for an answer to.
    >
    >



    This archive was generated by hypermail 2b29 : Mon Nov 13 2006 - 13:28:31 MET