Re: [Corpora-List] Auto-generation and how to spot it

From: Diana Maynard (d.maynard@dcs.shef.ac.uk)
Date: Mon Nov 13 2006 - 13:31:58 MET

  • Next message: Chiara Preite: "[Corpora-List] Linguistic and textual studies on call for papers"

    In general I've noticed that the subject header bears no correlation at
    all to the email content, which could be a useful indicator. Although of
    course, genuine emails often suffer from this problem when people reply
    to messages and gradually change tack without changing the subject
    header. In this case though, you generally get some pasting of the
    message to which they're replying (I've never yet seen that on a spam
    mail - I assumed because the content of the spam is pasted from a web
    corpus rather than an email corpus).
    Diana

    Lou Burnard wrote:
    > "My eyes tell me that there are fabulous talents in every decade,
    > including this one. You have to remember where these young guys were
    > picked. You know things are different when there's a press seat
    > assigned to someone representing lebronjames. Like many sports, you
    > are going to have writers who are too close to the teams they cover
    > and writers who aren't."
    >
    >
    > This is the start of a spam which I (and presumably several thousand
    > other people) just received. My suspicion is that the text has been
    > automatically generated from a reasonably large corpus of authentic
    > email material (in this case, presumably, from some collection of
    > sports writing). The interesting question for this list is: how do I
    > know it's artificially generated? I'm guessing that the lack of
    > coherence has something to do with it, but what are the factors which
    > indicate that? And how much text would you need to scan before
    > determining that there was no natural coherence amongst its components?
    >
    > It's a question that several spam filter makers would probably pay
    > good money for an answer to.
    >
    >



    This archive was generated by hypermail 2b29 : Mon Nov 13 2006 - 13:30:54 MET