Re: [Corpora-List] ANC Bigrams and Trigrams

From: Nicolas Hernandez (nicolas.hernandez@gmail.com)
Date: Mon Feb 14 2005 - 14:16:37 MET

  • Next message: Alex Murzaku: "Re: [Corpora-List] ANC Bigrams and Trigrams"

    On Fri, 11 Feb 2005 14:42:18 -0500, Nancy Ide <ide@cs.vassar.edu> wrote:
    > We are generating bigram and trigram data from the ANC First Release,
    > which will very soon be available on the (new and improved) ANC
    > website. We have a question for those who might be interested in this
    > kind of data: is it useful to generate the data for word pairs/triples
    > that span sentence (or even paragraph) boundaries? Is there any
    > advantage if we provide two sets of the bigram and trigram data, one
    > that spans such boundaries and one that doesn't?

    Dear Nancy,

    Personally I have used n-grams to extract "meta-discourse expressions"
    (basically frequent n-grams occurring in a corpus with a specific
    genre). I was interested by punctuation marks, because they could give
    me some contextual indications which could be used to select them".
    For exemple :
    "in this section" could have a different discourse interpretation at
    the start (". In this section") and at the end of a sentence ("in this
    section .") (depending on text genre).

    According to me, it makes more accurrate statistical measures having
    such ngrams.

    /Nicolas

    >
    > Thanks,
    > Nancy Ide
    >
    > =======================================================
    >
    > Nancy Ide
    >
    > Professor of Computer Science
    > Vassar College
    > Poughkeepsie, NY 12604-0520 USA
    > Tel: +1 845 437-5988 Fax: +1 845 437-7498
    > ide@cs.vassar.edu
    >
    > Chercheur Associe
    > Equipe Langue et Dialogue, LORIA/CNRS
    > Campus Scientifique - BP 239
    > 54506 Vandoeuvre-les-Nancy FRANCE
    > Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    > ide@loria.fr
    >
    > =======================================================
    >
    >

    -- 
    Nicolas Hernandez
    LIR - LIMSI
    BP 133, 91403 Orsay Cedex
    tel. 01 69 85 80 03, fax 01 69 85 80 88
    IIE - CNAM
    tel. 01 69 36 73 48
    



    This archive was generated by hypermail 2b29 : Mon Feb 14 2005 - 14:19:33 MET