Corpora: New Parser

From: Gojol (gojol@sunu.rnc.ro)
Date: Wed Dec 06 2000 - 14:51:00 MET

  • Next message: Kristen Precht: "Corpora: corpus of AAVE"

       Dear Colleagues ,

       Those interested in a new parser ( based on an original
    philosophy ) , shortly introduced below , are invited to
    contact me personally ( gojol@sunu.rnc.ro ) . Any sugges-
    tions , comparisons with existing parsers etc. will be wel-
    come . Thank you ,
                        Vlad V. Gojol

    ............................................................

       After learning from a 46,000 words pos-tagged corpus and
    a 32,000 words parsed ( treebank ) corpus , a 2,000 words
    text ( not included in any of the two corpora ) is parsed
    ( tagging excluded ) in 18 seconds ( on a 200 MHz machine )
    with 4% incomplete trees ( but for these declared failures ,
    are also provided well formed trees sufficient for a subse-
    quent translator ) - the extracted grammar having cca 12,000
    rules . The Negra corpus of German is used . After learning
    from a 17,000 words parsed corpus and from the same 46,000
    words pos-tagged one , a 2,000 words text included into the
    first ( but excluded from the second ) , to warrant that the
    grammar is complete relative to it ( i.e. contains all the
    rules necessary for its correct parsing ) , is processed in
    4 seconds with no incomplete tree - the extracted grammar
    having cca 7,000 rules . The parsing is 2-3 times slower on
    the English corpus Susanne . The system is language indepen-
    dent , with wide character support .
       The parser may accept a set of rules intended to refine
    the statistical grammar deduced from the corpus . Moreover ,
    it can take as input only a context-free grammar ( in which
    case it ceases to be a statistical parser ) , but in this
    operating mode it requires much time and memory ( during the
    learning , not during the parsing as such ) if the grammar
    is over-dimensioned . The statistical grammar is refined not
    by simply adding the proposed rules , but by modifying the
    corpus , to exploit all the real contexts possible for them .
       



    This archive was generated by hypermail 2b29 : Wed Dec 06 2000 - 12:47:38 MET