Corpora: learning regular expressions: responses

From: Chapman, Wendy (chapman@cbmi.upmc.edu)
Date: Tue Dec 12 2000 - 15:27:39 MET

  • Next message: MIT2USA@aol.com: "Corpora: JOB: Haitian Creole Localization Project Manager"

    Dear Corpora members,

    Thank you for the responses to regular expression learning that I posted a
    few weeks ago. I have included all the responses I received on the subject.

    Wendy Webber Chapman

    ____________________________________________________________________________
    ________

    Stephen Soderland's system WHISK applies learns regular expressions for
    information extraction. It's implemented in Perl. He's published an
    article on it in Machine Learning and had a paper at KDD. You could get
    more information from either of those sources.

    Mary Elaine Califf

    _________________________________________________________________________
      You can download a Unix version of Brill's original POS tagger, which
    uses the same TBL algorithm that he modified in his paper at EMNLP. I wa
    very excited by this most recent version of the TBL algorithm, so I can
    understand your interest in it.
     Try:

    http://www.cs.jhu.edu/~brill/code.html
     or if you prefer ftp:

    ftp://ftp.cs.jhu.edu/pub/brill/Programs/
      There is a port of the Brill tagger to windows, done by some French
    folks, but it never worked well for me, and it doesn't coem with source
    code.The original on the other hand is open-source software, written in C,
    and it depends pretty heavily on the Unix OS for memory management.
     Most folks using this and similar algorithms are looking for high
    precision and recall rates for POS tagging or parsing, and aren't really
    often very eager to take on ambiguities, except insofar as a parser will
    give good rankings of ambiguous parses.
      Good luck!

    -Mike

       vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >< Michael O'Connell ><
    >< http://ucsu.colorado.edu/~oconnelm ><
    >< University of Colorado - Boulder ><
    >< CB 295 Boulder, CO 80309 ><
    >< Hellems 285 303.492.1623 ><
       vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    ____________________________________________________________________________
    _________

    Regular expression pattern learning has been fairly well-mined by the
    'formal' machine learning community. You might want to look e.g., at
    the (old!) paper by Sam Pilato and I in the J. Machine Learning, 1985,
    where we used a method developed by Dana Angluin at Yale - it essentially
    does what's called k-tail merger of the finite-state equivalence classes.
    Alas, our very old Lisp implementation is no longer around, though the
    paper has pseudo-code that should suffice. You might want to track down Sam
    Pilato.
    This is a restrictive variant of an approach that was, to the best of my
    knowledge,
    employed by (even older!) work in the 60s by Solomonoff and many others
    to learn reg-exps. Angluin has some nice formal results on the difference
    in
    computational complexity betw. learning reg. expressions vs. fsa's, etc.
    Hope this is of some help,
    Best regards, Prof. Bob Berwick
    Professor Robert C. Berwick
    [berwick@ai.mit.edu]________________________________________________________
    ____________________________________________

    If I understand what you need, maybe we have something useful for you.
    We have an algorithm (LocalMaxs) that extracts multi-word units from
    text of any language. For example : Human Rights, Universal Declaration
    of Human Rights, as soon as possible, plus au moins, raining cats and
    dogs, Yasser Arafat, Issac Rabin, etc.

    Joaquim Ferreira da Silva
    jfs@di.fct.unl.pt



    This archive was generated by hypermail 2b29 : Tue Dec 12 2000 - 15:40:30 MET