Re: [Corpora-List] Query on the use of Google for corpus research

From: Philip Resnik (resnik@umiacs.umd.edu)
Date: Thu Jun 02 2005 - 14:06:04 MET DST

  • Next message: Marco Baroni: "[Corpora-List] near duplicate detection"

    > Your tools sound really interesing, and in part similar to what we are
    > developing/adapting. Is anything (besides GATES, of course) publicly
    > available?

    Marco and Nancy, we are soon (within a month or two) going to be doing
    an open source release of the codebase for the Linguist's Search
    Engine (LSE, http://lse.umiacs.umd.edu). Although the LSE does not
    currently do some of the Web page processing you're describing, other
    aspects of its architecture might be useful to you or others.

    The LSE currently piggybacks on Altavista, rather than doing its own
    crawling. Its facility for building custom collections currently
    includes the retrieval of pages, extraction of text from HTML,
    sentence breaking, tokenization, POS tagging, parsing, and indexing
    sentences by their syntactic structure. The architecture is highly
    modular and it's easy to add new annotation modules and to configure
    dependencies between modules (e.g adding a parser that requires
    POS-tagged input). The LSE is designed so that the processing of
    collected pages takes place in parallel on as many machines as you'd
    like. Annotation processes run concurrently as pages are added to the
    collection -- i.e. you are processing the pages, including indexing
    and making material searchable, while crawling is still taking place,
    and you can distribute multiple copies of the annotation processes on
    a computing cluster.

    It is very simple to modify the LSE code to draw from other Web
    sources (certainly anything available via a CPAN WWW::Search module),
    and although we do not identify tables, headers and footers, etc., I'm
    sure the architecture would be flexible enough to add that sort of
    functionality as part of the document-level processing before sentence
    identification and sentence-level processing take place.

    The resulting database and index are, of course, well suited to the
    kinds of lexically and syntactically driven searching the LSE was
    designed to support, including a very linguist-friendly user
    interface. But I expect the software could easily be adapted for
    other purposes, and we're hoping the open source release will make it
    easy for people to develop their own variations of linguistic search.

    Best,

      Philip

    P.S. The LSE will appear in the demo session at the upcoming ACL
    conference. Perhaps we'll get a chance to talk in person there!

      ----------------------------------------------------------------
      Philip Resnik, Associate Professor
      Department of Linguistics and Institute for Advanced Computer Studies

      1401 Marie Mount Hall UMIACS phone: (301) 405-6760
      University of Maryland Linguistics phone: (301) 405-8903
      College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
      http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu



    This archive was generated by hypermail 2b29 : Thu Jun 02 2005 - 14:33:06 MET DST