Re: [Corpora-List] Query on the use of Google for corpus research

From: Philip Resnik (resnik@umiacs.umd.edu)
Date: Thu Jun 02 2005 - 14:06:04 MET DST

Next message: Marco Baroni: "[Corpora-List] near duplicate detection"

Previous message: Linda Bawcom: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Marco Baroni: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Your tools sound really interesing, and in part similar to what we are
> developing/adapting. Is anything (besides GATES, of course) publicly
> available?

Marco and Nancy, we are soon (within a month or two) going to be doing
an open source release of the codebase for the Linguist's Search
Engine (LSE, http://lse.umiacs.umd.edu). Although the LSE does not
currently do some of the Web page processing you're describing, other
aspects of its architecture might be useful to you or others.

The LSE currently piggybacks on Altavista, rather than doing its own
crawling. Its facility for building custom collections currently
includes the retrieval of pages, extraction of text from HTML,
sentence breaking, tokenization, POS tagging, parsing, and indexing
sentences by their syntactic structure. The architecture is highly
modular and it's easy to add new annotation modules and to configure
dependencies between modules (e.g adding a parser that requires
POS-tagged input). The LSE is designed so that the processing of
collected pages takes place in parallel on as many machines as you'd
like. Annotation processes run concurrently as pages are added to the
collection -- i.e. you are processing the pages, including indexing
and making material searchable, while crawling is still taking place,
and you can distribute multiple copies of the annotation processes on
a computing cluster.

It is very simple to modify the LSE code to draw from other Web
sources (certainly anything available via a CPAN WWW::Search module),
and although we do not identify tables, headers and footers, etc., I'm
sure the architecture would be flexible enough to add that sort of
functionality as part of the document-level processing before sentence
identification and sentence-level processing take place.

The resulting database and index are, of course, well suited to the
kinds of lexically and syntactically driven searching the LSE was
designed to support, including a very linguist-friendly user
interface. But I expect the software could easily be adapted for
other purposes, and we're hoping the open source release will make it
easy for people to develop their own variations of linguistic search.

Best,

Philip

P.S. The LSE will appear in the demo session at the upcoming ACL
conference. Perhaps we'll get a chance to talk in person there!

  ----------------------------------------------------------------
  Philip Resnik, Associate Professor
  Department of Linguistics and Institute for Advanced Computer Studies

  1401 Marie Mount Hall UMIACS phone: (301) 405-6760
  University of Maryland Linguistics phone: (301) 405-8903
  College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
  http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu

Next message: Marco Baroni: "[Corpora-List] near duplicate detection"
Previous message: Linda Bawcom: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Marco Baroni: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Jun 02 2005 - 14:33:06 MET DST