Re: [Corpora-List] Query on the use of Google for corpus research

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Jun 01 2005 - 17:36:49 MET DST

  • Next message: Jim Jones: "[Corpora-List] Job Posting: Discourse Analyst/Linguist"

    Your tools sound really interesing, and in part similar to what we are
    developing/adapting. Is anything (besides GATES, of course) publicly
    available?

    > (PDF, Word, etc.) and strips out the text, does its best to identify
    > titles, tables, etc. and mark them as such

    So, here is where you identify the parts of a page that are probably not
    worth keeping, or that should at least be marked as something else than
    natural connected text? (E.g., header and footer material that is repeated
    on many pages from the same site?) Delimiting these seems to be one of the
    most annoying problems we are encountering right now...

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 17:44:31 MET DST