[Corpora-List] WEB AS CORPUS: Workshop/Tutorial, 14th July 05, Birmingham UK

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Thu Jun 09 2005 - 10:51:33 MET DST

  • Next message: Yunqing Xia: "[Corpora-List] Positive/commentatory and negative/derogatory subjective in textual report"

                            ********************************
                                     WEB AS CORPUS
                            Pre-conference workshop/tutorial
                                 Corpus Linguistics 2005
                                     14th July 2005
                                 Birmingham University, UK
                            *********************************

                  http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

                                        Co-chairs:
                      Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff

    Motivation:

    The World Wide Web is a mine of language data of unprecedented richness
    and ease of access (Kilgarriff and Grefenstette, 2003). A growing body of
    studies has shown that simple algorithms using Web-based evidence are
    successful at many linguistic tasks, often outperforming sophisticated
    methods based on smaller but more controlled data sources (e.g., Turney
    2001).

    However, many fundamental issues about the viability and exploitation of
    the web as a linguistic corpus must still be explored, or are just
    starting to be tackled. These issues range from word frequency
    distributions on the web to efficient handling of massive data sets, to
    the legal standing of web indexing.

    Thus, we believe that the research on the web as corpus is currently in a
    very exciting stage: increasing evidence points to the enormous potential
    of the Internet as a source of linguistic data, but we are still far
    removed from anything like a working, fully-fledged tool for linguists and
    language technologists to use the web as a corpus.

    Contents:

    This full-day workshop and tutorial will provide an introduction to the
    issues involved in using the web as a corpus. The emphasis will be
    practical and participatory, with presentations of programs addressing
    particular issues, and opportunities for all participants to describe their
    experiences of working with the web as a source of linguistic data. We
    shall also aim to establish what main challenges lying ahead are for this
    young community, and how it should work collectively to address them.

    * General overview of web-as-corpus work
    * Building large/general and small/special-purpose web corpora
    * Web crawling for linguistic purposes
    * (Near-)duplicate detection, boilerplate removal, language identification
    * Linguistic annotation
    * Working with non-latin1 languages
    * Indexing and retrieval from large document collections
    * Prospected interfaces

    Provisional program:

    9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of
      the workshop, overview of program
    10:00-10:45 Tom Emerson (Basis Technology) - Large crawls of the web for
      linguistic purposes
    10:45-11:15 coffee break
    11.15-12.00 Marco Baroni (University of Bologna) and Serge Sharoff
      (University of Leeds) - Creating specialized and general corpora using
      automated search engine queries
    12:00-13:00 Small groups arranged around the participants' research
      purposes

    13:00-14:30 lunch break

    14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing
      web-derived text (or: Working with very messy data)
    15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff
      (Lexicography MasterClass) - Indexing and interfaces
    16:00-16:30 coffee break
    16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) -
      Representing genre-specific websites
    17:00-17:30 Small groups on "what are critical next steps for
      Web-as-Corpus activity?"
    17:30-18:10 Plenary: where next?

    Registration:

    Registration and accommodation are managed by the main conference
    organizers. Please visit:

    http://www.corpus.bham.ac.uk/conference



    This archive was generated by hypermail 2b29 : Thu Jun 09 2005 - 10:57:26 MET DST