[Corpora-List] NewsExplorer: multilingual news analysis with cross-lingual news links

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Wed Sep 13 2006 - 17:04:29 MET DST

  • Next message: Francis Bond: "Re: [Corpora-List] starting a machine translation project"

    Please excuse multiple postings.

     

    URL: <http://press.jrc.it/NewsExplorer/>
    http://press.jrc.it/NewsExplorer/

    LANGUAGES: Arabic, Dutch, English, Estonian, Farsi, French, German,

                Italian, Portuguese, Russian, Slovene, Spanish, Swedish.

    COUNTRIES: Austria, Belgium, France, Germany, Italy, Netherlands,

                Spain, United Kingdom, United States.

    VOLUME: Approx. 15,000 news articles analysed every day.

                News on approx. 500,000 distinct names.

    WEB USAGE: Currently approximately 300,000 hits per day.

     

     

    NewsExplorer is a publicly accessible, fully automatic news aggregation and

    analysis system that makes use of various text analysis and visualisation

    tools. NewsExplorer allows users to navigate the news across languages and

    over time, to access articles via named persons and organisations, and to

    get an overview of developments via visual time lines. NewsExplorer, which

    was entirely developed at the European Commission's Joint Research Centre

    (JRC) in Ispra (Italy), currently exists in 13 languages, but distinguishes

    also country-specific news. Apart from the seamless integration of various

    information extraction tools, its major novel features are its high

    multilinguality and the ability to cross language borders.

     

    NewsExplorer is fully automatic and will thus make mistakes. The news

    analysis is bottom-up and without any political or other pre-conceptions.

    The following text analysis tools are part of NewsExplorer:

     

    - Document clustering.

    - Geo-coding, including disambiguation of homographic place names.

    - Name recognition (persons and - to some extent - organisations).

    - Approximate matching and automatic merging of name variants,

      monolingually and across languages

      (e.g. http://press.jrc.it/NewsExplorer/entities/en/23.html).

    - Daily calculation of weighted relations between persons,

      based on their co-occurrence in millions of news articles.

    - Identification of quotes by and about people.

    - Automatic linking of names to the Wikipedia encyclopaedia.

    - Detection of major new topics every day, week and month.

    - Tracking of ongoing topics over time ('stories').

    - Linking of news on the same subject across languages.

    - Various visualisation tools:

      - Location of news in the world.

      - Biggest daily news clusters per language over time (time line).

      - Development of individual stories over time.

      - Relations between persons and organisations.

    - More to come ...

     

    An overview of the system is given in the following article (For more

    detailed publications on individual tools and applications, see

    http://langtech.jrc.it/):

     

       Steinberger Ralf, Bruno Pouliquen, Camelia Ignat.

       Navigating multilingual news collections

             using automatically extracted information.

       Journal of Computing and Information Technology

             CIT 13, 2005, 4, 257-264.

       Available at: http://cit.zesoi.fer.hr/browseIssue.php?issue=23

     

    NewsExplorer receives its news articles from the JRC's Europe Media Monitor

    (publicly available on the NewsBrief page http://press.jrc.it/), which

    continually crawls about 1,000 news sites in 30 different languages.

    NewsBrief detects breaking news, roughly classifies all articles, and sends

    out email summaries.

     

    NewsExplorer and NewsBrief have been developed as a service to the European

    Commission and other EU institutions, as well as for the wider public.

     

     

    "Helping to unify Europe - One language at a time."

     

     

     

    European Commission - Joint Research Centre (JRC, http://www.jrc.it/)

    IPSC - SeS - Language Technology

    21020 Ispra (VA), Italy

    URL: http://langtech.jrc.it <http://langtech.jrc.it/>



    This archive was generated by hypermail 2b29 : Thu Sep 14 2006 - 21:16:58 MET DST