Re: [Corpora-List] starting a machine translation project

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Thu Sep 14 2006 - 00:01:14 MET DST

  • Next message: Rayson, Paul: "[Corpora-List] The ICAME Journal: call for submissions and subscriptions"

    zhang min wrote:
    > Does anyone know where we can get English-to-Indonesian bilingual corpus?

    Joseph Cathcart asked that question on this list in 2001 (I don't think
    he got any responses, but you might ask him), and Jelita Asian was
    looking for generic corpora in 2004 (not necessarily parallel).

    When Bill Poser and I were working at the LDC, we (actually, I think it
    was Bill) looked for parallel text in Indonesian. Bill noted that there
    was lots of news, mostly monolingual, but that one might be able to
    build a bilingual English-Bahasa Indonesian corpus by extracting
    parallel articles from the following site:
       Tempo Interactive (Indonesian) http://www.tempo.co.id/
       Tempo Interactive (English) http://www.tempointeractive.com/index,uk.asp

    Trying it just now, the first site redirects you to the second
    (http://www.tempointeractive.com/). In any case, it is still possible
    to switch between English and Indonesian (as well as Japanese and
    Mandarin; see the menu on the left of their web page). Whether you
    could find parallel articles depends on how they produce text in the two
    languages (and access to the English archives apparently now requires
    registration). When we looked into this kind of thing for Hindi, we
    found to our surprise that most bilingual news sites had little or no
    parallel text. Maybe it's cheaper there to employ separate reporters
    for the two languages than to employ translators. That, or the market's
    very different for news in Hindi and in English.

    Three years ago, when Bill looked, he was able to find at least one
    parallel article at the above site. Since he doesn't speak Indonesian
    (at least I _think_ he doesn't, although it wouldn't surprise me to hear
    that he was learning it!), I presume it was fairly easy to find. But
    when I tried an archive search just now, using proper nouns found in
    either an English or an Indonesian article, I couldn't come up with any
    parallel text. Maybe your luck will be better...

    -- 
    	Mike Maxwell
    	maxwell@ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Wed Sep 13 2006 - 23:58:28 MET DST