Re: [Corpora-List] Arabic language under Linux

From: Andy Roberts (andyr@comp.leeds.ac.uk)
Date: Sun May 29 2005 - 12:43:02 MET DST

  • Next message: Tom Emerson: "Re: [Corpora-List] Arabic language under Linux"

    This is not an operating system issue. You read an Arabic file much in
    the same way as any file. The main difference is that you will need to
    specify a character encoding.

    In terms of adapting your current tokeniser, it's difficult to advise
    what to do because it depends what programming language you've used.
    I've always found Java to be the best for multilingual support,
    including Arabic. I've also written an Arabic transliterator in Python
    which wasn't too difficult. All programming will let you specify an
    encoding, but it's easier in some than others.

    If you are unsure about encodings, I found this article to be
    particularly good:
    http://www.joelonsoftware.com/articles/Unicode.html

    If you have a bilingual file, with Arabic and French, then I'd recommend
    using the same encoding through out the file. The Unicode encoding is
    ideal. UTF8 should be adequate, although UTF-16 will certainly be fine.
    (that is, make sure you save your files as utf16 *before* trying to
    tokenise them).

    Andy

    On Sun, 29 May 2005, nouha.chaaben wrote:

    >
    >
    > Dear all,
    >
    > I have a French documents tokenizer under Linux; I want to adapt it to Arabic documents.
    > Does anyone know how to use Arabic language and how to read bilingual file under Linux?
    >
    > Thanks
    >
    > Nouha
    > ******************************
    > Nouha Chaâben
    > PhD Student at Faculty
    > of Economic Sciences
    > and management of Sfax, Tunisia
    >
    > Email : nouha.chaaben@laposte.net

    Accédez au courrier électronique de La Poste : www.laposte.net ;
    3615 LAPOSTENET (0,34€/mn) ; tél : 08 92 68 13 50 (0,34€/mn)
    >



    This archive was generated by hypermail 2b29 : Sun May 29 2005 - 13:03:13 MET DST