Corpora: information

Jochen Leidner (
Sat, 11 Oct 1997 04:31:23 +0200 (METDST)

>>>>> "Rashmi Prasad" == Rashmi Prasad <> writes:
Rashmi Prasad> Hi, I would very much appreciate information on
Rashmi Prasad> software to transcribe Sanskrit text into English,
Rashmi Prasad> for corpus-related work. Specifically, I am looking
Rashmi Prasad> for some software to do optical character
Rashmi Prasad> recognition of Sanskrit (an Indo-Aryan language)
Rashmi Prasad> text, and perform a transcription from the Sanskrit
Rashmi Prasad> orthography into the Roman script (English).

A transliteration into the Roman script is not really advisable
if you plan to construct a corpus, since it is an artificial
representation. Rather, I recommend you represent your Sanskrit texts
using the UNICODE 2.0 standard.

If you are constructing a spoken corpus, UNICODE 2.0 has character codes
for the phonetic alphabet (IPA). If you a constructing a written
corpus, You can use the Devanagari script directly:

"The Devanagari script is used for writing classical Sanskrit
and its modern historical derivative, Hindi. [...]"
-- UNICODE Standard 2.0, p.6-33 - 6-48

After nearly half a century of computing, it is a pity that many
computers are still not even able to represent and display some of the
most important scripts of the world, but this is now about to change.

To get an example page of UNICODE-encoded Sanskrit, together with some
hints on how to proceed, point your Web browser at

You may also want to read the Standard (published by Addison-Wesley
Developers Press, 1996).



Jochen Leidner