Computing for non-European languages:
A perspective from linguistics and Southeast-Asian languages.

Victoria Rosén

Why study non-European languages in Europe?

The study of non-European languages at European universities serves different purposes. One of them is to acquire a practical working knowledge of the language for personal or business communication. Another is to aid the study of the literature, culture, history, etc. of a foreign society. Yet another important purpose, which is often overlooked, is to provide a wider knowledge contributing to the study of general linguistics as well as other branches of linguistics such as sociolinguistics, psycholinguistics or computational linguistics.

At the Department of Linguistics and Comparative Literature at the University of Bergen, several non-Indo-European languages are taught. One of these, Japanese, is taught in cooperation with the Norwegian School of Economics and Business Administration. It is taught from beginning courses up to the grunnfag level (two years). Other non-Indo-European languages are taught as a part of the linguistics curriculum. All general linguistics students are required to study a non-Indo-European language as part of the first-year course in linguistics. Languages the department has offered include Cantonese, Finnish, Hungarian, Japanese, New Zealand Maori and Vietnamese.

There is also some teaching on non-Indo-European languages in the Department of Scandinavian Languages and Literature. Vietnamese and Cantonese are among the languages studied here, although the perspective is slightly different. The students are teachers or prospective teachers of Norwegian as a second language, and they study especially important immigrant languages (not only non-Indo-European languages, but also Indo-European languages such as Serbo-Croatian) in order to gain insight into the kind of problems that speakers of these languages may be expected to have in learning Norwegian.

Advanced computational linguistics tools for studying non-european languages

Most work within computational linguistics has been based on English and closely related Indo-European languages. This unfortunately results in analyses that do not do justice to the structures of non-Indo-European languages. In his paper Maximizing the (re)usability of language data, Arvi Hurskainen points out that the techniques developed do not necessarily work well for languages with extensive morphological variation. But it is equally true that these techniques do not necessarily work well for languages without morphological variation. Many languages of East and Southeast Asia have little or no morphology. Languages that do not have inflectional morphology must use grammatical morphemes and phrase structure configuration to express what other languages express through inflectional morphology. The study of these languages can be greatly enriched by the use of computational linguistics tools like syntactic workbenches. At the same time, such study may further the development of linguistic theory and computational linguistics in general, in the sense that these may become more language independent.

The Section for Linguistic Studies, University of Bergen, has access to a number of syntactic workbenches. These are advanced computational tools for the development of grammars. The Xerox LFG Grammar Writer's Workbench (http://www.parc.xerox.com/istl/groups/nltt/pargram/dev-env.html), is a grammar development environment written in Medley Lisp (Kaplan and Maxwell 1996). This tool provides an implementation of the Lexical-Functional Grammar syntactic formalism, originally presented in Kaplan and Bresnan (1982). The implementation includes also more recent features of the theory, such as functional uncertainty and multiple projections. The LFG workbench has been used to develop advanced grammars not only for English and other major European languages, but also for Vietnamese (cf. Rosén 1997). Specific constructions covered by these grammars include the topic-comment construction and sentences with empty pronouns in Vietnamese, and topicalization structures in English. The syntactic analysis involves both c-structures (phrase structure trees) and f-structures (feature matrices with grammatical information presented in a manner independent of phrase structure representation). In addition, there are semantic and discourse analyses, both in the form of feature matrices. Although these grammars have extremely limited coverage, they can be valuable from a number of perspectives, including theoretical and pedagogical perspectives.

The grammar of languages such as Vietnamese and Chinese has been treated in the West from two basic perspectives: structuralist and generative. The structuralist approach usually stresses how different these languages are from English, while the generative approach tends to attempt to make these languages appear to have the same structure (if not at surface structure, then at least at some deeper level) as English. In a theory such as LFG, it is possible to write grammars for widely varying languages based on a common set of linguistic principles. The level of functional structure permits different languages to express the same grammatical distinction in different ways, for instance, one languages may code object status through accusative case, another may code it through phrase structure position, and a third may code it through use of an adjacent object particle. Students with such a workbench at their disposal may use the somewhat eclectic language descriptions in thorough but outdated structuralist grammars to implement a generative grammar and simultaneously learn more about generative grammar and about the grammar of the language itself.

The LFG Grammar Writer's Workbench is already being used for the teaching of syntax to beginning linguistics students at the Section for Linguistic Studies. They use it for writing syntactic rules for Norwegian, but in principle it could also be used in their study of their non-Indo-European language. Although the study of such a language is a popular part of the linguistics program, students have often complained that there is too little connection between the study of this language and the linguistic theories they learn about. Actually writing syntactic rules for various grammatical constructions in the non-Indo-European language, and using the workbench to test whether their rules correctly analyze these constructions, would provide students with a much more concrete link between these two parts of the study program.

Fonts

At a lower level of processing, the limited range of fonts, mostly serving only western European languages, presents a practical barrier for text processing in non-European languages. For a language that uses the Latin alphabet, Vietnamese is undoubtedly one of the most complicated languages to write on a computer. This is because of the extensive use of diacritics. Vietnamese uses diacritics both to designate special consonants and vowels, but also to designate tone. For some characters there will therefore be two diacritics. These diacritics must be placed in a special configuration with respect to each other, but there is not only one possible configuration for each combination. For example, the tonal diacritic may in some cases be placed either to the right or the left of another diacritic.

Although it is not possible to present one single standard for how Vietnamese is to be written, it is essential that the diacritics are represented in an accurate fashion. Since each Vietnamese syllable is normally written as a separate orthographical word regardless of whether it is morphologically a single word or not, an enormous amount of homophony would result if the diacritics were left out. For a native speaker, the text may still be comprehensible since the appropriate word may be interpreted in by the context it occurs in. For a language learner, however, it is essential to learn the correct vowel or consonant and the correct tone. However, fonts for non-European languages still present a serious problem on the web. Web services for Vietnamese do not at present succeed in presenting the language legibly (for an example, check out http://catalogla.bnf.fr:8090/html/i-frames.htm, under index you choose sujet and under Entrer les critères, you write viêtnamien; the result does not look good).

Open and distance learning materials

The study of non-European languages is in need of good teaching materials. Due to a number of factors, a focus on materials for open and distance learning is desirable. The most prominent factor is perhaps that the field of non-European consists of many specialist niches, such that both students and teachers are relatively scarce and scattered across Europe. A new distance learning course in Norwegian as a second language started in January 1999 at the University of Bergen (http://studier.uib.no/prisme/index.nsf). In the context of this course, an introductory textbook on Vietnamese has been written (Rosén in press). While the book is a regular printed volume, it is meant to be used in conjunction with TV programmes and teacher-student interaction via the Internet.

References

Hurskainen, Arvi, 1998, Maximizing the (re)usability of language data,http://www.hd.uib.no/AcoHum/nel/paper-hurskainen.html.

Rosén, Victoria, 1997, Topics and Empty Pronouns in Vietnamese, doctoral dissertation, University of Bergen.