Description of the area

A. Areas addressed
B. Cross-linking with other EU projects
C. Purpose, objectives and expected outcomes
D. Project approach

A. Areas addressed

1. Arabic region

The University of Bergen, which coordinates the ACO*HUM TNP, has been researching (1) problems of optical character recognition (OCR) of cursive scripts, (2) establishment and tagging of text corpora in Arabic and Turkish, and (3) information retrieval techniques from the end user's perspective. A group led by Prof. Dr. Joseph Norment Bell, who is editor of the Journal of Arabic and Islamic studies, has been involved in the following three projects:

1. OCR

This project first received support from the Faculty of Arts at the University of Bergen in 1994. By the end of the year I had produced in cooperation with Dr. Petr Zemanek of Charles University, Prague, a detailed comparison of a new Arabic OCR program, al-Qari' al-Ali (originally developed under the auspices of the Oriental Institute of the Russian
Academy of Sciences, St. Petersburg Branch), with one of two other available programs. Dr. Jan Hoogland of the University of Nijmegen, did a similar comparison of al-Qari' al-Ali with the other competing program. Both our group and Dr. Hoogland then did comprehensive critiques of the new program and presented the results at a conference in Cambridge in April of this year. At the present time, it is fair to say that Bergen, Nijmegen, Prague, and St. Petersburg (State University) have been the pioneering European in terms of user evaluation of Arabic OCR tools for scholarly purposes.

2. Text corpora

Regarding the establishment and tagging of text corpora, while no one in our local team had experience with this before the project began in 1994, our colleague from Prague, Dr. Zemanek, has had considerable experience in connection with the Thesaurus Indogermanischer Text- und Sprachmaterialen (TITUS), the Ugaritic and South Arabian texts stored at Charles University, and the Czech Biblia Sacra project (see attachment 1). Moreover, we benefit in Bergen from the experience of the Wittgenstein Archive, the French-Norwegian parallel text corpus (French Section), the Bergen Corpus of Teenage Language and the Norwegian-American Immigrant Letters project (English Department), the Norwegian Term Bank, and various corpus-related activities coordinated by the Norwegian Centre for Computing in the Humanities.

3. Information retrieval

In information retrieval we have neither the competence nor the ambition to develop totally new searching tools, but we do believe that dialogue between humanists as informed end users and information specialists is essential in influencing the direction of future software development and in the fine tuning of information retrieval systems currently in existence or under development.

2. Subsaharan Africa

Within the Subsaharan African area preliminary work exists in:

research and data collecting;
teaching innovations and coordination.

The Helsinki University Language Corpus Server maintains a Swahili Text Archives, which presently contains the following kinds of materials: (1) about 40 books of prose text, both fiction and scientific texts (about 2 milj. words); (2) texts from various newspapers (about 2 milj. words); (3) two monolingual dictionaries, Kamusi ya Kiswahili Sanifu, and Kamusi ya Maana na Matumizi; (4) translation of Bible (Standard Version) by the United Bible Societies; (5) translation of Quran; (6) three versions of Pate Chronicle; (7) transcriptions of discussions in Bunge, the Parliament of Tanzania (1996); (8) transcriptions of oral discussions and folklore from various Swahili-speaking areas of Tanzania (a joint project of the University of Dar-es-Salaam and University of Helsinki, about 100 hours); (9) comparative word lists of 610 words from various Swahili-speaking areas of Tanzania. (10) The archives continue to accumulate material, e.g. the weekly newspaper Rai and the daily newspaper Majira have been included into the archives since they started to appear in Internet.

Access through Internet to the archives can be granted for researchers by application (Ilkka.Westman@ling.helsinki.fi or Arvi.Hurskainen@ling.helsinki.fi). Information retrieving tools, based on concrete string search and on regular expressions, are available. For particularly serious researchers access to the language-sensitive information retrieval system can be granted. This system first analyzes the text morphologically, performs disambiguation operations, and also carries out syntactic mapping, if needed. Search is then directed to the result of the analysis insted of the unanalyzed text.

At the Dpt. of African and Arabic Studies (IUO, Napoli) a small Swahili corpus is available.Part of teaching content also includes working on Swahili text with especially designed and adapted softwares. The GDRE group 1172 of the French CNRS (Univ. Nice-Sophia Antipolis) has a data bank on Sahelo-Saharan lexikon, called SAHELIA, accessable by a limited group through an especially designed data base software called MARIAMA.

Teaching activities resulted in the CAMEEL (Computer Applications to Modern Extra-European Languages) project. CAMEEL is an initiative of the Departments of African Languages and Linguistics in several European universities that constituted an ICP in the erstwhile ERASMUS network. The project combines highly innovative teaching methods and contents with optimal European dimension features. It is maximally flexible in terms of adaptability to different local conditions. Structurally, the modules can be used independently or as integrated parts of curricula. It is open to modification and thematic and didactic expansion. It is transnational in so far as study and practical experience abroad are required. It relies heavily on modern techniques of electronic communication across participating universities. Consequently students do not have to rely on the local facilities alone.

3. Oriental languages

Interest has been shown by scholars of some institutes (IUO, Univ. Barcelona) to become actively involved in the project. Preliminary contacts are going on between scholars of IUO and Barcelona to start investigations and plans on this area.

B. Cross-linking with other EU projects

In 1986 the ICP 1140/09 started its activities with a small group of 4 universities, namely the I.U.O. (which acted as the central coordinator), RUL, INALCO and ULB (Bruxelles). From 1987, membership increased and the application was renewed every year, receiving approval for TM (teacher mobility) and SM (student mobility) and occasionally IP.

The 1st WOCAAL (Workshop on Computer Application on African Linguistics) was organised by an ICP coordinated by INALCO, took place in Brussels, at the ULB, in March 1992, and was also very stimulating. In 1993/94 the application for an IP (Intensive Program) was accepted. The 2nd Workshop WOCAAL took place in Helsinki from the 15th to the 26th of August 1994 and proved very successful both with regard to the number of participants (7 teachers and 20 students) and to the range of subjects (phonetics, text treatment, lexicography, etc.).

During the various meetings connected to the work with the Erasmus Program, and on other occasions such as conferences, seminars, etc., the need was felt to introduce a systematic treatment of questions related to computer application to African linguistics into the national curricula. Consequently, the ICP-1140/09 applied in 1994 for the organization of a Common Development Programme and was awarded a grant of 4.200 ECU. Subsequently, members present at the annual ICP meeting of Sept. '95 at San Nicola (Italy) designed a project under the name CAMEEL (Computer Applications for Modern Extra-European Languages), and formed a Steering Committee for the project. The goal of the project is to create and develop a multidisciplinary and transnational curriculum to train students in computer applications to non-European languages.

C. Purpose, objectives and expected outcomes

1. Rationale and background of the project

Increasing cultural and economic contacts between EU and non-EU countries require efficient and fast language communication competences. Modern methods for language acquisition, text editing and information retrieval, which already exist for EU languages, cannot be easily converted and or created for non-EU languages. Therefore, the need for a new professional competence manifests itself. There is an urgent need for language engineers with expertise in advanced computing adapted to non-European linguistic structures, from character sets to grammars.

Since many non-European languages are primarily used in countries which have not sufficiently mastered new technologies, it cannot be expected that these countries develop suitable language technologies on their own. European institutions therefore have to take the lead in developing the necessary competence within their own linguistics and foreign language departments. These departments have already been familiarized with advanced computing and are beginning to adapt existing computing techniques for dealing with various languages. This very fragmented approach, though productive and creative, is somewhat amateur and is reflected in some of the limitations of the products, like, for example, the fact that most commercially available software for language can be used only on a PC platform, and will not work on many platforms used in publishing and information retrieval.

From the cultural viewpoint, Europe has strong traditions in the study of world-wide cultures to which it has been linked in its history. As more and more material becomes accessible electronically, scholars in the humanities are being confronted with an exponential expansion in the quantity of written sources available and a dramatic improvement in the accessibility and searchability of this material. A dialogue between humanists as informed end users and information specialists is essential for determining the direction of future software development.

It is thus clear that a field like computational linguistics for non-European languages is needed. Up to now, however, a forum which allows a scholar to discuss the educational implications in a broad European context does not exist.

2. Aims and objectives

The aims of this subproject are the same as for ACO*HUM, with the following specific objectives:

provide a Europe-wide discussion area for those involved/interested in non-EU language issue (learning with links to research and computer industry);
create groups of European partners (universities, software houses, third users) with various levels of involvements and for various language areas that can plan concerted actions in a domain which lacks opportunities of coordination and which is characterized by:

very long learning/training periods;
the necessity for 'vertical' agreement in text treatment in order to build widely usable resources);
difficulties in identifying acceptable standards for such different languages;

indicate and promote links within the non-EU area group and with other area groups of the TNP in order to foster cooperation and coordination;
agree on a 'preference list' of targeted non-EU language groups to be considered at various stages, based on:

- consideration of external needs (EU policies, non-research real user needs, etc.);
- inventory of existing sources and products;
- state of the arts in the various languages/language group.

D. Project approach

1. Main pedagogical and didactic approaches and concepts

Due to the wide diversities of languages involved, focal points have been identified, according to mixed criteria (available competence, geographical location, linguistic characteristics, availability of resources, possible applications). As far as language groups is concerned, the main focus is on the following language groups:

Sub-Saharan Africa (common problems/situations: lack of big corpora, lack of non-literary texts, lack of adeguate tonal treatment; use of Latin alphabet, with minor adaptation; good and extended applications of modern theoretical studies; cultural heritage present in Europe);
Arabic area (lack of exhaustive staudies on local/dialectal variants of classical Arabic, non-Latin alphabet, great homogeneity in basic culture, strong cultural links with Europe);
Oriental languages, including Chinese and Korean and the Indonesian area (technologically already somewhat advanced, commercially important, cultural heritage visible in Europe).

The project's methodology intends to build on already existing EU initiatives in the field of extra-EU languages (see the CAMEEL group) and enlarge participation from related groups/centres (Univ. Barcelona, Univ. Bergen, Univ. Nijmegen, etc.), while putting a more specific focus on the following:

modern and recent technology of information processing;
cross-disciplinary approaches;
non-European cultures, languages and multimedia;
practical and professional applications in multicultural communities;
international and interregional trade and cooperation

2. New information and communication technologies

In the existing MA programs in the field of African and Oriental languages the need is felt for the possibility of more specific training in computer applications to these languages. In order to finally develop methods and techniques for using, evaluating and creating tools regarding these languages, basic training at the undergraduate and a new one at the postgraduate level should be provided. The philosophy is to provide options within the frame of existing MA, BAhons programs, as well as introduce it as an independent novel curriculum. Individual universities may decide to integrate these options as obligatory parts of their programs. A proposal for study modules has been elaborated by prof. Arvi Hurskainen, from Univ. of Helsinki, and is now under discussion.

Updated May 5, 1998
acohum@uib.no