Extension proposal: Advanced computing for non-European languages (NEL)

Proposal for a subproject extending the ACO*HUM TNP. This proposal should therefore be read against the background of the original ACO*HUM application and the renewal application.

The extension is motivated by the desirability of increasing the coverage of the TNP to other areas within the humanities. The proposed area has links with the following areas already covered:

  • Computational linguistics and language engineering
  • Textual scholarship and edition philology
  • Although an integration of non-European languages in the above areas is conceivable, the group involved in non-European languages has already demonstrated a high degree of internal organisation and wishes to be active as a subproject, keeping close links with the existing areas.

    A. Background information on project partners

    A.1 Participating organizations

    CILLSAC-IUO (Istituto Universitario Orientale) EDU.4 Napoli IT
    Univ. London SOAS (School of African Studies)  EDU.4 London UK
    RUL (Dpt. of African Lang. and Ling.) EDU.4 Leiden NL
    Helsinki Yliopisto (Dpt. of Asian and African Studies) EDU.4 Helsinki FI
    Univ. Leipzig (Inst. f. Afrikanistik) EDU.4 Leipzig DE
    Univ. Köln (Inst. f. Afrikanistik) EDU.4 Köln DE
    Univ. Hamburg (Sem. f. Afrikanische Spr. und Kult.) EDU.4 Hamburg DE
    Univ. de Nice-Sophia Antipolis (Dpt. des Sciences du Language) EDU.4 Nice FR
    Univ. Lumière - Lyon 2 (Dpt. des Sciences du Language) EDU.4 Lyon FR
    Univ. Oslo (Institutt for lingvistiske fag) EDU.4 Oslo NO
    NTNU (Lingvistisk institutt) EDU.4 Trondheim NO
    RUG (Faculty of Art and Philosophy) EDU.4 Gent BE
    Univ. Zürich (Seminar f. Algemeine Sprachw.) EDU.4 Zurich CH
    Univ. Wien (Institut f. Afrikanistik) EDU.4 Wien AU
    INALCO (Inst. nat. des languages et civil. orientales) EDU.4 Paris FR

    All institutions in the above table are universities or schools of higher education offering courses and curricula in non-European languages.
     

    CNR (Istituto di Linguistica Computazionale) RES Pisa IT
    Olivetti Ricerca IND Pozzuoli IT

    CNR is active in research and education and will act as promotor for links with research. Olivetti is active in educational computer systems and language technology a.o.

    CILLSAC-IUO will be an A-partner and will invest 30 full-time days per year in the TNP; INALCO, SOAS, RUL, Univ. Helsinki Yliopisto, Univ. Leipzig, and Univ. Nice will be B-partners, contributing minimum 10 full-time days per year to the project; the others are C-partners and will contribute 5 full-time days.

    It is somewhat difficult to demonstrate recognized competence in a field with as many unexplored aspects as this subproject may involve, but both the coordinator and the partners in this proposed extension have considerable experience and expertise, which can be summarized as follows.

    ARABIC REGION

    The University of Bergen, which coordinates the ACO*HUM TNP, has been researching (1) problems of optical character recognition (OCR) of cursive scripts, (2) establishment and tagging of text corpora in Arabic and Turkish, and (3) information retrieval techniques from the end user's perspective. A group led by Prof. Dr. Joseph Norment Bell, who is editor of the Journal of Arabic and Islamic studies, has been involved in the following three projects:

    1. OCR

    This project first received support from the Faculty of Arts at the University of Bergen in 1994. By the end of the year I had produced in cooperation with Dr. Petr Zemanek of Charles University, Prague, a detailed comparison of a new Arabic OCR program, al-Qari' al-Ali (originally developed under the auspices of the Oriental Institute of the Russian Academy of Sciences, St. Petersburg Branch), with one of two other available programs. Dr. Jan Hoogland of the University of Nijmegen, did a similar comparison of al-Qari' al-Ali with the other competing program. Both our group and Dr. Hoogland then did comprehensive critiques of the new program and presented the results at a conference in Cambridge in April of this year. At the present time, it is fair to say that Bergen, Nijmegen, Prague, and St. Petersburg (State University) have been the pioneering European institutions in terms of user evaluation of Arabic OCR tools for scholarly purposes.

    2. Text corpora

    Regarding the establishment and tagging of text corpora, while no one in our local team had experience with this before the project began in 1994, our colleague from Prague, Dr. Zemanek, has had considerable experience in connection with the Thesaurus Indogermanischer Text- und Sprachmaterialen (TITUS), the Ugaritic and South Arabian texts stored at Charles University, and the Czech Biblia Sacra project (see attachment 1). Moreover, we benefit in Bergen from the experience of the Wittgenstein Archive, the French-Norwegian parallel text corpus (French Section), the Bergen Corpus of Teenage Language and the Norwegian-American Immigrant Letters project (English Department), the Norwegian Term Bank, and various corpus-related activities coordinated by the Norwegian Centre for Computing in the Humanities.

    3. Information retrieval

    In information retrieval we have neither the competence nor the ambition to develop totally new searching tools, but we do believe that dialogue between humanists as informed end users and information specialists is essential in influencing the direction of future software development and in the fine tuning of information retrieval systems currently in existence or under development.

    SUBSAHARAN AFRICA

    Within the Subsaharan African area preliminary work exists in:

    The Helsinki University Language Corpus Server maintains a Swahili Text Archives, which presently contains the following kinds of materials: (1) about 40 books of prose text, both fiction and scientific texts (about 2 milj. words); (2) texts from various newspapers (about 2 milj. words); (3) two monolingual dictionaries, Kamusi ya Kiswahili Sanifu, and Kamusi ya Maana na Matumizi; (4) translation of Bible (Standard Version) by the United Bible Societies; (5) translation of Quran; (6) three versions of Pate Chronicle; (7) transcriptions of discussions in Bunge, the Parliament of Tanzania (1996); (8) transcriptions of oral discussions and folklore from various Swahili-speaking areas of Tanzania (a joint project of the University of Dar-es-Salaam and University of Helsinki, about 100 hours); (9) comparative word lists of 610 words from various Swahili-speaking areas of Tanzania. (10) The archives continue to accumulate material, e.g. the weekly newspaper Rai and the daily newspaper Majira have been included into the archives since they started to appear in Internet.

    Access through Internet to the archives can be granted for researchers by application (Ilkka.Westman@ling.helsinki.fi or Arvi.Hurskainen@ling.helsinki.fi). Information retrieving tools, based on concrete string search and on regular expressions, are available. For particularly serious researchers access to the language-sensitive information retrieval system can be granted. This system first analyzes the text morphologically, performs disambiguation operations, and also carries out syntactic mapping, if needed. Search is then directed to the result of the analysis insted of the unanalyzed text.

    At the Dpt. of African and Arabic Studies (IUO, Napoli) a small Swahili corpus is available.Part of teaching content also includes working on Swahili text with especially designed and adapted softwares. The GDRE group 1172 of the French CNRS (Univ. Nice-Sophia Antipolis) has a data bank on Sahelo-Saharan lexikon, called SAHELIA, accessable by a limited group through an especially designed data base software called MARIAMA.

    Teaching activities resulted in the CAMEEL (Computer Applications to Modern Extra-European Languages) project. CAMEEL is an initiative of the Departments of African Languages and Linguistics in several European universities that constituted an ICP in the erstwhile ERASMUS network. The project combines highly innovative teaching methods and contents with optimal European dimension features. It is maximally flexible in terms of adaptability to different local conditions. Structurally, the modules can be used independently or as integrated parts of curricula. It is open to modification and thematic and didactic expansion. It is transnational in so far as study and practical experience abroad are required. It relies heavily on modern techniques of electronic communication across participating universities. Consequently students do not have to rely on the local facilities alone.

    ORIENTAL LANGUAGES

    Interest has been shown by scholars of some institutes (IUO, Univ. Barcelona) to become actively involved in the project. Preliminary contacts are going on between scholars of IUO and Barcelona to start investigations and plans on this area.

    A.2 Other associated organizations

    None.

    A.3 Involvement in other EU projects

    In 1986 the ICP 1140/09 started its activities with a small group of 4 universities, namely the I.U.O. (which acted as the central coordinator), RUL, INALCO and ULB (Bruxelles). From 1987, membership increased and the application was renewed every year, receiving approval for TM (teacher mobility) and SM (student mobility) and occasionally IP.

    The 1st WOCAAL (Workshop on Computer Application on African Linguistics) was organised by an ICP coordinated by INALCO, took place in Brussels, at the ULB, in March 1992, and was also very stimulating. In 1993/94 the application for an IP (Intensive Program) was accepted. The 2nd Workshop WOCAAL took place in Helsinki from the 15th to the 26th of August 1994 and proved very successful both with regard to the number of participants (7 teachers and 20 students) and to the range of subjects (phonetics, text treatment, lexicography, etc.).

    During the various meetings connected to the work with the Erasmus Program, and on other occasions such as conferences, seminars, etc., the need was felt to introduce a systematic treatment of questions related to computer application to African linguistics into the national curricula. Consequently, the ICP-1140/09 applied in 1994 for the organization of a Common Development Programme and was awarded a grant of 4.200 ECU. Subsequently, members present at the annual ICP meeting of Sept. '95 at San Nicola (Italy) designed a project under the name CAMEEL (Computer Applications for Modern Extra-European Languages), and formed a Steering Committee for the project. The goal of the project is to create and develop a multidisciplinary and transnational curriculum to train students in computer applications to non-European languages.

    B. Purpose, objectives and expected outcomes

    B.1 Rationale and background of the project

    Increasing cultural and economic contacts between EU and non-EU countries require efficient and fast language communication competences. Modern methods for language acquisition, text editing and information retrieval, which already exist for EU languages, cannot be easily converted and or created for non-EU languages. Therefore, the need for a new professional competence manifests itself. There is an urgent need for language engineers with expertise in advanced computing adapted to non-European linguistic structures, from character sets to grammars.

    Since many non-European languages are primarily used in countries which have not sufficiently mastered new technologies, it cannot be expected that these countries develop suitable language technologies on their own. European institutions therefore have to take the lead in developing the necessary competence within their own linguistics and foreign language departments. These departments have already been familiarized with advanced computing and are beginning to adapt existing computing techniques for dealing with various languages. This very fragmented approach, though productive and creative, is somewhat amateur and is reflected in some of the limitations of the products, like, for example, the fact that most commercially available software for language can be used only on a PC platform, and will not work on many platforms used in publishing and information retrieval.

    From the cultural viewpoint, Europe has strong traditions in the study of world-wide cultures to which it has been linked in its history. As more and more material becomes accessible electronically, scholars in the humanities are being confronted with an exponential expansion in the quantity of written sources available and a dramatic improvement in the accessibility and searchability of this material. A dialogue between humanists as informed end users and information specialists is essential for determining the direction of future software development.

    It is thus clear that a field like computational linguistics for non-European languages is needed. Up to now, however, a forum which allows a scholar to discuss the educational implications in a broad European context does not exist.

    B.2 Aims and objectives

    The aims of this subproject are the same as for ACO*HUM, with the following specific objectives:

    1. provide a Europe-wide discussion area for those involved/interested in non-EU language issue (learning with links to research and computer industry);
    2. create groups of European partners (universities, software houses, third users) with various levels of involvements and for various language areas that can plan concerted actions in a domain which lacks opportunities of coordination and which is characterized by:
    3. indicate and promote links within the non-EU area group and with other area groups of the TNP in order to foster cooperation and coordination;
    4. agree on a 'preference list' of targeted non-EU language groups to be considered at various stages, based on:

    B.3-5

    See ACO*HUM.

    C. Project approach

    C.1 Main pedagogical and didactic approaches and concepts

    Due to the wide diversities of languages involved, focal points have been identified, according to mixed criteria (available competence, geographical location, linguistic characteristics, availability of resources, possible applications). As far as language groups is concerned, the main focus is on the following language groups:

    The project's methodology intends to build on already existing EU initiatives in the field of extra-EU languages (see the CAMEEL group) and enlarge participation from related groups/centres (Univ. Barcelona, Univ. Bergen, Univ. Nijmegen, etc.), while putting a more specific focus on the following:

    1. modern and recent technology of information processing;
    2. cross-disciplinary approaches;
    3. non-European cultures, languages and multimedia;
    4. practical and professional applications in multicultural communities;
    5. international and interregional trade and cooperation

    C.2 New information and communication technologies

    In the existing MA programs in the field of African and Oriental languages the need is felt for the possibility of more specific training in computer applications to these languages. In order to finally develop methods and techniques for using, evaluating and creating tools regarding these languages, basic training at the undergraduate and a new one at the postgraduate level should be provided. The philosophy is to provide options within the frame of existing MA, BAhons programs, as well as introduce it as an independent novel curriculum. Individual universities may decide to integrate these options as obligatory parts of their programs.

    The aim is to enhance the multilingual and multicultural competence of students of the humanities, social sciences, and computer science concerning the acquisition, processing, and transfer of knowledge. It is envisaged that these goals could be achieved through the introduction of two levels in the curriculum structure of member universities:

    1. on an intermediate level of education (undergraduate programme / minor subject);
    2. on an advanced level of education (European Masters).

    Undergraduate Curriculum

    LEVEL 1

    Module 1): revision of computer skills and the fundamentals of information science;

    Module 2): introduction to basic concepts of linguistics (and text management);

    Module 3): introduction to issues in linguistic engineering and communication sciences (e.g.: translation, terminology, multimedia means of communication, language standardization, writing systems, etc.);

    LEVEL 2

    Module 1): general survey of linguistic tasks for computer applications;

    Module 2): introduction to the available tools and the philosophy behind them, consistent with the expectations of the curricula: phonetic analysis, font scripts, parsers, data bases, communication tools (data banks);

    Module 3): evaluation of the tools with respect to the linguistic tasks (learn how to put a valid question based on one's own needs);

    Module 4): language specific applications of selected tools.

    Masters Curriculum

    The Masters Curriculum involves an in-depth revision of the contents of levels 1 and 2 in addition to an extensive confrontation with language specific problems: tones, phonetics, parsing, sorting, etc.: explicit formulation of these problems, practical training in methods and strategies of problem solving.

    An integral part of the postgraduate curriculum is that the students should have study and/or work experience abroad. This means that a student should take at least one module at a university other than his/her home university. For the work experience abroad the student is expected to acquire practical experience in industrial companies or governmental and non governmental organizations. The student may decide wether to do this in his or her home country or abroad.

    Dictionary making, text searching and text analysis, text management, discourse analysis and phonetic analysis are some useful domains for testing and applying the acquired skills.

    C.3 ODL

    See ACO*HUM.

    D. Project organisation and workplan

    D.1 Architecture and work plan of the subproject

    1st year (1997-1998) 


    1. Inventory of existing software being used in non-European linguistics for the initial target languages
    2. Evaluate, select and suggest those programmes that could be profitably used for the purposes of the expanded subnet to be established.
    3. Indicate course program minimum requirements.
    4. Identify minimum standard requirements for corpora composition (minimum size, composition of corpora, general identification text and marking system).
    5. Indicate parameters and indicator of minimal acquired competence. 
    6. October 1997: area committee meeting, together with the ACO*HUM policy symposium.
    7. May 1998: representation of the area at the ACO*HUM conference on The future of humanities education.
    8. September 1998: report with pre-final recommendations. 

    2nd year (1998-1999) 


    1. Finalize the content of modules and units in reference to credit and certification.
    2. Planning of intensive programmes for training teachers in the use and evaluation of didactic materials.
    3. Extension to other languages not initially covered. 
    4. October 1998: area committee meeting.
    5. June 1999: Creation of a E.M. in M.I. (European Masters in Multilingual Engineering). 

    D.2 Coordinator

    As formal coordinator of the ACO*HUM project, the University of Bergen will appoint a local Area Coordinator, Prof. Dr. Joseph Norman Bell, who will be the liaison to the project office. Internal management of the subproject will be delegated to CILLSAC-IUO and INALCO-CRIM, who has already appointed a part-time manager for preparing the planning, creation, collation, implementation, etc. of the project by conducting preliminary investigations on major centres. A partial report on these investigations is expected before the end of May 1997.

    D.3 Partners contribution

    Participants are organized in A, B and C groups (cf. ACO*HUM).

    D.4 Expertise

    (See also A.1) Participants belonging to the A and B type have already been active in organizing and participating in intensive programmes on computer applications to modern non-European languages. CRIM (Paris) and ILC (Pisa) are already involved in research project in computational linguistics with financial aids from EU programmes.

    D.5 Working languages

    English, French, Italian.

    D.6 Dissemination

    The subproject will to a large extent use the WWW-based communication and dissemination channels which are set up by the ACO*HUM project. Reports, minutes of meetings and proceedings of conferences and workshops will be made available both on paper and through the ACO*HUM website.

    D.7 Monitoring and evaluation

    A set of teaching modules to be applied to African languages will be developped in the third phase for testing and evaluation purpose. A set of 'search and retrieval' formules from various available corpora will be compared with similar operations (similar) corpora of EU-languages in order to identify corpora 'critical mass'.

    Devices for evaluation and amelioration of the didactic tools will be indicated, using effective monitoring systems, at all stages of the project. Modules be first developed for certain major languages/language group (Sub-Saharan Africa, Arabic) and serve as a pilot project which will be extended to cover other non-European languages.

    E. Other issues

    See ACO*HUM.

    F. Financial aspects

    The ACO*HUM budget will be extended with the following amounts:

    1. Staffing

    1.1 Academic staff time 

    Area Liaison Officer (Univ. Bergen) 10 days x 150 ECU= 

    1500

    A-partners, teaching staff 1 x 30 days x 150 ECU= 

    4500

    B-partners, teaching staff 6 x 10 days x 150 ECU= 

    9000

    C-partners, teaching staff 10 x 5 days x 150 ECU= 

    7500

    Total

    22500

     

    22500

    1.2 Project management and administrative staff time  Area manager (CILLSAC-IUO / INALCO-CRIM) 20 days x 150 ECU= 3000

    3000

    Total staffing

    25500

    D.2 Other project costs

    Travel, accommodation, subsistence  

    Area Committee meeting 5 x 900= 

    4500

    Attendance at the May conference (one attendant per partner) 17 x 900= 

    15300

    Total

    19800

     

    19800

    Total other project costs

    19800

    The requested additional grant from SOCRATES is 7500 ECU, corresponding to the travel expenses for one meeting and the administrative staff costs. The partners will carry their own teaching staff costs. Additional funds will be sought for participation at the May conference.