Maximizing the (re)usability of language data

**Workshop 3 on September 28, 1998, in conjunction with the ACO*HUM Conference, Bergen**

Arvi Hurskainen
University of Helsinki
Arvi.Hurskainen@ling.helsinki.fi

Introduction

In May 28-30, 1998, there was the First International Conference on Language Resources and Evaluation in Granada, Spain. In addition to the local supporters, a wide range of organizations working within computational linguistics were cooperating, such as ACH, ACL, ALLC, COCOSDA, EAFT, EAGLES, EDR, ELSNET, ESCA, EURALEX, FRANCIL, LDC, PAROLE, and TELRI. The conference was preceded by a number of workshops, three of which were discussing issues relevant to this present workshop here in Bergen. This shows that various issues on electronic language resources need to be discussed from the viewpoints represented by different interest groups.

In Granada, the following workshops relevant to our theme were arranged:

Adapting Lexical and Corpus Resources to Sub-languages and Applications

Distributing and Accessing Linguistic Resources

Minimizing the Effort for Language Resource Acquisition

Language Resources for European Minority Languages

In the workshop Language Resources for European Minority Languages, it was found that the situation in regard to language resources is fragmented and disorganized. Those languages do not have even the basic descriptions of the language, not to talk about extensive language resources, such as corpora or electronic dictionaries. It was found out that in the field of language resources there is the snowball effect. Big languages, such as English, French and German, have an abundance of resources. It was also pointed out that even such big languages as Chinese, Korean and Arabic are very badly neglected in regard to language resources. Several notable researchers in the Far East study applications - in English - instead of the big local languages.

Because the situation with less important languages is in several respects bad, it is important to develop language resources in those languages.

It is not an entirely unfortunate thing that minor languages have been lagging behind in the development of language resources. Big mistakes and bad investments have been avoided, and it is now possible to apply tested methods to those languages.

Having worked with African languages for more than 30 years, and with computer analysis problems of those languages for 13 years, I have the impression that the main-stream development within computational linguistics does not suit to the analysis of those languages. For example, the development and use of parallel corpora for a number of applications does not seem to fit to languages with extensive morphological variation and quite different phrasestructure as well as syntactic structure, although a lot of work in this field has been done on Germanic and Romance languages, for example.

Roger (Bill) Mann, one of leading pioneers of computational linguistics, worked for a couple of years in Nairobi with the idea that useful translation aids to Bantu languages could be built by first describing properly one language, and then applying that configuration to other languages and dialects by replacing surface morphemes by morphemes of the other language. It is not known to me how well he succeeded in it. My belief is that even in the case of related languages the 'translation' does not work if it is not based on detailed analysis of the source and target languages.

With genetically unrelated and grammatically different languages the problems are still much bigger. For example, it is probably not very useful to simply align sentences or parts of sentences in English and Swahili, and thus get examples of phrases in both languages, later to be used in applications such as MT. The text should be analysed first, morphologically, syntactically, and preferably also semantically, and further processing could be safely built on those results. All kinds of heuristic guessing and building on probabilities should be avoided as far as possible.

I want to emphasize this point, because we should not only be concerned of the availability of language resources. We should also know what for the resources should be available. For example, the University of Lancaster has been working for a long time on producing bilingual and multilingual corpora. The need of such corpora arises from the approach adopted for developing linguistic tools. If parallel corpora or multilingual corpora are needed for training language processing tools for making them to perform tasks better, then such resources are necessary. With this approach, automatic translation between the present official languages of EU alone would need 100 parallel corpora, and there is not yet any guarantee that these corpora would solve the problems anyway. One can just imagine what the proportions of needed work in corpus construction would be if we add to the list even the most important world languages.

Although this workshop should concentrate on the availability and reusability of language resources, we should be aware that all available resources are not what we really want. Freedom in developing information management tools with alternative approaches should be supported, of course, but it should not make us stick to such approaches which in the long run will be unbearable.

Because I do not have any personal, intellectual or economic, allegiances in the field of computational linguistics, it is fully safe for me to present my own view towards computational language manipulation, relating to a wide variety of applications, including MT. The approach is briefly the following:

Develop a language analysis system so that it is capable of transforming, or translating, running text with normal orthography into a metalanguage, without manual interference. It is not a bad idea that this metalanguage would be based on English, because it is the most widely known language. The system would not translate into the surface form of English. Rather, the output would be a composition of morphological tags, information on phrasestructure and sentence structure, and semantic information in the form of 'glosses', written in English. The output would contain all that information that would be used as a basis for translating this output into any of the target languages.

There should be unanimity on the precise format of the metalanguage, what tags should be used, and what kind of structure it should have. If this is worked on properly, it should be possible to perform MT tasks between any two languages through this metalanguage.

There are tremendous advantages in this approach.

1. Proper descriptions of a language, starting from morphology and continuing to disambiguation, syntactic and semantic analysis, are being done on a multitude of languages in any case, irrespective of whether they are related to MT or not. Even a good spelling checker of an inflecting language requires this, not to talk of more sophisticated language analysis tools. So the work towards translation into metalanguage is in line with a number of other, less sophisticated, applications.

2. The construction of corpora, otherwise involving extensively manual work, is reduced to testing corpora between the source language and the metalanguage only.

3. It should be emphasized that for every individual language one needs to worry about its proper description and its interface with metalanguage. Translation problems to other languages are not involved.

4. Because the work is based on a proper linguistic theory and description, the error-prone play with probabilities and guessing is practically cut out.

This should not be interpreted so that there is no room for alternative approaches. Obviously there will always be different approaches and tastes of doing the work. For many of us, the working environment simply dictates what is possible and what is not in that particular environment. One working only with Macintosh certainly works differently from the one who has been working in Unix/Linux environment. Nevertheless, the constraints caused by the working environments are not the major ones, and new environments have features that have been found useful in other environments.

Various categories of users

Computerized information data banks, text archives and corpora facilitate information management and retrieval by using retrieving tools and other text manipulation programs. The processing of such data can be designed by using a number of different architectures. In fact, one of the major problems in developing computerized information banks is the incompatibility of different systems. At lest in linguistics, it has been a common practice to design a corpus by keeping in mind the needs of linguistic research on the one hand, and the software to be used in information retrieval on the other. The corpus text is provided with a large number of tags, the purpose of which is to serve further processing by using tools for retrieval.

Such information banks are suitable for linguistic research, but their applicability for the needs of other disciplines is very limited. Although the texts themselves might contain valuable information also for non-linguists, the use of such material for non-linguisticpurposes is cumbersome, if not totally impractical.

Is there a need and possibility to create such information banks that would serve researchers of different disciplines? From the viewpoint of research economy there is no doubt of such a need. The production of primary data in research projects is often the most cost-effective component, and it is not in general interest that such data are used only by those participating the particular project. Whether the creation of such information banks is possible depends on two major factors. The first is the willingness of research organizations to make their primary data available for other researchers, and the second concerns the technical implementation of such information banks. In this paper I shall concentrate on the latter issue, the question of the technical feasibility of multi-purpose information banks.

The idea that I have in mind is that the primary data should be kept, as far as possible, free of all kinds of codes, irrespective of whether they are motivated by the tools of retrieval or the research field concerned (linguistic, anthropological, folkloristic, comparative literature, historical, sociological etc.). A piece of text contains the same information irrespective of whether it is coded or not. Codes do not add any information; they are merely helpful in fulfilling certain tasks in research process. Let us call this original text a 'master text'.

Various researchers look at the very same text with different eyes. They look for different data, according to their orientation and research interests. Anthropological field research, for example, can produce a large variety of texts, which are highly interesting for a number of research fields, not only for anthropologists themselves. When a team of anthropologists has finished the field work, it processes the data (personal notes, observations, tape-recorded material containing a variety of information from different fields of life etc.) into a computer form. This is the 'master text' of that team. How are the team members going to find accurately all the information they need each time from that text? What are the specific needs of anthropologists in information retrieval? This is a discipline-specific question and means have to be found for solving it. If a linguist looks at the same text, his questions are certainly far different.

Two approaches to information retrieval

1. Direct string search

In it, each researcher would devise keys for retrieving information and use text-independent means for collecting and arranging search results. This is the simplest method and would not require any further considerations. The use of possibilities offered by the regular expression syntax would greatly enhance the power of this method. For most research tasks the power of this method would not be sufficient, however. Therefore I propose another method, which I shall outline below.

2. The use of language-specific tools in information management and retrieval (IMR)

This approach is based on the assumption that the text, in the format where it is normally available, is the starting point, the basis, on which the automatic manipulation is built up. No manual pre-coding should be required for the system to be able to operate. In other words, no demands should be placed on the text, except for the correct writing, free of (major) typingerrors.

While the text itself should be free of codes (part-of-speech, syntactic, semantic, etc.), the text manipulating tools should be able to make explicit all the information cryptically encoded into the text. Such implicit information includes part-of-speech information, morphological analysis, lemmatization, syntactic information including dependency structures, semantic context-specific information, etc. Also other types of information, such as etymological information of lemmas, can be managed within the same system.

In other words, the text itself should be in its original form, but at the same time the analysis system should be able to retrieve even the most esoteric bit of information from the text. In this system, the texts could be kept on a general level, and also the analysis system itself could be general in the sense that all IMR needs are taken care of by the same system. There would be no specific environment for the morphological analysis, another one for disambiguation and syntactic analysis (or mapping), yet another one for lemmatization, a fourth one for semantic information, a fifth one for concordancing or other types of string retrieval, etc. All this is taken care of within the same system.

What I have said above might sound over-optimistic, a kind of 'pia desideria' which have no relevance in practice. Of course, everybody would like to use non-coded texts and do all the tasks with one program, or an integrated set of programs. Although it sounds naive, it is not. In fact, it has been implemented to a quit sophisticated level at least in one language, Swahili, with rich inflectional and derivational morphology.

Phases of text analysis

1. Raw text

2. Pre-processing (normalization of text)

3. Verticalization

4. Morphological analysis including:

identification of morphemes with tags

lemmatization

part-of-speech

type of verb (argument structure)

one or more semantic 'glosses'

etymology (loanwords), etc.

5. Heuristic tagging of unanalysed words

6. Disambiguation (on the basis of syntactic and semantic information)

7. Adding syntactic tags (syntactic mapping)

8. Building dependency trees of syntactic constituents

9. Transforming the language-specific syntactic structure to meet the syntactic structure of the metalanguage through transformation rules.

Result: Analysed text in metalanguage

It should be noted that applications differ greatly as to the level of analysis. For some simpleinformation retrieval, no previous text manipulation is needed. The source text is good enough for this purpose. In testing morphological analysers, one needs to proceed to phase 4. But for doing this, one may already in phase 3 use different programs for verticalising the text (all words in the original order, all words in alphabetical order, only real words without diacritics and punctuation marks, only word-form tokens etc.). For accurate information retrieval based on morphology, the process needs to be carried out until phase 6. When one needs to study syntax, one has to go through all the phases until phase 7 or 8. Phase 9 is involved when the aim is to translate language into the metalanguage discussed above.

The result of the sequence of operations is an analysed text with maximal amount of information (morphological, syntactic, semantic, etymological, glosses in another language', etc.). Language-specific information retrieval is carried out on the basis of 'tags' thus inserted into the text.

When the approach to IMR is as described above, what we actually need is texts of different kinds, written in the normal way in computer form. The analysis system should be capable of producing different versions of that text for specific needs.

I consider it a bad policy if the system requires some specific format from the source text. The system should be capable of producing that format. And if a system requires a special format, hand-coded or semi-automatically produced, it should be easy to return the text back to the normal format. SGML, HTML, and TEI are examples of formats that can easily be switched back to the normal text. There is also an increasing number of system that recognize the meaning of the codes, taking them as codes and not as part of text.

Language resources available on African languages

While I have been working with African languages, I have some knowledge of the state-of-the-art in this field. Compared with language data of big world languages, the work with computerized language resources of African languages is still in initial phases. There are no really big projects in any part of Africa for creating computerized language data. Swahili, the most widely spread of African languages, is clearly leading in the field. The situation is, however, changing rapidly. Most of the printing work is done with systems which utilize electronic text form. Therefore, it is not primarily the question of the existence of the resources, but how to get access to those, and how they should be made available to the research community.

The following list of language data available on African languages is by no means exhaustive. However, it shows where the emphasis is.

Dictionaries in plain ASCII in various languages and various places, e.g. Swahili, Hausa, Oshindonga, Kinyakyusa, Kikae.

Dictionaries in a database structure (The Kamusi Project: Internet Living Swahili Dictionary, free access through Internet)

Computer Archives of Swahili (Helsinki, the biggest, many types of text, access by contract through: vanhala@ling.helsinki.fi)

Archives of Popular Swahili (Amsterdam, free access, http://www.pscw.uva.nl/lpca/)

Archives of Swahili (Naples, developing)

Archives of Swahili and Akan languages in Zürich

Word-lists in various languages and dialects, part of them in the archives of Helsinki

Transliterations of oral data (Helsinki and Dar-es-Salaam, Amsterdam)

Translations of Bible and Qur'an in various languages

Newspapers in Internet (Rai and Majira in Swahili)

Daily news in Internet (Deutsche Welle on Swahili and Hausa, Voice of America)

Most of these data are in plain text format with a minimal level of tagging.

TEI is used for structuring texts and for making the inclusion or exclusion of certain sections of text easy and accurate.

Problems of access to data

Because the creation of sufficiently large data resources to be of real use is labour-intensive and expensive, the creators of such resources have been unwilling to share the resources with others. Especially if the resources contain material that cannot be sold because of copyright restrictions, there is a real problem. If the copyright holders create such data, it is very unlikely that they do it for any other reason than for making money. It is also doubtful whether copyright holding is a good basis for creating language data. Copyright restrictions affect in two ways. Because they prevent or restrict selling of the language data, the compilers of the data try to go around these restrictions by taking only small samples from texts here and there, thus making, for example, the use of these data for the research of literature impossible. Or the compilers refrain from trying to sell the data, in which case the burden for financing the work falls to the compilers alone. If this is the case, there is not much motivation for sharing the data with others.

Because language resources are created primarily for research purposes, the copyright holders should be made aware that letting their material as part of the resources would benefit them in several ways. By letting the material for testing language processing tools they would ensure that the tools will be tuned so that they will be capable of processing the kind of material that they are producing. As a consequence, also they themselves would then benefit by getting appropriate tools for their use.

The experience has shown, at least in creating resources for Swahili, that publishing houses have in general been willing to let their material to be used for research purposes in an environment, where the access to the resources is controlled and granted only by a signed agreement. In my experience, a non-commercial language resource server, with access limited to researchers only, is a better solution than freely available resources, because it facilitates the inclusion of materials under copyright restrictions. The problem, not yet solved, is how to make the users of such good quality data compensate their access to such data. One solution used is to ask them to do their share in accumulating the data in the data bank. This is done, however, on a bona fide basis without formal agreement. Only in very few cases it has produced useful results. This has proved to be, so far, the best way of making useful data globally available through Internet.