4.1 The changing field of computational linguistics and human language technologies

4.1.1 Introduction

Computational Linguistics (CL) is concerned with computer methods in the scientific study of language.  CL is relatively young as a scientific field but nevertheless it was one of the earliest humanities disciplines where advanced computer methods were applied.  Attempts to translate documents automatically were already made in the 1950s and 1960s.  Although these first attempts fell short of expectations, the potential of computational methods for the study of language was soon discovered and triggered a revolution in linguistic methodology.

Through the efforts of Chomsky (1957) and others, linguistics went through a process of formalization founded in mathematics and logic.  Formal approaches to linguistics have strongly enabled the use of computers in language processing but have also had a lasting effect on general (theoretical) linguistics.  From the 1970s on, CL has become recognized as a field of its own with a wide range of ongoing advanced research.  Strong interdisciplinary ties were established with areas outside of linguistics, including computer science, artificial intelligence, and cognitive psychology.  This has led to the establishment of a number of interdisciplinary research programmes with CL as an important component, especially in the USA.  On the other hand, this has sometimes weakened the perception of CL as a humanities discipline.

The interdisciplinary nature of CL requires a brief reflexion on its relation with general linguistics.  Today, linguists who would not label themselves as computational linguists are increasingly using special computer tools which support lexicology, grammar writing and other forms of linguistics scholarship.  However, while general linguists may be engaged in the appropriate use of such tools, they may not necessarily fully understand their underlying mechanisms, let alone design and develop new tools.  In contrast, a distinguishing characteristic of the computational linguist is the knowledge and skills to understand, design and develop CL tools, methods and techniques. CL approaches to language are often referred to as Natural Language Processing (NLP).

Increasingly, however, advanced applications make computer approaches to languages visible in society, and it is here that engineering aspects come into the picture.  Language processing applications are for instance machine translation, dialogue systems, information access, proofreading, and computer access for the handicapped.  Engineering approaches to language are often designated as Language Engineering (LE) or Human Language Technologies (HLT).

Traditionally, CL has been mainly concerned with written text, whether dialogue or discourse (Allen, 1995; Gazdar and Mellish, 1989; Roche and Shabes, 1997).  The field of CL is different from that of speech processing by the fact that CL is concerned with discrete, symbolic representations of language rather than continuous speech signals.  In recent years, however, it has been recognized that the CL and speech processing communities need a common ground for cooperation.  Without speech, CL is missing an important modality of linguistic communication.  Conversely, speech processing without the deeper levels of representation fails to relate spoken language to its meaning.

In the 1990s, the commercial potential of HLT as well as speech processing began to be fully appreciated.  By the end of that decade, at least one company, Flanders-based Lernout & Hauspie, had built up 'language factories' employing over a thousand computer linguists and language specialists worldwide.  Such developments have not been entirely beneficial to educational institutions offering CL studies, since they became depleted of scarce competent personnel which moved to jobs in industry, due to the lack of stimuli from universities.  In some countries where educational authorities had failed to attract a sufficient level of academic competency over the years, for instance Norway, this is being felt especially hard.  Eventually also industry will realize that this is an insupportable situation in the long range.

In the past, CL has strongly been concerned with the automatic linguistic analysis of sentences (parsing).  Research in this important problem domain required strong interdisciplinary links between linguistics, formal methods (mathematics and logic) and computer science.  On the one hand, automatic analysis of language requires a language to be defined in terms of a formal grammar.  On the other hand, it also requires efficient parsing algorithms for searching through the maze of possible combinations of words and phrases that make up sentences.  When the problem is approached in this way, a sentence can be mapped into a representation of its structure.  This representation, in turn, can be mapped into a representation of its meaning and its communicative itention.

There has been much focus on representing the lexicon, syntax, semantics and pragmatics of natural language on formalisms such as Lexical-Functional Grammar (LFG) and Head-driven Phrase Structure Grammar (HPSG).  Also theories of dialogue and discourse and efficient methods for generating text from meaning representations have been researched.  In recent years, it has been recognized that many problems cannot be approached from formal theory alone, but also require empirical research, often through the study of very large corpora from which statistical information is extracted (Charniak, 1993; Krenn & Samuelsson, 1997; Young & Bloothooft, 1997).

Some research in CL is conducted in publicly or privately funded research centres, but much is located in universities, where staff are engaged in teaching alongside their research.  Most academics accept the principle that teaching and research are mutually reinforcing, hence in most universities where there are CL researchers, the subject is taught to students.

4.1.2 Language and information technology: where are we now?

During the 1990's, CL has seen a rapid increase in technological development at both research and industrial levels.  Among the several factors which provided a strong impetus for CL applications, we name but the most prominent ones.  On the one hand, the extremely fast growth of the Internet caused an explosion in computer-based communication, most of which involve written natural language.  The sheer multitude of communication has caused an acute need for intelligent search, information extraction and filtering of natural language documents.  On the other hand, the omnipresent computer itself is becoming invisible, as web browsers and other computer applications are being embedded into mobile phones, refrigerators, cars, and other everyday things.  The use of embedded computing in everyday situations strongly promotes the desirability of natural communication.  Already today, some mobile phones incorporate language and speech processing by such means as speech controlled dialing and predictive typing of SMS messages with the use of an internal dictionary.

Therefore, there has recently been a focus on developing CL methods for solving real-world application problems in the areas of dialogue systems (Bernsen et al., 1998; Dalsgaard et al., 1995, 1999; Gibbon et al., 1997) and information retrieval and extraction (TREC-7, 1998; MUC-7, 1998).  There has been a lesser focus on machine translation or machine-aided translation (see Cole et al., 1995).  There is a growing amount of work on developing integrated natural language and speech processing systems (Bloothooft et al., 1997, 1998, 1999; Green et al., 1997; Jurafsky & Martin, 1999; McTear & Kouroupetroglou, 1998; Young & Bloothooft, 1997).  Over the past few years the speech community has had much success with developing working spoken dialogue systems for limited application domains such as banking, travel information, weather information, call centre routing, and so on.  For example, Lucent Technologies' Bell Laboratories claims their call centre routing system for banking performs better than humans at routing phone calls.

Yet another community, i.e. information retrieval and message understanding, is in need of smarter methods for obtaining information from texts.  The USA has established national message understanding conferences for competing systems (e.g. MUC-7, 1998) to parallel the text retrieval conferences already held in information retrieval (TREC-7, 1998).  The recent upsurge of work in Intelligent MultiMedia or MultiModal systems integrating graphics, image processing, haptic and other modalities also incorporates CL methods as part of dialogue interfaces (Brøndsted et al., 1998; Dalsgaard et al., 1999; Maybury, 1993; Maybury and Wahlster, 1998; Mc Kevitt, 1995-96, 1998).

These developments suggest that the processing of natural language and speech is an eminent instance of where the humanities, science and engineering are converging (Bloothooft, 1998; De Smedt and Apollon, 1998; McTear and Kouroupetroglou, 1998).  When one recognizes that language is an important aspect of multimedia and multimodal systems incorporating also graphics, vision and other modalities, applicable to visual art, music, dance, film and other creative expressions, this convergence becomes all the more apparent (Maybury, 1993; Mc Kevitt, 1995-96, 1998).  The Internet contributes to forcing the merging of the humanities, sciences and engineering in terms of representing and accessing information in multiple modalities including at least text and voice in multiple languages, sounds and music, images and videos.  This is a major application area of Intelligent MultiMedia (see Maybury, 1997).

Mobile computing and communications devices are becoming more prevalent and computers are ubiquitous and even invisible.  There has been rapid convergence of computing and telecommunications technologies in the past few years (IEEE Spectrum, 1996).  These will soon enable users to interact with perceptual speech and image data at remote sites and where that data can be integrated and processed at some central source with the possibility of results being relayed back to the user.  The increase in bandwidth for wired and wireless networks and the proliferation of hand-held devices and computers (Bruegge & Bennington, 1996; Rudnicky et al., 1996; Smailagic and Siewiorek, 1996) brings this possibility even closer.  Applications of mobile media are numerous, including data fusion during emergencies, remote maintenance, remote medical assistance, distance teaching and internet web browsing.  One can imagine mobile offices where one can transfer money between bank accounts or order goods and tickets even while car cruising.  The possibility of controlling robots through mobile communications is gaining momentum (Uhlin & Johansson 1996) and there are also applications within virtual reality (IEEE Spectrum, 1997).  All of these applications are crucially dependent on communication, where language plays a prominent role.

4.1.3 The European dimension in research and technology

The European Union has recognized that HLT has a distinct dimension in its society.  The numerous cultures and languages of Europe have a significant impact on society, more so than in the USA or Japan.  It is therefore not surprising that the EU, through its past and present RTD frameworks, has allocated significant resources to HLT.  The rationale for the MLIS programme (MultiLingual Information Society) is that European citizens must be able to participate in and benefit from the global information society.  Human communication is at the heart of the information society, and in a richly multilingual area such as the EU, full participation requires multilingual facilities for creating, exchanging and accessing information across language borders, throughout Europe and beyond.  Through MLIS and other programmes, the efforts of private companies and public organisations in the Member States are supplemented with support on a European scale towards the following aims: Among the special needs in a multilingual society, there is not only machine translation and interpreting (e.g. Verbmobil) but also software localisation, language identification, multilingual generation of news, reports, manuals, etc.  People working with multilingual systems have discovered that this is not a simple matter of translating from one language to another but that concepts and metaphors which are not the same across languages need special attention.  Telephone answering systems would also have to take into account pragmatic conventions such as the fact that in Germany one answers the phone with one's surname whereas in the British Isles one usually says Hello.  The fact that half eight means 8:30 in the British Isles, but is often understood as 7:30 in many continental countries, is often the cause of confusion and missed meetings.  Hence, cultural and pragmatic conventions suggests that the development of European multilingual systems should not be left solely to engineers but need significant input from the humanities.

In the Fifth Framework, HLT falls under the Information Society Technologies programme.  It is expected to focus on advanced human language technologies enabling cost-effective interchanges across language and culture, natural interfaces to digital services and more intuitive assimilation and use of multimedia content.  Work would address written and spoken language technologies and their use in key sectors such as corporate and commercial publishing, education and training, cultural heritage, global business and electronic commerce, public services and utilities, and special needs groups.  Work would also develop electronic language resources (e.g. dictionaries or terminologies) in standard and re-usable formats.  Research and Development priorities include adding multilinguality to systems at all stages of the information cycle, including content generation and maintenance in multiple languages, localisation of software and content, automated translation and interpretation, and computer-assisted language training; enhancing the natural interactivity and usability of systems where multimodal dialogues, understanding of messages and communicative acts, unconstrained language input-output and keyboard-less operation can greatly improve applications; enabling active assimilation and use of digital content, where work would apply language-processing models, tools and techniques for deep information analysis and metadata generation, knowledge extraction, classification and summarisation of the meaning embodied in the content, including intelligent language-based assistants.

4.1.4 Consequences for employment and education

Many companies that integrate linguistic information processing in their activities (e.g. Microsoft, Toshiba, NEC, NTT, Nokia, Ericsson, Philips, DaimlerChrysler) are investing in spoken dialogue capabilities and other language applications.  In addition, dedicated developers of language and speech applications are becoming more and more prominent, with Lernout & Hauspie (L&H) currently the leader.  With the convergence of computing and communications, it is clear that language enabled systems have a significant role.  Those companies trying to invest in the technology offer good employment opportunities for CL graduates.  However, these opportunities are not without problems.  Companies are acutely faced with a shortage of people well educated and trained in CL.  On the one hand, too few students are attracted to the CL programmes due to a lack of awareness.  On the other hand, the field is moving so fast that CL courses and programmes are lagging behind, partly due to understaffing and suboptimal contacts between academia and industry.

The process of CL reaching HLT user communities is the object of the European-wide survey EUROMAP, which is funded under the Language Engineering sector within the Telematics programme of the EU.  The survey which started in 1996 is pulling together data on LE activities in Europe as well as actual user and market requirements.  Based on this analysis, recommendations are developed on how to link LE capabilities with marketplace opportunities.  A view of employment opportunities in CL/NLP and speech systems could be developed from an analysis of job openings.

On the education front, the challenges are formidable and will likely not be addressed adequately by small-scale solutions.  A recent survey in The Netherlands and Flanders showed that despite the explosive growth of the language and speech industries in this region, education in this field hardly made significant advances in recent years (Bouma & Schuurman, 1998).  It will therefore not be surprising that many see the need for increased public support to CL education and coordinated action at an international level to increase the quantity and quality of CL courses and programmes.

A number of projects and other activities in education are worth mentioning and will be discussed in more detail in the remainder of this chapter.  Previous work has been done by ERASMUS inter-university cooperation programme (ICP) projects aimed at promoting mobility and curriculum development.  Notably, there was an ICP on Natural Language Processing, active from 1993-1996 (Way, 1998) and one in Logic, Language and Information, coordinated by the European Association for Logic, Language and Information, also known as FoLLI, which also organized summer schools in this area.

The present chapter was written under the auspices and with the support of ACO*HUM, the SOCRATES thematic network project for Advanced Computing in the Humanities, which started in 1996 and is expected to conclude in 2000.  Within this project, which networked more than 100 European universities, a working group in Computational Linguistics and Language Engineering reflected on the challenges for CL education.  The group organized several meetings and conducted a survey which will be discussed below.  The working group kept close links to another SOCRATES thematic network project, that in Speech Communication Sciences (Bloothooft et al., 1997-1999; Bloothooft, 1997; 1998; 1999) with 80 partner institutions.

The main goal of this networking has been to analyse the present status of education in relevant fields in Europe, to make proposals on existing curricula and to formulate recommendations for future initiatives and implementation.  Among the results of this intensive networking is a proposal for a curriculum for a European degree in Natural Language Processing which is further specified in a SOCRATES Curriculum Development Action from 1997 to 2000.  Also, the networks reflected on the potential of computer assisted learning and distance learning techniques for language and speech through pilot projects co-sponsored by ELSNET.  The findings, proposals and recommendations of the ACO*HUM group will be discussed in detail in the remainder of this chapter.