Through the efforts of Chomsky (1957) and others, linguistics went through a process of formalization founded in mathematics and logic. Formal approaches to linguistics have strongly enabled the use of computers in language processing but have also had a lasting effect on general (theoretical) linguistics. From the 1970s on, CL has become recognized as a field of its own with a wide range of ongoing advanced research. Strong interdisciplinary ties were established with areas outside of linguistics, including computer science, artificial intelligence, and cognitive psychology. This has led to the establishment of a number of interdisciplinary research programmes with CL as an important component, especially in the USA. On the other hand, this has sometimes weakened the perception of CL as a humanities discipline.
The interdisciplinary nature of CL requires a brief reflexion on its relation with general linguistics. Today, linguists who would not label themselves as computational linguists are increasingly using special computer tools which support lexicology, grammar writing and other forms of linguistics scholarship. However, while general linguists may be engaged in the appropriate use of such tools, they may not necessarily fully understand their underlying mechanisms, let alone design and develop new tools. In contrast, a distinguishing characteristic of the computational linguist is the knowledge and skills to understand, design and develop CL tools, methods and techniques. CL approaches to language are often referred to as Natural Language Processing (NLP).
Increasingly, however, advanced applications make computer approaches to languages visible in society, and it is here that engineering aspects come into the picture. Language processing applications are for instance machine translation, dialogue systems, information access, proofreading, and computer access for the handicapped. Engineering approaches to language are often designated as Language Engineering (LE) or Human Language Technologies (HLT).
Traditionally, CL has been mainly concerned with written text, whether dialogue or discourse (Allen, 1995; Gazdar and Mellish, 1989; Roche and Shabes, 1997). The field of CL is different from that of speech processing by the fact that CL is concerned with discrete, symbolic representations of language rather than continuous speech signals. In recent years, however, it has been recognized that the CL and speech processing communities need a common ground for cooperation. Without speech, CL is missing an important modality of linguistic communication. Conversely, speech processing without the deeper levels of representation fails to relate spoken language to its meaning.
In the 1990s, the commercial potential of HLT as well as speech processing began to be fully appreciated. By the end of that decade, at least one company, Flanders-based Lernout & Hauspie, had built up 'language factories' employing over a thousand computer linguists and language specialists worldwide. Such developments have not been entirely beneficial to educational institutions offering CL studies, since they became depleted of scarce competent personnel which moved to jobs in industry, due to the lack of stimuli from universities. In some countries where educational authorities had failed to attract a sufficient level of academic competency over the years, for instance Norway, this is being felt especially hard. Eventually also industry will realize that this is an insupportable situation in the long range.
In the past, CL has strongly been concerned with the automatic linguistic analysis of sentences (parsing). Research in this important problem domain required strong interdisciplinary links between linguistics, formal methods (mathematics and logic) and computer science. On the one hand, automatic analysis of language requires a language to be defined in terms of a formal grammar. On the other hand, it also requires efficient parsing algorithms for searching through the maze of possible combinations of words and phrases that make up sentences. When the problem is approached in this way, a sentence can be mapped into a representation of its structure. This representation, in turn, can be mapped into a representation of its meaning and its communicative itention.
There has been much focus on representing the lexicon, syntax, semantics and pragmatics of natural language on formalisms such as Lexical-Functional Grammar (LFG) and Head-driven Phrase Structure Grammar (HPSG). Also theories of dialogue and discourse and efficient methods for generating text from meaning representations have been researched. In recent years, it has been recognized that many problems cannot be approached from formal theory alone, but also require empirical research, often through the study of very large corpora from which statistical information is extracted (Charniak, 1993; Krenn & Samuelsson, 1997; Young & Bloothooft, 1997).
Some research in CL is conducted in publicly or privately funded research centres, but much is located in universities, where staff are engaged in teaching alongside their research. Most academics accept the principle that teaching and research are mutually reinforcing, hence in most universities where there are CL researchers, the subject is taught to students.
Therefore, there has recently been a focus on developing CL methods for solving real-world application problems in the areas of dialogue systems (Bernsen et al., 1998; Dalsgaard et al., 1995, 1999; Gibbon et al., 1997) and information retrieval and extraction (TREC-7, 1998; MUC-7, 1998). There has been a lesser focus on machine translation or machine-aided translation (see Cole et al., 1995). There is a growing amount of work on developing integrated natural language and speech processing systems (Bloothooft et al., 1997, 1998, 1999; Green et al., 1997; Jurafsky & Martin, 1999; McTear & Kouroupetroglou, 1998; Young & Bloothooft, 1997). Over the past few years the speech community has had much success with developing working spoken dialogue systems for limited application domains such as banking, travel information, weather information, call centre routing, and so on. For example, Lucent Technologies' Bell Laboratories claims their call centre routing system for banking performs better than humans at routing phone calls.
Yet another community, i.e. information retrieval and message understanding, is in need of smarter methods for obtaining information from texts. The USA has established national message understanding conferences for competing systems (e.g. MUC-7, 1998) to parallel the text retrieval conferences already held in information retrieval (TREC-7, 1998). The recent upsurge of work in Intelligent MultiMedia or MultiModal systems integrating graphics, image processing, haptic and other modalities also incorporates CL methods as part of dialogue interfaces (Brøndsted et al., 1998; Dalsgaard et al., 1999; Maybury, 1993; Maybury and Wahlster, 1998; Mc Kevitt, 1995-96, 1998).
These developments suggest that the processing of natural language and speech is an eminent instance of where the humanities, science and engineering are converging (Bloothooft, 1998; De Smedt and Apollon, 1998; McTear and Kouroupetroglou, 1998). When one recognizes that language is an important aspect of multimedia and multimodal systems incorporating also graphics, vision and other modalities, applicable to visual art, music, dance, film and other creative expressions, this convergence becomes all the more apparent (Maybury, 1993; Mc Kevitt, 1995-96, 1998). The Internet contributes to forcing the merging of the humanities, sciences and engineering in terms of representing and accessing information in multiple modalities including at least text and voice in multiple languages, sounds and music, images and videos. This is a major application area of Intelligent MultiMedia (see Maybury, 1997).
Mobile computing and communications devices are becoming more prevalent and computers are ubiquitous and even invisible. There has been rapid convergence of computing and telecommunications technologies in the past few years (IEEE Spectrum, 1996). These will soon enable users to interact with perceptual speech and image data at remote sites and where that data can be integrated and processed at some central source with the possibility of results being relayed back to the user. The increase in bandwidth for wired and wireless networks and the proliferation of hand-held devices and computers (Bruegge & Bennington, 1996; Rudnicky et al., 1996; Smailagic and Siewiorek, 1996) brings this possibility even closer. Applications of mobile media are numerous, including data fusion during emergencies, remote maintenance, remote medical assistance, distance teaching and internet web browsing. One can imagine mobile offices where one can transfer money between bank accounts or order goods and tickets even while car cruising. The possibility of controlling robots through mobile communications is gaining momentum (Uhlin & Johansson 1996) and there are also applications within virtual reality (IEEE Spectrum, 1997). All of these applications are crucially dependent on communication, where language plays a prominent role.
In the Fifth Framework, HLT falls under the Information Society Technologies programme. It is expected to focus on advanced human language technologies enabling cost-effective interchanges across language and culture, natural interfaces to digital services and more intuitive assimilation and use of multimedia content. Work would address written and spoken language technologies and their use in key sectors such as corporate and commercial publishing, education and training, cultural heritage, global business and electronic commerce, public services and utilities, and special needs groups. Work would also develop electronic language resources (e.g. dictionaries or terminologies) in standard and re-usable formats. Research and Development priorities include adding multilinguality to systems at all stages of the information cycle, including content generation and maintenance in multiple languages, localisation of software and content, automated translation and interpretation, and computer-assisted language training; enhancing the natural interactivity and usability of systems where multimodal dialogues, understanding of messages and communicative acts, unconstrained language input-output and keyboard-less operation can greatly improve applications; enabling active assimilation and use of digital content, where work would apply language-processing models, tools and techniques for deep information analysis and metadata generation, knowledge extraction, classification and summarisation of the meaning embodied in the content, including intelligent language-based assistants.
The process of CL reaching HLT user communities is the object of the European-wide survey EUROMAP, which is funded under the Language Engineering sector within the Telematics programme of the EU. The survey which started in 1996 is pulling together data on LE activities in Europe as well as actual user and market requirements. Based on this analysis, recommendations are developed on how to link LE capabilities with marketplace opportunities. A view of employment opportunities in CL/NLP and speech systems could be developed from an analysis of job openings.
On the education front, the challenges are formidable and will likely not be addressed adequately by small-scale solutions. A recent survey in The Netherlands and Flanders showed that despite the explosive growth of the language and speech industries in this region, education in this field hardly made significant advances in recent years (Bouma & Schuurman, 1998). It will therefore not be surprising that many see the need for increased public support to CL education and coordinated action at an international level to increase the quantity and quality of CL courses and programmes.
A number of projects and other activities in education are worth mentioning and will be discussed in more detail in the remainder of this chapter. Previous work has been done by ERASMUS inter-university cooperation programme (ICP) projects aimed at promoting mobility and curriculum development. Notably, there was an ICP on Natural Language Processing, active from 1993-1996 (Way, 1998) and one in Logic, Language and Information, coordinated by the European Association for Logic, Language and Information, also known as FoLLI, which also organized summer schools in this area.
The present chapter was written under the auspices and with the support of ACO*HUM, the SOCRATES thematic network project for Advanced Computing in the Humanities, which started in 1996 and is expected to conclude in 2000. Within this project, which networked more than 100 European universities, a working group in Computational Linguistics and Language Engineering reflected on the challenges for CL education. The group organized several meetings and conducted a survey which will be discussed below. The working group kept close links to another SOCRATES thematic network project, that in Speech Communication Sciences (Bloothooft et al., 1997-1999; Bloothooft, 1997; 1998; 1999) with 80 partner institutions.
The main goal of this networking has been to analyse the present status
of education in relevant fields in Europe, to make proposals on existing
curricula and to formulate recommendations for future initiatives and implementation.
Among the results of this intensive networking is a proposal for a curriculum
for a European degree in Natural Language Processing which is further specified
in a SOCRATES Curriculum
Development Action from 1997 to 2000. Also, the networks reflected
on the potential of computer assisted learning and distance learning techniques
for language and speech through pilot projects co-sponsored by ELSNET.
The findings, proposals and recommendations of the ACO*HUM group will be
discussed in detail in the remainder of this chapter.