The core topics in CL, for most centres, are parsing algorithms (programs for analyzing sentences), formal grammars, representation of lexical knowledge, formal semantics, state of the art grammar formalisms, mathematics and logic, and pragmatics techniques. These are all areas in which learning how to apply the techniques is inseparable from learning about them. The usual manner in which these techniques are learned is via the use of computational mechanisms which implement the formalisms. Thus, techniques of formal grammars are explored through the writing of grammar rules which are subsequently fed into parser that attempts to analyze sentences with the help of the given grammar.
Learning how to assign structures to sentences of a natural language is a skill that should be expected of general linguists as well as computational linguists. It is of course a part of what CL students need to learn, and probably a major part. But computational linguists also need to understand how the algorithms for processing linguistic data actually work. Few of the tools available, including the LFG workbench, support this aspect of learning about linguistic analysis or generation. Typically, the processor operates as a black box between reading the user's input and presenting the results of analysis. It does not show the process of analysis, something that can be of particular learning value to a student.
Besides using special computer tools and understanding the algorithms, CL students often learn how to program language processing tools. The survey questionnaire did not ask whether programming was taught as part of a CL programme of study, but which computer languages were employed. Prolog has 40 adherents among the respondents, more than twice as many as its nearest rival, but since that is Java, only a few years old, the picture may change dramatically in the near future. From the total number of responses, it is clear that programming languages provide the vehicle for much of the practical learning of computational linguistics techniques. The 102 responses show that on average 1.5 different programming languages are used per CL site. The use of 'ready-made' CL tools was nearly as widespread, but with the exception of WordNet, PC-KIMMO and LFG-Workbench, no specific tool was used at more than 2 sites, although more than 50 distinct named and more unnamed packages are in use in teaching CL.
Multimedia can be used with any field of learning no matter how unstructured. The stimulus material can be any text, pictures, sound, video, animation, etc., and the role of the computer is primarily that of filing and presenting these resources. Such material takes a long time to prepare, and is usable only once (apart from repetition and reinforcement) per student. Consequently, the investment is considerable for fields with small student numbers, like CL. Nevertheless for those fields where the data is essentially visual or aural, the effort is often worthwhile. The speech communication sciences are one field in which recorded and generated sound, as well as visual representations of analysed sounds, are the basic source materials for students to work with. The computer provides far more flexible ways of handling such materials than conventional recording devices like tape and video recorders. It is therefore not surprosing that the SOCRATES thematic network project on Speech Communication Sciences has installed a group working on computer assisted learning (CAL) in this area. The group has conducted detailed studies on CAL packages available for the teaching of speech and CL/NLP (see Bowerman et al., 1999; Huckvale et al. 1997, 1998; Inventory 1999) and compiled a set of evaluation criteria which can be used as guidelines of best practice. The group has also organized a workshop on methods and tools for speech science education (see Hazan and Holland, 1999) and an education arena for Eurospeech-99, which included a job fair and a presentation of courseware on CD-ROM and other media.
Computational linguistics educators have also produced Web mediated course materials covering a variety of aspects of their discipline. Respondents to the survey reported over 30 URLs to Web courses. In addition, several other courses and tools have been announced at De Smedt and Apollon (1998) and Rosner (1999). We found no instances of courses which simply use the Web as a storage and distribution medium for textual notes. All use at least cross references via hyperlinks and other uses of the medium's interactive potential. We distinguish the following types of courses with examples.
Among the available CL tools usable in education are several systems for the automatic analysis (parsing) or generation of sentences. Examples include the LFG workbench by Xerox (see figure 4.1) and the Grammar Laboratories for the Macintosh by Linguistic Systems. These tools not only perform sentence analysis but also presents the sentence structures visually in graphical ways which linguists are used to, including hierarchical tree structures and feature-value matrices. The educational benefits are considerable compared with the more usual pedagogy of e.g. definite clause grammar, which needlessly brings a programming language into the picture, but not in a way that students are enabled to develop programming skills. The visualisation of the output of analysis is already a significant benefit compared with textual output.
Figure 4.1 Screen shot of the LFG Workbench showing grammar rule window, lexicon window, and two windows with different representations of the sentence structure: a tree structure (top right) and a feature structure (bottom right).
If, in addition to using a parsing tool, CL students must learn to understand how such a tool works, techniques like visual stepping or animation can be useful. For example, a visual representation of the step-by-step growing search space of a top-down parser referring to a left-recursive rule can help the student understand the limitations of that particular parsing mechanism. The PAIL laboratories developed at IDSIA included a range of different parsing algorithms which could be explored in this way. A very simple tool that helps the student understand the mechanics of deriving a phrase structure analysis is described in Black, Hill and Kassaei (1999). In this tool, the student enters a string, and then manually selects production rules one at a time to construct a derivation (see figure 4.2). There is no real parser behind this tool, just string substitution, and the whole tool works in the client browser without any need to communicate with a server. It can surely be further developed to be combined with better visual display, and also evolve into a mechanism for step-running specific parsing algorithms.
Figure 4.2 A parser derivation tool, showing the steps in the analysis of a sentence, each step taken under user control.
Even though tools such as these are valuable pedagogical instruments for basic concepts of CL, they may not address other important goals in present-day CL education. Bouma (1999) at Groningen University points out that many CL tools are aimed at the construction of toy systems, which are too far from reality. Rather, he proposes to offer a far more realistic learning context in which students are stimulated to make systems which account for actual linguistic data. This can be achieved by giving students easy access to large-scale resources such as big corpora and full-scale electronic dictionaries. Since students nowadays have access to hardware with sufficient computer power and data can be distributed over the net, there are no unsurmountable obstacles. Also, since it is possible to provide them with high-level tools, attention is not absorbed by low-level programming techniques. Instead, the student's minds are freed to deal with the actual linguistic data. Eventually, Bouma hopes, this approach will enable students to deal with problems at a realistic level of complexity and prepare them better for actual challenges in human language technologies. Some of the concepts, projects and tools used at Groningen can be seen online.
Large language engineering platforms are becoming more prevalent, such as GATE at Sheffield, England, in which complementary and alternative components can be integrated to produce modular text processing systems, and the Oregon Graduate Institute (OGI) spoken dialogue platform at Oregon, USA. With large and powerful platforms, students and developers do not need to develop basic tools for sub-tasks like parsing from scratch. Students will more and more be given a thorough grounding in how to use such platforms to develop comprehensive applications such as spoken dialogue systems incorporating CL as part of more project-based education.
To obtain many of the "ready-made" NL tools that may be usable in courses as well as for research purposes, it may be worthwhile to consult one of the lists or registries of resources. The main ones are:
Standard HTML lacks one important component, which is the ability to
present diagrams generated dynamically, except as text, as figure 4.3 shows.
Input:The New York Stock Exchange is located in the Manhattan business district. Output:+---------------J---------------+ +------------CL------------+ | +--------------D-------------+ | +----------DP---------+ | | +----------AN---------+ | | +-MP-+-MP-+---MP--+--S--+--V--+--EV--+ | | +----AN----+ | | | | | | | | | | | | | ///// the New York Stock Exchange is located.v in the Manhattan business.n district.n |
The ability to present linguistic analysis diagrams is an important feature of linguistic courseware - not to be underestimated as Bouma (1999) points out. There are two main ways that line drawings can be presented in present-day browsers - by writing them to a graphics file and loading this through an image tag, or by running a program, e.g. an applet or a plug-in in the browser. The former is not difficult in principle, but is unattractive in practice, for several reasons: the diagram as a picture consumes a much higher bandwidth than its contents as a graph; to make the representation interactive, it would be necessary to generate an image map alongside the picture, etc.
The alternative is to run a program that can present line drawings within the browser. The Java programming language supports a range of graphical tools, the most primitive of which is the Canvas on which lines can be drawn and mouse events reacted to. This is used in the tool by Black, Hill and Kassaei (1999) shown in figure 4.4.
Figure 4.4 Web parser showing linguistic diagrams output in a Java applet's canvas.
An interesting tool by Calder (1998) that concentrates on the presentation and editing of linguistic diagrams is Thistle. In Thistle, as well as viewing diagrams, the student can create and modify diagrams by direct manipulation of the nodes and labels in them, as shown in figure 4.5. Thistle diagrams are presented using consistent fonts and styles.
Figure 4.5 Screen shot of the Thistle tree editor.
Of the demonstrable systems that CL instructors are using, the majority seem to be research-based systems that have been given a HTML-based front end, nearly always using server-side CGI scripts to conduct the processing. There are many variants on this theme as ways of providing CL processing behind an educational tool. One of the arguments against HTML for the user interface is the lack of graphics (line drawing) support. Another is the relative inefficency of this method. Not only are CGI scripts often written in interpreted languages like PERL, but they involve initiating a new process for each interaction, and involve the transmission of whole pages of data to present results.
A popular alternative is to transfer the processing to the client (the browser). Scripting languages (like JavaScript) are favoured by Gibbon and Carson-Berndsen (1999) because they are relatively simple languages, and the student can inspect the source code). On the other hand, applets written in Java have a much more powerful and expressive language to draw on, greater standardisation, graphics support, greater run-time efficiency due to compilation and the ability to hide the source code from the student. Applets can run autonomously once downloaded, as with JavaScript.
However, the main reason that so many CGI based demonstrations are available is that server-side processing (in any language you like) makes it easier to re-use existing programs rather than writing them from scratch. Applets can access external resources, such as databases or "legacy" programs not written in Java to provide the user interface of a more complex and heterogeneous system. A system like this where processing is distributed between the applet and other programs residing on the server is referred to as a client-server system. Since the definition of an applet is more or less synonymous with a computer virus, i.e. a program transmitted over the net to run in the recipient's computer, browsers in which applets can be run enforce certain restrictions, like no access to the local file system, socket communication restricted to the host the applet was served from, etc. The program by Black, Hill and Kassaei (1999) depicted in Figure 4.4 works within these restrictions. The applet is downloaded from a Web server embedded within an HTML file that contains supporting tutorial and user documentation. It opens a two way, socket communication with a parser which must be already running on the same machine as the Web server. The parser is written in Lisp, and was relatively little modified to work in co-operation with the user interface. Because an applet cannot load or save files on the local computer, an additional applet is provided in which the student can edit grammars stored on the server.
Where a client-server system runs in a local area network, i.e. an Intranet, socket communication is adequate, but for Internet operation, this is not satisfactory, since organizations running firewalls restrict communication across the firewall to the HTTP protocol. Recent developments in Java, although seemingly complex, will eventually make it easier to deploy complex client-server systems with attractive user interfaces. Java servlets can replace the function of CGI scripts more cleanly and much more efficiently. Java XML parsers will make it easier to handle linguistically encoded information that is transmitted between programs, and new user interface component libraries are designed to work with representations of analysed texts.
Given the relative popularity of Java as a programming language for student use in CL (20 out of 68 responses in the survey) at such an early stage in its development, it cannot be long before the possibilities of this framework are fully appreciated, and we see a mushrooming of Java based NLP applications including courseware.
Distance learning is also seen in conventional universities as a significant proportion of their academic activities in the 21st century. There are a variety of reasons for this, some of them more laudable than others. For example, universities see distance learning as a way to increase or defend their market share, by removing geographical and temporal boundaries.
For a specialist discipline like CL, distance learning appears to offer a solution to the numbers problem: It is rare to find more than a half a dozen CL specialists in one university, and therefore the staff to cover a broad curriculum in the discipline. Hence inter-institution collaboration in the provision of distance learning modules is a possible way to make study units available where the local expertise is lacking. The course on statistical NLP by Joakim Nivre (part of the ELSNET LE Training Showcase, see below) is a pertinent example. Looking at the programmes of recent conferences, it is clear that the total of only four courses on statistical methods listed in answer to the questionnaire reflects a mismatch between what is currently offered to students and the CL research agenda. Nivre's course may well answer to a widespread need that is going unanswered not because it is unrecognised but because of the startup costs in initiating new modules outside of the teacher's core specialism.
Computer-mediated learning in general, like more traditional correspondence courses, removes the temporal and spatial requirements on study, by making learning material available to students irrespective of time and place. Nevertheless, most distance learning institutions and experiments find it helpful to place temporal if not spatial restrictions on the students. If students are to be accredited by distance learning instutions, there is more need than in conventional universities for students to be examined in invigilated time-constrained examination conditions, because the opportunities for plagiarism and cheating are obvious.
There is also a social dimension to learning, in which solidarity with fellow students is an important motivating and sustaining factor. Distance learning institutions often try to encourage this by both formal tutorials and informal self-help arrangements. One particularly interesting experiment in disatance learning in our discipline was conducted by Dekker (1998) who arranged a virtual classroom using an email list server at set times each week for a group of students of logic around the globe. These experiences suggest that computer mediated learning in general need to be seen as a socio-technical system, for use in a particular institutional context.
The following projects have been executed:
The pilot on Statistical Natural Language Processing is meant to provide the basic material for a distance learning course, although some local supervision or tutoring will normally be required. The content includes basic statistics, applied statistics (meaning Markov models and information theory), and NLP, covering language modelling, tagging and parsing, disambiguation, translation and alignment. This content is partly based on previous work from Brigitte Krenn's and Christer Samuelsson's work The Linguist's Guide to Statistics, which in fact is the main text presented. In addition to this and other texts which are available online, the course presents a set of exercises with solutions for each topic on the course, a set of projects with all tools and data provided, slides for each topic, ad pointers to the literature. Finally, there are hyperlinks to practical tools and resources on the Web. Because the course adopts a downloadable and printable book for most of the expository material, it does not fully exploit the Web medium; the format of the presentation is kept simple. In the spring of 1998, the course was given as a distance learning course learning course sponsored by Computational Linguistics in Flanders (CLIF), with Dr Walter Daelemans as coordinator and Dr Joakim Nivre as main lecturer. It must be noted that this course covers an urgent need for teaching materials in a new subfield of CL.
The pilot on NLP courseware for Web operation had different objectives. First, the project wanted to explore ways in which linguistics structures and processes can be visualized in a browser. Second, the project attempted to tie these visualizations to pedagogical goals and used them in an introductory Web course on CL. A first result consists of a client-server parser interface by Black, Hill and Kassaei (1999) depicted in figure 4.4. In this tool, a client-server architecture is used: a Java client sends a sentence for analysis to a parser on a remote server and subsequently receives a specification of a structure on the basis of which it draws trees in the user's Web browser. This is not so much intended as a tool to be used by itself, but as a component to be embedded in a course. The client can also be used to display trees from other sources but a parser. A second result consists of the parser derivation tool, shown in figure 4.2, that lets one experiment with step-by-step application of rules. Also this tool is not intended as a standalone module, but as a teaching aid to be used in an appropriate course. Finally, demonstrations of these tools are accessible from an introductory Web course on CL.
The pilot on Information Retrieval (mono- and multilingual) with Natural Language Processing techniques is meant as a distance learning course, usable standalone for self study or as an adjunct to conventional courses. The course covers linguistic techniques in IR focusing on morphology, tagging, and multilinguality. The course uses the Web medium well by allowing the student to actively use several computing tools via the Web. In fact, the main pedagogical effort in the course is provide or assemble Internet on-line resources that permit not just reading, but also practical experimentation of the issues considered in the course. Such facilities include software at the site offering the course (stemmers, morphological analyzers, part of speech taggers for different languages, multilingual lexical databases and cross-language mapping of queries) as well as external resources such as search engines, machine translation systems, etc. Experimentation is complemented with short introductions for every main topic, hyperlinks to reading materials available on the Web, and self tests.
Considerable efforts have gone into all three pilots. This is clearly on a scale way above the 5 K Euro contribution per course from ELSNET. The results of the projects are freely available for non-commercial use.
Computer-directed programmed learning allows for individually paced student work through exercises and differentiation of stimulus and response material according to learning success. However, programmed learning materials are not dependent on recent developments in computer based multimedia, as this form of guided interaction can be built into programmed texts, adventure books, and even self-assessment tax returns.
While CAL has enjoyed widespread use in adjacent fields such as language teaching and learning, it has been relatively less popular in Computational Linguistics prior to the advent of the Web. The latter has stimulated much recent courseware development, such as that described above as Web courseware. As in any field, there is much value in supporting the "push" of course materials with exercises and tests, which once developed can be repeatedly administered to each new cohort. However, the greater the degree of formalism that is achieveable in a field, the easier it is to generate some of the course material from a model of the subject expertise built into the software. That is, the subject knowledge should not be simply encoded as multimedia, but should be represented in the system on the knowledge level (Newell, 1982; Clancey, 1988). In scientific fields, this enables courseware to be developed as a simulation or a model of the domain in question. A theory of the domain is built into the program, and the student can explore the consequences of the theory to provide a more direct learning experience than that mediated by mere description no matter how well illustrated.
In computational linguistics certain aspects of the discipline are extremely amenable to this kind of treatment, since our subject matter extends not just to formalised theories but to meta-theories of grammar. The learning goals we set for students often include the development of theories of grammar for particular fragments of natural language. This has the benefit that once the machinery is in place, it is up to the student to provide the descriptive content and the teacher does not need to provide reams of text and pictures to generate content. In the related discipline of logic, one of the more successful computer-based learning packages is Barwise and Etchemendy's Tarski's World, in which the student can develop descriptions and theories of a simple 3-D blocks world, which the student can view. The program helps teach not just the syntax of first order logic but also semantics, by determining whether the student's theory is consistent with the displayed world.
Programs that parse natural language strings according to a linguistic theory - a grammar and lexicon - written by the student also provide for the same kind of learning experience and immediate reinforcement in the domain of syntax. The Linguistic Instruments tools for the Macintosh by Linguistic Instruments, the LFG Workbench by Xerox, the programs by Bouma (1999) and by Black, Hill and Kassaei (1999) referred to above can all be deployed to support the learning of linguistic formalisms in this way. Apart from the LFG workbench, all these recently mentioned systems are deliberately designed to be simple in use although they can support moderately realistic linguistic descriptions using unification based formalisms. Their authors have been particularly keen to support the needs of beginning students, and to dissociate the understanding of linguistic description from computer programming languages, so that the student can develop a cleaner mental model more readily.
We project that the current trends towards using the Web and of information
processing techniques will increase for all kinds of education, also for
CL. As the demands for quality and pedagogical relevance increase,
it will be recognized that many fields need special tools. But it
is also important to give attention to the context in which CAL tools for
CL are used. The ability to work in groups is important for CL education
and especially since it is seeing its home in relation to other communities
such as speech which involves much interdisciplinary interaction and expertise.
Courses should stress the theoretical as well as the practical, project
work, and the ability to work in groups.