The traditional humanities scholar works alone. His or her research proceeds faster in the vacation when there is no teaching. Projects can take as long as 20 years and funds are normally only needed for travel to visit libraries and museums. There is not much use for equipment, since much of the work relies on intuition rather than scientific experimentation. It can be argued that this method of working has not changed much for centuries, but the introduction of information technology to scholarship in the humanities is beginning to change modes of working but is not without problems.
A similar point can also be made about electronic mail. It facilitates the research process in that it enables rapid correspondence, or the circulation of information to groups of people, as well as collaborative writing. Bulletin boards and electronic discussion groups have flourished recently ad it is not difficult to spend all day reading them. The bulletin board approach where the user has to log on to it to look for information is better for information which will not change for some time, but not so good for quick questions and answers. It requires the information to be well-organized, as well as an easy way of getting printout or mailing information to recipient, and frequent updating. The discussion groups approach such as BITNET `lists', or digests such as HUMANIST, ANSAX-L, IOUDAIOS, PHILOS-L etc, which use electronic mail to distribute to all members of the group, are better for quick questions, the rapid dissemination of information, and discussion of items.
Wordprocessing and electronic mail, the information technology applications shared by all disciplines, facilitate things which are ancillary to scholarship. Here I want to concentrate on the use of information technology as a tool in the humanities both for research and, to a lesser extent, in teaching. This imposes an intellectual rigour, which can be alien to the humanities scholar's way of working. However it forces the scholar to think out more clearly the objectives of the research and how to achieve those objectives. It also forces the scholar to make decisions which in more traditional ways of working would be put off to much nearer the end of the project. If they are the right decisions, the project will proceed reasonably well. If they are the wrong ones, it can take a long time to put them right.
Concordances and text retrieval applications can be used to advantage for stylistic analyses, critical editions, and lexical studies, but they look at text only at the level of the graphic word and satisfy the needs of researchers in only a very simple way. There is a need for more sophisticated software to incorporate lemmatization (putting words under their dictionary headings), parsing, and semantic analysis in order to provide the kind of text retrieval which users need.
Current research in computational linguistics is concentrating on these problems and systems are being developed which use a morphological analyser, a parser and a machine-readable dictionary as tools in the retrieval process. These dictionaries or lexical databases are often derived starting from a printed dictionary which is then restructured to reflect the semantic relationships between words, hyponyms, synonyms etc. Printed dictionaries tend to concentrate on unusual usage of words. The lexical database is often augmented with information about the more common usages by further information derived from language corpora. Most work in this area is concentrating on the requirements of the so-called `language industries' applications of modern prose text, e.g. language understanding systems, intelligent wordprocessors, language teaching systems. Literary and other scholarly texts are more difficult to handle because of historical language, unusual usage, variant spellings, metaphors and possibly deliberate ambiguity, but experiments are already being conducted to tackle them in this way.
The introduction of structured databases on the network, hierarchic, or relational models following the increased usage of disk-based storage, permitted more flexible modelling of the data, but required the scholar to spend some considerable time working out an entity model for data, often using terms and concepts with which he or she was not familiar. Most structured database software, e.g. DBASE, INFORMIX, INGRES, ORACLE is now based on the relation model, in which it is not particularly easy to model humanities information. Time must be spent on designing the database so that the links between the tables are built up efficiently and effectively. It is really necessary to know the relationships between the elements in the data in order to design the database effectively, but most humanities data is put into a computer in order to determine these relationships. Furthermore, most standard database software is not designed for handling large amounts of variable length material, or for alphabetical sorting of texts in a variety of languages, or for dealing with non-standard dates, or for currency which is not decimal in form, or for different spellings of the same name. All these types of material are found in the humanities and need to be catered for.
It is also true that, particularly in a smaller project such as that done for a PhD, the design of a structured database can take up too much time and not enough time is then left to do a thorough analysis and interpretation of the results. Designing the database forces the researcher to look more closely at the source material and is really the essence of the research. However it is not a publication or a thesis, and time spent on this does not count for any credit.
Packaged electronic publications which are now on the market consist either of the data only, or the data with software. In the case where the data only is provided, e.g. the Thesaurus Linguae Graecae, the user has to write his or her own programs. This gives complete flexibility, but is of course harder for the beginner. When data is packaged with the software, as is more normal, it is easier to use for the beginner, but it pre-supposes all the questions which the user might want to ask and can thus soon become inflexible. I would argue that many of the packaged products on the market do not really satisfy the academic needs of scholars in the humanities and that there is a real need for considerably more scholarly input to the design of many packaged products.
However, in many ways more care needs to be taken in designing a hypertext system, simply because it is so flexible. Hypertext is often used synonymously with Hypercard on the Macintosh computer, which is the most widely used hypertext program simply because it comes with the Macintosh and runs on a basic configuration. I have read or heard many papers on the topic of using Hypercard in the humanities, but in only few of them have I heard any detailed realistic assessment of its capabilities. It seems to me that it is good for showing things, but not for manipulating them. Many types of work in the humanities need manipulation, such as searching or sorting. It is limited in the data models which it can handle easily and can thus give a false impression of the capabilities of hypertext for modelling data. Neither does it easily record the user's navigation path - with the result that the user `gets lost in hyperspace'. However, as the Perseus project based on Harvard shows, Hypercard can be used with great success when a well designed system is built on top of it.
At the other end of the spectrum, Intermedia provides a vast range of hypertext facilities - and consequently needs a very powerful machine to run. In Britain a number of humanities hypertexts have been set up using Guide, which was originally developed at the University of Kent. Guide runs on PCs and Macintoshes and seems to provide an effective intermediate solution. Southampton University has been conducting a very interesting experiment in the use of multimedia in history using software developed in the computer science department. Their most recent presentation on the Mountbatten papers includes sound as well as photographs and, of course, text.
At Oxford we are planning to conduct a detailed evaluation of the use of hypertext and multimedia for research and teaching in the humanities, concentrating on the users' perspective. We will look at issues such as:
At present we are looking at two areas in literary research as examples of hypertext. In literary criticism, traditional scholars have become increasingly critical of quantitative studies, based on concordances. They argue that there is little empirical basis for these studies and that they can often lead to attempts to solve `non-problems', or at least they cannot be related to traditional criticism. A project at Oxford is modelling the Victorian multiplot novel as a hypertext, using Dickens' Little Dorrit as an example, the theory being that the narrative procedures of the Victorian novelist are very similar to the structures of a hypertext. Studying the novel as a hypertext advances the theoretic understanding of how multi-sequential narrative works and helps to unravel the proliferating strands of the novel.
A second area which we are looking at, is the hypertext electronic critical edition. Traditional critical editions provide only one version of the text with footnotes and commentary. An electronic edition as a hypertext can provide multiple versions of the text, image representation of manuscripts, commentary etc., plus retrieval and browsing tools to aid the study of the text.
These projects are experimental at present. It will be interesting to see how far they can go towards achieving their objectives in a way which satisfies the academic requirements of the discipline.
Collaboration can be at more than one level: between several people in one organization or between several organizations and the more people there are involved in a project, the greater the need for solid ground rules on which it will operate.
One major area of collaboration in the humanities is the construction of very large databases. Here standardization of the data is very important for use in the current project, and so that data is re-usable for other future requirements, some of which may not yet be determined, and also for merging with data produced elsewhere as part of another project. The definition of data standards can also benefit the smaller individual project as time will not have to be spent on this./p>
The project began with a planning conference in November 1987 at which all present agreed that the existing situation of many different encoding schemes in use was chaos and that there was a need for one multi-purpose and extensible scheme which can be applied to all texts. The TEI now has groups of people working in different areas to look at characteristics of texts in those areas. For example the text representation group is looking at physical characteristics of text, character sets (the TEI provides a mechanism for user-defined sets), the logical structure of texts, critical apparatus, hypertext, and language corpora. The text analysis and interpretation group is looking at morphology, syntax, phonology and other forms of linguistic analysis (with a mechanism to insert different analyses which are not dependent on any linguistic theory), as well as literary analysis and the interpretation of historical documents. A further group on text documentation is working on methods for documentation of machine-readable data files, to include not only the bibliographic information for the source material but also what is encoded in the file.
The Text Encoding Initiative is using the Standard Generalized Markup Language (SGML), which is itself an international standard, which provides a syntactic framework for markup codes (tags) within a machine-readable text. SGML is based on the principle of `descriptive' rather than `prescriptive' markup, so that the text is encoded in such a way that it can be reused for many purposes. SGML encodes the logical structure of the text and each application program uses the tags for a specific purpose.
The TEI has now completed three years of its projected four-year development work. The first draft version of the TEI guidelines was made available in summer 1990 and the final version will be available in summer 1992. There is still some way to go for the provision of adequate software for handling SGML-encoded text on the machines which humanities scholars normally use (PCs and Macintoshes), but the acceptance of the TEI's guidelines by the academic community will be largely due to the fact that SGML comes nearer than any other encoding scheme to solving the intellectual problems posed by complex scholarly texts. Since it provides a means of describing the text, SGML also allows the scholar to defer until later the definition of relationships between elements in the data and also the selection of any particular application program An SGML encoded text can be converted to a hypertext, or a database or any other format, but it is much less easy to move from another format to SGML.
There is also a marked reluctance on the part of scholars to ask for copyright permission to enter a text, either because they believe they will not get permission, or because they do not want to get involved in legal issues, or because they have heard from another source that permission may not be granted. The result of this is often to enter an edition which is out of copyright and which may not have scholarly authority, thus giving rise to inferior work. I believe that text analysis computing has now reached the stage where it must establish proper guidelines on how to handle electronic texts including cataloguing and copyright issues.
Susan Hockey has been active in humanities computing for 22 years, 16 of which have been at Oxford University with responsibility for the Oxford Concordance Program (OCP), typesetting and general humanities computing facilities. Her most recent position at Oxford was Project Head of the Office for Humanities Communication and Director of the Computers in Teaching Initiative (CTI) Centre for Textual Studies. She was elected to a Fellowship at St Cross College, Oxford in 1979. She is the author of two books and over 25 articles on humanities computing. She became Chairman of the Association for Literary and Linguistic Computing (ALLC) in 1984 and is currently also Chair of the Steering Committee of the Text Encoding Initiative. In summer 1991 she was appointed Director of the new US National Center for Machine-Readable Texts in the Humanities, which is sponsored by Rutgers and Princeton Universities, and takes up her position there on 7 October 1991.
This article is a reprint of a presentation given at the NORDINFO Symposium on Cultural Heritage and Humanities Research in the Light of New Technology, Copenhagen, 7-9 June 1991.