INFORMATION TECHNOLOGY AND THE HUMANITIES

INFORMATION TECHNOLOGY AND THE HUMANITIES: THE USER'S PERSPECTIVE.

Susan Hockey

1. THE NATURE OF HUMANITIES RESEARCH

Humanities research is mostly concerned with interpretations of source material and challenging those interpretations. The source material may be primary sources e.g. literature (prose, poetry or drama), works of art such as paintings, sculptures, museum artefacts or musical works, and historical documents such as charters, correspondence, political papers, inscriptions, coins, papyri. It may be secondary sources such as published interpretations and analyses, or bibliographies and reference tools. Output from the research process is in the form of articles and monographs, which themselves form input to further research or teaching.

The traditional humanities scholar works alone. His or her research proceeds faster in the vacation when there is no teaching. Projects can take as long as 20 years and funds are normally only needed for travel to visit libraries and museums. There is not much use for equipment, since much of the work relies on intuition rather than scientific experimentation. It can be argued that this method of working has not changed much for centuries, but the introduction of information technology to scholarship in the humanities is beginning to change modes of working but is not without problems.

2. THE INTRODUCTION OF INFORMATION TECHNOLOGY

The earliest computer applications in the humanities were built on software for databases and for text retrieval, which were used for various types of analysis of source material. These applications date back to the late 1950's and, although there were substantial improvements in working conditions, the basic applications remained the same for two decades. It was not until the arrival of wordprocessing in the early 1980's that computing became much more widespread. Wordprocessing is a good way of getting people interested in computing, but it is not specific to the humanities and it is not appropriate to discuss it much further here.

A similar point can also be made about electronic mail. It facilitates the research process in that it enables rapid correspondence, or the circulation of information to groups of people, as well as collaborative writing. Bulletin boards and electronic discussion groups have flourished recently ad it is not difficult to spend all day reading them. The bulletin board approach where the user has to log on to it to look for information is better for information which will not change for some time, but not so good for quick questions and answers. It requires the information to be well-organized, as well as an easy way of getting printout or mailing information to recipient, and frequent updating. The discussion groups approach such as BITNET `lists', or digests such as HUMANIST, ANSAX-L, IOUDAIOS, PHILOS-L etc, which use electronic mail to distribute to all members of the group, are better for quick questions, the rapid dissemination of information, and discussion of items.

Wordprocessing and electronic mail, the information technology applications shared by all disciplines, facilitate things which are ancillary to scholarship. Here I want to concentrate on the use of information technology as a tool in the humanities both for research and, to a lesser extent, in teaching. This imposes an intellectual rigour, which can be alien to the humanities scholar's way of working. However it forces the scholar to think out more clearly the objectives of the research and how to achieve those objectives. It also forces the scholar to make decisions which in more traditional ways of working would be put off to much nearer the end of the project. If they are the right decisions, the project will proceed reasonably well. If they are the wrong ones, it can take a long time to put them right.

3. TEXT-BASED APPLICATIONS

In the area of text-based subjects, traditional applications were based on concordances and text retrieval. In this case the basic source material is the text itself and fewer decisions are required on what to put into the computer. Decisions may need to be made on choosing an edition, sampling the text and encoding the text, but the basic material is fairly clear at the start. The method of storing the text initially is also fairly clear. It must be input as a sequential text file.

Concordances and text retrieval applications can be used to advantage for stylistic analyses, critical editions, and lexical studies, but they look at text only at the level of the graphic word and satisfy the needs of researchers in only a very simple way. There is a need for more sophisticated software to incorporate lemmatization (putting words under their dictionary headings), parsing, and semantic analysis in order to provide the kind of text retrieval which users need.

Current research in computational linguistics is concentrating on these problems and systems are being developed which use a morphological analyser, a parser and a machine-readable dictionary as tools in the retrieval process. These dictionaries or lexical databases are often derived starting from a printed dictionary which is then restructured to reflect the semantic relationships between words, hyponyms, synonyms etc. Printed dictionaries tend to concentrate on unusual usage of words. The lexical database is often augmented with information about the more common usages by further information derived from language corpora. Most work in this area is concentrating on the requirements of the so-called `language industries' applications of modern prose text, e.g. language understanding systems, intelligent wordprocessors, language teaching systems. Literary and other scholarly texts are more difficult to handle because of historical language, unusual usage, variant spellings, metaphors and possibly deliberate ambiguity, but experiments are already being conducted to tackle them in this way.

4. DATABASE APPLICATIONS

Traditional applications in history, archaeology, art history and related subjects have been in databases and statistical analyses. In many of these applications it is less clear what the source material is and how to identify and organize it within the computer. The development of computing in these subjects was hampered by the inadequacy of early database software which began with the `flat file' model of one table or matrix, which was necessary because of tape-only sequential storage. Many historians and archaeologist felt that the computer was not for them as they were not able to model their data effectively within the one table structure. Methods of handling missing information were also difficult as was the need for much variable length data which had to be fitted into data models which allowed only fixed length fields.

The introduction of structured databases on the network, hierarchic, or relational models following the increased usage of disk-based storage, permitted more flexible modelling of the data, but required the scholar to spend some considerable time working out an entity model for data, often using terms and concepts with which he or she was not familiar. Most structured database software, e.g. DBASE, INFORMIX, INGRES, ORACLE is now based on the relation model, in which it is not particularly easy to model humanities information. Time must be spent on designing the database so that the links between the tables are built up efficiently and effectively. It is really necessary to know the relationships between the elements in the data in order to design the database effectively, but most humanities data is put into a computer in order to determine these relationships. Furthermore, most standard database software is not designed for handling large amounts of variable length material, or for alphabetical sorting of texts in a variety of languages, or for dealing with non-standard dates, or for currency which is not decimal in form, or for different spellings of the same name. All these types of material are found in the humanities and need to be catered for.

It is also true that, particularly in a smaller project such as that done for a PhD, the design of a structured database can take up too much time and not enough time is then left to do a thorough analysis and interpretation of the results. Designing the database forces the researcher to look more closely at the source material and is really the essence of the research. However it is not a publication or a thesis, and time spent on this does not count for any credit.

5. PACKAGED PRODUCTS

Most humanities scholars begin applying computers to their research by using standard packaged software. This provides results very quickly once the data has been loaded, but I would question whether they are always the answers to the right questions. Often the data has been made to fit the software rather than the approach of the software fitting the data. Browsing a text data file can often give the scholar ideas for more specific questions, but he or she may then find that it is not possible to formulate the exact question within the constraints of the software. Those programs which do allow a natural language interface to the retrieval process usually spend more time analysing the request than doing the search.

Packaged electronic publications which are now on the market consist either of the data only, or the data with software. In the case where the data only is provided, e.g. the Thesaurus Linguae Graecae, the user has to write his or her own programs. This gives complete flexibility, but is of course harder for the beginner. When data is packaged with the software, as is more normal, it is easier to use for the beginner, but it pre-supposes all the questions which the user might want to ask and can thus soon become inflexible. I would argue that many of the packaged products on the market do not really satisfy the academic needs of scholars in the humanities and that there is a real need for considerably more scholarly input to the design of many packaged products.

6. HYPERTEXT

We have seen that traditional computing techniques such as text retrieval and databases can pose problems in modelling and analysing humanities data in the way which scholars want. What about the newer techniques, and in particular hypertext and multimedia about which much has been said and written recently. The linking of isolated but related data into an associative web of information could have considerable potential for modelling humanities data, but is it just `hype'? Its popularity in the humanities is possibly due to the fact that it seems to allow the storage and manipulation of data without the absolute rigour imposed by database packages. It enables both primary and secondary material to be presented together and accompanied by images and sound and is thus seen to provide a better way of modelling humanities information.

However, in many ways more care needs to be taken in designing a hypertext system, simply because it is so flexible. Hypertext is often used synonymously with Hypercard on the Macintosh computer, which is the most widely used hypertext program simply because it comes with the Macintosh and runs on a basic configuration. I have read or heard many papers on the topic of using Hypercard in the humanities, but in only few of them have I heard any detailed realistic assessment of its capabilities. It seems to me that it is good for showing things, but not for manipulating them. Many types of work in the humanities need manipulation, such as searching or sorting. It is limited in the data models which it can handle easily and can thus give a false impression of the capabilities of hypertext for modelling data. Neither does it easily record the user's navigation path - with the result that the user `gets lost in hyperspace'. However, as the Perseus project based on Harvard shows, Hypercard can be used with great success when a well designed system is built on top of it.

At the other end of the spectrum, Intermedia provides a vast range of hypertext facilities - and consequently needs a very powerful machine to run. In Britain a number of humanities hypertexts have been set up using Guide, which was originally developed at the University of Kent. Guide runs on PCs and Macintoshes and seems to provide an effective intermediate solution. Southampton University has been conducting a very interesting experiment in the use of multimedia in history using software developed in the computer science department. Their most recent presentation on the Mountbatten papers includes sound as well as photographs and, of course, text.

At Oxford we are planning to conduct a detailed evaluation of the use of hypertext and multimedia for research and teaching in the humanities, concentrating on the users' perspective. We will look at issues such as:

What happens when the user (researcher or student) reaches the end of the links, that is he or she navigates all the material which is provided, then wants to find out more? How readily do they then move to more traditional methods of enquiry, i.e. printed books and journals?
A hypertext system often contains one person's interpretation of a literary text, or other data. Other interpretations may be different. Can a hypertext interpretation be easily compared with one or more published in more traditional forms? Or does the medium influence the users' acceptance of the interpretation?
Hypertext really only provides a one-way flow of information. It can be argued that books do this too, but because of the availability of so much interactive software - even simple drill and practice programs - there is some expectation that the computer should always provide interaction. In a hypertext system, how easily can the user interact and add his or her own interpretation to the data? As a one-way flow of information it is more dynamic than a book, but does it really provide what people have come to expect from a computer?
Screen design is really crucial for hypertext systems which mix text and graphics. The standard Macintosh has a small screen; it provides a lot of easy to use facilities for changing fonts etc. which can tempt the beginner into what has become known as `fontitis', using too many fonts to design something which is too difficult to read or follow through. Larger systems like Intermedia allow many windows superimposed on each other and again the screen can very rapidly become cluttered and difficult to interpret without careful design. What are the ideal characteristics of screen design, bearing in mind the type of material which is being presented?
Most importantly, what are the capabilities of hypertext systems for modelling the complexities of humanities data? On the face of it, hypertext does seem to provide a way forward for modelling humanities data, since it does not impose such rigid structure on the data. It makes the software fit the data and creates an interactive environment for exploration of the data. But can it really provide all the navigation paths which scholars need without confusing them at the same time?
At present we are looking at two areas in literary research as examples of hypertext. In literary criticism, traditional scholars have become increasingly critical of quantitative studies, based on concordances. They argue that there is little empirical basis for these studies and that they can often lead to attempts to solve `non-problems', or at least they cannot be related to traditional criticism. A project at Oxford is modelling the Victorian multiplot novel as a hypertext, using Dickens' Little Dorrit as an example, the theory being that the narrative procedures of the Victorian novelist are very similar to the structures of a hypertext. Studying the novel as a hypertext advances the theoretic understanding of how multi-sequential narrative works and helps to unravel the proliferating strands of the novel.

A second area which we are looking at, is the hypertext electronic critical edition. Traditional critical editions provide only one version of the text with footnotes and commentary. An electronic edition as a hypertext can provide multiple versions of the text, image representation of manuscripts, commentary etc., plus retrieval and browsing tools to aid the study of the text.

These projects are experimental at present. It will be interesting to see how far they can go towards achieving their objectives in a way which satisfies the academic requirements of the discipline.

7. COLLABORATIVE PROJECTS

The introduction of information technology has affected the humanities research in another way. It has made possible much larger and collaborative projects. These need activities which can be new to the individual humanities scholar: management and organization, writing grant proposals, handling the budget, applications for more money once the project has started, sharing work between a number of people, writing reports, keeping up publicity whilst developing the project. I have heard some of these management activities described as not being `scholarship', the implication being that they are inferior to research. There does seem to be a fear that embarking on a large project will mean no more `research'.

Collaboration can be at more than one level: between several people in one organization or between several organizations and the more people there are involved in a project, the greater the need for solid ground rules on which it will operate.

One major area of collaboration in the humanities is the construction of very large databases. Here standardization of the data is very important for use in the current project, and so that data is re-usable for other future requirements, some of which may not yet be determined, and also for merging with data produced elsewhere as part of another project. The definition of data standards can also benefit the smaller individual project as time will not have to be spent on this./p>

8. THE TEXT ENCODING INITIATIVE AND SGML

One major international collaborative project with which I am involved is the Text Encoding Initiative (TEI), sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL) and the Association for Literary and Linguistic Computing (ALLC), which has funding from the National Endowment for the Humanities, the Commission of the European Communities and the Mellon Foundation. This project is addressing the question of standardization of data formats and in particular is preparing guidelines for the preparation and interchange of texts in machine-readable form, both for scholarly texts in the humanities and for the so-called `language industries' applications.

The project began with a planning conference in November 1987 at which all present agreed that the existing situation of many different encoding schemes in use was chaos and that there was a need for one multi-purpose and extensible scheme which can be applied to all texts. The TEI now has groups of people working in different areas to look at characteristics of texts in those areas. For example the text representation group is looking at physical characteristics of text, character sets (the TEI provides a mechanism for user-defined sets), the logical structure of texts, critical apparatus, hypertext, and language corpora. The text analysis and interpretation group is looking at morphology, syntax, phonology and other forms of linguistic analysis (with a mechanism to insert different analyses which are not dependent on any linguistic theory), as well as literary analysis and the interpretation of historical documents. A further group on text documentation is working on methods for documentation of machine-readable data files, to include not only the bibliographic information for the source material but also what is encoded in the file.

The Text Encoding Initiative is using the Standard Generalized Markup Language (SGML), which is itself an international standard, which provides a syntactic framework for markup codes (tags) within a machine-readable text. SGML is based on the principle of `descriptive' rather than `prescriptive' markup, so that the text is encoded in such a way that it can be reused for many purposes. SGML encodes the logical structure of the text and each application program uses the tags for a specific purpose.

The TEI has now completed three years of its projected four-year development work. The first draft version of the TEI guidelines was made available in summer 1990 and the final version will be available in summer 1992. There is still some way to go for the provision of adequate software for handling SGML-encoded text on the machines which humanities scholars normally use (PCs and Macintoshes), but the acceptance of the TEI's guidelines by the academic community will be largely due to the fact that SGML comes nearer than any other encoding scheme to solving the intellectual problems posed by complex scholarly texts. Since it provides a means of describing the text, SGML also allows the scholar to defer until later the definition of relationships between elements in the data and also the selection of any particular application program An SGML encoded text can be converted to a hypertext, or a database or any other format, but it is much less easy to move from another format to SGML.

9. COPYRIGHT ISSUES AND MACHINE-READABLE DATA

Most scholars embarking on a computer-based project in literature will come up against copyright issues concerning machine-readable texts, particularly now that publishers are entering the market with electronic texts. Up till now the distribution of electronic texts has operated in a rather `ad hoc' anarchic fashion. Various organizations like the Oxford Text Archive have set up repositories of machine-readable texts, the aim being to try to ensure that the labour of preparing a machine-readable text is not duplicated. The copyright status of many texts in machine-readable form is unclear. Publishers need to protect their interests, yet many scholars feel that software and data on diskette is expensive to buy, but very cheap and easy to copy. The problem is compounded by the fact that the law is different in different countries, yet it is so easy to transfer texts all over the world via networks.

There is also a marked reluctance on the part of scholars to ask for copyright permission to enter a text, either because they believe they will not get permission, or because they do not want to get involved in legal issues, or because they have heard from another source that permission may not be granted. The result of this is often to enter an edition which is out of copyright and which may not have scholarly authority, thus giving rise to inferior work. I believe that text analysis computing has now reached the stage where it must establish proper guidelines on how to handle electronic texts including cataloguing and copyright issues.

10. ACADEMIC ACCEPTABILITY

The use of technology also has implications for career prospects. But how does one measure electronic publications against traditional ones? Is the compilation of a database a publication? I think it is fair to say even now that computing is marginal in the humanities. I would like to see it become much more central and that means that it must be acceptable academically. The value of computer-based work must be recognized and it must be shown to have real relevance for the furtherance of scholarship in its discipline area. We are much further ahead now with the development of computing techniques. I would like to find ways of ensuring that these techniques are used for high quality academic research and that they continue to be enhanced. This implies much more input from users in defining what it is that they want to do, which in turn implies more collaboration between the developers of software and scholars in the humanities. In this way academic standards in the use of information technology in the humanities will continue to be raised.

Susan Hockey has been active in humanities computing for 22 years, 16 of which have been at Oxford University with responsibility for the Oxford Concordance Program (OCP), typesetting and general humanities computing facilities. Her most recent position at Oxford was Project Head of the Office for Humanities Communication and Director of the Computers in Teaching Initiative (CTI) Centre for Textual Studies. She was elected to a Fellowship at St Cross College, Oxford in 1979. She is the author of two books and over 25 articles on humanities computing. She became Chairman of the Association for Literary and Linguistic Computing (ALLC) in 1984 and is currently also Chair of the Steering Committee of the Text Encoding Initiative. In summer 1991 she was appointed Director of the new US National Center for Machine-Readable Texts in the Humanities, which is sponsored by Rutgers and Princeton Universities, and takes up her position there on 7 October 1991.

This article is a reprint of a presentation given at the NORDINFO Symposium on Cultural Heritage and Humanities Research in the Light of New Technology, Copenhagen, 7-9 June 1991.