Examples of False Anglicisms: Retrieval Problems in a Large Corpus of Written Italian

Final Report on a Marie Curie Training Site (MCTS) host fellowship at the Bergen Advanced Training Site in Multilingual Tools (BATMULT), August - October 2003

Cristiano Furiassi (University of Torino)


I. Introduction

The average Italian speaker does not seem to be aware of the fact that many English-looking and/or English-sounding words are not at all English. Instead they are autonomous coinages which are usually referred to as `false anglicisms'.

Up to the present date, the phenomenon of false anglicisms has not received adequate treatment. An accurate typology and an exhaustive classification of false anglicisms is therefore needed both for language teaching purposes, especially at advanced levels, and for the compilation of a dictionary of false anglicisms in Italian.

My PhD thesis (A Learner's Dictionary of False Anglicisms in Italian), which has been carried out at the Università degli Studi di Torino under the supervision of Professor Maria Teresa Prat Zagrebelsky and Associate Professor Virginia Pulcini, will greatly benefit from the computational background provided by the Marie Curie Training Site (MCTS) host fellowship at the Bergen Advanced Training Site in Multilingual Tools (BATMULT).

I.a. Initial Aims

Initially, I wanted to study false anglicisms in order to set out a more detailed typology considering the linguistic processes involved in their coinage. This procedure was also meant to address the problematic aspect of retrieval caused by orthographic conventions and morphological variety.

The material on false anglicisms was to be extracted from corpus evidence following a previously drawn list of 72 false anglicisms which is based on dictionaries.[1] The corpus I wanted to mainly work on was the one obtained from La Repubblica, one of the major Italian newspapers.[2]

I also needed to analyze the written components of some English corpora such as the ICAME (International Computer Archive of Modern and Medieval English), the BNC (The British National Corpus), and the BoE (The Bank of English) –also in the tagged versions– in order to provide full coverage of both British and American English. This would have led to check if the list of false anglicisms that I used as a starting point for my project had to be enriched or shortened by exploiting corpus data.

In addition to this, among my initial aims I wanted to use Italian-English parallel corpora or perhaps to build a corpus of my own based on Ulisse (www.ulisse.alitalia.it), the Italian-English on-line version of Alitalia official on-board magazine. By doing so, I could have checked how Italian false anglicisms are translated into English in order to provide precise translation equivalents.

I.b. Adjusting Initial Aims

The original aims have been partially modified. In practice, the idea of building a corpus based on Ulisse was aborted since there were problems in downloading the English translations of the articles included in Ulisse. Both the Italian and the English versions were available under the same URL address and this hampered the independent selection of the English text. In addition, even by being able to build such a corpus, this would not have been wide and/or rich enough to include a considerable amount of false anglicisms. However, the work done with Ulisse allowed me to trace a few new false anglicisms. The gathering of pure text from Ulisse has been helped by the text-retrieval software HTTrack and once the texts were gathered they were made searchable through WordSmith Tools.

II. Project description

The project I finally developed at the Aksis center (Avdeling for kultur, språk og informasjonsteknologi), formerly HIT (Senter for humanistisk informasjonsteknologi), of the University of Bergen can be roughly divided into two different though connected parts. The first one consists in a detailed study of the structure of already available false anglicisms taken from the start-up list described above. The second part, more challenging and innovating, was that of finding a way to retrieve new false anglicisms exploiting computational tools and techniques.

II.a. Description and Study of False Anglicisms

On the one hand, the start-up list of false anglicisms which constitutes the basis of the first part of the BATMULT project has been completed before coming to Bergen taking advantage of a dictionary-based research previously carried out at the Università degli Studi di Torino. The lexicographic approach consisted in sketching a start-up list of possible false anglicisms in Italian through glossaries and collections of neologisms and foreign words. Consequently, several electronic editions of Italian monolingual dictionaries have been skimmed in order to find other possible false anglicisms. Then, the start-up list has been checked against the entries given in several electronic editions of English and American monolingual dictionaries. Finally, some electronic editions of Italian-English bilingual dictionaries have been studied to find proper translation equivalents of Italian false anglicisms.

On the other hand, the corpus-based methodology developed at the BATMULT made use of English corpora to check if false anglicisms, though not included in English monolingual dictionaries, could eventually be found in large databases of written English (e.g. BNC, BoE, and ICAME ). Then, Italian corpora (e.g. La Repubblica) were studied in order to provide concrete examples of false anglicisms in Italian.

Since the procedure applied to the description and study of false anglicisms in Italian was based on a previous set of false anglicisms, the analysis aimed at verifying if the items originally considered as false anglicisms according to dictionaries were actually so or if they could have been encountered in any English corpus. In order to do so, all the previously found false anglicisms had to be retrieved and then analyzed in each of the English corpora considered.

Besides English corpora, further web resources have been used to deepen the level of analysis. The resources used were WebCONC (Web-based Concordances), WebCorp, and The Word Spy. Both WebCONC and WebCorp are systems employed to extract and customize concordances from words included in web sites. WebCorp is a suite of tools which allows access to the Web as a corpus. WebCONC is a single tool used to generate KWIK (Key Word in Context) concordances based on web pages. The aim of The Word Spy (www.wordspy.com) is to find out how new English words and phrases that have appeared in newspapers, magazines, books, press releases, and Web sites may be defined and used.

By exploiting all these corpora and tools, the initial list of false anglicisms has been reduced from 72 to about 60 items. Some originally-thought-to-be false anglicisms (though not appearing inside English dictionaries) were found in English corpora. Once this happened, there were no valid reasons to consider them false anglicisms anymore so they were taken out of the list.

II.b. Implementing a Computational Technique to Find New False Anglicisms

This part of my project could not have been achieved without the help of Knut Hofland who kindly provided his computational and programming skills. Texts have been gathered along a two-month span (August 15th-October 15th) from three main Italian newspapers: La Stampa (www.lastampa.it), La Repubblica (www.repubblica.it), and Il Corriere della Sera (www.corriere.it). The method to retrieve new false anglicisms in Italian made use of enhanced Unix scripts (personally elaborated by Knut Hofland) combined with w3mir software, which was used to get pure texts from the newspapers considered. The system automatically updates every day and a list of all the newly gathered words will be scanned and skimmed manually at the end of the collection period in order to look for possible new false anglicisms.

The size of the texts collected amounts to about 10.000.000 tokens (about 4.2 million from La Repubblica, about 3.8 million from La Stampa, and about 2.0 million from Il Corriere della Sera). The software used to search the corpus is based on the IMS CWB (Corpus Workbench) and the corpus is also available for single-headline search.

Since the main aim of the project was that of retrieving possible new false anglicisms, word lists obtained from the POS-tagged versions of the LOB (London-Oslo-Bergen Corpus) and the BROWN (The Brown Corpus) were intersected with the corpus. This intersection showed all the English looking items in the corpus. The outcome was also merged with a lemmatized Italian word list (made available by Marco Baroni at the Unversità degli Studi di Bologna) in order to eliminate the Italian looking words from the word list resulting from the corpus.

This procedure generated a list of about 7.880 words. Since this provisional list included a lot of noise, i.e. English-Italian homographs, proper nouns, abbreviations, acronyms, and quotations, it has undergone further automatic skimming. Proper nouns were eliminated by taking out capitalized words; quotations were excluded by eliminating strings equal or longer than three graphic words; quotations were eliminated by taking out words in quotation marks. Abbreviations were also eliminated by erasing words shorter or equal to three orthographic characters.

The combination of automatic procedures and manual supervision (homographs had to necessarily be analyzed manually in the context in which they appear) originated a final list of … items that is very likely to contain new false anglicisms. Though representing a rather time-consuming activity (each word has to be checked inside the context in which it appears), the list seems to be promising and a few new false anglicisms have already been found through a preliminary search. This procedure increased the list of false anglicisms from 60 to 80.

III. Concrete Achievements and Upcoming Results

During my stay at the Aksis center of the University of Bergen (August 4th 2003 - October 31st 2003) the goals of my original project submitted to the BATMULT project have been achieved.

The results obtained constitute a fundamental contribution for the complete realization of a dictionary of false anglicisms that I intend to accomplish by the end of my PhD. The BATMULT project is an integral component of my PhD thesis as it supports the theoretical framework used to study false anglicisms. My project benefited from hands-on training in computational linguistic tools. The available facilities at BATMULT, the attendance at specific courses, and the help of a supervisor also enabled me to develop further understanding and knowledge of both new and already available computational linguistic tools and corpora.

Computational techniques have proved to be very useful in saving a great amount of time in building a corpus, in retrieving certain items and in leading towards a provisional list of false anglicisms in Italian. Beside the advantages, the computational techniques employed do not seem to be sufficient to thoroughly deal with the complex and manifold issue of false anglicisms. In fact, a manual scanning of the resulting list of items is needed to understand which of the several entries are false anglicisms.

III.a. Newspaper Corpus

The first tangible achievement is a small (about 10 million tokens) but updated corpus of Italian newspaper language. This will be available for future work and it could be exploited not only to find false anglicisms but also to search for anglicisms and neologisms in general.

III.b. A Wordlist with Possible New False Anglicisms

The newspaper corpus originated a list of items among which the search for false anglicisms could be restricted. Although some automatic filters were added in order to eliminate the unnecessary and/or undesired noise in the final word list, only further manual scanning of such list would lead to the tracing of new false anglicisms. Therefore, time-consuming manual scanning will have to be done along with previous automatic processing.

III.c. A List of False Anglicisms

New false anglicisms have been found and previous ones (that were first thought to be false anglicisms but then revealed to be authentic English loanwords) were eliminated. All the items considered as false anglicisms have therefore been double checked, both from a lexicographic and from a corpus-linguistic perspective resulting in a noise-free list of about 80 items.

III.d. Possible XML-implementation of a Dictionary of False Anglicisms in Italian

The paper edition of the dictionary of false anglicisms, once completed, may be encoded in an XML database to allow an alphabetically-independent search through an electronic version. The suggestions on how to electronically implement a dictionary have been provided by Sindre Søresen.

III. e. Computational Tools

Since all the above mentioned corpora needed appropriate software in order to be analyzed, during the period I spent at the Aksis center I have also developed a deeper understanding of computational linguistic tools such as WordSmith Tools (especially concord, wordlist, and compare wordlists features), SARA98 (especially word query and phrase query features), and HTTrack.

IV. Parallel Activities

The resources at BATMULT and the facilities provided by Aksis also allowed me to take part in different related activities during my stay.

IV. a. Lexicon and WordNet

I attended a course held by Gunn Lyse entitled Lexicon and WordNet. The attendance at this course inspired me to deepen the analysis of false anglicisms under a semantic perspective. A particular type of false anglicisms may be defined semantic shifts and it would be interesting to sub-categorize semantic shifts into smaller types. Semantic shifts can indeed comprehend metaphors, metonymies and/or polysemic expansions. Further insights may be dedicated to study the spread of false anglicisms across semantic/lexical fields. Furthermore, the implementation of lexical semantics into database such as WordNet may be useful to set into an XML database format the dictionary of false anglicisms.

IV. b. Lecture

During my stay at the BATMULT I was asked to give a presentation of my work entitled False Anglicisms in Italian: An Overview and Some Findings at a seminar series organized by Professor Helge Dyvik and Professor Koenraad de Smedt at the Department of Linguistics and Comparative Literature of the University of Bergen (September 12th 2003).

The feedback to the lecture turned out to be very constructive and highlighted both strengths and weaknesses of a future dictionary of false anglicisms in Italian. Major revisions might be required for the organization of each entry, i.e. the microstructure. Quantitative findings and the improvement of a corpus-based definition of false anglicisms in Italian might also be needed.

IV. c. Seminar Attendance

Aksis also organized a two-day seminar (October 23rd-24th) in occasion of Professor Christian Fluhr's visit to Bergen. The attendance at the seminar provided interesting insights in the newest researches being carried out by several scholars including Helge Dyvik, Knut Hofland, Sindre Sørensen, Paul Meurer, Koenraad de Smedt, Claus Huitfeldt, and Michael Sperberg-McQueen.


[1] Furiassi, Cristiano. 2003. `False Anglicisms in Italian Monolingual Dictionaries: A Case Study of Some Electronic Editions.' International Journal of Lexicography. Vol. 16 No. 2, 121-142.

[2] The corpus La Repubblica, created by Guy Aston and Lorenzo Piccioni at the Scuola Superiore di Lingue Moderne per Interpreti e Traduttori (SSLMIT) of the Università degli Studi di Bologna, Forlì branch, has been available since the beginning of 2002. It is a pioneer project intended to create a large corpus of written language taken from the Italian newspaper La Repubblica. The texts collected in 16 CD-ROM, containing all the articles appeared on La Repubblica from 1985 to 2000, have been converted into a single database –also available for single-year search– of about 350 million words and searchable with SARA98. The project has been now completed for the first six years (1985-1991). The author is very grateful to Professor Guy Aston for giving his permission to access and use the corpus.