Corpora: Anonymisation - Summary

Frances Rock (ROCKFE.ENG.ARTS.Bham@hhs.bham.ac.uk)
Tue, 30 Mar 1999 17:50:06 +0000

Dear all

Warm thanks to all the people who replied to my query on anonymisation
a couple of weeks ago. 22 people replied to my query so I was really
staggered by the interest shown in this issue.

I have posted a brief summary of replies below. I have divided this
summary into broad headings and only included short excerpts from, or
bullet-pointed summaries of, the mails I received. As you may imagine
I have a lot more information so do get in touch if you would like
further information about any of the points below.

I will collate the work I have been doing on this into a publication
in due course. Several people expressed an interest in receiving a
copy of any work I produce on anonymisation so I'll post a note to
the list when that is available.

Thank you all again for your help!

Regards

Frances Rock

Contributors (I hope I haven't missed anyone?)
Amanda Schiffrin mandy@nis.sdu.dk
Andrew Wilson andrew.wilson@phil.tu-chemnitz.de
Brian Ulincny bulicny@lhs.com
Christopher Brewster brewster@upatras.gr
Christopher Tribble ctribble@lanka.ccom.lk
David Lee d.lee@lancaster.ac.uk
Geoffrey Sampson geoffs@cogs.susx.ac.uk
Gerald McMenamin gerald_mcmenamin@csufresno.edu
Gill Philip philip@cilta.unibo.it
Gunter Lorenz Gunter.Lorenz@Phil.Uni-augsburg.de
James L. Fidelholtz jfidel@siu.buap.mx
John W. Du Bois dubois@humanitas.ucsb.edu
Kristine Hasund Kristine.Hasund@hia.no
Laura Gavioli gavioli@sslmit.unibo.it
Gunter Lorenz Gunter.Lorenz@Phil.Uni-augsburg.de
Mari Broman Olsen molsen@umiacs.umd.edu
Patrick Juola juola@quine.mathcs.duq.edu
Ramesh Krishmanurthy ramesh@clg.bham.ac.uk
Richard Todd r.todd@dcs.shef.ac.uk
Peter Hamer pgh@nortelnetworks.com
Susan Blackwell S.A.Blackwell@bham.ac.uk

WHAT IS ANONYMISATION?

The deliberate changing of, or concealment of, the name (and hence,
identity) of someone or something. Gill Philip

· It can be seen as security issue
· In which case we can ask what should anonymisation accomplish or
"what's your threat model?" Patrick Juola

IS ANONYMISATION A NON-ISSUE? IS IT WORTH WORRYING ABOUT?

Compare it to the worry certain people have as to whether their email
is being read by someone.

Similarly in a reasonable corpus i.e. over 100 m words, the bits that
are personal or sensitive are so 'rare' that it is rather ridiculous
to be concerned.

Corpora are usually used to extract data which is disassociated from
its context, citation lines. Text can only be sensitive IN CONTEXT
otherwise it is meaningless. Christopher Brewster

Those of us who are academics would like to think that this is a
non-issue, but if libellous stuff finds its way inadvertently or
otherwise into the corpus, anyone (ie, a publishing house) who
publishes the data in any way would then be liable. James L.
Fidelholtz

_. there is no need for 'anonymisation'.The only exception to this may
be a very specialised corpus (e.g. love letters) where the
correspondents are known individuals to the community using the
corpus. But it is easier to use data from individuals whom you do not
know personally than to take the trouble to anonymise the text. [How
do we know we won't know someone?] Christopher Brewster

This list will almost certainly be incomplete, but... Above all, in
situations where patient, client, or commercial confidentiality is
paramount: medical consultations, opinion focus groups, non-public
meetings, consultations with other professionals, and so on.
Written texts include personal records (e.g. medical records) and
private correspondence, etc. Andrew Wilson

[Anonymisation] is necessary when naming could harm or cause harm to
the person involved, or to those near to them.

It is also necessary when naming might result in (legal) problems.
Gill Philip

Your query brings up important issues. Presenting data anonymously is
a challenge, and is necessary sometimes. It is certainly not a
non-issue. Gerald McMenamin

Anonymisation is certainly not a non-issue _. We do voice recognition
products for a number of medical products.

In order to do speech recognition, we make statistical models of the
sub-language, and this involves processing as large a corpus as
possible of text in that medical speciality. However, hospitals will,
quite properly, not part with corpora of medical texts unless they can
be assured that the data has been suitably anonymised. If they
release data with personal information, they expose themselves to
legal troubles. Brian Ulicny

I certainly don't think this is a non-issue. Rather, it seems to be a
key one (at least in Italy) at the moment. Following a European law
protecting individual privacy (which has recently become part of the
Italian legislation), There is a group of scholars in Italy preparing
a document to ask legal authorities what has to be done in order to
protect privacy in the case of audio and video-recording collected for
(socio)linguistic research purposes. Laura Gavioli

I think anonymisation is a very serious issue, so I think what you're
doing is important. John W. Du Bois

I don't think I can contribute much positive to your enquiry, but I
should be very interested in due course to hear what comes of it.
Geoffrey Sampson

In any event, as I am concerned with all matters relating to corpus
design myself, I would be quite interested in whatever it is you are
working on. Gunter Lorenz

The Law

_. other countries have weird laws: eg, in the state of Puebla,
Mexico, you can successfully be sued for libel even if what you said
is provably true! James L. Fidelholtz

www.ipc.on.ca/Web_site.ups/Intro/Frames.htm Describes the Canadian
experience. Brian Ulicny

INFORMED CONSENT

Whenever names and/or contact details are published informed consent
(obtained either directly or via any instructing authority, such as
the police) is sought _prior_ to any such documentation.

No problems have been encountered with doing this so far.

All required parties are informed of any use the data may have that
is, perhaps, wider than that of immediate concern. In short, informed
consent seems to reduce future problems with 'permission', etc.

A multi-modal approach to informing subjects/informants.

Just because something may seem 'as clear as day' to the analyst, it
doesn't mean that the same method of communication will be so readily
digested by all others (i.e., those you hope will give consent).

Two or three ways to share information prior to consent:
· written message - covers the immediate use of any study
· (optionally) followed by playing a (speech) recording that
recapitulates the same message · face-to- face conversation with the
participant - further opportunity for questions. Richard Todd

HOW CAN ANONYMISATION BEST BE ACHIEVED?

· The nature of underlying data structure - systematisation
· 'Foreseeability' of needing information not currently required -
oneself or another · Data-mining strategies that already exist may be
used Richard Todd

An intelligent anonymisation tool would:
· Facilitate collaborative research
· Businesses
· Outside professionals (doctors, lawyers, etc.)
· Anonymisation program(s) could be run on their own machines and the
· Researchers would only take away anonymised data

90% of the time, the nature of data means:
· Can be anonymised by the typist as s/he transcribes
· Problems of third-party access to the unanonymised data (i.e.
outside typists listening to the recordings) remain
· This also applies to hand-written data (e.g. medical case notes)
Andrew Wilson

WHAT EXACTLY SHOULD BE ANONYMISED?

Cobuild - Ramesh Krishmanurthy
· Personal letters and non-public-domain spoken data
· Personal names and addresses - Dear <FX> female or Dear <MX>for
and male

The PIXI Corpus - Laura Gavioli
· Shops changed internal organisation (and address, in one case) · No
repercussions of any kind

London-Lund - Geoffrey Sampson
Potential to identify individual persons was probably the deciding
factor Tundraland Richmond Cinema, British Schools Blanket rule vs
human discretion

WHAT KINDS OF INFORMATION NEEDS TO BE PRESERVED IN ORDER TO AID
EFFECTIVE ANALYSIS

The answer to this depends on the questions that the linguist is
investigating. Gill Philip

WHAT CAN BE USED TO REPLACE ITEMS THAT HAVE BEEN ANONYMISED?

Role Codes - Andrew Wilson
· Participants - D = doctor, P = patient OR a random initial
· Could be "indexed" for multiple participants
· Distinct from text words - SGML tags

Italian corpus (Mondadori/CNR-Pisa) - Gill Philip
· Names - XXX or XYX or similar formula.
· Dates - symbols such as @#%

Corpus of learner essays - Gunter Lorenz
· Names - abbreviated personal names
· Abbreviations may occasionally:
· coincide with established acronyms (EU, GB, UN etc)
· coincide with two-letter function words (OR, IT, HE etc)
· Dots may skew sentence length

Cobuild - Ramesh Krishmanurthy
· Anonymised at transcription stage
· Names - <F01> = first female speaker <\<MO1> = first male speaker
· <MOX> and <FOX> for speakers who we could not actually identify for
sure · Anonymised internal reference to persons similarly

The PIXI corpora - Laura Gavioli
· Place names
· Main intonational pattern was maintained
· e.g. Birmingham = Nottingham
· Used "really existing" names
· e.g. "Nottingham", rather than "Mattigham"
· Less diversion of attention

David Lee
Proper names may form a fair proportion of the text so style of
anonymisation may affect :
· overall counts of nouns - Is first name + surname 1 or 2 nouns?
· type-token ratios · time/duration of utterances

METHOD - HOW CAN ANONYMISATION BE MOST EFFECTIVELY CARRIED OUT?

AVIATOR project - Susan Blackwell
· Automated deletion - Proper names
· List - sorted in alphabetical order
· Software could easily be enhanced or modified
· Not 100% accurate

Writing Difficult Texts, PhD Corpus - Christopher Tribble
· Focusing on proper nouns
· To preserve the commercial confidentiality
· Two alternatives:

Alternative 1 - Slow!
1. create a wordlist for the full text
2. identify (manually) proper nouns
3. allocate a unique alpha-numerical identifier to each distinct
proper noun 4. systematically replace each unique proper noun string
with a unique numerical identifier

Alternative 2 - Quick
1. Use CLAWS output
2. Replace any string which is followed by the code [NP*] with a
string of characters such as PROPERNOUN

THERE ARE DEGREES OF RISK DEPENDING ON THE WIDER CONTEXT

I certainly wouldn't anonymise anything in the public domain already
(ie, published stuff). James L. Fidelholtz

British National Corpus - Geoffrey Sampson
· Conversations come with information about when and where they were
recorded · Place is highly relevant for studying dialects, for
instance · Small places · Places with specific, identifiable people

Spanish corpus - James L. Fidelholtz
Working on proper nouns - Statistics tricky
An alternative to anonymisation - Cautioning users about delicate
extracts

AUDIO AND VIDEO RECORDING

Santa Barbara Corpus of Spoken American English - John W. Du Bois ·
Distribute recording - anonymity? · Bleeped names · Voice quality
remains · It is very important for future corpora to include audio
and/or video

Telecommunications - Peter Hamer
Anonymise traffic records and/or traces before analysing/publishing
them. Automated anonymisation Logs of phone calls, you certainly
wouldn't want to retain the actual phone numbers. There might be some
conflict of interests in how much of the network topology you retain.
For example, you might treat the exchange and the customer part of the
numbers differently. Obviously [?] larger sets of data introduce their
own degree of anonymity. If the data came from a small community the
called exchange might be a powerful clue to the identity of the
caller

WHAT IS ANONYMISATION - IT'S NOT AS STRAIGHTFORWARD AS IT SEEMS!

De-contextualisation [as a method of anonymisation] has to be
considered carefully _. because it can be as damaging as
contextualisation in many cases. Gill Philip

Cobuild - Ramesh Krishmanurthy
· Conversational data - One or two people recognising selves - major
problems ensuing · Newspaper data - Car accident

All this means that I find it really hard to envisage what a
satisfactory detailed policy on anonymisation would look like. I
would be very interested to see anything you eventually write up in
response to the consultations you are carrying out! Geoffrey Sampson

__________________________________________________
Frances Rock
Postgraduate Student
Department of English
The University of Birmingham
Edgbaston Birmingham B15 2TT

0121 257 3519
f.e.rock@bham.ac.uk
__________________________________________________