Corpora: anonymization

Geoffrey Sampson (geoffs@cogs.susx.ac.uk)
Wed, 31 Mar 1999 09:42:53 +0100

Dear Frances,

Thank you for the summary you have sent on responses you got to the inquiry
about anonymization. I was quite surprised at the extent to which some
of your respondents felt it was not a large problem. But this undoubtedly
depends on whether one is working with published written material, or
with "private" language, particularly speech. If material has already
been published, one can assume that decisions were made then about what
should and should not be said. But in my experience of working with
transcriptions of spontaneous conversations, there is a _high_ incidence
of things that need to be "blanked out" somehow, even if the speakers
have signed consent forms. Sometimes the damaging remarks about third
parties are complete within a line or two, contrary to your respondent
who suggested that corpus extracts out of context will never be harmful.

Since you circulated your original inquiry, I have had occasion to draft
a document about the anonymization policy I have applied in the part
of my CHRISTINE Corpus which derives from the spoken part of the British
National Corpus. (The BNC itself anonymizes some items, but not enough.)
I thought you might still be interested in reading this; I am also circulating
it to a couple of relevant discussion lists in case it is of wider
interest.

Best regards,

Geoffrey Sampson

School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, GB

e-mail geoffs@cogs.susx.ac.uk
tel. +44 1273 678525
fax +44 1273 671320
Web site http://www.grs.u-net.com

............................................................................

Anonymization

The BNC compilers promised anonymity to the speakers represented in
BNC/speech. The CHRISTINE Corpus extends this BNC policy in certain
respects.

The anonymity policy was implemented in BNC by removing surnames of speakers,
and a few other proper names, replacing them with an SGML entity which
in CHRISTINE appears as "<name>". However, this procedure is arguably
not adequate.

The headers to the BNC/speech files do specify speakers' Christian names
(forenames); and of course they also specify the dates and places of
the recordings. The places specified are sometimes small villages.
The date and place specifications represent significant scientific data,
and must be preserved. But, particularly when the speakers' Christian
names are moderately or very unusual, it seems likely that someone
familiar with the locale in question would often be able to identify
groups of friends from their Christian names.

True, an outsider would hardly be able to identify individuals without
their surnames. But anonymity vis-<agrave>-vis outsiders is not the
only kind of anonymity that matters. Surely it is equally important
to protect, say, a group of youngsters who have been recorded chatting
freely among themselves from embarrassment through being recognized
by their own teachers or relatives. One may feel that the likelihood
of such an "insider" encountering the CHRISTINE Corpus is fairly low.
But the decisive point is that some of the speakers themselves understood
that the corpus compilers were offering them this level of anonymity.
For instance, T06.00524 shows the speaker explaining the system to
her companion by saying "they don't give them a name, they just
say ... sixteen-year-old girl, fifteen-year-old girl with a friend".
It is not for us to breach this expectation of literal anonymity.

Furthermore, it is not only the speakers themselves who should be
protected. For instance, the two speakers just referred to happen to
comment that one of their schoolmates, identified by Christian name,
behaves like a whore. This person is entitled to anonymity as much
as the speakers, and arguably more: she signed no release form for
the corpus compilers. When well-known public figures or institutions
are mentioned, the BNC compilers seem to have felt that there was no
need to anonymize the references at all. Clearly, if someone announces
that he has just bought the latest album by a named pop singer, there
is no point in concealing the singer's name. But it depends what is
said. One of the CHRISTINE texts contains a series of quite damaging
remarks about the management of a secondary school, named in the
BNC file. In another case, speakers comment adversely on the sexual
morality of a named American actress. Even American actresses, surely,
are entitled to have their honour guarded by corpus linguists.

Consequently, the CHRISTINE Corpus has taken the BNC anonymization policy
further, in the following ways.

Where a BNC file gives the name of an institution, or the surname of
a third-party individual (it never gives surnames for participants in
the dialogues), in a context where it seems possible that the
identification could cause embarrassment, CHRISTINE replaces the name
with the "<name>" entity.

Christian names of speakers are in all cases replaced by other Christian
names, both in identifying the utterers of speech-turns, and in the
transcription of words uttered. Each speaker represented in the
CHRISTINE Corpus is assigned a name and a three-digit code, e.g.
"Scott125". Each of the speaker's turns is headed by this name/number code;
and other participants in the dialogue are shown addressing him as
"Scott" -- but "Scott" is not the individual's real name. The three-digit
codes are unique across the CHRISTINE Corpus. The names are sometimes
shared by different speakers, as their real names are.

(An alternative would have been to attribute the speaker turns to the
BNC five-byte speaker codes, e.g. "PS546". But this gives the corpus
user no easy way to link the individuals who contribute particular
turns to their names used vocatively by other dialogue participants.
It is far easier to grasp what is going on in a dialogue, if one has
naturalistic names to hook the spoken interactions onto; the fact that
they are not the actual names of the speakers is scientifically irrelevant.)

Some Christian names of individuals not participating in a dialogue, but
who are talked about in it, are also changed, if the comments made about
them seem potentially embarrassing, or if the name might involve a
special risk of rendering the speakers identifiable.

The "noms de corpus" are chosen to be metrically equivalent to the
real names, and also as far as possible to be socially equivalent.
Obviously, male names are replaced by male names and female by female.
But, in addition, when a name seems to be associated with a particular
age-group, social class, and/or region, it is replaced by a name which
feels similar in those respects. When (say) a two-syllable formal name
alternates with a one-syllable abbreviation, the replacement name
is chosen to preserve the same pattern, and formal name and abbreviation
of the replacement name are inserted wherever formal and abbreviated
versions of the real name occur, respectively, in the original file.
If two participants in a dialogue share the same Christian name, their
"noms de corpus" are also the same (sometimes, the logic of the
dialogue depends on this kind of ambiguity of names).

Two kinds of turn in the original BNC files are not attributed to
speakers with identified Christian names. In many cases, the transcriber
could not decide which speaker produced a particular utterance,
and assigned the turn to an "empty" speaker code, usually "PS000".
(Sometimes, where it is clear that different speakers are involved but
neither is identifiable, PS000 and PS001 are used; however, a series
of turns all attributed to "PS000" sometimes appear in fact to have
been uttered by more than one speaker.) These turns are attributed
in CHRISTINE to speakers "unid0", "unid1" (for PS000, PS001 respectively).

In other cases, the BNC file assigns a "normal" speaker code which is
identified by the header as referring to a particular individual with
specified characteristics, but no name is included. In those cases,
CHRISTINE invents a "nom de corpus" which seems appropriate in terms
of the speaker's sex, age, etc. (Occasionally, if sex as well as
real name are not given, CHRISTINE uses the cover name "Anon".)

It must be admitted that these procedures cannot offer a watertight
guarantee against speaker identification. Someone who was determined
to penetrate behind the veil of anonymity provided by CHRISTINE would
only have to link its files to the corresponding passages in the
original BNC files to discover the names we have concealed. There
is nothing we can do about that. But our policy greatly reduces the
chance of an accidental betrayal of informants' confidence. If any
of their identities should ever be revealed, it will not be the
fault of the CHRISTINE Corpus.