Corpora: Legal Language - Summary

Frances Rock (
Fri, 4 Dec 1998 17:10:07 GMT

Dear all

Many thanks to all the people who kindly responded to my query about
the existance of corpora made up of legal texts.

I had lots of very useful replies which are summarised below.

All good wishes


CONTRIBUTORS (Hopefully I've not missed anyone off!):
Oliver Mason
Brian Ulicny
Su'Ad Awab
Raman Chandrasekar
"Ted E. Dunning"
Bob Krovetz
Gerry Nelson
Geoffrey Sampson
Lou Burnard
David Graff
Judy Delin
Ralf Steinberger

The LA Times often is good for posting such information.
Here are links to documents in the Microsoft anti-trust case:

A large number of documents relating to the OJ Simpson trial can be
found at the Court TV site:
--------------------------------- The Danish-English-French corpus in
Contract Law, better known as the AARHUS corpus, contains 1 million
words in each language. It is not a parallel corpus.

The corpus is subdivided into 6 types of texts:
i) statutes, rules and regulations
ii) travaux preparatories (reports from law reform committees)
iii) judgements/decisions
iv) contracts
v) legal textbooks
vi) articles in law journals

There is also a database called Legal Data Consortium (LDC)
made available by the US Dept. of Justice. It contains among other,
categories such as Case Law, Federal Regulation, International
Agreement, Statutory Law. Email the following address to order a copy: or
LDC can also be accessed at URL:
LDC is actually the Linguistic Data Consortium (at the
University of Pennsylvania) supported in part by an NSF grant.


Announcing a NEW CORPUS from the LDC

JURIS (Justice Department Retrieval and Inquiry System) Text Corpus

The text data contained on this two-CD-ROM set
represent a release of the JURIS (Justice Department
Retrieval and Inquiry System) data collection that
has been made available to the Linguistic Data
Consortium (LDC) by the U.S. Department of Justice.
The time span of the text ranges from the 1700's to
the early 1990's.

There are 1664 individual text files in the corpus,
1011 on the first CD-ROM, and 653 on the second. The
original archive consisted of 219 files ranging
between less than 1 MB and nearly 70 MB in size. In
order to make the data more accessible for research
use, we chose to divide the larger files into pieces,
such that the average file size was about 2 MB when
uncompressed (the largest uncompressed file size is
about 4.5 MB). Divisions of the files were done at
document boundaries, so all files contain whole

There are a total of 694,667 document units in the
corpus, and these can be categorized to some extent
with regard to their content. The following is a
partial list of categories and their descriptions
drawn from JURIS documentation contained in the
corpus. The terminology and organization of
categories are those used in the JURIS documentation:


Published Comptroller General Decisions; Unpublished
Comptroller General Decisions; Opinions of the
Attorney General; Office of Legal Counsel (US Dept.
of Justice Board of Contract Appeals; ADP Protest
Report (Summary of ADP Procurement Protests before
the GSBCA); Federal Labor Relations Authority Case
Decisions; FLRA Administrative Law Judge Decisions;
Federal Service Impasses Decisions; Decisions and
Reports on Rulings of the Assistant Sec. of Labor
for Labor Management Relations; Federal Labor
Relations Council Rulings on Requests of the Asst.
Sec. of Labor for Labor Management Relations; HUD
Administrative Law Decisions; Merit System Protection
Board Decisions; Decisions under Immigration and
Nationality Laws; Environmental Protection Agency
General Counsel Opinions; Equal Opportunity
Commission Decisions; Equal Employment Opportunity
Commission Policy Statements; US Office of Government
Ethics Decisions; HHS Department Appeals Board


Office of the Solicitor General; Civil Division;
Civil Division Trial; Environmental and Natural
Resources Division; Tax Division Criminal Appellate;
US Attorney's Offices; US Trustees' Offices.


U.S. Supreme Court; Federal Reporter, 2nd Series;
Court of Appeals Unpublished Decisions; Federal
Supplement; Federal Rules Decisions; Atlantic 2nd
Reporter (DC only); Bankruptcy Reporter; Courts of
Military Review; Military Justice Reporter; Court of


FOIA Update Newsletter; DOJ Guide to the FOIA Case
List Publications.


Code of Federal Regulations; Unified Agenda of
Federal Regulations; Defense Acquisition Regulations.


United States Treaties and Other International
Agreements; Department of Defense Unpublished
International Agreements.


Opinions of the Solicitor (Dept. of Interior);
Ratified Treaties; Unratified Treaties; Presidential
Proclamations; Executive Orders and Other Orders
Pertaining to Indians.


Decisions Under Immigration and Nationality Law;
Title 8 - Code of Federal Regulations; Immigration
Reform and Control Act of 1988, Legislative History;
Equal Access to Justice Act, Legislative History.


Public Laws; United States Code; Executive Orders;
Anti-Drug Abuse Act of 1988; Section-by-section
analysis of anti-drug abuse act of 1988; Criminal
Division Handbook on CCCA; The Organic Laws of the
United States.


US Tax Court Decisions; US Board of Tax Appeals
Decisions; Tax Division's Summons Enforcement
Decisions; Tax Division's Tax Protester Case List;
Tax Division's Criminal Tax Manual; Tax Division's
Criminal Tax Indictment/Information Forms; Tax
Division's Standardized Criminal Tax Jury
Instructions; Tax Division's Criminal Section
Newsletter; Tax Court Memorandum Decisions; IRS
Cumulative Bulletin; Tax International Acts; IRS News
Releases; IRS General Counsel Memoranda; IRS Actions
on Decisions; IRS Technical Memoranda.


United States Attorney's Manual; United States
Trustees' Manual; Federal Personnel Manual; Federal
Acquisition Regulations; Federal Acquisition
Circulars; Federal Travel Regulation; Federal
Information Resources Management Regulation; Federal
Property Management Regulations; Principles of
Federal Appropriations Law; Justice Department
Acquisition Regulation; Justice Property Management


Civil Division Monographs; Civil Division Torts
Branch Handbook on damages under FTCA; Criminal
Division Monographs; Criminal Division Forms;
Criminal Division Guidelines for Drafting
Indictments; Criminal Division Narcotics; Forfeiture,
Prosecution Manual; Criminal Division Directory of
Services; Asset Forfeiture Manuals; Obscenity
Enforcement Reporter; Environmental and Natural
Resources Division Monographs; US Sentencing
Commission's Guidelines Manual; Sentencing Guidelines

The text files are all formatted using a set of SGML
tags to mark document boundaries, and to mark major
structural features within documents. As with file
organization, the markup is derived from the document
structures as provided by the Justice Department.

Institutions that have membership in the LDC during
the 1998 Membership Year will be able to receive this
corpus in the same manner as all other text and
speech corpora published by the LDC. Nonmembers may
purchase JURIS for $1500.

If you would like to order a copy of this corpus,
please email your request to
<>. If you need additional
information before placing your order, or would like
to inquire about membership in the LDC, please send
email or call (215) 898-0464.
---------------------------------- There is a big chunk of the
federal register in TREC. This is text which consists largely of
legal documents.

There are a number of technical standards which are available.

There is also a rather small ILO corpus which isn't really legal

The US Supreme Court decisions are now all going on-line.
--------------------------------- The ICE-GB corpus contains 40,000+
words of court proceedings from the Royal Courts of Justice in London.
These are divided into cross-examinations and legal presentations
(summations, judgements, etc). The corpus is fully parsed and is now
available. For more info, see
--------------------------------- There is some material in the
London-Lund Corpus of Spoken English which consists of transcriptions
of court proceedings (cross-examination, and a judge's ?summing-up?)
-- texts S.11.1, S.12.3, S.12.4 (I haven't checked whether this list
is exhaustive).
--------------------------------- The PELCRA corpus being built at the
University of Lodz includes a large quantity of transcribed police
interviews. They are all in Polish, of course.
--------------------------------- What I do have is a tape of the UK
International Client Interviewing Competition, which is apparently
held annually (and internationally) by law schools. An actor is
briefed on a complaint or issue about which s/he is approaching the
solicitor. Teams of two solicitors then interview the client about the
issue. I don't have direct access to the source of this but Kingston
University Law School was involved in it, and produced the video of
the 1991 competition. I just looked at the law school web site there
and they have something called a `negotiation competition', which may
be something else that they tape and distribute for training purposes.
So these aren't real situations, but realistic and usable data.

Mike Stubbs book Text and Corpus Analysis looks at language in a
judge's summing-up, which you might find of interest if you haven't
already seen it.

If you find any sources of data, I'd be very grateful to hear of them.
---------------------------------The European Commission publishes
large amounts of texts on the internet, among which many legal texts.
I don't know whether they are available on CD-ROM (they may be, for
payment), but in any case you can download the texts yourself from the

Try the URL as a starting point
for an explanation of what sort of texts there are. If you follow the
links 'OJ on the internet' and 'Eur-Lex' to, choose the language you want.

Frances Rock
Room 5 Flat 123 Block 21
The Tennis Court
Edgbaston Park Road
Birmingham B15 2RB