MANUAL OF INFORMATION

TO ACCOMPANY

THE AUSTRALIAN CORPUS OF ENGLISH (ACE)

MACQUARIE UNIVERSITY

BY

PAM PETERS

WITH THE ASSISTANCE OF

ADAM SMITH

MANUAL TO ACCOMPANY THE AUSTRALIAN CORPUS OF ENGLISH (ACE)

Introduction



The Australian Corpus of English (ACE) was compiled in the department of Linguistics at Macquarie University NSW Australia, from 1986 on. It was supported by a small grant 1988-9 from the Australian Research Grants Council, and by a series of grants from Macquarie University. Other support came from the National Languages and Literacy Institute of Australia and the University of New South Wales. The project was conceived by Pam Peters, Peter Collins and David Blair, and was carried through with the help of a number of research assistants, notably Alison Moore, Elizabeth Green, Robert Jenkins, Catherine Martin, Diana Grace, Heather Middleton, Wendy Young and Adam Smith. Computational help and advice was provided by Harry Purvis and Steve Cassidy, and the project enjoyed continuous infrastructure support from Macquarie's Speech, Hearing and Language Research Centre.

Contents


Introduction
Rationale
Sampling procedures : overview
Subjects and genres included within categories
Sample size and Coding
Appendix 1 : Corpus versions
Appendix 2 : Published papers associated with the Corpus
Textextracts

 

Rationale



ACE was the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. Interest in the differentiation between Australian, British and American English meant that a corpus modeled on the Brown and LOB corpora would provide ready comparisons. It would also serve as a strategic sample of current Australian English, and as a reference corpus for comparisons with more specialised, homogeneous corpora in Australia.

ACE matches the Brown and LOB corpora in most aspects of its structure and constituency, so that direct interdialectal comparisons can be made on a comparable range of printed genres. (The few small points of difference are outlined below pp. 3 to 8.) Yet the desire to create an up-to-date corpus of Australian English prompted the decision not to match Brown and LOB chronologically, ie. with data drawn from publications of the early 1960s. Instead, ACE consists of material from 1986. A time difference is therefore inherent in any regional intercomparisons with Brown and LOB, though that may itself be of considerable interest in showing the direction of influence in the latter part of this century. The twenty-five year difference in fact allowed us to match rather more categories of publishing than would have been possible had we attempted to create a retrospective corpus of Australian publications of the 60s (as LOB did). Independent southern hemisphere publishing has increased steadily since World War II, yet even in 1986 the range of locally published novels was limited and insufficient for the quota required by the Brown/LOB model. It was topped up with a higher proportion of extracts from short stories than were used in the model corpora. (See below, Table 4, p.5.)

Sampling procedures : overview



The prime objective in compiling ACE was to match the balance of genres represented in Brown and LOB, and to create a more or less equivalent set of 2000-word samples in each category. This provided quantitative targets in each of the fifteen categories of Brown and LOB, and the number of samples in the ACE categories A to J are closely matched with them, as shown below in Table 1. The fiction categories in ACE are slightly different in their constituency, for reasons explained below, p.8, but the total of fiction samples remains the same.

Table 1. Makeup of the three corpora

 

ACE

Brown

LOB

A Press: reportage

44

44

44

B Press: editorial

27

27

27

C Press: reviews

17

17

17

D Religion

17

17

17

E Skills, trades, and hobbies

38

36

38

F Popular lore

44

48

44

G Belles lettres, biography, essays

77

75

77

H Miscellaneous (government documents,

30

30

30

foundation reports, industry reports,      
college catalogue, industry house organ)      
J Learned and scientific writings

80

80

80

K General fiction

29

29

29

L Mystery and detective fiction

15

24

24

M Science fiction

7

6

6

N Adventure and western fiction (bush)

8

29

29

P Romance and love story

15

29

29

R Humor

15

9

9

S Historical fiction

22

-

-

W Women’s fiction

15

-

-

       
Total

500

500

500

 

Within each corpus category, the sampling procedures were mostly strategic rather than random, because of the felt need to match subgenres and subject areas where possible. In some categories, e.g. fiction, the corpus requirements were such that we sampled almost every Australian monograph published in that year, and so the representation in ACE is almost total. Where there was a choice, as with the selection of monographs in some nonfiction categories, we gave preference to those which were held in multiple libraries in several states, and therefore probably had more readers and more impact. Among the serials, both popular and scholarly, the selection was usually dictated by subject, to insure a spread of interests and disciplines like the broad range captured by our predecessors.

 

 

Table 2. Sampling of Australian newspapers for categories A,B,C

( * indicates tabloid format, but not necessarily low-brow journalism.)

 

Newspaper

Circulation 1986

A

B

C

National        
The Australian

134,000

1

1

1

Australian Financial Review*

66,000

1

1

-

National Times

86,000

1

1

-

Weekly Times*

46,000

1

1

-

New South Wales        
Daily Mirror*

296,000

3

1

1

Daily Telegraph

265,000

2

1

1

The Sun*

258,000

2

1

1

Sydney Morning Herald

255,000

2

1

1

Sun-Herald*

650,000

3

1

1

A.C.T.        
Canberra Times

45,000

1

1

-

Victoria        
The Age

233,000

2

1

1

The Herald

237,000

2

1

1

Sun News-Pictorial*

549,000

5

1

1

Sunday Press*

140,000

1

1

-

Queensland        
Courier-Mail

217,000

2

1

1

Daily Sun*

133,000

1

1

1

Telegraph*

119,000

1

1

1

Sunday Sun*

375,000

2

1

1

         
         
         
         
         

Newspaper

Circulation 1986

A

B

C

South Australia        
Adelaide Advertiser

211,000

2

1

1

The News*

159,000

1

1

1

Sunday Mail*

254,000

1

1

-

West Australia        
The West Australian*

238,000

2

1

1

Daily News*

98,000

1

1

-

Sunday Times*

251,000

1

1

1

Tasmania        
The Mercury

55,000

1

1

-

Sunday Tasmanian

40,000

1

1

-

Northern Territory        
Northern Territory News

18,000

1

1

-

Total Number of Samples

 

44

27

17

 

We also targeted both Sunday and weekly papers , but the predominance of Sunday papers in Australia means that ACE is closer to LOB in this respect, as shown in Table 3.

 

Table 3 Sampling of reportage, editorial matter and reviews from daily, weekly and Sunday newspapers in the three corpora.

 

ACE

Brown

LOB

A Press: Reportage      

Daily

33

33

33

Weekly

2

11

4

Sunday

9

-

7

Total

44

44

44

       
B Press: Editorial      

Daily

19

19

19

Weekly

2

8

3

Sunday

6

-

5

Total

27

27

27

       
C Press: Reviews      

Daily

14

14

8

Weekly

-

3

4

Sunday

3

-

5

Total

17

17

17

The overall balance of samples from books/monographs to articles/short stories is shown in Table 4. Further details on sampling are discussed with the individual categories below.

Table 4 Monographs v. articles/short stories

 

ACE

Brown

LOB

D: Religion      

Books

7

7

9

Periodicals

7

6

7

Tracts

3

4

1

Total

17

17

17

       
E: Skills, Trades and Hobbies      

Books

-

2

5

Periodicals

38

34

33

Total

38

36

38

       
F: Popular Lore      

Books

18

23

16

Periodicals

26

25

28

Total

44

48

44

 
       
G: Belles Lettres etc.      

Books

38

38

41

Periodicals

39

37

36

Total

77

75

77

       
H: Miscellaneous      

Gov. Documents

25

24

24

Foundation Reports

-

2

2

Industry Reports

2

2

2

Univ. catalogue

1

1

1

Ind. House Organ

2

1

1

Total

30

30

30

       
J: Learned      

monographs

47

41

35

articles

33

39

45

Total

80

80

80

       
       
 

ACE

Brown

LOB

K: General Fiction      

novels

9

20

20

short stories

20

9

9

Total

29

29

29

       
L: Mystery/Detective      

novels

10

20

21

short stories

5

4

3

Total

15

24

24

       
M: Science Fiction      

monographs

2

3

3

short stories

5

3

3

Total

7

6

6

       
N: Adventure/Western (Bush)      

monographs

4

15

15

short stories

4

14

14

Total

8

29

29

       
P: Romance/Love      

monographs

6

14

16

short stories

9

15

13

Total

15

29

29

       
R: Humor      

monographs

10

3

3

short stories

5

6

6

Total

15

9

9

       
S: Historical Fiction      

monographs

15

-

-

short stories

7

-

-

Total

22

-

-

       
W: Women’s Fiction      

monographs

8

-

-

short stories

7

-

-

Total

15

-

-

       
       
       

Subjects and genres included within categories



 

 

Table 5: types of reporting represented in the three corpora

 

 

ACE

Brown

LOB

A Press: Reportage      

Political

14

14

13

Sports

7

7

7

Society

-

3

3

Spot News

7

9

10

Financial

7

4

4

Cultural

-

7

7

Living

9

-

-

 

 

Table 6 Subjects represented in Categories E and F

 

E Skills, trades and hobbies

ACE

LOB

Homecraft, handyman

7

5

Hobbies

6

5

Music, dance

3

3

Pets

1

1

Sport

4

4

Food, wine

2

2

Travel

2

2

Miscellaneous

1

4

Trade, professional journals

9

9

Farming

3

3

 

F Popular lore    

Popular politics, psychology, sociology

15

22

Popular education

3

-

Personal development

4

-

Popular history

8

8

Popular health, medicine

3

3

Culture

4

4

Miscellaneous

7

7

 

 

Table 7 Genres included in Category G

G Belles lettres, biography, essays

ACE

LOB

Biography, memoirs

35

35

Literary essays and criticism

6

6

Arts

9

9

General essays

27

27

 

 

 

Table 8. Academic disciplines of Category J

 

ACE

Brown

LOB

J Learned      

Natural Sciences

12

12

12

Medicine

5

5

5

Mathematics

4

4

4

Soc. Sciences

14

14

14

Pol. Science, Law, Education

15

15

15

Humanities

18

18

18

Technology and Engineering

12

12

12

 

Sample size and Coding

Each sample is notionally 2000 words, the counts being done via WORD for WINDOWS 6, with the coding excluded. The samples contain a minimum of 2000 words, though most are a little more than that in order to conclude the sentence. A few, especially from the disciplines of mathematics and science, have a larger buffer because of the high proportion of formulae in them, which tended to fragment the discourse.

Texts are in the ASCII format, with each category and each sample prefaced by coded identification. The samples carry details of the sources from which they were obtained and individual headings or titles. Within the texts there is a limited amount of markup, for certain discrete elements such as bylines or formulae, and for certain nonalphabetic symbols, both of them in SGML-style codes. Mathematical and scientific symbols beyond those of the Greek alphabet were covered by a generic annotation (&symbol;). The markup <note></note> was used for a variety of extra corpus material, both editorial comment and components of the text itself which stood outside the ongoing discourse, such as extended quotations, graphs or tables.

Format/Comment Coding

<section></section> at the start and end of each category

<title></title> around the title of each category

<sample></sample> around each sample

<subsample></subsample> around any subsample

<id></id> around the sample number

<source></source> around the name of the source from which the sample was taken

<h></h> around the heading or title of each sample/subsample <bl></bl> around any bylines

<list></list> around extended lists

<note></note> to enclose any additional comments or text not be included within the wordcount

to <misc></misc> around unpunctuated or irregularly punctuated sections

* replacement of typographic or spelling errors in original e.g. assessment*assesement

+ replaces hyphen at line-break e.g. proces+sors &formula; replacing any complex formula

&symbol; replacing any symbol not listed below individually

 

Symbol Coding

&amp; & &epsilon; e

&pound; £ &theta; q

&bullet; • &eta; h

&deg; ° &zeta; z

&reg; ® &lambda; l

&para; ¶ &caplambda; L

&ohm; W &mu; m

&alpha; a &rho; r

&beta; b &sigma; s

&gamma; g &upsilon; u

&capgamma; G &psi; y

&delta; d &capomega; W

&capdelta; D

 

APPENDIX I



Corpus versions

ACE exists in two versions:

ACE I This is the full version, containing all 500 samples, available for interrogation via CD ROM or Internet connection

ACE II This reduced version includes 75% of ACE I, that is 375 samples available for unrestricted use. (The remaining 25% could not be copyright-cleared for use throughout the world.) The samples excluded are listed below:

  E01 E03 E04 E05 E06 E08 E09 E10 E11
  E14 E20 F06 F09 F10 F13 F14 F17 F21    
  F22 F23 F25 F26 F28 F29 F44 G01 G02
  G03 G04 G05 G07 G08 G09 G10 G14 G17
  G19 G21 G26 G30 G41 G44 G47 G65 G66
  G69 J01 J02 J07 J10 J12 J13 J15 J22
  J25 J26 J29 J30 J39 J40 J44 J49 J52
  J54 J55 J56 J60 J62 J63 J65 J72 J78
  K02 K03 K04 K07 K09 K11 K12 K13 K14
  K15 K16 K18 K19 K21 K22 K24 K26 L02
  L03 L04 L06 L09 M05 N05 N08 P02 P04
  P05 P12 R04 R06 R07 R12 R15 S01 S02
  S05 S06 S08 S10 S12 S13 S14 S15 S19
  S21 W01 W03 W04 W06 W09 W10 W11  

 

APPENDIX II



Published Papers Associated With The ACE Corpus

1. Peters, P. Towards a corpus of Australian English. ICAME JOURNAL No.11 (1987), 27-38. (ICAME = International Computer Archive of Modern English).

2. Collins, P. and Peters, P. The Australian corpus project in Corpus linguistics, hard and soft, ed. M Kyto et al. Amsterdam: Rodopi (1988), 103-120.

3. Collins, P. Computer corpora in English language research: a critical survey. AUSTRALIAN REVIEW OF APPLIED LINGUISTICS 10 i (1987), 1-19.

4. Peters, P., Collins, P., Blair, D. and Brierley, A. The Australian corpus project, findings on some functional variants in the Australian press. AUSTRALIAN REVIEW OF APPLIED LINGUISTICS 11 i (1988), 22-33.

5. Collins, P. The semantics of some modals in contemporary Australian English. AUSTRALIAN JOURNAL OF LINGUISTICS 8 (1988), 233-258.

6. Peters, P. and Fee, M. New configurations: the balance of British and American English features in Australian and Canadian English. AUSTRALIAN JOURNAL OF LINGUISTICS 9 (1989), 135-147.

7. Peters, P. The Australian corpus project: word punctuation in newspapers, in Frontiers of style: proceedings of Style Councils 87 and 88, ed. P.H. Peters. Sydney: Dictionary Research Centre, Macquarie University (1990) 72-79.

8. Peters, P., Purvis, H., Martin, C. and Jenkins, R. Word frequencies from the Macquarie corpus: the newspaper files. WORKING PAPERS OF THE SPEECH, HEARING AND LANGUAGE RESEARCH CENTRE, MACQUARIE UNIVERSITY (1990) 13-92.

9. Green, E. and Peters, P. The Australian corpus project and Australian English. ICAME JOURNAL no.15 (1991) 37-53.

10. Collins, P. The modals of obligation & necessity in Australian English, in English Corpus Linguistics, edd. Aijmer and Altenberg. London: Longman (1991) 145-165.

11. Peters, P. American & British English in Australian Usage, in Style on the move: proceedings of Style Council 92, ed. P.H. Peters. Sydney: Dictionary Research Centre, Macquarie University (1993) 20-27.

12. Peters, P. Corpus evidence on some points of usage, in J. Aarts et al. edd. English language corpora: design, analysis and exploitation Amsterdam: Rodopi (1993) pp. 247-256

13. Peters, P. American and British influence in Australian verb morphology, in U. Fries et al. edd. Creating and Using English Language Corpora Amsterdam: Rodopi (1994) pp. 149-158

14. Collins P. Get- passives in English World Englishes 15:1 (March 1996) pp. 43-56

15. Peters, P. Comparative insights into comparison World Englishes 15:1 (March 1996) pp.57-68

16. Peters, P. and Delbridge, A. Fowler’s Legacy in E. Schneider ed. Englishes Around The World vol. 2 Amsterdam, John Benjamins (1997) pp. 301-318

 

Textextracts



M-N  W