Corpora: Tools needed to process British National Corpus

From: Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Date: Thu Sep 14 2000 - 11:35:10 MET DST

Next message: Dirk Ludtke: "Re: Corpora: Question about a Brown Corpus tag"

Previous message: Harry Bunt: "Corpora: 3 Ph.D. positions in Tilburg"
In reply to: Kai Noponen: "Corpora: Tools needed to process British National Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I attach a minimalist perl prog that does the job. Or you can find
lists already generated on my website,

Adam

Kai Noponen wrote
> I need a tool that can make a frequency list out of the BNC. It must
> utilize the part-of-speech tags in order to separate the different cases.
> It also should read SGML.

-- 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff                                
Senior Research Fellow                         tel: (44) 1273 642919     
Information Technology Research Institute           (44) 1273 642900 
University of Brighton                         fax: (44) 1273 642908
Lewes Road                        
Brighton BN2 4GJ         email:      Adam.Kilgarriff@itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
==============cut here===================
$/="<w ";
while (<>){
    /^([^>]+)>([^<]+)/;
    $word=lc $2;   
# all words normalised to lower case --delete 'lc' if you want to retain capitalisation
    $pos = $1;
    $word =~ s/\n/ /;
    $word =~ s/ +$//;
    $word =~ s/ /_/;   
# multiword 'words' will have _ between items ("in_order_to") in stead of spaces
    $count{$word." ".$pos}++;
}
for (keys %count){print "$_ $count{$_}\n"} 
# words which, for some reason, weren't marked up with SGML w tag will be missed

Next message: Dirk Ludtke: "Re: Corpora: Question about a Brown Corpus tag"
Previous message: Harry Bunt: "Corpora: 3 Ph.D. positions in Tilburg"
In reply to: Kai Noponen: "Corpora: Tools needed to process British National Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 12:29:21 MET DST