Corpora: Tools needed to process British National Corpus

From: Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Date: Thu Sep 14 2000 - 11:35:10 MET DST

  • Next message: Dirk Ludtke: "Re: Corpora: Question about a Brown Corpus tag"

    I attach a minimalist perl prog that does the job. Or you can find
    lists already generated on my website,

               Adam

    Kai Noponen wrote
    > I need a tool that can make a frequency list out of the BNC. It must
    > utilize the part-of-speech tags in order to separate the different cases.
    > It also should read SGML.

    -- 
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Adam Kilgarriff                                
    Senior Research Fellow                         tel: (44) 1273 642919     
    Information Technology Research Institute           (44) 1273 642900 
    University of Brighton                         fax: (44) 1273 642908
    Lewes Road                        
    Brighton BN2 4GJ         email:      Adam.Kilgarriff@itri.bton.ac.uk
    UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    

    ==============cut here===================

    $/="<w "; while (<>){ /^([^>]+)>([^<]+)/; $word=lc $2; # all words normalised to lower case --delete 'lc' if you want to retain capitalisation $pos = $1; $word =~ s/\n/ /; $word =~ s/ +$//; $word =~ s/ /_/; # multiword 'words' will have _ between items ("in_order_to") in stead of spaces $count{$word." ".$pos}++; } for (keys %count){print "$_ $count{$_}\n"}

    # words which, for some reason, weren't marked up with SGML w tag will be missed



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 12:29:21 MET DST