Really grateful for any comments / suggestions
----------------------------------------------
PROBLEM
I've been tinkering with the raw text files on the BNC CDROM and find some
analomolies which confuse me.
These are:
1. The compressed files for A.tgz, B.tgz, and C.tgz unpack from the CDROM
to create a single text file - respectively A, B & C. All the other .TGZ
files unpack to create a .TAR file. Each of these can in turn be unpacked
to create a large number of individual corpus files. A useful arrangement
if you want to work with subsets of the BNC - which is my intention.
2. The large A,B, & C files are text files. They appear to contain the
original data files but in a concatenated form. There also appears to be a
certain amount of "noise" in these files:
a) "loose" tags are included which also seem to be associated with the loss
of some tags - as in the example below where the tag for "should" has gone
missing:
<w NP0>EVELYN <w NP0>McEWEN <w NP0>Divisional <w NN1>Director<c PUN>,
<w NN2>Services
</p>
</div1>
</text>
</bncDoc>
should <w VBI>be
<w VVN>ensured <w PRP>for <w AJC>older <w NN2>workers<c PUN>;
These appear close to the end of texts at what seem to arbitrary intervals
throughout the file - without any corresponding <text> starting tag
b) there is a block of unprintable characters at the beginning of each text
- eg:
</B/B0/B02
440 15530 15000 1621613 5725401633 5212
Any idea what the problem / solution might be? I'm able to split the files
back into constituent texts using WordSmith, but then lose the file names -
it's all a bit confusing.
----------------------------------------------
LOU'S REPLY
On Thursday, December 31, 1998 11:49 PM, Lou Burnard
[SMTP:lou.burnard@computing-services.oxford.ac.uk] wrote:
> Hi Chris
>
> sorry not to have replied to your query earlier: it got swamped by
> xmas xcesses.
>
> the short answer is: upgrade your version of winzip. or unpack the cds
> with something that knows how to deal with a GNU tar file properly.
>
> I dont know why A B and C are different from the others (probably
> because they were done first) but, clearly what you are getting is a
> TAR archive instead of the proper file structure. Later versions of
> Winzip (mine is 6) recognize this file format correctly,. and will
> unpack it into the cxorrect file system.
>
> stand by for exciting announcements abouyt the bnc sampler (wot? that
> old thing?)
>
> best wishes to you for 99
>
> Lou
----------------------------------------------
As I say above - the Winzip version doesn't seem to be the problem ...
Enlightenment greatly welcomed!
bestest
Chris Tribble
-- Sri Lanka 21 Wijerama Mawatha, Colombo 7 TEL +94 75 332 309 UK 122, Queen Alexandra Mansions, Judd Street London WC1 H 9DQ TEL +44 171 833 4271 UK Mailing c/o FCO (Colombo) The British Council: Sri Lanka King Charles Street, London SW1A 2AH E-mail ctribble@serendib.ccom.lk Home Page http://ourworld.compuserve.com/homepages/Christopher_Tribble