Corpora: Unpacking BNC with WinZip

Christopher Tribble (ctribble@lanka.ccom.lk)
Fri, 1 Jan 1999 11:29:46 +0530

Dear All - I've put this request to Lou, and appended his reply - I'd be
really grateful if anyone has any ideas. (BTW - I'm using Winzip 32 (6.3)
so I don't think Lou's thoughts that it's a Winzip problem hold)

Really grateful for any comments / suggestions

----------------------------------------------
PROBLEM

I've been tinkering with the raw text files on the BNC CDROM and find some
analomolies which confuse me.

These are:

1. The compressed files for A.tgz, B.tgz, and C.tgz unpack from the CDROM
to create a single text file - respectively A, B & C. All the other .TGZ
files unpack to create a .TAR file. Each of these can in turn be unpacked
to create a large number of individual corpus files. A useful arrangement
if you want to work with subsets of the BNC - which is my intention.

2. The large A,B, & C files are text files. They appear to contain the
original data files but in a concatenated form. There also appears to be a
certain amount of "noise" in these files:

a) "loose" tags are included which also seem to be associated with the loss
of some tags - as in the example below where the tag for "should" has gone
missing:

<w NP0>EVELYN <w NP0>McEWEN <w NP0>Divisional <w NN1>Director<c PUN>,
<w NN2>Services
</p>
</div1>
</text>
</bncDoc>
should <w VBI>be
<w VVN>ensured <w PRP>for <w AJC>older <w NN2>workers<c PUN>;
These appear close to the end of texts at what seem to arbitrary intervals
throughout the file - without any corresponding <text> starting tag

b) there is a block of unprintable characters at the beginning of each text
- eg:

</B/B0/B02

440 15530 15000 1621613 5725401633 5212

Any idea what the problem / solution might be? I'm able to split the files
back into constituent texts using WordSmith, but then lose the file names -
it's all a bit confusing.
----------------------------------------------
LOU'S REPLY
On Thursday, December 31, 1998 11:49 PM, Lou Burnard
[SMTP:lou.burnard@computing-services.oxford.ac.uk] wrote:
> Hi Chris
>
> sorry not to have replied to your query earlier: it got swamped by
> xmas xcesses.
>
> the short answer is: upgrade your version of winzip. or unpack the cds
> with something that knows how to deal with a GNU tar file properly.
>
> I dont know why A B and C are different from the others (probably
> because they were done first) but, clearly what you are getting is a
> TAR archive instead of the proper file structure. Later versions of
> Winzip (mine is 6) recognize this file format correctly,. and will
> unpack it into the cxorrect file system.
>
> stand by for exciting announcements abouyt the bnc sampler (wot? that
> old thing?)
>
> best wishes to you for 99
>
> Lou
----------------------------------------------
As I say above - the Winzip version doesn't seem to be the problem ...

Enlightenment greatly welcomed!

bestest

Chris Tribble

--
Sri Lanka	21 Wijerama Mawatha, Colombo 7
		TEL  +94 75 332 309
UK		122, Queen Alexandra Mansions, Judd Street
		London WC1 H 9DQ
		TEL +44 171 833 4271
UK Mailing	c/o FCO (Colombo)
		The British Council: Sri Lanka
		King Charles Street, London SW1A 2AH
E-mail		ctribble@serendib.ccom.lk
Home Page	http://ourworld.compuserve.com/homepages/Christopher_Tribble