RE: Corpora: Unpacking BNC with WinZip

Burnard Towers (lou.burnard@computing-services.oxford.ac.uk)
Fri, 01 Jan 1999 22:16:05 +0000

Apologies for my earlier and misleading reply. Here's what you do.

1. Extract the files (using Winzip). This will create 3 big files called A,
B and C.
2. Rename each extracted file (i.e. A becomes A.tar, B becomes B.tar, C
becomes C.tar)
3. Now re-open the A.tar file (etc) with WinZip

Happy new year!

Lou

>> Christopher Tribble wrote:
>>
>> > Dear All - I've put this request to Lou, and appended his reply - I'd
>be
>> > really grateful if anyone has any ideas. (BTW - I'm using Winzip 32
>(6.3)
>> > so I don't think Lou's thoughts that it's a Winzip problem hold)
>> >
>> > Really grateful for any comments / suggestions
>> >
>> > ----------------------------------------------
>> > PROBLEM
>> >
>> > I've been tinkering with the raw text files on the BNC CDROM and find
>some
>> > analomolies which confuse me.
>> >
>> > These are:
>> >
>> > 1. The compressed files for A.tgz, B.tgz, and C.tgz unpack from the
>CDROM
>> > to create a single text file - respectively A, B & C. All the other
>.TGZ
>> > files unpack to create a .TAR file. Each of these can in turn be
>unpacked
>> > to create a large number of individual corpus files. A useful
>arrangement
>> > if you want to work with subsets of the BNC - which is my intention.
>> >
>> > 2. The large A,B, & C files are text files. They appear to contain the
>> > original data files but in a concatenated form. There also appears to
>be a
>> > certain amount of "noise" in these files:
>> >
>> > a) "loose" tags are included which also seem to be associated with the
>loss
>> > of some tags - as in the example below where the tag for "should" has
>gone
>> > missing:
>> >
>> > <w NP0>EVELYN <w NP0>McEWEN <w NP0>Divisional <w NN1>Director<c PUN>,
>> > <w NN2>Services
>> > </p>
>> > </div1>
>> > </text>
>> > </bncDoc>
>> > should <w VBI>be
>> > <w VVN>ensured <w PRP>for <w AJC>older <w NN2>workers<c PUN>;
>> > These appear close to the end of texts at what seem to arbitrary
>intervals
>> > throughout the file - without any corresponding <text> starting tag
>> >
>> > b) there is a block of unprintable characters at the beginning of each
>text
>> > - eg:
>> >
>> > </B/B0/B02
>> >
>> > 440 15530 15000 1621613 5725401633
> 5212
>> >
>> > Any idea what the problem / solution might be? I'm able to split the
>files
>> > back into constituent texts using WordSmith, but then lose the file
>names -
>> > it's all a bit confusing.
>> > ----------------------------------------------
>> > LOU'S REPLY
>> > On Thursday, December 31, 1998 11:49 PM, Lou Burnard
>> > [SMTP:lou.burnard@computing-services.oxford.ac.uk] wrote:
>> > > Hi Chris
>> > >
>> > > sorry not to have replied to your query earlier: it got swamped
>by
>> > > xmas xcesses.
>> > >
>> > > the short answer is: upgrade your version of winzip. or unpack the
>cds
>> > > with something that knows how to deal with a GNU tar file properly.
>> > >
>> > > I dont know why A B and C are different from the others (probably
>> > > because they were done first) but, clearly what you are getting is a
>> > > TAR archive instead of the proper file structure. Later versions of
>> > > Winzip (mine is 6) recognize this file format correctly,. and will
>> > > unpack it into the cxorrect file system.
>> > >
>> > > stand by for exciting announcements abouyt the bnc sampler (wot? that
>> > > old thing?)
>> > >
>> > > best wishes to you for 99
>> > >
>> > > Lou
>> > ----------------------------------------------
>> > As I say above - the Winzip version doesn't seem to be the problem ...
>> >
>> > Enlightenment greatly welcomed!
>> >
>> > bestest
>> >
>> > Chris Tribble
>> >
>> > --
>> > Sri Lanka 21 Wijerama Mawatha, Colombo 7
>> > TEL +94 75 332 309
>> > UK 122, Queen Alexandra Mansions, Judd Street
>> > London WC1 H 9DQ
>> > TEL +44 171 833 4271
>> > UK Mailing c/o FCO (Colombo)
>> > The British Council: Sri Lanka
>> > King Charles Street, London SW1A 2AH
>> > E-mail ctribble@serendib.ccom.lk
>> > Home Page
> http://ourworld.compuserve.com/homepages/Christopher_Tribble
>>
>>
>
>
>
----------------------------------