Corpora: Sizer / chunker

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Tue, 23 Mar 1999 14:19:10 +0000 (GMT)

in perl it's easy - input is standard input, output is a series of
files called outfilename.1, outfilename.2 etc.

adam

================================================================

$chunk_size=10000; # (or whatever)
$chunk_num=1;
open OUT, "> outfilename.$chunk_num";
while (<>){
print OUT;
# or something fancy would get exactly the right number - this
# program doesn't split lines so might have up to a dozen
# extra words in a chunk

$count += &words_in_line($_);
# where words_in_line is a subroutine returning - you guessed
# it! - the number of words in the line.
# in the default case of no markup, you could simply do
# $count += split
if ($count >= $chunk_size){
close OUT;
$chunk_num++;
open OUT, "> outfilename.$chunk_num";
$count=0;
}
}