Corpora: Summary: Sizer/Chunker

Tony Berber Sardinha (tony4@uol.com.br)
Sun, 28 Mar 1999 15:28:47 -0300

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Spela Vintar: "Corpora: CfP - Workshop on Language Technologies: Multilingual Aspects"
Previous message: Alexander Caskey: "Corpora: Research internship in Spoken Dialog Systems"

Hi,

Thanks to everyone who took the time to respond to my query about a
word-based file splitter, ie software or a command for breaking large text
files into chunks of x words each. (people whose emails were not included
in the body of their message are not listed below:)

Adam Kilgariff
Andreas Mengel
Doug Cooper doug@th.net
Ken Beesley
Kristen Precht
Mike Scott Mike.Scott@liv.ac.uk
Ole Norling-Christensen Norling@ddf.dk
Pascual Cantos
Stefan Thomas Gries
Ted Dunning

In what follows there are suggestions as to how to the job in both Unix and
Windows. Windows users will be pleased to know that Ole Norling-Christensen
and Mike Scott have kindly produced W95 software for this task (Kristen
Precht is also writing one). For Ole's program, please contact him at
Norling@ddf.dk and for Mike's splitter please download "wordsplt.zip" from
http://www.liv.ac/~ms2928/downloads/_freebies or
http://www.ndirect.co.uk/~lexical/downloads/_freebies

Thanks a lot.
tony.

****** Adam Kilgariff:

in perl it's easy - input is standard input, output is a series of
files called outfilename.1, outfilename.2 etc.

adam

================================================================

$chunk_size=10000; # (or whatever)
$chunk_num=1;
open OUT, "> outfilename.$chunk_num";
while (<>){
print OUT;
# or something fancy would get exactly the right number - this
# program doesn't split lines so might have up to a dozen
# extra words in a chunk

$count += &words_in_line($_);
# where words_in_line is a subroutine returning - you guessed
# it! - the number of words in the line.
# in the default case of no markup, you could simply do
# $count += split
if ($count >= $chunk_size){
close OUT;
$chunk_num++;
open OUT, "> outfilename.$chunk_num";
$count=0;
}
}

****** Andreas Mengel

if you have UNIX available, a first approximation would be to just use
the split command (perhaps with -l for lines).

****** Doug Cooper
Use any number you want. Output files are named xaa, xab, ...

deroff -w foo | split -l 1000

****** Ken Beesley

UNIX 'split' can split a file into subfiles consisting
of a designated number of LINES, e.g.

split -500 myfile.txt

would split myfile.txt into files of 500 lines each,
named

myfile.txtaa
myfile.txtab
myfile.txtac

etc.

****** Mike Scott
Anyone who doesn't use Unix or who wants this facility at home may be
interested in downloading "wordsplt.zip" from
http://www.liv.ac/~ms2928/downloads/_freebies
or
http://www.ndirect.co.uk/~lexical/downloads/_freebies

Freeware.

This 32-bit PC program does word-based file splitting in the way Tony
Berber Sardinha requested. It also allows you to cut out < > tags,
eliminate punctuation symbols and redundant spaces, and optionally create
an alphabetically ordered list of the words. The download is a zip file and
there's an .rtf help file which you can view within the program or within
your word processor.
Requirements: Windows 95 or better.

Do let me know if you find it useful...

****** Ted Dunning

The following is a reasonable approximation to what you want in TCL.
If you don't mind losing punctuation, then you can do a prettier job
in terms of getting exactly X words per chunk. The snippet shown here
will preserve punctuation, if not white-space.

# command line argument is number of words to keep in each chunk.

# standard input is munched into standard sized blocks which are then
# spit out to standard out with separators that look like
# <chunk> ... </chunk>

set n [lindex $argv 0]

set txt [read stdin]
set i 0
set out {}
foreach w [join [split $txt " \n\t\r"]] {
if {[incr i]%$n == 0} {
puts <chunk>
puts $out
puts </chunk>
set out {}
} else {
append out $w " "
if {$i%20 == 0} {append out \n}
}
}

-------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://homepages.infoseek.com/~corpuslinguistics/homepage.html
-------------------------------

Next message: Spela Vintar: "Corpora: CfP - Workshop on Language Technologies: Multilingual Aspects"
Previous message: Alexander Caskey: "Corpora: Research internship in Spoken Dialog Systems"