Corpora: kwic concordances with Perl

Noord G.J.M. van (vannoord@let.rug.nl)
Thu, 7 Oct 1999 17:04:47 +0200 (METDST)

Christer Geisler writes:
> The Perl script below (adapted from Dan Malamed's 2kwic.pl) will produce
> kwic concordances on a match, but
> a) will not detect multiple occurrencences on a line,
> b) nor find complex patterns across several lines.
>
> Can someone suggest other ways of writing simple kwic programs in Perl?
> Should I split into an array, use Perl's format, etc?

both a) and b) are treated by the script below. No warrenties!
Some comments are in Dutch (which is useful for some..).
It obtains b) by treating paragraphs at the time. It also does
sentence splitting which might not be what you want (exercise left
to the reader).

#!/usr/local/bin/perl -w
# kwic [-f W -l W -n W -r W -s Expr] Word [Files]
# geeft _per _zin_ die met Word matcht_ de linker- en rechtercontext of Word
# - iedere match per zin gerapporteerd
# - alleen context binnen dezelfde zin
# - breedte van context wordt gegeven door $opt_l en $opt_r
# - geeft ook bestandnaam en regelnummer (breedte met $opt_n $opt_f)
# - $opt_s bepaalt hoe einde van de zin gedefinieerd is.
use strict;
use vars qw($opt_f $opt_l $opt_n $opt_r $opt_s);
use Getopt::Std;

# assign command line options:
getopts('f:l:n:r:s:');

# assign default values to options
$opt_f = defined($opt_f) ? $opt_f : 0;
$opt_l = defined($opt_l) ? $opt_l : 30;
$opt_n = defined($opt_n) ? $opt_n : 0;
$opt_r = defined($opt_r) ? $opt_r : 30;
$opt_s ||= '[\.\?\!][\'\"]?\s';

# there must be at least one option remaining: the Word
@ARGV > 0 || die
"Usage: $0 [-f W -l W -n W -r W -s Expr] Word [Files]
Word is the Perl regular expression for the word(s) you are looking for,
-f argument determines width of file name (0 for full file name),
Default: $opt_f
NB. file name is printed only if there is more than one input file.
-l argument determines width of left context,
Default: $opt_l
-n argument determines width of line number field,
Default: $opt_n
-r argument determines width of right context,
Default: $opt_r
-s argument is a Perl regular expression for end of sentence.
Default: $opt_s
";

my $word = shift(@ARGV);
my $report_file_name=1;
$report_file_name=0 if @ARGV < 2;
# any remaining arguments are file names. If more than one file name,
# we report file name for each match.

$/=""; # reads a paragraph at a time. This gives unexpected results on
# dos files (more like slurp then...

while(<>) {
close ARGV if eof; # for $. (current record nr of input)
foreach $_ (split $opt_s) {
tr/\n\t\r / /s; # removes ^M, ^J, ^I
while (/$word/gio) { # report each match
if ($report_file_name) {
printf("%*s ",$opt_f,
length($ARGV)>$opt_f ? substr($ARGV,-$opt_f) : $ARGV);
}
printf("%*s %*s%s%-*s\n",
$opt_n,$.,
$opt_l,$opt_l ? (length($`)>$opt_l ? substr($`,-$opt_l):$`): "",
$&,
$opt_r,substr($',0,$opt_r));
}
}
}