Maybe I am missing something important here but couldn't be a good idea
to 'chomp' the lines before splitting them so that not words at the end
of lines are counted as separate words just because they have a end of
line character at the end?
Institutionen för lingvistik
Yes - indeed. I had forgotten about that.
There are further problems with the script:
- It doesn't distinguish between lower and upper case.
This could easily be remedied by adding "$line=lc($line);"
- What happens to punctuation? Usually, there is no space between the actual
word and punctuation markers, so in the sentence "Something is missing.", there
would be a new type "missing." which isn't the same as "missing" in the middle
of a sentence...
If you add "$line=~s/[,.;:-!?]//g;" this would be taken care of - but no
difference is being made between sentence boundaries and abbreviations.
I'm sure someone will point out a few other problems... ;-)
Sebastian Hoffmann Englisches Seminar der Univ. Zürich Plattenstrasse 47 CH-8032 Zürich Tel: +41-1-634 3551 Fax: +41-1-634 4908
This archive was generated by hypermail 2b29 : Thu May 30 2002 - 10:55:08 MET DST