Kurzweil AI is looking for a temporary/summer employee to
help us with our text corpora. This would be ideal for a
student in NLP &/or computing.
The job involves working with existing Perl scripts & C programs,
and writing some original scripts & programs, to collect written
corpora & prepare them for constructing language models.
The company is located in Waltham, MA, near Boston.
_____________________________________________________________
Specifically, the job will involve the following sorts of tasks:
COLLECT & GENERAL & SPECIALIZED CORPORA
* from CD-ROMS
* from web sites, email collections, & Usenet news
* from the LDC & other collaborative groups
* specialized medical & legal corpora
* foreign language corpora
FORMAT THE CORPORA
* organize text into appropriate files & directories
* delete headers, comments, HTML/SGML tags, & similar text
* identify & delete quoted material in message text
* remove or mask proper names & confidential information
in medical & legal report text
NORMALIZE THE CORPORA ACCORDING TO A GIVEN LEXICON
* mark out-of-vocabulary words
* normalize punctuation, phrases, hyphenations, & so on:
e.g.
He won't take Mr. Hill's high-stress job in New York.
|
he won't take Mr. hill 's high - stress job in New_York .
COMPUTE PRELIMINARY STATISTICS FROM THE CORPORA
* count n-grams for small values of n
* identify top-n lists of words for constructing lexicons
for speech recognition
* compute perplexities, given various training & testing
corpora & language models.
_____________________________________________________________
Interested applicants should contact Jeff Adams
jeffa@kurz-ai.com
508-893-5151 x339
-- Jeff Adams Language Modeling Scientist Kurzweil Applied Intelligence http://www.kurz-ai.com/people/jeffa