Corpora: NLP Summer Internships - Request for Nominations

WS '99 Internships (
Mon, 1 Feb 1999 02:43:14 -0500 (EST)

Dear Colleague,

The Center for Language and Speech Processing at the Johns Hopkins
University is offering a unique summer internship opportunity which we
would like you to bring to the attention of your best students in the
current junior class.

This internship is unique in the sense that the selected students will
participate in cutting edge research as full members alongside leading
scientists from industry, academia and the government. The exciting
nature of the internship is the exposure of the undergraduate students to
the emerging fields of text-to-speech synthesis, automatic speech
recognition and natural language processing.

We are specifically looking to attract new talent into the field and, as
such, do not require the students to have prior knowledge of the
technologies. Please take a few moments to nominate suitable bright
students who may be interested in this internships. Details are attached

If you have any questions, please contact us by phone, e-mail or via
the internet.


Frederick Jelinek
Professor and Director.



The Center for Language and Speech Processing at the Johns Hopkins
University is seeking outstanding members of the current junior class to
participate in a summer workshop on language engineering from June 28 to
August 20, 1999.

No limitation is placed on the undergraduate major. Only relevant skills,
employment experience, past academic record and the strength of letters of
recommendation will be considered. In the past, students of Biomedical
Engineering, Computer Science, Cognitive Science, Electrical Engineering,
Linguistics, Mathematics, Physics, Psychology, etc., have been considered.
Women and minorities are encouraged to apply.

* An opportunity to explore an exciting new area of research;

* A two week tutorial on speech and language technology;

* Mentoring by an experienced researcher;

* Use of a computer workstation throughout the workshop;

* A $4800 stipend and $1680 towards per diem expenses;

* Private accommodation for 8 weeks covering the workshop;

* Travel expenses to and from the workshop venue;

* Participation in project planning activities.

The eight week workshop provides a vigorously stimulating and enriching
intellectual environment and hopes to attract students to eventually
pursue graduate study or research in the field of human language

Application forms are available via the internet or by mail. Electronic
submission of applications is strongly encouraged. Applications must be
received at CLSP by February 10, 1999. For details, contact CLSP, Barton
Hall, 3400 N. Charles Street, Baltimore, MD 21218, visit our web site at, or call 410 516 4237.



Automated systems that interact with human users in spoken and written
communication will greatly enhance productivity and program usability.
These systems will act as on- and off-ramps to the information
superhighway, allowing friendly access to services. The convenience
provided by these systems is essential to other tasks, such as for
handicapped users or for accessing a database of maintenance manuals while
performing intricate repairs. Some other applications are conversion of
phone mail to text, transcription of radio or TV programs or of telephone
conversations, mechanical translation, and information retrieval.

Unfortunately, in many respects, current technology is inadequate for the
tasks at hand. For instance, automatic recognition of natural
conversational speech has a 40% error rate. Mechanical translation of
technical manuals results in confusing and ungrammatical instructions.
Even parsing of sentences of newspaper articles, while it has improved a
lot, leads to faulty analysis of over 50% of the sentences attempted.

There is need to make progress in this important field. The number of
available personnel trained in the field is small and solutions to long
standing research problems must be found. At this time, relatively few
universities educate students capable of performing the required tasks.

We are organizing a six week workshop on Language Engineering at Johns
Hopkins University from July 12-August 20, 1999 in which mixed teams of
leading professionals and students would fully cooperate to advance the
state of the art. The professionals will be university professors and
industrial and governmental researchers presently working in widely
dispersed locations. Six or more undergraduates will be selected through a
nationwide search from the current junior class based on outstanding
academic promise. Graduate students will be familiar with the field and
will be selected in accordance with their demonstrated performance.

Four topics of research for this workshop are proposed and were determined
by a group of leading professionals in the field:

1. Statistical Machine Translation
2. Towards Language Independent Acoustic Modeling
3. Topic-based Novelty Detection
4. Normalization of Non-standard Words

The Center for Language and Speech Processing has successfully organized
similar workshops for the last three summers. Details of past workshops
are available at our web site -




Automatic translation from one human language to another using computers,
better known as machine translation (MT), is a longstanding goal of
computer science. In order to be able to perform such a task, the
computer must "know" the two languages --- synonyms for words and phrases,
grammars of the two languages, and semantic or world knowledge. One way
to incorporate such knowledge into a computer is to use bilingual experts
to hand-craft the necessary information into the computer program. Another
is to let the computer learn some of these things automatically by
examining large amounts of parallel text: documents which are nearly exact
translations of each other. The Canadian government produces one such
resource, for example, in the form of parliamentary proceedings which are
recorded in both English and French.

Recently, statistical data analysis has been used to gather MT knowledge
automatically, from parallel bilingual text. The techniques have
unfortunately not been disseminated to the scientific community in very
usable form, and new follow-on ideas have not developed rapidly. In
pre-workshop activity, we plan to reconstruct a baseline statistical MT
system for distribution to all researchers, and to use it as a platform
for workshop experiments. These experiments will include working with
morphology, online dictionaries, widely available monolingual texts, and
syntax. The goal will be to improve the accuracy of the baseline and/or
achieve the same accuracy with only limited parallel corpora. We will
work with the French-English Hansard data as well as with a new language,
perhaps Czech or Chinese.


The state of the art in automatic speech recognition (ASR) has advanced
considerably for those languages for which large amounts of data is
available to build the ASR system. Obtaining such data is usually very
difficult as it includes tens of hours of recorded speech along with
accurate transcriptions, an on-line dictionary or lexicon which lists how
words are pronounced in terms of elementary sound units such as phonemes,
and on-line text resources. The text resources are used to train a
language model which helps the recognizer anticipate likely words, the
dictionary tells the recognizer identify how a word will sound in terms of
phonemes when it is spoken, and the speech recordings are used to learn
the acoustic signal pattern for each phoneme, resulting in a hierarchy of
models which work together to recognize successive spoken words.
Relatively little research has been done for building speech recognition
systems for languages for which such data resources are not available ---
a situation which unfortunately is true for all but a few languages of the

This project will investigate the use of speech from diverse source
languages to build an ASR system for a single target language. We will
study promising modeling techniques to develop ASR systems in languages
for which large amounts of training data are not available. We intend to
pursue three themes. The first concerns the development of algorithms to
map pronunciation dictionary entries in the target language to elements in
the dictionaries of the source languages. The second theme will be
Discriminative Model Combination of acoustic models in the individual
source languages for recognition of speech in the target language. The
third theme will be development of clustering and adaptation techniques to
train a single set of acoustic models using data pooled from the available
source languages. The goal is to develop Czech Broadcast News
transcription systems using a small amount of Czech adaptation data to
augment training data available in English, Spanish, and Mandarin. The
best data for this modeling task would be natural, unscripted speech
collected on a quiet, wide-band acoustic channel. News broadcasts are a
good source of such speech and are fairly easily obtained. Broadcast news
data of other source or target languages, possibly German or Russian, will
be used if they become available in a suitable amount and quality.


Computers are being increasingly used to manage large volumes of news and
information increasingly available in electronic form. The task of the
computer is to organize the incoming data into segments or stories which
are related and to index them in a way which makes it easier for the user
to digest them.

A key problem of digesting new data is deciding which parts contain
redundant information so attention can be focused on the new material.
This project proposes to investigate the problem of analyzing newly
arrived news stories for two purposes: (1) to decide if the story
discusses an event or topic that has not been seen earlier (first story
detection); and (2) to identify, within a sequence of stories on the same
pre-defined topic, which portions of subsequent stories contain new
information and to determine the new named entities that are central to
the topic (within-topic novelty detection). The project will focus on
extending and combining Information Retrieval and Natural Language
Processing/Information Extraction techniques toward addressing these
questions. Specifically, the team will look at identifying who/where/when
entities and how to use them in Information Retrieval and other language
modeling approaches for addressing this problem. An important component
of the proposed project is investigating the impact on the detection
results of using (degraded) text put out by a speech recognition system.
The evaluation of the project's results will be based on established
measures from the Topic Detection Tracking initiative in the case of first
story detection, and on accuracy of aligning predicted new text with
actual new information (as identified by human experts prior to the
workshop) in the case of novelty detection.


Real text contains a variety of "non-standard" token types, such as digit
sequences; words, acronyms and letter sequences in all capitals; mixed
case words (WinNT, SunOS); abbreviations; Roman numerals; URL's and
e-mail addresses. Many of these kinds of elements are pronounced
according to principles that are quite different from the pronunciation of
ordinary words. Furthermore, many items have more than one plausible
pronunciation, and the correct one must be disambiguated from context: IV
could be "four", "fourth", "the fourth", or "I.V."

Normalizing or rewriting such text using ordinary words is an important
issue for several applications. For instance, an essential feature of
natural human-computer interfaces is that the computer be capable of
responding with spoken replies or comments. A Text-to-Speech module
synthesizes the spoken response from such text input and must be able to
render such items appropriately into speech. In Automatic Speech
Recognition nonstandard types cause problems for training acoustic as well
as language models. More sophisticated text normalization will be an
important tool for utilizing the vast amounts of on-line text resources.
Normalized text is likely to be of specific benefit in information
extraction applications.

This project will apply language modeling techniques to creation of wide
coverage models for disambiguating non-standard words in English. Its aim
is to create (1) a publicly available corpus of tagged examples, plus a
publicly available taxonomy of cases to be considered, and (2) a set of
tools that would represent the best state of the art in text normalization
for English.