Modified ROget Available

Patrick Cassidy (micra@tigger.jvnc.net)
Tue, 28 May 1996 15:41:40 -0400 (EDT)

Jing-Shin Chang inquired (28 May, 1996)
---------------
> Dear netters,
>
> I am trying to use thesaurus for nlp research.
> I found three electronic versions of the roget's thesaurus
> (1911) in the iinternet: roget13.zip, roget13a.zip, roget14a.zip.
> They are not really good for my purposes. And they are different
> from the original paper copy.
>
> I am wondering if there is any newer versions of this
> thesaurus or an electronic version of the 1911 edition
> in its original organization.
>
> Where can I find the electronic version?
> What is the distributor/publisher's contact
> address/FAX/email?
> What's the cost for academic research purposes?
>
> Thanks for your infomation.
>
> Best regards,
>
> Jing-Shin Chang

There is a modified version of The Roget, prepared by
MICRA, Inc., in which the Roget has been converted into a
hierarchical semantic network. The modifications consist primarily
of the addition of some more recent words and phrases, and the addition
of semantic relations to specify explicitly the relation of the words within
each main entry to the headword. This has been titled
FACTOTUM Semantic Network
This is a work still in progress, far from complete. More details can
be found by anonymous ftp from the site:

ftp.cs.cmu.edu
login anonymous
cd to user/ai/new
change mode to binary

get files: readme.fsn general information
relation.asc the list of semantic relations used and their
definitions
FSNOUTL.ASC The hierarchy of headwords (ca. 2000)
in the semantic network
FSN_DOC.ASC discussion of the purpose and content
of the work
fsn001.asc | The semantic network, in ASCII form,
fsn002.asc | split for convenience in transmission.
fsn003.asc | Each file is about 650 kb
fsn004.asc | They should be concatenated in order,
fsn005.asc | for easy processing.
fsn006.asc |

NOTE: these files are essentially pure ASCII, but in the fsn00x.asc
files there are occasional European accented characters, so these
files should be ftp'd in binary format.
---------------------------------------------------------------

This semantic network is being developed for use in natural language
understanding. The initial task, almost completed now, has been the
definition of the semantic relations required to relate each headword to
the numerous words listed under it. In the original Roget, there are
about 1026 headwords, and many words listed under each headword, which
are presumably related in some way to the headword, but the relation
is not specified. In the FACTOTUM Semantic Network, the organization is
very similar to that of Roget, but the hierarchy is more explicit, and
the relations of each word to the headword is specified by one of
about 170 semantic relations.
The present version is the work of only one individual, and is
necessarily only a bare outline of what a proper semantic network
should be. Nevertheless, every word is linked by some relation to
the network. The number of links are, perhaps, only one percent of
the number of links specified in the CYC system; the word coverage is
somewhat less than that of WordNet. The organization differs from
both WordNet and CYC.
There are still over 2000 semantic relation references which
have not yet been disambiguated by word sense; the scan of the
thesaurus for semantic relations has been performed only once, and
there are doubtless many inconsistencies and errors still present.
No attempt has yet been made to systematically increase the word
coverage to that of a modern dictionary. Nevertheless, even in it
present form, this may have some utility in NLU, and in any case is
likely to be more useful than the original 1911 Roget.
A program has been written by Aleksandr Gelbukh to parse the
structure of the ASCII text and create an index. He has also
written a program to read the index and display the text of the
Semantic Network with the search word highlighted. The hierarchy
above any search word can also be displayed, and a list of
words related by the semantic relations can be viewed in a separate
screen. This program runs under DOS, or in a DOS window under OS/2.
A dump can be created with separate files for each semantic
relation explicitly marked in the network; however, we do not yet have
a full dump of all of the data present in a convenient form.
This could probably be done fairly quickly, if there is a need
for it.
------------------------------------------------------------
The files on the CMU site are available by anonymous ftp and can be used
for personal use or for research without restriction. The only
restriction is that inclusion of any part of the semantic network in
a product for sale requires written permission of MICRA.
The index and viewer for use under DOS can be obtained from MICRA
for $50.00, suplied on 3-1/2 inch high-density DOS-formatted disks.
(Payable to MICRA, Inc.).
---------------------------------------------------------------
The utility of this semantic network should be evaluable from
the files at the CMU site. If anyone would like to explore
its use in more detail, and would like to obtain logical
data dumps, please contact Pat Cassidy.
Anyone with an interest in semantic networks is encouraged to
send comments or suggestions. If any others are engaged in
building semantic networks, and are willing to work in a collaborative
effort, please contact Pat Cassidy to discuss possible useful
sharing of data.
========================================================
The reason why Mr. Chang would like a version of Roget closer to the
original is not clear. The Roget13.zip is very close to the
original (unless it has been modified since it was submitted to
Project Gutenberg). It has only about 1000 words added to the original
text, and all other modifications were made solely to make it easier
to interpret by a computer program. The original should be primarily
of interest only for historical purposes. The electronic version
of theoriginal lacked an index, because such an index is better
prepared directly from the text, with an indexing program, for
texts which will be modified. If there is really some interest
in a version even closer to the original, please contact me and I will
see if I can find an earlier version. If still accessible, it will,
unfortunately, be filled with typos.
=========================================================================

COntact:
Patrick Cassidy cassidy@micra.com
MICRA Inc. (908) 561-3416
735 Belvidere Ave. FAX: (908) 668-5904
Plainfield, NJ 07062