[Corpora-List] morphological analysis: Russian

From: Amanda Stent (stent@cs.sunysb.edu)
Date: Sun Mar 12 2006 - 23:34:11 MET

  • Next message: Ajith Abraham: "[Corpora-List] Call for Book Chapters - Information Assurance and Security"

    About a month ago, I requested information on this topic and promised to
    post a summary of the replies I received. Here it is (thanks to all you
    informative resources!):

    Grigori Sidorov writes:
    If you want just separation of words into stem and flexion, then you can
    use our system of morphological analysis.
    This system does not have information about suffixes or prefixes.

    www.cic.ipn.mx/~sidorov/rmorph

    The example at the page does not present separation into stem and flexion,
    but the system has this function.

    ---------

    Roman Yangarber writes:
    the only thing i am aware of that is in existence, is a tool (being?)
    developed by a team at the now impoverished academy of sciences in Moscow.
    (it's headed by Igor Boguslavsky.) we had talked about some collaboration
    a few years back (in the context of information extraction), but i've not
    had an opportunity to evaluate the tool myself. i just know something
    exists.

    ---------

    Eric Atwell writes:
    if you cant find a good morphological analyser for Russian,
    then try an unsupervised learning system: the EU PASCAL research network
    has just run the MorphoChallenge2005 contest to devleop unsupervised
    learning systems to learn morpohlogical analysis from corpus data. The
    contestants were evaluated for English, Finnish, and Turkish, but
    hopefully systems general enough to learn morpholoigcal segmentation for
    these 3 different langauges should also cope with Russian. Winner(s)
    are still to be announced - see http://www.cis.hut.fi/morphochallenge2005/

    --------

    Jonathan Young writes:
    Here's what I found in my inventory and from a quick google search:

    - ispell has wordlists for russian; Stanford pointed me to the broken
    link ftp://mch5.chem.msu.su/pub/russian/ispell/ ("rus-ispell"), but
    there appear to be several. While not a true morphological analyzer,
    the wordlists have sigificant structure because of the /XYZ suffix
    codes, which code which endings follow each "root". It's totally
    uninterpreted (words are just character strings, not lemmas/morphs, no
    POS tags, etc.), but it might be a good starting point.

    - http://www.artint.ru/projects/frqlist/frqlist-en.asp contains russian
    word and lemma frequency lists (similar to Adam Kilgarriff's frequency
    lists for the BNC; they appear to have a corpus of a similar size, but I
    can't find it), a paper, and a link to another morphological analyzer:
    Dialing, at http://www.aot.ru/ . My russian isn't good enough to tell
    exactly what the AOT folks are doing, but there's plenty of technology
    documentation, as well as a free download of both Linux and Windows
    versions of their (probably commercially sold) lemmatizer and a Python
    scripting interface.

    - FreeBSD appears to have lemmatizers for English, German, and Russian -
    one source is
    http://osmirrors.cerias.purdue.edu/pub/FreeBSD/distfiles/lemmatizer/ .
    I haven't tested this, but it looks promising. It may also be the same
    technology as the aot.ru lemmatizer.

    - XRCE has a demo at
    www.xrce.xerox.com/competencies/content-analysis/demos/russian
    (commercial).

    - http://clr.nmsu.edu/Research/Projects/tide/Russian.html mentions an
    algorithm by Svetlana Sheremetyeva and Sergei Nirenburg, but all I can
    find on the web is links to papers.

    - There's a demo at http://starling.rinet.ru/morph.htm . Dictionaries
    and executable code is downloadable from
    http://starling.rinet.ru/downl.php?lan=en#dict , but the dictionaries
    are not really human-readable.

    - http://snowball.tartarus.org/algorithms/russian/stemmer.html documents
    in great detail a Russian stemmer written in Snobol . This is (IMHO)
    older, more primitive technology, and (similar to the well-known Porter
    stemmer for English) it is based on a small number of hand-coded rules,
    and is unlikely to include many well-known special cases (e.g. most
    irregular verbs).

    - RussianStemmer.java and other utilities in Lucene (my notes say
    LuBo?), Lucene is now part of apache, and can be found at
    http://lucene.apache.org/ . The code cites the russian stemmer at
    http://snowball.sourceforge.net .

    - PyStemmer at http://sourceforge.net/projects/pystemmer (v 0.10 is the
    only version released). According to the project page, "PyStemmer
    provides stemmer functionality in Python for English, German, Norwegian,
    Italian, Dutch, Portuguese, French, Swedish. PyStemmer is based on the
    Snowball stemmer (snowball.sourceforge.net)" - but it also has rules for
    Russian. Note that the same snowball stemmer source is cited.

    - Unitex v 1.2 has support for russian, but it appears to be mostly
    empty stubs.

    --------

    Lars Borin writes:

    Please have a look at this site: <http://www.aot.ru/>

    --------

    As far as I know the matter, there are relatively small amount of
    corpora, concerning the Russian word segmentation. Actually, I can point
    the follows:

    1. http://www.philol.msu.ru/~lex/corpus/ (200.000 running words,
    about
    5.500 models of word-formations). Unfortunately, the corpus is not
    available since last autumn.
    2. http://www.ruscorpora.ru (more than 65 billion running words, some
    information about word structure can be extracted using semantic
    annotation tags (such as diminutive, having IK-suffix, like in sadIK
    ^?small garden^?)

    3. I know only one (rather small) dictionary on the Net that represents
    a word structure of about 3.000 Russian words. It is the Russian
    Derivational and Morphemes Dictionary, prepared in Kazhan^?
    (http://www.kcn.ru/tat_ru/universitet/infres/slovar/index.htm)

    Two other recourses are not available through the Internet, but one can
    try to contact with a person in charge within the following projects.
    4. A Computer Implementation of Russian Derivational Morphology
    represented in DATR (http://www.surrey.ac.uk/LIS/SMG/lever_final_desc.htm)
    5. An Electronic Dictionary of Russian morphemes, (that is based on the
    comprehensive Dictionary of Russian Morphemes, by A.I. Kuznetsov & T.F.
    Efremova). Actually, the author of the database is a school teacher
    Tatiana Sentsova form Moscow, and I have no idea about how to reach her.
    ....
    Finally, I can send you two reviews on the topics. Both are written in
    Russian.
    1. S. Koval', Resursy po russkoi morfologii v internete
    2. T. Reznikova, M. Kopotev. ^?Lingvisticheski annotirovannye korpusa
    russkogo yazyka (obzor obschedostupnyh resursov)^? Natsional'nyi korpus
    russkogo yazyka 2003-2005. Moscow, 2006 [in press]

    ---------

    Jasper Holmes writes:
    I'm sure that the Surrey Morphology Group
    (http://www.surrey.ac.uk/LIS/SMG/) will have some relevant
    information.



    This archive was generated by hypermail 2b29 : Sun Mar 12 2006 - 23:36:19 MET