Re: [Corpora-List] American and British English spelling converter

From: Ben Hutchinson (ben.hutch@gmail.com)
Date: Fri Nov 03 2006 - 01:12:03 MET

  • Next message: Eric Atwell: "Re: [Corpora-List] American and British English spelling converter"

    Stanford University's NLP group's POS tagger does some pre-processing
    that converts British spellings to US spellings based on variations in
    the spellings of certain common words and word endings.

    As an example of how it modifies word endings, it tags
    "sour flour our dour parlour rigour glamour colour Harbour"
    as
    "sour/JJ flour/NN our/PRP$ dour/NN parlor/NN rigor/NN glamor/NN
    color/NN Harbor/NNP".

    It even Americanizes unknown words ending in "-our", so, for example,
    it tags "nonsensour" as "nonsensor". Sometimes it is a bit over
    eager, as in "devour" -> "devor/NN".

    The tagger is under the GNU license, so I think it should be possible
    to adapt the Java code to suit your requirements as long as you
    resdistribute your changes. I also think it should be fairly
    straightforward to invert their algorithm, although it's a while since
    I looked at the source. It is available from
    http://nlp.stanford.edu/software/index.shtml

    On 03/11/06, Martin Wynne <martin.wynne@oucs.ox.ac.uk> wrote:
    > If you find such a program, let us know, and we can run it over the BNC
    > and change the 5849 occurrences of 'realize' and inflected forms to
    > 'realise' etc., and otherwise correct British English to your preferred
    > spellings ;)
    >
    > Martin Krallinger wrote:
    >
    > > Dear all,
    > >
    > > I was looking for some simple tool (preferable in Python) which
    > > is able to do automatic conversion of texts (or words) from
    > > British English (UK) to American (US) English and vice versa.
    > > (Example: realize <-> realise)
    > >
    > > This seems to be an easy task, but I could not find any ready to use
    > > stand alone tool capable of performing this task.
    > >
    > > I want to integrate this application into an Information extraction
    > > system
    > > which handles scientific literature.
    > >
    > > I am also interested in references where aspects related to US/UK English
    > > spelling has been analyzed in the context of information extraction, text
    > > mining and terminology extraction.
    > >
    > > Best regards,
    > >
    > >
    > > Martin
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Fri Nov 03 2006 - 01:10:10 MET