Re: [Corpora-List] American and British English spelling converter

From: Martin Krallinger (martink@cnb.uam.es)
Date: Fri Nov 03 2006 - 12:20:48 MET

  • Next message: John Milton: "RE: [Corpora-List] American and British English spelling converter"

    Dear all,

    Just to clarify the motivation behind my question (spelling conversion
    UK/US), I am actually not a linguist, but working in a cancer research
    center and I want to combine bioinformatics
    tools with IE and text mining systems. I actually extracted the spelling
    example I used before from the PubMed database (maybe I did not choose
    the best example,..):

    realize:
    'By working toward team care, hospitals may achieve a successful
    intensivist model, and patients may realize the benefits of spending
    less for healthcare and living longer. '
    [PMID:17077695]

    realise:
    'However, these experiences have also illuminated a number of critical
    challenges that will have to be addressed in the development of
    effective drugs across different cancers, to fully realise the potential
    of individualised molecular therapy.'
    [PMID:17059381]

    In life sciences people are interested in using ontologies (e.g. Gene
    Ontology), controlled vocabularies and information extraction systems to
    increase efficiency of information access. As the biomedical literature
    is written mainly in English but from different native speakers, most of
    the articles I suppose are either in UK or US English. (For a study of
    the effect of different native languages in the writing of biomedical
    literature, refer to: see Netzel et al
    http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1319188).

    This makes information extraction or mapping terms derived from existing
    biomedical ontologies quite challenging.

    I want to use a spelling converter ONLY as a form to 'normalize' the a
    large collection of biomedical text for subsequent IE, IR, document
    categorization and term mapping and not for extensive lexical,
    grammatical and idiomatic analysis.

    Best regards,

    Martin Krallinger

    >It would be a grave mistake to think that the only difference between
    >British and American English is a few wayward spellings. There are
    >considerable and extensive lexical, grammatical and idiomatic
    >differences. The 1st and 3rd of those are more or less well known, but
    >the grammatical differences never cease to surprise me. I'd be
    >moderately interested to see what other examples corpora listers come up
    >with (though no doubt they will also remind me that there are
    >significant differences in usage between American dialects, not to
    >mention Canadian etc)
    >
    >To give just one example of each:
    >
    >Lift vs elevator
    >Have you got vs do you have
    >Half four vs 4:30
    >
    >Harold Somers
    >
    >
    >
    >>-----Original Message-----
    >>
    >>
    >>>Martin Krallinger wrote:
    >>>
    >>>
    >>>
    >>>>Dear all,
    >>>>
    >>>>I was looking for some simple tool (preferable in Python) which is
    >>>>able to do automatic conversion of texts (or words) from British
    >>>>English (UK) to American (US) English and vice versa.
    >>>>(Example: realize <-> realise)
    >>>>
    >>>>This seems to be an easy task, but I could not find any
    >>>>
    >>>>
    >>ready to use
    >>
    >>
    >>>>stand alone tool capable of performing this task.
    >>>>
    >>>>I want to integrate this application into an Information
    >>>>
    >>>>
    >>extraction
    >>
    >>
    >>>>system which handles scientific literature.
    >>>>
    >>>>I am also interested in references where aspects related to US/UK
    >>>>English spelling has been analyzed in the context of information
    >>>>extraction, text mining and terminology extraction.
    >>>>
    >>>>Best regards,
    >>>>
    >>>>
    >>>>Martin
    >>>>
    >>>>
    >>>>
    >>>>
    >>>
    >>>
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Fri Nov 03 2006 - 12:18:21 MET