Re: [Corpora-List] 'Standard European English' ? Web Corpus-building tools

From: William Fletcher (fletcher@usna.edu)
Date: Thu Mar 02 2006 - 19:18:13 MET

  • Next message: Martin Kay: "Re: [Corpora-List] 'Standard European English' ?"

    Dear Eric, and others with similar projects,

    For your students' "courseworks" (like Carmela I'd use a singular here, and probably say "course projects" instead) I have a suite of Windows tools that makes it easy to compile a "corpus" of Web pages in a few hours. While some of the tools have rough edges, I'd be glad to make them available to any understanding party who e-mails me (give me a day or two to make a rough guide to the process):

    1. KWiCFinder (free from KWiCFinder.com), which conducts searches on specific words on "AltaVista" (now actually the Yahoo search engine) and downloads matching webpages. Searches can be restricted to a specific domain and language. The developmental version I use allows bypassing search report generation, so it runs much faster than the current release version. With broadband you can conduct 30-40 searches simultaneously and download several thousand matching pages an hour.

    2. kfWinnow, which processes downloaded pages to eliminate duplicates and pages with very low or high word counts, which have low signal/noise ratio and high chance of repetition respectively (think I did that right).

    3. kfNgram (ditto), which helps identify highly repetitive (long) documents (HRDs): look for multiple occurrences of 10- and 25-grams, and which prepares n-grams for 4 (not essential, but valuable for studying Euro-English phraseology and comparing it to English (e.g. from my BNC-based Phrases in English site http://pie.usna.edu ).

    4. kfNgramDB, which imports the output of steps 2 and 3 into a MySQL database for further study. It supplies default DB schemas and generates default queries and models more sophisticated queries for those willing to tinker a bit with SQL. It also downloads and imports datasets from PIE.

    Looking forward to seeing your students' results eventually, whatever tools they use!

    Regards,
    Bill Fletcher

    PS My favorite Euro-English word is _beamer_ 'LCD data / video projector'.

    >>> Eric Atwell <eric@comp.leeds.ac.uk> 03/02/06 6:00 AM >>>
    My intuition is that, in addition to some "pan-european(except-UK)" English terms,
    as suggested by Harold, there will be national variants of English with
    local L1-inspired vocabulary and usages.

    I have just set my final-year undergrad Computing class a coursework challenge,
    "Finding English terms specific to a domain on the World Wide Web",
    where "domain" here means a national top-level domain like .DE or .UK
    - the 85 students in the class each have to study WWW-English in a different
    country, and many have signed up for European nations.
    So, I should have some answers for you after 24 March when the courseworks
    have to be submitted!

    Eric Atwell, School of Computing, Leeds University

    PS CORPORA readers are welcome to send me advice or tips to pass on to
    my students, esp on appropriate technologies they can use (so they
    dont have to write the programs themselves!) - the coursework outline is

    http://www.comp.leeds.ac.uk/eric/db32cw.doc

    On Thu, 2 Mar 2006, Parveen Lallmamode wrote:

    > Has anyone of you here ever heard of a 'Standard European English'? If yes:
    >
    > - What are its characteristics?
    > - Which researcher added that 'English' to the World Englishes?
    > - How does it differ from the 'Standard British English'?
    > - Where can I read more about it?
    >
    > Thanking you all in advance.
    >
    >
    >
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Language research group, School of Computing,
    Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric
    



    This archive was generated by hypermail 2b29 : Thu Mar 02 2006 - 19:27:03 MET