Re: [Corpora-List] fast string replacement

From: Damon Allen Davison (allolex@gmail.com)
Date: Mon Mar 14 2005 - 12:47:35 MET

  • Next message: Lillian Lee: "[Corpora-List] problems with Google counts"

    Dear Jörg,

    In this case, you'll probably be happier having your lookup dictionary
    live in a database because access is faster. You can still use a
    scripting language like Perl to do the glue work for you, but it's
    conceivable to do this entirely in SQL. We have a very large corpus
    collection in our collocations dictionary project
    (http://www.romanistik.uni-koeln.de/home/blumenthal/colloc-en.shtml)
    which are stored in a MySQL database with one record per token. I have
    written a multiword unit tagger in Perl and SQL that works like this:

    Given a corpus stored in a MySQL database with one token per record
    (in numerical order using fldTokenID as a counter and fldToken as the
    actual token) and a multiword lookup table with the order of the
    multiword unit's elements clearly marked.

    1. Read in a record of my multiword unit lookup table.

    2. Use the *final* element of the MWU to create a temporary table with
    all occurrences of that element. I wrote the tagger for French, where
    the initial element of an MWU is often a preposition or other such
    highly frequent part-of-speech.

    3. Use the second-to-last MWU element to create a new temporary table
    with the following SQL code:

            $query = "CREATE TABLE mwu_$index ";
            $query .= 'SELECT @a:=(a.fldTokenID-1) AS fldTokenID ';
            $query .= "FROM mwe_$previous_index a ";
            $query .= "INNER JOIN $tablename b ";
            $query .= 'USING(fldTokenID) ';
            $query .= "WHERE b.fldToken = \"$element\""; # you can also
    use lemmata--token was just more expedient for me

    4. Repeat this until you run out of MWU elements.

    You can make this algorithm more efficient by bundling the entire MWU
    into a single statement and saving yourself the trouble of building
    temporary tables. I was pressed for time so when my algorithm worked,
    I stopped developing the program. That would be how you make it
    faster.

    The output table with the locations of the MWU in the corpus were very
    useful to us, since we wanted to be able to use the corpus and our
    statistics both with and without consideration of the MWU.

    A detailed description (in German) of our methods for extracting
    collocations is avalaible in the journal Zeitschrift für Romanische
    Philologie [2005; 121 (1)] , "Kombinatorische Wortprofile und
    Profilkontraste. Berechnungsverfahren und Anwendungen" by my
    colleagues Peter Blumenthal (project director), Sascha Diwersy, and
    Jörg Mielebacher.

    Warm Regards,

    Damon

    PS: Vim rules! ;)

    -- 
    

    Damon Allen Davison http://allolex.net



    This archive was generated by hypermail 2b29 : Mon Mar 14 2005 - 12:50:37 MET