Re: [Corpora-List] Comparing files

From: Bob Krovetz (bkrovetz@askjeeves.com)
Date: Sun Nov 16 2003 - 10:00:00 MET

  • Next message: Vlado Keselj: "Re: [Corpora-List] Comparing files"

    For files that are this small, the time difference between O(n) and O(nlogn)
    isn't a big deal on typical PC's. I would also recommend looking at the
    "comm"
    command in Unix/Linux. With the comm command you would only have to do two
    sorts (one for each file).
     
    >If you have enough memory to fit the smaller one there (and for any
    reasonable wordlist >size I can imagine, you should have), a hash table is
    much better than sorting - it is >linear in the size of both files.
     
    Preslav
     
    ----- Original Message -----
    From: "Tony Abou-Assaleh" <taa@cs.dal.ca>
    To: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen"
    <otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>;
    <CORPORA@HD.UIB.NO>; <radev@umich.edu>
    Sent: Saturday, November 15, 2003 3:25 PM
    Subject: Re: [Corpora-List] Comparing files
     
     
    > If the files are really big, sorting them first and then merging using
    > a merge sort is probably the fastest. Sorting is O(N lg N) and merging
    > is O(N). Alternatively, you could concatenate, sort, and then if
    > needed traverse.
    >
    > However, 40K and 70K words isn't really that much, especially if you
    > don't need to repeat every few seconds. So in your case, Drago's
    > method is the simplest and would be fast enough.
    >
    > TAA
    >
    > --------------------------------------------------
    > Tony Abou-Assaleh
    > Ph.D. Candidate, Faculty of Computer Science
    > Dalhousie University, Halifax, NS, Canada, B3H 1W5
    > Fax: 902-492-1517
    > Email: taa@acm.org
    > WWW: http://www.cs.dal.ca/~taa/ <http://www.cs.dal.ca/~taa/>
    > ---------------------[THE END]--------------------
    >
    >
    > On Sat, 15 Nov 2003 radev@umich.edu wrote:
    >
    > > Here is a UNIX script:
    > >
    > > % sort one | uniq > one.uniq
    > > % sort two | uniq > two.uniq
    > > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr >
    > > output
    > >
    > > Here is an example
    > >
    > > one:
    > > ==========
    > > cat
    > > dog
    > > cat
    > > mouse
    > >
    > > two:
    > > ==========
    > > cat
    > > rabbit
    > > elephant
    > > rabbit
    > >
    > > output:
    > > ==========
    > > 3 cat
    > > 2 mouse
    > > 2 dog
    > > 1 rabbit
    > > 1 elephant
    > >
    > >
    > > Words with a count of 3 appear in both "one" and "two". Words with a
    > > count of 2 appear in "one" only. Words with a count of 1 appear in
    > > "two" only.
    > >
    > > --
    > > Drago
    > >
    > >
    > > Miles Osborne wrote:
    > > >
    > > > that's far too slow -use a hash table instead.
    > > >
    > > > now, this wouldn't be homework, would it?
    > > >
    > > > Miles
    > > >
    > > > Quoting Otto Lassen <otto@lassen.mail.dk>:
    > > >
    > > > > Hi
    > > > > That could be done in any language:
    > > > > 1. sort then two lists
    > > > > 2. compare them word for word
    > > > > 3. output words which are not in both lists
    > > > > Regards
    > > > > Otto Lassen
    > > > >
    > > > > At 21:54 15-11-2003 +0100, you wrote:
    > > > > >Hi,
    > > > > >
    > > > > >I'm doing a project that involves comparing two very large word
    lists
    > > > >
    > > > > >(~40.000 and 70.000 words). What I need to find out, is which
    > > > > >words
    are
    > > > > on
    > > > > >one list and not on the other (and/or vice versa).
    > > > > >Can anyone give me a hint as to how to do this? (I was
    > > > > >thinking;
    maybe
    > > > > a
    > > > > >perl script?)
    > > > > >
    > > > > >Any help will be greatly appreciated.
    > > > > >Best,
    > > > > >Tine Lassen
    > > > >
    > > > >
    > > >
    > > >
    > >
    > >
    > > --
    > > Dragomir R. Radev
    radev@umich.edu
    > > Assistant Professor of Information, Electrical Engineering and
    > > Computer Science, and Linguistics, the University of Michigan, Ann Arbor
    > > Phone: 734-615-5225 Fax: 734-764-2475
    http://www.si.umich.edu/~radev <http://www.si.umich.edu/~radev>



    This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 11:12:57 MET