Re: [Corpora-List] Comparing files

From: Bob Krovetz (bkrovetz@askjeeves.com)
Date: Sun Nov 16 2003 - 10:00:00 MET

Next message: Vlado Keselj: "Re: [Corpora-List] Comparing files"

Previous message: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"
Maybe in reply to: Tine Lassen: "[Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

For files that are this small, the time difference between O(n) and O(nlogn)
isn't a big deal on typical PC's. I would also recommend looking at the
"comm"
command in Unix/Linux. With the comm command you would only have to do two
sorts (one for each file).

>If you have enough memory to fit the smaller one there (and for any
reasonable wordlist >size I can imagine, you should have), a hash table is
much better than sorting - it is >linear in the size of both files.

Preslav

----- Original Message -----
From: "Tony Abou-Assaleh" <taa@cs.dal.ca>
To: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen"
<otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>;
<CORPORA@HD.UIB.NO>; <radev@umich.edu>
Sent: Saturday, November 15, 2003 3:25 PM
Subject: Re: [Corpora-List] Comparing files

> If the files are really big, sorting them first and then merging using
> a merge sort is probably the fastest. Sorting is O(N lg N) and merging
> is O(N). Alternatively, you could concatenate, sort, and then if
> needed traverse.
>
> However, 40K and 70K words isn't really that much, especially if you
> don't need to repeat every few seconds. So in your case, Drago's
> method is the simplest and would be fast enough.
>
> TAA
>
> --------------------------------------------------
> Tony Abou-Assaleh
> Ph.D. Candidate, Faculty of Computer Science
> Dalhousie University, Halifax, NS, Canada, B3H 1W5
> Fax: 902-492-1517
> Email: taa@acm.org
> WWW: http://www.cs.dal.ca/~taa/ <http://www.cs.dal.ca/~taa/>
> ---------------------[THE END]--------------------
>
>
> On Sat, 15 Nov 2003 radev@umich.edu wrote:
>
> > Here is a UNIX script:
> >
> > % sort one | uniq > one.uniq
> > % sort two | uniq > two.uniq
> > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr >
> > output
> >
> > Here is an example
> >
> > one:
> > ==========
> > cat
> > dog
> > cat
> > mouse
> >
> > two:
> > ==========
> > cat
> > rabbit
> > elephant
> > rabbit
> >
> > output:
> > ==========
> > 3 cat
> > 2 mouse
> > 2 dog
> > 1 rabbit
> > 1 elephant
> >
> >
> > Words with a count of 3 appear in both "one" and "two". Words with a
> > count of 2 appear in "one" only. Words with a count of 1 appear in
> > "two" only.
> >
> > --
> > Drago
> >
> >
> > Miles Osborne wrote:
> > >
> > > that's far too slow -use a hash table instead.
> > >
> > > now, this wouldn't be homework, would it?
> > >
> > > Miles
> > >
> > > Quoting Otto Lassen <otto@lassen.mail.dk>:
> > >
> > > > Hi
> > > > That could be done in any language:
> > > > 1. sort then two lists
> > > > 2. compare them word for word
> > > > 3. output words which are not in both lists
> > > > Regards
> > > > Otto Lassen
> > > >
> > > > At 21:54 15-11-2003 +0100, you wrote:
> > > > >Hi,
> > > > >
> > > > >I'm doing a project that involves comparing two very large word
lists
> > > >
> > > > >(~40.000 and 70.000 words). What I need to find out, is which
> > > > >words
are
> > > > on
> > > > >one list and not on the other (and/or vice versa).
> > > > >Can anyone give me a hint as to how to do this? (I was
> > > > >thinking;
maybe
> > > > a
> > > > >perl script?)
> > > > >
> > > > >Any help will be greatly appreciated.
> > > > >Best,
> > > > >Tine Lassen
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Dragomir R. Radev
radev@umich.edu
> > Assistant Professor of Information, Electrical Engineering and
> > Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> > Phone: 734-615-5225 Fax: 734-764-2475
http://www.si.umich.edu/~radev <http://www.si.umich.edu/~radev>

Next message: Vlado Keselj: "Re: [Corpora-List] Comparing files"
Previous message: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"
Maybe in reply to: Tine Lassen: "[Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 11:12:57 MET