Re: [Corpora-List] Comparing files

From: Tony Abou-Assaleh (taa@cs.dal.ca)
Date: Sun Nov 16 2003 - 00:25:06 MET

Next message: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"

Previous message: radev@umich.edu: "Re: [Corpora-List] Comparing files"
In reply to: radev@umich.edu: "Re: [Corpora-List] Comparing files"
Reply: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

If the files are really big, sorting them first and then merging using a
merge sort is probably the fastest. Sorting is O(N lg N) and merging is
O(N). Alternatively, you could concatenate, sort, and then if needed
traverse.

However, 40K and 70K words isn't really that much, especially if you don't
need to repeat every few seconds. So in your case, Drago's method is the
simplest and would be fast enough.

TAA

--------------------------------------------------
Tony Abou-Assaleh
Ph.D. Candidate, Faculty of Computer Science
Dalhousie University, Halifax, NS, Canada, B3H 1W5
Fax: 902-492-1517
Email: taa@acm.org
WWW: http://www.cs.dal.ca/~taa/
---------------------[THE END]--------------------

On Sat, 15 Nov 2003 radev@umich.edu wrote:

> Here is a UNIX script:
>
> % sort one | uniq > one.uniq
> % sort two | uniq > two.uniq
> % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
>
> Here is an example
>
> one:
> ==========
> cat
> dog
> cat
> mouse
>
> two:
> ==========
> cat
> rabbit
> elephant
> rabbit
>
> output:
> ==========
> 3 cat
> 2 mouse
> 2 dog
> 1 rabbit
> 1 elephant
>
>
> Words with a count of 3 appear in both "one" and "two".
> Words with a count of 2 appear in "one" only.
> Words with a count of 1 appear in "two" only.
>
> --
> Drago
>
>
> Miles Osborne wrote:
> >
> > that's far too slow -use a hash table instead.
> >
> > now, this wouldn't be homework, would it?
> >
> > Miles
> >
> > Quoting Otto Lassen <otto@lassen.mail.dk>:
> >
> > > Hi
> > > That could be done in any language:
> > > 1. sort then two lists
> > > 2. compare them word for word
> > > 3. output words which are not in both lists
> > > Regards
> > > Otto Lassen
> > >
> > > At 21:54 15-11-2003 +0100, you wrote:
> > > >Hi,
> > > >
> > > >I'm doing a project that involves comparing two very large word lists
> > >
> > > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > > on
> > > >one list and not on the other (and/or vice versa).
> > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > > a
> > > >perl script?)
> > > >
> > > >Any help will be greatly appreciated.
> > > >Best,
> > > >Tine Lassen
> > >
> > >
> >
> >
>
>
> --
> Dragomir R. Radev radev@umich.edu
> Assistant Professor of Information, Electrical Engineering and
> Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
>

Next message: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"
Previous message: radev@umich.edu: "Re: [Corpora-List] Comparing files"
In reply to: radev@umich.edu: "Re: [Corpora-List] Comparing files"
Reply: Preslav Ivanov Nakov: "Re: [Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 00:28:21 MET