[Corpora-List] 'imperfect' corpora

From: Hunter, Duncan (D.I.Hunter@warwick.ac.uk)
Date: Wed Nov 15 2006 - 21:58:07 MET

Next message: sciubba@uniroma3.it: "Re: [Corpora-List] transcribing video corpora"

Previous message: Hardie, Andrew: "RE: [Corpora-List] Re: transcribing video corpora"
Next in thread: Yannick Versley: "Re: [Corpora-List] 'imperfect' corpora"
Reply: Yannick Versley: "Re: [Corpora-List] 'imperfect' corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi list members,

I have been given access to a large amount of data, which has been OCR'd into a digital (.txt file) format. The data is extremely valuable for a number of reasons and I would like to carry out, amongst other things, a Keyword analysis. However, test-runs with corpus investigation tools show that there are a few problems with the reliability of the corpus due to OCR errors (mis-copying and fragmentation of words over end-of-line boundaries, etc.).

How can valuable but 'imperfect' corpus data be utilised effectively? Any tips as to how anomalous (but generally explicable) results can be described/accounted for in a principled, consistent manner?

Duncan Hunter

Next message: sciubba@uniroma3.it: "Re: [Corpora-List] transcribing video corpora"
Previous message: Hardie, Andrew: "RE: [Corpora-List] Re: transcribing video corpora"
Next in thread: Yannick Versley: "Re: [Corpora-List] 'imperfect' corpora"
Reply: Yannick Versley: "Re: [Corpora-List] 'imperfect' corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Nov 15 2006 - 21:56:01 MET