one of the major assets of statistical approaches in NLP is the
robustness to errors in the training data. I was wondering if anybody
has done some research on the effect of error rates on the success of
the trained system. I can imagine that if you have a large training set,
an error rate of 3-5% would not really make a difference, but if the
trainig set is rather small, things might be different.
I am aware of Walter Daeleman's publications for memory-based learning
where leaving out dubious cases causes a drop in performance. Has
anybody else done some work on this?
Any hint would be appreciated.
Thanks in advance,
Sandra