[Corpora-List] Criteria to Building a Corpus for Text Classification

From: Mohsen Al-Thubaity (althubaity@gmail.com)
Date: Thu Jun 15 2006 - 17:56:46 MET DST

  • Next message: Mark P. Line: "Re: [Corpora-List] Google Books, copyrights, and corpora"

    Hi all

    My sincere thanks to Ylva, Eric and Ozlem for their response. All responses
    are included in this E-mail.

    What I mean by "text classification" is " *a program or algorithm to decide
    what genre or domain a text document belongs to *".

    Actually, I am aware of text size.

     Is it possible to have different text sizes ranging from 100 words to
    several thousands of words?

    Governmental reports, as an example, have this variation in text size.

    News papers articles does not have this variation.

    Best wishes

    ____________________________________________________________________

    On 15/06/06, Mohsen Al-Thubaity < althubaity@gmail.com> wrote:

    Hi all

    I am working on a research project investigating Arabic text classification.

    The first part of this project, required building a corpus to train and test
    the classifier.

    Are there are any criteria or standards must be followed to build such a
    corpus.

    Any suggestions or references are most appreciated.

    Best wishes

    Mohsen

    --------------------------------------------------------------------------------------

    On 15/06/06, Ylva Berglund < ylva.berglund@oucs.ox.ac.uk> wrote:

    Dear Mohsen,

    Selection of texts for a (training) corpus is a very complex and
    important issue. Unfortunately I don't think there are any hard and fast
    rules defining what to include. You would have to consider not only what
    kind of text classes there are and what would be suitable examples of
    these, but also what is available to you (text resources as well as
    time, money, expertise etc). Some issues relating to corpus creation
    (including text selection) are discussed in the fairly recent book:
    'Developing Linguistic Corpora: A Guide to Good Practice' which is
    available online at
    http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ (hard copies
    from Oxbow books: http://www.oxbowbooks.com/bookinfo.cfm/ID/32969 ).
    Maybe that can be of use to you.

    Good luck with your project.

    -- Ylva

    On 15/06/06, Eric Atwell < eric@comp.leeds.ac.uk> wrote:

    Mohsen,

    You dont say what you mean by "text classification" - do you mean you
    are developing a program or algorithm to decide what genre or domain
    a text document belongs to? Or are you trying to develop a set of
    genres which cover needs of Arabic corpus linguistics? Or something
    else?

    My colleage Latifa Al-Sulaiti and i have looked into text-types or
    genres whcih Arabic language teachers and language engineers would like
    to see in a Corpus of Contemporary Arabic, see

    Al-Sulaiti, Latifa; Atwell, Eric. The Design of a Corpus of Contemporary
    Arabic. To appear in International Journal of Corpus Linguistics,
    vol.11, 2006. [Preprint at http://www.comp.leeds.ac.uk/eric/rae/ ]

    Another colleage, Serge Sharoff, has developed a set of text
    classification categories which he has demonstrated apply to
    100-million-word corpora covering a range of languages, see
    http://www.comp.leeds.ac.uk/ssharoff/

    - I beleive he has a paper forthcoming on this topic, you will have to
      ask him direct for a preprint.

    Please let me have any publication(s) you have on your work, I would
    like to find out more as we have interests in common

    regards

    Eric Atwell

    -------------------------------------------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Thu Jun 15 2006 - 17:55:06 MET DST