Re: [Corpora-List] free tagged corpus

From: Kristofer Franzén (franzen@sics.se)
Date: Thu Nov 17 2005 - 20:32:29 MET

  • Next message: Nicolas Nicolov: "[Corpora-List] Available positions in Statistical NLP at Umbria, Inc."

    In what language?

    /Kristofer Franzén

    Delip Rao wrote:

    >Dear Martin/All,
    >
    >By "free" I meant $0, not "freedom". As a research
    >student I would be willing to comply with the
    >legal/ethical restrictions etc. Most standard
    >literature in good conferences use corpora from
    >sources like LDC which are not available free of cost.
    >If my organization is not a member of LDC then I would
    >not have access to these. Are they any free-of-cost
    >PoS tagged corpora for experimentation that is well
    >accepted by the research community?
    >
    >Thanks,
    >Delip
    >
    >--- Martin Wynne <martin.wynne@oucs.ox.ac.uk> wrote:
    >
    >
    >
    >>Dear Delip,
    >>
    >>It depends on what you mean by 'freely available'.
    >>This has (at least)
    >>two meanings in this context. It can mean free of
    >>cost, or it can mean
    >>free of legal or ethical restrictions on its use.
    >>
    >>Many corpora are do not cost money to use, although
    >>the ones mentioned
    >>so far in this thread, such as the BNC and resources
    >>from the LDC, do
    >>cost money.
    >>
    >>As for legal and ethical restrictions, it may be
    >>useful to look at what
    >>they say in the world of software, where several
    >>levels of freedom can
    >>be differentiated:
    >>
    >> * The freedom to run the program, for any
    >>purpose (freedom 0).
    >> * The freedom to study how the program works,
    >>and adapt it to your
    >>needs (freedom 1). Access to the source code is a
    >>precondition for this.
    >> * The freedom to redistribute copies so you can
    >>help your neighbor
    >>(freedom 2).
    >> * The freedom to improve the program, and
    >>release your improvements
    >>to the public, so that the whole community benefits
    >>(freedom 3). Access
    >>to the source code is a precondition for this.
    >>
    >>(from http://www.gnu.org/philosophy/free-sw.html)
    >>
    >>With corpora, a parallel classification may be
    >>possible:
    >>
    >> * The freedom to access and analyse the corpus
    >>(freedom 0).
    >> * The freedom to run your own tools on the
    >>corpus, and adapt it to
    >>your needs (freedom 1). Access to the full text of
    >>the corpus is a
    >>precondition for this.
    >> * The freedom to redistribute copies so you can
    >>help your neighbor
    >>(freedom 2).
    >> * The freedom to add texts or metadata or
    >>annotations, and release
    >>your improvements to the public, so that the whole
    >>community benefits
    >>(freedom 3).
    >>
    >>In most cases, any of the above freedoms may be
    >>restricted by only
    >>allowing the relevant freedoms in the context of
    >>academic or
    >>non-commercial research, though the precise terms of
    >>these restrictions
    >>may vary, and the boundaries of non-commercial may
    >>not be easy to draw.
    >>
    >>Usually a corpus creator cannot simply release a
    >>corpus under terms of
    >>their choosing, allowing whichever of the above
    >>freedoms they want to,
    >>because they don't own the rights over all of the
    >>texts contained in the
    >>corpus. A corpus usually contains texts written or
    >>spoken by various
    >>people, and these people, or publishers, or
    >>employers, or others, are
    >>likely to have intellectual property rights over
    >>these texts.
    >>(Furthermore, the corpus builders are acquire rights
    >>over the
    >>collection, but these may reside not in the
    >>individuals but in their
    >>institution or funders). To complicate things
    >>further, the relevant laws
    >>relating to these rights vary in different
    >>countries, and have varied
    >>over time.
    >>
    >>My colleague Lou Burnard asked a similar question on
    >>this list in
    >>January this year. You can see the start of the
    >>thread in the archive at
    >>
    >>
    >>
    >http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0501&L=CORPORA&D=0&I=-3&P=13048
    >
    >
    >>He was surprised to find virtually nothing which
    >>could be distributed
    >>under something like an open source software
    >>licence.
    >>
    >>The simplest answer to this is that you have to say
    >>a bit more precisely
    >>what it is you want to be free to do with the
    >>corpus, and then maybe
    >>you'll get some more answers.
    >>
    >>Best wishes,
    >>Martin
    >>
    >>
    >>Delip Rao wrote:
    >>
    >>
    >>>Hello All,
    >>>
    >>>Is there any freely available part-of-speech
    >>>
    >>>
    >>tagged
    >>
    >>
    >>>corpus for research/non-commercial use?
    >>>
    >>>Thanks,
    >>>Delip Rao
    >>>-----------
    >>>AIDB LAB,
    >>>IIT MADRAS
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>__________________________________
    >>>Do you Yahoo!?
    >>>New and Improved Yahoo! Mail - 1GB free storage!
    >>>http://sg.whatsnew.mail.yahoo.com
    >>>
    >>>
    >>>
    >>>
    >>--
    >>Martin Wynne
    >>Head of the Oxford Text Archive and
    >>AHDS Literature, Languages and Linguistics
    >>
    >>Oxford University Computing Services
    >>13 Banbury Road
    >>Oxford
    >>UK - OX2 6NN
    >>Tel: +44 1865 283299
    >>Fax: +44 1865 273275
    >>martin.wynne@oucs.ox.ac.uk
    >>
    >>
    >>
    >
    >
    >
    >
    >
    >
    >__________________________________
    >Do you Yahoo!?
    >New and Improved Yahoo! Mail - 1GB free storage!
    >http://sg.whatsnew.mail.yahoo.com
    >
    >



    This archive was generated by hypermail 2b29 : Thu Nov 17 2005 - 20:35:18 MET