[Corpora-List] Re: license question

From: Peter Halacsy (peter@halacsy.com)
Date: Sat Aug 19 2006 - 09:39:41 MET DST

  • Next message: Serge Sharoff: "RE: [Corpora-List] license question"

    Alexander Paile wrote:
    > Hej Lars,
    > The corpus consists mainly of Finnish legislation texts and public
    > annual reports from different companies in Finland. I guess that could,
    > theoretically speaking, be a problem if somebody wants to be nasty. The
    > languages in question are Finnish and Swedish. When I was calling around
    > asking for material many companies just shrugged and sent me what they
    > had to get rid of me. I'm afraid of scaring them away if I start asking
    > them to sign papers. Most people don't know what a corpus is and they
    > couldn't care less. And they don't want to sign papers they don't
    > understand. On the other hand both the legislation texts and the company
    > reports are freely available and nobody probably ever thought of
    > licensing them in any way.
    >
    > What kind of corpus is it? Well, it's a sentence aligned Finnish-Swedish
    > parallel corpus of some 4 million words. The markup is CES XML. No
    > morphosyntactic tagging yet.
    >
    > Oh, by the way. The sentences in the corpus files don't even necessarily
    > come in the same order that they did in the original texts. I'm not sure
    > that has any legal implications. We are thinking LGPL.
    >
    > cheers
    >
    > Alexander Paile
    >

    Hi Alexander!

    (I've sent a similar post to this list some months ago )

    We distribute our parallel corpus under the CC Attribution license. LGPL
    is for software code (for example the contract mentions source code that
    does not make sense for a text corpus).

    I think sentence shuffling solves your problem. It's fair use.

    Our copyright notice is:

    Some raw materials used for the Hunglish corpus are under copyright

    (literature, film subtitles, magazines). We prevented the illegal use of
    copyrighted material

    by shuffling the texts at sentence level. This form is still useful for
    research purposes,

      while it does not infringe upon the rightholders' interests. If you
    are a copyright holder,

    and you consider the shuffled files infringing, please send email and we
    will remove the material

    in question from the corpus.

    The Hunglish corpus is open for use (with the above restrictions) under
    a creative commons attributions

    licence, refer to our publication.

    This method can be used for web corpus as well. No URL lists are needed.

    peter



    This archive was generated by hypermail 2b29 : Mon Aug 21 2006 - 09:30:55 MET DST