[Corpora-List] ACL-DCI and BLLIP corpora

From: David Brooks (D.J.Brooks@cs.bham.ac.uk)
Date: Tue Apr 11 2006 - 15:06:06 MET DST

  • Next message: TadPiotr: "[Corpora-List] End of sentence search in BNC"

    Dear All,

    Until very recently, I was under the impression that the sole
    distributions of Penn Treebank data were to be found in the Treebank
    projects at the LDC. However, I've been made aware that certain subsets
    of the data are also available through two other LDC projects: ACL-DCI
    and BLLIP. I'm looking into obtaining one or both of these corpora, but
    would like some advice as to their content, as the online descriptions
    are not all that thorough.

    Ideally, I'd like to get hold of the ATIS and Wall Street Journal
    corpora in PTB parsed format, for the purpose of parser evaluation. Now,
    ACL-DCI claims to have some Penn Treebank material (though I don not
    know if that covers ATIS), and some WSJ material. Does anyone know if
    the WSJ material is parsed in PTB format? Does that include the now
    infamous Sections 1-23 used in parser evaluation? Otherwise, can anyone
    tell me what the PTB datasets are, relative to the Treebank projects?

    If the ACL-DCI does not contain parsed WSJ material, does the BLLIP
    corpus contain the data I am looking for?

    Many thanks,
    David

    -- 
    David Brooks
    http://www.cs.bham.ac.uk/~djb
    



    This archive was generated by hypermail 2b29 : Tue Apr 11 2006 - 15:06:53 MET DST