RE: [Corpora-List] Word frequencies in English, French, German, Spanish, Dutch, Italian and Portuguese

From: Mark Davies (Mark_Davies@byu.edu)
Date: Mon Feb 12 2007 - 21:42:46 MET

  • Next message: Keith Trnka: "Re: [Corpora-List] Emotional Dialogue corpus"

    For Spanish, you might consult the Routledge Frequency Dictionary of Spanish, which came out in early 2006. It contains the top 5000 lemmas in Spanish, and is based on 20 million words from the late 1900s -- 1/3 spoken, 1/3 fiction, 1/3 non-fiction -- in the Corpus del Español (http://www.corpusdelespanol.org).

    >> You can get word frequencies lists for the Portuguese language in
    >> Linguateca (http://www.linguateca.pt/), for instance, here:
    >> http://acdc.linguateca.pt/acesso/tokens/tokens.todos (token list)
    >> http://acdc.linguateca.pt/acesso/tokens/lemas.todos (lemma list)

    For Portuguese, you might also consult the Corpus do Português:

         http://www.corpusdoportugues.org

    You can get the top x word forms overall, by register, between registers, etc. The corpus has 45 million words; 20 million from the 1900s -- 2m spoken, 6m fiction, 6m newspaper, and 6m academic; 1/2 Portugal, 1/2 Brazil. In late 2007, Routledge will publish a frequency dictionary based on this data, similar to the Spanish one noted above.

    Best,

    Mark Davies

    ============================================
    Mark Davies
    Professor of (Corpus) Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906
    Web: davies-linguistics.byu.edu

    ** Corpus design and use // Linguistic databases **
    ** Historical linguistics // Language variation **
    ** English, Spanish, and Portuguese **
    ============================================



    This archive was generated by hypermail 2b29 : Mon Feb 12 2007 - 21:41:15 MET