[Corpora-List] string frequency reports for Project Gutenberg texts

From: Ronald Reck (rreck@iama.rrecktek.com)
Date: Mon Jul 08 2002 - 14:50:32 MET DST

  • Next message: Alessandro Lenci: "[Corpora-List] Call for participation: Workshop on Asian Language Resources and International Standardization"

    Hello all,

    I have created string frequency
    reports for 5400+ books (400M words)
    from Project Gutenberg:
    http://iama.rrecktek.com/text/frequency/

    they are searchable here:
    http://iama.rrecktek.com/cgi-bin/apps/wordfind/searchpg.pl

    the process is described briefly here with links to
    all the src in CVS:
    http://iama.rrecktek.com/text/

    I am looking for help in improving
    these graphs of string frequency histograms across the archive
    when they are rendered in SVG:
    http://iama.rrecktek.com/text/frequency/words/seeall.html

    I merged some of the results into an SVG:
    (its worth the plugin hassle)
    http://iama.rrecktek.com/~rreck/samplesvg

    I also extended the DAML ontology for PG presented here:
    http://www.daml.org/ontologies/113

    and created RDF metadata for the archive here:
    http://iama.rrecktek.com/text/frequency/meta/

    the meta data is loaded into a specialty rdf backend called
    Parka. this example query shows how to get RF values for an
    author's use of certain strings:
    http://iama.rrecktek.com/cgi-bin/apps/parka/parka.pl

    Comments, and criticisms are very appreciated,
    (I know the png graphs arent labeled well, all will get fixed
    in the SVG s.)

    ----
    Ronald P. Reck                          rreck@iama.rrecktek.com
    



    This archive was generated by hypermail 2b29 : Mon Jul 08 2002 - 15:06:07 MET DST