[Corpora-List] string frequency reports for Project Gutenberg texts

From: Ronald Reck (rreck@iama.rrecktek.com)
Date: Mon Jul 08 2002 - 14:50:32 MET DST

Next message: Alessandro Lenci: "[Corpora-List] Call for participation: Workshop on Asian Language Resources and International Standardization"

Previous message: Laurel S Stvan: "[Corpora-List] topic identification literature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello all,

I have created string frequency
reports for 5400+ books (400M words)
from Project Gutenberg:
http://iama.rrecktek.com/text/frequency/

they are searchable here:
http://iama.rrecktek.com/cgi-bin/apps/wordfind/searchpg.pl

the process is described briefly here with links to
all the src in CVS:
http://iama.rrecktek.com/text/

I am looking for help in improving
these graphs of string frequency histograms across the archive
when they are rendered in SVG:
http://iama.rrecktek.com/text/frequency/words/seeall.html

I merged some of the results into an SVG:
(its worth the plugin hassle)
http://iama.rrecktek.com/~rreck/samplesvg

I also extended the DAML ontology for PG presented here:
http://www.daml.org/ontologies/113

and created RDF metadata for the archive here:
http://iama.rrecktek.com/text/frequency/meta/

the meta data is loaded into a specialty rdf backend called
Parka. this example query shows how to get RF values for an
author's use of certain strings:
http://iama.rrecktek.com/cgi-bin/apps/parka/parka.pl

Comments, and criticisms are very appreciated,
(I know the png graphs arent labeled well, all will get fixed
in the SVG s.)

----
Ronald P. Reck                          rreck@iama.rrecktek.com

Next message: Alessandro Lenci: "[Corpora-List] Call for participation: Workshop on Asian Language Resources and International Standardization"
Previous message: Laurel S Stvan: "[Corpora-List] topic identification literature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Jul 08 2002 - 15:06:07 MET DST