Re: [Corpora-List] Corpus from Blogs required.

From: Jean-Phi (jpprost@gmail.com)
Date: Thu Mar 31 2005 - 00:35:11 MET DST

Next message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"

Previous message: Linguistic Data Consortium: "[Corpora-List] LDC Online and New Corpora"
In reply to: Trilok Khairnar: "[Corpora-List] Corpus from Blogs required."
Next in thread: Trilok Khairnar: "Re: [Corpora-List] Corpus from Blogs required."
Reply: Trilok Khairnar: "Re: [Corpora-List] Corpus from Blogs required."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

> In the absence of such corpus and APIs, I am thinking of doing this by
> 1] using RSS, ATOM feed parsers on some OPML files to get URLs for blog posts
> 2] Extracting the text (easier if the blog template format is known)

It might not be that easy: I suspect that many blogs use some sort of
Content Management System, which basically means that the texts are
stored in a database, and are only presented in the blog dynamically,
on request.In such cases my guess is that you'll probably need to know
a minimum about the database structure in order to query it --unless,
of course, the site provides you with an RSS feed. Or do I miss
something?

Some blog host sites may sometimes also couple the dynamic rendering
with a permanent html link for each text. http://www.blogger.com/ (now
owned by google) does provide both these features: RSS feed and
permanent link. I don't hold any shares, though...

Cheers,

--
  Jean-Philippe Prost
    Centre for Language Technology
    Macquarie University ~ Sydney, Australia
and
    Laboratoire Parole et Langage (Speech & Language Lab.)
    Université de Provence ~ Aix-en-Provence, France
<http://www.ics.mq.edu.au/~jpprost/>
_______________________________________________

Next message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"
Previous message: Linguistic Data Consortium: "[Corpora-List] LDC Online and New Corpora"
In reply to: Trilok Khairnar: "[Corpora-List] Corpus from Blogs required."
Next in thread: Trilok Khairnar: "Re: [Corpora-List] Corpus from Blogs required."
Reply: Trilok Khairnar: "Re: [Corpora-List] Corpus from Blogs required."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue May 31 2005 - 00:52:12 MET DST