Re: [Corpora-List] Corpus from Blogs required.

From: Jean-Phi (jpprost@gmail.com)
Date: Thu Mar 31 2005 - 00:35:11 MET DST

  • Next message: ELDA: "[Corpora-List] ELRA - Language Resources Catalogue - Update"

    Hi,

    > In the absence of such corpus and APIs, I am thinking of doing this by
    > 1] using RSS, ATOM feed parsers on some OPML files to get URLs for blog posts
    > 2] Extracting the text (easier if the blog template format is known)

    It might not be that easy: I suspect that many blogs use some sort of
    Content Management System, which basically means that the texts are
    stored in a database, and are only presented in the blog dynamically,
    on request.In such cases my guess is that you'll probably need to know
    a minimum about the database structure in order to query it --unless,
    of course, the site provides you with an RSS feed. Or do I miss
    something?

    Some blog host sites may sometimes also couple the dynamic rendering
    with a permanent html link for each text. http://www.blogger.com/ (now
    owned by google) does provide both these features: RSS feed and
    permanent link. I don't hold any shares, though...

    Cheers,

    --
      Jean-Philippe Prost
        Centre for Language Technology
        Macquarie University ~ Sydney, Australia
    and
        Laboratoire Parole et Langage (Speech & Language Lab.)
        Université de Provence ~ Aix-en-Provence, France
    <http://www.ics.mq.edu.au/~jpprost/>
    _______________________________________________
    



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 00:52:12 MET DST