Re: [Corpora-List] Corpus from Blogs required.

From: Trilok Khairnar (trilokgk@gmail.com)
Date: Mon Apr 04 2005 - 10:45:44 MET DST

  • Next message: Eric Ringger: "[Corpora-List] CFP: ACL 2005 Workshop on Feature Engineering for Machine Learning in NLP"

    Hello Jean-Phi, Gilad

    Thanks for the inputs.

    Permalinks and Technorati APIs will definitely be useful.

    Technorati APIS provide - inbound and outbound links of a blog, basic
    user and blog info etc. but not the list of posts on a blog and their
    text.

    On the other hand, permalinks should be useful to extract the text of
    one blog post at a time though surrounding text on the blog like
    badges and blogroll will be included too. (Looks like a hack will be
    required to extract only the text of a post when permalink is
    available.)

    I will try this sometime using Atom.Net and RSS.Net libraries and let
    the list-members know.

    Thanks,
    Trilok.

    On Mar 31, 2005 4:05 AM, Jean-Phi <jpprost@gmail.com> wrote:
    > Hi,
    >
    > > In the absence of such corpus and APIs, I am thinking of doing this by
    > > 1] using RSS, ATOM feed parsers on some OPML files to get URLs for blog posts
    > > 2] Extracting the text (easier if the blog template format is known)
    >
    > It might not be that easy: I suspect that many blogs use some sort of
    > Content Management System, which basically means that the texts are
    > stored in a database, and are only presented in the blog dynamically,
    > on request.In such cases my guess is that you'll probably need to know
    > a minimum about the database structure in order to query it --unless,
    > of course, the site provides you with an RSS feed. Or do I miss
    > something?
    >
    > Some blog host sites may sometimes also couple the dynamic rendering
    > with a permanent html link for each text. http://www.blogger.com/ (now
    > owned by google) does provide both these features: RSS feed and
    > permanent link. I don't hold any shares, though...
    >
    > Cheers,
    > --
    > Jean-Philippe Prost
    > Centre for Language Technology
    > Macquarie University ~ Sydney, Australia
    > and
    > Laboratoire Parole et Langage (Speech & Language Lab.)
    > Université de Provence ~ Aix-en-Provence, France
    > <http://www.ics.mq.edu.au/~jpprost/>
    > _______________________________________________
    >
    >



    This archive was generated by hypermail 2b29 : Mon Apr 04 2005 - 10:58:08 MET DST