Re: [Corpora-List] extract the content of an html element (was: chaker jebari)

From: Peter Adolphs (peter.adolphs@student.hu-berlin.de)
Date: Sat Dec 02 2006 - 13:31:57 MET

  • Next message: Slovko: "[Corpora-List] Slovko 2007 First Call"

    Hi!

    Chaker Jabbari wrote:
    > I need a tool (under windows) to extract the content of any html tag
    > from a html/text file.

    Do you want to strip the tags or do you want to extract the content of
    specific html elements?

    You could either extract the content with regular expressions or convert
    the HTML file to XML (tidy, jtidy) and transform that into the desired
    output (with XSLT) for cleaner results. In both cases, I would recommend
    jEdit -- a powerful text editor, Free Software, written in Java. There
    are numerous plugins and macros available that you could probably use
    for your task (plugins: JTidy and XSLT; macros: for instance, my own
    regular-expression-based "Extract Matches").

    Hope that helped!

    -- 
    Peter Adolphs    peter.adolphs@student.hu-berlin.de    gpg/pgp welcome!
    



    This archive was generated by hypermail 2b29 : Sat Dec 02 2006 - 14:21:34 MET