Re: [Corpora-List] Translator_HTML_to_XML

From: Klaus Guenther (klaus@capitalfocus.org)
Date: Sat May 03 2003 - 01:52:10 MET DST

  • Next message: Torbjörn Lager: "Re: [Corpora-List] Translator_HTML_to_XML"

    ----- Original Message -----
    From: "d'Armond Speers" <speersdl@msn.com>
    To: <corpora@hd.uib.no>
    Sent: Saturday, May 03, 2003 1:36 AM
    Subject: Re: [Corpora-List] Translator_HTML_to_XML

    >
    > >Dear all,
    > >
    > >I'm working on an Internet Query System,
    > >Can somebody point me to : any system for translating
    > >HTML to XML (In Java)?
    >
    > Hmm, HTML is a form of XML, isn't it?

    HTML is SGML which is not a type of XML. XHTML, however, is HTML that is
    reconstructed using XML. One of the big differences is that XML is a very
    strict language, and doesn't tolerate mistakes (e.g., unclosed tags, illegal
    tag combinations, etc). So a simple transform isn't going to be enough. You
    need to parse it and get rid of errors before you can declare it as an XML
    document. Even a tag like <br> will break an XML parser -- it needs to be
    written <br />. And then in HTML you have all the unclosed <p> tags.

    I know there are converters. Macromedia Dreamweaver, for example, will
    update your code to be XHMTL compliant. So there must be addons. I can't
    think of any for Java, but I'm sure they are out there.

    hth

    K.G.

    > For converting between different XML specs (as defined by a DTD or XML
    > Schema), you should take a look at XSLT (XML transforms). This is an
    > XML-based programming language. There are quite a few XSLT processors out
    > there that include Java libraries, such as Saxon and Xalan. You write the
    > XSLT, and apply the XSLT to the input XML to generate the output XML.
    Check
    > out XML, XSL and XML Schemas at the W3C (www.w3.org).
    >
    > >Thanks a lot,
    > >wassim
    >
    > --
    > d'Armond Speers, Ph.D.
    > speersd@georgetown.edu
    >
    >
    > _________________________________________________________________
    > Tired of spam? Get advanced junk mail protection with MSN 8.
    > http://join.msn.com/?page=features/junkmail



    This archive was generated by hypermail 2b29 : Sat May 03 2003 - 12:35:21 MET DST