Как построить правильно сформированный XML-документ из HTML с помощью Saxon? - PullRequest
0 голосов
/ 05 января 2019

Конкретная ошибка - Exception in thread "main" java.net.MalformedURLException: no protocol, но, поскольку html выводится на консоль, URL может показаться совершенно корректным, поэтому ошибка может быть неинформативной.

Оставаясь с Saxon-HE и tagsoup, я должен сначала проверить streamResult?

Чтение вывода консоли почти похоже на , обертывающее html в xml было бы достаточно, чтобы затем сделать Document из streamResult.

авария:

thufir@dur:~/NetBeansProjects/helloWorldSaxon$ gradle clean run

> Task :run
Exception in thread "main" java.net.MalformedURLException: no protocol: <?xml version="1.0" encoding="UTF-8"?><!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html xmlns:html="http://www.w3.org/1999/xhtml" class="no-js" lang="en-us"><!--<![endif]-->
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      <title>
    All products | Books to Scrape - Sandbox
</title>
      <meta name="created" content="24th Jun 2016 09:29" />
      <meta name="description" content="" />
      <meta name="viewport" content="width=device-width" />
      <meta name="robots" content="NOARCHIVE,NOCACHE" />
      <!-- Le HTML5 shim, for IE6-8 support of HTML elements --><!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
      <link rel="shortcut icon" href="static/oscar/favicon.ico" />
      <link rel="stylesheet" type="text/css" href="static/oscar/css/styles.css" />
      <link rel="stylesheet" href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" />
      <link rel="stylesheet" type="text/css" href="static/oscar/css/datetimepicker.css" />
   </head>

..

      <!-- Version: N/A -->

      </body>
</html>
        at java.net.URL.<init>(URL.java:593)
        at java.net.URL.<init>(URL.java:490)
        at java.net.URL.<init>(URL.java:439)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:620)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
        at helloWorldSaxon.HandlerForXML.parseFromURL(HandlerForXML.java:53)
        at helloWorldSaxon.App.scrapeHTML(App.java:26)
        at helloWorldSaxon.App.main(App.java:19)

> Task :run FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':run'.
> Process 'command '/usr/lib/jvm/java-8-openjdk-amd64/bin/java'' finished with non-zero exit value 1

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 3s
4 actionable tasks: 4 executed
thufir@dur:~/NetBeansProjects/helloWorldSaxon$ 

Примечательно, что закрывающего тега xml нет.

код:

    public void parseFromURL() throws SAXException, ParserConfigurationException, IOException, TransformerException {
        StringWriter writer = new StringWriter();
        StreamResult streamResult = new StreamResult(writer);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
        Source source = new SAXSource(xmlReader, new InputSource(url.toString()));

        Transformer transformer = transformerFactory.newTransformer();
        transformer.transform(source, streamResult);

        String stringResult = writer.toString();
        LOG.fine(stringResult);

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
        Document document;
        document = builder.parse(stringResult);

    }

Глядя на сборку правильно сформированного xml документа из stringResult.

...