Я использую Java XPath API для извлечения содержимого из файла xhtml. Я вставляю HTML и пытаюсь извлечь содержание конкретного. Содержит текст и несколько внутри. Когда я использую XPath, как ни странно, он игнорирует все HTML-теги и извлекает только текстовое содержимое. Вот фрагмент HTML.
<html>
<body>
<div class="content">
<div class="content_wrapper">
<table border="0" cellspacing="0" cellpadding="0" class="test_class">
<tr>
<td>
<p>
Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.
</p>
<p style="text-align: center;">
<img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" />
<img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" />
<img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" />
<img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" />
</p>
<p>
<br />
Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br />
</p>
<p>
<strong>Operating System</strong><br />
• Microsoft® Windows® XP Professional (SP 2 or higher)<br />
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br />
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)
</p>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>
Теперь вот код, который я использую. Мне нужно сделать некоторую очистку перед использованием xpath.
CleanerProperties props = new CleanerProperties();
props.setOmitDoctypeDeclaration(true);
props.setAllowHtmlInsideAttributes(true);
props.setOmitUnknownTags(true);
TagNode tagNode = new HtmlCleaner(props).clean(urlXML, "UTF-8");
Document doc = new DomSerializer(props, true).createDOM(tagNode);
String content = XPathAPI.eval(doc, "/html/body//div[@class='content']/div[@class='content_wrapper']").toString();
А вот и вывод.
Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.
Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it
Operating System
• Microsoft® Windows® XP Professional (SP 2 or higher)<br />
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br />
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)
Все, что мне нужно, это полный контент внутри div content_wrapper.
Любые указатели будут высоко оценены.
EDIT
Пример кода в ответ на решение yamburg.
XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(contentPath);
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Node n = (Node)nodes.item(i);
traverseNodes(n);
}
public static void traverseNodes( Node n ) {
NodeList children = n.getChildNodes();
if( children != null ) {
for(int i = 0; i > children.getLength(); i++ ) {
Node childNode = children.item( i );
System.out.println( "node name = " + childNode.getNodeName() );
System.out.println( "node value = " + childNode.getNodeValue() );
System.out.println( "node type = " + childNode.getNodeType() );
traverseNodes( childNode );
}
}
}