Как получить категории для очистки с помощью libxml2
утилиты?
thufir@dur:~/xmllint$
thufir@dur:~/xmllint$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"]' - 2>/dev/null
<div class="tides">
<div class="weather-sprite icon st_nl" title="Marées Oostende"></div>
<p>Hoogtij: <strong>08:16</strong> <strong>20:56</strong></p>
<p>Laagtij: <strong>02:10</strong> <strong>14:48</strong></p>
<div class="weather-sprite icon anv_nl clearFlt" title="Marées Anvers"></div>
<p>Hoogtij: <strong>10:52</strong> <strong></strong></p>
<p>Laagtij: <strong>04:45</strong> <strong>04:45</strong></p>
</div><div class="tides">
<div class="weather-sprite icon st_nl" title="Marées Oostende"></div>
<p>Hoogtij: <strong>09:21</strong> <strong>22:05</strong></p>
<p>Laagtij: <strong>03:22</strong> <strong>16:01</strong></p>
<div class="weather-sprite icon anv_nl clearFlt" title="Marées Anvers"></div>
<p>Hoogtij: <strong></strong> <strong>12:02</strong></p>
<p>Laagtij: <strong>05:51</strong> <strong>05:51</strong></p>
</div>thufir@dur:~/xmllint$
thufir@dur:~/xmllint$
thufir@dur:~/xmllint$ wget -q -O - http://books.toscrape.com | xmllint --html --xpath '//div[@class='side_categories']' - 2 > /dev/null
-:53: HTML parser error : Tag header invalid
<header class="header container-fluid">
^
-:78: HTML parser error : Tag aside invalid
<aside class="sidebar col-sm-4 col-md-3">
^
-:647: HTML parser error : Tag section invalid
<section>
^
-:660: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:735: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:810: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:885: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:960: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1035: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1110: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1185: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1260: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1335: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1410: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1485: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1560: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1635: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1710: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1785: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1860: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:1935: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:2010: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:2085: HTML parser error : Tag article invalid
<article class="product_pod">
^
-:2186: HTML parser error : Tag footer invalid
<footer class="footer container-fluid">
^
XPath set is empty
warning: failed to load external entity "2"
thufir@dur:~/xmllint$
Предположительно либо xmllint
, либо утилита tidy
есть такая возможность? Возможно :
Enum htmlParserOption
Enum htmlParserOption {
HTML_PARSE_RECOVER = 1 : Relaxed parsing
HTML_PARSE_NODEFDTD = 4 : do not default a doctype if not found
HTML_PARSE_NOERROR = 32 : suppress error reports
HTML_PARSE_NOWARNING = 64 : suppress warning reports
HTML_PARSE_PEDANTIC = 128 : pedantic error reporting
HTML_PARSE_NOBLANKS = 256 : remove blank nodes
HTML_PARSE_NONET = 2048 : Forbid network access
HTML_PARSE_NOIMPLIED = 8192 : Do not add implied html/body... elements
HTML_PARSE_COMPACT = 65536 : compact small text nodes
HTML_PARSE_IGNORE_ENC = 2097152 : ignore internal document encoding hint
}
см. Также:
https://stackoverflow.com/a/12478652/262852
https://stackoverflow.com/a/3486809/262852