убирать, анализировать и очищать HTML с помощью libxml2 (конвертировать HTML в XML) - PullRequest
0 голосов
/ 01 января 2019

Как получить категории для очистки с помощью libxml2 утилиты?

thufir@dur:~/xmllint$ 
thufir@dur:~/xmllint$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"]' - 2>/dev/null
<div class="tides">
            <div class="weather-sprite icon  st_nl" title="Marées Oostende"></div>
            <p>Hoogtij: <strong>08:16</strong>  <strong>20:56</strong></p>
            <p>Laagtij: <strong>02:10</strong>  <strong>14:48</strong></p>
            <div class="weather-sprite icon  anv_nl clearFlt" title="Marées Anvers"></div>
            <p>Hoogtij: <strong>10:52</strong>  <strong></strong></p>
            <p>Laagtij: <strong>04:45</strong>  <strong>04:45</strong></p>
        </div><div class="tides">
            <div class="weather-sprite icon  st_nl" title="Marées Oostende"></div>
            <p>Hoogtij: <strong>09:21</strong>  <strong>22:05</strong></p>
            <p>Laagtij: <strong>03:22</strong>  <strong>16:01</strong></p>
            <div class="weather-sprite icon  anv_nl clearFlt" title="Marées Anvers"></div>
            <p>Hoogtij: <strong></strong>  <strong>12:02</strong></p>
            <p>Laagtij: <strong>05:51</strong>  <strong>05:51</strong></p>
        </div>thufir@dur:~/xmllint$ 
thufir@dur:~/xmllint$ 
thufir@dur:~/xmllint$ wget -q -O - http://books.toscrape.com | xmllint --html --xpath '//div[@class='side_categories']' - 2 > /dev/null
-:53: HTML parser error : Tag header invalid
    <header class="header container-fluid">
                                          ^
-:78: HTML parser error : Tag aside invalid
            <aside class="sidebar col-sm-4 col-md-3">
                                                    ^
-:647: HTML parser error : Tag section invalid
        <section>
                ^
-:660: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:735: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:810: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:885: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:960: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1035: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1110: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1185: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1260: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1335: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1410: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1485: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1560: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1635: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1710: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1785: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1860: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:1935: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:2010: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:2085: HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
-:2186: HTML parser error : Tag footer invalid
<footer class="footer container-fluid">
                                      ^
XPath set is empty
warning: failed to load external entity "2"
thufir@dur:~/xmllint$ 

Предположительно либо xmllint, либо утилита tidyесть такая возможность? Возможно :

Enum htmlParserOption

Enum htmlParserOption {
    HTML_PARSE_RECOVER = 1 : Relaxed parsing
    HTML_PARSE_NODEFDTD = 4 : do not default a doctype if not found
    HTML_PARSE_NOERROR = 32 : suppress error reports
    HTML_PARSE_NOWARNING = 64 : suppress warning reports
    HTML_PARSE_PEDANTIC = 128 : pedantic error reporting
    HTML_PARSE_NOBLANKS = 256 : remove blank nodes
    HTML_PARSE_NONET = 2048 : Forbid network access
    HTML_PARSE_NOIMPLIED = 8192 : Do not add implied html/body... elements
    HTML_PARSE_COMPACT = 65536 : compact small text nodes
    HTML_PARSE_IGNORE_ENC = 2097152 : ignore internal document encoding hint
}

см. Также:

https://stackoverflow.com/a/12478652/262852

https://stackoverflow.com/a/3486809/262852

...