Java Android xPath html-разбор - PullRequest
       7

Java Android xPath html-разбор

0 голосов
/ 04 ноября 2011

У меня есть приложение, которое должно взять html и добавить в него некоторые теги.

Мне нужно получить все tr и все td и получить их внутренний текст.

Можете ли вы дать мне код для этого?

Я работаю над этим часами уже ...

Содержание сайта:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">    
<!-- Updated: 03/11/2011 15:17:29-->    
<html xmlns="http://www.w3.org/1999/xhtml" >    
<head><title>    
    Untitled Page    
</title><meta http-equiv="Page-Exit" content="progid:DXImageTransform.Microsoft.GradientWipe(duration=1)" /><meta HTTP-EQUIV="CACHE-CONTROL" content="NO-CACHE" /><meta HTTP-EQUIV="PRAGMA" content="NO-CACHE" /><meta http-equiv="refresh" content="60" />    
    <style type="text/css">                
    .DisplayTable { width: 97%; }    
    .DisplayHeader { font-family: Arial; font-weight: bold; font-size: 25px; color: Black; text-align: center; }    
    .DisplayCell { font-family: Arial; font-weight: bold; font-size: 16px; color: Black; }                
    .MessageTable { width: 97%; }    
    .MessageHeader { font-family: Arial; font-size: 20px; color: SteelBlue; border-bottom: solid 3px SteelBlue; }    
    .MessageText { font-family: Arial; font-size: 20px; color: SteelBlue; text-align: right; }                
    .DisplayFillChange { font-family: Arial; font-weight: bold; font-size: 16px; color: MediumBlue; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayFreeChange { font-family: Arial; font-weight: bold; font-size: 16px; color: OrangeRed; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayEventChange { font-family: Arial; font-weight: bold; font-size: 16px; color: DarkGreen; background-color: LightCyan; border-bottom: solid 1px LightCyan; }    
    .DisplayExamChange { font-family: Arial; font-weight: bold; font-size: 16px; color: IndianRed; background-color: LightCyan; border-bottom: solid 1px LightCyan; }                
    </style>    
</head>    
<body dir="rtl" style="margin: 0px; background-color: LightCyan; overflow: hidden;" scroll="no" onload="resize()">    
    <form name="form1" method="post" action="MainScreen.aspx?pid=17&amp;mid=6264&amp;page=5&amp;msgof=0&amp;static=1" id="form1">    
<div>    
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJLTQwMjA0MzQzZGSqqj0xDnBRKxIgowwhNZzzyzQHVg==" />    
</div>            
        <table width="100%" cellspacing="0" cellpadding="0" border="0" style="background-image: url(fill.gif);">    
            <tr height="59" style="font-family: Arial; font-size: 34px; color: Yellow; vertical-align: middle;">    
                <td width="15">&nbsp;</td>    
                <td width="45%" align="right" id="clock">00:00</td>    
                <td align="center" nowrap><b>שינוי מערכת שעות לתאריך                        </b></td>    
                <td width="45%" align="left">04.11.2011</td>    
                <td width="15">&nbsp;</td>    
            </tr>    
        </table>    
        <br />    
        <div id="header" align="center"><table width='100%' class='DisplayTable' cellspacing='0' border='1'><tr class='DisplayHeader'><td width='1%' style='color: LightCyan;'>0</td><td width='14%'>יא - 1</td><td width='14%'>יא - 2</td><td width='14%'>יא - 3</td><td width='14%'>יא - 4</td><td width='14%'>יא - 5</td><td width='14%'>יא - 6</td><td width='14%'>יא - 7</td><td width='1%' style='color: LightCyan;'>0</td></tr></table></div>    
        <div id="scrollPanel" align="center" style="overflow: hidden;">    
            <div id="panel" align="center" style=""><table width='100%' class='DisplayTable' cellspacing='0' border='1'><tr><td width='1%' class='DisplayCell'>0</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>0</td></tr><tr><td width='1%' class='DisplayCell'>1</td><td width='14%' class='DisplayCell'><table width='100%'></table></td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>1</td></tr><tr><td width='1%' class='DisplayCell'>2</td><td width='14%' class='DisplayCell'><table width='100%'></table></td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>2</td></tr><tr><td width='1%' class='DisplayCell'>3</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>3</td></tr><tr><td width='1%' class='DisplayCell'>4</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>4</td></tr><tr><td width='1%' class='DisplayCell'>5</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>5</td></tr><tr><td width='1%' class='DisplayCell'>6</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>6</td></tr><tr><td width='1%' class='DisplayCell'>7</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>7</td></tr><tr><td width='1%' class='DisplayCell'>8</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>8</td></tr><tr><td width='1%' class='DisplayCell'>9</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='14%' class='DisplayCell'>&nbsp;</td><td width='1%' class='DisplayCell'>9</td></tr></table></div>    
            <div id="messages" align="center"><table width='100%' class='MessageTable' cellspacing='0' cellpadding='7' border='0'><tr><td class='MessageHeader'>הודעות</td></tr></tr></table></div>    
        </div>    
    </form>    
    <script>                
    var sp;    
    var delay = 0;                
    function resize(){    
        sp = document.getElementById('scrollPanel');    
        sp.style.height = document.documentElement.clientHeight - sp.offsetTop;            
        delay = document.getElementById('panel').clientHeight - document.getElementById('scrollPanel').clientHeight;    
        if (delay > 0)    
            delay = delay / 5 * 120;    
        else    
            delay = 0;                    
        setTimeout("doScroll()", 3000);    
        setTimeout("doNextPage()", 500);    
    }                
    function doScroll()    
    {    
        sp.scrollTop += 5;    
        setTimeout("doScroll()", 100);    
    }                
    updateClock();    
    function nextUrl()    
    {    
        return 'MainScreen.aspx?pid=17&mid=6264&page=6&msgof=0&nd=0';    
    }                
    function doNextPage()    
    {                    
    }                
    function updateClock()    
    {    
        document.getElementById('clock').innerHTML = getClock();    
        setTimeout("updateClock()", 55000)    
    }
    function getClock()    
    {    
        var date = new Date();    
        var hours = date.getHours();    
        var minutes = date.getMinutes();                    
        if (hours < 10)    
            hours = '0' + hours;                        
        if (minutes < 10)    
            minutes = '0' + minutes;            
        return hours + ':' + minutes;    
    }    
    </script>    
</body>    
</html>

1 Ответ

3 голосов
/ 04 ноября 2011

Самый простой выход - использование библиотеки разбора HTML, например HTMLCleaner, TagSoup, HTML Parser и т. Д. Таким образом, вы сможете просто извлечь все нужные элементы из документа или выполнить итерации вручную с помощью «посетителя узла»'- или как ее называют библиотеки.

Беглый взгляд на документацию случайно выбранной библиотеки, приведенной выше, предполагает, что для HTMLCleaner должно работать что-то вроде следующего:*

Пример, использующий ту же библиотеку, но теперь с TagNodeVisitor и отфильтрованный по <td>:

node.traverse(new TagNodeVisitor() {
    public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
        if (htmlNode instanceof TagNode) {
            TagNode tag = (TagNode) htmlNode;
            String tagName = tag.getName();
            if ("td".equals(tagName)) {
                System.out.println("All text inside this <td> tag (including children): " + tag.getText());
            }
        }
        // tells visitor to continue traversing the DOM tree
        return true;
    }
});
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...