Если HTML-файл не имеет конца "/ tr" Tag OR "/ td" Tag, HTML Agility Pack не может полностью прочитать эту информацию - PullRequest
6 голосов
/ 19 марта 2010

Я использую HTML Agility Pack для разбора html-контента. Я использую разбор для извлечения информации таблицы. Оно работает. Но если нет конечного тега "/ tr" или тега "/ td", то он не сможет полностью проанализировать эту информацию (в которой нет конечного тега tr или тега td.)

Как

    <html>
  <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
  </head>
  <body>
    <table cellspacing="0" cellpadding="0" width="100%" border="0">
      <tbody>
        <tr>
          <td class="xl27" valign="bottom" colspan="9">
            Sir / Madam,<br>
            I/We have this day done by your order and on your account the
            following transactions:
          </td>
          <td class="xl27boTRL" align="middle" colspan="5">
            Stamp duty as required under the relevant stamp act to be paid on
            consolidated basis at the end of the month.
          </td>
        </tr>
        <tr height="30">
          <td class="xl27boTBL" align="middle" width="7%">
            Order No
          </td>
          <td class="xl27boTBL" align="middle" width="4%">
            Order Time
          </td>

          <td class="xl27boTBL" align="middle" width="5%">
            Net Rate
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
            Service Tax
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
           Amount
          </td>
          <td class="xl27boTRBL" style="BORDER-BOTTOM: windowtext 1pt solid;"
          align="middle" width="8%">
          Net Amount Rs
          </td>
        </tr>
        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            25222105
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:02:39
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            125288.00 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            122122141
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:01:56
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            249612.64 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap align="left" width="15%">
            [SERVICE TAX]
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="7%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            61.66
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Итак, что мне делать?

<TABLE  cellpadding=1 cellspacing=0 Width='100%'  style='border:1px solid #FFFFFF;''>
<TRAlign='middle' VAlign='bottom' Class='clsTRFontBold'>
<TD NoWrap class=clsTRFontHdr>ORDER NO</TD><TD NoWrap class=clsTRFontHdr>ORD TIME</TD>
<TD  NoWrap class=clsTRFontHdr>TRADE NO</TD><TD  NoWrap class=clsTRFontHdr>TRD TIME</TD>
<TD  NoWrap class=clsTRFontHdr ALIGN=CENTER>SCRIPNAME</TD>
<TD  NoWrap class=clsTRFontHdr>BUY/SELL</TD><TD  NoWrap class=clsTRFontHdr>QUANTITY</TD>
<TD NoWrap class=clsTRFontHdr align=right>RATE (RS)</TD>
<TD NoWrap class=clsTRFontHdr align=right>TOTAL (RS)</TD>
<TD NoWrap class=clsTRFontHdr align=right>TOT BROK (RS)</TD>
<TD NoWrap class=clsTRFontHdr align=right>SER TAX (RS)</TD>
<TD NoWrap class=clsTRFontHdr align=right>STT (RS)</TD>
<TD NoWrap class=clsTRFontHdr align=right>NET TOTAL (RS)</TD>
</TR>

<TR Class='clsTRFont'>
<TD NoWrap>2009030267182768</TD>
<TD NoWrap>10:28:11</TD><TD NoWrap>66950592</TD>
<TD NoWrap>10:28:25</TD>
<TD NoWrap>SESA GOA LTD</TD>
<TD NoWrap>BUY</TD>
<TD NoWrap ALIGN='RIGHT'>366 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2000</TD>
<TD NoWrap ALIGN='RIGHT'>28621.20</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD>
<TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>-28621.21</TD></TR>
<!--tr tag missing-->
<TD NoWrap>2009030267182768</TD>
<TD NoWrap>10:28:11</TD><TD NoWrap>66950783</TD><TD NoWrap>10:28:27</TD>
<TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>100 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2000</TD><TD NoWrap ALIGN='RIGHT'>7820.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-7820.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267182768</TD><TD NoWrap>10:28:11</TD>
<TD NoWrap>66956828</TD><TD NoWrap>10:29:39</TD><TD NoWrap>SESA GOA LTD</TD>
<TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>534 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2000</TD><TD NoWrap ALIGN='RIGHT'>41758.80</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-41758.81</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD><TD NoWrap>11:06:12</TD><TD NoWrap>67137258</TD>
<TD NoWrap>11:09:24</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD>
<TD NoWrap ALIGN='RIGHT'>162 </TD><TD NoWrap ALIGN='RIGHT'>78.2500</TD>
<TD NoWrap ALIGN='RIGHT'>12676.50</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD>
<TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>3.1320</TD>
<TD NoWrap ALIGN='RIGHT'>12673.36</TD></TR><TD NoWrap>2009030267510894</TD>
<TD NoWrap>11:06:12</TD><TD NoWrap>67137465</TD><TD NoWrap>11:09:28</TD>
<TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>200 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2500</TD><TD NoWrap ALIGN='RIGHT'>15650.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>4.1010</TD><TD NoWrap ALIGN='RIGHT'>15645.89</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD><TD NoWrap>11:06:12</TD>
<TD NoWrap>67137479</TD><TD NoWrap>11:09:28</TD><TD NoWrap>SESA GOA LTD</TD>
<TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>4 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2500</TD><TD NoWrap ALIGN='RIGHT'>313.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.0773</TD><TD NoWrap ALIGN='RIGHT'>312.91</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD><TD NoWrap>11:06:12</TD><TD NoWrap>67137995</TD>
<TD NoWrap>11:09:32</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD>
<TD NoWrap ALIGN='RIGHT'>16 </TD><TD NoWrap ALIGN='RIGHT'>78.2500</TD>
<TD NoWrap ALIGN='RIGHT'>1252.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD>
<TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.3093</TD>
<TD NoWrap ALIGN='RIGHT'>1251.68</TD></TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD>
<TD NoWrap>11:06:12</TD><TD NoWrap>67138097</TD><TD NoWrap>11:09:34</TD>
<TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>100 </TD>
<TD NoWrap ALIGN='RIGHT'>78.2500</TD><TD NoWrap ALIGN='RIGHT'>7825.00</TD>
<TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD>
<TD NoWrap ALIGN='RIGHT'>1.9333</TD><TD NoWrap ALIGN='RIGHT'>7823.06</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD><TD NoWrap>11:06:12</TD><TD NoWrap>67138333</TD><TD NoWrap>11:09:39</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>200 </TD><TD NoWrap ALIGN='RIGHT'>78.2500</TD><TD NoWrap ALIGN='RIGHT'>15650.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>3.8666</TD><TD NoWrap ALIGN='RIGHT'>15646.12</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030267510894</TD><TD NoWrap>11:06:12</TD><TD NoWrap>67138344</TD><TD NoWrap>11:09:40</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>318 </TD><TD NoWrap ALIGN='RIGHT'>78.2500</TD><TD NoWrap ALIGN='RIGHT'>24883.50</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>6.1479</TD><TD NoWrap ALIGN='RIGHT'>24877.34</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268222556</TD><TD NoWrap>13:03:50</TD><TD NoWrap>67511545</TD><TD NoWrap>13:03:51</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>733 </TD><TD NoWrap ALIGN='RIGHT'>78.0000</TD><TD NoWrap ALIGN='RIGHT'>57174.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-57174.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268222556</TD><TD NoWrap>13:03:50</TD><TD NoWrap>67511621</TD><TD NoWrap>13:03:53</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>2 </TD><TD NoWrap ALIGN='RIGHT'>78.0000</TD><TD NoWrap ALIGN='RIGHT'>156.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-156.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268222556</TD><TD NoWrap>13:03:50</TD><TD NoWrap>67511797</TD><TD NoWrap>13:03:58</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>1 </TD><TD NoWrap ALIGN='RIGHT'>78.0000</TD><TD NoWrap ALIGN='RIGHT'>78.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-78.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268222556</TD><TD NoWrap>13:03:50</TD><TD NoWrap>67512082</TD><TD NoWrap>13:04:05</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>264 </TD><TD NoWrap ALIGN='RIGHT'>78.0000</TD><TD NoWrap ALIGN='RIGHT'>20592.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-20592.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268378000</TD><TD NoWrap>13:31:04</TD><TD NoWrap>67609079</TD><TD NoWrap>13:33:39</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>405 </TD><TD NoWrap ALIGN='RIGHT'>77.6000</TD><TD NoWrap ALIGN='RIGHT'>31428.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-31428.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268378000</TD><TD NoWrap>13:31:04</TD><TD NoWrap>67609374</TD><TD NoWrap>13:33:46</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>45 </TD><TD NoWrap ALIGN='RIGHT'>77.6000</TD><TD NoWrap ALIGN='RIGHT'>3492.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-3492.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030268779359</TD><TD NoWrap>14:32:04</TD><TD NoWrap>67870192</TD><TD NoWrap>14:32:41</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>BUY</TD><TD NoWrap ALIGN='RIGHT'>900 </TD><TD NoWrap ALIGN='RIGHT'>77.3000</TD><TD NoWrap ALIGN='RIGHT'>69570.00</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>-69570.01</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030269013760</TD><TD NoWrap>15:03:56</TD><TD NoWrap>68018179</TD><TD NoWrap>15:03:56</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>146 </TD><TD NoWrap ALIGN='RIGHT'>76.2500</TD><TD NoWrap ALIGN='RIGHT'>11132.50</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>2.8226</TD><TD NoWrap ALIGN='RIGHT'>11129.67</TD>
</TR>
<!--tr tag missing-->
<TD NoWrap>2009030269013760</TD><TD NoWrap>15:03:56</TD><TD NoWrap>68018180</TD><TD NoWrap>15:03:56</TD><TD NoWrap>SESA GOA LTD</TD><TD NoWrap>SELL</TD><TD NoWrap ALIGN='RIGHT'>10 </TD><TD NoWrap ALIGN='RIGHT'>76.2500</TD><TD NoWrap ALIGN='RIGHT'>762.50</TD><TD NoWrap ALIGN='RIGHT'>0.01</TD><TD NoWrap ALIGN='RIGHT'>0.00</TD><TD NoWrap ALIGN='RIGHT'>0.1933</TD><TD NoWrap ALIGN='RIGHT'>762.30</TD>
</TR>
<TABLE cellpadding=0 cellspacing=0 border=0><br>

Ответы [ 4 ]

2 голосов
/ 22 марта 2010

Поскольку вы проверили мою другую идею, и она не сработала, я думаю, у вас есть только два варианта:

  1. Изменить HTML Agility Pack для обработки вашего дела, или
  2. Заполните пропущенные </tr> s самостоятельно.

Вот регулярное выражение, которое может заполнить недостающие </tr> s для вас:

html = Regex.Replace(html, "<tr[^>]*>(?:(?!</?tr>|</tbody>|</table>).)*?(?=<tr[^>]*>|</tbody>|</table>)", "$&</tr>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

(Если кто-то может улучшить мое регулярное выражение, пожалуйста, не стесняйтесь.)

0 голосов
/ 28 апреля 2012

Вы можете попробовать HTML Tidy Tidy.NET . Похоже, это решает ваши проблемы.

0 голосов
/ 15 мая 2011

Я написал патч для HTML agility Pack, который должен позволять обрабатывать все необязательные конечные теги (хотя в данный момент я разрешаю блок внутри <p>).

Тестирование пока не так уж и много, но вы всегда можете попробовать. Он прикреплен к следующему Html Agility Pack отчету об ошибках: http://htmlagilitypack.codeplex.com/workitem/29218

0 голосов
/ 19 марта 2010

Я не пробовал это, но помогает ли добавить этот код?

HtmlNode.ElementsFlags.Add("tr", HtmlElementFlag.Closed);
HtmlNode.ElementsFlags.Add("td", HtmlElementFlag.Closed);
...