XPATH не работает на HTML - PullRequest
       30

XPATH не работает на HTML

0 голосов
/ 03 июня 2011

У меня есть код, который читает файл HTML с моего локального веб-сервера localhost, а затем преобразует его в XHTML с tidy. Затем я загружаю это XHTML в мое DOM. код выглядит так

<?php


function getXHTML($html)
{
    $options = array("output-html" => true,"quote-nbsp" => true, "drop-proprietary-attributes" => true,"drop-font-tags" => true,"drop-empty-paras" => true,"hide-comments" => true);
    $tidy=new tidy();
    $xhtml=$tidy->repairString($html,$options);
    echo $xhtml;
    return $xhtml;
}
$content = file_get_contents("http://localhost/filename.htm");
$page = new DOMDocument();
$xpath=new DOMXPath($page);
$content = getXHTML($content);   // this is a tidy function to return XHTML
$page->loadHTML($content);   
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
echo $total->length;    // this shows zero
?> 

содержимое filename.htm выглядит так

<!-- saved from url=(0041)http://www.rtu.ac.in/results/reformat.php -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="SHORTCUT ICON" href="http://www.rtu.ac.in/favicon.ico">
<link href="./Result - Rajasthan Technical University6_files/styleresults.css" rel="stylesheet" type="text/css">
<title>Result - Rajasthan Technical University</title>
</head>
<body>


<table width="773" cellpadding="5" cellspacing="0" align="center">
  <tbody><tr height="60">
    <td width="16%" height="60" valign="top"><font color="brown" size="+2"><img src="./Result - Rajasthan Technical University6_files/logo.jpg" width="100" height="102" border="0" align="right">&nbsp;</font></td>
    <td width="72%" height="60" align="center" valign="top"><p><font color="brown" size="+2"><strong>RAJASTHAN TECHNICAL UNIVERSITY </strong></font></p><font color="brown" size="+2">
      <p><font size="+1"><strong>B.Tech -IVth SEMESTER -2010(Main) 16.5.2011</strong></font></p><font size="+1">&nbsp;</font></font></td>      
    <td width="12%" height="80"><strong>www.rtu.ac.in</strong>&nbsp;</td>
  </tr>
</tbody></table>



<br>
<br>
<table width="783" align="center" cellpadding="5" cellspacing="0" class="table"> 
  <tbody>
    <tr>
      <td width="34%" align="center" valign="top" rowspan="2"><strong>Subject(s) Name </strong>&nbsp;</td>
      <td width="10%" align="center" valign="top" colspan="1" rowspan="2"> <strong>Subject(s) Code </strong>&nbsp;</td>

      <td align="center" valign="top" colspan="3" rowspan="1"><strong>Marks Obtained </strong>&nbsp;</td>
    </tr>


    <tr>
      <td width="20%" align="center"><strong>Internal</strong>&nbsp;</td>
      <td width="18%" align="center"><strong>Theory</strong>&nbsp;</td>
      <td width="18%" align="center">&nbsp;</td>
    </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4551</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 50</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;4552</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 17</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 61</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4553</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 49</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-4</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4554</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 68</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>
        <tr>
          <td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-5</strong>&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">4555</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
          <td align="center" style=" border-bottom: 0px none transparent;"> 36</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-6</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4556</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 48</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      </tr><tr>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;&nbsp;</td>
          <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;<strong>Internal</strong>&nbsp;</td>
          <td width="18%" align="center" style=" border-bottom: 0px none transparent;"><strong>Practical</strong>&nbsp;</td>
        </tr>

        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-1</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4174</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 29</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">48</td>
      </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-2</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4175</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">26</td>
      </tr>

      <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-3</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4171</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 15</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">27</td>
      </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-4</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4172</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 17</td>
        <td align="center" style=" border-bottom: 0px none transparent;">29</td>
        </tr>
      <tr>
        <td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-5</strong>&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;">4173</td>
        <td align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
        <td align="center" style=" border-bottom: 0px none transparent;"> 29</td>
        <td align="center" style=" border-bottom: 0px none transparent;">46</td>
        </tr>




        <tr>
          <td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>Disipline (Deca)</strong>&nbsp;</td>
          <td width="10%" align="center" style=" border-bottom: 0px none transparent;">4176</td>

      <td width="20%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">&nbsp;</td>
      <td width="18%" align="center" style=" border-bottom: 0px none transparent;">46</td>
      </tr>
  <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr></tbody>
</table>

<br><table width="783" align="center" cellpadding="5" cellspacing="0" class="table">
  <tbody><tr>

    <td width="18%" align="center" valign="top"><strong>Practical Marks   </strong>&nbsp;</td>
    <td width="18%" align="center" valign="top">328</td>
    <td width="19%" align="center" valign="top"><strong>Theory Marks </strong>&nbsp;</td>
    <td width="19%" align="center" valign="top">411</td>
  </tr>

  <tr>
    <td width="18%" align="center"><strong>Institute Code   </strong>&nbsp;</td>
    <td width="18%" align="center"> 1229 </td>
    <td width="19%" align="center"><strong>DECCA </strong>&nbsp;</td>
    <td width="19%" align="center">4176</td>
  </tr>

  <tr>

    <td width="18%" align="center"><strong>Division   </strong>&nbsp;</td>
    <td width="18%" align="center"> PASS </td>
    <td width="19%" align="center"><strong>Grand Total </strong>&nbsp;</td>
    <td width="19%" align="center">739</td>
  </tr>
  </tbody></table>


&nbsp;&nbsp; 
<!-- Reformatter by Shashank Kumar Jain (CS, IIIrd Year, 2010-11) -->


<div id="csscan-wrapper" style="display: none; "><h2 id="csscan-header">element</h2><table id="csscan-table"><tbody><tr><th colspan="2" id="csscan-header-font" class="csscan-header">Font</th></tr><tr id="csscan-row-font-family"><td id="csscan-property-font-family" class="csscan-property">font-family</td><td id="csscan-value-font-family" class="csscan-value"></td></tr><tr id="csscan-row-font-size"><td id="csscan-property-font-size" class="csscan-property">font-size</td><td id="csscan-value-font-size" class="csscan-value"></td></tr><tr id="csscan-row-font-style"><td id="csscan-property-font-style" class="csscan-property">font-style</td><td id="csscan-value-font-style" class="csscan-value"></td></tr><tr id="csscan-row-font-variant"><td id="csscan-property-font-variant" class="csscan-property">font-variant</td><td id="csscan-value-font-variant" class="csscan-value"></td></tr><tr id="csscan-row-font-weight"><td id="csscan-property-font-weight" class="csscan-property">font-weight</td><td id="csscan-value-font-weight" class="csscan-value"></td></tr><tr id="csscan-row-letter-spacing"><td id="csscan-property-letter-spacing" class="csscan-property">letter-spacing</td><td id="csscan-value-letter-spacing" class="csscan-value"></td></tr><tr id="csscan-row-line-height"><td id="csscan-property-line-height" class="csscan-property">line-height</td><td id="csscan-value-line-height" class="csscan-value"></td></tr><tr id="csscan-row-text-decoration"><td id="csscan-property-text-decoration" class="csscan-property">text-decoration</td><td id="csscan-value-text-decoration" class="csscan-value"></td></tr><tr id="csscan-row-text-align"><td id="csscan-property-text-align" class="csscan-property">text-align</td><td id="csscan-value-text-align" class="csscan-value"></td></tr><tr id="csscan-row-text-indent"><td id="csscan-property-text-indent" class="csscan-property">text-indent</td><td id="csscan-value-text-indent" class="csscan-value"></td></tr><tr id="csscan-row-text-transform"><td id="csscan-property-text-transform" class="csscan-property">text-transform</td><td id="csscan-value-text-transform" class="csscan-value"></td></tr><tr id="csscan-row-white-space"><td id="csscan-property-white-space" class="csscan-property">white-space</td><td id="csscan-value-white-space" class="csscan-value"></td></tr><tr id="csscan-row-word-spacing"><td id="csscan-property-word-spacing" class="csscan-property">word-spacing</td><td id="csscan-value-word-spacing" class="csscan-value"></td></tr><tr id="csscan-row-color"><td id="csscan-property-color" class="csscan-property">color</td><td id="csscan-value-color" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-background" class="csscan-header">Background</th></tr><tr id="csscan-row-background-attachment"><td id="csscan-property-background-attachment" class="csscan-property">bg-attachment</td><td id="csscan-value-background-attachment" class="csscan-value"></td></tr><tr id="csscan-row-background-color"><td id="csscan-property-background-color" class="csscan-property">bg-color</td><td id="csscan-value-background-color" class="csscan-value"></td></tr><tr id="csscan-row-background-image"><td id="csscan-property-background-image" class="csscan-property">bg-image</td><td id="csscan-value-background-image" class="csscan-value"></td></tr><tr id="csscan-row-background-position"><td id="csscan-property-background-position" class="csscan-property">bg-position</td><td id="csscan-value-background-position" class="csscan-value"></td></tr><tr id="csscan-row-background-repeat"><td id="csscan-property-background-repeat" class="csscan-property">bg-repeat</td><td id="csscan-value-background-repeat" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-size" class="csscan-header">Box</th></tr><tr id="csscan-row-width"><td id="csscan-property-width" class="csscan-property">width</td><td id="csscan-value-width" class="csscan-value"></td></tr><tr id="csscan-row-height"><td id="csscan-property-height" class="csscan-property">height</td><td id="csscan-value-height" class="csscan-value"></td></tr><tr id="csscan-row-border-top"><td id="csscan-property-border-top" class="csscan-property">border-top</td><td id="csscan-value-border-top" class="csscan-value"></td></tr><tr id="csscan-row-border-right"><td id="csscan-property-border-right" class="csscan-property">border-right</td><td id="csscan-value-border-right" class="csscan-value"></td></tr><tr id="csscan-row-border-bottom"><td id="csscan-property-border-bottom" class="csscan-property">border-bottom</td><td id="csscan-value-border-bottom" class="csscan-value"></td></tr><tr id="csscan-row-border-left"><td id="csscan-property-border-left" class="csscan-property">border-left</td><td id="csscan-value-border-left" class="csscan-value"></td></tr><tr id="csscan-row-margin"><td id="csscan-property-margin" class="csscan-property">margin</td><td id="csscan-value-margin" class="csscan-value"></td></tr><tr id="csscan-row-padding"><td id="csscan-property-padding" class="csscan-property">padding</td><td id="csscan-value-padding" class="csscan-value"></td></tr><tr id="csscan-row-max-height"><td id="csscan-property-max-height" class="csscan-property">max-height</td><td id="csscan-value-max-height" class="csscan-value"></td></tr><tr id="csscan-row-min-height"><td id="csscan-property-min-height" class="csscan-property">min-height</td><td id="csscan-value-min-height" class="csscan-value"></td></tr><tr id="csscan-row-max-width"><td id="csscan-property-max-width" class="csscan-property">max-width</td><td id="csscan-value-max-width" class="csscan-value"></td></tr><tr id="csscan-row-min-width"><td id="csscan-property-min-width" class="csscan-property">min-width</td><td id="csscan-value-min-width" class="csscan-value"></td></tr><tr id="csscan-row-outline-color"><td id="csscan-property-outline-color" class="csscan-property">outline-color</td><td id="csscan-value-outline-color" class="csscan-value"></td></tr><tr id="csscan-row-outline-style"><td id="csscan-property-outline-style" class="csscan-property">outline-style</td><td id="csscan-value-outline-style" class="csscan-value"></td></tr><tr id="csscan-row-outline-width"><td id="csscan-property-outline-width" class="csscan-property">outline-width</td><td id="csscan-value-outline-width" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-position" class="csscan-header">Positioning</th></tr><tr id="csscan-row-position"><td id="csscan-property-position" class="csscan-property">position</td><td id="csscan-value-position" class="csscan-value"></td></tr><tr id="csscan-row-top"><td id="csscan-property-top" class="csscan-property">top</td><td id="csscan-value-top" class="csscan-value"></td></tr><tr id="csscan-row-bottom"><td id="csscan-property-bottom" class="csscan-property">bottom</td><td id="csscan-value-bottom" class="csscan-value"></td></tr><tr id="csscan-row-right"><td id="csscan-property-right" class="csscan-property">right</td><td id="csscan-value-right" class="csscan-value"></td></tr><tr id="csscan-row-left"><td id="csscan-property-left" class="csscan-property">left</td><td id="csscan-value-left" class="csscan-value"></td></tr><tr id="csscan-row-float"><td id="csscan-property-float" class="csscan-property">float</td><td id="csscan-value-float" class="csscan-value"></td></tr><tr id="csscan-row-display"><td id="csscan-property-display" class="csscan-property">display</td><td id="csscan-value-display" class="csscan-value"></td></tr><tr id="csscan-row-clear"><td id="csscan-property-clear" class="csscan-property">clear</td><td id="csscan-value-clear" class="csscan-value"></td></tr><tr id="csscan-row-z-index"><td id="csscan-property-z-index" class="csscan-property">z-index</td><td id="csscan-value-z-index" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-list" class="csscan-header">List</th></tr><tr id="csscan-row-list-style-image"><td id="csscan-property-list-style-image" class="csscan-property">list-style-image</td><td id="csscan-value-list-style-image" class="csscan-value"></td></tr><tr id="csscan-row-list-style-type"><td id="csscan-property-list-style-type" class="csscan-property">list-style-type</td><td id="csscan-value-list-style-type" class="csscan-value"></td></tr><tr id="csscan-row-list-style-position"><td id="csscan-property-list-style-position" class="csscan-property">list-style-position</td><td id="csscan-value-list-style-position" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-table" class="csscan-header">Table</th></tr><tr id="csscan-row-vertical-align"><td id="csscan-property-vertical-align" class="csscan-property">vertical-align</td><td id="csscan-value-vertical-align" class="csscan-value"></td></tr><tr id="csscan-row-border-collapse"><td id="csscan-property-border-collapse" class="csscan-property">border-collapse</td><td id="csscan-value-border-collapse" class="csscan-value"></td></tr><tr id="csscan-row-border-spacing"><td id="csscan-property-border-spacing" class="csscan-property">border-spacing</td><td id="csscan-value-border-spacing" class="csscan-value"></td></tr><tr id="csscan-row-caption-side"><td id="csscan-property-caption-side" class="csscan-property">caption-side</td><td id="csscan-value-caption-side" class="csscan-value"></td></tr><tr id="csscan-row-empty-cells"><td id="csscan-property-empty-cells" class="csscan-property">empty-cells</td><td id="csscan-value-empty-cells" class="csscan-value"></td></tr><tr id="csscan-row-table-layout"><td id="csscan-property-table-layout" class="csscan-property">table-layout</td><td id="csscan-value-table-layout" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-effects" class="csscan-header">Effects</th></tr><tr id="csscan-row-text-shadow"><td id="csscan-property-text-shadow" class="csscan-property">text-shadow</td><td id="csscan-value-text-shadow" class="csscan-value"></td></tr><tr id="csscan-row--webkit-box-shadow"><td id="csscan-property--webkit-box-shadow" class="csscan-property">-webkit-box-shadow</td><td id="csscan-value--webkit-box-shadow" class="csscan-value"></td></tr><tr id="csscan-row-border-radius"><td id="csscan-property-border-radius" class="csscan-property">border-radius</td><td id="csscan-value-border-radius" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-other" class="csscan-header">Other</th></tr><tr id="csscan-row-overflow"><td id="csscan-property-overflow" class="csscan-property">overflow</td><td id="csscan-value-overflow" class="csscan-value"></td></tr><tr id="csscan-row-cursor"><td id="csscan-property-cursor" class="csscan-property">cursor</td><td id="csscan-value-cursor" class="csscan-value"></td></tr><tr id="csscan-row-visibility"><td id="csscan-property-visibility" class="csscan-property">visibility</td><td id="csscan-value-visibility" class="csscan-value"></td></tr></tbody></table></div></body></html>

XPath выше правильно, так как я проверил это с FirePath. Может кто-нибудь сказать мне, что я делаю не так?

Ответы [ 3 ]

4 голосов
/ 03 июня 2011

Попробуйте использовать loadHTML($string) вместо loadXML. Из руководства:

Функция анализирует HTML, содержащийся в источнике строки. В отличие от загрузки XML, HTML не должен быть правильно сформирован для загрузки.

Обновление 1

loadHTML создает в памяти то же дерево DOM, что и loadXML. Используется только менее строгий парсер. Вот пример кода с XPath:

<?php
$content = file_get_contents("1.html");
$page = new DOMDocument();
$page->loadHTML($content);   // this will ignore most errors in formating
echo $page->saveHTML();
echo "=====\n";
$xpath = new DOMXPath($page); // use any "XML" parsing function
foreach ($xpath->query("//li[not(@id='3')]") as $elem) {
        echo "[".trim($elem->textContent)."]\n";
}

Содержимое 1.html файла:

<li id="1">item 1
<li id="2">item 2
<li id="3">item 3
<li id="4">item 4

Вывод будет:

<!DOCTYPE html PUBLIC "...">
<html><body>
<li id="1">item 1
</li>
<li id="2">item 2
</li>
<li id="3">item 3
</li>
<li id="4">item 4
</li>
</body></html>
=====
[item 1]
[item 2]
[item 4]

Обновление 2

Вы только что пропустили инициализацию для переменной $xpath. Я также удалил вызов getXHTML, потому что он не нужен:

$content = file_get_contents("2.html");
$page = new DOMDocument();
//$content=getXHTML($content); // no need this if you're using loadHTML
$page->loadHTML($content);
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$xpath = new DOMXPath($page); // creating $xpath object
$total = $xpath->query($totalPath);
echo "[",$total->length,"]";
0 голосов
/ 10 июня 2011

ответ на поставленный выше вопрос несколько хитрый.мой исходный код выглядел примерно так:

$xpath=new DOMXPath($page);
..
...
...
$page->loadHTML($content);
..
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
...
...

. То, что происходит выше, заключается в том, что $xpath создается на пустом документе, поскольку html все еще не загружен в Dom.поэтому, когда xpath выполнил любой запрос, он запустил запрос на пустом документе.теперь я изменил порядок 2 операторов

...
...
$page->loadHTML($content);
$xpath=new DOMXPath($page);
...
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);

, теперь он работает, потому что $xpath создается на непустом документе

0 голосов
/ 03 июня 2011

Сколько вы играли с опциями PHP Tidy ?Если полученная ошибка относится к сущностям (в частности, &nbsp;), мне интересно, поможет ли установка числовых сущностей "on" или игра со значением для preserve-entity.

План B: Попробуйте это.XPath работал даже с плохо сформированными HTML-файлами.

<?php 

$oldSetting = libxml_use_internal_errors( true ); 
libxml_clear_errors(); 

$html = new DOMDocument(); 
$html->loadHtmlFile(
    'myHtmlFile.html'); 

$xpath = new DOMXPath( $html ); 
$test = $xpath->query( "//div[@id='mydiv']" ); 

$div = $test->item(0);
echo $div->getAttribute('style');

libxml_clear_errors(); 
libxml_use_internal_errors( $oldSetting ); 
?>
...