Объединение тегов с таким же атрибутом в XML Python не работает так, как мне хотелось - PullRequest
0 голосов
/ 18 апреля 2020

У меня есть XML, структурированный так:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.5>
<textline bbox="68.031,540.828,372.755,553.566">
        <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
          <text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
          <text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
          <text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
          <text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
          <text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
          <text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
          <text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
          <text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
          <text font="NUMPTY+ImprintMTnum" bbox="68.031,540.828,72.548,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
          <text font="NUMPTY+ImprintMTnum" bbox="72.404,540.828,77.499,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
          <text font="NUMPTY+ImprintMTnum" bbox="77.354,540.828,81.871,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
          <text font="NUMPTY+ImprintMTnum" bbox="81.726,540.828,84.975,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
          <text font="NUMPTY+ImprintMTnum" bbox="84.830,540.828,89.925,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
        </new_line>
      </textline>
<textline bbox="68.031,528.028,372.758,540.766">
        <new_line>
          <text font="NUMPTY+ImprintMTnum" bbox="106.735,540.828,113.110,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
          <text font="NUMPTY+ImprintMTnum" bbox="112.965,540.828,118.061,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
          <text font="NUMPTY+ImprintMTnum" bbox="117.916,540.828,121.164,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
          <text font="NUMPTY+ImprintMTnum" bbox="121.020,540.828,124.268,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
          <text font="NUMPTY+ImprintMTnum" bbox="124.124,540.828,129.219,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="129.074,540.828,131.978,553.310" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
</new_line>
</textline>
</textbox>
</page>
</pages>

Я хочу объединить смежные символы, имеющие одинаковый тег text с тем же атрибутом size, но только для каждого new_line block.

Мой ожидаемый результат будет:

  <pages>
      <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="191.745,592.218,249.042,603.5>
<textline bbox="68.031,540.828,372.755,553.566">
        <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text>
<text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">mi-seria</text>
        </new_line>
      </textline>
<textline>
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">della </text>
</new_line>
</textline>
</textbox>
</page>
</pages>

Но вместо этого я получаю это:

  <pages>
      <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="191.745,592.218,249.042,603.5>
<textline bbox="68.031,540.828,372.755,553.566">
        <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text>
        </new_line>
      </textline>
<textline>
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">mi-seria della </text>
</new_line>
</textline>
</textbox>
</page>

    </pages>

Мой код такой:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "test" element in the new_line block
    list_text_elts = new_line_block.xpath('//text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)


newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
#print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

Что я могу сделать, чтобы решить мою проблему?

...