При вставке родительского элемента в XML его родительский элемент не сохраняется (Python) - PullRequest
0 голосов
/ 25 апреля 2020

У меня есть XML документ, структурированный так:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
<text bbox="191.745,592.218,249.042,603.578">a
</text>
<text bbox="191.745,592.218,249.042,603.578">b
</text>
<textline>
</textbox>
<textbox id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textline id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<text bbox="0.000,0.000,462.047,680.315">c
</text>
</textbox>
</textline

В этом документе, если бы было определенное расстояние между значением атрибута одного text элемента и его предшествующего брата, я бы Откройте тег new_line и закройте его, когда возникла необходимость открыть новый.

Код работает, но с ним есть проблема, поскольку родительский элемент textbox не сохраняется должным образом.

Мой код такой:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()

# Get the first BBox value as float
# Return null if not found
def getBBoxFirstValue(line):
    if line is not None:
        bb = line.attrib.get('bbox')
        if bb is not None:
            try:
                return float(bb.split(",")[0])
            except ValueError:
                pass
    return None


new_line        = None
previous_bb     = None

for x in tree.xpath('//text'):
    # Get current bb value
    bb = getBBoxFirstValue(x)
    # Check current and past values aren't empty
    if bb is not None and previous_bb is not None:
        #print(bb, previous_bb, (bb-previous_bb))
        #print(abs(bb-previous_bb))
        # If distance with preview bb > 10
        if (bb - previous_bb) > 20 or (bb - previous_bb) < -1000:
            # If new_line isn't empty: it's inserted into parent tag at position of current tag

            if new_line is not None:
                x.getparent().insert(x.getparent().index(x), new_line)
            # A new "new_line" element is created
            new_line = etree.Element("new_line")

        # If the new line isn't not (e.g. one distance > 10 has been already found)
        if new_line is not None:
            new_line.append(x)

    # Keep latest non empty BBox 1st value
    if bb is not None:
        previous_bb = bb

# Add last new_line element if not null
if new_line is not None:
    tree.xpath('//text')[-1].getparent().append(new_line)


newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("UTF-8")
print(newtree)
with open("output.xml", "wb") as f:
    f.write(newtree)

Ссылка на полный XML входной файл находится здесь: ссылка А вот ссылка на мой вывод: ссылка

РЕДАКТИРОВАТЬ:

Пример моего поведения ввода:

enter image description here

Пример того, как мой XML на входе:

<textbox id="2" bbox="68.031,511.628,372.728,537.166">
<textline bbox="68.031,524.428,372.728,537.166">
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,524.872,72.786,535.557" colourspace="DeviceGray" ncolour="0" size="10.685">2</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="72.691,524.872,77.445,535.557" colourspace="DeviceGray" ncolour="0" size="10.685">4</text>
 
<text font="NUMPTY+ImprintMTnum" bbox="79.665,524.428,86.040,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">p</text>
<text font="NUMPTY+ImprintMTnum" bbox="85.917,524.428,91.013,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="90.890,524.428,95.407,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="95.051,524.428,98.299,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="98.177,524.428,103.272,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="103.150,524.428,106.977,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
<text font="NUMPTY+ImprintMTnum" bbox="106.855,524.428,111.950,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="111.827,524.428,115.076,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">;</text>
<text font="NUMPTY+ImprintMTnum" bbox="114.954,524.428,118.781,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="118.658,524.428,121.562,536.910" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="121.021,524.439,126.817,537.166" colourspace="DeviceGray" ncolour="0" size="12.727">d</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="126.706,524.439,132.502,537.166" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
 
<text font="NUMPTY+ImprintMTnum" bbox="134.697,524.428,141.072,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">p</text>
<text font="NUMPTY+ImprintMTnum" bbox="140.949,524.428,146.045,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="145.922,524.428,150.439,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="150.083,524.428,153.331,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="153.209,524.428,158.304,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="158.182,524.428,162.009,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
<text font="NUMPTY+ImprintMTnum" bbox="161.887,524.428,166.982,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="166.626,524.428,169.874,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">,</text>
 
<text font="NUMPTY+ImprintMTnum" bbox="186.807,524.428,191.323,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.201,524.428,196.296,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="196.352,524.428,202.148,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">g</text>
<text font="NUMPTY+ImprintMTnum" bbox="202.037,524.428,207.833,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">g</text>
<text font="NUMPTY+ImprintMTnum" bbox="207.722,524.428,210.970,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="210.859,524.428,216.655,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
<text font="NUMPTY+ImprintMTnum" bbox="216.544,524.428,219.792,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="219.681,524.428,225.477,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
<text font="NUMPTY+ImprintMTnum" bbox="225.366,524.428,231.740,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
<text font="NUMPTY+ImprintMTnum" bbox="231.629,524.428,236.724,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="236.324,524.428,239.572,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">.</text>
<text font="NUMPTY+ImprintMTnum" bbox="239.461,524.428,243.288,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="243.177,524.428,246.081,536.910" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="245.473,524.439,251.269,537.166" colourspace="DeviceGray" ncolour="0" size="12.727">d</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="251.158,524.439,256.954,537.166" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="256.843,524.439,259.746,537.166" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="259.149,524.428,263.665,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="263.543,524.428,268.638,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="268.694,524.428,274.490,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">g</text>
<text font="NUMPTY+ImprintMTnum" bbox="274.379,524.428,280.175,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">g</text>
<text font="NUMPTY+ImprintMTnum" bbox="280.064,524.428,283.312,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="283.201,524.428,288.997,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
<text font="NUMPTY+ImprintMTnum" bbox="288.886,524.428,292.134,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="292.023,524.428,297.819,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
<text font="NUMPTY+ImprintMTnum" bbox="297.708,524.428,304.082,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
<text font="NUMPTY+ImprintMTnum" bbox="303.971,524.428,309.066,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="308.955,524.428,312.204,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">;</text>
 
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="329.092,524.872,333.846,535.557" colourspace="DeviceGray" ncolour="0" size="10.685">2</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="333.755,524.872,338.509,535.557" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="338.418,524.872,340.800,535.557" colourspace="DeviceGray" ncolour="0" size="10.685"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="340.309,524.428,347.963,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">E</text>
<text font="NUMPTY+ImprintMTnum" bbox="347.841,524.428,351.090,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="350.967,524.428,354.216,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
<text font="NUMPTY+ImprintMTnum" bbox="354.093,524.428,359.189,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="359.066,524.428,361.970,536.910" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
<text font="NUMPTY+ImprintMTnum" bbox="361.380,524.428,367.755,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">h</text>
<text font="NUMPTY+ImprintMTnum" bbox="367.632,524.428,372.728,536.910" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>


</textline>
<textline bbox="68.031,511.628,308.559,524.366">
<text font="NUMPTY+ImprintMTnum" bbox="68.031,511.628,74.406,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
<text font="NUMPTY+ImprintMTnum" bbox="74.284,511.628,79.379,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="79.257,511.628,82.160,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="NUMPTY+ImprintMTnum" bbox="82.327,511.628,86.844,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="86.721,511.628,91.817,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="91.694,511.628,98.069,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="97.947,511.628,102.463,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="102.341,511.628,107.436,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="107.314,511.628,111.831,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="111.708,511.628,121.331,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="NUMPTY+ImprintMTnum" bbox="121.209,511.628,124.457,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="124.335,511.628,128.162,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="128.040,511.628,130.943,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="PYNIYO+ImprintMTnum-Italic" bbox="131.181,511.639,136.977,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="136.866,511.639,141.382,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">r</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="141.271,511.639,144.520,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="144.408,511.639,152.752,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">m</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="152.641,511.639,158.437,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
 
<text font="NUMPTY+ImprintMTnum" bbox="161.378,511.628,172.848,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">M</text>
<text font="NUMPTY+ImprintMTnum" bbox="172.725,511.628,175.974,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="175.851,511.628,178.755,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="NUMPTY+ImprintMTnum" bbox="178.922,511.628,183.439,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="183.316,511.628,188.412,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="188.289,511.628,194.664,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
<text font="NUMPTY+ImprintMTnum" bbox="194.541,511.628,199.058,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
<text font="NUMPTY+ImprintMTnum" bbox="198.936,511.628,202.184,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
 
<text font="NUMPTY+ImprintMTnum" bbox="219.105,511.628,224.201,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="224.078,511.628,226.982,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="NUMPTY+ImprintMTnum" bbox="227.149,511.628,236.772,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="NUMPTY+ImprintMTnum" bbox="236.650,511.628,239.898,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="239.776,511.628,246.150,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
<text font="NUMPTY+ImprintMTnum" bbox="246.028,511.628,251.123,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="251.001,511.628,256.096,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="255.974,511.628,261.069,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
<text font="NUMPTY+ImprintMTnum" bbox="260.947,511.628,264.195,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
<text font="NUMPTY+ImprintMTnum" bbox="264.073,511.628,269.168,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="269.046,511.628,273.562,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
<text font="NUMPTY+ImprintMTnum" bbox="273.329,511.628,278.424,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
<text font="NUMPTY+ImprintMTnum" bbox="278.302,511.628,282.129,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
<text font="NUMPTY+ImprintMTnum" bbox="282.006,511.628,284.910,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="NUMPTY+ImprintMTnum" bbox="285.077,511.628,290.172,524.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
<text font="NUMPTY+ImprintMTnum" bbox="290.050,511.628,292.953,524.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
 
<text font="PYNIYO+ImprintMTnum-Italic" bbox="293.240,511.639,296.488,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="296.377,511.639,302.173,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">n</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="302.062,511.639,305.889,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="305.310,511.639,308.559,524.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>


</textline>
</textbox>

Пример поведения на выходе:

enter image description here

По сути, XML принимает предыдущее строка при вставке new_line родительского тега.

EDIT: выходной образец, который я получаю с кодом Александра (я исключил тег page и pages в образце, которые сейчас здесь бесполезны):

<textbox id="6" bbox="68.031,502.428,372.824,566.366">
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="68.031,553.639,76.375,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">T</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="76.231,553.639,79.479,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">i</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="79.334,553.639,83.161,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">t</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="83.017,553.639,88.112,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="87.968,553.639,91.216,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">l</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="91.071,553.639,96.167,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">o</text>
      <text font="NUMPTY+ImprintMTnum" bbox="99.311,553.628,104.406,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">I</text>
      <text font="NUMPTY+ImprintMTnum" bbox="104.261,553.628,107.510,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">l</text>
      <text font="NUMPTY+ImprintMTnum" bbox="107.365,553.628,110.269,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
      <text font="NUMPTY+ImprintMTnum" bbox="110.658,553.628,119.002,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">C</text>
      <text font="NUMPTY+ImprintMTnum" bbox="118.857,553.628,123.953,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
      <text font="NUMPTY+ImprintMTnum" bbox="123.808,553.628,130.183,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
      <text font="NUMPTY+ImprintMTnum" bbox="130.038,553.628,134.555,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
      <text font="NUMPTY+ImprintMTnum" bbox="134.410,553.628,137.659,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
      <text font="NUMPTY+ImprintMTnum" bbox="137.514,553.628,143.889,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
      <text font="NUMPTY+ImprintMTnum" bbox="143.744,553.628,146.993,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
      <text font="NUMPTY+ImprintMTnum" bbox="146.848,553.628,151.943,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">c</text>
      <text font="NUMPTY+ImprintMTnum" bbox="151.799,553.628,157.595,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
      <text font="NUMPTY+ImprintMTnum" bbox="157.450,553.628,161.277,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
      <text font="NUMPTY+ImprintMTnum" bbox="161.132,553.628,164.036,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="164.417,553.639,168.244,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="168.099,553.639,173.895,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="173.751,553.639,177.578,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="176.966,553.639,180.215,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="180.070,553.639,182.974,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
      <text font="PYNIYO+ImprintMTnum-Italic" bbox="183.363,553.639,189.159,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
      <text font="NUMPTY+ImprintMTnum" bbox="192.314,553.628,201.937,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">D</text>
      <text font="NUMPTY+ImprintMTnum" bbox="201.793,553.628,207.589,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
      <text font="NUMPTY+ImprintMTnum" bbox="207.444,553.628,213.819,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
      <text font="NUMPTY+ImprintMTnum" bbox="213.674,553.628,216.578,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
      <text font="NUMPTY+ImprintMTnum" bbox="216.967,553.628,225.311,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">R</text>
      <text font="NUMPTY+ImprintMTnum" bbox="225.166,553.628,230.962,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
      <text font="NUMPTY+ImprintMTnum" bbox="230.818,553.628,237.192,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">d</text>
      <text font="NUMPTY+ImprintMTnum" bbox="237.048,553.628,241.565,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
      <text font="NUMPTY+ImprintMTnum" bbox="241.420,553.628,244.668,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
      <text font="NUMPTY+ImprintMTnum" bbox="244.524,553.628,250.320,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">g</text>
      <text font="NUMPTY+ImprintMTnum" bbox="250.064,553.628,255.860,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">o</text>
      <new_line>
        <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
        <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
        <text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
        <text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
        <text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
        <text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
        <text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
        <text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
        <text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
        <text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
        <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
        <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
        <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
        <text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
        <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
        <text font="NUMPTY+ImprintMTnum" bbox="68.031,540.828,72.548,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
        <text font="NUMPTY+ImprintMTnum" bbox="72.404,540.828,77.499,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
        <text font="NUMPTY+ImprintMTnum" bbox="77.354,540.828,81.871,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
        <text font="NUMPTY+ImprintMTnum" bbox="81.726,540.828,84.975,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
        <text font="NUMPTY+ImprintMTnum" bbox="84.830,540.828,89.925,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
      </new_line>

В основном он начинается new_line чуть позже textbox, а не сразу после textbox.

Ссылка на полный вывод, который я получаю: ссылка

1 Ответ

0 голосов
/ 26 апреля 2020

Я вижу две точки.

  • Во-первых, все элементы <textline> можно удалить с помощью strip_tags. В этом обсуждении объясняется, как работает функция.

  • А затем оберните поиск new_line во всех <textbox> элементах, используя iter.


Полный код :

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)

# Get the first BBox value as float
# Return null if not found
def getBBoxFirstValue(line):
    if line is not None:
        bb = line.attrib.get('bbox')
        if bb is not None:
            try:
                return float(bb.split(",")[0])
            except ValueError:
                pass
    return None

# Remove all 'textline' elements
etree.strip_tags(tree, 'textline')

# Search for all text "textbox" elements
for textbox in tree.xpath('//textbox'):
    new_line = etree.Element("new_line")
    previous_bb = None

    # From a given textbox element, iterate over all the "text" elements
    for x in textbox.iter("text"):
        # Get current bb valu
        bb = getBBoxFirstValue(x)
        # Check current and past values aren't empty
        if bb is not None and previous_bb is not None and (bb - previous_bb) > 10:
            # Inserte newline into parent tag
            x.getparent().insert(x.getparent().index(x), new_line)

            # A new "new_line" element is created
            new_line = etree.Element("new_line")

        # Append current element is new_line tag
        new_line.append(x)

        # Keep latest non empty BBox 1st value
        if bb is not None:
            previous_bb = bb

    # Add last new_line element if not null
    textbox.append(new_line)


tree.write("output.xml", pretty_print=True)

Надеюсь, это поможет!

...