Набор данных ICDAR 2009 содержит основную истину в формате xml:
<?xml version="1.0" encoding="UTF-8"?>
<bs-submission participant-id="0"
run-id="GROUNDTRUTH"
task="book-toc"
toc-creation="semi-automatic"
toc-source="full-content">
<source-files xml="no" pdf="no" />
<description>
This file contains the annotated groundtruth file (ideal ToCs), manually and collaboratively built by the participants of the ICDAR Structure Extraction competition 2009 and used for evaluation.
</description>
<book>
<bookid>049AA21392135223</bookid>
<toc-section page="11" /><toc-entry title="I. Introduction" page="15" />
<toc-entry title="II. List of the skeletal remains" page="20" />
<toc-entry title="III. The New Orleans skeleton" page="21" />
<toc-entry title="IV. The Quebec skeleton" page="22" />
<toc-entry title="V. The Natchez pelvic bone" page="22" />
<toc-entry title="VI. The Lake Monroe (Florida) bones" page="25" />
<toc-entry title="VII. The Soda Creek skeleton" page="26" />
<toc-entry title="VIII. The Charleston bones" page="26" />
<toc-entry title="IX. The Calaveras skull" page="27">
<toc-entry title="History" page="27" />
<toc-entry title="Physical characters." page="28" />
<toc-entry title="Comparisons" page="33" />
</toc-entry>
<toc-entry title="X. The Rock Bluff cranium" page="36" />
<toc-entry title="XI. The Man of Penon" page="42" />
<toc-entry title="XII. The crania of Trenton" page="45">
<toc-entry title="The Burlington County skull" page="46" />
<toc-entry title="The Riverview Cemetery skull" page="46" />
<toc-entry title="Racial affinities of the Burlington County and Riverview Cemetery skulls" page="55" />
</toc-entry>
<toc-entry title="XIII. The Trenton femur" page="60" />
<toc-entry title="XIV. The Lansing skeleton" page="61">
<toc-entry title="Somatological characters" page="62" />
<toc-entry title="Conclusion" page="68" />
</toc-entry>
<toc-entry title="XV. The fossil man of western Florida" page="69">
<toc-entry title="The Osprey skull" page="69" />
<toc-entry title="The North Osprey bones" page="70" />
<toc-entry title="The Hanson Landing remains" page="71" />
<toc-entry title="The South Osprey remains" page="71" />
<toc-entry title="Examination of the specimens" page="72" />
<toc-entry title="Physical characters" page="75" />
<toc-entry title="Resume" page="82">
<toc-entry title="Report of Dr. T. Way land Vaughan" page="86" />
</toc-entry>
</toc-entry>
<toc-entry title="XVI. Mound crania (Florida)" page="90" />
<toc-entry title="XVII. The Nebraska "loess man"" page="90">
<toc-entry title="History of finds" page="91" />
<toc-entry title="Description of the mound" page="98" />
<toc-entry title="Examination of the bones" page="100" />
<toc-entry title="Discussion" page="115" />
</toc-entry>
<toc-entry title="XVIII. General conclusion" page="130" />
<toc-entry title="XIX. Appendix: Recent Indian skulls of low type in the U.S. National Museum" page="147" />
<toc-entry title="Index" page="157" />
</book>
</bs-submission>
В этом большом xml-файле некоторые элементы <book>
имеют дочерние элементы, называемые <toc-section>
.
Я хотел бы перебрать все <book>
, чтобы увидеть, есть ли такие, которые не содержат таких потомков.Как я могу сделать это в Python, например, с lxml.html
?
Вот начало моего сценария:
with open(icdaf_xmlfile) as infile:
icdar2013_tree_string = infile.read()
root = lxml.html.fromstring(icdar2013_tree_string)
for book in root.iter('book'):
# check if book contains toc-section