Получить содержимое определенных тегов в XML, используя elemettree - PullRequest
0 голосов
/ 28 февраля 2019

Ниже приведены мои данные XML:

<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
  <PMID Version="1">1883738</PMID>
  <DateCompleted>
    <Year>1991</Year>
    <Month>10</Month>
    <Day>07</Day>
  </DateCompleted>
  <DateRevised>
    <Year>2013</Year>
    <Month>11</Month>
    <Day>21</Day>
  </DateRevised>
  <Article PubModel="Print">
    <Journal>
      <ISSN IssnType="Print">0959-9673</ISSN>
      <JournalIssue CitedMedium="Print">
        <Volume>72</Volume>
        <Issue>4</Issue>
        <PubDate>
          <Year>1991</Year>
          <Month>Aug</Month>
        </PubDate>
      </JournalIssue>
      <Title>International journal of experimental pathology</Title>
      <ISOAbbreviation>Int J Exp Pathol</ISOAbbreviation>
    </Journal>
    <ArticleTitle>The effect of HeNe laser radiation on the thyroid gland of the rat.</ArticleTitle>
    <Pagination>
      <MedlinePgn>379-85</MedlinePgn>
    </Pagination>
    <Abstract>
      <AbstractText>Although laser irradiation is becoming common practice in medicine, there is not always a clear understanding of the possible side-effects. The present report is a light and electron microscopic study of the effects of fixed low intensity doses of soft HeNe laser on the thyroid of Wistar rats. The immediate effects are mild multifocal degenerative changes; these lesions recover in less than 3 months. Long-term lesions are identified only by electron microscopy; they consist of an increased number of peroxisomes and free or intramitochondrial crystalline structures. We discuss the laser's hypothetical functions.</AbstractText>
    </Abstract>
    <AuthorList CompleteYN="Y">
      <Author ValidYN="Y">
        <LastName>Lerma</LastName>
        <ForeName>E</ForeName>
        <Initials>E</Initials>
        <AffiliationInfo>
          <Affiliation>Department of Pathology and Radiology, Hospital Universitario Virgen Macarena, University of Seville, Spain.</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Hevia</LastName>
        <ForeName>A</ForeName>
        <Initials>A</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Rodrigo</LastName>
        <ForeName>P</ForeName>
        <Initials>P</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Gonzalez-Campora</LastName>
        <ForeName>R</ForeName>
        <Initials>R</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Armas</LastName>
        <ForeName>J R</ForeName>
        <Initials>JR</Initials>
      </Author>
      <Author ValidYN="Y">
        <LastName>Galera</LastName>
        <ForeName>H</ForeName>
        <Initials>H</Initials>
      </Author>
    </AuthorList>
    <Language>eng</Language>
    <PublicationTypeList>
      <PublicationType UI="D016428">Journal Article</PublicationType>
    </PublicationTypeList>
  </Article>
  <MedlineJournalInfo>
    <Country>England</Country>
    <MedlineTA>Int J Exp Pathol</MedlineTA>
    <NlmUniqueID>9014042</NlmUniqueID>
    <ISSNLinking>0959-9673</ISSNLinking>
  </MedlineJournalInfo>
  <ChemicalList>
    <Chemical>
      <RegistryNumber>06LU7C9H1V</RegistryNumber>
      <NameOfSubstance UI="D014284">Triiodothyronine</NameOfSubstance>
    </Chemical>
    <Chemical>
      <RegistryNumber>Q51BO43MG4</RegistryNumber>
      <NameOfSubstance UI="D013974">Thyroxine</NameOfSubstance>
    </Chemical>
  </ChemicalList>
  <CitationSubset>IM</CitationSubset>
  <CommentsCorrectionsList>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Histochem Cytochem. 1969 Oct;17(10):675-80</RefSource>
      <PMID Version="1">4194356</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Acta Anat (Basel). 1986;125(1):10-3</RefSource>
      <PMID Version="1">3953239</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Anat Anz. 1977;142(3):209-12</RefSource>
      <PMID Version="1">603070</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Cell Biol. 1964 Nov;23:383-5</RefSource>
      <PMID Version="1">14222822</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>J Cell Biol. 1967 Jun;33(3):605-23</RefSource>
      <PMID Version="1">6036524</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Am J Med. 1983 May;74(5):852-62</RefSource>
      <PMID Version="1">6837608</PMID>
    </CommentsCorrections>
    <CommentsCorrections RefType="Cites">
      <RefSource>Exp Eye Res. 1977 Jan;24(1):45-56</RefSource>
      <PMID Version="1">402283</PMID>
    </CommentsCorrections>
  </CommentsCorrectionsList>
  <MeshHeadingList>
    <MeshHeading>
      <DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D007834" MajorTopicYN="N">Lasers</DescriptorName>
      <QualifierName UI="Q000009" MajorTopicYN="Y">adverse effects</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008297" MajorTopicYN="N">Male</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008830" MajorTopicYN="N">Microbodies</DescriptorName>
      <QualifierName UI="Q000528" MajorTopicYN="N">radiation effects</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D008854" MajorTopicYN="N">Microscopy, Electron</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D051381" MajorTopicYN="N">Rats</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D011919" MajorTopicYN="N">Rats, Inbred Strains</DescriptorName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D013961" MajorTopicYN="N">Thyroid Gland</DescriptorName>
      <QualifierName UI="Q000528" MajorTopicYN="Y">radiation effects</QualifierName>
      <QualifierName UI="Q000648" MajorTopicYN="N">ultrastructure</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D013974" MajorTopicYN="N">Thyroxine</DescriptorName>
      <QualifierName UI="Q000097" MajorTopicYN="N">blood</QualifierName>
    </MeshHeading>
    <MeshHeading>
      <DescriptorName UI="D014284" MajorTopicYN="N">Triiodothyronine</DescriptorName>
      <QualifierName UI="Q000097" MajorTopicYN="N">blood</QualifierName>
    </MeshHeading>
  </MeshHeadingList>
  <OtherID Source="NLM">PMC2001961</OtherID>
</MedlineCitation>
<PubmedData>

Мне нужно извлечь всю фамилию автора из документа.Однако существует несколько таких файлов, каждый из которых имеет различное имя автора.Как я могу разобрать этот файл и извлечь только фамилию автора в список для создания базы данных?

Я использовал elementtree для разбора документа.Вот мой код:

tree = ET.parse("file path"+file)
            doc = tree.getroot()
            for LastName in doc.iter('LastName'):
                file1 = (ET.tostring(LastName, encoding='utf8').decode('utf8'))
                file2 = file1[48:(len(file1))]
                author_name_lastname = file2.split("<")[0]
                print(author_name_lastname)

Здесь я могу напечатать только первое имя автора "Лерма".

1 Ответ

0 голосов
/ 28 февраля 2019
import os
from lxml import etree as ET

DIR="D:\yourfilesdirectory/"

for filename in os.listdir(DIR):
    if filename.endswith(".xml"):
        with open(file=DIR+filename,mode='r',encoding='utf-8') as file:
            _tree = ET.fromstring(text=file.read())
            _all_metadata_tags = _tree.xpath('.//LastName')
            for i in _all_metadata_tags:
                print(i.text + '\n')

    else:
        print("skipping for filename")
...