Как разобрать 13ГБ XML файла данных в Python - PullRequest
2 голосов
/ 14 апреля 2020

У меня есть файл XML, размер которого составляет 13 ГБ, и мне нужно проанализировать его, а затем извлечь в формат CSV и JSON, используя python. Файл структурирован и формат выглядит следующим образом. Кто-нибудь знает, чтобы извлечь в CSV и JSON, используя python.

  1. Точность и формат выходных файлов
  2. Скорость вашего алгоритма
  3. Дизайн вашего подхода
  4. Использование памяти вашего алгоритма
<ClinVarSet ID="38755811">
      <RecordStatus>current</RecordStatus>
      <Title>LIPA, 934G-A AND Lysosomal acid lipase deficiency</Title>
      <ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2019-03-30" ID="57601">
        <ClinVarAccession Acc="RCV000000098" Version="3" Type="RCV" DateUpdated="2019-03-31"/>
        <RecordStatus>current</RecordStatus>
        <ClinicalSignificance DateLastEvaluated="1996-04-01">
          <ReviewStatus>no assertion criteria provided</ReviewStatus>
          <Description>Pathogenic</Description>
        </ClinicalSignificance>
        <Assertion Type="variation to disease"/>
        <ObservedIn>
          <Sample>
            <Origin>germline</Origin>
            <Species TaxonomyId="9606">human</Species>
            <AffectedStatus>not provided</AffectedStatus>
          </Sample>
          <Method>
            <MethodType>literature only</MethodType>
          </Method>
          <ObservedData ID="37870920">
            <Attribute Type="Description">In a 12-year-old patient with cholesteryl ester storage disease (278000) from a nonconsanguineous Polish-German family, Klima et al. (1993) detected a 72-bp in-frame deletion resulting in the loss of amino acid codons 254 through 277. Analysis of genomic DNA revealed that the 72 bp represented an exon, indicating that the deletion in the mRNA was caused by defective splicing. Sequence analysis of the patient's genomic DNA revealed a G-to-A substitution in the last nucleotide of the 72-bp exon on 1 allele. No normal-sized mRNA was detectable in the propositus even though he was not homozygous for the splice site mutation. Klima et al. (1993) concluded that the patient was compound heterozygous for the splice site mutation and a null allele. The patient showed LIPA activity in cultured skin fibroblasts approximately 9% of normal. Hepatosplenomegaly had been present since age 5 years.</Attribute>
            <Citation Type="general">
              <ID Source="PubMed">8254026</ID>
            </Citation>
          </ObservedData>
          <ObservedData ID="37870920">
            <Attribute Type="Description">Aslanidis et al. (1996) restudied the patient of Klima et al. (1993) and defined the splice site mutation as a G-to-A mutation at position -1 of the splice donor site following exon 8, resulting in incorrect splicing and the removal of the 72-bp exon 8 of the LIPA gene. They determined that the other allele of the patient carried a premature termination mutation (613497.0003) as well as the L179P mutation (613497.0001); the LIPA mRNA was rendered unstable by the premature stop codon. Aslanidis et al. (1996) demonstrated that the splice site mutation allowed the production of approximately 3 to 4% of correctly spliced mRNA relative to wildtype. Aslanidis et al. (1996) also identified a mutation at the same splice donor site, and also resulting in deletion of exon 8, in 2 sibs with Wolman disease; that mutation, at the +1 position, allowed no correct splicing, and patient fibroblasts were devoid of enzymatic activity. See 613497.0005.</Attribute>
            <Citation Type="general">
              <ID Source="PubMed">8254026</ID>
            </Citation>
            <Citation Type="general">
              <ID Source="PubMed">8617513</ID>
            </Citation>
          </ObservedData>
          <ObservedData ID="37870920">
            <Attribute Type="Description">In 2 sibs with CESD, Maslen and Illingworth (1993) and Maslen et al. (1995) identified compound heterozygosity for this splice site mutation in the LIPA gene, inherited from their father, and the L179P mutation (613497.0001). The affected children were a sister and brother who presented with idiopathic hepatomegaly at ages 6 and 8 years, respectively. Subsequent analyses indicated that they also had hypercholesterolemia and a severe reduction in cholesteryl ester hydrolase activity in cultured fibroblasts.</Attribute>
            <Citation Type="general">
              <ID Source="PubMed">8598644</ID>
            </Citation>
            <Citation Type="general">
              <CitationText>Maslen, C. L., Illingworth, D. R. Molecular genetics of cholesterol ester hydrolase deficiency. (Abstract) Am. J. Hum. Genet. 53 (suppl.): A926, 1993.</CitationText>
            </Citation>
          </ObservedData>
          <ObservedData ID="37870920">
            <Attribute Type="Description">Muntoni et al. (1995) observed homozygosity for the splice site mutation (Klima et al., 1993) in a Spanish kindred with cholesterol ester storage disease. Exon 8 of the LIPA gene was deleted.</Attribute>
            <Citation Type="general">
              <ID Source="PubMed">7759067</ID>
            </Citation>
            <Citation Type="general">
              <ID Source="PubMed">8254026</ID>
            </Citation>
          </ObservedData>
        </ObservedIn>
        <MeasureSet Type="Variant" ID="79" Acc="VCV000000079" Version="1">
          <Measure Type="single nucleotide variant" ID="15118">
            <Name>
              <ElementValue Type="Preferred">LIPA, 934G-A</ElementValue>
            </Name>
            <AttributeSet>
              <Attribute Type="nucleotide change">934G-A</Attribute>
              <XRef Type="Allelic variant" ID="613497.0002" DB="OMIM"/>
            </AttributeSet>
            <CytogeneticLocation>10q23.31</CytogeneticLocation>
            <MeasureRelationship Type="asserted, but not computed">
              <Name>
                <ElementValue Type="Preferred">lipase A, lysosomal acid type</ElementValue>
              </Name>
              <Symbol>
                <ElementValue Type="Preferred">LIPA</ElementValue>
              </Symbol>
              <SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89213569" stop="89252039" display_start="89213569" display_stop="89252039" Strand="-"/>
              <SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90973325" stop="91011659" display_start="90973325" display_stop="91011659" variantLength="38335" Strand="-"/>
              <XRef ID="3988" DB="Gene"/>
              <XRef Type="MIM" ID="613497" DB="OMIM"/>
              <XRef ID="HGNC:6617" DB="HGNC"/>
            </MeasureRelationship>
            <XRef Type="Allelic variant" ID="613497.0002" DB="OMIM"/>
          </Measure>
          <Name>
            <ElementValue Type="Preferred">LIPA, 934G-A</ElementValue>
          </Name>
        </MeasureSet>
        <TraitSet Type="Disease" ID="41">
          <Trait ID="2626" Type="Disease">
            <Name>
              <ElementValue Type="Preferred">Lysosomal acid lipase deficiency</ElementValue>
              <XRef ID="Wolman+disease/7523" DB="Genetic Alliance"/>
            </Name>
            <Name>
              <ElementValue Type="Alternate">LAL DEFICIENCY</ElementValue>
              <XRef Type="MIM" ID="278000" DB="OMIM"/>
            </Name>
            <Name>
              <ElementValue Type="Alternate">CHOLESTEROL ESTER HYDROLASE DEFICIENCY</ElementValue>
              <XRef Type="MIM" ID="278000" DB="OMIM"/>
            </Name>
            <AttributeSet>
              <Attribute Type="public definition">The phenotypic spectrum of lysosomal acid lipase (LAL) deficiency ranges from the infantile-onset form (Wolman disease) to later-onset forms collectively known as cholesterol ester storage disease (CESD). Wolman disease is characterized by infantile-onset malabsorption that results in malnutrition, storage of cholesterol esters and triglycerides in hepatic macrophages that results in hepatomegaly and liver disease, and adrenal gland calcification that results in adrenal cortical insufficiency. Unless successfully treated with hematopoietic stem cell transplantation (HSCT), infants with classic Wolman disease do not survive beyond age one year. CESD may present in childhood in a manner similar to Wolman disease or later in life with such findings as serum lipid abnormalities, hepatosplenomegaly, and/or elevated liver enzymes long before a diagnosis is made. The morbidity of late-onset CESD results from atherosclerosis (coronary artery disease, stroke), liver disease (e.g., altered liver function ± jaundice, steatosis, fibrosis, cirrhosis and related complications of esophageal varices, and/or liver failure), complications of secondary hypersplenism (i.e., anemia and/or thrombocytopenia), and/or malabsorption. Individuals with CESD may have a normal life span depending on the severity of disease manifestations.</Attribute>
              <XRef ID="NBK305870" DB="GeneReviews"/>
            </AttributeSet>
            <Citation Type="Suggested Reading" Abbrev="Shirley, 2015">
              <ID Source="PubMed">26452566</ID>
            </Citation>
            <Citation Type="review" Abbrev="GeneReviews">
              <ID Source="PubMed">26225414</ID>
              <ID Source="BookShelf">NBK305870</ID>
            </Citation>
            <XRef ID="C0043208" DB="MedGen"/>
            <XRef Type="MIM" ID="278000" DB="OMIM"/>
          </Trait>
        </TraitSet>
      </ReferenceClinVarAssertion>
      <ClinVarAssertion ID="20241">
        <ClinVarSubmissionID localKey="613497.0002_CHOLESTERYL ESTER STORAGE DISEASE" submitter="OMIM" submitterDate="2017-12-22" title="LIPA, 934G-A_CHOLESTERYL ESTER STORAGE DISEASE"/>
        <ClinVarAccession Acc="SCV000020241" Version="2" Type="SCV" OrgID="3" OrganizationCategory="resource" OrgType="primary" DateUpdated="2019-03-31"/>
        <RecordStatus>current</RecordStatus>
        <ClinicalSignificance DateLastEvaluated="1996-04-01">
          <ReviewStatus>no assertion criteria provided</ReviewStatus>
          <Description>Pathogenic</Description>
        </ClinicalSignificance>
        <Assertion Type="variation to disease"/>
        <ExternalID DB="OMIM" ID="613497.0002" Type="Allelic variant"/>
        <ObservedIn>
          <Sample>
            <Origin>germline</Origin>
            <Species>human</Species>
            <AffectedStatus>not provided</AffectedStatus>
          </Sample>
          <Method>
            <MethodType>literature only</MethodType>
          </Method>
          <ObservedData>
            <Attribute Type="Description">In a 12-year-old patient with cholesteryl ester storage disease (278000) from a nonconsanguineous Polish-German family, Klima et al. (1993) detected a 72-bp in-frame deletion resulting in the loss of amino acid codons 254 through 277. Analysis of genomic DNA revealed that the 72 bp represented an exon, indicating that the deletion in the mRNA was caused by defective splicing. Sequence analysis of the patient's genomic DNA revealed a G-to-A substitution in the last nucleotide of the 72-bp exon on 1 allele. No normal-sized mRNA was detectable in the propositus even though he was not homozygous for the splice site mutation. Klima et al. (1993) concluded that the patient was compound heterozygous for the splice site mutation and a null allele. The patient showed LIPA activity in cultured skin fibroblasts approximately 9% of normal. Hepatosplenomegaly had been present since age 5 years.</Attribute>
            <Citation>
              <ID Source="PubMed">8254026</ID>
            </Citation>
            <XRef DB="OMIM" ID="278000" Type="MIM"/>
          </ObservedData>
          <ObservedData>
            <Attribute Type="Description">Aslanidis et al. (1996) restudied the patient of Klima et al. (1993) and defined the splice site mutation as a G-to-A mutation at position -1 of the splice donor site following exon 8, resulting in incorrect splicing and the removal of the 72-bp exon 8 of the LIPA gene. They determined that the other allele of the patient carried a premature termination mutation (613497.0003) as well as the L179P mutation (613497.0001); the LIPA mRNA was rendered unstable by the premature stop codon. Aslanidis et al. (1996) demonstrated that the splice site mutation allowed the production of approximately 3 to 4% of correctly spliced mRNA relative to wildtype. Aslanidis et al. (1996) also identified a mutation at the same splice donor site, and also resulting in deletion of exon 8, in 2 sibs with Wolman disease; that mutation, at the +1 position, allowed no correct splicing, and patient fibroblasts were devoid of enzymatic activity. See 613497.0005.</Attribute>
            <Citation>
              <ID Source="PubMed">8617513</ID>
            </Citation>
            <Citation>
              <ID Source="PubMed">8254026</ID>
            </Citation>
          </ObservedData>
          <ObservedData>
            <Attribute Type="Description">In 2 sibs with CESD, Maslen and Illingworth (1993) and Maslen et al. (1995) identified compound heterozygosity for this splice site mutation in the LIPA gene, inherited from their father, and the L179P mutation (613497.0001). The affected children were a sister and brother who presented with idiopathic hepatomegaly at ages 6 and 8 years, respectively. Subsequent analyses indicated that they also had hypercholesterolemia and a severe reduction in cholesteryl ester hydrolase activity in cultured fibroblasts.</Attribute>
            <Citation>
              <CitationText>Maslen, C. L., Illingworth, D. R. Molecular genetics of cholesterol ester hydrolase deficiency. (Abstract) Am. J. Hum. Genet. 53 (suppl.): A926, 1993.</CitationText>
            </Citation>
            <Citation>
              <ID Source="PubMed">8598644</ID>
            </Citation>
          </ObservedData>
          <ObservedData>
            <Attribute Type="Description">Muntoni et al. (1995) observed homozygosity for the splice site mutation (Klima et al., 1993) in a Spanish kindred with cholesterol ester storage disease. Exon 8 of the LIPA gene was deleted.</Attribute>
            <Citation>
              <ID Source="PubMed">7759067</ID>
            </Citation>
            <Citation>
              <ID Source="PubMed">8254026</ID>
            </Citation>
          </ObservedData>
        </ObservedIn>
        <MeasureSet Type="Variant">
          <Measure Type="Variation">
            <Name>
              <ElementValue Type="Preferred">LIPA, 934G-A</ElementValue>
            </Name>
            <AttributeSet>
              <Attribute Type="NonHGVS">934G-A</Attribute>
            </AttributeSet>
            <MeasureRelationship Type="variant in gene">
              <Symbol>
                <ElementValue Type="Preferred">LIPA</ElementValue>
              </Symbol>
            </MeasureRelationship>
            <XRef DB="OMIM" ID="613497.0002" Type="Allelic variant"/>
          </Measure>
        </MeasureSet>
        <TraitSet Type="Disease">
          <Trait Type="Disease">
            <Name>
              <ElementValue Type="Preferred">CHOLESTERYL ESTER STORAGE DISEASE</ElementValue>
            </Name>
          </Trait>
        </TraitSet>
      </ClinVarAssertion>
    </ClinVarSet>
...