Разбор комплекса XML в POJO или JSON - PullRequest
0 голосов
/ 08 апреля 2020

Хорошо. В настоящее время я ищу способы решения конкретной задачи c. Я начинаю с файла PDF, и моя цель - превратить его в объект JSON. Я использовал стороннюю библиотеку, чтобы превратить PDF-файл в XML, и теперь я пытаюсь разобрать XML в JSON. Тем не менее, я мог бы также просто проанализировать PDF прямо в POJO или JSON, не выполняя шаг XML, если это было бы проще, но я считаю, что анализ с XML проще.

У меня есть комплекс XML (или PDF), представляющий исследовательские работы, с несколькими разделами, каждый из которых обозначен неизвестным заранее названием. Этот заголовок может быть чем угодно. Я знаю, что некоторые из названий, таких как «Аннотация» и «Введение». Это было бы легко для меня, чтобы разобрать. поскольку у нас есть определенные c секции для них в нашей JSON схеме. Тем не менее, у нас также есть раздел «full_text», который содержит текст из остальной части документа. Каждый абзац в «full_text» также будет иметь идентификатор «section», который показывает, к какому разделу относится абзац (например, «Заключение» или «Результаты» или «Обсуждение»). У меня очень специфическая потребность c в том, что каждый раздел в PDF или XML может иметь несколько вложенных подразделов. Но мне не нужны эти вложенные заголовки подразделов - мне важен только текст внутри них, так как любые абзацы внутри подразделов будут принадлежать самому верхнему разделу, содержащему их.

В итоге, как только я найду раздел, скажем, Результаты, мне нужно собрать весь текст внутри него и в его подразделах, а затем перейти к следующему разделу на том же уровне, что и Результаты. Название разделов и подразделов мне заранее неизвестно, но их имена будут go превращаться в простое поле String, называемое «section».

Редактировать: Глядя на сложность файлов xml, я Теперь верьте, что синтаксический анализ прямо из PDF может быть даже лучше.

 <abstract>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <p>The new coronavirus COVID-19, also known as SARS-CoV-2, has infected more than 300,000 patients and become a global health emergency due to the very high risk of spread and impact of COVID-19. There are no specific drugs or vaccines against COVID-19, thus effective antiviral agents are still urgently needed to combat this virus. Herein, the FEP (free energy perturbation)-based screening strategy is newly derived as a rapid protocol to accurately reposition potential agents against COVID-19 by targeting viral proteinase Mpro. Restrain energy distribution (RED) function was derived to optimize the alchemical pathway of FEP, which greatly accelerated the calculations and first made FEP possible in the virtual screening of the FDA-approved drugs database. As a result, fifteen out of twenty-five drugs validated in vitro exhibited considerable inhibitory potencies towards Mpro. Among them, the most potent Mpro inhibitor dipyridamole potentially inhibited NF-B signaling pathway and inflammatory responses, and has just finished the first round clinical trials. Our result demonstrated that the FEP-based screening showed remarkable advantages in prompting drug repositioning against COVID-19.</p>
                </div>
            </abstract>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en">
        <body>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="1.">Introduction</head>
                <p>The novel coronavirus 2019-nCoV (also known as HCoV-19 or SARS-CoV-2) outbreak had emerged from Wuhan, Hubei Province, China in December 2019 
                    <ref type="bibr" target="#b0">1,</ref>
                    <ref type="bibr">2</ref> . On March 22, there were 813,00 confirmed COVID-19 cases including 3,253 deaths in China. This virus has also infected more than 220,000 patients in all of the continents and over 180 other countries, such as Italy, Spain, U.S.A, Germany, France, and Iran gradually became a global pandemic due to the very high risk of spread and impact of COVID-19. To date, there is no specific treatment or vaccine against COVID-19, thus it is urgently need to repositioning potential agents against COVID-19. 
                    <ref type="bibr" target="#b2">3</ref> The COVID-19's replicase gene encodes two over-lapping translation products, polyproteins 1a and 1ab (pp1a and pp1ab), which mediate all of the functions required for viral replication. Mpro, as the key enzyme in proteolytic processing of viral replication, is initially released by the auto-cleavage of pp1a and pp1ab. Then Mpro in turn cleaves pp1a and pp1ab to release functional proteins necessary for viral replication. 
                    <ref type="bibr" target="#b3">4</ref> In the view of essential functions of Mpro in viral life cycle and its high conservatism, it is an attractive target for the discovery of anti-COVID-19 agents.
                </p>
                <p>Great efforts from various research groups have been done to discover new agents from several databases by targeting the target Mpro via several virtual screening strategy, 
                    <ref type="bibr" target="#b4">5,</ref>
                    <ref type="bibr" target="#b5">6</ref> which consists of pharmacophore, molecule docking, and molecular simulations approaches. As a result, six drugs inhibited Mpro with IC50 values ranging from 0.67 to 21.4 μM. 
                    <ref type="bibr" target="#b4">5</ref> These drug design methods contributed considerably to the lead discovery, but the computational accuracy and efficiency need to be improved especially when dealing with emergency situations such as the COVID-19 outbreak. Free energy perturbation (FEP) method is a promising method with satisfactory accuracy 
                    <ref type="bibr" target="#b6">[7]</ref>
                    <ref type="bibr" target="#b7">[8]</ref>
                    <ref type="bibr" target="#b8">[9]</ref>
                    <ref type="bibr" target="#b9">[10]</ref>
                    <ref type="bibr" target="#b10">[11]</ref>
                    <ref type="bibr" target="#b11">[12]</ref>
                    <ref type="bibr" target="#b12">[13]</ref>
                    <ref type="bibr" target="#b13">[14]</ref> , but their actual applications to drug design are still limited to simulate minor structural changes of the ligands, thus predicting the relative binding free energy (RBFE). 
                    <ref type="bibr" target="#b6">7,</ref>
                    <ref type="bibr" target="#b13">14</ref> In order to perform virtual screening of a large molecule database, the absolute binding free energy (ABFE) calculation must be performed for each ligand without using of a reference ligand structure. The FEP approach has an advantage in predicting the affinities more precisely between drugs and their targets than conventional methods, such as pharmacophore, molecule docking, and molecular simulations. However, the FEP-ABFE approaches are extremely expensive/time-consumpting and therefore not used for virtual screening purpose. 
                    <ref type="bibr" target="#b14">15,</ref>
                    <ref type="bibr" target="#b15">16</ref> To accelerate the discovery of Mpro inhibitors from the small molecule database to combat COVID- 
                    <ref type="bibr" target="#b18">19</ref>, we represent a newly derived FEP-ABFE-accelerated screening strategy together with bioassay validation to rapidly reposition potential agents against COVID-19 by targeting viral proteinase Mpro.
                </p>
                <p>As a result, fifteen of twenty-five drugs were validated in vitro to exhibit considerable inhibitory potencies towards Mpro. Among them, the most potent and representative Mpro inhibitor dipyridamole just finished its first-round clinical trials, and showed significant clinical outcomes. 
                    <ref type="bibr" target="#b16">17</ref> In short, this is the first report to screen the FDA-approved database by using the FEP-ABFE approach, and this FEP-based method showed significant advantages by means of improving the hit rates and repositioning more potent leads.
                </p>
            </div>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="2.">Methods</head>
            </div>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="2.1">Molecular docking</head>
                <p>The crystal structure of viral proteinase Mpro (PDB ID: 6LU7) 
                    <ref type="bibr" target="#b17">18</ref> for COVID-19 was used for the molecule docking purpose. Based on the crystal structure, more than 2500 small molecules in the FDAapproved drug database were first screened by using molecular docking program Glide 
                    <ref type="bibr" target="#b18">19</ref> . Considering
                </p>
                <p>Mpro being a protease, Cys145-His41/Ser144-His163 can act as the nucleophilic agent and acid that assist the hydrolysis reaction of the substrate proteins, and Gly143 and Gln166 can form hydrogen bonds with the "CO-NH-Cα-CO-NH-Cα" structure of the backbone of the substrate protein. Thus, these 6 residues were considered as the key residues of Mpro in the screening. After docking, the binding modes of all the ligands were carefully checked, and 100 molecules with specific interaction with the key residues and relatively high docking scores were selected for further FEP studies to evaluate their ABFE.</p>
            </div>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="2.2">Free energy perturbation (FEP)</head>
            </div>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="2.2.1">Preliminary MD simulations.</head>
                <p>All the 100 ligands selected by molecular docking were further evaluated by FEP calculations carried out in Gromacs-2019 
                    <ref type="bibr" target="#b19">20,</ref>
                    <ref type="bibr" target="#b20">21</ref> . Before FEP calculations, 4 ns preliminary MD simulations were performed for each receptor-ligand complex to improve the fit of the ligand into the binding pocket. All the ligands are parameterized by the general AMBER force field (GAFF) 
                    <ref type="bibr" target="#b21">22</ref> . Restrained electrostatic potential (RESP) charges calculation of relative ligand was performed with Gaussian 03 program 
                    <ref type="bibr" target="#b22">23</ref> at the HF/6-31G* level. The parameters of protein were described by the AMBER FF14SB force field 24 . The TIP3P model 
                    <ref type="bibr" target="#b24">25</ref> was used for water molecules.
                </p>
                <p>The systems were neutralized by adding counter ions (either Na + or Cl ions). The systems were first minimized by using steepest descent method for 5000 cycles and then heated from 0 to 298 K in an NVT ensemble within 100 ps. The systems were then equilibrated in an NPT ensemble with weak restraints of 1000 kJ/mol/nm 2 for 500 ps followed by a 4 ns unconstrained production simulation. The last snapshot of the MD simulations was used for the following FEP calculations, and the trajectory of the last 2 ns was analyzed to get the parameters for adding restraints between receptors and ligands.</p>
            </div>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="2.2.3">Protocol for automatically adding restraints.</head>
                <p>Based on the preliminary MD simulations results, the FEP-ABFE calculations were carried based on the thermodynamic cycle given in 
                    <ref type="figure">Figure 1</ref>. As shown in the thermodynamic cycle, a restraint should be added to the receptor and ligand for each FEP calculation. The strategy of adding restraints first reported by Boresch et al. 
                    <ref type="bibr" target="#b25">26</ref> was used in this study, which consists of one distance, two angles, and three dihedrals harmonic potentials with a force constant of 10 kcal/mol/Å 2 [rad 
                    <ref type="bibr">2</ref> ]. The contribution of the restraints to the Lig (∆ ) system was calculated analytically, and the contribution of the restraint to the Rec-Lig system (∆ ) was calculated by FEP. According to the strategy, three atoms of the ligand and three atoms of the receptor will be selected to add the restraint. In order to add the restraints at the equilibrium position, a program was designed to automatically detect the required parameters and select the three ligand atoms and the three receptors atoms. For ligand, the heavy atom which is closest to the geometry center was selected as the first atom; the heavy atom which is most distant from the first atom was selected as the second atom; the heavy atom forms an angle that is larger than 90 degrees with the first two atoms and most distant from the first atom is selected as the third atom. For the receptor, based on the last 2 ns trajectory of the 4 ns preliminary MD simulations, the distances, angles, dihedrals between the three ligand atoms and Cα, Cc (carbon of the carboxyl group) and N (N atom of the amino group) atoms of all the residues within 5 Å of the ligand were calculated along the MD trajectory. Cα, Cc and N atoms from the same residue with the most stable (lowest standard deviation) distances, angles, and dihedrals values were selected as the three receptor atoms. The determined mean values of distance, angles, and dihedrals were used for adding the restraints between the three ligand atoms and the three receptor atoms.
                </p>
            </div>
...