Используя сценарий .py, который очищает, а затем разбивает большую XML-запись MODS, чтобы сделать то же самое для Dublin Core XML-записи, и я не получаю вывод - PullRequest
0 голосов
/ 23 октября 2019

Я взял шаблон OpenRefine для перевода csv в гигантскую XML-запись MODS, затем скрипт .py для его очистки и превращения в несколько небольших XML-файлов, названных с использованием одного из тегов. Работает отлично. Однако, когда я попытался изменить его, чтобы он соответствовал моим потребностям в xml-записях Dublin Core ... не так много.

У меня есть шаблон OpenRefine, который дает мне это из моего csv:

<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance">

<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">

    <dc:title>[Mary Adams at the organ]</dc:title>
    <dc:creator>MacAfee, Don</dc:creator>


    <dc:date>4/14/1964</dc:date>
    <dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject>
    <dc:description>Music instructor Mary C. Adams playing the organ.</dc:description>


    <dc:format>1 print : b&amp;w ; 6.5 x 6.5 in.</dc:format>


    <dcterms:spatial>Alexandria, Virginia</dcterms:spatial>

    <dc:type>Photograph</dc:type>
    <dc:format>Image</dc:format>


    <dc:identifier>MS332-01-01-001</dc:identifier>
    <dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>

</record>
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">

    <dc:title>[Portrait of Dr. Robert Adeson]</dc:title>



    <dc:date>1980</dc:date>
    <dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject>
    <dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description>


    <dc:format>1 print : b&amp;w ; 5 x 7 in.</dc:format>


    <dcterms:spatial>Alexandria, Virginia</dcterms:spatial>

    <dc:type>Photograph</dc:type>
    <dc:format>Image</dc:format>


    <dc:identifier>MS332-01-01-002</dc:identifier>
    <dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>

</record>
</collection>

У меня есть программа на Python, которая очищает и разделяет запись MODS, которую я изменил, и выглядит так:

import os, lxml.etree as ET

output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'

# parse source.xml with lxml
tree = ET.parse('source.xml')

# start cleanup
# remove any element tails
for element in tree.iter():
    element.tail = None

# remove any line breaks or tabs in element text
    if element.text:
        if '\n' in element.text:
            element.text = element.text.replace('\n', '') 
        if '\t' in element.text:
            element.text = element.text.replace('\t', '')

# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)

# remove recursively empty nodes
def recursively_empty(e):
   if e.text:
       return False
   return all((recursively_empty(c) for c in e.iterchildren()))

context = ET.iterwalk(clean)
for action, elem in context:
    parent = elem.getparent()
    if recursively_empty(elem):
        parent.remove(elem)

# remove nodes with blank attribute
for element in clean.xpath(".//*[@*='']"):
    element.getparent().remove(element)

# remove nodes with attribute "null"
for element in clean.xpath(".//*[@*='null']"):
    element.getparent().remove(element)

# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
    f.write(ET.tostring(clean))
print("XML is now clean")

# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))

# find the <dc> nodes
for event, elem in cleanxml:
    if elem.tag == '{http://purl.org/dc/elements/1.1/}record':

# name new files using the <dc:identifier> tag
        identifier = elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier').text
        filename = format(identifier + "_DC.xml")

        # write out to new file
        with open(output_path+filename, 'wb') as f:
            f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
            f.write(ET.tostring(elem, pretty_print = True))
        print("Writing", filename)

# remove the intermediate file
os.remove('clean.xml')
print("All done!")

Команда cmd выводит «XML is clean» и «All». сделанный!"заявления, ОДНАКО, нет файлов в каталоге SplitXML (или в любом месте). Моя попытка отладки состояла в том, чтобы закомментировать строку os.remove('clean.xml'), чтобы я мог посмотреть на очищенный xml. Я сделал это с помощью скрипта MODS .py, и XML-файл выглядит так, как вы ожидаете. Тем не менее, файл clean.xml на DC является чистым, но представляет собой одну длинную строку кода, а не использует разные строки и вкладки, например:

<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance"><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Mary Adams at the organ]</dc:title><dc:creator>MacAfee, Don</dc:creator><dc:date>4/14/1964</dc:date><dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject><dc:description>Music instructor Mary C. Adams playing the organ.</dc:description><dc:format>1 print : b&amp;w ; 6.5 x 6.5 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-001</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Portrait of Dr. Robert Adeson]</dc:title><dc:date>1980</dc:date><dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject><dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description><dc:format>1 print : b&amp;w ; 5 x 7 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-002</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record></collection>

Если это помогает, вот оригинальный Pythonкод для очистки и расщепления модов. Я получил это от Calhist на GitHub.

# Split XML containing many <mods> elements into invidual files
# Modified from script found here: http://stackoverflow.com/questions/36155049/splitting-xml-file-into-multiple-at-given-tags
# by Bill Levay for California Historical Society

import os, lxml.etree as ET
# uncomment below modules if doing MODS cleanup on existing Islandora objects
import codecs, json

output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'

# parse source.xml with lxml
tree = ET.parse('source.xml')

# start cleanup
# remove any element tails
for element in tree.iter():
    element.tail = None

# remove any line breaks or tabs in element text
    if element.text:
        if '\n' in element.text:
            element.text = element.text.replace('\n', '') 
        if '\t' in element.text:
            element.text = element.text.replace('\t', '')

# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)

# remove recursively empty nodes
# found here: https://stackoverflow.com/questions/12694091/python-lxml-how-to-remove-empty-repeated-tags
def recursively_empty(e):
   if e.text:
       return False
   return all((recursively_empty(c) for c in e.iterchildren()))

context = ET.iterwalk(clean)
for action, elem in context:
    parent = elem.getparent()
    if recursively_empty(elem):
        parent.remove(elem)

# remove nodes with blank attribute
# for element in clean.xpath(".//*[@*='']"):
#    element.getparent().remove(element)

# remove nodes with attribute "null"
for element in clean.xpath(".//*[@*='null']"):
    element.getparent().remove(element)

# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
    f.write(ET.tostring(clean))
print("XML is now clean")

# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))

###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# getting islandora IDs for existing collections
###
# item_list = []

# json_path = 'C:\\mods\\data.json'

# with codecs.open(json_path, encoding='utf-8') as filename:
#     item_list = json.load(filename)
# filename.close
###

# find the <mods> nodes
for event, elem in cleanxml:
    if elem.tag == '{http://www.loc.gov/mods/v3}mods':

        # the filenames of the resulting xml files will be based on the <identifier> element
        # edit the specific element or attribute if necessary
        identifier = elem.find('{http://www.loc.gov/mods/v3}identifier[@type="local"]').text
        filename = format(identifier + "_MODS.xml")

        ### 
        # uncomment this section if doing MODS cleanup on existing Islandora objects
        # look through the list of object metadata and get the islandora ID by matching the digital object ID
        ###
        # for item in item_list:
        #     local_ID = item["identifier-type:local"]
        #     islandora_ID = item["PID"]

        #     if identifier == local_ID:
        #         filename = format(islandora_ID + "_MODS.xml")
        ###

        # write out to new file
        with open(output_path+filename, 'wb') as f:
            f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
            f.write(ET.tostring(elem, pretty_print = True))
        print("Writing", filename)

# remove the intermediate file
os.remove('clean.xml')
print("All done!")

1 Ответ

1 голос
/ 23 октября 2019

Я обнаружил две проблемы, связанные с пространством имен:

  1. Элемент record отсутствует в пространстве имен. Поэтому вам необходимо изменить

    if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
    

    на

    if elem.tag == 'record':
    
  2. elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier') не правильно. Бит dc: должен быть удален.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...