Я взял шаблон OpenRefine для перевода csv в гигантскую XML-запись MODS, затем скрипт .py для его очистки и превращения в несколько небольших XML-файлов, названных с использованием одного из тегов. Работает отлично. Однако, когда я попытался изменить его, чтобы он соответствовал моим потребностям в xml-записях Dublin Core ... не так много.
У меня есть шаблон OpenRefine, который дает мне это из моего csv:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance">
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Mary Adams at the organ]</dc:title>
<dc:creator>MacAfee, Don</dc:creator>
<dc:date>4/14/1964</dc:date>
<dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject>
<dc:description>Music instructor Mary C. Adams playing the organ.</dc:description>
<dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-001</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Portrait of Dr. Robert Adeson]</dc:title>
<dc:date>1980</dc:date>
<dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject>
<dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description>
<dc:format>1 print : b&w ; 5 x 7 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-002</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
</collection>
У меня есть программа на Python, которая очищает и разделяет запись MODS, которую я изменил, и выглядит так:
import os, lxml.etree as ET
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
for element in clean.xpath(".//*[@*='']"):
element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[@*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
# find the <dc> nodes
for event, elem in cleanxml:
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
# name new files using the <dc:identifier> tag
identifier = elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier').text
filename = format(identifier + "_DC.xml")
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
Команда cmd выводит «XML is clean» и «All». сделанный!"заявления, ОДНАКО, нет файлов в каталоге SplitXML (или в любом месте). Моя попытка отладки состояла в том, чтобы закомментировать строку os.remove('clean.xml')
, чтобы я мог посмотреть на очищенный xml. Я сделал это с помощью скрипта MODS .py, и XML-файл выглядит так, как вы ожидаете. Тем не менее, файл clean.xml на DC является чистым, но представляет собой одну длинную строку кода, а не использует разные строки и вкладки, например:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance"><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Mary Adams at the organ]</dc:title><dc:creator>MacAfee, Don</dc:creator><dc:date>4/14/1964</dc:date><dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject><dc:description>Music instructor Mary C. Adams playing the organ.</dc:description><dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-001</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Portrait of Dr. Robert Adeson]</dc:title><dc:date>1980</dc:date><dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject><dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description><dc:format>1 print : b&w ; 5 x 7 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-002</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record></collection>
Если это помогает, вот оригинальный Pythonкод для очистки и расщепления модов. Я получил это от Calhist на GitHub.
# Split XML containing many <mods> elements into invidual files
# Modified from script found here: http://stackoverflow.com/questions/36155049/splitting-xml-file-into-multiple-at-given-tags
# by Bill Levay for California Historical Society
import os, lxml.etree as ET
# uncomment below modules if doing MODS cleanup on existing Islandora objects
import codecs, json
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
# found here: https://stackoverflow.com/questions/12694091/python-lxml-how-to-remove-empty-repeated-tags
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
# for element in clean.xpath(".//*[@*='']"):
# element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[@*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# getting islandora IDs for existing collections
###
# item_list = []
# json_path = 'C:\\mods\\data.json'
# with codecs.open(json_path, encoding='utf-8') as filename:
# item_list = json.load(filename)
# filename.close
###
# find the <mods> nodes
for event, elem in cleanxml:
if elem.tag == '{http://www.loc.gov/mods/v3}mods':
# the filenames of the resulting xml files will be based on the <identifier> element
# edit the specific element or attribute if necessary
identifier = elem.find('{http://www.loc.gov/mods/v3}identifier[@type="local"]').text
filename = format(identifier + "_MODS.xml")
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# look through the list of object metadata and get the islandora ID by matching the digital object ID
###
# for item in item_list:
# local_ID = item["identifier-type:local"]
# islandora_ID = item["PID"]
# if identifier == local_ID:
# filename = format(islandora_ID + "_MODS.xml")
###
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")