Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XSLT for LOC XML to PRONOM #15

Open
archivist-liz opened this issue Nov 23, 2019 · 5 comments
Open

XSLT for LOC XML to PRONOM #15

archivist-liz opened this issue Nov 23, 2019 · 5 comments

Comments

@archivist-liz
Copy link
Contributor

archivist-liz commented Nov 23, 2019

Taking @kmurmur's mapping in Issue #7 I tried to start writing an XSLT to extract the relevant fields and output to a CSV. This is a work in progress at best. The LOC FDD contains multiple entries for certain fields, so I'm having trouble mapping/extracting them, for example MIME Type. (This probably mostly has to do with my week xml/xslt skills, but I'm still at a big of a loss here.) Example from LOC FDD 119 on MIDI:

<fdd:internetMediaType>
    <fdd:sigValues>
        <fdd:sigValue>audio/mid</fdd:sigValue>
        <fdd:sigValue>audio/m </fdd:sigValue>
        <fdd:sigValue>audio/midi </fdd:sigValue>
        <fdd:sigValue>audio/x-midi </fdd:sigValue>
        <fdd:sigValue>application/x-midi</fdd:sigValue>
    </fdd:sigValues>
    <fdd:note>
        Selected from <a href="http://filext.com/">The File Extension Source</a>; for
        <i>kar</i> karaoke files, <i>x-music/x-midi</i> is added. No Internet Media Type for MIDI found at
        <a href="http://www.iana.org/assignments/media-types/">http://www.iana.org/assignments/media-types/</a>.
    </fdd:note>
</fdd:internetMediaType>

Secondly, due to the formatting within the XML, I can't get a string delimiter to function right. It works better with a single quote, but still not all the time. I'm attaching
Screenshot-loc-to-pronom
a screenshot of the command running over a directory with all the LOC FDD XML files (which can be downloaded here: https://www.loc.gov/preservation/digital/formats/fddXML.zip), and the resulting csv file (saved as xslx). Not really sure whether it's worth it to keep working on this in this way.
pronomtest1.xlsx

Here's the xslt code:

<?xml version="1.0" encoding="utf-8"?>

<!--
Basic XSLT to transform LOC FDD XML fields mapped to PRONOM to output to a CSV file
Usage:
xsltproc LOC-to-PRONOM.xslt LOC_number.xml > PRONOM_output.csv
Version 1.01
Elizabeth Kata, 2019
-->

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fdd="http://www.loc.gov/preservation/digital/formats/schemas/fdd/v1">

<xsl:output method="text" />
    <xsl:template match="fdd:FDD">
	
	 <xsl:variable name="var1" select="@id">
		</xsl:variable>

        
	<xsl:text>LOC FFD: https://www.loc.gov/preservation/digital/formats/fdd/</xsl:text><xsl:value-of select="$var1"/><xsl:text>.shtml,</xsl:text>
        <xsl:for-each select="fdd:identificationAndDescription" >
            	<xsl:text>&#39;</xsl:text><xsl:value-of select="fdd:fullName"/><xsl:text>&#39;</xsl:text><xsl:text>,</xsl:text>
		<xsl:text>&#39;</xsl:text><xsl:value-of select="fdd:shortDescription"/><xsl:text>&#39;</xsl:text><xsl:text>,</xsl:text>
		<xsl:text>&#39;</xsl:text><xsl:value-of select="fdd:description"/><xsl:text>&#39;</xsl:text><xsl:text>,</xsl:text>
	</xsl:for-each>
	<xsl:for-each select="fdd:sustainabilityFactors" >
            	<xsl:text>&#39;</xsl:text><xsl:value-of select="fdd:documentation"/><xsl:text>&#39;</xsl:text><xsl:text>,</xsl:text>
		<xsl:text>&#39;</xsl:text><xsl:value-of select="fdd:adoption"/><xsl:text>&#39;</xsl:text><xsl:text>,</xsl:text>
		<xsl:text>&#10;</xsl:text>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>
@archivist-liz
Copy link
Contributor Author

Also, sorry for the poor formatting

@kmurmur
Copy link

kmurmur commented Nov 25, 2019

Many thanks @archivist-liz! If I can help untangle our XML, please shout. You are correct that we can have multiples of certain fields, depending on the scope of the FDD.

@adamretter
Copy link

@archivist-liz no worries I fixed the formatting :-)

@archivist-liz
Copy link
Contributor Author

Thanks @adamretter! This makes it look cleaner. Sadly there are still some intellectual issues (like what to do when there are multiple MIME types, etc).

@marhop
Copy link
Contributor

marhop commented Nov 26, 2019

That depends on how you'd like such multiple-valued fields to be displayed. Do you want them all concatenated in one CSV cell like "audio/mid audio/m audio/midi ..." or do you want one CSV column per MIME type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants