-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Portfolio 1.7 Files #12
Comments
Jack, Do you suggest a simple signature for identification as a PDF Portfolio or should we discuss adding it as a Archive/Container format so the contents may get some identification as well? |
I think a simple signature is a priority so that we can at least recognise these different files. I can see the use of marking it as an archive/container format, but I suspect that to get anything useful from that would require a decent amount of coding work. Adding an issue for that might be worthwhile, I'm not sure I'd get time to look at that in this window though! |
… Portfolio files
Jack, Having a little difficulty getting your signature to work for me. I am not seeing any offsets for I like the use of the Collection dictionary, should be a good identifer. |
I kind of disagree with a specific PUID for portfolio files. My reasons:
I believe that the fact that a PDF can be viewed as a portfolio is something that should be determined at characteristic or risk level, so something that e.g. JHOVE should pick up and not the file format identification chain. |
Good point!
... and for generic WAVE files, WAVE files with PCM encoding, WAVE files with the WAVE_FORMAT_EXTENSIBLE extension, WAVE files with the Broadcast WAVE extension (further splitted into generic, PCM and MPEG encodings), WAVE files with Exif metadata ... SCNR ;-) It's hard to draw a line between format identification and characterization (or whatever you'd like to call it) and I really wish PRONOM had clear criteria for that. PS: To be fair, both Broadcast WAVE and Exif audio are commonly advertised as file formats of their own. Personally, I think they are just extended (mainly added metadata) WAVE files, so I added them to the above list. PPS: Sorry for hijacking this thread with generic rants! Now, back to business. |
good point - but i think the connection between WAVE / BWF / AV-container of your choice vs. PDF is different. P.S. and, again, we might need portfolios for every sub-profile of PDF, depending of wheter they allow protfolios per design. so |
@thorsted - good spot on the mime type, recycled files from previous signature work, oops, will fix and update the PR. There are deliberately no offsets for the /Collection or <<CI<< strings as the structure of PDF means that they can be almost anywhere in the file (with the exceptions of the protected BOF and EOF areas, which the overall PDF byte sequences would protect against anyway). I used the PUID rather than the ID specifically because PUIDs are persistent, whereas I'm not sure how constant those IDs actually are. @asciim0 - I agree there is a risk of format explosion, and I can see the argument that it's not hugely different from a PDF with an attachment, or an embedded video, but I do think Collections are an inherently different use case. PDF is generically a document. That document can have attachments, but the basic use case is still a single primary document, and some supporting material. I agree that finding those attachments and dealing with them is more about characterising them and dealing with the associated risks than treating a PDF with an embedded word doc as fundamentally a different format from a PDF with an embedded video, from a PDF with no other content. Portfolios change the nature of the file from essentially a document to essentially a container. There is no "primary document" with attachments hanging off it, there is just a set of equally primary files. This changes the entire purpose of the file, which to me makes it a different enough format to merit it's own entry in PRONOM. |
…rityover field to make clear this is a reference to the format puid rather than an attempt to reference another node in the XML
Issue #12 - Added signature file and notes for PDF Portfolio files
@jackdos Playing with the new signature released today and it looks like some of my Portfolio samples do not have the "<</CI<<" string. Is this constant in all your samples? |
Hi @thorsted, apologies, only just saw this. All of the examples I have would have had that string, IIRC that's the specifier for a CollectionItem, and I was assuming that a collection wouldn't really exist without items. Are you seeing additional characters between your object delimiters (<<) and the label (/CI)? Or just not seeing those objects at all? |
@jackdos Not seeing the /CI entirely only the /Collection tag. I'll do some more digging. |
See https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (12.3.5) for the following:
The effect of this feature is to use a file structured as a PDF as a ZIP like container of other content. The intent as expressed in the spec implies that this usage should be treated as distinct from a "standard" PDF file as the intent is not to present a single coherent document, but rather to contain a series of related files. This makes files using this feature much closer to traditional Containers such as ZIP, TAR and WARC, than to traditional Documents such as Word and PDF.
Given this significant change in intent, any software consuming these files has a different purpose, and may have to act in very different ways, from software consuming "standard" PDF documents. Converting the document to HTML markup in the way PDF.js does for example, will not be sufficient to render the file in line with the intention of the creator. For this reason, treating it as a separate format and assigning it a separate PUID makes sense, so that separate Preservation Actions can be taken on files of this type.
The reasoning behind naming the format "PDF Portfolio", rather than anything to do with "collections" is because "Portfolio" is the term that Adobe Acrobat uses in relation to files using this feature.
I have a candidate signature and examples for the 1.7 version, I'm assuming a similar signature will work for 2.0 as well, although I haven't got any examples to test that with.
The text was updated successfully, but these errors were encountered: