Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Portfolio 1.7 Files #12

Open
jackdos opened this issue Oct 5, 2020 · 10 comments
Open

PDF Portfolio 1.7 Files #12

jackdos opened this issue Oct 5, 2020 · 10 comments

Comments

@jackdos
Copy link
Contributor

jackdos commented Oct 5, 2020

See https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (12.3.5) for the following:

"Beginning with PDF 1.7, PDF documents may specify how a conforming reader's user interface presents collections of file attachments, where the attachments are related in structure or content. Such a presentation is called a portable collection.

The intent of portable collections is to present, sort and search collections of related documents embedded in the containing PDF document, such as email archives, photo collections, and engineering bid sets. There is no requirement that documents in a collection have an implicit relationship or even a similarity; however, showing differentiating characteristics of related documents can be helpful for document navigation."

The effect of this feature is to use a file structured as a PDF as a ZIP like container of other content. The intent as expressed in the spec implies that this usage should be treated as distinct from a "standard" PDF file as the intent is not to present a single coherent document, but rather to contain a series of related files. This makes files using this feature much closer to traditional Containers such as ZIP, TAR and WARC, than to traditional Documents such as Word and PDF.

Given this significant change in intent, any software consuming these files has a different purpose, and may have to act in very different ways, from software consuming "standard" PDF documents. Converting the document to HTML markup in the way PDF.js does for example, will not be sufficient to render the file in line with the intention of the creator. For this reason, treating it as a separate format and assigning it a separate PUID makes sense, so that separate Preservation Actions can be taken on files of this type.

The reasoning behind naming the format "PDF Portfolio", rather than anything to do with "collections" is because "Portfolio" is the term that Adobe Acrobat uses in relation to files using this feature.

I have a candidate signature and examples for the 1.7 version, I'm assuming a similar signature will work for 2.0 as well, although I haven't got any examples to test that with.

@thorsted
Copy link
Contributor

thorsted commented Oct 5, 2020

Jack,

Do you suggest a simple signature for identification as a PDF Portfolio or should we discuss adding it as a Archive/Container format so the contents may get some identification as well?

@jackdos
Copy link
Contributor Author

jackdos commented Oct 5, 2020

I think a simple signature is a priority so that we can at least recognise these different files. I can see the use of marking it as an archive/container format, but I suspect that to get anything useful from that would require a decent amount of coding work. Adding an issue for that might be worthwhile, I'm not sure I'd get time to look at that in this window though!

@jackdos jackdos changed the title PDF Portfolio Files PDF Portfolio 1.7 Files Oct 6, 2020
jackdos added a commit to preservica/pronom-research-week that referenced this issue Oct 6, 2020
@thorsted
Copy link
Contributor

thorsted commented Oct 6, 2020

Jack,

Having a little difficulty getting your signature to work for me. I am not seeing any offsets for /Collection or the other strings.
Also, the mimetype is set to json and the format priority should be set to 289 instead of the PUID fmt/276.

I like the use of the Collection dictionary, should be a good identifer.

@asciim0
Copy link

asciim0 commented Oct 7, 2020

I kind of disagree with a specific PUID for portfolio files. My reasons:

  • they are not a PDF profile per se, but a possibility integrated in PDF 1.7 and 2.0
  • different readers present portofilos differently. best readability is supported by adobe acrobat, others tend to show the contents as bookmarks and you can navigate through the files

I believe that the fact that a PDF can be viewed as a portfolio is something that should be determined at characteristic or risk level, so something that e.g. JHOVE should pick up and not the file format identification chain.
If portfolios receive their own PUID, the same argument could be made for PDFs containing attachments, PDFs containing embedded AV streams, etc.

@marhop
Copy link
Contributor

marhop commented Oct 7, 2020

Good point!

If portfolios receive their own PUID, the same argument could be made for PDFs containing attachments, PDFs containing embedded AV streams, etc.

... and for generic WAVE files, WAVE files with PCM encoding, WAVE files with the WAVE_FORMAT_EXTENSIBLE extension, WAVE files with the Broadcast WAVE extension (further splitted into generic, PCM and MPEG encodings), WAVE files with Exif metadata ... SCNR ;-)

It's hard to draw a line between format identification and characterization (or whatever you'd like to call it) and I really wish PRONOM had clear criteria for that.

PS: To be fair, both Broadcast WAVE and Exif audio are commonly advertised as file formats of their own. Personally, I think they are just extended (mainly added metadata) WAVE files, so I added them to the above list.

PPS: Sorry for hijacking this thread with generic rants! Now, back to business.

@asciim0
Copy link

asciim0 commented Oct 7, 2020

good point - but i think the connection between WAVE / BWF / AV-container of your choice vs. PDF is different.
A/V containers have to have the payload in a specific encoding per design. it's an expected behavior of the format in every case.
PDF attachments or portfolios or optional as per standard. it's a feature, not a bug ;-D

P.S. and, again, we might need portfolios for every sub-profile of PDF, depending of wheter they allow protfolios per design. so
extra PUIDs for every PDF/A, PDF/UA, PDF/X, PDF/VT, etc. that has is presented as a portfolio in addition to the PUIDs already in existence.
P.P.S. I would hope for smart PDF readers to turn the portfolio rendering off anyways and just present them as attachments, as it's a pain to load, even in Acrobat.

@jackdos
Copy link
Contributor Author

jackdos commented Oct 7, 2020

@thorsted - good spot on the mime type, recycled files from previous signature work, oops, will fix and update the PR. There are deliberately no offsets for the /Collection or <<CI<< strings as the structure of PDF means that they can be almost anywhere in the file (with the exceptions of the protected BOF and EOF areas, which the overall PDF byte sequences would protect against anyway). I used the PUID rather than the ID specifically because PUIDs are persistent, whereas I'm not sure how constant those IDs actually are.

@asciim0 - I agree there is a risk of format explosion, and I can see the argument that it's not hugely different from a PDF with an attachment, or an embedded video, but I do think Collections are an inherently different use case.

PDF is generically a document. That document can have attachments, but the basic use case is still a single primary document, and some supporting material. I agree that finding those attachments and dealing with them is more about characterising them and dealing with the associated risks than treating a PDF with an embedded word doc as fundamentally a different format from a PDF with an embedded video, from a PDF with no other content.

Portfolios change the nature of the file from essentially a document to essentially a container. There is no "primary document" with attachments hanging off it, there is just a set of equally primary files. This changes the entire purpose of the file, which to me makes it a different enough format to merit it's own entry in PRONOM.

jackdos added a commit to preservica/pronom-research-week that referenced this issue Oct 7, 2020
…rityover field to make clear this is a reference to the format puid rather than an attempt to reference another node in the XML
Dclipsham pushed a commit that referenced this issue Oct 7, 2020
Issue #12 - Added signature file and notes for PDF Portfolio files
@thorsted
Copy link
Contributor

@jackdos Playing with the new signature released today and it looks like some of my Portfolio samples do not have the "<</CI<<" string. Is this constant in all your samples?

@jackdos
Copy link
Contributor Author

jackdos commented Nov 30, 2021

Hi @thorsted, apologies, only just saw this.

All of the examples I have would have had that string, IIRC that's the specifier for a CollectionItem, and I was assuming that a collection wouldn't really exist without items. Are you seeing additional characters between your object delimiters (<<) and the label (/CI)? Or just not seeing those objects at all?

@thorsted
Copy link
Contributor

@jackdos Not seeing the /CI entirely only the /Collection tag. I'll do some more digging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants