-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: implement im.getxmp() to return all embedded XMP meta data as XML #5076
Comments
Hello @laynr @hugovk I'd like to take this issue, I'm already working on this. I already got the xmp tags out of the file, I'm just wondering which output structure would be best for the getxmp() to return I was thinking that it could be an object like the getexif(), but in this one I already got the tags name, so, instead of the tag number I could implement the actual tag name and its value |
Btw I also could just simply return its xml tree |
Great thanks @UrielMaD! It is probably best to keep it similar to getexif() if they return an object and adding tags name, instead of the tag number would be awesome... That said, there is definite value in just returning the XML tree. For one, returning the XML tree may be the most future proof as you wouldn't need to stay current on new tags names. I guess if you return an object, one of the objects functions can be to return the XML tree - perhaps that would be the best of both worlds! (but more work) For my personal project I just used the XML tree as the parsers for XML are well established. One thing I am noticing is that there can be multiple XMP sections in one file and not necessarily adjacent. Thank you for taking this on. I believe it will be very useful for many people! |
Thank you @Layn, then I'll send a PR implementing xml object, I can also return the whole xml tree as string. I get the tag names directly from what's in the xml tree so I will return only the tags that comes in that file, so if there's more xmp tags added in the future it will still return the new ones as they just come as tags attributes. |
@hugovk Changes were merged into my PR and all tests have passed |
Hi. Something to be aware of. Since the
in Pillow 8.3.0, we've added a new requirement - you will have to install |
Any thoughts to add a parameter to getxmp() to tell it to just return the xml as a string and not a nested dict of lists and dicts that is very hard to work with to extract anything useful (it is not a nice set of name:value pairs but instead an entire xml tree shoehorned in). Even after flattening it is an issue given the complex overuse of namespaces, and non-standardized prefixes in the IPTC spec. There are a large number of xml parsers and tools such as bs4 that can properly be used to find and extract information from xml while properly handling namespace prefix differences across implementations. No defusedxml needed as bs4 can use lxml for parsing. It would be extremely easy to add given the xmp xml string data is available when _getxmp() is called. Right now I have to walk the bytes of all image files looking for xmpmeta and then backtracking to validly check that the proper namespace is used. The prefix used to represent the namespace is not always "x:". If you ever plan to support modifying or writing xmp metadata, accepting complete/validated xml as input would certainly be easier than fighting with changes to nested dictionaries and lists. As Accessibility becomes more important to publishers of all sorts, getting the direct access to the xml as a string will simplify things for everyone. Please consider making this slight change thereby get yourself out of the processing and repackaging of xml game. Thank you for your time and consideration. |
FWIW, I thought about writing a routine to convert your nested dict back to real xml but found that the use of dictionaries presents issues for sequences of identical tags "in this case "li" tags being stored (same key overwriting the earlier key). Here is an example here just a simple snippet from the official sample image from the IPTC.
And here is that same snippet extracted from the getxmp() command:
Notice that there is only one "li" tag shown in the dict version while the actual xml has two li tags as with separate values. Also notice the missing namespace information. So there is no easy way to walk what is returned by getxmp() to rebuild the actual xml. Even for this very very simple example that just happened to be near the top of the tree. Trying to suss out anything farther down in the nested dict structure is an exercise in futility unless you know the entire structure in advance which kinds of defeats the whole purpose. The actual xml is much much easier to work with and is simpler for you. Hope this helps. |
The following code shows how to retrieve the XMP string from each format that Pillow gathers it from - from PIL import Image
with Image.open("Tests/images/flower2.webp") as im:
print(im.info["xmp"])
with Image.open("Tests/images/color_snakes.png") as im:
print(im.info["XML:com.adobe.xmp"])
with Image.open("Tests/images/lab.tif") as im:
print(im.tag_v2[700])
with Image.open("Tests/images/xmp_test.jpg") as im:
for segment, content in im.applist:
if segment == "APP1":
marker, xmp_tags = content.split(b"\x00")[:2]
if marker == b"http://ns.adobe.com/xap/1.0/":
print(xmp_tags)
break From that, I would think the best way to unify this is to add |
That would work fine for me. Thank you. I read that for jpg that if the xmp metadata was larger than the 64k segment limit it would be split to use multiple segments. If that is correct, breaking after the first may not quite be enough. Thank you for your code snippet. It is nice to have something working with the current version of Pillow. |
I've created #8069
We have seen multiple segments for EXIF. Do you know where you read that specifically about XMP? Or do you have an image that demonstrates this happening? |
I have no test case other than the official IPTC one. I think, if it exists it would be quite rare. Although since some of these metadata values are open textfields with no spec'd size limits, you could create one easily enough. I read it here: [QUOTE] So it may just be a hypothetical case, but one someone will probably try to exploit if at all possible. |
Actually the issue of > 64k is real and the adobe spec describes how to deal with it: |
Here is an adobe spec quote from that issue: Quoting Adobe XMP Specification part 3: Following the normal rules for JPEG sections, the header plus the following data can be at most 65535 bytes long. If the XMP packet is not split across multiple APP1 sections, the size of the XMP packet can be at most 65502 bytes. It is unusual for XMP to exceed this size; typically, it is around 2 KB. If the serialized XMP packet becomes larger than the 64 KB limit, you can divide it into a main portion (StandardXMP) and an extended portion (ExtendedXMP), and store it in multiple JPEG marker segment. A reader must check for the existence of ExtendedXMP, and if it is present, integrate the data with the main XMP. Each portion (standard and extended) is a fully formed XMP metadata tree, although only the standard portion contains a complete packet wrapper. If the data is more than twice the 64 KB limit, the extended portion can also be split and stored in multiple marker segments; in this case, the split portions are not fully formed metadata trees. When ExtendedXMP is required, the metadata must be split according to some algorithm that assigns more important data to the main portion, and less important data to the extended portions or portions. |
So it sounds like just walking the applist and appending the xmp sections in the sequence found will work. As more than 2 sections worth of xmp can result in split trees where a single section is no longer a valid tree on its own, meaning they could never be parsed separately. |
Page 20 of https://archimedespalimpsest.net/Documents/External/XMP/XMPSpecificationPart3.pdf states
That's not as simple as just concatenating the chunks. I would prefer to work from an example of this type of file. If there is no example, then that sounds like an argument that this feature may not be so vital. |
Yes, not that simple. Of course returning a list of xmp segment strings and let the user fight with them is always an option! I will look for a couple of official or unofficial example of images using extendedxmp and get back to you with links for them. My understanding is some Android camera apps make liberal use of the xmp metadata to store some depth and other special effects info that are bigger than 64k so examples should be out there. |
Started checking github repos for software that manipulates ExtendedXmp and found a java project called icafe that supports xmp across a number of formats: https://github.com/dragon66/icafe and in their set of test images found this sample which is one of the depth image information that takes up a number of segments of xmp. https://github.com/dragon66/icafe/blob/master/images/table.jpg I will look for others. |
And here is a second image that uses ExtendedXmp in jpeg. https://github.com/drewnoakes/metadata-extractor-images/ https://github.com/drewnoakes/metadata-extractor-images/blob/main/jpg/Google%20Cardboard.jpg If you need more, please let me know. |
Implement image.getxmp() similar to image.getexif(), that returns all embedded XMP meta data out of an image as XML
Something like:
FYI: This didn't work for me:
xmp_tags = self.info.get("XML:com.adobe.xmp")
I am sure this feature request has been asked before... but a search of 'XMP' in issues yielded nothing. Just asking for MVP, not write support, or tag comprehension.
XMP documentation:
Official: https://www.adobe.com/devnet/xmp.html
Helpful: https://exiftool.org/TagNames/XMP.html
Requesting output similar to the output of:
exiftool.exe -xmp:all -X image.jpg
The text was updated successfully, but these errors were encountered: