Skip to content

Remove "generated by python-docx" from description tag #1387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoang22 opened this issue May 2, 2024 · 5 comments
Closed

Remove "generated by python-docx" from description tag #1387

shoang22 opened this issue May 2, 2024 · 5 comments

Comments

@shoang22
Copy link

shoang22 commented May 2, 2024

Hello,

I'm building an translation app that converts pdfs to docx files that I can use to generate xliff, which gets parsed so that I can perform machine translation before merging the translations to the output doc (expected to be in the same format as the input docx).

When I use the parser to extract text from the generated docx file, I get extra text that I'm assuming comes from here. I tried simply removing the lines, but the parser still cannot merge the source and target language docs.

Is there a way to ensure the tags don't get generated?

@icy-comet
Copy link

icy-comet commented May 2, 2024

@shoang22
Copy link
Author

shoang22 commented May 2, 2024

Thanks for the reference. What if I wanted to remove select components from core_properties entirely?
Initially, I set them as:

docx_doc.core_properties.comments = ""
docx_doc.core_properties.author = ""

The problem with this is that the parser (tikal) still recognizes them. And read them as two blank lines. When attempting to merge with the document containing the target text, I have to add two blank lines to the end of the target file to make it work. Was wondering if there's a more elegant solution.

I tried to delete them but was met with the following:

AttributeError: property 'comments' of 'CoreProperties' object has no deleter

@scanny
Copy link
Contributor

scanny commented May 2, 2024

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()

@scanny scanny closed this as completed May 2, 2024
@shoang22
Copy link
Author

shoang22 commented May 3, 2024

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()

Both of these still set comments and author to an empty string

-> docx_doc.core_properties.comments
'generated by python-docx'

-> docx_doc.core_properties._element._remove_description()

-> docx_doc.core_properties.comments
''

@scanny
Copy link
Contributor

scanny commented May 3, 2024

@shoang22 okay, well I'm sure there's a reason we did it that way, possibly because Dublin Core (the "core" in core-properties) attributes should always be type str, even if they are not "filled".

If for your use case you prefer the value None you can use the expression:
comments = core_properties.comments or None

>>> core_properties = document.core_properties
>>> core_properties.comments
''
>>> core_properties.comments or None
None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants