Remove "generated by python-docx" from description tag #1387

shoang22 · 2024-05-02T15:37:03Z

Hello,

I'm building an translation app that converts pdfs to docx files that I can use to generate xliff, which gets parsed so that I can perform machine translation before merging the translations to the output doc (expected to be in the same format as the input docx).

When I use the parser to extract text from the generated docx file, I get extra text that I'm assuming comes from here. I tried simply removing the lines, but the parser still cannot merge the source and target language docs.

Is there a way to ensure the tags don't get generated?

icy-comet · 2024-05-02T16:25:22Z

You can just modify these properties through a document's core_properties attribute.

References:

shoang22 · 2024-05-02T17:07:51Z

Thanks for the reference. What if I wanted to remove select components from core_properties entirely?
Initially, I set them as:

docx_doc.core_properties.comments = ""
docx_doc.core_properties.author = ""

The problem with this is that the parser (tikal) still recognizes them. And read them as two blank lines. When attempting to merge with the document containing the target text, I have to add two blank lines to the end of the target file to make it work. Was wondering if there's a more elegant solution.

I tried to delete them but was met with the following:

AttributeError: property 'comments' of 'CoreProperties' object has no deleter

scanny · 2024-05-02T18:32:55Z

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()

shoang22 · 2024-05-03T13:45:54Z

This should do the trick:

# -- corresponds to "comments" --
core_properties._element._remove_description()
# -- corresponds to "author" --
core_properties._element._remove_creator()

Both of these still set comments and author to an empty string

-> docx_doc.core_properties.comments
'generated by python-docx'

-> docx_doc.core_properties._element._remove_description()

-> docx_doc.core_properties.comments
''

scanny · 2024-05-03T17:25:20Z

@shoang22 okay, well I'm sure there's a reason we did it that way, possibly because Dublin Core (the "core" in core-properties) attributes should always be type str, even if they are not "filled".

If for your use case you prefer the value None you can use the expression:
comments = core_properties.comments or None

>>> core_properties = document.core_properties
>>> core_properties.comments
''
>>> core_properties.comments or None
None

scanny closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove "generated by python-docx" from description tag #1387

Remove "generated by python-docx" from description tag #1387

shoang22 commented May 2, 2024

icy-comet commented May 2, 2024 •

edited

Loading

shoang22 commented May 2, 2024

scanny commented May 2, 2024

shoang22 commented May 3, 2024 •

edited

Loading

scanny commented May 3, 2024

Remove "generated by python-docx" from description tag #1387

Remove "generated by python-docx" from description tag #1387

Comments

shoang22 commented May 2, 2024

icy-comet commented May 2, 2024 • edited Loading

shoang22 commented May 2, 2024

scanny commented May 2, 2024

shoang22 commented May 3, 2024 • edited Loading

scanny commented May 3, 2024

icy-comet commented May 2, 2024 •

edited

Loading

shoang22 commented May 3, 2024 •

edited

Loading