Include language directives? [proposed label: enhancement] #715

Mrodent · 2024-05-17T08:36:12Z

Feature request

I'm often examining documents which are not written in English, or where I have a mixture of languages.

I'm doing a project where identifying the language is important because I'm putting the text in an Elasticsearch index. Stemming using an English analyser on French text, for example, makes absolutely no sense, and in fact will tend to deliver worse results than no stemmer at all. So identifying the languages correctly matters.

Sometimes these documents or document fragments will be properly indicated by setting the text (or fragment thereof) with the right language. This usually gets translated like this, either in styles.xml or in document.xml, (NB case of a French document, so "fr-FR"), output using grep on decompressed .docx file:

./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t>elle a subi du fait des livraisons
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t> ;</w:t></w:r></w:p><w:p w14:paraI
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr></w:p><w:p w14:paraId="5C118616
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr><w:r w:rsidRPr="00D000FB"><w:rP
./word/document.xml:" w:cs="Arial"/><w:b/><w:bCs/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t xml:space="preserve">CONDAMNER </w

./word/styles.xml:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="fr-FR" w:eastAsia="en-US" w:bidi="ar-SA"/><w14:ligature
./word/styles.xml:hanging="283"/></w:pPr><w:rPr><w:lang w:eastAsia="fr-FR"/></w:rPr></w:style><w:style w:type="paragraph" w
./word/styles.xml:val="32"/><w:szCs w:val="32"/><w:lang w:eastAsia="fr-FR"/><w14:ligatures w14:val="none"/></w:rPr></w:styl

Solution I'd like

It'd be nice if indications as to language, both for the global document (i.e. from styles.xml) and for text runs (as found in document.xml) could be detected in the json object delivered.

Alternatives?

It's fairly practical to find the global settings for the document's language, i.e. by examing styles.xml. This is what the crate docx-rust let you do for example. But this only gives you the "globally set" language for the docx. In fact it appears to be the "lang" property for the default character style.

But getting indications concerning individual runs of text seems currently to be impossible using either that crate or this one.

Having said that, there is a crate, lingua, which is intended to identify languages from fragments of text. It's pretty good, but usually the directives actually found in Word documents will be better (at least when these directives state a language other than English).

The text was updated successfully, but these errors were encountered:

Mrodent changed the title ~~Include language directives?~~ Include language directives? [proposed label: enhancement] May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include language directives? [proposed label: enhancement] #715

Include language directives? [proposed label: enhancement] #715

Mrodent commented May 17, 2024 •

edited

Loading

Include language directives? [proposed label: enhancement] #715

Include language directives? [proposed label: enhancement] #715

Comments

Mrodent commented May 17, 2024 • edited Loading

Feature request

Solution I'd like

Alternatives?

Mrodent commented May 17, 2024 •

edited

Loading