You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm often examining documents which are not written in English, or where I have a mixture of languages.
I'm doing a project where identifying the language is important because I'm putting the text in an Elasticsearch index. Stemming using an English analyser on French text, for example, makes absolutely no sense, and in fact will tend to deliver worse results than no stemmer at all. So identifying the languages correctly matters.
Sometimes these documents or document fragments will be properly indicated by setting the text (or fragment thereof) with the right language. This usually gets translated like this, either in styles.xml or in document.xml, (NB case of a French document, so "fr-FR"), output using grep on decompressed .docx file:
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t>elle a subi du fait des livraisons
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t> ;</w:t></w:r></w:p><w:p w14:paraI
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr></w:p><w:p w14:paraId="5C118616
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr><w:r w:rsidRPr="00D000FB"><w:rP
./word/document.xml:" w:cs="Arial"/><w:b/><w:bCs/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t xml:space="preserve">CONDAMNER </w
It'd be nice if indications as to language, both for the global document (i.e. from styles.xml) and for text runs (as found in document.xml) could be detected in the json object delivered.
Alternatives?
It's fairly practical to find the global settings for the document's language, i.e. by examing styles.xml. This is what the crate docx-rust let you do for example. But this only gives you the "globally set" language for the docx. In fact it appears to be the "lang" property for the default character style.
But getting indications concerning individual runs of text seems currently to be impossible using either that crate or this one.
Having said that, there is a crate, lingua, which is intended to identify languages from fragments of text. It's pretty good, but usually the directives actually found in Word documents will be better (at least when these directives state a language other than English).
The text was updated successfully, but these errors were encountered:
Mrodent
changed the title
Include language directives?
Include language directives? [proposed label: enhancement]
May 17, 2024
Feature request
I'm often examining documents which are not written in English, or where I have a mixture of languages.
I'm doing a project where identifying the language is important because I'm putting the text in an Elasticsearch index. Stemming using an English analyser on French text, for example, makes absolutely no sense, and in fact will tend to deliver worse results than no stemmer at all. So identifying the languages correctly matters.
Sometimes these documents or document fragments will be properly indicated by setting the text (or fragment thereof) with the right language. This usually gets translated like this, either in styles.xml or in document.xml, (NB case of a French document, so "fr-FR"), output using grep on decompressed .docx file:
Solution I'd like
It'd be nice if indications as to language, both for the global document (i.e. from styles.xml) and for text runs (as found in document.xml) could be detected in the json object delivered.
Alternatives?
It's fairly practical to find the global settings for the document's language, i.e. by examing styles.xml. This is what the crate docx-rust let you do for example. But this only gives you the "globally set" language for the docx. In fact it appears to be the "lang" property for the default character style.
But getting indications concerning individual runs of text seems currently to be impossible using either that crate or this one.
Having said that, there is a crate, lingua, which is intended to identify languages from fragments of text. It's pretty good, but usually the directives actually found in Word documents will be better (at least when these directives state a language other than English).
The text was updated successfully, but these errors were encountered: