Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word documents and content_type_norm #199

Open
tokee opened this issue Oct 29, 2018 · 1 comment
Open

Word documents and content_type_norm #199

tokee opened this issue Oct 29, 2018 · 1 comment

Comments

@tokee
Copy link
Collaborator

tokee commented Oct 29, 2018

If we do a search for content_type_ext:doc AND content_type:"application/msword" in the Danish Netarchive Search, we get the facet for content_type_norm:

  • other : 3577875
  • word : 17606

There seems to be a problem with deriving the normalised content type with Word documents?

Maybe a more overall issue would be to search for all records that has other as nrmalised content type and facet on the different content type fields to see if there are more heavy hitters that are not handled?

@anjackson
Copy link
Contributor

anjackson commented Aug 3, 2022

This may be related to #289 where at least part of the problem is that the content type does not fall back on the content_type_served when format identification via Tika/DROID fails. EDIT: Hmm probably not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants