Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process needed for directly editing story entries in the ES archive #345

Open
pgulley opened this issue Oct 30, 2024 · 2 comments
Open

Process needed for directly editing story entries in the ES archive #345

pgulley opened this issue Oct 30, 2024 · 2 comments
Assignees

Comments

@pgulley
Copy link
Member

pgulley commented Oct 30, 2024

We recently indexed a bunch of stories with an "original_url" of "mediacloud.org/need_canonical_url", without correctly setting the canonical domain for those stories, so they're effectively hidden from our front end entirely. This presents a new problem- can we update a document in place within ES? We need a script which will, for each problematic story:

  1. Grab the story id
  2. Determine the correct url/canonical domain
  3. Write an update to ES to correct the canonical_domain field

@m453h, I think the first steps is just to duplicate @philbudne's histogram of all of the offending stories, and go from there!
If I'm understanding the problem correctly, the data is all present, just in the wrong place- so there's no need to go back to the original data that we queued up, right?

@philbudne
Copy link
Contributor

My (updated) query script (using raw ES API):
mc.py.txt

@philbudne
Copy link
Contributor

Looks like multiple entries can be updated in one call with bulk() API, so multiple entries can be pulled at once, and then be sent back in a single call?
https://elasticsearch-py.readthedocs.io/en/stable/helpers.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants