Process needed for directly editing story entries in the ES archive #345

pgulley · 2024-10-30T15:39:55Z

We recently indexed a bunch of stories with an "original_url" of "mediacloud.org/need_canonical_url", without correctly setting the canonical domain for those stories, so they're effectively hidden from our front end entirely. This presents a new problem- can we update a document in place within ES? We need a script which will, for each problematic story:

Grab the story id
Determine the correct url/canonical domain
Write an update to ES to correct the canonical_domain field

@m453h, I think the first steps is just to duplicate @philbudne's histogram of all of the offending stories, and go from there!
If I'm understanding the problem correctly, the data is all present, just in the wrong place- so there's no need to go back to the original data that we queued up, right?

philbudne · 2024-10-31T15:43:51Z

My (updated) query script (using raw ES API):
mc.py.txt

philbudne · 2024-11-03T20:07:25Z

Looks like multiple entries can be updated in one call with bulk() API, so multiple entries can be pulled at once, and then be sent back in a single call?
https://elasticsearch-py.readthedocs.io/en/stable/helpers.html

pgulley assigned pgulley, m453h and philbudne Oct 30, 2024

m453h mentioned this issue Nov 5, 2024

Implement canonical domain update script #348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process needed for directly editing story entries in the ES archive #345

Process needed for directly editing story entries in the ES archive #345

pgulley commented Oct 30, 2024

philbudne commented Oct 31, 2024

philbudne commented Nov 3, 2024

Process needed for directly editing story entries in the ES archive #345

Process needed for directly editing story entries in the ES archive #345

Comments

pgulley commented Oct 30, 2024

philbudne commented Oct 31, 2024

philbudne commented Nov 3, 2024