-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If we ever reindex... #344
Comments
@m453h These are some of the example changes- I think the top three @philbudne listed are the priority- removing fielddata shoud free up a bunch of memory off the bat, and those field type changes to language and canonical domain will potentially really speed up our aggregator query. |
So to test the changes described by @philbudne above, I've made the following changes.
Deployed test on http://posey.angwin:8601 |
An issue arises when we try and eliminate some of the fields in the old(current) indices such as To eliminate this we'd need to disbale dynamic mapping on the new index template, i.e set
|
Next steps that we discussed:
|
Xavier wrote:
An issue arises when we try and eliminate some of the fields in the old(current) indices such as `full_language` and `original_url`??
They're not large, so keeping them isn't a problem.
Keeping them as keyword could be less expensive, and setting `index`
to false, might also reduce storage:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html
But could come back to bite us.
Changing all the fields at the same time will tell us if the new
schema works at all, which is a good thing, and a good starting place.
If we want/need to understand the storage and query speed impacts of
each change (especially if the changes make anything worse), I think
we'll want to either make single field changes. Assuming there aren't
interactions, testing single field changes would give us the clearest
picture of the impact of that change.
"article_title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"fielddata": false // Turned off to reduce heap memory demand
},
fielddata _might_ be necessary to do "top terms" sampling/aggregation?
We currently do "top terms" on title.
"canonical_domain": {
"type": "wildcard" // Updated to wildcard for more efficient prefix searches
},
prefix matching on "url" is of interest, but not on canonical_domain
(which has already been reduced to a very simple form).
"indexed_date": {
"type": "date"
},
From what I've read, date is supposed to be stored in milliseconds
(three decimal places) but we currently put in times with
microseconds, (six decimal places) and they seem to be retrieved with
microseconds intact?! This is a GOOD thing, because it's the most
useful pagination key!!! From what I've read, if you want better than
millisecond granularity, you're _SUPPOSED_ to need to use "date_nanos"??!
"language": {
"type": "keyword" // Changed to keyword type for more efficient aggregations
},
I don't know that it makes aggregation more efficient, but keyword
could be more storage efficient (tho the field should only ever be two
letters), and should only ever be queried with fixed, two letter values.
"publication_date": {
"type": "date",
"ignore_malformed": true
},
In THIS case, we don't care about anything more granular than a date
(no time) YYYY-mm-dd (so long as we ignore any included time, rather
than rejecting it altogether), IF there is a storage win.
|
Regarding aggregations failing with the new indexing. In the production ES, this change in news-search-api:
got canonical_domain aggregation working (canonical domain is a keyword field), so I hope/suspect changing |
@thepsalmist does the index still exist? I wanted to test against it. I tried http://posey:9200 but didn't find any indices. |
The dev stack runs on |
@thepsalmist
Thanks!
|
changing
so, as I initially feared, the top words aggregation (as written) requires fielddata, and often fails due to memory circuit breaker trips. Since the Based on 6 samples each, |
|
I was able to run |
Created new indices based off this comment representing each field change Base index: The This should allows us benchmark on query changes for each field level & entire schema change |
Great!
An index that maps url as wildcard would also be of interest (to me).
|
Things to consider if we ever reindex (ie; when moving to new cluster), after testing!!
I had put a note here about truncating indexed_date to days, but it's used as a pagination key (and a very good/stable one at that!). as a
date
it should only have millisecond granularity, but we're sending in microseconds, and they seem to come back?! If we ever create more than 1000 stories in a second, ms is not good enough for pagination, anddate_nano
might be a better choice (but is limited to the range 1970 thru 2262?)The text was updated successfully, but these errors were encountered: