If we ever reindex... #344

philbudne · 2024-10-29T18:28:32Z

Things to consider if we ever reindex (ie; when moving to new cluster), after testing!!

Make languages keyword (since it doesn't seem to be needed for aggregation, as I've made it work on canonical_domain)
See if turning off fielddata for full text reduces heap memory demand (only needed for doing aggregations?). BUT I suppose it's possible a better cluster design might make it (more) possible to do top words on the full text?!
See if making canonical_domain a "wildcard" field makes using prefix * less painful. Specifying the full prefix could be clunky in the long run (possibly needing to create multiple sources with different full prefixes that could have been matched with a single leading wildcard)
Since snapshots only need to copy data segments that have changed (and data written in the past doesn't need to be re-copied, once the segments have settled (been consolidated)), that makes keeping indices based on processing order make less sense, we could see if splitting indices based on published date makes sense (again!), IF limiting the indices searched based on date range means queries run faster (I suppose we can test this by searching some of our old indices for dates they don't contain!). Undated stories will grow without bound, but could be managed using ILM. Since indices split by published_date can grow over time, we would either need to (a) pick a number of shards that allows for growth (aim for 30GB shards, (b) use ILM on date-based indices(!) (c) "name" date based indices based on quarter.
eliminate full_language?
eliminate original_url (always same as url in our use of mcmetadata?)??

I had put a note here about truncating indexed_date to days, but it's used as a pagination key (and a very good/stable one at that!). as a date it should only have millisecond granularity, but we're sending in microseconds, and they seem to come back?! If we ever create more than 1000 stories in a second, ms is not good enough for pagination, and date_nano might be a better choice (but is limited to the range 1970 thru 2262?)

The text was updated successfully, but these errors were encountered:

pgulley · 2024-11-06T15:55:07Z

@m453h These are some of the example changes- I think the top three @philbudne listed are the priority- removing fielddata shoud free up a bunch of memory off the bat, and those field type changes to language and canonical domain will potentially really speed up our aggregator query.
I suppose a test would involve spinning up a news-search-api that points at the new index and getting some benchmarks. iirc we just pass an environment variable to extend the exposed indices- @thepsalmist can confirm.

thepsalmist · 2024-11-13T13:54:06Z

So to test the changes described by @philbudne above, I've made the following changes.

I've created a new test index template, with the suggested mapping changes. This is to ensure the mapping changes to not apply to the new indices created on rollover from the current existing ILM policy().

PUT _index_template/test_mediacloud_search_template
{
  "index_patterns": "mc_search_test-*",
  "template": {
    "settings": {
      "index.lifecycle.name": "mediacloud-lifecycle-policy",
      "index.lifecycle.rollover_alias": "mc_search",
      "number_of_shards": 2,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "article_title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "fielddata": false  // Turned off to reduce heap memory demand
        },
        "canonical_domain": {
          "type": "wildcard"  // Updated to wildcard for more efficient prefix searches
        },
        "indexed_date": {
          "type": "date"
        },
        "language": {
          "type": "keyword"  // Changed to keyword type for more efficient aggregations
        },
        "publication_date": {
          "type": "date",
          "ignore_malformed": true
        },
        "text_content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "fielddata": false  // Turned off to reduce heap memory demand
        },
        "url": {
          "type": "keyword"
        }
      }
    }
  }
}

Created a test ILM policy, to achieve shorter rollover time's that the existing production cluster ILM policy

PUT _ilm/policy/test-mediacloud-lifecycle-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "2d"
          }
        }
      }
    }
  }
}

Bootstrapped an initial test index,

PUT mc_search_test-000001
{
  "aliases": {
    "mc_search_test": {
      "is_write_index": true
    }
  }
}

Currently re-indexing from mc_search-XXXX to mc_search_test-XXXX

Deployed test on http://posey.angwin:8601

thepsalmist · 2024-11-13T14:40:25Z

eliminate full_language?
eliminate original_url (always same as url in our use of mcmetadata?)??

An issue arises when we try and eliminate some of the fields in the old(current) indices such as full_language and original_url??

To eliminate this we'd need to disbale dynamic mapping on the new index template, i.e set dynamic": "strict",.
However this fails on re-indexing with the error ??

"failures": [
{
"index": "mc_search_test-000001",
"id": "495304d59a34dfc7b1870de6553a4fb0fa57a7ac3bfcc7ab1e63b5ff2e957e66",
"cause": {
"type": "strict_dynamic_mapping_exception",
"reason": "[1:17] mapping set to strict, dynamic introduction of [original_url] within [_doc] is not allowed"
},
"status": 400
},
{
"index": "mc_search_test-000001",
"id": "2b70fd87b0961bcc6a6e28b29aea77ce8f6e54fecc36e5abb73969cfca63e2dc",
"cause": {
"type": "strict_dynamic_mapping_exception",
"reason": "[1:17] mapping set to strict, dynamic introduction of [original_url] within [_doc] is not allowed"
},
"status": 400
},
{
"index": "mc_search_test-000001",
"id": "bfc4459583c6ee7816c3cf5ab6979632441d334cf6f0b6b9d16b748448d88d66",
"cause": {
"type": "strict_dynamic_mapping_exception",
"reason": "[1:17] mapping set to strict, dynamic introduction of [original_url] within [_doc] is not allowed"
},
"status": 400
},

thepsalmist · 2024-11-13T15:10:29Z

Search results from the original index vs a re-index with the new mapping.
Initial results shows aggregations do not appear with the re-index template

New index from re-indexing

Old index

pgulley · 2024-11-13T15:47:22Z

Next steps that we discussed:

Create another new index on posey with the same data, but indexed with or production ES mapping
Lift up an A/B testing version of the news_search_api, to hit both the new index and the copy of the old index
2.a Benchmark with query return time, get profiles to compare

philbudne · 2024-11-13T17:12:35Z

Xavier wrote:

An issue arises when we try and eliminate some of the fields in the old(current) indices such as `full_language` and `original_url`??

They're not large, so keeping them isn't a problem. Keeping them as keyword could be less expensive, and setting `index` to false, might also reduce storage: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html But could come back to bite us. Changing all the fields at the same time will tell us if the new schema works at all, which is a good thing, and a good starting place. If we want/need to understand the storage and query speed impacts of each change (especially if the changes make anything worse), I think we'll want to either make single field changes. Assuming there aren't interactions, testing single field changes would give us the clearest picture of the impact of that change.

"article_title": { "type": "text", "fields": { "keyword": { "type": "keyword" } }, "fielddata": false // Turned off to reduce heap memory demand },

fielddata _might_ be necessary to do "top terms" sampling/aggregation? We currently do "top terms" on title.

"canonical_domain": { "type": "wildcard" // Updated to wildcard for more efficient prefix searches },

prefix matching on "url" is of interest, but not on canonical_domain (which has already been reduced to a very simple form).

"indexed_date": { "type": "date" },

From what I've read, date is supposed to be stored in milliseconds (three decimal places) but we currently put in times with microseconds, (six decimal places) and they seem to be retrieved with microseconds intact?! This is a GOOD thing, because it's the most useful pagination key!!! From what I've read, if you want better than millisecond granularity, you're _SUPPOSED_ to need to use "date_nanos"??!

"language": { "type": "keyword" // Changed to keyword type for more efficient aggregations },

I don't know that it makes aggregation more efficient, but keyword could be more storage efficient (tho the field should only ever be two letters), and should only ever be queried with fixed, two letter values.

"publication_date": { "type": "date", "ignore_malformed": true },

In THIS case, we don't care about anything more granular than a date (no time) YYYY-mm-dd (so long as we ignore any included time, rather than rejecting it altogether), IF there is a storage win.

philbudne · 2024-11-13T18:43:23Z

Regarding aggregations failing with the new indexing. In the production ES, this change in news-search-api:

--- a/client.py
+++ b/client.py
@@ -57,7 +57,7 @@ class QueryBuilder:
         }
         TOP_LANGS = {"toplangs": {"terms": {"field": "language.keyword", "size": 100}}}
         TOP_DOMAINS = {
-            "topdomains": {"terms": {"field": "canonical_domain.keyword", "size": 100}}
+            "topdomains": {"terms": {"field": "canonical_domain", "size": 100}}
         }

got canonical_domain aggregation working (canonical domain is a keyword field), so I hope/suspect changing language.keyword in the TOP_LANGS above would get language aggregation working with a keyword language field.

philbudne · 2024-11-17T01:06:37Z

@thepsalmist does the index still exist? I wanted to test against it.

I tried http://posey:9200 but didn't find any indices.

thepsalmist · 2024-11-19T12:14:00Z

@thepsalmist does the index still exist? I wanted to test against it.

I tried http://posey:9200 but didn't find any indices.

The dev stack runs on :9220 should be accessible via http://posey:9220

philbudne · 2024-11-19T18:58:47Z

@thepsalmist Thanks!

philbudne · 2024-11-19T19:48:56Z

changing language.keyword to language (in my direct-to-ES version of mc-providers) fixed the languages aggregation, BUT, the top words query dies with:

elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'Fielddata is disabled on [article_title] in [mc_search_test-000001]. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [article_title] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')

so, as I initially feared, the top words aggregation (as written) requires fielddata, and often fails due to memory circuit breaker trips.

Since the canonical_domain field in this index is wildcard (and not url), I tried both canonical_url:goo.ne.jp (the second to top domain in the index), which took from 15 to 30ms. canonical_url:*.ne.jp didn't seem much different. The noise/variance between tests may be as big as the signal/data, so it may be hard to draw any conclusions.

Based on 6 samples each, ministat which uses https://en.wikipedia.org/wiki/Student%27s_t-test reported:
I

philbudne · 2024-11-19T19:49:51Z

mail% cat goo 
12
17
19
22
15
13
mail% cat star
23
17
22
19
15
16
mail% ministat goo star
x goo
+ star
+------------------------------------------------------------------------------------------------------------------+
|x         x                    *         +         *                    *                              *         +|
|      |____________________________|_____M___A________________M_____A______________|__________________|           |
+------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   6            12            22            16     16.333333     3.7771241
+   6            15            23            18     18.666667     3.2659863
No difference proven at 95.0% confidence

philbudne · 2024-11-19T20:40:12Z

I was able to run significant_text aggregation on the article_title and text_content fields, for whatever that's worth (maybe nothing?)

thepsalmist · 2024-11-20T13:48:53Z

Xavier wrote:
An issue arises when we try and eliminate some of the fields in the old(current) indices such as full_language and original_url??
They're not large, so keeping them isn't a problem. Keeping them as keyword could be less expensive, and setting index to false, might also reduce storage: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html But could come back to bite us. Changing all the fields at the same time will tell us if the new schema works at all, which is a good thing, and a good starting place. If we want/need to understand the storage and query speed impacts of each change (especially if the changes make anything worse), I think we'll want to either make single field changes. Assuming there aren't interactions, testing single field changes would give us the clearest picture of the impact of that change.
"article_title": { "type": "text", "fields": { "keyword": { "type": "keyword" } }, "fielddata": false // Turned off to reduce heap memory demand },
fielddata might be necessary to do "top terms" sampling/aggregation? We currently do "top terms" on title.
"canonical_domain": { "type": "wildcard" // Updated to wildcard for more efficient prefix searches },
prefix matching on "url" is of interest, but not on canonical_domain (which has already been reduced to a very simple form).
"indexed_date": { "type": "date" },
From what I've read, date is supposed to be stored in milliseconds (three decimal places) but we currently put in times with microseconds, (six decimal places) and they seem to be retrieved with microseconds intact?! This is a GOOD thing, because it's the most useful pagination key!!! From what I've read, if you want better than millisecond granularity, you're SUPPOSED to need to use "date_nanos"??!
"language": { "type": "keyword" // Changed to keyword type for more efficient aggregations },
I don't know that it makes aggregation more efficient, but keyword could be more storage efficient (tho the field should only ever be two letters), and should only ever be queried with fixed, two letter values.
"publication_date": { "type": "date", "ignore_malformed": true },
In THIS case, we don't care about anything more granular than a date (no time) YYYY-mm-dd (so long as we ignore any included time, rather than rejecting it altogether), IF there is a storage win.

each change (especially if the changes make anything worse), I think
we'll want to either make single field changes.

Created new indices based off this comment representing each field change

Base index: mc_search-000001
Modification to text_content field : mc_search_text_content-000001
Modification to language field : mc_search_language-000001
Modification to canonical_domain: mc_search_canonical_domain-000001
Modification to article_title field: mc_search_article_title-000001

The mc_search_test-000001 index represent all the schema changes.

This should allows us benchmark on query changes for each field level & entire schema change

philbudne · 2024-11-20T15:51:19Z

Great! An index that maps url as wildcard would also be of interest (to me).

pgulley assigned m453h Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If we ever reindex... #344

If we ever reindex... #344

philbudne commented Oct 29, 2024 •

edited

Loading

pgulley commented Nov 6, 2024

thepsalmist commented Nov 13, 2024 •

edited

Loading

thepsalmist commented Nov 13, 2024

thepsalmist commented Nov 13, 2024

pgulley commented Nov 13, 2024

philbudne commented Nov 13, 2024 via email

philbudne commented Nov 13, 2024 •

edited

Loading

philbudne commented Nov 17, 2024

thepsalmist commented Nov 19, 2024

philbudne commented Nov 19, 2024 via email

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

thepsalmist commented Nov 20, 2024 •

edited

Loading

philbudne commented Nov 20, 2024 via email

If we ever reindex... #344

If we ever reindex... #344

Comments

philbudne commented Oct 29, 2024 • edited Loading

pgulley commented Nov 6, 2024

thepsalmist commented Nov 13, 2024 • edited Loading

thepsalmist commented Nov 13, 2024

thepsalmist commented Nov 13, 2024

pgulley commented Nov 13, 2024

philbudne commented Nov 13, 2024 via email

philbudne commented Nov 13, 2024 • edited Loading

philbudne commented Nov 17, 2024

thepsalmist commented Nov 19, 2024

philbudne commented Nov 19, 2024 via email

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

philbudne commented Nov 19, 2024

thepsalmist commented Nov 20, 2024 • edited Loading

philbudne commented Nov 20, 2024 via email

philbudne commented Oct 29, 2024 •

edited

Loading

thepsalmist commented Nov 13, 2024 •

edited

Loading

philbudne commented Nov 13, 2024 •

edited

Loading

thepsalmist commented Nov 20, 2024 •

edited

Loading