Skip to content

Commit

Permalink
Use Elasticsearch index aliases to avoid search downtime (mdn#3409)
Browse files Browse the repository at this point in the history
* Use ElasticSearch index aliases to avoid search downtime

Fixes mdn#3098

* Use ElasticSearch index aliases to avoid search downtime

Part of mdn#3098

(the reason this PR doesn't close issue mdn#3098) is because the `prod-build.yml` still uses `--update`)

* flake8

* Apply suggestions from code review

Co-authored-by: Ryan Johnson <[email protected]>

* remove 'already'

* feedbacked

* cleaning up

* cope with --update the first time ever

* Apply suggestions from code review

Co-authored-by: Ryan Johnson <[email protected]>

* feedbacked

Co-authored-by: Ryan Johnson <[email protected]>
  • Loading branch information
peterbe and escattone authored Apr 14, 2021
1 parent f414a9b commit d95e03f
Show file tree
Hide file tree
Showing 6 changed files with 136 additions and 96 deletions.
11 changes: 1 addition & 10 deletions .github/workflows/prod-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -231,16 +231,7 @@ jobs:
# poetry run deployer upload --prune --archived-files ../content/archived.txt ../client/build
poetry run deployer upload ../client/build
poetry run deployer update-lambda-functions ./aws-lambda
# TODO: Depending on how long the upload takes, consider switching to
# use `--update` in this command. It will skip the index re-creation
# and "cake on". This is slightly faster and means no window of time
# where the Elasticsearch server is nearly empty.
# Consider doing something like a clean indexing on Sundays and
# use `--update` on all the other week days.
# poetry run deployer search-index ../client/build --priority-prefix=en-us/docs/web
# TEMPORARY
poetry run deployer search-index ../client/build --update
poetry run deployer search-index ../client/build
# Record the deployment in our Speedcurve account. This should always
# be done after the upload of the documents and Lambda functions, as well
Expand Down
11 changes: 1 addition & 10 deletions .github/workflows/stage-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -237,16 +237,7 @@ jobs:
# poetry run deployer upload --prune --archived-files ../content/archived.txt ../client/build
poetry run deployer upload ../client/build
poetry run deployer update-lambda-functions ./aws-lambda
# TODO: Depending on how long the upload takes, consider switching to
# use `--update` in this command. It will skip the index re-creation
# and "cake on". This is slightly faster and means no window of time
# where the Elasticsearch server is nearly empty.
# Consider doing something like a clean indexing on Sundays and
# use `--update` on all the other week days.
# poetry run deployer search-index ../client/build --priority-prefix=en-us/docs/web
# TEMPORARY
poetry run deployer search-index ../client/build --update
poetry run deployer search-index ../client/build
# Record the deployment in our Speedcurve account. This should always
# be done after the upload of the documents and Lambda functions, as well
Expand Down
44 changes: 24 additions & 20 deletions deployer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,30 @@ This is to make it convenient in GitHub Actions to control the
execution purely based on the presence of the
environment variable.

### About Elasticsearch aliases

The default behavior is that each day you get a different index name.
E.g. `mdn_docs_20210331093714`. And then there's an alias with a more "generic" name.
E.g. `mdn_docs`. It's the alias name that Kuma uses to send search queries to.

The way indexing works is that we leave the existing index and its alias in place,
then we fill up a new index and once that works, we atomically "move the alias" and
delete the old index. To demonstrate, consider this example timeline:

- Yesterday: index `mdn_docs_20210330093714` and `mdn_docs --> mdn_docs_20210330093714`
- Today:
- create new index `mdn_docs_20210331094500`
- populate `mdn_docs_20210331094500` (could take a long time)
- atomically re-assign alias `mdn_docs --> mdn_docs_20210331094500` and delete old index `mdn_docs_20210330093714`
- delete old index `mdn_docs_20210330`

Note, this only applies if you _don't_ use `--update`.
If you use `--update` it will just keep adding to the existing index whose
name is based on today's date.

What this means it that **there is zero downtime for the search queries**. Nothing
needs to be reconfigured on the Kuma side.

### To update or not start a fresh

The default behavior is that it deletes the index first and immediately creates
Expand All @@ -148,26 +172,6 @@ only from time to time omit it for a fresh new start.
But note, if you omit the `--update` (i.e. recreating the index), search
will work. It just may find less that it finds when it's fully indexed.

### Priority prefixes

When you index without `--update` it will delete and recreate the index.
That means that during the time you're indexing, the searches that are happening
concurrently will not find much. That's probably OK in most cases but you
can adjust the priority of what gets indexed first. This has the advantage
that most searches, that are expecting to find content in the popular document,
will get something useful while the indexing is going on.

To set up one or multiple priority prefixes use the `--priority-prefixes`
(or just `-p` for short). Best described with an example:

```sh
poetry run deployer search-index ../client/build -p en-us/docs/web -p en-us/docs
```

This will first index all the files whose `index.json` file path (relative
to the root) matches `en-us/docs/web`. Then it does all that match `en-us/docs`.
And lastly, it does all the files that don't match any of the prefixes.

## Analyze PR builds

When you've built files you can analyze those built files to produce a Markdown
Expand Down
7 changes: 0 additions & 7 deletions deployer/src/deployer/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,12 +337,6 @@ def speedcurve_deploy(ctx, **kwargs):
default=CI,
show_default=True,
)
@click.option(
"--priority-prefix",
"-p",
multiple=True,
help="Specific folder prefixes to index first.",
)
@click.argument("buildroot", type=click.Path(), callback=validate_directory)
@click.pass_context
def search_index(ctx, buildroot: Path, **kwargs):
Expand All @@ -360,7 +354,6 @@ def search_index(ctx, buildroot: Path, **kwargs):
url,
update=kwargs["update"],
no_progressbar=kwargs["no_progressbar"],
priority_prefixes=kwargs["priority_prefix"],
)


Expand Down
147 changes: 99 additions & 48 deletions deployer/src/deployer/search/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,26 @@
import re
import time
from pathlib import Path
from collections import defaultdict, Counter
from collections import Counter

import click
from elasticsearch.helpers import streaming_bulk
from elasticsearch_dsl import Index
from elasticsearch_dsl.connections import connections
from selectolax.parser import HTMLParser

from .models import Document
from .models import Document, INDEX_ALIAS_NAME


class IndexAliasError(Exception):
"""When there's something wrong with finding the index alias."""


def index(
buildroot: Path,
url: str,
update=False,
no_progressbar=False,
priority_prefixes: (str) = (),
):
# We can confidently use a single host here because we're not searching
# a cluster.
Expand All @@ -34,48 +38,51 @@ def index(

click.echo(f"Found {count_todo:,} (potential) documents to index")

# Confusingly, `._index` is actually not a private API.
# It's the documented way you're supposed to reach it.
document_index = Document._index
if not update:
if update:
index_name = None
has_old_index = False
for name in connection.indices.get_alias():
if name.startswith("mdn_docs_"):
index_name = name
break
elif name == INDEX_ALIAS_NAME:
has_old_index = True
else:
if not has_old_index:
raise IndexAliasError("Unable to find an index called mdn_docs_*")

document_index = Index(index_name)
else:
# Confusingly, `._index` is actually not a private API.
# It's the documented way you're supposed to reach it.
document_index = Document._index
click.echo(
"Deleting any possible existing index "
f"and creating a new one called {document_index._name}"
f"and creating a new one called {document_index._name!r}"
)
document_index.delete(ignore=404)
document_index.create()

search_prefixes = [None]
for prefix in reversed(priority_prefixes):
search_prefixes.insert(0, prefix)

count_by_prefix = defaultdict(int)

already = set()

skipped = []

def generator():
for prefix in search_prefixes:
root = Path(buildroot)
if prefix:
root /= prefix
for doc in walk(root):
if doc in already:
continue
already.add(doc)
search_doc = to_search(doc)
if search_doc:
count_by_prefix[prefix] += 1
yield search_doc.to_dict(True)
else:
# The reason something might be chosen to be skipped is because
# there's logic that kicks in only when the `index.json` file
# has been opened and parsed.
# Keep a count of all of these. It's used to make sure the
# progressbar, if used, ticks as many times as the estimate
# count was.
skipped.append(1)
root = Path(buildroot)
for doc in walk(root):
# The reason for specifying the exact index name is that we might
# be doing an update and if you don't specify it, elasticsearch_dsl
# will fall back to using whatever Document._meta.Index automatically
# becomes in this moment.
search_doc = to_search(doc, _index=document_index._name)
if search_doc:
yield search_doc.to_dict(True)
else:
# The reason something might be chosen to be skipped is because
# there's logic that kicks in only when the `index.json` file
# has been opened and parsed.
# Keep a count of all of these. It's used to make sure the
# progressbar, if used, ticks as many times as the estimate
# count was.
skipped.append(1)

def get_progressbar():
if no_progressbar:
Expand All @@ -90,7 +97,6 @@ def get_progressbar():
for success, info in streaming_bulk(
connection,
generator(),
index=document_index._name,
# It's an inexact science as to what the perfect chunk_size should be.
# The default according to
# https://elasticsearch-py.readthedocs.io/en/v7.12.0/helpers.html#elasticsearch.helpers.streaming_bulk
Expand Down Expand Up @@ -122,6 +128,56 @@ def get_progressbar():
for skip in skipped:
bar.update(1)

# Now when the index has been filled, we need to make sure we
# correct any previous indexes.
if update:
# When you do an update, Elasticsearch will internally delete the
# previous docs (based on the _id primary key we set).
# Normally, Elasticsearch will do this when you restart the cluster
# but that's not something we usually do.
# See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
if not has_old_index:
document_index.forcemerge()
else:
# (Apr 2021) In the olden days, before using aliases, we used to call
# the index what is today the index. We have to delete that index
# to make for using that name as the alias.
# This code can be deleted any time after it's been useful once
# in all deployments.
legacy_index = Index(INDEX_ALIAS_NAME)
if legacy_index.exists():
# But `.exists()` will be true if it's an alias as well! It basically
# answers: "Yes, it's an index or an alias".
# We only want to delete it if it's an index. The test for that
# is to see if it does *not* have an alias.
if (
INDEX_ALIAS_NAME in legacy_index.get_alias()
and not legacy_index.get_alias()[INDEX_ALIAS_NAME]["aliases"]
):
click.echo(
f"Delete the old {INDEX_ALIAS_NAME!r} index from when it was "
"an actual index."
)
legacy_index.delete(ignore=404)

# Now we're going to bundle the change to set the alias to point
# to the new index and delete all old indexes.
# The reason for doing this together in one update is to make it atomic.
alias_updates = [
{"add": {"index": document_index._name, "alias": INDEX_ALIAS_NAME}}
]
for index_name in connection.indices.get_alias():
if index_name.startswith("mdn_docs_"):
if index_name != document_index._name:
alias_updates.append({"remove_index": {"index": index_name}})
click.echo(f"Delete old index {index_name!r}")

connection.indices.update_aliases({"actions": alias_updates})
click.echo(
f"Reassign the {INDEX_ALIAS_NAME!r} alias from old index "
f"to {document_index._name}"
)

t1 = time.time()
took = t1 - t0
rate = count_done / took
Expand All @@ -139,15 +195,6 @@ def get_progressbar():
for error, count in errors_counter.most_common():
click.echo(f"{count:,}\t{error[:80]}")

if priority_prefixes:
click.echo("Counts per priority prefixes:")
rest = sum(v for v in count_by_prefix.values())
for prefix in priority_prefixes:
click.echo(f"\t{prefix:<30} {count_by_prefix[prefix]:,}")
rest -= count_by_prefix[prefix]
prefix = "*rest*"
click.echo(f"\t{prefix:<30} {rest:,}")


class VoidProgressBar:
def __enter__(self):
Expand Down Expand Up @@ -184,7 +231,7 @@ def walk(root):
yield path


def to_search(file):
def to_search(file, _index=None):
with open(file) as f:
data = json.load(f)
if "doc" not in data:
Expand All @@ -207,7 +254,8 @@ def to_search(file):
# files.
return
locale = locale[1:]
return Document(
d = Document(
_index=_index,
_id=doc["mdn_url"],
title=doc["title"],
# This is confusing. But it's worth hacking around for now until
Expand Down Expand Up @@ -268,6 +316,9 @@ def to_search(file):
slug=slug.lower(),
locale=locale.lower(),
)
# print(dir(d))
# raise Exception
return d


_display_none_regex = re.compile(r"display:\s*none")
Expand Down
12 changes: 11 additions & 1 deletion deployer/src/deployer/search/models.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import datetime

from elasticsearch_dsl import (
Boolean,
Document as ESDocument,
Expand All @@ -9,6 +11,12 @@
char_filter,
)

# Note, this is the name that the Kuma code will use when sending Elasticsearch
# search queries.
# We always build an index that is called something based on this name but with
# the _YYYYMMDDHHMMSS date suffix.
INDEX_ALIAS_NAME = "mdn_docs"

"""
A great way to debug analyzers is with the `Document._index.analyze()` API which
is a convenient abstraction for the Elasticsearch Analyze API
Expand Down Expand Up @@ -172,4 +180,6 @@ class Document(ESDocument):
popularity = Float()

class Index:
name = "mdn_docs"
name = (
f'{INDEX_ALIAS_NAME}_{datetime.datetime.utcnow().strftime("%Y%m%d%H%M%S")}'
)

0 comments on commit d95e03f

Please sign in to comment.