Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh metastore on Lambda searcher #4985

Merged
merged 10 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions .github/workflows/publish_lambda_packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,15 @@ jobs:
- name: Install rustup
run: curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain none -y
- name: Install python dependencies
run: pip install ./distribution/lambda
- name: Mypy lint
run: mypy distribution/lambda/
run: |
pip install --user pipenv
pipenv install --system
working-directory: ./distribution/lambda
- name: Lint and format
run: |
mypy .
black . --check
working-directory: ./distribution/lambda
- name: Retrieve and export commit date, hash, and tags
run: |
echo "QW_COMMIT_DATE=$(TZ=UTC0 git log -1 --format=%cd --date=format-local:%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_ENV
Expand Down
9 changes: 5 additions & 4 deletions distribution/lambda/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ package:
if [ "$${QW_LAMBDA_BUILD:-0}" = "1" ]
then
pushd ../../quickwit/
rustc --version
cargo lambda build \
-p quickwit-lambda \
--release \
Expand Down Expand Up @@ -60,10 +61,10 @@ bootstrap:
cdk bootstrap aws://$$CDK_ACCOUNT/$$CDK_REGION

deploy-hdfs: package check-env
cdk deploy -a cdk/app.py HdfsStack
cdk deploy --require-approval never -a cdk/app.py HdfsStack

deploy-mock-data: package check-env
cdk deploy -a cdk/app.py MockDataStack
cdk deploy --require-approval never -a cdk/app.py MockDataStack

print-mock-data-metastore: check-env
python -c 'from cdk import cli; cli.print_mock_data_metastore()'
Expand All @@ -76,11 +77,11 @@ before-destroy:

destroy-hdfs: before-destroy check-env
python -c 'from cdk import cli; cli.empty_hdfs_bucket()'
cdk destroy --force -a cdk/app.py HdfsStack
cdk destroy --force -a cdk/app.py HdfsStack

destroy-mock-data: before-destroy check-env
python -c 'from cdk import cli; cli.empty_mock_data_buckets()'
cdk destroy --force -a cdk/app.py MockDataStack
cdk destroy --force -a cdk/app.py MockDataStack

clean:
rm -rf cdk.out
Expand Down
23 changes: 23 additions & 0 deletions distribution/lambda/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
cdk = {file = "cdk", editable = true}
aws-cdk-lib = "2.95.1"
cargo-lambda = "1.1.0"
constructs = "10.3.0"
pyyaml = "6.0.1"
black = "24.3.0"
boto3 = "1.28.59"
mypy = "1.7.0"
ziglang = "0.11.0"

# types
boto3-stubs = "1.28.59"
types-requests = "2.31.0.2"
types-pyyaml = "6.0.12.11"

[requires]
python_version = "3.10"
441 changes: 441 additions & 0 deletions distribution/lambda/Pipfile.lock

Large diffs are not rendered by default.

67 changes: 39 additions & 28 deletions distribution/lambda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,36 +31,13 @@ console](https://console.aws.amazon.com/servicequotas/home/services/lambda/quota

### Python venv

This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the `.venv`
directory. To create the virtualenv it assumes that there is a `python3`
executable in your path with access to the `venv` package. If for any reason the
automatic creation of the virtualenv fails, you can create the virtualenv
manually.
The Python environment is configured using pipenv:

To manually create a virtualenv on MacOS and Linux:

```bash
python3 -m venv .venv
```

After the init process completes and the virtualenv is created, you can use the following
step to activate your virtualenv.

```bash
source .venv/bin/activate
```

Once the virtualenv is activated, you can install the required dependencies.

```bash
pip install .
```

If you prefer using Poetry, achieve the same by running:
```bash
poetry shell
poetry install
# Install pipenv if needed.
pip install --user pipenv
pipenv shell
pipenv install
```

### Example stacks
Expand Down Expand Up @@ -99,6 +76,40 @@ make deploy-mock-data
make invoke-mock-data-searcher
```

### Configurations

The following environment variables can be configured on the Lambda functions.
Note that only a small subset of all Quickwit configurations are exposed to
simplify the setup and avoid unstable deployments.

| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_INDEX_ID | the index this Lambda interacts with (one and only one) | required |
| QW_LAMBDA_METASTORE_BUCKET | bucket name for metastore files | required |
| QW_LAMBDA_INDEX_BUCKET | bucket name for split files | required |
| QW_LAMBDA_OPENTELEMETRY_URL | HTTP OTEL tracing collector endpoint | none, OTEL disabled |
| QW_LAMBDA_OPENTELEMETRY_AUTHORIZATION | Authorization header value for HTTP OTEL calls | none, OTEL disabled |
| QW_LAMBDA_ENABLE_VERBOSE_JSON_LOGS | true to enable JSON logging of spans and logs in Cloudwatch | false |
| RUST_LOG | [Rust logging config][1] | info |

[1]: https://rust-lang-nursery.github.io/rust-cookbook/development_tools/debugging/config_log.html


Indexer only:
| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_INDEX_CONFIG_URI | location of the index configuration file, e.g `s3://mybucket/index-config.yaml` | required |
| QW_LAMBDA_DISABLE_MERGE | true to disable compaction merges | false |
| QW_LAMBDA_DISABLE_JANITOR | true to disable retention enforcement and garbage collection | false |
| QW_LAMBDA_MAX_CHECKPOINTS | maximum number of ingested file names to keep in source history | 100 |

Searcher only:
| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_SEARCHER_METASTORE_POLLING_INTERVAL_SECONDS | refresh interval of the metastore | 60 |
| QW_LAMBDA_PARTIAL_REQUEST_CACHE_CAPACITY | `searcher.partial_request_cache_capacity` node config | 64M |


### Set up a search API

You can configure an HTTP API endpoint around the Quickwit Searcher Lambda. The
Expand Down
7 changes: 4 additions & 3 deletions distribution/lambda/cdk/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

import aws_cdk as cdk

from cdk.stacks.services.quickwit_service import DEFAULT_LAMBDA_MEMORY_SIZE
from cdk.stacks.examples.hdfs_stack import HdfsStack
from cdk.stacks.examples.mock_data_stack import MockDataStack
from stacks.services.quickwit_service import DEFAULT_LAMBDA_MEMORY_SIZE
from stacks.examples.hdfs_stack import HdfsStack
from stacks.examples.mock_data_stack import MockDataStack

HDFS_STACK_NAME = "HdfsStack"
MOCK_DATA_STACK_NAME = "MockDataStack"
Expand Down Expand Up @@ -50,6 +50,7 @@ def package_location_from_env(type: Literal["searcher"] | Literal["indexer"]) ->
indexer_package_location=package_location_from_env("indexer"),
searcher_package_location=package_location_from_env("searcher"),
search_api_key=os.getenv("SEARCHER_API_KEY", None),
data_generation_interval_sec=int(os.getenv("DATA_GENERATION_INTERVAL_SEC", 300)),
)

app.synth()
4 changes: 2 additions & 2 deletions distribution/lambda/cdk/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
import boto3
import botocore.config
import botocore.exceptions
from cdk import app
from cdk.stacks.examples import hdfs_stack, mock_data_stack
from . import app
from stacks.examples import hdfs_stack, mock_data_stack

region = os.environ["CDK_REGION"]

Expand Down
7 changes: 7 additions & 0 deletions distribution/lambda/cdk/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from setuptools import setup, find_packages

setup(
name="cdk",
version="0.1.0",
packages=find_packages(),
)
1 change: 1 addition & 0 deletions distribution/lambda/cdk/stacks/examples/hdfs_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ def __init__(
searcher_memory_size=searcher_memory_size,
indexer_package_location=indexer_package_location,
searcher_package_location=searcher_package_location,
indexer_timeout=aws_cdk.Duration.minutes(10),
)

aws_cdk.CfnOutput(
Expand Down
14 changes: 12 additions & 2 deletions distribution/lambda/cdk/stacks/examples/mock_data_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ def __init__(
construct_id: str,
index_id: str,
qw_svc: quickwit_service.QuickwitService,
data_generation_interval_sec: int,
**kwargs,
):
super().__init__(scope, construct_id, **kwargs)
Expand Down Expand Up @@ -59,7 +60,9 @@ def __init__(
rule = aws_events.Rule(
self,
"ScheduledRule",
schedule=aws_events.Schedule.rate(aws_cdk.Duration.minutes(5)),
schedule=aws_events.Schedule.rate(
aws_cdk.Duration.seconds(data_generation_interval_sec)
),
)
rule.add_target(aws_events_targets.LambdaFunction(generator_lambda))

Expand Down Expand Up @@ -139,6 +142,7 @@ def __init__(
indexer_package_location: str,
searcher_package_location: str,
search_api_key: str | None = None,
data_generation_interval_sec: int = 300,
**kwargs,
) -> None:
"""If `search_api_key` is not set, the search API is not deployed."""
Expand Down Expand Up @@ -167,7 +171,13 @@ def __init__(
searcher_package_location=searcher_package_location,
)

Source(self, "Source", index_id=index_id, qw_svc=qw_svc)
Source(
self,
"Source",
index_id=index_id,
qw_svc=qw_svc,
data_generation_interval_sec=data_generation_interval_sec,
)

if search_api_key is not None:
SearchAPI(
Expand Down
4 changes: 2 additions & 2 deletions distribution/lambda/cdk/stacks/services/indexer_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def __init__(
index_config_bucket: str,
index_config_key: str,
memory_size: int,
timeout: aws_cdk.Duration,
environment: dict[str, str],
asset_path: str,
**kwargs,
Expand All @@ -32,8 +33,7 @@ def __init__(
"QW_LAMBDA_INDEX_CONFIG_URI": f"s3://{index_config_bucket}/{index_config_key}",
**environment,
},
# use a strict timeout and retry policy to avoid unexpected costs
timeout=aws_cdk.Duration.minutes(1),
timeout=timeout,
retry_attempts=0,
reserved_concurrent_executions=1,
memory_size=memory_size,
Expand Down
3 changes: 3 additions & 0 deletions distribution/lambda/cdk/stacks/services/quickwit_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ def __init__(
indexer_package_location: str,
indexer_memory_size: int = DEFAULT_LAMBDA_MEMORY_SIZE,
indexer_environment: dict[str, str] = {},
# small default timeout to avoid unexpected costs and hanging indexers
indexer_timeout: aws_cdk.Duration = aws_cdk.Duration.minutes(1),
searcher_memory_size: int = DEFAULT_LAMBDA_MEMORY_SIZE,
searcher_environment: dict[str, str] = {},
**kwargs,
Expand All @@ -55,6 +57,7 @@ def __init__(
index_config_bucket=index_config_bucket,
index_config_key=index_config_key,
memory_size=indexer_memory_size,
timeout=indexer_timeout,
environment=indexer_environment,
asset_path=indexer_package_location,
)
Expand Down
Loading
Loading