Skip to content

Commit

Permalink
Extract indexing
Browse files Browse the repository at this point in the history
  • Loading branch information
rejasupotaro committed Nov 4, 2024
1 parent f3fda40 commit c8e86e8
Show file tree
Hide file tree
Showing 31 changed files with 5,284 additions and 2,204 deletions.
12 changes: 10 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
REGION?=asia-northeast1
TRAINING_IMAGE_URI:=gcr.io/$(PROJECT_ID)/training:latest
INDEXING_IMAGE_URI:=gcr.io/$(PROJECT_ID)/indexing:latest
TEMPLATES_DIR:=templates

# -------------------------------------
Expand All @@ -11,9 +12,16 @@ lint:
python -m mypy src/dense-retrieval/src --explicit-package-bases --namespace-packages
python -m mypy src/amazon-product-search/src --explicit-package-bases --namespace-packages
python -m mypy src/training/src --explicit-package-bases --namespace-packages
python -m mypy src/indexing/src --explicit-package-bases --namespace-packages

.PHONY: build
build:
.PHONY: build_training
build_training:
gcloud builds submit . \
--config=cloudbuild.yaml \
--substitutions=_DOCKERFILE=src/training/Dockerfile,_IMAGE=${TRAINING_IMAGE_URI}

.PHONY: build_indexing
build_indexing:
gcloud builds submit . \
--config=cloudbuild.yaml \
--substitutions=_DOCKERFILE=src/indexing/Dockerfile,_IMAGE=${INDEXING_IMAGE_URI}
18 changes: 0 additions & 18 deletions src/amazon-product-search/Dockerfile.indexing

This file was deleted.

15 changes: 0 additions & 15 deletions src/amazon-product-search/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,3 @@ Clone https://github.com/amazon-science/esci-data and copy `esci-data/shopping_q
```shell
$ poetry run inv data.merge-and-split
```

## Index Products

This project involves indexing products into search engines. If you'd like to test it on your own machine, you can start by launching Elasticsearch or Vespa locally. Then, execute the document indexing pipeline against the created index.

```shell
$ docker compose --profile elasticsearch up
$ poetry run inv es.create-index --index-name=products_jp
$ poetry run inv indexing.feed \
--index-name=products_jp \
--locale=jp \
--dest=es \
--dest-host=http://localhost:9200 \
--nrows=10
```
12 changes: 0 additions & 12 deletions src/amazon-product-search/cloudbuild.yaml

This file was deleted.

1,071 changes: 3 additions & 1,068 deletions src/amazon-product-search/poetry.lock

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions src/amazon-product-search/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@ description = ""
authors = ["rejasupotaro"]

[tool.poetry.dependencies]
python = "3.11.8"
python = "~3.11"
pandas = "^1.5.0"
apache-beam = {version = "2.50.0", extras = ["gcp"]}
fugashi = {extras = ["unidic"], version = "^1.2.0"}
ipadic = "^1.0.0"
elasticsearch = "8.12.1"
Expand Down
2 changes: 0 additions & 2 deletions src/amazon-product-search/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from tasks import (
data_tasks,
es_tasks,
indexing_tasks,
model_tasks,
synonyms_tasks,
vespa_tasks,
Expand All @@ -20,7 +19,6 @@ def verify(c):
ns.add_task(verify)
ns.add_collection(Collection.from_module(data_tasks, name="data"))
ns.add_collection(Collection.from_module(es_tasks, name="es"))
ns.add_collection(Collection.from_module(indexing_tasks, name="indexing"))
ns.add_collection(Collection.from_module(model_tasks, name="model"))
ns.add_collection(Collection.from_module(synonyms_tasks, name="synonyms"))
ns.add_collection(Collection.from_module(vespa_tasks, name="vespa"))
4 changes: 2 additions & 2 deletions src/dense-retrieval/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion src/dense-retrieval/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description = ""
authors = ["rejasupotaro"]

[tool.poetry.dependencies]
python = "^3.11"
python = "~3.11"
pandas = "^1.5.0"
pytest = "^7.3.1"
transformers = "^4.28.1"
Expand Down
30 changes: 30 additions & 0 deletions src/indexing/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
ARG PYTHON_VERSION=3.11
ARG BEAM_VERSION=2.49.0

FROM apache/beam_python${PYTHON_VERSION}_sdk:${BEAM_VERSION}

ENV PYTHONPATH src:

WORKDIR /app/src/dense-retrieval

COPY src/dense-retrieval/pyproject.toml pyproject.toml
COPY src/dense-retrieval/poetry.lock poetry.lock
COPY src/dense-retrieval/src src

WORKDIR /app/src/amazon-product-search

COPY src/amazon-product-search/pyproject.toml pyproject.toml
COPY src/amazon-product-search/poetry.lock poetry.lock
COPY src/amazon-product-search/src src

WORKDIR /app/src/indexing

COPY src/indexing/pyproject.toml pyproject.toml
COPY src/indexing/poetry.lock poetry.lock
COPY src/indexing/src src

RUN pip install --upgrade pip && \
pip install -U poetry --no-cache-dir
RUN poetry config virtualenvs.create false && \
poetry install --without dev --no-interaction --no-ansi
RUN python -m unidic download
34 changes: 34 additions & 0 deletions src/indexing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Indexing - Amazon Product Search

## Installation

```shell
$ pyenv install 3.11.8
$ pyenv local 3.11.8
$ pip install poetry
$ poetry env use python
$ poetry install
```

The following libraries are necessary for Japanese text processing.

```shell
# For macOS
$ brew install mecab mecab-ipadic
$ poetry run python -m unidic download
```

## Index Products

This project involves indexing products into search engines. If you'd like to test it on your own machine, you can start by launching Elasticsearch or Vespa locally. Then, execute the document indexing pipeline against the created index.

```shell
$ docker compose --profile elasticsearch up
$ poetry run inv es.create-index --index-name=products_jp
$ poetry run inv indexing.feed \
--index-name=products_jp \
--locale=jp \
--dest=es \
--dest-host=http://localhost:9200 \
--nrows=10
```
Loading

0 comments on commit c8e86e8

Please sign in to comment.