Skip to content

namehash/namegraph

Repository files navigation

Tests

Install

pip3 install -e .

Usage

The application can be run:

  • reading queries from stdin
  • reading query as an argument

Additional resources need to be downloaded:

python download.py # dictionaries, embeddings
python download_names.py

Queries from stdin

python ./generator/app.py app.input=stdin

Query as an argument

The application can be run with

python ./generator/app.py

It will generate suggestions for the default query.

The default parameters are defined in conf/config.yaml. Any of the parameters might be substituted with a path to the parameter, with dot-separated fragments, e.g.

python ./generator/app.py app.query=firepower

will substitute the default query with the provided one.

The parameters are documented in the config.

REST API

Start server:

python -m uvicorn web_api:app --reload

Query with POST:

curl -d '{"label":"fire"}' -H "Content-Type: application/json" -X POST http://localhost:8000

Tests

Run:

pytest

or without slow tests:

pytest -m "not slow"

Debugging

Run app with app.logging_level=DEBUG to see debug information:

python generator/app.py app.input=stdin app.logging_level=DEBUG

Deployment

Build Docker image locally

Set image TAG:

export TAG=0.1.0

Build a Docker image locally

docker compose -f docker-compose.build.yml build

Authorize to Amazon (if you are using MFA you have to take temporary ACCESS keys from AWS STS):

aws configure

Authorize to ECR:

./authorize-ecr.sh

Push image to ECR:

`docker push 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator:${TAG}

Deploy image on remote instance

Set image TAG:

`export TAG=0.1.0

Authorize EC2 instance in ECR:

aws ecr get-login-password | docker login --username AWS --password-stdin 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator

(Re-Deploy) image:

docker compose up -d

Check if service works:

curl -d '{"label":"firestarter"}' -H "Content-Type: application/json" -X POST http://44.203.61.202

Learning-To-Rank

To access the LTR features, you need to configure it in the Elasticsearch instance (see here for more details).

Pipelines, weights, sampler

In conf/prod_config_new.yaml are defined generator_limits which limits maximum number of suggestions generated by each generator. This is for optimization. E.g.:

  generator_limits:
    HyphenGenerator: 128
    AbbreviationGenerator: 128
    EmojiGenerator: 150
    Wikipedia2VGenerator: 100
    RandomAvailableNameGenerator: 20000

In conf/pipelines/prod_new.yaml are defined pipelines. Each pipeline have:

  • a name
  • one generator
  • list of filters, e.g. SubnameFilter, ValidNameFilter, ValidNameLengthFilter, DomainFilter
  • weights for each interpretation type (ngram, person, other) and each language
  • mode_weights_multiplier - a multiplier of above weights for each mode (e.g. instant, domain_detail, full)
  • global_limits for each mode, which can be integer (absolute number) or float (percentage of all results); also you can override values for grouped_by_category endpoint by adding prefix grouped_ (e.g. grouped_instant, grouped_domain_detail, grouped_full)

Setting 0 in mode_weights_multiplier or global_limits disables the pipeline in a given mode.

Sampler

Each request defines:

  • mode
  • min_suggestions
  • max_suggestions
  • min_available_fraction

A name can have many interpretations. Every interpretation has type (ngram, person, other) and language. Every interpretation have a probability. There might be more than one interpretation with the same type and language.

For each pair of type and language, probabilities of each pipeline are computed.

  1. If there is enough suggestions then break.
  2. If all pipeline probabilities for every pair of type nad language are 0 then break.
  3. Sample type and language, then sample interpretation within this type and language.
  4. Sample a pipeline for the sampled interpretation. The first pass of sampling is without replacement to increase diversity in top suggestions.
  5. If the pipeline exceeds its global limit then go to 4.
  6. Get a suggestion from the pipeline. (The generator is executed here). If there is no more suggestions then go to 4.
  7. If the suggestion have been already sampled then go to 6.
  8. If the suggestion is not available and there is room only for available then go to 6.
  9. If the suggestion is not normalized then go to 6.
  10. Go to 1.

Exhausted pipelines are removed from sampling.

Grouped by category

Parameters:

  • mode
  • min_available_fraction
  • max number of categories
  • max number of suggestions per category
  • max related categories
  • min total categories?
  • max total categories?

Requirements:

  • order of categories is fixed
  • every generator must be mapped to only one category
  • flag generator suggestion should appear in 10% of suggestions - maybe we should detect if it is first search by a user
    • should we remove first pass of sampling with every generator?
  1. Shuffle order of categories (using weights?) if min number of categories is smaller than all categories. If some category does not return suggestions then we take the next one.
  2. Within each category: sample type and lang of interpretation, sample interpretaion with this type and lang. Sample pipeline (weights of pipelines depends on type and language. Do it in parallel?
  3. Sample max number of suggestions per category. How handle min_available_fraction?

Suggestions by category

For each category MetaSampler is created with appropriate pipelines. In parallel, all MetaSamplers are exectuted. In one MetaSampler:

  1. Apply global limits.
  2. For each interpretation (interpretation_type, lang, tokenization) a sampler is created.

After generation of suggestions for all categories:

  1. For each category number of suggestions is limited by category's max_suggestions.
  2. If count_real_suggestions < min_total_suggestions then RandomAvailable names are appended as other category.