palomar (search) iteration (#263)

Larger refactors in this branch: - [x] local docker dev env documented - [x] specify mappings (schemas) for post and profile indices - [x] transform raw records in to the index schemas - [x] different doc _id syntax - [x] skip read+deserialization of records other than profile and post, for efficiency - [x] don't store records in database; database only used for firehose cursor state - [x] switch to informal /xrpc/app.bsky.unspecced.search*Skeleton endpoints - [x] return only skeleton responses (eg, AT-URI or DID lists) - [x] handle non-success OpenSearch responses as errors - [x] auto-create indices with schema when in indexing mode (not READONLY) (with `go:embed` schemas) - [x] switch logging to `log/slog`, including echo integration - [x] use `atproto/identity` package for identity caching and handling, not `User` database record - [x] merged in backfill worker code - [x] use `analysis-icu` plugin for (hopefully) better internationalized search - [x] special typeahead indexing and query parameter - [x] basic/simple query string parsing, which should be safe, supports quoted phrases, and `from:` filtering This branch includes a couple small commits to SDK code, which i've cherry-picked out as separate PRs for easier review. See also Lexicon PR in atproto repo: bluesky-social/atproto#1594 This is not compatible with the previous version of `palomar` at the HTTP API, opensearch index, or database schema levels. The config vars should be backwards compatible. The operational plan for staging and prod is to deploy this as an entirely new environment (eg, "prod2", "staging2"), get everything backfilled, and then flip over the AppView and then client app to use the lexicons/endpoints instead of the older version. ---- I think this is ready for review, merge, and deploy to staging. Some things to check before prod: - [ ] compare index size and performance to existing version/schema - [ ] real-world testing of profile typeahead (eg, do we need fuzzy?) - [ ] real-world search relevancy checks - [ ] real-world CJK text analysis checks (#302) Out of scope for this PR: - [ ] deal with `created_at` timestamp not being reliable, by adding a `sort_at` hybrid field, for future "sort by date" - [ ] instrumentation and metrics (Jaz to implement on top of this branch) - [ ] better bulk indexing performance, especially during backfill: disable refresh during backfill? longer refresh window? bulk (batch) indexing would be best - [x] integrate a better identity service/cache; current is probably Ok in context of backfill. or perhaps just bump the cache size to ~50k or ~100k identities in prod?
bluesky-social · Sep 15, 2023 · 0e409ee · 0e409ee
2 parents 4ef5ee5 + 32a9856
commit 0e409ee
Show file tree

Hide file tree

Showing 21 changed files with 1,636 additions and 443 deletions.
diff --git a/.gitignore b/.gitignore
@@ -29,6 +29,9 @@ test-coverage.out
 /lexgen
 /stress
 /labelmaker
+/palomar
+/sonar-cli
+/supercollider
 
 # Don't ignore this file itself, or other specific dotfiles
 !.gitignore

diff --git a/Makefile b/Makefile
@@ -26,6 +26,7 @@ build: ## Build all executables
 	go build ./cmd/labelmaker
 	go build ./cmd/supercollider
 	go build -o ./sonar-cli ./cmd/sonar 
+	go build ./cmd/palomar
 
 .PHONY: all
 all: build

diff --git a/cmd/palomar/Dockerfile.opensearch b/cmd/palomar/Dockerfile.opensearch
@@ -0,0 +1,2 @@
+FROM opensearchproject/opensearch:2.5.0
+RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-icu
diff --git a/cmd/palomar/README.md b/cmd/palomar/README.md
@@ -1,48 +1,92 @@
 # Palomar
 
-Palomar is an Elasticsearch/OpenSearch frontend and ATP (AT Protocol) repository crawler designed to provide search services for the Bluesky network.
+Palomar is a backend search service for atproto, specifically the `bsky.app` post and profile record types. It works by consuming a repo event stream ("firehose") and upating an OpenSearch cluster (fork of Elasticsearch) with docs.
 
-## Prerequisites
+Almost all the code for this service is actually in the `search/` directory at the top of this repo.
 
-- GoLang (version 1.21)
-- Running instance of Elasticsearch or OpenSearch for indexing.
+In September 2023, this service was substantially re-written. It no longer stores records in a local database, returns only "skelton" results (list of ATURIs or DIDs) via the HTTP API, and defines index mappings.
 
-## Building
 
-```
-go build
-```
+## Query String Syntax
+
+Currently only a simple query string syntax is supported. Double-quotes can surround phrases, `-` prefix negates a single keyword, and the following initial filters are supported:
+
+- `from:<handle>` will filter to results from that account, based on current (cached) identity resolution
+- entire DIDs as an un-quoted keyword will result in filtering to results from that account
+
 
 ## Configuration
 
 Palomar uses environment variables for configuration.
 
-- `ATP_BGS_HOST`: URL of the Bluesky BGS (e.g., `https://bgs.staging.bsky.dev`).
-- `ELASTIC_HTTPS_FINGERPRINT`: Required if using a self-signed cert for your Elasticsearch deployment.
-- `ELASTIC_USERNAME`: Elasticsearch username (default: `admin`).
-- `ELASTIC_PASSWORD`: Password for Elasticsearch authentication.
-- `ELASTIC_HOSTS`: Comma-separated list of Elasticsearch endpoints.
-- `READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing).
+- `ATP_BGS_HOST`: URL of firehose to subscribe to, either global BGS or individual PDS (default: `wss://bsky.social`)
+- `ATP_PLC_HOST`: PLC directory for identity lookups (default: `https://plc.directory`)
+- `DATABASE_URL`: connection string for database to persist firehose cursor subscription state
+- `PALOMAR_BIND`: IP/port to have HTTP API listen on (default: `:3999`)
+- `ES_USERNAME`: Elasticsearch username (default: `admin`)
+- `ES_PASSWORD`: Password for Elasticsearch authentication
+- `ES_CERT_FILE`: Optional, for TLS connections
+- `ES_HOSTS`: Comma-separated list of Elasticsearch endpoints
+- `ES_POST_INDEX`: name of index for post docs (default: `palomar_post`)
+- `ES_PROFILE_INDEX`: name of index for profile docs (default: `palomar_profile`)
+- `PALOMAR_READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing)
+
+## HTTP API
+
+### Query Posts: `/xrpc/app.bsky.unspecced.searchPostsSkeleton`
+
+HTTP Query Params:
+
+- `q`: query string, required
+- `limit`: integer, default 25
+- `cursor`: string, for partial pagination (uses offset, not a scroll)
+
+Response:
+
+- `posts`: array of AT-URI strings
+- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
+- `cursor`: string; optionally included if there are more results that can be paginated
+
+### Query Profiles: `/xrpc/app.bsky.unspecced.searchActorsSkeleton`
+
+HTTP Query Params:
+
+- `q`: query string, required
+- `limit`: integer, default 25
+- `cursor`: string, for partial pagination (uses offset, not a scroll)
+- `typeahead`: boolean, for typeahead behavior (vs. full search)
+
+Response:
+
+- `actors`: array of AT-URI strings
+- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
+- `cursor`: string; optionally included if there are more results that can be paginated
+
+## Development Quickstart
+
+Run an ephemeral opensearch instance on local port 9200, with SSL disabled, and the `analysis-icu` plugin installed, using docker:
 
-## Running the Application
+    docker build -f Dockerfile.opensearch . -t opensearch-palomar
+    docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearch-palomar
 
-Once the environment variables are set properly, you can start Palomar by running:
+See [README.opensearch.md]() for more Opensearch operational tips.
 
-```
-./palomar run
-```
+From the top level of the repository:
 
-## Indexing
+    # run combined indexing and search service
+    make run-dev-search
 
-For now, there isnt an easy way to get updates from the PDS, so to keep the
-index up to date you will periodcally need to scrape the data.
+    # run just the search service
+    READONLY=true make run-dev-search
 
-## API
+You'll need to get some content in to the index. An easy way to do this is to have palomar consume from the public production firehose.
 
-### `/index/:did`
+You can run test queries from the top level of the repository:
 
-Indexes the content in the given user's repository. It keeps track of the last repository update and only fetches incremental changes.
+    go run ./cmd/palomar search-post "hello"
+    go run ./cmd/palomar search-profile "hello"
+    go run ./cmd/palomar search-profile -typeahead "h"
 
-### `/search?q=QUERY`
+For more commands and args:
 
-Performs a simple, case-insensitive search across the entire application.
+    go run ./cmd/palomar --help
diff --git a/cmd/palomar/README.opensearch.md b/cmd/palomar/README.opensearch.md
@@ -0,0 +1,90 @@
+
+# Basic OpenSearch Operations
+
+We use OpenSearch version 2.5+, with the `analysis-icu` plugin. This is included automatically on the AWS hosted version of Opensearch, otherwise you need to install:
+
+    sudo /usr/share/opensearch/bin/opensearch-plugin install analysis-icu
+    sudo service opensearch restart
+
+If you are trying to use Elasticsearch 7.10 instead of OpenSearch, you can install the plugin with:
+
+    sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
+    sudo service elasticsearch restart
+
+## Local Development
+
+With OpenSearch running locally.
+
+To manually drop and re-build the indices with new schemas (palomar will create these automatically if they don't exist, but this can be helpful when developing the schema itself):
+
+    http delete :9200/palomar_post
+    http delete :9200/palomar_profile
+    http put :9200/palomar_post < post_schema.json
+    http put :9200/palomar_profile < profile_schema.json
+
+Put a single object (good for debugging):
+
+    head -n1 examples.json | http post :9200/palomar_post/_doc/0
+    http get :9200/palomar_post/_doc/0
+
+Bulk insert from a file on disk:
+
+    # esbulk is a golang CLI tool which must be installed separately
+    esbulk -verbose -id ident -index palomar_post -type _doc examples.json
+
+## Index Aliases
+
+To make re-indexing and schema changes easier, we can create versioned (or
+time-stamped) elasticsearch indexes, and then point to them using index
+aliases. The index alias updates are fast and atomic, so we can slowly build up
+a new index and then cut over with no downtime.
+
+    http put :9200/palomar_post_v04 < post_schema.json
+
+To do an atomic swap from one alias to a new one ("zero downtime"):
+
+    http post :9200/_aliases << EOF
+        {
+            "actions": [
+                { "remove": { "index": "palomar_post_v05", "alias": "palomar_post" }},
+                { "add":    { "index": "palomar_post_v06", "alias": "palomar_post" }}
+            ]
+        }
+    EOF
+
+To replace an existing ("real") index with an alias pointer, do two actions
+(not truly zero-downtime, but pretty fast):
+
+    http delete :9200/palomar_post
+    http put :9200/palomar_post_v03/_alias/palomar_post
+
+## Full-Text Querying
+
+A generic full-text "query string" query look like this (replace "blood" with
+actual query string, and "size" field with the max results to return):
+
+    GET /palomar_post/_search
+    {
+      "query": {
+        "query_string": {
+          "query": "blood",
+          "analyzer": "textIcuSearch",
+          "default_operator": "AND",
+          "analyze_wildcard": true,
+          "lenient": true,
+          "fields": ["handle^5", "text"]
+        }
+      },
+      "size": 3
+    }
+
+In the results take `.hits.hits[]._source` as the objects; `.hits.total` is the
+total number of search hits.
+
+
+## Index Debugging
+
+Check index size:
+
+    http get :9200/palomar_post/_count
+    http get :9200/palomar_profile/_count
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		FROM opensearchproject/opensearch:2.5.0
		RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-icu