Skip to content

Commit

Permalink
palomar (search) iteration (#263)
Browse files Browse the repository at this point in the history
Larger refactors in this branch:

- [x] local docker dev env documented
- [x] specify mappings (schemas) for post and profile indices
- [x] transform raw records in to the index schemas
- [x] different doc _id syntax
- [x] skip read+deserialization of records other than profile and post,
for efficiency
- [x] don't store records in database; database only used for firehose
cursor state
- [x] switch to informal /xrpc/app.bsky.unspecced.search*Skeleton
endpoints
- [x] return only skeleton responses (eg, AT-URI or DID lists)
- [x] handle non-success OpenSearch responses as errors
- [x] auto-create indices with schema when in indexing mode (not
READONLY) (with `go:embed` schemas)
- [x] switch logging to `log/slog`, including echo integration
- [x] use `atproto/identity` package for identity caching and handling,
not `User` database record
- [x] merged in backfill worker code
- [x] use `analysis-icu` plugin for (hopefully) better internationalized
search
- [x] special typeahead indexing and query parameter
- [x] basic/simple query string parsing, which should be safe, supports
quoted phrases, and `from:` filtering

This branch includes a couple small commits to SDK code, which i've
cherry-picked out as separate PRs for easier review.

See also Lexicon PR in atproto repo:
bluesky-social/atproto#1594

This is not compatible with the previous version of `palomar` at the
HTTP API, opensearch index, or database schema levels. The config vars
should be backwards compatible. The operational plan for staging and
prod is to deploy this as an entirely new environment (eg, "prod2",
"staging2"), get everything backfilled, and then flip over the AppView
and then client app to use the lexicons/endpoints instead of the older
version.

----

I think this is ready for review, merge, and deploy to staging. Some
things to check before prod:

- [ ] compare index size and performance to existing version/schema
- [ ] real-world testing of profile typeahead (eg, do we need fuzzy?)
- [ ] real-world search relevancy checks
- [ ] real-world CJK text analysis checks
(#302)

Out of scope for this PR:

- [ ] deal with `created_at` timestamp not being reliable, by adding a
`sort_at` hybrid field, for future "sort by date"
- [ ] instrumentation and metrics (Jaz to implement on top of this
branch)
- [ ] better bulk indexing performance, especially during backfill:
disable refresh during backfill? longer refresh window? bulk (batch)
indexing would be best
- [x] integrate a better identity service/cache; current is probably Ok
in context of backfill. or perhaps just bump the cache size to ~50k or
~100k identities in prod?
  • Loading branch information
bnewbold authored Sep 15, 2023
2 parents 4ef5ee5 + 32a9856 commit 0e409ee
Show file tree
Hide file tree
Showing 21 changed files with 1,636 additions and 443 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ test-coverage.out
/lexgen
/stress
/labelmaker
/palomar
/sonar-cli
/supercollider

# Don't ignore this file itself, or other specific dotfiles
!.gitignore
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ build: ## Build all executables
go build ./cmd/labelmaker
go build ./cmd/supercollider
go build -o ./sonar-cli ./cmd/sonar
go build ./cmd/palomar

.PHONY: all
all: build
Expand Down
2 changes: 2 additions & 0 deletions cmd/palomar/Dockerfile.opensearch
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
FROM opensearchproject/opensearch:2.5.0
RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-icu
98 changes: 71 additions & 27 deletions cmd/palomar/README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,92 @@
# Palomar

Palomar is an Elasticsearch/OpenSearch frontend and ATP (AT Protocol) repository crawler designed to provide search services for the Bluesky network.
Palomar is a backend search service for atproto, specifically the `bsky.app` post and profile record types. It works by consuming a repo event stream ("firehose") and upating an OpenSearch cluster (fork of Elasticsearch) with docs.

## Prerequisites
Almost all the code for this service is actually in the `search/` directory at the top of this repo.

- GoLang (version 1.21)
- Running instance of Elasticsearch or OpenSearch for indexing.
In September 2023, this service was substantially re-written. It no longer stores records in a local database, returns only "skelton" results (list of ATURIs or DIDs) via the HTTP API, and defines index mappings.

## Building

```
go build
```
## Query String Syntax

Currently only a simple query string syntax is supported. Double-quotes can surround phrases, `-` prefix negates a single keyword, and the following initial filters are supported:

- `from:<handle>` will filter to results from that account, based on current (cached) identity resolution
- entire DIDs as an un-quoted keyword will result in filtering to results from that account


## Configuration

Palomar uses environment variables for configuration.

- `ATP_BGS_HOST`: URL of the Bluesky BGS (e.g., `https://bgs.staging.bsky.dev`).
- `ELASTIC_HTTPS_FINGERPRINT`: Required if using a self-signed cert for your Elasticsearch deployment.
- `ELASTIC_USERNAME`: Elasticsearch username (default: `admin`).
- `ELASTIC_PASSWORD`: Password for Elasticsearch authentication.
- `ELASTIC_HOSTS`: Comma-separated list of Elasticsearch endpoints.
- `READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing).
- `ATP_BGS_HOST`: URL of firehose to subscribe to, either global BGS or individual PDS (default: `wss://bsky.social`)
- `ATP_PLC_HOST`: PLC directory for identity lookups (default: `https://plc.directory`)
- `DATABASE_URL`: connection string for database to persist firehose cursor subscription state
- `PALOMAR_BIND`: IP/port to have HTTP API listen on (default: `:3999`)
- `ES_USERNAME`: Elasticsearch username (default: `admin`)
- `ES_PASSWORD`: Password for Elasticsearch authentication
- `ES_CERT_FILE`: Optional, for TLS connections
- `ES_HOSTS`: Comma-separated list of Elasticsearch endpoints
- `ES_POST_INDEX`: name of index for post docs (default: `palomar_post`)
- `ES_PROFILE_INDEX`: name of index for profile docs (default: `palomar_profile`)
- `PALOMAR_READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing)

## HTTP API

### Query Posts: `/xrpc/app.bsky.unspecced.searchPostsSkeleton`

HTTP Query Params:

- `q`: query string, required
- `limit`: integer, default 25
- `cursor`: string, for partial pagination (uses offset, not a scroll)

Response:

- `posts`: array of AT-URI strings
- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
- `cursor`: string; optionally included if there are more results that can be paginated

### Query Profiles: `/xrpc/app.bsky.unspecced.searchActorsSkeleton`

HTTP Query Params:

- `q`: query string, required
- `limit`: integer, default 25
- `cursor`: string, for partial pagination (uses offset, not a scroll)
- `typeahead`: boolean, for typeahead behavior (vs. full search)

Response:

- `actors`: array of AT-URI strings
- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
- `cursor`: string; optionally included if there are more results that can be paginated

## Development Quickstart

Run an ephemeral opensearch instance on local port 9200, with SSL disabled, and the `analysis-icu` plugin installed, using docker:

## Running the Application
docker build -f Dockerfile.opensearch . -t opensearch-palomar
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearch-palomar

Once the environment variables are set properly, you can start Palomar by running:
See [README.opensearch.md]() for more Opensearch operational tips.

```
./palomar run
```
From the top level of the repository:

## Indexing
# run combined indexing and search service
make run-dev-search

For now, there isnt an easy way to get updates from the PDS, so to keep the
index up to date you will periodcally need to scrape the data.
# run just the search service
READONLY=true make run-dev-search

## API
You'll need to get some content in to the index. An easy way to do this is to have palomar consume from the public production firehose.

### `/index/:did`
You can run test queries from the top level of the repository:

Indexes the content in the given user's repository. It keeps track of the last repository update and only fetches incremental changes.
go run ./cmd/palomar search-post "hello"
go run ./cmd/palomar search-profile "hello"
go run ./cmd/palomar search-profile -typeahead "h"

### `/search?q=QUERY`
For more commands and args:

Performs a simple, case-insensitive search across the entire application.
go run ./cmd/palomar --help
90 changes: 90 additions & 0 deletions cmd/palomar/README.opensearch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@

# Basic OpenSearch Operations

We use OpenSearch version 2.5+, with the `analysis-icu` plugin. This is included automatically on the AWS hosted version of Opensearch, otherwise you need to install:

sudo /usr/share/opensearch/bin/opensearch-plugin install analysis-icu
sudo service opensearch restart

If you are trying to use Elasticsearch 7.10 instead of OpenSearch, you can install the plugin with:

sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
sudo service elasticsearch restart

## Local Development

With OpenSearch running locally.

To manually drop and re-build the indices with new schemas (palomar will create these automatically if they don't exist, but this can be helpful when developing the schema itself):

http delete :9200/palomar_post
http delete :9200/palomar_profile
http put :9200/palomar_post < post_schema.json
http put :9200/palomar_profile < profile_schema.json

Put a single object (good for debugging):

head -n1 examples.json | http post :9200/palomar_post/_doc/0
http get :9200/palomar_post/_doc/0

Bulk insert from a file on disk:

# esbulk is a golang CLI tool which must be installed separately
esbulk -verbose -id ident -index palomar_post -type _doc examples.json

## Index Aliases

To make re-indexing and schema changes easier, we can create versioned (or
time-stamped) elasticsearch indexes, and then point to them using index
aliases. The index alias updates are fast and atomic, so we can slowly build up
a new index and then cut over with no downtime.

http put :9200/palomar_post_v04 < post_schema.json

To do an atomic swap from one alias to a new one ("zero downtime"):

http post :9200/_aliases << EOF
{
"actions": [
{ "remove": { "index": "palomar_post_v05", "alias": "palomar_post" }},
{ "add": { "index": "palomar_post_v06", "alias": "palomar_post" }}
]
}
EOF

To replace an existing ("real") index with an alias pointer, do two actions
(not truly zero-downtime, but pretty fast):

http delete :9200/palomar_post
http put :9200/palomar_post_v03/_alias/palomar_post

## Full-Text Querying

A generic full-text "query string" query look like this (replace "blood" with
actual query string, and "size" field with the max results to return):

GET /palomar_post/_search
{
"query": {
"query_string": {
"query": "blood",
"analyzer": "textIcuSearch",
"default_operator": "AND",
"analyze_wildcard": true,
"lenient": true,
"fields": ["handle^5", "text"]
}
},
"size": 3
}

In the results take `.hits.hits[]._source` as the objects; `.hits.total` is the
total number of search hits.


## Index Debugging

Check index size:

http get :9200/palomar_post/_count
http get :9200/palomar_profile/_count
Loading

0 comments on commit 0e409ee

Please sign in to comment.