Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

palomar (search) iteration #263

Merged
merged 41 commits into from
Sep 15, 2023
Merged

palomar (search) iteration #263

merged 41 commits into from
Sep 15, 2023

Conversation

bnewbold
Copy link
Collaborator

@bnewbold bnewbold commented Aug 2, 2023

Larger refactors in this branch:

  • local docker dev env documented
  • specify mappings (schemas) for post and profile indices
  • transform raw records in to the index schemas
  • different doc _id syntax
  • skip read+deserialization of records other than profile and post, for efficiency
  • don't store records in database; database only used for firehose cursor state
  • switch to informal /xrpc/app.bsky.unspecced.search*Skeleton endpoints
  • return only skeleton responses (eg, AT-URI or DID lists)
  • handle non-success OpenSearch responses as errors
  • auto-create indices with schema when in indexing mode (not READONLY) (with go:embed schemas)
  • switch logging to log/slog, including echo integration
  • use atproto/identity package for identity caching and handling, not User database record
  • merged in backfill worker code
  • use analysis-icu plugin for (hopefully) better internationalized search
  • special typeahead indexing and query parameter
  • basic/simple query string parsing, which should be safe, supports quoted phrases, and from: filtering

This branch includes a couple small commits to SDK code, which i've cherry-picked out as separate PRs for easier review.

See also Lexicon PR in atproto repo: bluesky-social/atproto#1594

This is not compatible with the previous version of palomar at the HTTP API, opensearch index, or database schema levels. The config vars should be backwards compatible. The operational plan for staging and prod is to deploy this as an entirely new environment (eg, "prod2", "staging2"), get everything backfilled, and then flip over the AppView and then client app to use the lexicons/endpoints instead of the older version.


I think this is ready for review, merge, and deploy to staging. Some things to check before prod:

Out of scope for this PR:

  • deal with created_at timestamp not being reliable, by adding a sort_at hybrid field, for future "sort by date"
  • instrumentation and metrics (Jaz to implement on top of this branch)
  • better bulk indexing performance, especially during backfill: disable refresh during backfill? longer refresh window? bulk (batch) indexing would be best
  • integrate a better identity service/cache; current is probably Ok in context of backfill. or perhaps just bump the cache size to ~50k or ~100k identities in prod?

@bnewbold bnewbold force-pushed the bnewbold/palomar-iterate branch from 365962a to 9174fd5 Compare August 14, 2023 06:30
@bnewbold bnewbold changed the title palomar search schema iteration palomar (search) iteration Aug 14, 2023
@bnewbold bnewbold force-pushed the bnewbold/palomar-iterate branch from 777ef41 to cafef6e Compare September 1, 2023 05:13
@bnewbold bnewbold marked this pull request as ready for review September 14, 2023 04:08
@bnewbold bnewbold requested a review from ericvolp12 September 14, 2023 04:08
cmd/palomar/main.go Outdated Show resolved Hide resolved
Comment on lines 61 to 63
if cur != 0 {
u.RawPath = fmt.Sprintf("cursor=%d", cur)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cursor of 0 asks for the farthest back event that the host is willing to provide. Maybe we want to default the value to -1 as s sentinel so we can allow for an explicit 0 cursor?

Doesn't need to be blocking but just a thing to think about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to be able to configure this, the same way Kafka lets you default a consumer group to "start at current offset", "start at earliest available", "start at a configured time delta", etc.

For now it is in the "zero means empty/nil" situation. I'm going to leave as-is, which I think is a sane default with the backfill stuff now handling broader enumeration of backfill work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clarifying, this means if you provide no cursor we'll start from as far back as the BGS/PDS will allow us, not from live.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I guess I had a typo and it should be RawQuery not RawPath.

But if we don't provide a cursor HTTP query param at all, shouldn't the PDS (or BGS) give us the current stream, not start at the beginning? The intent of this snippet of code is to not set the parameter at all, instead of defaulting to cursor=0 (which would scroll back to oldest available).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops yeah you're right, I read that wrong

for _, r := range resp.Hits.Hits {
var doc PostDoc
if err := json.Unmarshal(r.Source, &doc); err != nil {
return nil, fmt.Errorf("decoding post doc from search response: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to fail the entire query or just log it with some context and move to the next result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty bad error which should "never" happen, so I think we want to fail-and-bail on this situation.

An example where this would happen is if there was a config mistake where a query worker got pointed at the wrong elasticsearch index (either post instead of profile, or an older schema version). I think in that kind of situation we want to fail loudly.


did, err := syntax.ParseDID(doc.DID)
if err != nil {
return nil, fmt.Errorf("invalid DID in indexed document: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: this is a bad and very unexpected situation

search/handlers.go Outdated Show resolved Hide resolved
for _, r := range resp.Hits.Hits {
var doc ProfileDoc
if err := json.Unmarshal(r.Source, &doc); err != nil {
return nil, fmt.Errorf("decoding profile doc from search response: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, do we want to error here or log it and keep going?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: this is a bad and very unexpected situation


did, err := syntax.ParseDID(doc.DID)
if err != nil {
return nil, fmt.Errorf("invalid DID in indexed document: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

search/query.go Outdated Show resolved Hide resolved
search/query.go Outdated Show resolved Hide resolved
search/server.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@ericvolp12 ericvolp12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks solid, have a few comments on how we want to handle different error cases and some dead code bits etc.

@bnewbold
Copy link
Collaborator Author

Addressed a bunch of these. The main outstanding thing is failing hard on some of the query error cases. I think that is the right thing to do, but could be wrong and we could get bitten.

I'll probably merge this later/tomorrow after thinking a bit more.

@bnewbold bnewbold merged commit 0e409ee into main Sep 15, 2023
6 checks passed
@bnewbold bnewbold deleted the bnewbold/palomar-iterate branch September 15, 2023 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants