palomar (search) iteration #263

bnewbold · 2023-08-02T01:42:00Z

Larger refactors in this branch:

This branch includes a couple small commits to SDK code, which i've cherry-picked out as separate PRs for easier review.

See also Lexicon PR in atproto repo: bluesky-social/atproto#1594

This is not compatible with the previous version of palomar at the HTTP API, opensearch index, or database schema levels. The config vars should be backwards compatible. The operational plan for staging and prod is to deploy this as an entirely new environment (eg, "prod2", "staging2"), get everything backfilled, and then flip over the AppView and then client app to use the lexicons/endpoints instead of the older version.

I think this is ready for review, merge, and deploy to staging. Some things to check before prod:

compare index size and performance to existing version/schema
real-world testing of profile typeahead (eg, do we need fuzzy?)
real-world search relevancy checks
real-world CJK text analysis checks (Improve the search quality especially for CJK queries #302)

Out of scope for this PR:

deal with created_at timestamp not being reliable, by adding a sort_at hybrid field, for future "sort by date"
instrumentation and metrics (Jaz to implement on top of this branch)
better bulk indexing performance, especially during backfill: disable refresh during backfill? longer refresh window? bulk (batch) indexing would be best
integrate a better identity service/cache; current is probably Ok in context of backfill. or perhaps just bump the cache size to ~50k or ~100k identities in prod?

This can be left to the configured default for the index; default is 1sec. The exception is post deletions, which we process immediately.

…nvalid handle

…ocial)

cmd/palomar/main.go

ericvolp12 · 2023-09-14T16:53:49Z

search/firehose.go

+	if cur != 0 {
+		u.RawPath = fmt.Sprintf("cursor=%d", cur)
+	}


A cursor of 0 asks for the farthest back event that the host is willing to provide. Maybe we want to default the value to -1 as s sentinel so we can allow for an explicit 0 cursor?

Doesn't need to be blocking but just a thing to think about.

We'll probably want to be able to configure this, the same way Kafka lets you default a consumer group to "start at current offset", "start at earliest available", "start at a configured time delta", etc.

For now it is in the "zero means empty/nil" situation. I'm going to leave as-is, which I think is a sane default with the backfill stuff now handling broader enumeration of backfill work.

Just clarifying, this means if you provide no cursor we'll start from as far back as the BGS/PDS will allow us, not from live.

Looking at this again, I guess I had a typo and it should be RawQuery not RawPath.

But if we don't provide a cursor HTTP query param at all, shouldn't the PDS (or BGS) give us the current stream, not start at the beginning? The intent of this snippet of code is to not set the parameter at all, instead of defaulting to cursor=0 (which would scroll back to oldest available).

Oh oops yeah you're right, I read that wrong

ericvolp12 · 2023-09-14T17:06:56Z

search/handlers.go

+	for _, r := range resp.Hits.Hits {
+		var doc PostDoc
+		if err := json.Unmarshal(r.Source, &doc); err != nil {
+			return nil, fmt.Errorf("decoding post doc from search response: %w", err)


Do we want this to fail the entire query or just log it with some context and move to the next result?

This is a pretty bad error which should "never" happen, so I think we want to fail-and-bail on this situation.

An example where this would happen is if there was a config mistake where a query worker got pointed at the wrong elasticsearch index (either post instead of profile, or an older schema version). I think in that kind of situation we want to fail loudly.

ericvolp12 · 2023-09-14T17:07:04Z

search/handlers.go

+
+		did, err := syntax.ParseDID(doc.DID)
+		if err != nil {
+			return nil, fmt.Errorf("invalid DID in indexed document: %w", err)


Same as above

same as above: this is a bad and very unexpected situation

search/handlers.go

ericvolp12 · 2023-09-14T17:08:27Z

search/handlers.go

+	for _, r := range resp.Hits.Hits {
+		var doc ProfileDoc
+		if err := json.Unmarshal(r.Source, &doc); err != nil {
+			return nil, fmt.Errorf("decoding profile doc from search response: %w", err)


Similarly, do we want to error here or log it and keep going?

same as above: this is a bad and very unexpected situation

ericvolp12 · 2023-09-14T17:08:34Z

search/handlers.go

+
+		did, err := syntax.ParseDID(doc.DID)
+		if err != nil {
+			return nil, fmt.Errorf("invalid DID in indexed document: %w", err)


same as above

search/query.go

search/server.go

ericvolp12

Overall looks solid, have a few comments on how we want to handle different error cases and some dead code bits etc.

bnewbold · 2023-09-15T00:26:12Z

Addressed a bunch of these. The main outstanding thing is failing hard on some of the query error cases. I think that is the right thing to do, but could be wrong and we could get bitten.

I'll probably merge this later/tomorrow after thinking a bit more.

bnewbold force-pushed the bnewbold/palomar-iterate branch from 365962a to 9174fd5 Compare August 14, 2023 06:30

bnewbold changed the title ~~palomar search schema iteration~~ palomar (search) iteration Aug 14, 2023

bnewbold added 12 commits August 31, 2023 18:01

palomar: add second README with ES ops stuff

af1829a

palomar: proposed post and profile schema iteration

d01531d

palomar: tweak proposed schemas

ca73914

search: search doc transform helpers

2c782e3

search: incorporate transforms

42078d8

palomar: update post+profile index schemas

4cdbdb0

palomar: update README and dev setup

e382b9e

palomar: more progress

550fc57

search: lint fixes

ae34ff3

palomar: more tweaks to schema

adc05ba

palomar: bit of progress

a9e1c05

palomar: basic query parsing, handle 'from:'

cafef6e

bnewbold force-pushed the bnewbold/palomar-iterate branch from 777ef41 to cafef6e Compare September 1, 2023 05:13

bnewbold added 15 commits September 12, 2023 00:09

Merge branch 'main' into bnewbold/palomar-iterate

8e78c4b

Makefile: build palomar (search)

10fd2dd

gitignore: add more executables

89c0ff4

palomar: construct subscribeRepos URL using struct

b5055aa

palomar: switch to slog; add prometheus and other common middleware

2cc193a

palomar: don't force refresh for most indexing ops

35f09f3

This can be left to the configured default for the index; default is 1sec. The exception is post deletions, which we process immediately.

palomar: progress on removing user+record database tables

41f3ad3

identity: handle errors when doing LookupDID should not error, just i…

4489a89

…nvalid handle

util: allow non-fractional-second timestamps

5251558

palomar: more cleanup

8c914b9

palomar: switch HTTP API to skeleton

f8ad174

palomar: fix bug in query marshal

d842aa9

palomar: clarify weird double-marshal

31a7212

palomar: auto-create indices if needed; check existence

cc587df

palomar: logging, var names

1812e1e

bnewbold added 2 commits September 13, 2023 19:52

identity: support skipping DNS resolution for some hosts (like bsky.s…

ee3f286

…ocial)

palomar: skip DNS resolution on bsky.social; do try authoritative DNS

fff34f1

bnewbold marked this pull request as ready for review September 14, 2023 04:08

bnewbold requested a review from ericvolp12 September 14, 2023 04:08

bnewbold added 5 commits September 13, 2023 23:17

palomar: fix unclosed HTTP connections

25e1e95

palomar: default index shard sizes

4f7f097

palomar: tune backfill a bit

2ead460

palomar: clear createdAt on error, not skip record

c4bcc0e

palomar: fix go:embed schemas

bcec416

bnewbold mentioned this pull request Sep 14, 2023

Improve the search quality especially for CJK queries #302

Closed

bnewbold added 2 commits September 14, 2023 00:20

make lint

cddee8f

palomar: fix bad slog invocation

8b8ab88