vocab-mapper: feedback, ideas, next steps #164

josephjclark · 2025-01-29T13:36:11Z

Randomish thoughts:

we need to save the sheets extension to a repo here on github, and work out how to publish it outside of Sheets
work out how to add more logging so we can see more stuff from the workflow view. I think this means a) log more frequently in apollo itself, and b) work out how to get log line into the job
work how to feedback errors into the sheet
Instantly update the sheet columns with "loading..."
Add a streaming mode, which processes terms one a time and emits them through a websocket to the caller. This lets us update the sheet in realtime Kind hard but very achievable.
Run faster: it currently take 3 minute to process results. Does this matter? Can we easily speed it up? What about streaming mode? vocab-mapper: use batching for better performance #165
When we're done, show a toaster saying "we've finished!" in the sheet
Is there any way we can poll apollo to get a progress update for the job?
how can we limit calls, so that if the workflow is running for that sheet, we return an error? oh, collections!
Can users paste their own target values? We'd have to process and embed them. When do we remove them? Should we cache?
How do users control target data sets?

josephjclark · 2025-02-10T12:19:11Z

Regarding logs:

Right now the workflow calls out to apollo via the rest API. It's a black box. All logs are invisible to Lightning (and therefore the user).

How can we use websocket events in the http adaptor in a workflow to stream logs? We must need special adaptor support for this right?

The alternative would be an Apollo adaptor, which would be in a much better place to handle this use-case. It would almost be a copy and paste of the core CLI code.

hanna-paasivirta · 2025-02-12T11:15:56Z

Speed:
- We have added some concurrent processing vocab-mapper: use batching for better performance #165 , but testing the ideal settings (fast while stable) will cost some time and API tokens. The settings for different steps need to be tuned individually, and the logic could be improved.
- With our Anthropic limit for tokens/minute, we can only manage a maximum of two concurrent calls. I have only tested 25-50 inputs at a time and we will probably hit the limit with more inputs. This means if we had two users we would need to slow the pipeline down and wait for 60 seconds when limits are reached.
Cost & Speed: This first version maximised accuracy. We can probably make the mapper faster and cheaper with optimisations such as:
- Combining the top n and top 1 selection steps – this is the most obvious one that will lower the cost by up to 40%. Since it improved perf a little, I split them for now
- Limiting total search results
- Integrate batch API option for 50% cheaper 24-hour processing.
- I haven't implemented skipping user locked-in answers yet!
Performance:
- The biggest bottleneck might be the search. If we could optimise it to give fewer specialised results for non-specialised inputs that would make the selection step easier (and cheaper).
- The vector search in particular favours long and specialised terms over simple ones. For this, different embeddings, search algorithm, text preprocessing might help.
- For keyword search, something rule-based could work (trim by length in relation to the length of the input?)
- The LLM picks specialised results. If this is not because of the search step, a different model or reasoning steps might help.

I tracked speed with LangSmith, but it might work better for cost optimisation with OpenAI.

hanna-paasivirta mentioned this issue Feb 12, 2025

Batch vocab mapper #169

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab-mapper: feedback, ideas, next steps #164

vocab-mapper: feedback, ideas, next steps #164

josephjclark commented Jan 29, 2025 •

edited

Loading

josephjclark commented Feb 10, 2025

hanna-paasivirta commented Feb 12, 2025 •

edited

Loading

vocab-mapper: feedback, ideas, next steps #164

vocab-mapper: feedback, ideas, next steps #164

Comments

josephjclark commented Jan 29, 2025 • edited Loading

josephjclark commented Feb 10, 2025

hanna-paasivirta commented Feb 12, 2025 • edited Loading

josephjclark commented Jan 29, 2025 •

edited

Loading

hanna-paasivirta commented Feb 12, 2025 •

edited

Loading