Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up calls to LLM by parallelization of the topic categorization #11

Closed
jucor opened this issue Jan 21, 2025 · 12 comments
Closed

Speed-up calls to LLM by parallelization of the topic categorization #11

jucor opened this issue Jan 21, 2025 · 12 comments

Comments

@jucor
Copy link
Collaborator

jucor commented Jan 21, 2025

Dear Jigsaw team

As discussed by email, it would be really helpful if the library could run faster. The topic learning is very fast. The categorization could do with being faster. In our IRL conversations I remember you mentioned that you definitely had this in mind, so I’m just adding it here to follow-up.

When looking at the categorization code, I suspect you were probably thinking of parallelizing the call accross the mini-batches, i.e. this loop:

for (
let i = 0;
i < comments.length;
i += this.modelSettings.defaultModel.categorizationBatchSize
) {
const uncategorizedBatch = comments.slice(
i,
i + this.modelSettings.defaultModel.categorizationBatchSize
);
const categorizedBatch = await categorizeWithRetry(
this.modelSettings.defaultModel,
instructions,
uncategorizedBatch,
includeSubtopics,
topics,
additionalInstructions
);
categorized.push(...categorizedBatch);
}

Parallelizing this loop seems possibly the highest-level way with the less amount of work needed and the maximum return.

Of course, as you also pointed, there’s the question of whether Vertex will throttle the requests. Does Vertex offer an async caller which automatically respects its throttling limits? That would be neat :)

Thanks!

@dborkan
Copy link

dborkan commented Jan 22, 2025

@jucor thanks for sharing and pointing out the relevant code locations. We're definitely interested in speeding this up, and are looking to implement parallelization likely after our current sprint to tackle hallucinations.

There are Vertex quota limits, so we'll need to do some rate limiting ourselves or find a helper library. We did recently create a helper function resolvePromisesInParallel for summarization, this could be a good starting point for categorization

@jucor
Copy link
Collaborator Author

jucor commented Jan 27, 2025

Nice, thanks! I know very little about TypeScript (mostly a Python guy) and its Promise premise, so it's great to see the parallelization helper :)

@alyssachvasta
Copy link
Collaborator

Hi @jucor

I just submitted this commit that parallelizes this loop. For a test set of 300 comments it brought the categorization time down from 3.58 minutes to 1.4 minutes. This uses the function "resolvePromiseseInParallel" that @dborkan mentioned and is using the default value of 2 parallel threads at once which is what the Vertex Models allow on the free tier.

@jucor
Copy link
Collaborator Author

jucor commented Jan 29, 2025

Woohoo! Amazing! A 2.5x speedup! Thanks a lot team! 🚀🎉

@jucor
Copy link
Collaborator Author

jucor commented Jan 29, 2025

Oh, and that's a super interesting observation in the comments of the commit:

// TODO: Consider the effects of smaller batch sizes. 1 comment per batch was much faster, but
// the distribution was significantly different from what we're currently seeing. More testing
// is needed to determine the ideal size and distribution.
!

I'm super excited to hear discussion of the distribution, and the factors that affect it! That's super linked to evals discussed in compdemocracy/polis#1866 and compdemocracy/polis#1878 !

Could you say a bit more about what you observed, @alyssachvasta please? @akonya , is this change of distribution depending on the batch size something you observed too with your own LLM experiments, please?

@alyssachvasta
Copy link
Collaborator

For the batch size of one comment per LLM call I found that the model was 50% more likely to categorize a comment under multiple topics/subtopics than if 100 comments were categorized at once. The current categorization behavior is quite good so I didn't want to make that change. In the future I may come back to it with some additional prompt changes / other tweaks, but only if I can ensure the categorization behavior will stay the same.

@jucor
Copy link
Collaborator Author

jucor commented Jan 29, 2025

It's great you've observed that behavior. For me that would be reason to dig a little more into it, to double-check how robust the results are. In theory, if we were using a regular classifier, conditionally on the topics, the classification of each comment should be independent from the others, and thus independent of the batch size or the batch content.
Here it seems the LLM doesn't quite do that.

The way I would suggest to investigate the robustness (which would also quantify the "quite good" behaviour of the current categorization) would be, for a fixed set of comments, and a fixed set of topics, to run the categorization for several batch sizes, and several times for each batch size, and look even just at the histograms of the count of comments per topic (or per subtopic).0
Visualizing these histograms in a facetted grid (n_batches on one grid axis, replicates on the other) would give us a great at-a-glance visual assessment of stability, at least on the marginal distribution of categories.

It will also allow to diagnose at a glance whether there is any change of categorization behaviour, and provide a useful tool for debugging if there is.

What do you think?

@tevko
Copy link

tevko commented Jan 30, 2025

For reference, our current benchmark on unparallelized code with 318 comments: https://github.com/compdemocracy/polis/actions/runs/12589605245/job/35089707403#step:9:752

at 17 minutes, a 2.5x speedup would still be over 6 minutes, which would work for a microservice that ran periodically but is still above thresholds for real time users over http, just to flag that

@akonya
Copy link

akonya commented Jan 31, 2025

@jucor -- Yep, we've seen similar batch-dependent effects in our LLM tagger. As batch size increases -- ie more comments being topic-tagged per prompt -- we saw fewer tags per comment as well as some degradation in general tag accuracy.

We focus on two levers to optimize the quality and speed tradeoff: model size and batch size.

Optimal would obviously be the biggest/best model, with batch size of 1, run fully in parallel. But bigger models from 3rd party providers have more aggressive throttling limits (at least for us); so you get a tradeoff between speed and quality that is mediated by batchsize.

Smaller models are way faster and have less aggressive throttling but quality can be lower in general (and the degradation with batch size ramps much quicker). This creates different speed-quality pareto curves for diff models.

So you get the best possible speed-quality pareto curve if you make model choice to be a optimization parameter.

Currently, using a mid-size off-the-shelf model with batchsize 10 seems to be an OK sweet spot. But this changes as new models are released and as we graduate to higher rate limits.

But the batch size degradation effects generally seem to kick in as you go above 10.

@jucor
Copy link
Collaborator Author

jucor commented Jan 31, 2025

Amazing, thanks @akonya for sharing your experience. This is super valuable.
@alyssachvasta , do you think this is a kind of tuning Jigsaw would consider doing, to find the degradation point? Or maybe setting the default batch size a bit lower?
In a dream world, if we were super rigorous, we would have either some human tagged conversations to evaluate on, or some human preference comparison, to quantify where the degradation becomes perceptible.

@akonya
Copy link

akonya commented Feb 1, 2025

We actually have a collection of human-tagged data sets we could use to do this if you're interested in the super rigorous thing. There are 5 diff datasets. Each has 300-500 statements with ground-truth topic and subtopic taxonomies + per-statement tags done by experts.

@alyssachvasta
Copy link
Collaborator

@akonya Can you share the dataset if it's public? I'd be interested in testing with it!

@jucor That's interesting to see the tradeoffs you've already seen. I'm looking into improving our evals for categorization generally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants