Speed-up calls to LLM by parallelization of the topic categorization #11

jucor · 2025-01-21T21:02:00Z

Dear Jigsaw team

As discussed by email, it would be really helpful if the library could run faster. The topic learning is very fast. The categorization could do with being faster. In our IRL conversations I remember you mentioned that you definitely had this in mind, so I’m just adding it here to follow-up.

When looking at the categorization code, I suspect you were probably thinking of parallelizing the call accross the mini-batches, i.e. this loop:

sensemaking-tools/src/sensemaker.ts

Lines 217 to 235 in 8eb482e

    
           for ( 
        
             let i = 0; 
        
             i < comments.length; 
        
             i += this.modelSettings.defaultModel.categorizationBatchSize 
        
           ) { 
        
             const uncategorizedBatch = comments.slice( 
        
               i, 
        
               i + this.modelSettings.defaultModel.categorizationBatchSize 
        
             ); 
        
             const categorizedBatch = await categorizeWithRetry( 
        
               this.modelSettings.defaultModel, 
        
               instructions, 
        
               uncategorizedBatch, 
        
               includeSubtopics, 
        
               topics, 
        
               additionalInstructions 
        
             ); 
        
             categorized.push(...categorizedBatch); 
        
           }

Parallelizing this loop seems possibly the highest-level way with the less amount of work needed and the maximum return.

Of course, as you also pointed, there’s the question of whether Vertex will throttle the requests. Does Vertex offer an async caller which automatically respects its throttling limits? That would be neat :)

Thanks!

dborkan · 2025-01-22T18:37:56Z

@jucor thanks for sharing and pointing out the relevant code locations. We're definitely interested in speeding this up, and are looking to implement parallelization likely after our current sprint to tackle hallucinations.

There are Vertex quota limits, so we'll need to do some rate limiting ourselves or find a helper library. We did recently create a helper function resolvePromisesInParallel for summarization, this could be a good starting point for categorization

jucor · 2025-01-27T11:13:16Z

Nice, thanks! I know very little about TypeScript (mostly a Python guy) and its Promise premise, so it's great to see the parallelization helper :)

alyssachvasta · 2025-01-29T16:12:03Z

Hi @jucor

I just submitted this commit that parallelizes this loop. For a test set of 300 comments it brought the categorization time down from 3.58 minutes to 1.4 minutes. This uses the function "resolvePromiseseInParallel" that @dborkan mentioned and is using the default value of 2 parallel threads at once which is what the Vertex Models allow on the free tier.

jucor · 2025-01-29T16:38:59Z

Woohoo! Amazing! A 2.5x speedup! Thanks a lot team! 🚀🎉

jucor · 2025-01-29T17:02:08Z

Oh, and that's a super interesting observation in the comments of the commit:

sensemaking-tools/src/sensemaker.ts

Lines 223 to 225 in 44600ff

    
           // TODO: Consider the effects of smaller batch sizes. 1 comment per batch was much faster, but 
        
           // the distribution was significantly different from what we're currently seeing. More testing 
        
           // is needed to determine the ideal size and distribution.

!

I'm super excited to hear discussion of the distribution, and the factors that affect it! That's super linked to evals discussed in compdemocracy/polis#1866 and compdemocracy/polis#1878 !

Could you say a bit more about what you observed, @alyssachvasta please? @akonya , is this change of distribution depending on the batch size something you observed too with your own LLM experiments, please?

alyssachvasta · 2025-01-29T19:35:53Z

For the batch size of one comment per LLM call I found that the model was 50% more likely to categorize a comment under multiple topics/subtopics than if 100 comments were categorized at once. The current categorization behavior is quite good so I didn't want to make that change. In the future I may come back to it with some additional prompt changes / other tweaks, but only if I can ensure the categorization behavior will stay the same.

jucor · 2025-01-29T22:55:23Z

It's great you've observed that behavior. For me that would be reason to dig a little more into it, to double-check how robust the results are. In theory, if we were using a regular classifier, conditionally on the topics, the classification of each comment should be independent from the others, and thus independent of the batch size or the batch content.
Here it seems the LLM doesn't quite do that.

The way I would suggest to investigate the robustness (which would also quantify the "quite good" behaviour of the current categorization) would be, for a fixed set of comments, and a fixed set of topics, to run the categorization for several batch sizes, and several times for each batch size, and look even just at the histograms of the count of comments per topic (or per subtopic).0
Visualizing these histograms in a facetted grid (n_batches on one grid axis, replicates on the other) would give us a great at-a-glance visual assessment of stability, at least on the marginal distribution of categories.

It will also allow to diagnose at a glance whether there is any change of categorization behaviour, and provide a useful tool for debugging if there is.

What do you think?

tevko · 2025-01-30T19:54:21Z

For reference, our current benchmark on unparallelized code with 318 comments: https://github.com/compdemocracy/polis/actions/runs/12589605245/job/35089707403#step:9:752

at 17 minutes, a 2.5x speedup would still be over 6 minutes, which would work for a microservice that ran periodically but is still above thresholds for real time users over http, just to flag that

akonya · 2025-01-31T20:09:32Z

@jucor -- Yep, we've seen similar batch-dependent effects in our LLM tagger. As batch size increases -- ie more comments being topic-tagged per prompt -- we saw fewer tags per comment as well as some degradation in general tag accuracy.

We focus on two levers to optimize the quality and speed tradeoff: model size and batch size.

Optimal would obviously be the biggest/best model, with batch size of 1, run fully in parallel. But bigger models from 3rd party providers have more aggressive throttling limits (at least for us); so you get a tradeoff between speed and quality that is mediated by batchsize.

Smaller models are way faster and have less aggressive throttling but quality can be lower in general (and the degradation with batch size ramps much quicker). This creates different speed-quality pareto curves for diff models.

So you get the best possible speed-quality pareto curve if you make model choice to be a optimization parameter.

Currently, using a mid-size off-the-shelf model with batchsize 10 seems to be an OK sweet spot. But this changes as new models are released and as we graduate to higher rate limits.

But the batch size degradation effects generally seem to kick in as you go above 10.

jucor · 2025-01-31T21:29:03Z

Amazing, thanks @akonya for sharing your experience. This is super valuable.
@alyssachvasta , do you think this is a kind of tuning Jigsaw would consider doing, to find the degradation point? Or maybe setting the default batch size a bit lower?
In a dream world, if we were super rigorous, we would have either some human tagged conversations to evaluate on, or some human preference comparison, to quantify where the degradation becomes perceptible.

akonya · 2025-02-01T02:37:37Z

We actually have a collection of human-tagged data sets we could use to do this if you're interested in the super rigorous thing. There are 5 diff datasets. Each has 300-500 statements with ground-truth topic and subtopic taxonomies + per-statement tags done by experts.

alyssachvasta · 2025-02-05T16:27:54Z

@akonya Can you share the dataset if it's public? I'd be interested in testing with it!

@jucor That's interesting to see the tradeoffs you've already seen. I'm looking into improving our evals for categorization generally

jucor mentioned this issue Jan 22, 2025

Using multi-sample/semantic-entropy for topic categorization (and for summarization?) #12

Open

alyssachvasta closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up calls to LLM by parallelization of the topic categorization #11

Speed-up calls to LLM by parallelization of the topic categorization #11

jucor commented Jan 21, 2025

dborkan commented Jan 22, 2025

jucor commented Jan 27, 2025

alyssachvasta commented Jan 29, 2025

jucor commented Jan 29, 2025 •

edited

Loading

jucor commented Jan 29, 2025

alyssachvasta commented Jan 29, 2025

jucor commented Jan 29, 2025 •

edited

Loading

tevko commented Jan 30, 2025 •

edited

Loading

akonya commented Jan 31, 2025

jucor commented Jan 31, 2025

akonya commented Feb 1, 2025

alyssachvasta commented Feb 5, 2025

Speed-up calls to LLM by parallelization of the topic categorization #11

Speed-up calls to LLM by parallelization of the topic categorization #11

Comments

jucor commented Jan 21, 2025

dborkan commented Jan 22, 2025

jucor commented Jan 27, 2025

alyssachvasta commented Jan 29, 2025

jucor commented Jan 29, 2025 • edited Loading

jucor commented Jan 29, 2025

alyssachvasta commented Jan 29, 2025

jucor commented Jan 29, 2025 • edited Loading

tevko commented Jan 30, 2025 • edited Loading

akonya commented Jan 31, 2025

jucor commented Jan 31, 2025

akonya commented Feb 1, 2025

alyssachvasta commented Feb 5, 2025

jucor commented Jan 29, 2025 •

edited

Loading

jucor commented Jan 29, 2025 •

edited

Loading

tevko commented Jan 30, 2025 •

edited

Loading