This study aims to investigate the potential of combining schedule optimization with ICL to optimize LLM utilization
-
RQ1: What kind of prediction techniques are helpful to evaluate the accuracy objective in the fitness function without actually submitting the queries to the LLM?
-
RQ2: How do various search-based techniques perform in identifying optimal solutions for the prompt template allocation problem?
-
RQ3: How generalizable are our findings across different LLMs and datasets?
- Original log data: LogPai benchmark
- Interacted with LLM: GPT-3.5 (from OpenAI)
- Prompt templates: Simple, Standard, Enhanced, 1-shot, 2-shot, and 4-shot
- Original text data: Overruling benchmark
- Interacted with LLM: J2-Mid (from AI21)
- Prompt templates: Simple, Standard, Enhanced, 1-shot, 2-shot, and 4-shot
Impact of Training Data Size on Prediction Accuracy and Label Collection Cost
- Prediction accuracy generally increases with larger label sizes for both SCE and MLBP across all systems
- However, the gains in accuracy diminish when the training size reach a certain point (e.g., 30-50% range)
- Label collection costs increase significantly with label size, roughly doubling from 1% to 5%, 5% to 10%, and so on
Considering the accuracy improvment and label collection cost, we use 30% of the training data for prediction accuracy evaluation in this study.
In our study, we present the average performance across all instances based on several metrics: the highest accuracy achieved, the inverted generational distance (IGD), the delta (
However, it's important to examine the performance on individual instances, as the characteristics of the data may influence the outcomes. Therefore, we provide comparisons for each algorithm using different fitness functions on 16 separate instances. These comparisons include the accuracy, IGD, delta (
Comparisons of NSGA-II with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of SPEA2 with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of MOPSO with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of MOACO with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of MOEA/D with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of RNSGA-II with Different Fitness Function in terms of Accuracy, IGD,
Comparisons of MOEA/D-GEN with Different Fitness Function in terms of Accuracy, IGD,
$ python exp.py $
We used the standard version of NSGA-II and R-NSGA-II implemented from Pymoo library, MOACO, MOPSO, MOEA/D and MOEA/D-GEN from Pygmo library, SPEA2 from jMetalPy library.
The source code of the baselines is available under llm_prompts_allocation/baselines
directory.