Skip to content

superyue72/llm_prompts_allocation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Study on Large Language Model Performance

This study aims to investigate the potential of combining schedule optimization with ICL to optimize LLM utilization

1. Study Design


2. Research Questions

  • RQ1: What kind of prediction techniques are helpful to evaluate the accuracy objective in the fitness function without actually submitting the queries to the LLM?

  • RQ2: How do various search-based techniques perform in identifying optimal solutions for the prompt template allocation problem?

  • RQ3: How generalizable are our findings across different LLMs and datasets?

3. Benchmark Datasets

3.1 Log Parsing

  • Original log data: LogPai benchmark
  • Interacted with LLM: GPT-3.5 (from OpenAI)
  • Prompt templates: Simple, Standard, Enhanced, 1-shot, 2-shot, and 4-shot

3.1 Text classification

  • Original text data: Overruling benchmark
  • Interacted with LLM: J2-Mid (from AI21)
  • Prompt templates: Simple, Standard, Enhanced, 1-shot, 2-shot, and 4-shot

4. Additional Results

4.1 Training Data Size Impact on Prediction Accuracy and Label Collection Cost


Impact of Training Data Size on Prediction Accuracy and Label Collection Cost

  • Prediction accuracy generally increases with larger label sizes for both SCE and MLBP across all systems
  • However, the gains in accuracy diminish when the training size reach a certain point (e.g., 30-50% range)
  • Label collection costs increase significantly with label size, roughly doubling from 1% to 5%, 5% to 10%, and so on

Considering the accuracy improvment and label collection cost, we use 30% of the training data for prediction accuracy evaluation in this study.

4.2 Completed Comparisons

In our study, we present the average performance across all instances based on several metrics: the highest accuracy achieved, the inverted generational distance (IGD), the delta ($\Delta$) metric, and the count of non-dominated solutions ($M_n$) obtained by combining all algorithms with three different prediction methods.

However, it's important to examine the performance on individual instances, as the characteristics of the data may influence the outcomes. Therefore, we provide comparisons for each algorithm using different fitness functions on 16 separate instances. These comparisons include the accuracy, IGD, delta ($\Delta$), and the number of non-dominated solutions ($M_n$) specific to each instance.


Comparisons of NSGA-II with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of SPEA2 with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of MOPSO with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of MOACO with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of MOEA/D with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of RNSGA-II with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$


Comparisons of MOEA/D-GEN with Different Fitness Function in terms of Accuracy, IGD, $\Delta$, and $M_n$

5 Implementation

$ python exp.py $

We used the standard version of NSGA-II and R-NSGA-II implemented from Pymoo library, MOACO, MOPSO, MOEA/D and MOEA/D-GEN from Pygmo library, SPEA2 from jMetalPy library. The source code of the baselines is available under llm_prompts_allocation/baselines directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published