-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReliability Testing.tex
683 lines (521 loc) · 32.9 KB
/
Reliability Testing.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
\documentclass{article}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{float}
\usepackage{listings}
\usepackage{color}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{deepred}{rgb}{0.6,0,0}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{
basicstyle=\ttfamily\small,
keywordstyle=\color{blue},
emph={Validator},
emphstyle=\ttb\color{mauve},
emph={[2]name,message,predicate,minimum_success_percentage},
emphstyle={[2]\color{deepred}},
stringstyle=\color{dkgreen},
aboveskip=2mm,
belowskip=2mm,
breaklines=true,
columns=fullflexible,
frame=single,
language=Python,
showstringspaces=false
}
\title{Reliability Testing for LLM-Based Systems}
\author{Robert Cunningahm}
\date{August 30th 2024}
\begin{document}
\maketitle
\section{Introduction}
As large language models (LLMs) become increasingly integrated into various applications, ensuring their reliability is critical. These systems often take multiple inputs and produce corresponding outputs, each of which must adhere to specific guidelines or criteria. Assessing the reliability of such systems is essential for maintaining trust, safety, and effectiveness. This white paper introduces a framework for conducting reliability tests on LLM-based systems. The framework utilizes validators and verifiers to evaluate the system's behavior across multiple dimensions, providing a comprehensive assessment of its performance.
\pagebreak
\section{Key Concepts in Reliability Testing}
\subsection{Validators: Ensuring Consistent Behavior}
Validators are the foundational elements of the reliability testing framework. They are designed to measure how reliably the system adheres to specific instructions or behaviors. For instance, consider a scenario where an LLM is instructed not to use contractions like ``isn't,'' ``doesn't,'' or ``can't.'' A validator can be implemented to assess how well the model follows this rule.
\vspace{1em}
\textbf{Example Validator:}
\begin{center}
\begin{minipage}{0.9\linewidth}
\begin{lstlisting}
Validator(
name = "contraction_validator",
message = "Output contains too many contractions",
predicate = lambda o: o.count("'") <= 3,
minimum_success_percentage = 0.95
)
\end{lstlisting}
\end{minipage}
\end{center}
Each validator operates on a collection of outputs generated by the system, determining a success percentage that reflects the proportion of outputs meeting the specified criterion. Sometimes validators can have conditional predicates that rely on the input as well:
\vspace{1em}
\textbf{Conditional Validator:}
\begin{center}
\begin{minipage}{0.9\linewidth}
\begin{lstlisting}
Validator(
name = "politeness_validator",
message = "System seems to have forgotten its manners",
predicate = lambda i, o:
"You're welcome" in o"You're welcome" in o
if "Thank you" in i
else True,
minimum_success_percentage = 0.90
)
\end{lstlisting}
\end{minipage}
\end{center}
\subsection{Running Binomial Experiments with Validators}
Binomial experiments are used to quantify the reliability of the system as determined by a validator. In a Continuous Alignment Testing (CAT) environment, each validator has a minimum success percentage threshold. The outcome of the binomial experiment is compared against this threshold to determine if the system's behavior is reliable. There are two primary methods for conducting these experiments that can be combined into a third.
\subsubsection{Varying Inputs, Single Output}
The system generates a single output for each varied input, and the validator assesses the entire collection of outputs.
\begin{equation*}
\begin{aligned}
&\text{input}_1 \rightarrow \text{output}_{11} \\
&\text{input}_2 \rightarrow \text{output}_{12} \\
&\vdots \\
&\text{input}_N \rightarrow \text{output}_{1N}
\end{aligned}
\end{equation*}
\begin{equation*}
\text{Validator}: (\text{output}_{11}, \text{output}_{12}, \ldots, \text{output}_{1N}) \rightarrow \text{``success percentage''}
\end{equation*}
\subsubsection{Fixed Input, Multiple Outputs}
A single input is used to generate multiple outputs, and the validator assesses this set of outputs.
\begin{equation*}
\begin{aligned}
&\text{input}_J \rightarrow \text{output}_{1J}, \text{output}_{2J}, \text{output}_{3J}, \ldots, \text{output}_{NJ}
\end{aligned}
\end{equation*}
\begin{equation*}
\text{Validator}: (\text{output}_{1J}, \text{output}_{2J}, \ldots, \text{output}_{NJ}) \rightarrow \text{``success percentage''}
\end{equation*}
\subsubsection{Varying Inputs, Multiple Outputs}
In this approach, the system generates multiple outputs for each varied input, resulting in a comprehensive set of \( N^2 \) outputs. This method enables a thorough examination of the system's behavior across a wide range of scenarios, capturing both the variability in inputs and the stochastic nature of output generation.
\begin{equation*}
\begin{aligned}
\text{input}_1 &\rightarrow \text{output}_{11}, \text{output}_{12}, \text{output}_{13}, \ldots, \text{output}_{1N} \\
\text{input}_2 &\rightarrow \text{output}_{21}, \text{output}_{22}, \text{output}_{23}, \ldots, \text{output}_{2N} \\
\text{input}_3 &\rightarrow \text{output}_{31}, \text{output}_{32}, \text{output}_{33}, \ldots, \text{output}_{3N} \\
&\vdots \\
\text{input}_N &\rightarrow \text{output}_{N1}, \text{output}_{N2}, \text{output}_{N3}, \ldots, \text{output}_{NN}
\end{aligned}
\end{equation*}
Due to the significant number of outputs, an effective validation strategy is crucial to efficiently assess the system's reliability. There are three primary methods for applying validators in this context:
\paragraph{Validating Rows}
Validating rows involves assessing all outputs generated from a single input. For each input \( i \), the validator is applied to the set of outputs \( \{\text{output}_{i1}, \text{output}_{i2}, \ldots, \text{output}_{iN}\} \).
\begin{equation*}
\begin{split}
\text{Validator for input } i: (\text{output}_{i1}, \text{output}_{i2}, \ldots, \text{output}_{iN}) \\
\rightarrow \text{``Success Percentage for Input } i\text{''}
\end{split}
\end{equation*}
This method evaluates the system's consistency and reliability in handling a specific input across multiple output variations. It helps identify inputs for which the system consistently performs well or poorly, highlighting potential input-specific issues.
\paragraph{Validating Columns}
Validating columns focuses on outputs generated across different inputs under the same conditions or iterations. For each output index \( j \), the validator is applied to the set
\( \{\text{output}_{1j}, \text{output}_{2j}, \ldots, \text{output}_{Nj}\} \).
\begin{equation*}
\begin{split}
\text{Validator for iteration } j: (\text{output}_{1j}, \text{output}_{2j}, \ldots, \text{output}_{Nj}) \\
\rightarrow \text{``Success Percentage for Iteration } j\text{''}
\end{split}
\end{equation*}
This approach assesses the system's performance across a variety of inputs for a particular output generation instance. It can reveal systemic issues that affect all inputs under certain generation conditions, such as biases introduced by specific random seeds or sampling methods.
\paragraph{Validating All \( N^2 \) Outputs}
Validating all \( N^2 \) outputs involves applying the validator to every output individually and aggregating the results.
\begin{equation*}
\begin{split}
\text{Validator}: (\text{output}_{11}, \text{output}_{12}, \ldots, \text{output}_{NN}) \\ \rightarrow \text{``Overall Success Percentage''}
\end{split}
\end{equation*}
This comprehensive method provides a holistic view of the system's reliability across all inputs and outputs. While thorough, it may be resource-intensive, necessitating strategies like parallel processing or intelligent sampling to remain practical.
\pagebreak
\section{Scaling Reliability Testing with Multiple Validators}
\subsection{Understanding the Role of Multiple Validators}
In real-world applications, it is often necessary to assess the reliability of a system across multiple dimensions simultaneously. This requires deploying multiple validators, each designed to measure a specific aspect of the system's behavior. In this section, we extend the framework discussed earlier to accommodate the use of \( K \) validators. This approach allows for a more comprehensive evaluation of the system's reliability, as it accounts for the diverse requirements and constraints that a system may need to satisfy.
\paragraph{Example Use Case:}
Consider a content generation system where the LLM must adhere to the following rules, each represented by a corresponding validator:
\begin{center}
\begin{minipage}{0.9\linewidth}
\begin{lstlisting}
Validator(
name = "contraction_validator",
message = "Output contains too many contractions",
predicate = lambda o: o.count("'") <= 3,
minimum_success_percentage = 0.95
)
Validator(
name = "factual_accuracy_validator",
message = "Output contains factual inaccuracies",
predicate = lambda o: is_factually_correct(o),
minimum_success_percentage = 0.98
)
Validator(
name = "ethical_compliance_validator",
message = "Output contains unethical content",
predicate = lambda o: is_ethical(o),
minimum_success_percentage = 0.99
)
Validator(
name = "tone_consistency_validator",
message = "Output tone is inconsistent",
predicate = lambda o: is_tone_consistent(o),
minimum_success_percentage = 0.97
)
\end{lstlisting}
\end{minipage}
\end{center}
\begin{enumerate}
\item No Contractions: Avoid using contractions in the output.
\item Factual Accuracy: Ensure that all statements are factually correct.
\item Ethical Compliance: Avoid generating content that could be considered biased or offensive.
\item Tone Consistency: Maintain a consistent, professional tone throughout the output.
\end{enumerate}
\subsection{Running Binomial Experiments with Multiple Validators}
When running binomial experiments with multiple validators, the process can be scaled to evaluate the system's output against each validator independently. The success percentage for each validator is computed based on how well the outputs satisfy the corresponding criterion. The overall reliability of the system is then assessed by combining the results of all validators.
\subsubsection{Method 1: Varying Inputs with Multiple Validators}
In this method, we vary the inputs, generate outputs for each input, and then apply all \( K \) validators to the resulting set of outputs. This approach allows us to assess the system's performance across different scenarios.
\begin{equation*}
\begin{aligned}
&\text{input}_1 \rightarrow \text{output}_{11} \\
&\text{input}_2 \rightarrow \text{output}_{12} \\
&\vdots \\
&\text{input}_N \rightarrow \text{output}_{1N}
\end{aligned}
\end{equation*}
\begin{equation*}
\begin{aligned}
&\text{Validator}_1: (\text{output}_{11}, \text{output}_{12}, \ldots, \text{output}_{1N}) \rightarrow \text{``success percentage''}_1 \\
&\text{Validator}_2: (\text{output}_{11}, \text{output}_{12}, \ldots, \text{output}_{1N}) \rightarrow \text{``success percentage''}_2 \\
&\vdots \\
&\text{Validator}_K: (\text{output}_{11}, \text{output}_{12}, \ldots, \text{output}_{1N}) \rightarrow \text{``success percentage''}_K
\end{aligned}
\end{equation*}
\subsubsection{Method 2: Fixed Input with Multiple Validators}
In this method, we hold a single input constant and generate multiple outputs for that input. Each validator is then applied to the set of outputs. This method is particularly useful for assessing the consistency of the system’s behavior when responding to a single prompt.
\begin{equation*}
\text{input}_K \rightarrow \text{output}_{1K}, \text{output}_{2K}, \text{output}_{3K}, \ldots, \text{output}_{NK}
\end{equation*}
\begin{equation*}
\begin{aligned}
&\text{Validator}_1: (\text{output}_{1K}, \text{output}_{2K}, \ldots, \text{output}_{NK}) \rightarrow \text{``success percentage''}_1 \\
&\text{Validator}_2: (\text{output}_{1K}, \text{output}_{2K}, \ldots, \text{output}_{NK}) \rightarrow \text{``success percentage''}_2 \\
&\vdots \\
&\text{Validator}_K: (\text{output}_{1K}, \text{output}_{2K}, \ldots, \text{output}_{NK}) \rightarrow \text{``success percentage''}_K
\end{aligned}
\end{equation*}
\subsection{Aggregating Results from Multiple Validators}
After obtaining the success percentages from all \( K \) validators, the next step is to aggregate these results to form a comprehensive view of the system's reliability. There are several ways to approach this aggregation, depending on the specific requirements of the system and the relative importance of each validator.
\subsubsection{Simple Average Method}
One straightforward approach is to calculate the simple average of the success percentages across all validators. This method treats each validator equally, providing a general measure of the system’s overall reliability.
\begin{equation*}
\text{Overall Success Percentage} = \frac{1}{K} \sum_{i=1}^{K} \text{Success Percentage}_i
\end{equation*}
\subsubsection{Weighted Average Method}
In cases where certain behaviors are more critical than others, a weighted average can be used. Each validator is assigned a weight based on its importance, and the overall success percentage is calculated as the weighted sum of the individual success percentages.
\begin{equation*}
\text{Overall Success Percentage} = \frac{\sum_{i=1}^{K} \text{Weight}_i \times \text{Success Percentage}_i}{\sum_{i=1}^{K} \text{Weight}_i}
\end{equation*}
\subsubsection{Minimum Threshold Method}
Another approach is to set a minimum success threshold that the system must meet across all validators. The overall reliability is then determined by the lowest success percentage recorded among the validators. This method is stringent, ensuring that the system performs reliably across all critical dimensions.
\begin{equation*}
\text{Overall Success Percentage} = \min \left( \begin{array}{c}
\text{Success Percentage}_1, \\
\text{Success Percentage}_2, \\
\vdots \\
\text{Success Percentage}_K \\
\end{array} \right)
\end{equation*}
\subsection{Confidence Intervals with Multiple Validators}
Confidence intervals provide a range within which the true success percentage likely falls. When dealing with multiple validators, confidence intervals can be calculated for each validator’s success percentage. These intervals can then be reported individually or combined to provide a more nuanced understanding of the system’s reliability.
For each validator, the confidence interval is calculated using the formula:
\begin{equation*}
\text{Confidence Interval for Validator } i = \text{Success Percentage}_i \pm Z \times \text{SE}_i
\end{equation*}
Where \( \text{SE}_i \) is the standard error for validator \( i \), calculated as:
\begin{equation*}
\text{SE}_i = \sqrt{\frac{\text{Success Percentage}_i \times (1 - \text{Success Percentage}_i)}{N_i}}
\end{equation*}
The combined confidence interval for the system’s overall reliability can be determined based on the method of aggregation used.
\subsection{Parallel Execution of Validators}
One of the key advantages of using multiple validators is that they can be executed in parallel. This parallelism allows for efficient and scalable testing, particularly in Continuous Alignment Testing (CAT) environments where real-time feedback is crucial.
By running multiple validators simultaneously, the system can quickly identify areas where it meets or falls short of expectations, enabling prompt adjustments and improvements.
\subsection{Generalizing to a Tensor Framework for Reliability Analysis}
When extending reliability testing to include multiple inputs, multiple outputs per input, and multiple validators, the system's performance can be represented as a three-dimensional tensor. This \textbf{Reliability Tensor} captures the interplay between inputs, outputs, and validators, allowing for a nuanced analysis of the system's reliability.
\subsubsection{Constructing the Reliability Tensor}
The Reliability Tensor \( R \) can be defined with dimensions corresponding to:
\begin{itemize}
\item \textbf{Input Dimension (I):} Represents the set of varied inputs:\newline \( \{\text{input}_1, \text{input}_2, \ldots, \text{input}_N\} \).
\item \textbf{Output Dimension (J):} Represents the multiple outputs generated per input: \( \{\text{output}_1, \text{output}_2, \ldots, \text{output}_M\} \).
\item \textbf{Validator Dimension (K):} Represents the set of validators:\newline \( \{\text{validator}_1, \text{validator}_2, \ldots, \text{validator}_K\} \).
\end{itemize}
Each element \( R[i][j][k] \) in the tensor represents the result (e.g., pass/fail, success percentage) of validator \( k \) applied to output \( \text{output}_j \) generated from input \( \text{input}_i \).
\begin{equation*}
R[i][j][k] = \text{Result of validator } k \text{ on } \text{output}_j \text{ from } \text{input}_i
\end{equation*}
\subsubsection{Analyzing Success Percentages Along Tensor Axes}
By examining the tensor along different axes, we can derive various success percentages:
\begin{itemize}
\item \textbf{Per-Input Success Rates:} For each input \( i \), aggregate results across outputs and validators to assess how reliably the system handles that specific input.
\begin{equation*}
\text{Success Percentage for Input } i = \text{Aggregate}_{j,k} \, R[i][j][k]
\end{equation*}
\item \textbf{Per-Output Success Rates:} For each output iteration \( j \), aggregate results across inputs and validators to evaluate the reliability of outputs generated under specific conditions.
\begin{equation*}
\text{Success Percentage for Output } j = \text{Aggregate}_{i,k} \, R[i][j][k]
\end{equation*}
\item \textbf{Per-Validator Success Rates:} For each validator \( k \), aggregate results across inputs and outputs to measure how well the system performs regarding a specific criterion.
\begin{equation*}
\text{Success Percentage for Validator } k = \text{Aggregate}_{i,j} \, R[i][j][k]
\end{equation*}
\end{itemize}
\subsubsection{Developing Terms of Art}
To facilitate discussion and analysis, we introduce the following terms:
\begin{itemize}
\item \textbf{Input Reliability Profile (IRP):} The collection of success percentages for a specific input across all outputs and validators.
\item \textbf{Output Reliability Profile (ORP):} The collection of success percentages for a specific output iteration across all inputs and validators.
\item \textbf{Validator Reliability Profile (VRP):} The collection of success percentages for a specific validator across all inputs and outputs.
\end{itemize}
These profiles help identify patterns and anomalies in the system's performance, enabling targeted improvements.
\subsubsection{Marginal Success Percentages and Reliability Profiles}
By aggregating over specific dimensions of the tensor, we can compute marginal success percentages that provide insights into different aspects of system performance.
\begin{itemize}
\item \textbf{Input Marginal Success Percentage (Input MSP):} The success percentage for each input \( i \), aggregated over outputs and validators.
\begin{equation*}
\text{Input MSP}[i] = \frac{1}{J \times K} \sum_{j,k} R[i][j][k]
\end{equation*}
\item \textbf{Output Marginal Success Percentage (Output MSP):} The success percentage for each output iteration \( j \), aggregated over inputs and validators.
\begin{equation*}
\text{Output MSP}[j] = \frac{1}{I \times K} \sum_{i,k} R[i][j][k]
\end{equation*}
\item \textbf{Validator Marginal Success Percentage (Validator MSP):} The success percentage for each validator \( k \), aggregated over inputs and outputs.
\begin{equation*}
\text{Validator MSP}[k] = \frac{1}{I \times J} \sum_{i,j} R[i][j][k]
\end{equation*}
\end{itemize}
These marginal success percentages form the basis of the Input Reliability Profile (IRP), Output Reliability Profile (ORP), and Validator Reliability Profile (VRP), respectively.
\subsubsection{Interpreting Reliability Profiles}
\begin{itemize}
\item \textbf{Input Reliability Profile (IRP):} Highlights inputs where the system performs exceptionally well or poorly, guiding efforts to improve handling of specific inputs.
\item \textbf{Output Reliability Profile (ORP):} Reveals output iterations that consistently yield better or worse results, potentially indicating issues with certain generation methods or configurations.
\item \textbf{Validator Reliability Profile (VRP):} Indicates areas where the system meets or fails to meet specific criteria, informing adjustments to enhance compliance with critical requirements.
\end{itemize}
\subsubsection{Visualizing the Reliability Tensor}
To aid in interpreting the data, visualization techniques such as heatmaps or 3D plots can represent the tensor's elements and marginal percentages. Such visualizations can make patterns and outliers more apparent, facilitating a deeper understanding of the system's performance.
\subsubsection{Framework for Combining Success Percentages}
To report an overall reliability score for the system, we can aggregate success percentages from the tensor using various methods:
\begin{itemize}
\item \textbf{Mean Aggregation:} Compute the average success percentage across all elements.
\begin{equation*}
\text{Overall Success Percentage} = \frac{1}{I \times J \times K} \sum_{i,j,k} R[i][j][k]
\end{equation*}
\item \textbf{Weighted Aggregation:} Assign weights to inputs, outputs, or validators based on their importance.
\begin{equation*}
\text{Overall Success Percentage} = \frac{\sum_{i,j,k} W[i][j][k] \times R[i][j][k]}{\sum_{i,j,k} W[i][j][k]}
\end{equation*}
\item \textbf{Minimum Threshold Method:} Identify the lowest success percentage across any dimension to ensure reliability standards are met in all areas.
\begin{equation*}
\text{Overall Success Percentage} = \min_{i,j,k} R[i][j][k]
\end{equation*}
\end{itemize}
The choice of aggregation method depends on the specific requirements and priorities of the system being evaluated.
\pagebreak
\section{Verifiers: Assessing System-Wide Reliability}
Verifiers provide a holistic assessment of the system's reliability. Unlike validators, which focus on specific aspects of behavior that can be evaluated programmatically, verifiers evaluate the overall performance of the system. A verifier reviews input-output pairs and determines whether the system's output passes or fails based on a comprehensive set of instructions.
\textbf{Verifier Process:}
\begin{equation*}
\begin{aligned}
&\text{input}_1 \rightarrow \text{output}_1 \\
&\text{input}_2 \rightarrow \text{output}_2 \\
&\vdots \\
&\text{input}_N \rightarrow \text{output}_N
\end{aligned}
\end{equation*}
\begin{equation*}
\begin{aligned}
&\text{Verifier}: (\text{input}_1, \text{output}_1) \rightarrow \text{PASS/FAIL} \\
&\text{Verifier}: (\text{input}_2, \text{output}_2) \rightarrow \text{PASS/FAIL} \\
&\vdots \\
&\text{Verifier}: (\text{input}_N, \text{output}_N) \rightarrow \text{PASS/FAIL}
\end{aligned}
\end{equation*}
The results of these verification steps are aggregated to assess the overall reliability of the system, but the addition of another call to an LLM in this verification step opens up the possibility for the system to self-correct on a single input-output pair basis.
\pagebreak
\subsection{Verifier Driven Retry Mechanism}
Once you have integrated AI into your application, there are several ways to make the system auto-correct. One of the simplest is to use the verification step to trigger a "retry."
Since the verifier step uses an LLM transaction to decide whether or not the input-output pair "passes" our test, that same LLM can be made to provide a list of reasons for the failure. In this case, the input can be augmented with those reasons and sent back through the system to produce another output for the system. This can be made to repeat \( \text{MAX} \) times.
\begin{align*}
&(\text{input}, \text{output}_1) \rightarrow \text{Verifier: PASS} \\
&\quad \text{\# publish output}_1 \\
\end{align*}
\begin{align*}
&(\text{input}, \text{output}_1) \rightarrow \text{Verifier: FAIL, reasons} \\
&\quad \rightarrow (\text{input + reasons}, \text{output}_2) \rightarrow \text{Verifier: PASS} \\
&\quad \text{\# publish output}_2 \\
\end{align*}
\begin{align*}
&(\text{input}, \text{output}_1) \rightarrow \text{Verifier: FAIL, reasons}_1 \\
&\quad \rightarrow (\text{input + reasons}_1, \text{output}_2) \rightarrow \text{Verifier: FAIL, reasons}_2 \\
&\quad \rightarrow (\text{input + reasons}_2, \text{output}_3) \rightarrow \text{Verifier: FAIL, reasons}_3 \\
&\quad \vdots \\
&\quad \rightarrow (\text{input + reasons}(\text{MAX}-1), \text{output}_\text{MAX}) \rightarrow \text{Verifier: FAIL, reasons}_\text{MAX} \\
&\quad \text{\# no output published}
\end{align*}
\subsection{Retry Mechanisms with Validators}
In complex LLM-based systems, outputs may occasionally fail to meet all the criteria specified by multiple validators due to the inherent stochasticity of language models. To enhance reliability, a \textbf{retry mechanism} can be implemented, allowing the system to generate new outputs for a given input up to a maximum of \( m \) attempts. This section explores how to design such a retry mechanism using all validators and how to predict the expected number of retries required for an output to pass all validators based on their success percentages.
\pagebreak
\subsubsection{Concept of the Retry Mechanism}
The retry mechanism operates as follows:
\begin{enumerate}
\item \textbf{Initial Generation:} For a given input \( i \), the system generates an output \( o_1 \).
\item \textbf{Validation:} All validators \( \{V_1, V_2, \ldots, V_K\} \) are applied to \( o_1 \).
\item \textbf{Check Pass/Fail:}
\begin{itemize}
\item If \( o_1 \) passes all validators, the process stops, and \( o_1 \) is accepted.
\item If \( o_1 \) fails any validator, the system retries up to a maximum of \( m \) times.
\end{itemize}
\item \textbf{Subsequent Generations:} On each retry \( j \), the system generates a new output \( o_j \) for the same input \( i \) and repeats the validation process.
\item \textbf{Termination Conditions:}
\begin{itemize}
\item \textbf{Success:} If any \( o_j \) passes all validators before reaching \( m \) retries, the output is accepted.
\item \textbf{Failure:} If none of the outputs pass all validators after \( m \) retries, the process terminates without an accepted output.
\end{itemize}
\end{enumerate}
\subsubsection{Predicting the Expected Number of Retries}
To optimize the retry mechanism, it is crucial to predict:
\begin{itemize}
\item The expected number of retries needed for an output to pass all validators.
\item The optimal value of \( m \) to balance reliability and resource consumption.
\end{itemize}
\paragraph{Success Percentages of Validators}
Each validator \( V_k \) has an inherent success percentage \( p_k \), representing the probability that a randomly generated output will pass \( V_k \).
\begin{itemize}
\item \textbf{Validator Success Probability:} \( p_k = \text{Success Percentage of } V_k \)
\item \textbf{Assumption:} The validators operate independently, and the success probabilities are consistent across outputs for a given input.
\end{itemize}
\paragraph{Combined Success Probability}
The probability that an output passes all validators is:
\begin{equation*}
P_{\text{pass}} = \prod_{k=1}^{K} p_k
\end{equation*}
This formula assumes independence among validators.
\paragraph{Expected Number of Retries}
The expected number of retries \( E[R] \) required for an output to pass all validators is:
\begin{itemize}
\item \textbf{Geometric Distribution:} Since each attempt is independent, and the probability of success remains constant, the number of trials until the first success follows a geometric distribution.
\item \textbf{Expected Number of Trials (including the first attempt):}
\begin{equation*}
E[R] = \frac{1}{P_{\text{pass}}}
\end{equation*}
\item \textbf{Expected Number of Retries (excluding the first attempt):}
\begin{equation*}
E[\text{Retries}] = E[R] - 1 = \frac{1}{P_{\text{pass}}} - 1
\end{equation*}
\end{itemize}
\paragraph{Determining Maximum Retries \( m \)}
To choose an appropriate maximum number of retries \( m \):
\begin{itemize}
\item \textbf{Probability of Success Within \( m \) Attempts:}
\begin{equation*}
P_{\text{success within } m \text{ attempts}} = 1 - (1 - P_{\text{pass}})^{m}
\end{equation*}
\item \textbf{Selecting \( m \):} Choose \( m \) such that \( P_{\text{success within } m \text{ attempts}} \) meets a desired confidence level (e.g., 95\%).
\end{itemize}
\subsubsection{Is \texorpdfstring{$m$}{m} Input-Specific?}
The value of \( m \) can be:
\begin{itemize}
\item \textbf{Input-Agnostic:} If the success probabilities \( p_k \) are consistent across all inputs, \( m \) can be set globally.
\item \textbf{Input-Specific:} If success probabilities vary significantly with different inputs (as indicated by the Input Reliability Profile), \( m \) may need adjustment per input.
\end{itemize}
\paragraph{Using the Reliability Tensor}
The Reliability Tensor \( R[i][j][k] \) provides empirical success data for each input \( i \), output attempt \( j \), and validator \( k \).
\begin{itemize}
\item \textbf{Empirical Success Probability for Input \( i \):}
\begin{equation*}
P_{\text{pass}, i} = \frac{1}{J} \sum_{j=1}^{J} \left( \prod_{k=1}^{K} R[i][j][k] \right)
\end{equation*}
\item \textbf{Expected Retries for Input \( i \):}
\begin{equation*}
E[R]_i = \frac{1}{P_{\text{pass}, i}}
\end{equation*}
\end{itemize}
If \( P_{\text{pass}, i} \) varies significantly across inputs, it indicates that \( m \) should be adjusted per input to optimize performance.
\subsubsection{Practical Calculation Example}
\textbf{Assumptions:}
\begin{itemize}
\item Validators and their success percentages:
\begin{itemize}
\item \( V_1 \): \( p_1 = 0.95 \)
\item \( V_2 \): \( p_2 = 0.90 \)
\item \( V_3 \): \( p_3 = 0.85 \)
\end{itemize}
\item \textbf{Combined Success Probability:}
\begin{equation*}
P_{\text{pass}} = p_1 \times p_2 \times p_3 = 0.95 \times 0.90 \times 0.85 = 0.72675
\end{equation*}
\item \textbf{Expected Number of Trials:}
\begin{equation*}
E[R] = \frac{1}{0.72675} \approx 1.376
\end{equation*}
\item \textbf{Expected Number of Retries:}
\begin{equation*}
E[\text{Retries}] = E[R] - 1 \approx 0.376
\end{equation*}
\item \textbf{Conclusion:} On average, less than one retry is needed for an output to pass all validators.
\end{itemize}
\textbf{Determining \( m \) for 99\% Confidence:}
\begin{itemize}
\item Desired \( P_{\text{success within } m \text{ attempts}} = 0.99 \)
\item Solve for \( m \):
\begin{align*}
0.99 &= 1 - (1 - 0.72675)^{m} \\
(1 - 0.72675)^{m} &= 0.01 \\
(0.27325)^{m} &= 0.01 \\
m \log(0.27325) &= \log(0.01) \\
m &= \frac{\log(0.01)}{\log(0.27325)} \approx 2.74
\end{align*}
\item \textbf{Conclusion:} Set \( m = 3 \) retries to have a 99\% chance of success.
\end{itemize}
\subsubsection{Factors Influencing \texorpdfstring{$m$}{m}}
\paragraph{Validator Independence}
\begin{itemize}
\item \textbf{Assumption of Independence:} The calculation assumes validators act independently.
\item \textbf{Correlation Between Validators:} If validators are correlated, the combined success probability may differ, affecting \( m \).
\end{itemize}
\paragraph{Input Variability}
\begin{itemize}
\item \textbf{Input-Specific Success Rates:} Use the Reliability Tensor to identify inputs with lower success probabilities.
\item \textbf{Adaptive Retry Mechanism:} Adjust \( m \) based on input-specific data to optimize resource usage.
\end{itemize}
\paragraph{System Constraints}
\begin{itemize}
\item \textbf{Resource Limitations:} Higher \( m \) increases computational load and latency.
\item \textbf{User Experience:} Excessive retries may delay responses; balance is necessary.
\end{itemize}
\subsubsection{Implementing the Retry Mechanism}
\textbf{Algorithm Steps:}
\begin{enumerate}
\item \textbf{Initialize:} Set maximum retries \( m \), initialize attempt counter \( j = 1 \).
\item \textbf{Generate Output:} Produce output \( o_j \) for input \( i \).
\item \textbf{Validation:} Apply all validators \( V_k \) to \( o_j \).
\item \textbf{Check Pass/Fail:}
\begin{itemize}
\item \textbf{If Pass:} Accept \( o_j \), terminate.
\item \textbf{If Fail:} Increment \( j \).
\end{itemize}
\item \textbf{Retry Condition:}
\begin{itemize}
\item \textbf{If } \( j \leq m \): Go back to Step 2.
\item \textbf{If } \( j > m \): Fail the input, terminate.
\end{itemize}
\end{enumerate}
\textbf{Considerations:}
\begin{itemize}
\item \textbf{Logging:} Record each attempt and validation results for analysis.
\item \textbf{Timeouts:} Implement time constraints to prevent indefinite processing.
\item \textbf{Feedback Loop:} Analyze failed inputs to improve model or validators.
\end{itemize}
\end{document}