From a56d3e291498f2011332b125defae596d913fdf1 Mon Sep 17 00:00:00 2001 From: jasonjabbour Date: Mon, 11 Nov 2024 16:10:24 -0500 Subject: [PATCH] cutting down on repetitive metrics descriptions --- contents/core/benchmarking/benchmarking.qmd | 44 +++++++-------------- 1 file changed, 15 insertions(+), 29 deletions(-) diff --git a/contents/core/benchmarking/benchmarking.qmd b/contents/core/benchmarking/benchmarking.qmd index e6327c75..bd2e436a 100644 --- a/contents/core/benchmarking/benchmarking.qmd +++ b/contents/core/benchmarking/benchmarking.qmd @@ -603,59 +603,45 @@ Machine learning model evaluation has evolved from a narrow focus on accuracy to #### Accuracy -Accuracy is one of the most intuitive and commonly used metrics for evaluating machine learning models. At its core, accuracy measures the proportion of correct predictions made by the model out of all predictions. For example, imagine we have developed a machine learning model to classify images as either containing a cat or not. If we test this model on a dataset of 100 images, and it correctly identifies 90 of them, we would calculate its accuracy as 90%. +Accuracy is one of the most intuitive and commonly used metrics for evaluating machine learning models. In the early stages of machine learning, accuracy was often the primary, if not the only, metric considered when evaluating model performance. However, as the field has evolved, it’s become clear that relying solely on accuracy can be misleading, especially in applications where certain types of errors carry significant consequences. -In the initial stages of machine learning, accuracy was often the primary, if not the only, metric considered when evaluating model performance. This is understandable, given its straightforward nature and ease of interpretation. However, as the field has progressed, the limitations of relying solely on accuracy have become more apparent. +Consider the example of a medical diagnosis model with an accuracy of 95%. While at first glance this may seem impressive, we must look deeper to assess the model's performance fully. Suppose the model fails to accurately diagnose severe conditions that, while rare, can have severe consequences; its high accuracy may not be as meaningful. A well-known example of this limitation is [Google’s diabetic retinopathy model](https://about.google/intl/ALL_us/stories/seeingpotential/). While it achieved high accuracy in lab settings, it encountered challenges when deployed in real-world clinics in Thailand, where variations in patient populations, image quality, and environmental factors reduced its effectiveness. This example illustrates that even models with high accuracy need to be tested for their ability to generalize across diverse, unpredictable conditions to ensure reliability and impact in real-world settings. -Consider the example of a medical diagnosis model with an accuracy of 95%. While at first glance this may seem impressive, we must look deeper to assess the model's performance fully. Suppose the model fails to accurately diagnose severe conditions that, while rare, can have severe consequences; its high accuracy may not be as meaningful. A pertinent example of this is [Google's retinopathy machine learning model](https://about.google/intl/ALL_us/stories/seeingpotential/), which was designed to diagnose diabetic retinopathy and diabetic macular edema from retinal photographs. +Similarly, if the model performs well on average but exhibits significant disparities in performance across different demographic groups, this, too, would be cause for concern. The evolution of machine learning has thus seen a shift towards a more holistic approach to model evaluation, taking into account not just accuracy, but also other crucial factors such as fairness, transparency, and real-world applicability. A prime example is the [Gender Shades project](http://gendershades.org/) at MIT Media Lab, led by Joy Buolamwini, highlighting biases by performing better on lighter-skinned and male faces compared to darker-skinned and female faces. -The Google model demonstrated impressive accuracy levels in lab settings. Still, when deployed in real-world clinical environments in Thailand, [it faced significant challenges](https://www.technologyreview.com/2020/04/27/1000658/google-medical-ai-accurate-lab-real-life-clinic-covid-diabetes-retina-disease/). In the real-world setting, the model encountered diverse patient populations, varying image quality, and a range of different medical conditions that it had not been exposed to during its training. Consequently, its performance could have been better, and it struggled to maintain the same accuracy levels observed in lab settings. This example serves as a clear reminder that while high accuracy is an important and desirable attribute for a medical diagnosis model, it must be evaluated in conjunction with other factors, such as the model's ability to generalize to different populations and handle diverse and unpredictable real-world conditions, to understand its value and potential impact on patient care truly. - -Similarly, if the model performs well on average but exhibits significant disparities in performance across different demographic groups, this, too, would be cause for concern. - -The evolution of machine learning has thus seen a shift towards a more holistic approach to model evaluation, taking into account not just accuracy, but also other crucial factors such as fairness, transparency, and real-world applicability. A prime example is the [Gender Shades project](http://gendershades.org/) at MIT Media Lab, led by Joy Buolamwini, highlighting significant racial and gender biases in commercial facial recognition systems. The project evaluated the performance of three facial recognition technologies developed by IBM, Microsoft, and Face++. It found that they all exhibited biases, performing better on lighter-skinned and male faces compared to darker-skinned and female faces. - -While accuracy remains a fundamental and valuable metric for evaluating machine learning models, a more comprehensive approach is required to fully assess a model's performance. This means considering additional metrics that account for fairness, transparency, and real-world applicability, as well as conducting rigorous testing across diverse datasets to uncover and mitigate any potential biases. The move towards a more holistic approach to model evaluation reflects the maturation of the field and its increasing recognition of the real-world implications and ethical considerations associated with deploying machine learning models. +While accuracy remains essential for evaluating machine learning models, a comprehensive approach is needed to fully assess performance. This includes additional metrics for fairness, transparency, and real-world applicability, along with rigorous testing across diverse datasets to identify and address biases. This holistic evaluation approach reflects the field’s growing awareness of real-world implications in deploying models. #### Fairness -Fairness in machine learning models is a multifaceted and critical aspect that requires careful attention, particularly in high-stakes applications that significantly affect people's lives, such as in loan approval processes, hiring, and criminal justice. It refers to the equitable treatment of all individuals, irrespective of their demographic or social attributes such as race, gender, age, or socioeconomic status. - -Simply relying on accuracy can be insufficient and potentially misleading when evaluating models. For instance, consider a loan approval model with a 95% accuracy rate. While this figure may appear impressive at first glance, it does not reveal how the model performs across different demographic groups. If this model consistently discriminates against a particular group, its accuracy is less commendable, and its fairness is questioned. +Fairness in machine learning involves ensuring that models perform consistently across diverse groups, especially in high-impact applications like loan approvals, hiring, and criminal justice. Relying solely on accuracy can be misleading if the model exhibits biased outcomes across demographic groups. For example, a loan approval model with high accuracy may still consistently deny loans to certain groups, raising questions about its fairness. -Discrimination can manifest in various forms, such as direct discrimination, where a model explicitly uses sensitive attributes like race or gender in its decision-making process, or indirect discrimination, where seemingly neutral variables correlate with sensitive attributes, indirectly influencing the model's outcomes. An infamous example of the latter is the COMPAS tool used in the US criminal justice system, which exhibited racial biases in predicting recidivism rates despite not explicitly using race as a variable. +Bias in models can arise directly, when sensitive attributes like race or gender influence decisions, or indirectly, when neutral features correlate with these attributes, affecting outcomes. Simply relying on accuracy can be insufficient when evaluating models. For instance, consider a loan approval model with a 95% accuracy rate. While this figure may appear impressive at first glance, it does not reveal how the model performs across different demographic groups. For instance, a well-known example is the COMPAS tool used in the US criminal justice system, which showed racial biases in predicting recidivism despite not explicitly using race as a variable. -Addressing fairness involves careful examination of the model's performance across diverse groups, identifying potential biases, and rectifying disparities through corrective measures such as re-balancing datasets, adjusting model parameters, and implementing fairness-aware algorithms. Researchers and practitioners continuously develop metrics and methodologies tailored to specific use cases to evaluate fairness in real-world scenarios. For example, disparate impact analysis, demographic parity, and equal opportunity are some of the metrics employed to assess fairness. +Addressing fairness requires analyzing a model’s performance across groups, identifying biases, and applying corrective measures like re-balancing datasets or using fairness-aware algorithms. Researchers and practitioners continuously develop metrics and methodologies tailored to specific use cases to evaluate fairness in real-world scenarios. For example, disparate impact analysis, demographic parity, and equal opportunity are some of the metrics employed to assess fairness. Additionally, transparency and interpretability of models are fundamental to achieving fairness. Tools like [AI Fairness 360](https://ai-fairness-360.org/) and [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators) help explain how a model makes decisions, allowing developers to detect and correct fairness issues in machine learning models. -Additionally, transparency and interpretability of models are fundamental to achieving fairness. Understanding how a model makes decisions can reveal potential biases and enable stakeholders to hold developers accountable. Open-source tools like [AI Fairness 360](https://ai-fairness-360.org/) by IBM and [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators) by TensorFlow are being developed to facilitate fairness assessments and mitigation of biases in machine learning models. - -Ensuring fairness in machine learning models, particularly in applications that significantly impact people's lives, requires rigorous evaluation of the model's performance across diverse groups, careful identification and mitigation of biases, and implementation of transparency and interpretability measures. By comprehensively addressing fairness, we can work towards developing machine learning models that are equitable, just, and beneficial for society. +While accuracy is a valuable metric, it doesn’t always provide the full picture; assessing fairness ensures models are effective across real-world scenarios. Ensuring fairness in machine learning models, particularly in applications that significantly impact people's lives, requires rigorous evaluation of the model's performance across diverse groups, careful identification and mitigation of biases, and implementation of transparency and interpretability measures. #### Complexity ##### Parameters -In the initial stages of machine learning, model benchmarking often relied on parameter counts as a proxy for model complexity. The rationale was that more parameters typically lead to a more complex model, which should, in turn, deliver better performance. However, this approach has proven inadequate as it needs to account for the computational cost associated with processing many parameters. - -For example, GPT-3, developed by OpenAI, is a language model that boasts an astounding 175 billion parameters. While it achieves state-of-the-art performance on various natural language processing tasks, its size and the computational resources required to run it make it impractical for deployment in many real-world scenarios, especially those with limited computational capabilities. +In the initial stages of machine learning, model benchmarking often relied on parameter counts as a proxy for model complexity. The rationale was that more parameters typically lead to a more complex model, which should, in turn, deliver better performance. However, this approach overlooks the practical costs associated with processing large models. As parameter counts increase, so do the computational resources required, making such models impractical for deployment in real-world scenarios, particularly on devices with limited processing power. -Relying on parameter counts as a proxy for model complexity also fails to consider the model's efficiency. If optimized for efficiency, a model with fewer parameters might be just as effective, if not more so, than a model with a higher parameter count. For instance, MobileNets, developed by Google, is a family of models designed specifically for mobile and edge devices. They use depth-wise separable convolutions to reduce the number of parameters and computational costs while still achieving competitive performance. +Relying on parameter counts as a proxy for model complexity also fails to consider the model's efficiency. A well-optimized model with fewer parameters can often achieve comparable or even superior performance to a larger model. For instance, MobileNets, developed by Google, is a family of models designed specifically for mobile and edge devices. They used depth-wise separable convolutions to reduce parameter counts and computational demands while still maintaining strong performance. -In light of these limitations, the field has moved towards a more holistic approach to model benchmarking that considers parameter counts and other crucial factors such as floating-point operations per second (FLOPs), memory consumption, and latency. FLOPs, in particular, have emerged as an important metric as they provide a more accurate representation of the computational load a model imposes. This shift towards a more comprehensive approach to model benchmarking reflects a recognition of the need to balance performance with practicality, ensuring that models are effective, efficient, and deployable in real-world scenarios. +In light of these limitations, the field has moved towards a more holistic approach to model benchmarking that considers parameter counts and other crucial factors such as floating-point operations per second (FLOPs), memory consumption, and latency. This comprehensive approach balances performance with deployability, ensuring that models are not only accurate but also efficient and suitable for real-world applications. ##### FLOPS -The size of a machine learning model is an essential aspect that directly impacts its usability in practical scenarios, especially when computational resources are limited. Traditionally, the number of parameters in a model was often used as a proxy for its size, with the underlying assumption being that more parameters would translate to better performance. However, this simplistic view does not consider the computational cost of processing these parameters. This is where the concept of floating-point operations per second (FLOPs) comes into play, providing a more accurate representation of the computational load a model imposes. +FLOPs, or floating-point operations per second, have become a critical metric for representing a model’s computational load. Traditionally, parameter count was used as a proxy for model complexity, based on the assumption that more parameters would yield better performance. However, this approach overlooks the computational cost of processing these parameters, which can impact a model’s usability in real-world scenarios with limited resources. -FLOPs measure the number of floating-point operations a model performs to generate a prediction. A model with many FLOPs requires substantial computational resources to process the vast number of operations, which may render it impractical for certain applications. Conversely, a model with a lower FLOP count is more lightweight and can be easily deployed in scenarios where computational resources are limited. @fig-flops, from [@bianco2018benchmark], shows the relationship between Top-1 Accuracy on ImageNet (_y_-axis), the model's G-FLOPs (_x_-axis), and the model's parameter count (circle-size). +FLOPs measure the number of floating-point operations a model performs to generate a prediction. A model with many FLOPs requires substantial computational resources to process the vast number of operations, which may render it impractical for certain applications. Conversely, a model with a lower FLOP count is more lightweight and can be easily deployed in scenarios where computational resources are limited. @fig-flops, from [@bianco2018benchmark], illustrates the trade-off between ImageNet accuracy, FLOPs, and parameter count, showing that some architectures achieve higher efficiency than others. ![A graph that depicts the top-1 imagenet accuracy vs. the FLOP count of a model along with the model's parameter count. The figure shows a overall tradeoff between model complexity and accuracy, although some model architectures are more efficiency than others. Source: @bianco2018benchmark.](images/png/model_FLOPS_VS_TOP_1.png){#fig-flops} -Let's consider an example. BERT---Bidirectional Encoder Representations from Transformers [@devlin2018bert]---is a popular natural language processing model, has over 340 million parameters, making it a large model with high accuracy and impressive performance across various tasks. However, the sheer size of BERT, coupled with its high FLOP count, makes it a computationally intensive model that may not be suitable for real-time applications or deployment on edge devices with limited computational capabilities. - -In light of this, there has been a growing interest in developing smaller models that can achieve similar performance levels as their larger counterparts while being more efficient in computational load. DistilBERT, for instance, is a smaller version of BERT that retains 97% of its performance while being 40% smaller in terms of parameter count. The size reduction also translates to a lower FLOP count, making DistilBERT a more practical choice for resource-constrained scenarios. +Let's consider an example. BERT---Bidirectional Encoder Representations from Transformers [@devlin2018bert]---is a popular natural language processing model, has over 340 million parameters, making it a large model with high accuracy and impressive performance across various tasks. However, the sheer size of BERT, coupled with its high FLOP count, makes it a computationally intensive model that may not be suitable for real-time applications or deployment on edge devices with limited computational capabilities. In light of this, there has been a growing interest in developing smaller models that can achieve similar performance levels as their larger counterparts while being more efficient in computational load. DistilBERT, for instance, is a smaller version of BERT that retains 97% of its performance while being 40% smaller in terms of parameter count. The size reduction also translates to a lower FLOP count, making DistilBERT a more practical choice for resource-constrained scenarios. -In summary, while parameter count provides a useful indication of model size, it is not a comprehensive metric as it needs to consider the computational cost associated with processing these parameters. FLOPs, on the other hand, offer a more accurate representation of a model's computational load and are thus an essential consideration when deploying machine learning models in real-world scenarios, particularly when computational resources are limited. The evolution from relying solely on parameter count to considering FLOPs signifies a maturation in the field, reflecting a greater awareness of the practical constraints and challenges of deploying machine learning models in diverse settings. +While parameter count indicates model size, it does not fully capture the computational cost. FLOPs provide a more accurate measure of computational load, highlighting the practical trade-offs in model deployment. This shift from parameter count to FLOPs reflects the field’s growing awareness of deployment challenges in diverse settings. ##### Efficiency