diff --git a/frameworks.qmd b/frameworks.qmd
index 85f2ab9b..1c64f52b 100644
--- a/frameworks.qmd
+++ b/frameworks.qmd
@@ -1,119 +1,1738 @@
# AI Frameworks
-::: {.callout-tip collapse="true"}
-## Learning Objectives
+Learning Objectives
-* coming soon.
+- The evolution, core components, and advanced features of ML frameworks
-:::
+- How frameworks specialize for cloud, edge, and tinyML environments
+
+- Challenges of embedded ML and how frameworks optimize models
+
+- Criteria for selecting the right framework based on models, hardware, software factors
+
+- How to match framework capabilities to the constraints and requirements of a project
+
+- Ongoing innovations in frameworks for next-generation machine learning
## Introduction
-Explanation: Discuss what ML frameworks are and why they are important. Also, elaborate on the aspects involved in understanding how an ML framework is developed and deployed.
+Machine learning frameworks provide the tools and infrastructure to
+efficiently build, train, and deploy machine learning models. In this
+chapter, we will explore the evolution and key capabilities of major
+frameworks like [[TensorFlow (TF)]{.underline}](https://www.tensorflow.org/), [[PyTorch]{.underline}](https://pytorch.org/), and specialized frameworks for
+embedded devices. We will dive into the components like computational
+graphs, optimization algorithms, hardware acceleration, and more that
+enable developers to quickly construct performant models. Understanding
+these frameworks is essential to leverage the power of deep learning
+across the spectrum from cloud to edge devices.
+
+ML frameworks handle much of the complexity of model development through
+high-level APIs and domain-specific languages that allow practitioners
+to quickly construct models by combining pre-made components and
+abstractions. For example, frameworks like TensorFlow and PyTorch
+provide Python APIs to define neural network architectures using layers,
+optimizers, datasets, and more. This enables rapid iteration compared to
+coding every model detail from scratch.
+
+A key capability offered by frameworks is distributed training engines
+that can scale model training across clusters of GPUs and TPUs. This
+makes it feasible to train state-of-the-art models with billions or
+trillions of parameters on vast datasets. Frameworks also integrate with
+specialized hardware like NVIDIA GPUs to further accelerate training via
+optimizations like parallelization and efficient matrix operations.
+
+In addition, frameworks simplify deploying finished models into
+production through tools like [[TensorFlow Serving]{.underline}](https://www.tensorflow.org/tfx/guide/serving) for scalable model
+serving and [[TensorFlow Lite]{.underline}](https://www.tensorflow.org/lite) for optimization on mobile and edge devices.
+Other valuable capabilities include visualization, model optimization
+techniques like quantization and pruning, and monitoring metrics during
+training.
+
+Leading open source frameworks like TensorFlow, PyTorch, and [[MXNet]{.underline}](https://mxnet.apache.org/versions/1.9.1/) power
+much of AI research and development today. Commercial offerings like
+[[Amazon SageMaker]{.underline}](https://aws.amazon.com/pm/sagemaker/?trk=b6c2fafb-22b1-4a97-a2f7-7e4ab2c7aa28&sc_channel=ps&ef_id=CjwKCAjws9ipBhB1EiwAccEi1JpbBz6j4t7sRUoAiKFDc0mi59faZYge5MuFecAU6zGDQYTFz9NnaBoCV-wQAvD_BwE:G:s&s_kwcid=AL!4422!3!651751060692!e!!g!!amazon%20sagemaker!19852662230!145019225977) and [[Microsoft Azure Machine Learning]{.underline}](https://azure.microsoft.com/en-us/free/machine-learning/search/?ef_id=_k_CjwKCAjws9ipBhB1EiwAccEi1JVOThls797Sj3Li96_GYjoJQDx_EWaXNsDaEWeFbIaRkESUCkq64xoCSmwQAvD_BwE_k_&OCID=AIDcmm5edswduu_SEM__k_CjwKCAjws9ipBhB1EiwAccEi1JVOThls797Sj3Li96_GYjoJQDx_EWaXNsDaEWeFbIaRkESUCkq64xoCSmwQAvD_BwE_k_&gad=1&gclid=CjwKCAjws9ipBhB1EiwAccEi1JVOThls797Sj3Li96_GYjoJQDx_EWaXNsDaEWeFbIaRkESUCkq64xoCSmwQAvD_BwE) integrate these
+open source frameworks with proprietary capabilities and enterprise
+tools.
+
+Machine learning engineers and practitioners leverage these robust
+frameworks to focus on high-value tasks like model architecture, feature
+engineering, and hyperparameter tuning instead of infrastructure. The
+goal is to efficiently build and deploy performant models that solve
+real-world problems.
+
+In this chapter, we will explore today\'s leading cloud frameworks and
+how they have adapted models and tools specifically for embedded and
+edge deployment. We will compare programming models, supported hardware,
+optimization capabilities, and more to fully understand how frameworks
+enable scalable machine learning from the cloud to the edge.
+
+## Framework Evolution
+
+Machine learning frameworks have evolved significantly over time to meet
+the diverse needs of machine learning practitioners and advancements in
+AI techniques. A few decades ago, building and training machine learning
+models required extensive low-level coding and infrastructure. Machine
+learning frameworks have evolved considerably over the past decade to
+meet the expanding needs of practitioners and rapid advances in deep
+learning techniques. Early neural network research was constrained by
+insufficient data and compute power. Building and training machine
+learning models required extensive low-level coding and infrastructure.
+But the release of large datasets like [[ImageNet]{.underline}](https://www.image-net.org/) in 2009 and advancements
+in parallel GPU computing unlocked the potential for far deeper neural
+networks.
+
+The first ML frameworks, [[Theano]{.underline}](https://pypi.org/project/Theano/#:~:text=Theano%20is%20a%20Python%20library,a%20similar%20interface%20to%20NumPy's.) (2007) and [[Caffe]{.underline}](https://caffe.berkeleyvision.org/) (2014), were developed
+by academic institutions (Montreal Institute for Learning Algorithms,
+Berkeley Vision and Learning Center). Amid a growing interest in deep
+learning due to state-of-the-art performance of AlexNet (2012) on the
+ImageNet dataset, private companies and individuals began developing ML
+frameworks, resulting in frameworks such as [[Keras]{.underline}](https://keras.io/) by Google researcher
+François Chollet (2015), [[Chainer]{.underline}](https://chainer.org/) by Preferred Networks (2015),
+TensorFlow by Google (2015), [[CNTK]{.underline}](https://learn.microsoft.com/en-us/cognitive-toolkit/) by Microsoft (2016), and PyTorch by
+Facebook (2016).
+
+Many of these ML frameworks can be divided into categories, namely
+high-level vs. low-level frameworks and static vs. dynamic computational
+graph frameworks. High-level frameworks provide a higher level of
+abstraction than low-level frameworks. That is, high-level frameworks
+have pre-built functions and modules for common ML tasks, such as
+creating, training, and evaluating common ML models as well as
+preprocessing data, engineering features, and visualizing data, which
+low-level frameworks do not have. Thus, high-level frameworks may be
+easier to use, but are not as customizable as low-level frameworks (i.e.
+users of low-level frameworks can define custom layers, loss functions,
+optimization algorithms, etc.). Examples of high-level frameworks
+include TensorFlow/Keras and PyTorch. Examples of low-level ML
+frameworks include TensorFlow with low-level APIs, Theano, Caffe,
+Chainer, and CNTK.
+
+Frameworks like Theano and Caffe used static computational graphs which
+required rigidly defining the full model architecture upfront. Static
+graphs require upfront declaration and limit flexibility. Dynamic graphs
+construct on-the-fly for more iterative development. But around 2016,
+frameworks began adopting dynamic graphs like PyTorch and TensorFlow 2.0
+which can construct graphs on-the-fly. This provides greater flexibility
+for model development. We will discuss these concepts and details later
+on in the AI Training section.
+
+The development of these frameworks facilitated an explosion in model
+size and complexity over time---from early multilayer perceptrons and
+convolutional networks to modern transformers with billions or trillions
+of parameters. In 2017, ResNet models achieved record ImageNet accuracy
+with over 150 layers and 25 million parameters. Then in 2020, the GPT-3
+language model pushed parameters to an astonishing 175 billion using
+model parallelism in frameworks to train across thousands of GPUs and
+TPUs.
+
+Each generation of frameworks unlocked new capabilities that powered
+advancement:
+
+- Theano and TensorFlow (2015) introduced computational graphs and automatic differentiation to simplify model building.
+
+- CNTK (2016) pioneered efficient distributed training by combining model and data parallelism.
+
+- PyTorch (2016) provided imperative programming and dynamic graphs for flexible experimentation.
+
+- TensorFlow 2.0 (2019) made eager execution default for intuitiveness and debugging.
+
+- TensorFlow Graphics (2020) added 3D data structures to handle point clouds and meshes.
+
+In recent years, there has been a convergence on the frameworks.
+TensorFlow and PyTorch have become the overwhelmingly dominant ML
+frameworks, representing more than 95% of ML frameworks used in research
+and production. Keras was integrated into TensorFlow in 2019; Preferred
+Networks transitioned Chainer to PyTorch in 2019; and Microsoft stopped
+actively developing CNTK in 2022 in favor of supporting PyTorch on
+Windows.
+
+![Popularity of ML frameworks in the United States as measured by Google
+web searches](images_ml_frameworks/image6.png){width="3.821385608048994in"
+height="2.5558081802274715in"}
+
+However, a one-size-fits-all approach does not work well across the
+spectrum from cloud to tiny edge devices. Different frameworks represent
+various philosophies around graph execution, declarative versus
+imperative APIs, and more. Declarative defines what the program should
+do while imperative focuses on how it should do it step-by-step. For
+instance, TensorFlow uses graph execution and declarative-style modeling
+while PyTorch adopts eager execution and imperative modeling for more
+Pythonic flexibility. Each approach carries tradeoffs that we will
+discuss later in the Basic Components section.
+
+Today\'s advanced frameworks enable practitioners to develop and deploy
+increasingly complex models - a key driver of innovation in the AI
+field. But they continue to evolve and expand their capabilities for the
+next generation of machine learning. To understand how these systems
+continue to evolve, we will dive deeper into TensorFlow as an example of
+how the framework grew in complexity over time.
+
+## DeepDive into TensorFlow
+
+TensorFlow was developed by the Google Brain team and was released as an
+open-source software library on November 9, 2015. It was designed for
+numerical computation using data flow graphs and has since become
+popular for a wide range of machine learning and deep learning
+applications.
+
+TensorFlow is both a training and inference framework and provides
+built-in functionality to handle everything from model creation and
+training, to deployment. Since its initial development, the TensorFlow
+ecosystem has grown to include many different "varieties" of TensorFlow
+that are each intended to allow users to support ML on different
+platforms. In this section, we will mainly discuss only the core
+package.
+
+### TF Ecosystem
+
+1. [[TensorFlow Core]{.underline}](https://www.tensorflow.org/tutorials): primary package that most developers engage with. It provides a comprehensive, flexible platform for defining, training, and deploying machine learning models. It includes tf.keras as its high-level API.
+
+2. [[TensorFlow Lite]{.underline}](https://www.tensorflow.org/lite): designed for deploying lightweight models on mobile, embedded, and edge devices. It offers tools to convert TensorFlow models to a more compact format suitable for limited-resource devices and provides optimized pre-trained models for mobile.
+
+3. [[TensorFlow.js]{.underline}](https://www.tensorflow.org/js): JavaScript library that allows training and deployment of machine learning models directly in the browser or on Node.js. It also provides tools for porting pre-trained TensorFlow models to the browser-friendly format.
+
+4. [[TensorFlow on Edge Devices (Coral)]{.underline}](https://developers.googleblog.com/2019/03/introducing-coral-our-platform-for.html): platform of hardware components and software tools from Google that allows the execution of TensorFlow models on edge devices, leveraging Edge TPUs for acceleration.
+
+5. [[TensorFlow Federated (TFF)]{.underline}](https://www.tensorflow.org/federated): framework for machine learning and other computations on decentralized data. TFF facilitates federated learning, allowing model training across many devices without centralizing the data.
+
+6. [[TensorFlow Graphics]{.underline}](https://www.tensorflow.org/graphics): library for using TensorFlow to carry out graphics-related tasks, including 3D shapes and point clouds processing, using deep learning.
+
+7. [[TensorFlow Hub]{.underline}](https://www.tensorflow.org/hub): repository of reusable machine learning model components to allow developers to reuse pre-trained model components, facilitating transfer learning and model composition
+
+8. [[TensorFlow Serving]{.underline}](https://www.tensorflow.org/tfx/guide/serving): framework designed for serving and deploying machine learning models for inference in production environments. It provides tools for versioning and dynamically updating deployed models without service interruption.
+
+9. [[TensorFlow Extended (TFX)]{.underline}](https://www.tensorflow.org/tfx): end-to-end platform designed to deploy and manage machine learning pipelines in production settings. TFX encompasses components for data validation, preprocessing, model training, validation, and serving.
+
+TensorFlow was developed to address the limitations of DistBelief[^2]---the
+framework in use at Google from 2011 to 2015---by providing flexibility
+along three axes: 1) defining new layers, 2) refining training
+algorithms, and 3) defining new training algorithms. To understand what
+limitations in DistBelief led to the development of TensorFlow, we will
+first give a brief overview of the Parameter Server Architecture that
+DistBelief employed.[^3]
+
+The Parameter Server (PS) architecture is a popular design for
+distributing the training of machine learning models, especially deep
+neural networks, across multiple machines. The fundamental idea is to
+separate the storage and management of model parameters from the
+computation used to update these parameters:
+
+**Storage**: The storage and management of model parameters were handled
+by the stateful parameter server processes. Given the large scale of
+models and the distributed nature of the system, these parameters were
+sharded across multiple parameter servers. Each server maintained a
+portion of the model parameters, making it \"stateful\" as it had to
+maintain and manage this state across the training process.
+
+**Computation**: The worker processes, which could be run in parallel,
+were stateless and purely computational, processing data and computing
+gradients without maintaining any state or long-term memory.[^4]
+
+DistBelief and its architecture defined above were crucial in enabling
+distributed deep learning at Google but also introduced limitations that
+motivated the development of TensorFlow:
+
+### Static Computation Graph
+
+In the parameter server architecture, model parameters are distributed
+across various parameter servers. Since DistBelief was primarily
+designed for the neural network paradigm, parameters corresponded to a
+fixed structure of the neural network. If the computation graph were
+dynamic, the distribution and coordination of parameters would become
+significantly more complicated. For example, a change in the graph might
+require the initialization of new parameters or the removal of existing
+ones, complicating the management and synchronization tasks of the
+parameter servers. This made it harder to implement models outside the
+neural framework or models that required dynamic computation graphs.
+
+TensorFlow was designed to be a more general computation framework[^2] where
+the computation is expressed as a data flow graph. This allows for a
+wider variety of machine learning models and algorithms outside of just
+neural networks, and provides flexibility in refining models.
+
+### Usability & Deployment
+
+The parameter server model involves a clear delineation of roles (worker
+nodes and parameter servers), and is optimized for data center
+deployments which might not be optimal for all use cases. For instance,
+on edge devices or in other non-data center environments, this division
+introduces overheads or complexities.
+
+TensorFlow was built to run on multiple platforms, from mobile devices
+and edge devices, to cloud infrastructure. It also aimed to provide ease
+of use between local and distributed training, and to be more
+lightweight, and developer friendly.
+
+### Architecture Design
+
+Rather than using the parameter server architecture, TensorFlow instead
+deploys tasks across a cluster. These tasks are named processes that can
+communicate over a network, and each can execute TensorFlow\'s core
+construct: the dataflow graph, and interface with various computing
+devices (like CPUs or GPUs). This graph is a directed representation
+where nodes symbolize computational operations, and edges depict the
+tensors (data) flowing between these operations.
+
+Despite the absence of traditional parameter servers, some tasks, called
+"PS tasks", still perform the role of storing and managing parameters,
+reminiscent of parameter servers in other systems. The remaining tasks,
+which usually handle computation, data processing, and gradient
+calculations, are referred to as \"worker tasks.\" TensorFlow\'s PS
+tasks can execute any computation representable by the dataflow graph,
+meaning they aren\'t just limited to parameter storage, and the
+computation can be distributed. This capability makes them significantly
+more versatile and gives users the power to program the PS tasks using
+the standard TensorFlow interface, the same one they\'d use to define
+their models. As mentioned above, dataflow graphs' structure also makes
+it inherently good for parallelism allowing for processing of large
+datasets.
+
+### Built-in Functionality & Keras
+
+TensorFlow includes libraries to help users develop and deploy more
+use-case specific models, and since this framework is open-source, this
+list continues to grow. These libraries address the entire ML
+development life-cycle: data preparation, model building, deployment, as
+well as responsible AI.
+
+Additionally, one of TensorFlow's biggest advantages is its integration
+with Keras, though as we will cover in the next section, Pytorch recently also added a Keras integration. Keras is another ML framework that was built to be extremely
+user-friendly and as a result has a high level of abstraction. We will
+cover Keras in more depth later in this chapter, but when discussing its
+integration with TensorFlow, the most important thing to note is that it
+was originally built to be backend agnostic. This means users could
+abstract away these complexities, offering a cleaner, more intuitive way
+to define and train models without worrying about compatibility issues
+with different backends. TensorFlow users had some complaints about the
+usability and readability of TensorFlow's API, so as TF gained
+prominence it integrated Keras as its high-level API. This integration
+offered major benefits to TensorFlow users since it introduced more
+intuitive readability, and portability of models while still taking
+advantage of powerful backend features, Google support, and
+infrastructure to deploy models on various platforms.
+
+### Limitations and Challenges
+
+TensorFlow is one of the most popular deep learning frameworks but does
+have criticisms and weaknesses-- mostly focusing on usability, and
+resource usage. The rapid pace of updates through its support from
+Google, while advantageous, has sometimes led to issues of backward
+compatibility, deprecated functions, and shifting documentation.
+Additionally, even with the Keras implementation, the syntax and
+learning curve of TensorFlow can be difficult for new users. One major
+critique of TensorFlow is its high overhead and memory consumption due
+to the range of built in libraries and support. Some of these concerns
+can be addressed by using pared down versions, but can still be limiting
+in resource-constrained environments.
+
+### PyTorch vs. TensorFlow
+
+PyTorch and TensorFlow have established themselves as frontrunners in
+the industry. Both frameworks offer robust functionalities, but they
+differ in terms of their design philosophies, ease of use, ecosystem,
+and deployment capabilities.
+
+**Design Philosophy and Programming Paradigm:** PyTorch uses a dynamic
+computational graph, termed as eager execution. This makes it intuitive
+and facilitates debugging since operations are executed immediately and
+can be inspected on-the-fly. In comparison, earlier versions of
+TensorFlow were centered around a static computational graph, which
+required the graph\'s complete definition before execution. However,
+TensorFlow 2.0 introduced eager execution by default, making it more
+aligned with PyTorch in this regard. PyTorch\'s dynamic nature and
+Python based approach has enabled its simplicity and flexibility,
+particularly for rapid prototyping. TensorFlow\'s static graph approach
+in its earlier versions had a steeper learning curve; the introduction
+of TensorFlow 2.0, with its Keras integration as the high-level API, has
+significantly simplified the development process.
+
+**Deployment:** PyTorch is heavily favored in research environments,
+deploying PyTorch models in production settings was traditionally
+challenging. However, with the introduction of TorchScript and the
+TorchServe tool, deployment has become more feasible. One of
+TensorFlow\'s strengths lies in its scalability and deployment
+capabilities, especially on embedded and mobile platforms with
+TensorFlow Lite. TensorFlow Serving and TensorFlow.js further facilitate
+deployment in various environments, thus giving it a broader reach in
+the ecosystem.
+
+**Performance:** Both frameworks offer efficient hardware acceleration
+for their operations. However, TensorFlow has a slightly more robust
+optimization workflow, such as the XLA (Accelerated Linear Algebra)
+compiler, which can further boost performance. Its static computational
+graph, in the early versions, was also advantageous for certain
+optimizations.
+
+**Ecosystem:** PyTorch has a growing ecosystem with tools like
+TorchServe for serving models and libraries like TorchVision, TorchText,
+and TorchAudio for specific domains. As we mentioned earlier, TensorFlow
+has a broad and mature ecosystem. TensorFlow Extended (TFX) provides an
+end-to-end platform for deploying production machine learning pipelines.
+Other tools and libraries include TensorFlow Lite, TensorFlow.js,
+TensorFlow Hub, and TensorFlow Serving.
+
+Here's a summarizing comparative analysis:
+
+| Feature/Aspect | PyTorch | TensorFlow |
+|-----------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
+| Design Philosophy | Dynamic computational graph (eager execution) | Static computational graph (early versions); Eager execution in TensorFlow 2.0 |
+| Deployment | Traditionally challenging; Improved with TorchScript & TorchServe | Scalable, especially on embedded platforms with TensorFlow Lite |
+| Performance & Optimization | Efficient GPU acceleration | Robust optimization with XLA compiler |
+| Ecosystem | TorchServe, TorchVision, TorchText, TorchAudio | TensorFlow Extended (TFX), TensorFlow Lite, TensorFlow.js, TensorFlow Hub, TensorFlow Serving |
+| Ease of Use | Preferred for its Pythonic approach and rapid prototyping | Initially steep learning curve; Simplified with Keras in TensorFlow 2.0 |
+
+
+## Basic Framework Components
+
+### Tensor data structures
+
+To understand tensors, let us start from the familiar concepts in linear
+algebra. Vectors can be represented as a stack of numbers in a
+1-dimensional array. Matrices follow the same idea, and one can think of
+them as many vectors being stacked on each other, making it 2
+dimensional. Higher dimensional tensors work the same way. A
+3-dimensional tensor is simply a set of matrices stacked on top of each
+other in another direction. The figure below demonstrates this step.
+Therefore, vectors and matrices can be considered special cases of
+tensors, with 1D and 2D dimensions respectively.
+
+![Visualization of Tensor Data Structure](images_ml_frameworks/image2.png){width="3.9791666666666665in" height="1.9672287839020122in" caption="Visualization of Tensor Data Structure" align="center"}
+
+Defining formally, in machine learning, tensors are a multi-dimensional
+array of numbers. The number of dimensions defines the rank of the
+tensor. As a generalization of linear algebra, the study of tensors is
+called multilinear algebra. There are noticeable similarities between
+matrices and higher ranked tensors. First, it is possible to extend the
+definitions given in linear algebra to tensors, such as with
+eigenvalues, eigenvectors, and rank (in the linear algebra sense) .
+Furthermore, with the way that we have defined tensors, it is possible
+to turn higher dimensional tensors into matrices. This turns out to be
+very critical in practice, as multiplication of abstract representations
+of higher dimensional tensors are often completed by first converting
+them into matrices for multiplication.
+
+Tensors offer a flexible data structure with its ability to represent
+data in higher dimensions. For example, to represent color image data,
+for each of the pixel values (in 2 dimensions), one needs the color
+values for red, green and blue. With tensors, it is easy to contain
+image data in a single 3-dimensional tensor with each of the numbers
+within it representing a certain color value in the certain location of
+the image. Extending even further, if we wanted to store a series of
+images, we can simply extend the dimensions such that the new dimension
+(to create a 4-dimensional tensor) represents the different images that
+we have. This is exactly what the famous MNIST dataset does,
+loading a single 4-dimensional tensor when one calls to load the
+dataset, allowing a compact representation of all the data in one place. [^5]
+
+### Computational graphs
+
+#### Graph Definition
+
+Computational graphs are a key component of deep learning frameworks
+like TensorFlow and PyTorch. They allow us to express complex neural
+network architectures in a way that can be efficiently executed and
+differentiated. A computational graph consists of a directed acyclic
+graph (DAG) where each node represents an operation or variable, and
+edges represent data dependencies between them.
+
+For example, a node might represent a matrix multiplication operation,
+taking two input matrices (or tensors) and producing an output matrix
+(or tensor). To visualize this, consider the simple example below. The
+directed acyclic graph above computes $z = x \times y$, where each of
+the variables are just numbers.
+
+![Basic Example of Computational Graph](images_ml_frameworks/image1.png){width="50%" height="auto" align="center" caption="Basic Example of Computational Graph"}
+
+Underneath the hood, the computational graphs represent abstractions for
+common layers like convolutional, pooling, recurrent, and dense layers,
+with data including activations, weights, biases, are represented in
+tensors. Convolutional layers form the backbone of CNN models for
+computer vision. They detect spatial patterns in input data through
+learned filters. Recurrent layers like LSTMs and GRUs enable processing
+sequential data for tasks like language translation. Attention layers
+are used in transformers to draw global context from the entire input.
+
+Broadly speaking, layers are higher level abstractions that define
+computations on top of those tensors. For example, a Dense layer
+performs a matrix multiplication and addition between input/weight/bias
+tensors. Note that a layer operates on tensors as inputs and outputs and
+the layer itself is not a tensor. Some key differences:
+
+- Layers contain states like weights and biases. Tensors are
+ stateless, just holding data.
+
+- Layers can modify internal state during training. Tensors are
+ immutable/read-only.
+
+- Layers are higher level abstractions. Tensors are lower level,
+ directly representing data and math operations.
+
+- Layers define fixed computation patterns. Tensors flow between
+ layers during execution.
+
+- Layers are used indirectly when building models. Tensors flow
+ > between layers during execution.
+
+So while tensors are a core data structure that layers consume and
+produce, layers have additional functionality for defining parameterized
+operations and training. While a layer configures tensor operations
+under the hood, the layer itself remains distinct from the tensor
+objects. The layer abstraction makes building and training neural
+networks much more intuitive. This sort of abstraction enables
+developers to build models by stacking these layers together, without
+having to implement the layer logic themselves. For example, calling
+tf.keras.layers.Conv2D in TensorFlow creates a convolutional layer. The
+framework handles computing the convolutions, managing parameters, etc.
+This simplifies model development, allowing developers to focus on
+architecture rather than low-level implementations. Layer abstractions
+utilize highly optimized implementations for performance. They also
+enable portability, as the same architecture can run on different
+hardware backends like GPUs and TPUs.
+
+In addition, computational graphs include activation functions like
+ReLU, sigmoid, and tanh that are essential to neural networks and many
+frameworks provide these as standard abstractions. These functions
+introduce non-linearities that enable models to approximate complex
+functions. Frameworks provide these as simple, pre-defined operations
+that can be used when constructing models. For example, tf.nn.relu in
+TensorFlow. This abstraction enables flexibility, as developers can
+easily swap activation functions for tuning performance. Pre-defined
+activations are also optimized by the framework for faster execution.
+
+In recent years, models like ResNets and MobileNets have emerged as
+popular architectures, with current frameworks pre-packaging these as
+computational graphs. Rather than worrying about the fine details,
+developers can utilize them as a starting point, customizing as needed
+by substituting layers. This simplifies and speeds up model development,
+avoiding reinventing architectures from scratch. Pre-defined models
+include well-tested, optimized implementations that ensure good
+performance. Their modular design also enables transferring learned
+features to new tasks via transfer learning. In essence, these
+pre-defined architectures provide high-performance building blocks to
+quickly create robust models.
+
+These layer abstractions, activation functions, and predefined
+architectures provided by the frameworks are what constitute a
+computational graph. When a user defines a layer in a framework (e.g.
+tf.keras.layers.Dense()), the framework is configuring computational
+graph nodes and edges to represent that layer. The layer parameters like
+weights and biases become variables in the graph. The layer computations
+become operation nodes (such as the x and y in the figure above). When
+you call an activation function like tf.nn.relu(), the framework adds a
+ReLU operation node to the graph. Predefined architectures are just
+pre-configured subgraphs that can be inserted into your model\'s graph.
+Thus, model definition via high-level abstractions creates a
+computational graph. The layers, activations, and architectures we use
+become graph nodes and edges.
+
+When we define a neural network architecture in a framework, we are
+implicitly constructing a computational graph. The framework uses this
+graph to determine operations to run during training and inference.
+Computational graphs bring several advantages over raw code and that's
+one of the core functionalities that is offered by a good ML framework:
+
+- Explicit representation of data flow and operations
+
+- Ability to optimize graph before execution
+
+- Automatic differentiation for training
+
+- Language agnosticism - graph can be translated to run on GPUs, TPUs,
+ > etc
+
+- Portability - graph can be serialized, saved, and restored later
+
+Computational graphs are the fundamental building blocks of ML
+frameworks. Model definition via high-level abstractions creates a
+computational graph. The layers, activations, and architectures we use
+become graph nodes and edges. The framework compilers and optimizers
+operate on this graph to generate executable code. Essentially, the
+abstractions provide a developer-friendly API for building computational
+graphs. Under the hood, it\'s still graphs all the way down! So while
+you may not directly manipulate graphs as a framework user, they enable
+your high-level model specifications to be efficiently executed. The
+abstractions simplify model-building while computational graphs make it
+possible.
+
+#### Static vs. Dynamic Graphs
+
+Deep learning frameworks have traditionally followed one of two
+approaches for expressing computational graphs.
+
+**Static graphs (declare-then-execute):** With this model, the entire
+computational graph must be defined upfront before it can be run. All
+operations and data dependencies must be specified during the
+declaration phase. TensorFlow originally followed this static approach -
+models were defined in a separate context, then a session was created to
+run them. The benefit of static graphs is they allow more aggressive
+optimization, since the framework can see the full graph. But it also
+tends to be less flexible for research and interactivity. Changes to the
+graph require re-declaring the full model.
+
+For example:
+
+``x = tf.placeholder(tf.float32)``
+
+``y = tf.matmul(x, weights) + biases``
+
+The model is defined separately from execution, like building a
+blueprint. For TensorFlow 1.x, this is done using tf.Graph(). All ops
+and variables must be declared upfront. Subsequently, the graph is
+compiled and optimized before running. Execution is done later by
+feeding in tensor values.
+
+**Dynamic graphs (define-by-run):** In contrast to declare (all) first
+and then execute, the graph is built dynamically as execution happens.
+There is no separate declaration phase - operations execute immediately
+as they are defined. This style is more imperative and flexible,
+facilitating experimentation.
+
+PyTorch uses dynamic graphs, building the graph on-the-fly as execution
+happens. For example, consider the following code snippet, where the
+graph is built as the execution is taking place:
+
+``x = torch.randn(4,784)``
+
+``y = torch.matmul(x, weights) + biases``
+
+In the above example, there are no separate compile/build/run phases.
+Ops define and execute immediately. With dynamic graphs, definition is
+intertwined with execution. This provides a more intuitive, interactive
+workflow. But the downside is less potential for optimizations, since
+the framework only sees the graph as it is built.
+
+Recently, however, the distinction has blurred as frameworks adopt both
+modes. TensorFlow 2.0 defaults to dynamic graph mode, while still
+letting users work with static graphs when needed. Dynamic declaration
+makes frameworks easier to use, while static models provide optimization
+benefits. The ideal framework offers both options.
+
+Static graph declaration provides optimization opportunities but less
+interactivity. While dynamic execution offers flexibility and ease of
+use, it may have performance overhead. Here is a table comparing the
+pros and cons of static vs dynamic execution graphs:
+
+| Execution Graph | Pros | Cons |
+| --- | --- | --- |
+| Static (Declare-then-execute) | Enable graph optimizations by seeing full model ahead of time
Can export and deploy frozen graphs
Graph is packaged independently of code | Less flexible for research and iteration
Changes require rebuilding graph
Execution has separate compile and run phases |
+| Dynamic (Define-by-run) | Intuitive imperative style like Python code
Interleave graph build with execution
Easy to modify graphs
Debugging seamlessly fits workflow | Harder to optimize without full graph
Possible slowdowns from graph building during execution
Can require more memory |
+
+### Data Pipeline Tools
+
+Computational graphs can only be as good as the data they learn from and
+work on. Therefore, feeding training data efficiently is crucial for
+optimizing deep neural networks performance, though it is often
+overlooked as one of the core functionalities. Many modern AI frameworks
+provide specialized pipelines to ingest, process, and augment datasets
+for model training.
+
+#### Data Loaders
+
+At the core of these pipelines are data loaders, which handle reading
+examples from storage formats like CSV files or image folders. Reading
+training examples from sources like files, databases, object storage,
+etc. is the job of the data loaders. Deep learning models require
+diverse data formats depending on the application. Among the popular
+formats are CSV: A versatile, simple format often used for tabular data.
+TFRecord: TensorFlow\'s proprietary format, optimized for performance.
+Parquet: Columnar storage, offering efficient data compression and
+retrieval. JPEG/PNG: Commonly used for image data. WAV/MP3: Prevalent
+formats for audio data. For instance, tf.data is TensorFlows's
+dataloading pipeline: https://www.tensorflow.org/guide/data
+
+Data loaders batch examples to leverage vectorization support in
+hardware. Batching refers to grouping multiple data points for
+simultaneous processing, leveraging the vectorized computation
+capabilities of hardware like GPUs. While typical batch sizes range from
+32-512 examples, the optimal size often depends on the memory footprint
+of the data and the specific hardware constraints. Advanced loaders can
+stream virtually unlimited datasets from disk and cloud storage.
+Streaming large datasets from disk or networks instead of loading fully
+into memory. This enables virtually unlimited dataset sizes.
+
+Data loaders can also shuffle data across epochs for randomization, and
+preprocess features in parallel with model training to expedite the
+training process. Randomly shuffling the order of examples between
+training epochs reduces bias and improves generalization.
+
+Data loaders also support caching and prefetching strategies to optimize
+data delivery for fast, smooth model training. Caching preprocessed
+batches in memory so they can be reused efficiently during multiple
+training steps. Caching these batches in memory eliminates redundant
+processing. Prefetching, on the other hand, involves preloading
+subsequent batches, ensuring that the model never idles waiting for
+data.
+
+### Data Augmentation
+
+Besides loading, data augmentation expands datasets synthetically.
+Augmentations apply random transformations like flipping, cropping,
+rotating, altering color, adding noise etc. for images. For audio,
+common augmentations involve mixing clips with background noise, or
+modulating speed/pitch/volume.
+
+Augmentations increase variation in the training data. Frameworks like
+TensorFlow and PyTorch simplify applying random augmentations each epoch
+by integrating into the data pipeline.By programmatically increasing
+variation in the training data distribution, augmentations reduce
+overfitting and improve model generalization.
+
+Many frameworks make it easy to integrate augmentations into the data
+pipeline so they are applied on-the-fly each epoch. Together, performant
+data loaders and extensive augmentations enable practitioners to feed
+massive, varied datasets to neural networks efficiently. Hands-off data
+pipelines represent a significant improvement in usability and
+productivity. They allow developers to focus more on model architecture
+and less on data wrangling when training deep learning models.
+
+### Optimization Algorithms
+
+Training a neural network is fundamentally an iterative process that
+seeks to minimize a loss function. At its core, the goal is to fine-tune
+the model weights and parameters to produce predictions as close as
+possible to the true target labels. Machine learning frameworks have
+greatly streamlined this process by offering extensive support in three
+critical areas: loss functions, optimization algorithms, and
+regularization techniques.
+
+Loss Functions are useful to quantify the difference between the
+model\'s predictions and the true values. Different datasets require a
+different loss function to perform properly, as the loss function tells
+the computer the "objective" for it to aim to. Commonly used loss
+functions are Mean Squared Error (MSE) for regression tasks and
+Cross-Entropy Loss for classification tasks.
+
+To demonstrate some of the loss functions, imagine that you have a set of inputs and the corresponding outputs, $Y_n$ that denotes the output of $n$'th value. The inputs are fed into the model, and the model outputs a prediction, which we can call $\hat{Y_n}$. With the predicted value and the real value, we can for example use the MSE to calculate the loss function:
+
+$$MSE = \frac{1}{N}\sum_{n=1}^{N}(Y_n - \hat{Y_n})^2$$
+
+If the problem is a classification problem, we do not want to use the MSE, since the distance between the predicted value and the real value does not have significant meaning. For example, if one wants to recognize handwritten models, while 9 is further away from 2, it does not mean that the model is more wrong by making the prediction. Therefore, we use the cross-entropy loss function, which is defined as:
+
+$$Cross-Entropy = -\sum_{n=1}^{N}Y_n\log(\hat{Y_n})$$
+
+
+
+Once the loss like above is computed, we need methods to adjust the model\'s
+parameters to reduce this loss or error during the training process. To
+do so, current frameworks use a gradient based approach, where it
+computes how much changes tuning the weights in a certain way changes
+the value of the loss function. Knowing this gradient, the model moves
+in the direction that reduces the gradient. There are many challenges
+associated with this, however, primarily stemming from the fact that the
+optimization problem is not convex, making it very easy to solve, and
+more details about this will come in the AI Training section. Modern
+frameworks come equipped with efficient implementations of several
+optimization algorithms, many of which are variants of gradient descent
+algorithms with stochastic methods and adaptive learning rates. More
+information with clear examples can be found in the AI Training section.
+
+Last but not least, overly complex models tend to overfit, meaning they
+perform well on the training data but fail to generalize to new, unseen
+data (see Overfitting). To counteract this, regularization methods are
+employed to penalize model complexity and encourage it to learn simpler
+patterns. Dropout for instance randomly sets a fraction of input units
+to 0 at each update during training, which helps prevent overfitting.
+
+However, there are cases where the problem is more complex than what the model can represent, and this may result in underfitting. Therefore, choosing the right model architecture is also a critical step in the training process. Further heuristics and techniques are discussed in the AI Training section.
+
+Frameworks also provide efficient implementations of gradient descent,
+Adagrad, Adadelta, and Adam. Adding regularization like dropout and
+L1/L2 penalties prevents overfitting during training. Batch
+normalization accelerates training by normalizing inputs to layers.
+
+### Model Training Support
+
+Before training a defined neural network model, a compilation step is
+required. During this step, the high-level architecture of the neural
+network is transformed into an optimized, executable format. This
+process comprises several steps. The construction of the computational
+graph is the first step. It represents all the mathematical operations
+and data flow within the model. We discussed this earlier.
+
+During training, the focus is on executing the computational graph.
+Every parameter within the graph, such as weights and biases, is
+assigned an initial value. This value might be random or based on a
+predefined logic, depending on the chosen initialization method.
+
+The next critical step is memory allocation. Essential memory is
+reserved for the model\'s operations on both CPUs and GPUs, ensuring
+efficient data processing. The model\'s operations are then mapped to
+the available hardware resources, particularly GPUs or TPUs, to expedite
+computation. Once compilation is finalized, the model is prepared for
+training.
+
+The training process employs various tools to enhance efficiency. Batch
+processing is commonly used to maximize computational throughput.
+Techniques like vectorization enable operations on entire data arrays,
+rather than proceeding element-wise, which bolsters speed. Optimizations
+such as kernel fusion (refer to the Optimizations chapter) amalgamate
+multiple operations into a single action, minimizing computational
+overhead. Operations can also be segmented into phases, facilitating the
+concurrent processing of different mini-batches at various stages.
+
+Frameworks consistently checkpoint the state, preserving intermediate
+model versions during training. This ensures that if an interruption
+occurs, the progress isn\'t wholly lost, and training can recommence
+from the last checkpoint. Additionally, the system vigilantly monitors
+the model\'s performance against a validation data set. Should the model
+begin to overfit (that is, if its performance on the validation set
+declines), training is automatically halted, conserving computational
+resources and time.
+
+ML frameworks incorporate a blend of model compilation, enhanced batch
+processing methods, and utilities such as checkpointing and early
+stopping. These resources manage the complex aspects of performance,
+enabling practitioners to zero in on model development and training. As
+a result, developers experience both speed and ease when utilizing the
+capabilities of neural networks. [^6]
+
+### Validation and Analysis
+
+After training deep learning models, frameworks provide utilities to
+evaluate performance and gain insights into the models\' workings. These
+tools enable disciplined experimentation and debugging.
+
+#### Evaluation Metrics
+
+Frameworks include implementations of common evaluation metrics for
+validation:
+
+- Accuracy - Fraction of correct predictions overall. Widely used for classification.
+
+- Precision - Of positive predictions, how many were actually positive. Useful for imbalanced datasets.
+
+- Recall - Of actual positives, how many did we predict correctly. Measures completeness.
+
+- F1-score - Harmonic mean of precision and recall. Combines both metrics.
-- Definition of ML Frameworks
-- What is an ML framework?
-- Why are ML frameworks important?
-- Go over the design and implementation
-- Examples of ML frameworks
-- Challenges of embedded systems
+- AUC-ROC - Area under ROC curve. Used for classification threshold analysis.
-## Evolution of AI Frameworks
+- MAP - Mean Average Precision. Evaluates ranked predictions in retrieval/detection.
-- High-level vs. low-level frameworks
-- Static vs. dynamic computation graph frameworks
-- Plot showing number of different frameworks and shrinking
+- Confusion Matrix - Matrix that shows the true positives, true negatives, false positives, and false negatives. Provides a more detailed view of classification performance.
-## Types of AI Frameworks
+These metrics quantify model performance on validation data for
+comparison.
-- Cloud-based AI frameworks
-- Edge AI frameworks
-- TinyML frameworks
+#### Visualization
-## Popular AI Frameworks
+Visualization tools provide insight into models:
-Explanation: Discuss the most common types of ML frameworks available and provide a high-level overview, so that we can set into motion what makes embedded ML frameworks unique.
+- Loss curves - Plot training and validation loss over time to spot overfitting.
-- TensorFlow, PyTorch, Keras, ONNX Runtime, Scikit-learn
-- Key Features and Advantages
-- API and Programming Paradigms
-- Table comparing the different frameworks
-## Basic Components
-- Computational graphs
-- Tensor data structures
-- Distributed training
-- Model optimizations
-- Code generation
-- Differentiable programming
-- Hardware acceleration support (GPUs, TPUs)
+- Activation grids - Illustrate features learned by convolutional filters.
+
+- Projection - Reduce dimensionality for intuitive visualization.
+
+- Precision-recall curves - Assess classification tradeoffs.
+
+Tools like [[TensorBoard]{.underline}](https://www.tensorflow.org/tensorboard/scalars_and_keras)
+for TensorFlow and [[TensorWatch]{.underline}](https://github.com/microsoft/tensorwatch)for PyTorch enable
+real-time metrics and visualization during training.
+
+### Differentiable programming
+
+With the machine learning training methods such as backpropagation
+relying on the change in the loss function with respect to the change in
+weights (which essentially is the definition of derivatives), the
+ability to quickly and efficiently train large machine learning models
+rely on the computer's ability to take derivatives. This makes
+differentiable programming one of the most important elements of a
+machine learning framework.
+
+There are primarily four methods that we can use to make computers take
+derivatives. First, we can manually figure out the derivatives by hand
+and input them to the computer. One can see that this would quickly
+become a nightmare with many layers of neural networks, if we had to
+compute all the derivatives in the backpropagation steps by hand.
+Another method is symbolic differentiation using computer algebra
+systems such as Mathematica, but this can introduce a layer of
+inefficiency, as there needs to be a level of abstraction to take
+derivatives. Numerical derivatives, the practice of approximating
+gradients using finite difference methods, suffer from many problems
+including high computational costs, and larger grid size can lead to a
+significant amount of errors. This leads to automatic differentiation,
+which exploits the primitive functions that computers use to represent
+operations to obtain an exact derivative. With automatic
+differentiation, computational complexity of computing the gradient is
+proportional to computing the function itself. Intricacies of automatic
+differentiation are not dealt with by end users now, but resources to
+learn more can be found widely, such as from
+[[here]{.underline}](https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf).
+Automatic differentiation and differentiable programming today is
+ubiquitous and is done efficiently and automatically by modern machine
+learning frameworks.
+
+### Hardware Acceleration
+
+The trend to continuously train and deploy larger machine learning
+models has essentially made hardware acceleration support a necessity
+for machine learning platforms. Deep layers of neural networks require
+many matrix multiplications, which attracts hardware that can compute
+matrix operations fast and in parallel. In this landscape, two types of
+hardware architectures, the [[GPU and
+TPU]{.underline}](https://cloud.google.com/tpu/docs/intro-to-tpu), have
+emerged as leading choices for training machine learning models.
+
+The use of hardware accelerators began with
+[[AlexNet]{.underline}](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf),
+which paved the way for future works to utilize GPUs as hardware
+accelerators for training computer vision models. GPUs, or Graphics
+Processing Units, excel in handling a large number of computations at once, making them
+ideal for the matrix operations that are central to neural network
+training. Their architecture, designed for rendering graphics, turns out
+to be perfect for the kind of mathematical operations required in
+machine learning. While they are very useful for machine learning tasks
+and have been implemented in many hardware platforms, GPU's are still
+general purpose in that they can be used for other applications.
+
+On the other hand, [[Tensor Processing
+Units]{.underline}](https://cloud.google.com/tpu/docs/intro-to-tpu)
+(TPU) are hardware units designed specifically for neural networks. They
+focus on the multiply and accumulate (MAC) operation, and their hardware
+essentially consists of a large hardware matrix that contains elements
+efficiently computing the MAC operation. This, called the [[systolic
+array
+architecture]{.underline}](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1653825),
+was pioneered in 1979 by HT Kung and Charles E. Leiserson, but has
+proven to be a useful structure to efficiently compute matrix products
+and other operations within neural networks (such as convolutions).
+
+While TPU's can drastically reduce training times, it also has
+disadvantages. For example, many operations within the machine learning
+frameworks (primarily TensorFlow here since the TPU directly integrates
+with it) are not supported with the TPU's. It also cannot support custom
+custom operations from the machine learning frameworks, and the network
+design must closely align to the hardware capabilities.
+
+Today, NVIDIA GPUs dominate training, aided by software libraries like
+[[CUDA]{.underline}](https://developer.nvidia.com/cuda-toolkit),
+[[cuDNN]{.underline}](https://developer.nvidia.com/cudnn), and
+[[TensorRT.]{.underline}](https://developer.nvidia.com/tensorrt#:~:text=NVIDIA%20TensorRT%2DLLM%20is%20an,knowledge%20of%20C%2B%2B%20or%20CUDA.)
+Frameworks also tend to include optimizations to maximize performance on
+these hardware types, like pruning unimportant connections and fusing
+layers. Combining these techniques with hardware acceleration provides
+greater efficiency. For inference, hardware is increasingly moving
+towards optimized ASICs and SoCs. Google\'s TPUs accelerate models in
+data centers. Apple, Qualcomm, and others now produce AI-focused mobile
+chips. The NVIDIA Jetson family targets autonomous robots.
## Advanced Features
-- AutoML, No-Code/Low-Code ML
-- Transfer learning
-- Federated learning
-- Model conversion
-- Distributed training
-- End-to-End ML Platforms
+### Distributed training
-## Embedded AI Constraints
+As machine learning models have become larger over the years, it has
+become essential for large models to utilize multiple computing nodes in
+the training process. This process, called distributed learning, has
+allowed for higher training capabilities, but has also imposed
+challenges in implementation.
-Explanation: Describe the constraints of embedded systems, referring to the previous chapters, and remind readers about the challenges and why we need to consider creating lean and efficient solutions.
+We can consider three different ways to spread the work of training
+machine learning models to multiple computing nodes. Input data
+partitioning, referring to multiple processors running the same model on
+different input partitions. This is the easiest to implement that is
+available for many machine learning frameworks. The more challenging
+distribution of work comes with model parallelism, which refers to
+multiple computing nodes working on different parts of the model, and
+pipelined model parallelism, which refers to multiple computing nodes
+working on different layers of the model on the same input. The latter
+two mentioned here are active research areas.
-### Hardware
+ML frameworks that support distributed learning include TensorFlow
+(through its
+[[tf.distribute]{.underline}](https://www.tensorflow.org/api_docs/python/tf/distribute)
+module), PyTorch (through its
+[[torch.nn.DataParallel]{.underline}](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
+and
+[[torch.nn.DistributedDataParallel]{.underline}](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
+modules), and MXNet (through its
+[[gluon]{.underline}](https://mxnet.apache.org/versions/1.9.1/api/python/docs/api/gluon/index.html)
+API).
-- Memory Usage
-- Processing Power
-- Energy Efficiency
-- Storage Limitations
-- Hardware Diversity
+### Model Conversion
-### Software
+Machine learning models have various methods to be represented in order
+to be used within different frameworks and for different device types.
+For example, a model can be converted to be compatible with inference
+frameworks within the mobile device. The default format for TensorFlow
+models is checkpoint files containing weights and architectures, which
+are needed in case we have to retrain the models. But for mobile
+deployment, models are typically converted to TensorFlow Lite format.
+TensorFlow Lite uses a compact flatbuffer representation and
+optimizations for fast inference on mobile hardware, discarding all the
+unnecessary baggage associated with training metadata such as checkpoint
+file structures.
+
+The default format for TensorFlow models is checkpoint files containing
+weights and architectures. For mobile deployment, models are typically
+converted to TensorFlow Lite format. TensorFlow Lite uses a compact
+flatbuffer representation and optimizations for fast inference on mobile
+hardware.
+
+Model optimizations like quantization (see Optimizations chapter) can
+further optimize models for target architectures like mobile. This
+reduces precision of weights and activations to uint8 or int8 for a
+smaller footprint and faster execution with supported hardware
+accelerators. For post-training quantization, TensorFlow\'s converter
+handles analysis and conversion automatically.
+
+Frameworks like TensorFlow simplify deploying trained models to mobile
+and embedded IoT devices through easy conversion APIs for TFLite format
+and quantization. Ready-to-use conversion enables high performance
+inference on mobile without manual optimization burden. Besides TFLite,
+other common targets include TensorFlow.js for web deployment,
+TensorFlow Serving for cloud services, and TensorFlow Hub for transfer
+learning. TensorFlow\'s conversion utilities handle these scenarios to
+streamline end-to-end workflows.
+
+More information about model conversion in TensorFlow is linked
+[[here]{.underline}](https://www.tensorflow.org/lite/models/convert).
+
+### AutoML, No-Code/Low-Code ML
+
+In many cases, machine learning can have a relatively high barrier of
+entry compared to other fields. To successfully train and deploy models,
+one needs to have a critical understanding of a variety of disciplines,
+from data science (data processing, data cleaning), model structures
+(hyperparameter tuning, neural network architecture), hardware
+(acceleration, parallel processing), and more depending on the problem
+at hand. The complexity of these problems have led to the introduction
+to frameworks such as AutoML, which aims to make "Machine learning
+available for non-Machine Learning exports" and to "automate research in
+machine learning". They have constructed AutoWEKA, which aids in the
+complex process of hyperparameter selection, as well as Auto-sklearn and
+Auto-pytorch, an extension of AutoWEKA into the popular sklearn and
+PyTorch Libraries.
+
+While these works of automating parts of machine learning tasks are
+underway, others have focused on constructing machine learning models
+easier by deploying no-code/low code machine learning, utilizing a drag
+and drop interface with an easy to navigate user interface. Companies
+such as Apple, Google, and Amazon have already created these easy to use
+platforms to allow users to construct machine learning models that can
+integrate to their ecosystem.
+
+These steps to remove barrier to entry continue to democratize machine
+learning and make it easier to access for beginners and simplify
+workflow for experts.
+
+### Advanced Learning Methods
+
+#### Transfer Learning
+
+Transfer learning is the practice of using knowledge gained from a
+pretrained model to train and improve performance of a model that is for
+a different task. For example, datasets that have been trained on
+ImageNet datasets such as MobileNet and ResNet can help classify other
+image datasets. To do so, one may freeze the pretrained model, utilizing
+it as a feature extractor to train a much smaller model that is built on
+top of the feature extraction. One can also fine tune the entire model
+to fit the new task. Transfer learning has a series of challenges, in
+that the modified model may not be able to conduct its original tasks
+after transfer learning. Papers such as [["Learning without
+Forgetting"]{.underline}](https://browse.arxiv.org/pdf/1606.09282.pdf)
+paper aims to address these challenges and have been implemented in
+modern machine learning platforms.
+
+#### Federated Learning
+
+Consider the problem of labeling items that are present in a photo from
+personal devices. One may consider moving the image data from the
+devices to a central server, where a single model will train Using these
+image data provided by the devices. However, this presents many
+potential challenges. First, with many devices one needs a massive
+network infrastructure to move and store data from these devices to a
+central location. With the number of devices that are present today this
+is often not feasible, and very costly. Furthermore, there are privacy
+challenges associated with moving personal data, such as Photos central
+servers.
+
+[[Federated learning]{.underline}](https://arxiv.org/abs/1602.05629) is
+a form of distributed computing that resolves these issues by
+distributing the models into personal devices for them to be trained on
+device. At the beginning, a base global model is trained on a central
+server to be distributed to all devices. Using this base model, the
+devices individually compute the gradients and send them back to the
+central hub. Intuitively this is the transfer of model parameters
+instead of the data itself. This innovative approach allows the model to
+be trained with many different datasets (which, in our example, would be
+the set of images that are on personal devices), without the need to
+transfer a large amount of potentially sensitive data. However,
+federated learning also comes with a series of challenges.
+
+
+In many real-world situations, data collected from devices may not come with suitable labels. This issue is compounded by the fact that users, who are often the primary source of data, can be unreliable. This unreliability means that even when data is labeled, there's no guarantee of its accuracy or relevance. Furthermore, each user's data is unique, resulting in a significant variance in the data generated by different users. This non-IID nature of data, coupled with the unbalanced data production where some users generate more data than others, can adversely impact the performance of the global model. Researchers have worked to compensate for this, such as by
+adding a proximal term to achieve a balance between the local and global
+model, and adding a frozen [[global hypersphere
+classifier]{.underline}](https://arxiv.org/abs/2207.09413).
+
+There are additional challenges associated with federated learning. The number of mobile device owners can far exceed the average number of training samples on each device, leading to substantial communication overhead. This issue is particularly pronounced in the context of mobile networks, which are often used for such communication and can be unstable. This instability can result in delayed or failed transmission of model updates, thereby affecting the overall training process.
+
+The heterogeneity of device resources is another hurdle. Devices participating in Federated Learning can have varying computational powers and memory capacities. This diversity makes it challenging to design algorithms that are efficient across all devices. Privacy and security issues are not a guarantee for federated learning. Techniques such as inversion gradient attacks can be used to extract information about the training data from the model parameters. Despite these challenges, the large amount of potential benefits continue to make it a popular research area. Open source programs such as [[Flower]{.underline}](https://flower.dev/) have been developed to make it simpler to implement federated learning with a variety of machine learning frameworks.
+
+
+
+
+
+## Framework Specialization
+
+Thus far, we have talked about ML frameworks generally. However,
+typically frameworks are optimized based on the target environment\'s
+computational capabilities and application requirements, ranging from
+the cloud to the edge to tiny devices. Choosing the right framework is
+crucial based on the target environment for deployment. This section
+provides an overview of the major types of AI frameworks tailored for
+cloud, edge, and tinyML environments to help understand the similarities
+and differences between these different ecosystems.
-- Library Dependency
-- Lack of OS
+### Cloud
+
+Cloud-based AI frameworks assume access to ample computational power,
+memory, and storage resources in the cloud. They generally support both
+training and inference. Cloud-based AI frameworks are suited for
+applications where data can be sent to the cloud for processing, such as
+cloud-based AI services, large-scale data analytics, and web
+applications. Popular cloud AI frameworks include the ones we mentioned
+earlier such as TensorFlow, PyTorch, MXNet, Keras, and others. These
+frameworks utilize technologies like GPUs, TPUs, distributed training,
+and AutoML to deliver scalable AI. Concepts like model serving, MLOps,
+and AIOps relate to the operationalization of AI in the cloud. Cloud AI
+powers services like Google Cloud AI and enables transfer learning using
+pre-trained models.
+
+### Edge
+
+Edge AI frameworks are tailored for deploying AI models on edge devices,
+such as IoT devices, smartphones, and edge servers. Edge AI frameworks
+are optimized for devices with moderate computational resources,
+offering a balance between power and performance. Edge AI frameworks are
+ideal for applications requiring real-time or near-real-time processing,
+including robotics, autonomous vehicles, and smart devices. Key edge AI
+frameworks include TensorFlow Lite, PyTorch Mobile, CoreML, and others.
+They employ optimizations like model compression, quantization, and
+efficient neural network architectures. Hardware support includes CPUs,
+GPUs, NPUs and accelerators like the Edge TPU. Edge AI enables use cases
+like mobile vision, speech recognition, and real-time anomaly detection.
+
+### Embedded
+
+TinyML frameworks are specialized for deploying AI models on extremely
+resource-constrained devices, specifically microcontrollers and sensors
+within the IoT ecosystem. TinyML frameworks are designed for devices
+with severely limited resources, emphasizing minimal memory and power
+consumption. TinyML frameworks are specialized for use cases on
+resource-constrained IoT devices for applications such as predictive
+maintenance, gesture recognition, and environmental monitoring. Major
+tinyML frameworks include TensorFlow Lite Micro, uTensor, and ARM NN.
+They optimize complex models to fit within kilobytes of memory through
+techniques like quantization-aware training and reduced precision.
+TinyML allows intelligent sensing across battery-powered devices,
+enabling collaborative learning via federated learning. The choice of
+framework involves balancing model performance and computational
+constraints of the target platform, whether cloud, edge or tinyML. Here
+is a summary table comparing the major AI frameworks across cloud, edge,
+and tinyML environments:
+
+
+| Framework Type | Examples | Key Technologies | Use Cases |
+|----------------|-----------------------------------|-------------------------------------------------------------------------|------------------------------------------------------|
+| Cloud AI | TensorFlow, PyTorch, MXNet, Keras | GPUs, TPUs, distributed training, AutoML, MLOps | Cloud services, web apps, big data analytics |
+| Edge AI | TensorFlow Lite, PyTorch Mobile, Core ML | Model optimization, compression, quantization, efficient NN architectures | Mobile apps, robots, autonomous systems, real-time processing |
+| TinyML | TensorFlow Lite Micro, uTensor, ARM NN | Quantization-aware training, reduced precision, neural architecture search | IoT sensors, wearables, predictive maintenance, gesture recognition |
+
+
+**Key differences:**
+
+- Cloud AI leverages massive computational power for complex models
+ > using GPUs/TPUs and distributed training
+
+- Edge AI optimizes models to run locally on resource-constrained edge
+ > devices.
+
+- TinyML fits models into extremely low memory and compute
+ > environments like microcontrollers
## Embedded AI Frameworks
-Explanation: Now, discuss specifically about the unique embedded AI frameworks that are available and why they are special, etc.
+### Resource Constraints
+
+Embedded systems face severe resource constraints that pose unique
+challenges for deploying machine learning models compared to traditional
+computing platforms. For example, microcontroller units (MCUs) commonly
+used in IoT devices often have:
+
+- **RAM** in the range of tens of kilobytes to a few megabytes. The
+ > popular ESP8266 MCU has around 80KB RAM available to developers.
+ > This contrasts with 8GB or more on typical laptops and desktops
+ > today.
+
+- **Flash storage** ranging from hundreds of kilobytes to a few
+ > megabytes. The Arduino Uno microcontroller provides just 32KB of
+ > storage for code. Standard computers today have disk storage in
+ > the order of terabytes.
+
+- **Processing power** from just a few MHz to approximately 200MHz.
+ > The ESP8266 operates at 80MHz. This is several orders of magnitude
+ > slower than multi-GHz multi-core CPUs in servers and high-end
+ > laptops.
+
+These tight constraints make training machine learning models directly
+on microcontrollers infeasible in most cases. The limited RAM precludes
+handling large datasets for training. Energy usage for training would
+also quickly deplete battery-powered devices. Instead, models are
+trained on resource-rich systems and deployed on microcontrollers for
+optimized inference. But even inference poses challenges:
+
+1. **Model Size:** AI models are too large to fit on embedded and IoT
+ > devices. This necessitates the need for model compression
+ > techniques, such as quantization, pruning, and knowledge
+ > distillation. Additionally, as we will see in the Embedded AI
+ > Frameworks section, many of the frameworks used by developers for
+ > AI development have large amounts of overhead, and built in
+ > libraries that embedded systems can't support.
+
+2. **Complexity of Tasks:** With only tens of KBs to a few MBs of RAM,
+ > IoT devices and embedded systems are constrained in the complexity
+ > of tasks they can handle. Tasks that require large datasets or
+ > sophisticated algorithms-- for example LLMs-- which would run
+ > smoothly on traditional computing platforms, might be infeasible
+ > on embedded systems without compression or other optimization
+ > techniques due to memory limitations.
+
+3. **Data Storage and Processing:** Embedded systems often process data
+ > in real-time and might not store large amounts of data locally.
+ > Conversely, traditional computing systems can hold and process
+ > large datasets in memory, enabling faster data operations and
+ > analysis as well as real-time updates.
+
+4. **Security and Privacy:** Limited memory also restricts the
+ > complexity of security algorithms and protocols, data encryption,
+ > reverse engineering protections, and more that can be implemented
+ > on the device. This can potentially make some IoT devices more
+ > vulnerable to attacks.
+
+Consequently, specialized software optimizations and ML frameworks
+tailored for microcontrollers are necessary to work within these tight
+resource bounds. Clever optimization techniques like quantization,
+pruning and knowledge distillation compress models to fit within limited
+memory (see Optimizations section). Learnings from neural architecture
+search help guide model designs.
+
+Hardware improvements like dedicated ML accelerators on microcontrollers
+also help alleviate constraints. For instance, Qualcomm\'s Hexagon DSP
+provides acceleration for TensorFlow Lite models on Snapdragon mobile
+chips. Google\'s Edge TPU packs ML performance into a tiny ASIC for edge
+devices. ARM Ethos-U55 offers efficient inference on Cortex-M class
+microcontrollers. These customized ML chips unlock advanced capabilities
+for resource-constrained applications.
+
+Generally, due to the limited processing power, it\'s almost always
+infeasible to train AI models on IoT or embedded systems. Instead,
+models are trained on powerful traditional computers (often with GPUs)
+and then deployed on the embedded device for inference. TinyML
+specifically deals with this, ensuring models are lightweight enough for
+real-time inference on these constrained devices.
+
+### Frameworks & Libraries
+
+Embedded AI frameworks are software tools and libraries designed to
+enable artificial intelligence (AI) and machine learning (ML)
+capabilities on embedded systems. These frameworks are essential for
+bringing AI to IoT (Internet of Things) devices, robotics, and other
+edge computing platforms and they are designed to work where
+computational resources, memory, and power consumption are limited.
+
+### Challenges
+
+While embedded systems present an enormous opportunity for deploying
+machine learning to enable intelligent capabilities at the edge, these
+resource-constrained environments also pose significant challenges.
+Unlike typical cloud or desktop environments rich with computational
+resources, embedded devices introduce severe constraints around memory,
+processing power, energy efficiency, and specialized hardware. As a
+result, existing machine learning techniques and frameworks designed for
+server clusters with abundant resources do not directly translate to
+embedded systems. This section uncovers some of the challenges and
+opportunities for embedded systems and ML frameworks.
+
+#### Fragmented Ecosystem
+
+The lack of a unified ML framework led to a highly fragmented ecosystem.
+Engineers at companies like STMicroelectronics, NXP Semiconductors, and
+Renesas had to develop custom solutions tailored to their specific
+microcontroller and DSP architectures. These ad-hoc frameworks required
+extensive manual optimization for each low-level hardware platform. This
+made porting models extremely difficult, requiring redevelopment for new
+Arm, RISC-V or proprietary architectures.
+
+#### Disparate Hardware Needs
+
+Without a shared framework, there was no standard way to assess
+hardware\'s capabilities. Vendors like Intel, Qualcomm and NVIDIA
+created integrated solutions blending model, software and hardware
+improvements. This made it hard to discern the sources of performance
+gains - whether new chip designs like Intel\'s low-power x86 cores or
+software optimizations were responsible. A standard framework was needed
+so vendors could evaluate their hardware\'s capabilities in a fair,
+reproducible way.
+
+#### Lack of Portability
+
+Adapting models trained in common frameworks like TensorFlow or PyTorch
+to run efficiently on microcontrollers was very challenging without
+standardized tools. It required time-consuming manual translation of
+models to run on specialized DSPs from companies like CEVA or low-power
+Arm M-series cores. There were no turnkey tools enabling portable
+deployment across different architectures.
+
+#### Incomplete Infrastructure
+
+The infrastructure to support key model development workflows was
+lacking. There was minimal support for compression techniques to fit
+large models within constrained memory budgets. Tools for quantization
+to lower precision for faster inference were missing. Standardized APIs
+for integration into applications were incomplete. Essential
+functionality like on-device debugging, metrics, and performance
+profiling was absent. These gaps increased the cost and difficulty of
+embedded ML development.
+
+#### No Standard Benchmark
-- TensorFlow Lite
-- ONNX Runtime
-- MicroPython
-- CMSIS-NN
-- Edge Impulse
-- Others (briefly mention some less common but significant frameworks)
+Without unified benchmarks, there was no standard way to assess and
+compare the capabilities of different hardware platforms from vendors
+like NVIDIA, Arm and Ambiq Micro. Existing evaluations relied on
+proprietary benchmarks tailored to showcased strengths of particular
+chips. This made it impossible to objectively measure hardware
+improvements in a fair, neutral manner.
+
+#### Minimal Real-World Testing
+
+Much of the benchmarks relied on synthetic data. Rigorously testing
+models on real-world embedded applications was difficult without
+standardized datasets and benchmarks. This raised questions on how
+performance claims would translate to real-world usage. More extensive
+testing was needed to validate chips in actual use cases.
+
+The lack of shared frameworks and infrastructure slowed TinyML adoption,
+hampering the integration of ML into embedded products. Recent
+standardized frameworks have begun addressing these issues through
+improved portability, performance profiling, and benchmarking support.
+But ongoing innovation is still needed to enable seamless,
+cost-effective deployment of AI to edge devices.
+
+### Summary
+
+The absence of standardized frameworks, benchmarks, and infrastructure
+for embedded ML has traditionally hampered adoption. However, recent
+progress has been made in developing shared frameworks like TensorFlow
+Lite Micro and benchmark suites like MLPerf Tiny that aim to accelerate
+the proliferation of TinyML solutions. But overcoming the fragmentation
+and difficulty of embedded deployment remains an ongoing process.
+
+#### Examples
+
+Machine learning deployment on microcontrollers and other embedded
+devices often requires specially optimized software libraries and
+frameworks to work within the tight constraints of memory, compute, and
+power. Several options exist for performing inference on such
+resource-limited hardware, each with their own approach to optimizing
+model execution. This section will explore the key characteristics and
+design principles behind TFLite Micro, TinyEngine, and CMSIS-NN,
+providing insight into how each framework tackles the complex problem of
+high-accuracy yet efficient neural network execution on
+microcontrollers. They showcase different approaches for implementing
+efficient TinyML frameworks.
+
+The table summarizes the key differences and similarities between these
+three specialized machine learning inference frameworks for embedded
+systems and microcontrollers.
+
+| Framework | TensorFlow Lite Micro | TinyEngine | CMSIS-NN |
+|------------------------|:----------------------------:|:--------------------------------------:|:--------------------------------------:|
+| **Approach** | Interpreter-based | Static compilation | Optimized neural network kernels |
+| **Hardware Focus** | General embedded devices | Microcontrollers | ARM Cortex-M processors |
+| **Arithmetic Support** | Floating point | Floating point, fixed point | Floating point, fixed point |
+| **Model Support** | General neural network models| Models co-designed with TinyNAS | Common neural network layer types |
+| **Code Footprint** | Larger due to inclusion of interpreter and ops | Small, includes only ops needed for model | Lightweight by design |
+| **Latency** | Higher due to interpretation overhead | Very low due to compiled model | Low latency focus |
+| **Memory Management** | Dynamically managed by interpreter | Model-level optimization | Tools for efficient allocation |
+| **Optimization Approach** | Some code generation features | Specialized kernels, operator fusion | Architecture-specific assembly optimizations |
+| **Key Benefits** | Flexibility, portability, ease of updating models | Maximizes performance, optimized memory usage | Hardware acceleration, standardized API, portability |
+
+
+In the following sections, we will dive into understanding each of these
+in greater detail.
+
+##### TFLM (Interpreter)
+
+TensorFlow Lite Micro (TFLM) is a machine learning inference framework
+designed for embedded devices with limited resources. It uses an
+interpreter to load and execute machine learning models, which provides
+flexibility and ease of updating models in the field.
+
+Traditional interpreters often have significant branching overhead,
+which can reduce performance. However, machine learning model
+interpretation benefits from the efficiency of long-running kernels,
+where each kernel runtime is relatively large and helps mitigate
+interpreter overhead.
+
+An alternative to an interpreter-based inference engine is to generate
+native code from a model during export. This can improve performance,
+but it sacrifices portability and flexibility, as the generated code
+needs recompilation for each target platform and must be replaced
+entirely to modify a model.
+
+TFLM strikes a balance between the simplicity of code compilation and
+the flexibility of an interpreter-based approach by incorporating
+certain code-generation features. For example, the library can be
+constructed solely from source files, offering much of the compilation
+simplicity associated with code generation while retaining the benefits
+of an interpreter-based model execution framework.
+
+An interpreter-based approach offers several benefits over code
+generation for machine learning inference on embedded devices:
+
+- Flexibility: Models can be updated in the field without recompiling
+ the entire application.
+
+- Portability: The interpreter can be used to execute models on
+ different target platforms without porting the code.
+
+- Memory efficiency: The interpreter can share code across multiple
+ models, reducing memory usage.
+
+- Ease of development: Interpreters are easier to develop and maintain
+ than code generators.
+
+TensorFlow Lite Micro is a powerful and flexible framework for machine
+learning inference on embedded devices. Its interpreter-based approach
+offers several benefits over code generation, including flexibility,
+portability, memory efficiency, and ease of development.
+
+##### TinyEngine (Compiler-based)
+
+TinyEngine is an ML inference framework designed specifically for
+resource-constrained microcontrollers. It employs several optimizations
+to enable high-accuracy neural network execution within the tight
+constraints of memory, compute, and storage on microcontrollers.
+
+While inference frameworks like TFLite Micro use interpreters to execute
+the neural network graph dynamically at runtime, this adds significant
+overhead in terms of memory usage to store metadata, interpretation
+latency, and lack of optimizations, although TFLite argues that the
+overhead is small. TinyEngine eliminates this overhead by employing a
+code generation approach. During compilation, it analyzes the network
+graph and generates specialized code to execute just that model. This
+code is natively compiled into the application binary, avoiding runtime
+interpretation costs.
+
+Conventional ML frameworks schedule memory per layer, trying to minimize
+usage for each layer separately. TinyEngine does model-level scheduling
+instead, analyzing memory usage across layers. It allocates a common
+buffer size based on the max memory needs of all layers. This buffer is
+then shared efficiently across layers to increase data reuse.
+
+TinyEngine also specializes the kernels for each layer through
+techniques like tiling, unrolling, and fusing operators. For example, it
+will generate unrolled compute kernels with the exact number of loops
+needed for a 3x3 or 5x5 convolution. These specialized kernels extract
+maximum performance from the microcontroller hardware. It uses depthwise
+convolutions that are optimized to minimize memory allocations by
+computing each channel\'s output in-place over the input channel data.
+This technique exploits the channel-separable nature of depthwise
+convolutions to reduce peak memory size.
+
+Similar to TFLite Micro, the compiled TinyEngine binary only includes
+ops needed for a specific model rather than all possible operations.
+This results in a very small binary footprint, keeping code size low for
+memory-constrained devices.
+
+One difference between TFLite Micro and TinyEngine is that the latter is
+co-designed with "TinyNAS," an architecture search method for
+microcontroller models, similar to differential NAS for
+microcontrollers. The efficiency of TinyEngine allows exploring larger
+and more accurate models through NAS. It also provides feedback to
+TinyNAS on which models can fit within the hardware constraints.
+
+Through all these various custom techniques like static compilation,
+model-based scheduling, specialized kernels, and co-design with NAS,
+TinyEngine enables high-accuracy deep learning inference within the
+tight resource constraints of microcontrollers.
+
+##### CMSIS-NN (Library)
+
+CMSIS-NN, standing for Cortex Microcontroller Software Interface
+Standard for Neural Networks, is a software library devised by ARM. It
+offers a standardized interface for deploying neural network inference
+on microcontrollers and embedded systems, with a particular focus on
+optimization for ARM Cortex-M processors.
+
+**Neural Network Kernels:** CMSIS-NN is equipped with highly efficient
+kernels that handle fundamental neural network operations such as
+convolution, pooling, fully connected layers, and activation functions.
+It caters to a broad range of neural network models by supporting both
+floating-point and fixed-point arithmetic. The latter is especially
+beneficial for resource-constrained devices as it curtails memory and
+computational requirements (Quantization).
+
+**Hardware Acceleration:** CMSIS-NN harnesses the power of Single
+Instruction, Multiple Data (SIMD) instructions available on many
+Cortex-M processors. This allows for parallel processing of multiple
+data elements within a single instruction, thereby boosting
+computational efficiency. Certain Cortex-M processors feature Digital
+Signal Processing (DSP) extensions that CMSIS-NN can exploit for
+accelerated neural network execution. The library also incorporates
+assembly-level optimizations tailored to specific microcontroller
+architectures to further enhance performance.
+
+**Standardized API:** CMSIS-NN offers a consistent and abstracted API
+that protects developers from the complexities of low-level hardware
+details. This makes the integration of neural network models into
+applications simpler. It may also encompass tools or utilities for
+converting popular neural network model formats into a format that is
+compatible with CMSIS-NN.
+
+**Memory Management:** CMSIS-NN provides functions for efficient memory
+allocation and management, which is vital in embedded systems where
+memory resources are scarce. It ensures optimal memory usage during
+inference and in some instances, allows for in-place operations to
+further decrease memory overhead.
+
+**Portability**: CMSIS-NN is designed with portability in mind across
+various Cortex-M processors. This enables developers to write code that
+can operate on different microcontrollers without significant
+modifications.
+
+**Low Latency:** CMSIS-NN minimizes inference latency, making it an
+ideal choice for real-time applications where swift decision-making is
+paramount.
+
+**Energy Efficiency:** The library is designed with a focus on energy
+efficiency, making it suitable for battery-powered and
+energy-constrained devices.
## Choosing the Right Framework
-- Factors to consider: ease of use, community support, performance, scalability, etc.
-- Integration with data engineering tools
-- Integration with model optimization tools
+Choosing the right machine learning framework for a given application
+requires carefully evaluating models, hardware, and software
+considerations. By analyzing these three aspects - models, hardware, and
+software - ML engineers can select the optimal framework and customize
+as needed for efficient and performant on-device ML applications. The
+goal is to balance model complexity, hardware limitations, and software
+integration to design a tailored ML pipeline for embedded and edge
+devices.
+
+![TensorFlow Framework Comparison - General](images_ml_frameworks/image4.png){width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - General"}
+
+### Model
+
+TensorFlow supports significantly more ops than TensorFlow Lite and
+TensorFlow Lite Micro as it is typically used for research or cloud
+deployment, which require a large number of and more flexibility with
+operators (ops),. TensorFlow Lite supports select ops for on-device
+training, whereas TensorFlow Micro does not. TensorFlow Lite also
+supports dynamic shapes and quantization aware training, but TensorFlow
+Micro does not. In contrast, TensorFlow Lite and TensorFlow Micro offer
+native quantization tooling and support, where quantization refers to
+the process of transforming an ML program into an approximated
+representation with available lower precision operations.
+
+### Software
+![TensorFlow Framework Comparison - Software](images_ml_frameworks/image5.png){width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - Model"}
+
+
+
+TensorFlow Lite Micro does not have OS support, while TensorFlow and
+TensorFlow Lite do, in order to reduce memory overhead, make startup
+times faster, and consume less energy. TensorFlow Lite Micro can be used
+in conjunction with real-time operating systems (RTOS) like FreeRTOS,
+Zephyr, and Mbed OS. TensorFlow Lite and TensorFlow Lite Micro support
+model memory mapping, allowing models to be directly accessed from flash
+storage rather than loaded into RAM, whereas TensorFlow does not.
+TensorFlow and TensorFlow Lite support accelerator delegation to
+schedule code to different accelerators, whereas TensorFlow Lite Micro
+does not, as embedded systems tend not to have a rich array of
+specialized accelerators.
+
+### Hardware
+
+![TensorFlow Framework Comparison - Hardware](images_ml_frameworks/image3.png){width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - Hardware"}
+
+TensorFlow Lite and TensorFlow Lite Micro have significantly smaller
+base binary sizes and base memory footprints compared to TensorFlow. For
+example, a typical TensorFlow Lite Micro binary is less than 200KB,
+whereas TensorFlow is much larger. This is due to the
+resource-constrained environments of embedded systems. TensorFlow
+provides support for x86, TPUs, and GPUs like NVIDIA, AMD, and Intel.
+TensorFlow Lite provides support for Arm Cortex A and x86 processors
+commonly used in mobile and tablets. The latter is stripped out of all
+the training logic that is not necessary for ondevice deployment.
+TensorFlow Lite Micro provides support for microcontroller-focused Arm
+Cortex M cores like M0, M3, M4, and M7, as well as DSPs like Hexagon and
+SHARC and MCUs like STM32, NXP Kinetis, Microchip AVR.
+
+Selecting the appropriate AI framework is essential to ensure that
+embedded systems can efficiently execute AI models. There are key
+factors to consider when choosing a machine learning framework, with a
+focus on ease of use, community support, performance, scalability,
+integration with data engineering tools, and integration with model
+optimization tools. By understanding these factors, you can make
+informed decisions and maximize the potential of your machine learning
+initiatives.
+
+### Other Factors
+
+When evaluating AI frameworks for embedded systems, several other key
+factors beyond models, hardware, and software should be considered.
+
+#### Performance
+
+Performance is critical in embedded systems where computational
+resources are limited. Evaluate the framework\'s ability to optimize
+model inference for embedded hardware. Factors such as model
+quantization and hardware acceleration support play a crucial role in
+achieving efficient inference.
+
+#### Scalability
-## Framework Comparison
+Scalability is essential when considering the potential growth of an
+embedded AI project. The framework should support the deployment of
+models on a variety of embedded devices, from microcontrollers to more
+powerful processors. It should also handle both small-scale and
+large-scale deployments seamlessly.
-Explanation: Provide a high-level comparison of the different frameworks based on class slides, etc.
+#### Integration with Data Engineering Tools
-- Table of differences and similarities
+Data engineering tools are essential for data preprocessing and pipeline
+management. An ideal AI framework for embedded systems should seamlessly
+integrate with these tools, allowing for efficient data ingestion,
+transformation, and model training.
-## Trends in ML Frameworks
+#### Integration with Model Optimization Tools
-Explanation: Discuss where these ML frameworks are heading in the future. Perhaps consider discussing ML for ML frameworks?
+Model optimization is crucial to ensure that AI models are well-suited
+for embedded deployment. Evaluate whether the framework integrates with
+model optimization tools, such as TensorFlow Lite Converter or ONNX
+Runtime, to facilitate model quantization and size reduction.
-- Framework Developments on the Horizon
-- Anticipated Innovations in the Field
+#### Ease of Use
-## Challenges and Limitations
+The ease of use of an AI framework significantly impacts development
+efficiency. A framework with a user-friendly interface and clear
+documentation reduces the learning curve for developers. Consideration
+should be given to whether the framework supports high-level APIs,
+allowing developers to focus on model design rather than low-level
+implementation details. This factor is incredibly important for embedded
+systems, which have less features that typical developers might be
+accustomed to.
-Explanation: None of the frameworks are perfect, so it is important to understand their limitations and challenges.
+#### Community Support
-- Model compatibility and interoperability issues
-- Scalability and performance challenges
-- Addressing the evolving needs of AI developers
+Community support plays another essential factor. Frameworks with active
+and engaged communities often have well-maintained codebases, receive
+regular updates, and provide valuable forums for problem-solving. As a
+result, community support plays into Ease of Use as well because it
+ensures that developers have access to a wealth of resources, including
+tutorials and example projects. Community support provides some
+assurance that the framework will continue to be supported for future
+updates. There are only a handful of frameworks that cater to TinyML
+needs. Of that, TensorFlow Lite Micro is the most popular and has the
+most community support.
+
+## Future Trends in ML Frameworks
+
+### Decomposition
+
+Currently, the ML system stack consists of four abstractions, namely (1)
+computational graphs, (2) tensor programs, (3) libraries and runtimes,
+and (4) hardware
+primitives.
+
+![](images_ml_frameworks/image8.png){width="2.557292213473316in"
+height="2.9092125984251966in"}
+
+This has led to vertical (i.e. between abstraction levels) and
+horizontal (i.e. library-driven vs. compilation-driven approaches to
+tensor computation) boundaries, which hinder innovation for ML. Future
+work in ML frameworks can look toward breaking these boundaries. In
+December 2021, Apache TVM Unity was proposed, which aimed to facilitate
+interactions between the different abstraction levels (as well as the
+people behind them, such as ML scientists, ML engineers, and hardware
+engineers) and co-optimize decisions in all four abstraction levels.[^1]
+
+### High-Performance Compilers & Libraries
+
+As ML frameworks further develop, high-performance compilers and
+libraries will continue to emerge. Some current examples include
+[[TensorFlow
+XLA]{.underline}](https://www.tensorflow.org/xla/architecture) and
+Nvidia's
+[[CUTLASS]{.underline}](https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/),
+which accelerate linear algebra operations in computational graphs, and
+Nvidia's
+[[TensorRT]{.underline}](https://developer.nvidia.com/tensorrt), which
+accelerates and optimizes inference.
+
+### ML for ML Frameworks
+
+We can also use ML to improve ML frameworks in the future. Some current
+uses of ML for ML frameworks include:
+
+- hyperparameter optimization using techniques such as Bayesian
+ > optimization, random search, and grid search
+
+- neural architecture search (NAS) to automatically search for optimal
+ > network architectures
+
+- AutoML, which as described in the Advanced Features section,
+ > automates the ML pipeline.
## Conclusion
-- Summary of Key Takeaways
-- Recommendations for Further Learning
\ No newline at end of file
+In summary, selecting the optimal framework requires thoroughly
+evaluating options against criteria like usability, community support,
+performance, hardware compatibility, and model conversion abilities.
+There is no universal best solution, as the right framework depends on
+the specific constraints and use case.
+
+For extremely resource constrained microcontroller-based platforms,
+TensorFlow Lite Micro currently provides a strong starting point. Its
+comprehensive optimization tooling like quantization mapping and kernel
+optimizations enables high performance on devices like Arm Cortex-M and
+RISC-V processors. The active developer community ensures accessible
+technical support. Seamless integration with TensorFlow for training and
+converting models makes the workflow cohesive.
+
+For platforms with more capable CPUs like Cortex-A, TensorFlow Lite for
+Microcontrollers expand possibilities. They provide greater flexibility
+for custom and advanced models beyond the core operators in TFLite
+Micro. However, this comes at the cost of a larger memory footprint.
+These frameworks are ideal for automotive systems, drones, and more
+powerful edge devices that can benefit from greater model
+sophistication.
+
+Frameworks specifically built for specialized hardware like CMSIS-NN on
+Cortex-M processors can further maximize performance, but sacrifice
+portability. Integrated frameworks from processor vendors tailor the
+stack to their architectures. This can unlock the full potential of
+their chips but lock you into their ecosystem.
+
+Ultimately, choosing the right framework involves finding the best match
+between its capabilities and the requirements of the target platform.
+This requires balancing tradeoffs between performance needs, hardware
+constraints, model complexity, and other factors. Thoroughly assessing
+intended models, use cases, and evaluating options against key metrics
+will guide developers towards picking the ideal framework for their
+embedded ML application.
+
+[^1]: Sampson et al. 2021. "Apache TVM Unity: a vision for the ML software & hardware ecosystem in 2022." [[https://tvm.apache.org/2021/12/15/tvm-unity]{.underline}](https://tvm.apache.org/2021/12/15/tvm-unity).
+[^2]: Abadi et al. 2015. "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems." [[https://arxiv.org/pdf/1603.04467.pdf]{.underline}](https://arxiv.org/pdf/1603.04467.pdf).
+[^3]: Dean et al. 2012. "Large Scale Distributed Deep Networks." *Proceedings of the 25th International Conference on Neural Information Processing Systems* 1: 1223–1231. [[https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40565.pdf]{.underline}](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40565.pdf).
+[^4]: Li et al. 2014. "Communication Efficient Distributed Machine Learning with the Parameter Server." *Proceedings of the 27th International Conference on Neural Information Processing Systems* 1: 19–27. [[https://proceedings.neurips.cc/paper_files/paper/2014/file/1ff1de774005f8da13f42943881c655f-Paper.pdf
+]{.underline}](https://proceedings.neurips.cc/paper_files/paper/2014/file/1ff1de774005f8da13f42943881c655f-Paper.pdf
+).
+[^5]: [[TensorFlow: Large-scale machine learning on heterogeneous systems,
+2015.]{.underline}](https://www.tensorflow.org/datasets/catalog/mnist)
+
+[^6]: [[Patrick McClanahan, Introduction to Operating Systems, 2023]{.underline}](https://eng.libretexts.org/Courses/Delta_College/Introduction_to_Operating_Systems/03%3A_The_Operating_System/3.06%3A_Types_of_Operating_Systems)
diff --git a/images_ml_frameworks/image1.png b/images_ml_frameworks/image1.png
new file mode 100644
index 00000000..ba76db9c
Binary files /dev/null and b/images_ml_frameworks/image1.png differ
diff --git a/images_ml_frameworks/image2.png b/images_ml_frameworks/image2.png
new file mode 100644
index 00000000..a3ca27e6
Binary files /dev/null and b/images_ml_frameworks/image2.png differ
diff --git a/images_ml_frameworks/image3.png b/images_ml_frameworks/image3.png
new file mode 100644
index 00000000..91043427
Binary files /dev/null and b/images_ml_frameworks/image3.png differ
diff --git a/images_ml_frameworks/image4.png b/images_ml_frameworks/image4.png
new file mode 100644
index 00000000..70dfb2b0
Binary files /dev/null and b/images_ml_frameworks/image4.png differ
diff --git a/images_ml_frameworks/image5.png b/images_ml_frameworks/image5.png
new file mode 100644
index 00000000..644fed12
Binary files /dev/null and b/images_ml_frameworks/image5.png differ
diff --git a/images_ml_frameworks/image6.png b/images_ml_frameworks/image6.png
new file mode 100644
index 00000000..39feb8f8
Binary files /dev/null and b/images_ml_frameworks/image6.png differ
diff --git a/images_ml_frameworks/image7.png b/images_ml_frameworks/image7.png
new file mode 100644
index 00000000..b8a50fec
Binary files /dev/null and b/images_ml_frameworks/image7.png differ
diff --git a/images_ml_frameworks/image8.png b/images_ml_frameworks/image8.png
new file mode 100644
index 00000000..e193c05f
Binary files /dev/null and b/images_ml_frameworks/image8.png differ