Mason . Mason .

TinyML: It’s a Small World After All!

Learning about 'TinyML' changed the way that I see the world!

What’s So Tiny About It?

I think this excerpt from a talk from Dennis Laudick who is VP of marketing for ARM, a company who creates TinyML tech, provides a great introduction to the tiny behind TinyML:

Interviewer: “How tiny is tiny?”

Laudick: “Pretty much anywhere where you have a microcontroller. And there are tens of billions of microcontrollers shipped annually. This way you can run machine learning workloads in the device at the source of the information. Everything for awhile has been about making better and better models in order to achieve better and better accuracy, with the gold standard of beating humans. But in order to actually run the models you have to be close to where the data is. Think of an elevator, where you have sensors on the motors, the shafts, the chains, and the pulleys, and you want to know when the device is getting dangerously close to failing. You want to know when to perform maintenance. You want to listen to vibrations and motors and things. Well, the best place to do that is there where the data is, where the device happens to be. So you have a microcontroller controlling something and you’re sampling data, you know, at a very low Hz rate, you can run a small machine learning model that listens for variability in the motor, so you know when you need to shut it down for maintenance before something bad happens.”

Swap elevator for escalator or airplane engine and one definitely starts to appreciate the idea. Actually, after thinking about TinyML in that way, one might start thinking, “what can’t we monitor and/or optimize?” In other words, prepare for the tiny takeover of machine learning!

How the Framework Works: It’s Not Magic, It’s Tiny Magic!

Otherwise known as magical tiny science! Or is it tiny magical science??

*Ahem*

A quick primer :

TinyML, in simple terms, is like teaching an ant to do backflips — it’s all about doing a lot with a little. TinyML usually exists on extremely small, microcontroller-based devices with ultra-low power sources, often capable of running on a tiny amount of milliwatts. A fun comparison is to think of is if a tiny device can run on 10 mW then it takes 200–500% less power than a standard smartphone when active (~2–5W). This might not sound that impressive since we think heavily of our phones and all that they can do, but that tiny device can perform real-time inference directly on the device. For example it could be running a tiny model for specific sound recognition or gesture detection. It can process input data from its sensors, like an accelerometer or microphone, and make decisions based on the machine learning model’s output, all while consuming only a few milliwatts of power.

How Do They Make It So Tiny?!?

In order to achieve this size while being able to run ML they are powered by specialized processors and other architectures that focus on doing more with less. How? Through some neat optimizations using algorithms. Yep, it’s optimization all the way down (to low-level)!

Optimization as a term also includes processes like compression, pruning (think of it as removing unnecessary parts of the network), and quantization (making your calculations take up less bits). And there’s always transfer learning — the ability to use base tiny models to build on and train for new tasks. Plus, on-device learning means these devices get smarter while staying where they are, no need to chat with the cloud all the time.

To maximize efficiency, TinyML employs advanced methods like neural architecture search (NAS) and knowledge distillation. NAS automates the design of neural network architectures to find models that are both accurate and lightweight. Knowledge distillation, on the other hand, involves training a compact model (student) to replicate the behavior of a larger, more complex model (teacher), thereby retaining performance while reducing size.

But these are only some of the techniques used, and even the section below will not cover them all! This is meant to be a tiny dive into the world of TinyML. That being said, let’s go a tiny bit deeper.

Further, concise explanations of some hot TinyML terms:

Compression: reduces the overall model size.

Methods of compression include: pruning, quantization, weight sharing, low-rank factorization, sparse representations, knowledge distillation, parameter sharing, and network architecture optimization.

On some compression techniques and more:

https://medium.com/marionete/tinyml-models-whats-happening-behind-the-scenes-5e61d1555be9

Pruning: eliminates redundant or non-contributory neurons.

Pruning is an important method in machine learning that helps make big models more efficient. Imagine you have a large, detailed model that works really well. Pruning is like trimming down this model to make it smaller, but still trying to keep it working just as well. It’s like cutting off the branches of a tree that aren’t needed, so the tree stays healthy but slimmer overall and it’s easier to direct its growth.

Here’s how pruning works in simple terms:

  1. Identifying Unimportant Parts: First, you look at the model and find parts that aren’t doing much. These are like tiny gears in a machine that aren’t really contributing to the overall function. For example, if a part of the model has a value like 0.00001, it might not be very important, so it can be set to zero.

  2. Pruning and Retraining Cycle: After removing these small parts, the model can be retrained with the data. This is like giving the model a chance to adjust to the changes made. By doing this, the parts of the model that are left might work a bit harder to make up for what was removed. One can keep doing this — pruning a bit, then retraining — while making sure two things happen: the model still fits within the space allocated (like fitting a big book into a smaller bookshelf), and it still performs well (like making sure the book still tells the same story).

  3. Effectiveness in Big Models: In models that have millions of parts (parameters), this pruning method is really useful. It finds and gets rid of a small percentage of parts that aren’t needed. One can think of this process as gradually increasing the amount of pruning from the start to the end, like slowly turning up a dial from 0 to 100. In the end, one might find that they’ve removed up to 90% of the unnecessary parts, making the model much more manageable and efficient.

Guide on pruning:

https://towardsdatascience.com/model-compression-via-pruning-ac9b730a7c7b

Quantization: a technique that simplifies how numbers are represented by using fewer bits than the standard 64-bit floating point precision. When employing quantization one often uses formats like 16-bit or 8-bit, which require less memory. For instance, a 16-bit number needs four times less memory than a 64-bit number, and an 8-bit number needs eight times less.

This process involves mapping a number from its original range to a new, smaller range. Quantization includes two main steps:

  1. Quantization Process: This step converts a number into its simpler, quantized form.

  2. Dequantization Process: This step approximates the original number from its quantized form.

While quantization reduces the precision of the numbers and limits the range of calculations, the benefits often outweigh this drawback. For example, training results from quantized neural networks show that the loss in precision is usually small compared to the significant reduction in model size. A notable example of a quantized model is I-BERT, a version of the BERT model that uses only integer arithmetic. This model can perform complex operations more efficiently and can be up to four times faster using 8-bit integers compared to the standard 32-bit floating-point format.

I-BERT: Integer-only BERT Quantization

https://arxiv.org/abs/2101.01321

A White Paper on Neural Network Quantization https://arxiv.org/pdf/2106.08295.pdf

Weight Sharing: using the same weights for multiple connections in the neural network. By sharing weights across different parts of the network, the overall number of unique weights is reduced, leading to a smaller model size.

Low-Rank Factorization : reduces the model size by representing large matrices with smaller ones, which can significantly decrease the number of parameters.

Sparse Representations: modifying the network to have more zeros in the weight matrices. Unlike pruning, which removes connections entirely, sparse representations maintain the network structure but with many weights set to zero, which can be efficiently stored and computed.

Parameter Sharing: especially effective in models processing sequential data, where the same weights can be used at each time step.

Network Architecture Optimization: using techniques like Neural Architecture Search (NAS) to find optimal architectures that maintain performance with fewer parameters.

Neural Architecture Search (NAS): the automated design of neural network architectures.

Data Serialization: the transformation of complex data structures into streamlined, compact formats.

Transfer Learning: allows for the adaptation of pre-trained models to new tasks, significantly reducing the computational resources required for training. For example, one could use a tiny model that another has created as a base model to be trained to do another task. Models that do similar tasks and/or deal with similar data can transfer the most learning to a new task.

Knowledge Distillation: leverages the strength of a larger, more accurate model to enhance the training quality for a smaller model. Operates on the principle that a smaller model can achieve accurate results if trained with high-quality data. This approach is often described using a student-teacher model analogy.

Here’s how it works:

  1. Training the Teacher: First, a larger and more complex machine learning model, referred to as the “teacher,” is trained on a given dataset. This teacher model is capable of generating highly accurate predictions.

  2. Generating New Data: The teacher model’s predictions are then used to create a new dataset. This dataset combines the original data with the additional insights gained from the teacher model. Essentially, it enriches the original data with more detailed information.

  3. Training the Student: The new, enriched dataset is then used to train a smaller, less complex model, known as the “student.” The idea is that the high-quality data, now containing more detailed insights, can help improve the performance of this smaller model.

On Knowledge Distillation:

https://medium.com/@zone24x7_inc/shrinking-the-giants-how-knowledge-distillation-is-changing-the-landscape-of-deep-learning-models-83dffde577ec

Decision Trees: these are a viable model choice in TinyML due to their inherent simplicity and interpretability. These models operate by making a series of binary decisions, each based on specific data features, making them particularly suitable for scenarios with limited computational resources. It’s like plinko, but rigged!

On-Device Learning in TinyML: On-device learning in TinyML signifies a shift towards localized data processing, enabling devices to make decisions without reliance on continuous cloud connectivity. When TinyML devices do connect to the cloud, they send only essential data. This approach not only conserves bandwidth but also enhances privacy and reduces latency, which is crucial for real-time applications.

Edge Computing in TinyML: Edge computing plays a crucial role in TinyML. By processing data locally, we reduce latency and enhance privacy. This is crucial for applications like real-time language translation or emergency response systems in smart cities.

Examples of TinyTech

Imagine your morning alarm clock not just waking you up, but also easing you into a wakeup, telling you how well you slept, thanks to a TinyML device monitoring your sleep patterns, and it could tell you all kinds of data, like the news or what your schedule is or any important reminders, all according to what you have it set to report.

More examples?

In fitness, your running shoes could give you feedback on your jogging technique.

In healthcare, TinyML is like a mini-doctor you wear on your wrist, keeping an eye on your heart rate, oxygen levels, and maybe even predicting if you’re about to catch a cold.

In agriculture, it’s like having a tiny farmer in your field, monitoring soil moisture and crop health. That tiny farmer has a tiny farmer’s almanac and it knows things.

Some case studies that have been performed:

TinyML for environmental data collection, sign language detection, handwriting recognition, medical face mask detection, gesture recognition, speech recognition, and autonomous tiny vehicles.

These aren’t scenes from a sci-fi movie; they’re real possibilities with TinyML!

“Aren’t these sci-fi things expensive??”

Actually, more optimization means more scalability, and more scalability means leveraging economies of scale!! In other words, these things will be so tiny, in demand, and therefore numerous that we will be able to make them extremely cheaply after building the means of production. This will tap economies of scale at a magnitude that could summon the ghost of Andrew Carnegie to praise us all.

Now, let’s get our hands tiny! With TensorFlow Lite, Arduino, and even Raspberry Pi (not the tiniest) even hobbyists can join in the fun, turning everyday objects into smart gadgets. You could create a gadget that tells you when your pet needs food, or a system that understands sign language. You could create a bot that determines what type of dance move you are performing and starts playing the relevant music to your jig. The possibilities are endless!!

Much of this section was influenced by the paper: “TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications”, Section 4.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9227753/

Why I’m Excited: The Takeover of the Tinytech

Uhhh, because tiny is such a cute way to explain microtech (read the section title again: tinytech.. hilarious). But really, what excites me about TinyML is its potential to be everywhere, like invisible helpers making our lives easier without us even noticing. And they are built to be extremely efficient and focused on specific tasks, which is a huge boon to desirability in a world where we are all starting to think about ‘our data’ and privacy concerns.

Currently, machine learning is something that seems only to be useful in application for very large projects. Soon, many TinyML softwares and hardwares will be deployed and become part of our everyday lives.

Outro: From Tiny Steps to Tiny Leaps

In the near-term, expect to see more gadgets getting smarter and more efficient. Imagine a world where many little technologies seem to have a bit of ‘intelligence’, communicating and learning from each other actively in the environment. At first, this will be very task-focused and implemented sparsely. But in the moonshot future we are looking at swarms of ‘smart’ tech.

An example: when AI automated vehicles are swarming the roads, they will all be communicating their positions and the data of their environment with each other allowing the AI to know the movements of all objects around every road. This is the moonshot future that makes some automated vehicle engineers so sure that this type of future would be the safest one for vehicle passengers and pedestrians. Currently, we simply don’t have the data bandwidth or the devices deployed in order to monitor streets in such a way, though this is one of the purposes of 5G deployment.

One of my favorite concepts has always been nanobots. I feel like nanobots are the real TinyML moonshot. This is where little bots with nano-tiny power sources perform cellular level repair. Is this a real thing we can do? I don’t know but it sounds very cool.

That brings us to the end of this tiny blog. As we step into this paradigm of expanding TinyML and BigML, let’s embrace that it’s a small world, and it’s about to get a lot smarter!

Sources

TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9227753/

On Compression:

https://medium.com/marionete/tinyml-models-whats-happening-behind-the-scenes-5e61d1555be9

Tiny guide on pruning:

Model Compression via Pruning

Pruning Neural Network

towardsdatascience.com

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language…

arxiv.org

A White Paper on Neural Network Quantization: https://arxiv.org/pdf/2106.08295.pdf

On Knowledge Distillation:

https://medium.com/@zone24x7_inc/shrinking-the-giants-how-knowledge-distillation-is-changing-the-landscape-of-deep-learning-models-83dffde577ec

Read More
Mason . Mason .

Optimizing Alzheimer’s Disease Classification using Bayesian Optimization and Transfer Learning

An extremely fun project with great results recreating MRI images using a DCGAN.

Introduction

Alzheimer’s disease is a progressive brain disorder that affects memory, thinking, and behavior. Early diagnosis plays a critical role in managing the disease. In this article we will discuss the process of using bayesian optimization techniques to fine-tune a convolutional neural network (CNN) aimed at classifying Alzheimer’s disease based on MRI scans. Specifically, we’ll focus on classifying Alzheimer’s disease stages. Transfer learning and from-scratch methods are employed. We’ll discuss the thought process behind each decision, the statistical concepts involved, and the conclusions drawn from the optimization process.

Process and Reasoning: Why This Project Structure?

When I was brainstorming how to approach this project I decided that the main goals were personal learning and training efficiency, as I basically constructed this project in 5 days. Both of these objectives influenced the choices made in dataset selection, code structuring, and overall project architecture. Below are the key considerations:

Dataset: Why MRI Images?

  • Clinical Relevance: MRI scans are one of the most reliable forms of diagnostic tools for Alzheimer’s disease, making them a clinically relevant choice.

  • Learning Objective: Given my keen interest in computational neuroscience, working with medical images like MRIs offers an invaluable learning experience. It provides a practical application of machine learning techniques in a neuroscience context.

Code Structure: Modular and Reusable

The codebase for this project was designed with reusability and scalability in mind. Here’s how:

  • Function-Based Design: The code is structured into separate functions for data loading, model creation, and optimization. This modularity makes it easier to understand, debug, and scale.

  • Dynamic Hyperparameters: Hyperparameters like batch size are made dynamic, allowing the bayesian optimization process to have full control, thereby improving learning efficiency. This design choice enables the model to adapt better to the data, leading to more robust performance.

  • Custom Metrics: Implementing custom metrics like F1-Score serves both as a learning exercise and as a way to fine-tune the model’s performance. The F1-Score is particularly useful when the class distribution is imbalanced, which is often the case in medical datasets.

Hyperparameter Optimization: Bayesian Over Grid Search

Hyperparameter tuning is often the most time-consuming part of a machine learning project. Traditional methods like grid search or random Search are computationally expensive and less efficient. That’s where bayesian optimization comes in.

Bayesian optimization is generally more efficient than other methods like grid search, especially for high-dimensional spaces. It uses a probabilistic model to predict the objective function and intelligently selects the next set of hyperparameters to evaluate, thereby reducing the number of evaluations needed.

Training Efficiency: Why Transfer Learning?

Transfer learning is the practice of taking a pre-trained model and fine-tuning it for a different but related task. Here’s why it’s beneficial:

  • Transfer Learning: Using a pre-trained model allows us to leverage prior learning, reducing training time and computational resources. For instance, a model trained on general brain images can be fine-tuned to classify Alzheimer’s stages, saving both time and computational power.

  • Note: in this project we will only utilize models that were pre-trained on ImageNet, so their usefulness in transferring knowledge to this project is limited but very easy to implement using Keras.

Primer on Concepts

What is a Gaussian Process?

A Gaussian Process (GP) is a powerful tool for probabilistic modeling. In the context of bayesian optimization, GPs are used to model the unknown objective function. A GP assumes that any finite set of points from the function are jointly gaussian distributed. This property allows us to not just make point estimates of the function but also quantify our uncertainty about those estimates.

In simpler terms, it’s a way to predict a function based on prior observations. GPs are the backbone of bayesian optimization, providing a probabilistic model of the objective function. Gaussian processes are widely used in machine learning to make predictions when the underlying distribution is unknown.

What is Bayesian Optimization?

Bayesian optimization (BO) aims to find the hyperparameter combination that will give the best score on the objective function, but without having to test every single possible combination, which can be computationally expensive. It uses probabilistic models, like gaussian processes, to predict the performance of untested hyperparameters and intelligently decides the next set of hyperparameters to test. This makes the optimization process more efficient. It’s particularly useful for optimizing functions that are expensive to evaluate, like high-dimensional hyperparameter tuning for deep learning models.

The Model(s)

MobileNetV2

MobileNetV2 is designed for mobile and embedded vision applications. Given its efficiency and lower computational requirements, MobileNetV2 is an excellent choice for quick experiments and prototyping. It’s also a good fit for real-world applications where computational resources may be limited. Also, I had used it for a previous transfer learning project and wanted to test the accuracy using it versus much more robust off-the-shelf CNNs.

VGG16

VGG16 has fewer parameters compared to other models like ResNet, making it faster to train. It has a simple architecture with a series of convolutional layers followed by max-pooling and fully connected layers, making it easier to understand and modify. This was used as a step up from MobileNetV2.

DenseNet169

The final model chosen: DenseNet is another CNN architecture. Specifically, DenseNet169 is a variant with 169 layers. Unlike traditional convolutional networks, where each layer obtains new features and passes them forward, DenseNet layers also receive features from all preceding layers. The architecture employs “dense blocks” where each layer receives the feature maps from all preceding layers and passes on its own feature maps to all subsequent layers. This results in fewer parameters, reduced overfitting, and better gradient flow. That also makes it useful for this project which is aiming for low computational overhead.

From Scratch

Some exploration was done with from scratch models, but this idea was scrapped due to time constraints.

Hyperparameters and Why They Were Chosen

  1. Learning Rate: This controls how quickly or slowly a model learns. Too high a learning rate can cause the model to converge too quickly and possibly overshoot the minimum cost. Too low a learning rate will make the model slow to converge.

  2. Dense Units: This refers to the number of units in the dense layers of the network. More units allow for more complex representations but can also lead to overfitting.

  3. Dropout Rate: Dropout is a regularization technique where a fraction of the input units are randomly set to zero during training. This prevents overfitting.

  4. L2 Weight: This is the weight for L2 regularization in the loss function. Regularization helps prevent overfitting by adding a penalty for large weights.

  5. Batch Size: This is the number of samples that will be used to update the model weights in one iteration. Smaller batch sizes often provide a regularizing effect and lower generalization error.

These hyperparameters were chosen because they have a significant impact on the model’s performance and are commonly tuned in deep learning projects.

Satisficing Metrics

When I started this project my go-to metric for model evaluation was the good ol’ validation loss. It’s a classic choice, often used as a quick and dirty way to gauge how well a model is performing. However, as I continued my research into how best deal with a dataset of MRI images in the context of satisficing metrics, I realized that this project had its own set of unique challenges — chief among them being class imbalance.

The Problem with Validation Loss in Imbalanced Classification

Validation loss is a great metric when you’re dealing with a balanced dataset. But here’s the kicker: Alzheimer’s disease stages are not uniformly distributed in the real world, and neither were they in my dataset. Some stages of the disease are more common than others, leading to an imbalanced dataset. Using validation loss in such a scenario can be misleading. The model might perform well on the majority class but terribly on the minority class, and yet show a deceptively low validation loss. This is a classic trap in machine learning, and it’s one that I fell into.

The Aha Moment: Enter F1 Score

As I was falling through this rabbit hole, I stumbled upon the concept of ‘satisficing metrics’ which are metrics that satisfy the minimum criteria for adequacy. This led me to the F1 Score, a metric that balances both precision and recall. In the context of imbalanced classification, the F1 Score shines because it gives you a more holistic view of how well your model is performing across all classes.

The F1 Score is calculated as follows:

F1 Score = (2 × Precision × Recall) / (Precision+Recall)

Here, Precision is the number of True Positives divided by the number of True Positives and False Positives.

Recall, or Sensitivity, is the number of True Positives divided by the number of True Positives and False Negatives.

The F1 Score harmoniously combines these two metrics into a single number that ranges from 0 to 1, with 1 being the best possible F1 Score.

Why F1 Score Over Other Metrics?

You might be wondering, “Why not use other metrics like ROC-AUC?” While ROC-AUC is a strong contender, it’s not as interpretable as the F1 score. Given that one of my goals was to make this project as accessible as possible, even to those without a deep technical background, the F1 score was the clear winner.

The Takeaway

Switching from validation loss to the F1 score was a pivotal moment in this project. It not only provided a more accurate measure of the model’s performance but also deepened my understanding of the nuances involved in choosing the right metric. The F1 score became my satisficing metric, ensuring that the model met the minimum criteria for adequacy while dealing with an imbalanced dataset.

So, the next time you find yourself knee-deep in a classification problem with an imbalanced dataset, remember: the choice of metric can make or break your model. Choose wisely!

Acquisition Functions

What is an Acquisition Function?

In bayesian optimization, the acquisition function is a heuristic that provides a measure of the utility of evaluating the objective function at a given point.

What is an objective function anyway? What is this gobbledygook speak? Well, it’s any mathematical formula that you want to optimize — either maximize or minimize — while solving a problem. In machine learning, for example, the objective function could measure how well a model predicts data; you’d aim to find the model settings that make this score as good as possible. In the context of bayesian optimization, the objective function represents the performance of a machine learning model across different hyperparameter settings. The “score” given by this function could be something like accuracy or F1 score for a classification task, or mean squared error for a regression task.

The acquisition function guides the selection of the next point to evaluate in the hyperparameter space. It balances the trade-off between exploration (searching unknown or less certain regions) and exploitation (zooming in on known good regions). The point that maximizes the acquisition function is chosen as the next point to evaluate in the objective function.

Types of Acquisition Functions

Here’s a quick rundown of some commonly used acquisition functions:

  1. Probability of Improvement (PI): It aims to improve over the current best-known value.

  2. Expected Improvement (EI): A balanced choice that considers both exploration and exploitation. It measures the expected improvement over the current best-known value.

  3. Upper Confidence Bound (UCB): It takes into account both the mean and variance of the predicted values, making it more explorative.

  4. Thompson Sampling: It samples from the posterior distribution and picks the point that maximizes this sample, offering a more randomized approach.

  5. Knowledge Gradient (KG): It considers the value of information collected from future evaluations, making it computationally more expensive but potentially more effective in some cases.

Choosing Expected Improvement (EI)

Among these, expected improvement (EI) is often the go-to choice for many practitioners, and for good reasons. EI provides a good balance between exploration and exploitation. Mathematically, it’s defined as:

EI(x) = E[max(f(x) − f(x+), 0)]

Here, f(x) is the objective function, and f(x+) is the value of the best sample so far. The expectation is computed with respect to the posterior distribution over the objective function.

Why expected improvement?

  1. Balance: EI naturally balances exploration and exploitation, making it a versatile choice for a wide range of problems.

  2. Analytical Tractability: The EI can be computed analytically, which makes it computationally efficient. And we are focused on computational efficiency here.

  3. Intuitiveness: It’s easy to interpret. A higher EI value simply means that there’s either a high likelihood of improvement, a large magnitude of potential improvement, or both.

  4. Parameter-Free: Unlike UCB, which has a tunable parameter to balance exploration and exploitation, EI is parameter-free, making it easier to use out-of-the-box.

The Takeaway

There’s no shame in the EI game!

Other project design choices

  • Docker: For containerization and easy deployment.

  • Colab: For leveraging free GPU resources and accessibility.

Training and RESULTS

Everything works and the F1 score is increasing with bayesian optimization and final model training, but I am in the process of planning and executing the final training sessions, which will be large in epoch and patience count.

Conclusions

Bayesian optimization proved to be a powerful tool for hyperparameter tuning with high computational efficiency. It’s important to note that the pre-trained models we used have limitations, such as being initially trained on ImageNet rather than medical images. In terms of performance metrics, the F1-Score was our chosen evaluation method, particularly because it is effective for imbalanced datasets. We chose the Expected Improvement (EI) as our acquisition function due to its balance between exploration and exploitation, making it highly effective for navigating the hyperparameter space in fewer iterations. On the coding side, the emphasis was on writing code that is not only functional but also modular, maintainable, and portable.

Looking forward, there are several avenues for building on this project. Current ideas include: using bayesian optimization in deeper ways with a broader hyperparameter space, increasing visualization, finding and utilizing a pre-trained model on MRI data, and more.

Final Thoughts

This project was not just an academic exercise but an exploration into the real-world applications of machine learning in healthcare. It was inspiring to actually work with a dataset from the web and train a model locally, even if it was very small project and only humble beginnings. From this process I was enlightened about the possibilities of using machine learning in medical diagnostics.

Read More
Mason . Mason .

Classifying the CIFAR-10 Dataset with Transfer Learning (and Tensorflow Keras)

A project from the beginning of my machine learning journey that I am very proud of, as I experimented and learned a lot.

“Research is what I’m doing when I don’t know what I’m doing.” — Wernher von Braun

Abstract

In this blog post, I will share my journey of developing a Python script that utilizes transfer learning to train a Convolutional Neural Network (CNN) to classify the CIFAR-10 dataset. The goal was to achieve a validation accuracy of 87% or higher, using only TensorFlow’s Keras API and one of the pre-trained models from Keras Applications. The journey was filled with trials, errors, laughter, and a lot of learning. It involved setting up a Docker environment, experimenting with different models, data augmentation, fine-tuning, and tweaking hyperparameters of course. Although I am admittedly a noob, the results surprised me (the first training went very well for such a simple model)!

Introduction

The goal was to train a model that could accurately classify images from the CIFAR-10 dataset. However, the challenge was not just to build a model, but to build it in a specific way: using only TensorFlow’s Keras API, using one of the pre-trained models from Keras Applications, and saving the trained model in the current working directory as ‘cifar10.h5’. Some additional technical rules were: the model should be compiled, should not run when the file is imported, and should have a validation accuracy of 87% or higher.

Understanding the CIFAR-10 Dataset

The CIFAR-10 dataset is a well-known dataset in the machine learning community. It is often used as a benchmark for image classification algorithms. The dataset consists of 60,000 32x32 color images, divided into 10 classes, with 6,000 images per class. The classes are mutually exclusive and include various types of objects and animals, such as airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

The images in the CIFAR-10 dataset are low-resolution (32x32 pixels), which makes the classification task challenging. The images are also diverse and varied, which means that a good model needs to be able to recognize a wide range of features and patterns.

Materials and Methods

Model Selection Process

The first step in the model selection process was to decide which pre-trained model to choose from Keras Applications. This was a great tool to have available, as it allowed me to leverage the features that the model has already learned from a large image dataset like ImageNet. This approach, known as transfer learning, is particularly effective when the new task is similar to the task that the pre-trained model was trained on.

I started with the idea of training several different models of different ‘compute loads’. That is, I wanted to create a very simple network that would train very quickly, I wanted a ‘mid-tier’ net that implements all tools but with a shallow network and low epochs, and a very robust and deep net with high epoch count; all nets must at least pass the mandatory 87% accuracy mark. After finalizing the plan, I chose to start by building the simple network, which is where the decision to use the pretrained MobileNetV2 model from Keras Applications as the base model for our task came from. MobileNetV2 is the smallest model by megabyte count with the fewest amount of parameters out of all models on keras applications. Thus, ‘lilTinyNet’ was born. The ‘mid-tier’ was decided to be one of the EfficientNet models and the ‘large-tier’ was supposed to be ResNet50, VGG16, or InceptionResNetV2. Things rarely go to plan of course, and what ended up happening is that the performance from the first run of ‘lilTinyNet’ went so well that after trying 5 other pre-trained models (with accuracy over 87%, but not near ‘lilTinyNet’) I decided instead to use as many tools as I could to expand ‘lilTinyNet’ while changing things like the optimizer and other hyperparameters (epoch size, batch size, learning rate schedule). Therefore, all models and code covered will be the result of transfer learning from MobileNetV2.

Setting up the Environment (skip if you don’t want to read my gpu rant xD)

The first step in this journey was setting up the environment. I started with Google Colab, but soon realized that I wanted to unleash the full power of my own personal GPU for some deep learning fun. Besides, who doesn’t love a good experiment? So I decided to ditch the cloud and go local. But don’t worry, if you’re still wanting to use that sweet, free Colab GPU, I have a small guide on how to use it for training the model below.

I was also curious about how my RTX 2060 super would compare to the free GPU that Google Colab offers. I mean, it’s free, but is it fast? And how much would it cost me in terms of wear-and-tear on my precious GPU? I’ve heard horror stories of GPUs dying after intensive training sessions, but I’ve also put my GPU through some serious gaming and overclocking tests, and it always came out alive and kicking. So I figured it could handle some deep learning as well, as long as I kept an eye on the load and temperature.

But before I risked frying my GPU, I did what any sensible person would do: I asked Reddit. And lo and behold, the wise and anonymous redditors agreed with me that it should be fine, as long as I was careful and monitored the situation. So, seeing as they agreed with me, I took their word as gospel and immediately put my GPU back to work! (Disclaimer: don’t sue me if your GPU explodes or something. But please inform me about it, because that’s hilarious... I didn’t know they can actually explode. Do your own research and/or be prepared for the consequences!)

A tip: I have nice fans on my GPU, so I cranked them up to 100% and lowered the power consumption cap to 85–90% in my graphics card management software. This kept my GPU at a cool maximum of 72 degrees during training! Note: most of my training sessions were very short (average 15m) or cut off short.

Also, to be honest, I had never used Docker before this project and saw it as an opportunity to learn something new. Plus, it was actually necessary, because WSL2 Ubuntu doesn’t play nice with my GPU without Docker. That being said, I’m a total Docker noob!

Quick Docker set-up (For those interested in training the model using their own GPU in WSL2 Ubuntu)

Now, let’s dive into setting up Docker. Docker is a platform that allows you to automate the deployment, scaling, and management of applications using containerization. It’s a great tool to ensure that your application runs the same, regardless of the environment.

To use Docker, you first need to install it. (WINDOWS) You can download Docker Desktop for Windows from the official Docker website. After installation, you’ll need to enable the WSL 2 backend. This can be done from the Docker Desktop settings.

Once Docker is set up, you can run a Docker container with GPU support using the following command in your terminal (I use inside VSCode). This will (put simply) open the Docker container within the directory you are located in the terminal:

docker run -it --gpus all -v $(pwd):/scripts tensorflow/tensorflow:latest-gpu bash

This command does a few things:

  • docker run starts a new Docker container.

  • -it ensures that you're running Docker interactively (i.e., it will provide a terminal interface).

  • --gpus all allows the Docker container to access your GPU.

  • -v $(pwd):/scripts mounts the current directory ($(pwd)) to the '/scripts' directory in the Docker container. This means that all the files in your current directory will be accessible in the '/scripts' directory in the Docker container.

  • tensorflow/tensorflow:latest-gpu is the Docker image that the container is based on. This image comes with TensorFlow pre-installed and is configured to use a GPU.

  • bash starts a bash shell inside the Docker container.

This command is quite handy as it not only runs a Docker container with GPU support but also mounts the current directory to the ‘/scripts’ directory in the container. This means that any files saved in the Docker container will actually be saved in the directory you were in when you started the Docker session!

Also, look up ‘Docker file’ when you get a chance.

Running the Code in Google Colab

Google Colab is a free cloud service that provides a coding environment for AI researchers. It comes with GPU support and is a great tool for anyone who wants to experiment with machine learning and deep learning without setting up their own environment.

Look the left file directory for a google drive logo. Click it to mount your drive so your model will save to the drive of your google account.

To run your code in Google Colab, follow these steps:

  1. Go to the Google Colab website and sign in with your Google account.

  2. Click on ‘File’ -> ‘New notebook’ to create a new notebook.

  3. You can now write your code in the cells. You can add new cells by clicking on ‘+ Code’ or ‘+ Text’ for code and text cells respectively.

  4. To run a cell, click on the play button on the left side of the cell or press ‘Shift+Enter’.

  5. To use a GPU, click on ‘Runtime’ -> ‘Change runtime type’, select ‘GPU’ under ‘Hardware accelerator’, and then click on ‘Save’.

Remember to save your work regularly. Google Colab notebooks are saved to your Google Drive.

In conclusion, whether you choose to use Docker or Google Colab largely depends on your specific needs and resources. Docker allows you to utilize your own GPU and provides a consistent environment, while Google Colab is a hassle-free option that comes with a free GPU.

Preprocessing the Data

The next step was to preprocess the data for the model. The function preprocess_data(X, Y) was written to preprocess the CIFAR-10 data and labels. The function uses the preprocess_input function from the Keras Applications to normalize the pixel values of the images and the to_categorical function from Keras utils to convert the labels into one-hot encoded vectors. As stated earlier, CIFAR-10 is a dataset of 60,000 32x32 color images in 10 different classes, with 6,000 images per class.

How to preprocess the data?

The data preprocessing consists of two main steps:

  1. Normalizing the pixel values of the images. This means scaling the values from 0 to 255 to a range between -1 and 1. This helps the model learn faster and more accurately, as it reduces the variance of the input data. The preprocess_input function from Keras Applications does this normalization for us, as it is designed for models that use MobileNetV2 as their base.

  2. Converting the labels into one-hot encoded vectors. This means transforming the labels from integers (0 to 9) to binary arrays of length 10, where only one element is 1 and the rest are 0. For example, the label 3 (cat) would be converted to [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. This helps the model output probabilities for each class, as it can use a softmax activation function at the last layer. The to_categorical function from Keras utils does this conversion for us, as it takes the number of classes as an argument.

The code for the preprocessing function is:

def preprocess_data(X, Y):
    """
    Pre-processes the data for your model.

    X is a numpy.ndarray of shape (m, 32, 32, 3) containing the CIFAR 10 data,
      where m is the number of data points.
    Y is a numpy.ndarray of shape (m,) containing the CIFAR 10 labels for X.

    Returns: X_p, Y_p.
    X_p is a numpy.ndarray containing the preprocessed X.
    Y_p is a numpy.ndarray containing the preprocessed Y.
    """
    X_p = K.applications.mobilenet_v2.preprocess_input(X)
    Y_p = K.utils.to_categorical(Y, 10)
    return X_p, Y_p

Transfer Learning: What, When, and How?

Transfer learning is a machine learning technique where a pre-trained model is used on a new problem. It’s called transfer learning because the knowledge from the pre-trained model is transferred to the new problem.

What to transfer: In our case, we want to transfer the knowledge embedded in the convolutional layers of a pre-existing, pre-trained model, such as MobileNetV2, EfficientNet, InceptionV3, VGG16, ResNet50, or others available in Keras Applications. These models have been trained on large image datasets like ImageNet, and their convolutional layers have learned a robust set of features from thousands of images and hundreds of classes. These features range from simple edges and textures to more complex ones like object parts. These learned features are generally transferable to other image recognition tasks, including our CIFAR-10 classification problem.

When to transfer: in our scenario we will be working with the CIFAR-10 dataset which, while varied and well-labelled, is comparatively small and lacks the diversity found in large-scale datasets like ImageNet. Training a deep learning model from scratch on a smaller dataset may lead to overfitting, meaning the model may not generalize well to unseen data. Additionally, training deep learning models from scratch requires significant computational resources and time. So, by using a pre-trained model, we can leverage the features it has learned and save on training time and resources, making transfer learning an attractive choice in our context.

How to transfer: Here’s how we plan to transfer the knowledge from the pre-existing model to our task:

  1. Select a pre-trained model: As mentioned, we can use a model available in Keras Applications.

  2. Preprocess the CIFAR-10 dataset: Since CIFAR-10 images are smaller than what the pre-existing models are trained on, we’ll need to resize the images. Also, we’ll need to normalize the pixel values and one-hot encode the labels.

  3. Freeze the convolutional base of the pre-trained model: This involves setting the trainable attribute of the model layers to False to preserve the weights and biases.

  4. Add a new classifier on top of the pre-trained model: We’ll add a few layers that will be trained on our specific task. These layers should end with a dense layer with 10 units (one for each class in CIFAR-10) with a softmax activation function to output class probabilities.

  5. Compile and train the model: Compile the model with an appropriate optimizer and loss function, and then train it on the CIFAR-10 data.

  6. Fine-tune the model (optional): Once the top layers are well-trained, we could unfreeze a few layers in the pre-trained model and train it further with a very low learning rate to fine-tune the model to our specific task.

It’s important to keep track of the model’s performance (accuracy) on a validation set during training to ensure the model is learning well and not overfitting. That is, if the accuracy is going up, but the validation loss is going up while the validation accuracy is going down, then the model is ‘overfitting’, meaning that its ability to actually recognize images in the ‘real-world’ is not as high as the accuracy makes it seem; the model is ‘overfitting’ to the training data and therefore is actually getting too ‘accustomed’ to trends in the training data.
This is a good place to note that one can set early stopping to make sure that the model will stop training after enough consecutive epochs with increasing validation loss. This is especially important if one needs to grab a drink and a snack, followed by a ‘short video’, followed by a nap, followed by forgetting what you were doing…. This is not an exact, personal example. Anyway, in our case, the code for early stopping looks like:

early_stopping = K.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

This code is making it so after 3 consecutive epochs with validation loss the model will quit training and save.

Also, checkpoint saving was implemented in the code, so no matter if the training goes to completion, the model will be saved as the epoch with the lowest validation loss and its associated accuracy. This code looks like:

model_checkpoint = K.callbacks.ModelCheckpoint('cifar10.h5', save_best_only=True)

Building the Models

The Birth of ‘lilTinyNet’

The first model I built, which I affectionately named ‘lilTinyNet’, was based on the MobileNetV2 model from Keras Applications. The model was compiled with the RMSprop optimizer with a learning rate of 0.001, the categorical cross entropy loss function, and the accuracy metric. The model was trained for up to 10 epochs with a batch size of 32, but the training was stopped early if the validation loss did not improve for 3 consecutive epochs. The best model weights were saved using a model checkpoint callback.

Model Architecture

‘lilTinyNet’ was built using the Keras library with TensorFlow as the backend. The model architecture was based on the MobileNetV2 architecture, which was pre-trained on the ImageNet dataset. The top layers of the MobileNetV2 model were replaced with custom dense layers to adapt the model to the CIFAR-10 classification task.

The model structure was defined as follows:

  1. A lambda layer was used to resize images from 32x32 to 128x128 to match the input size that MobileNetV2 was trained on.

  2. The base model (MobileNetV2) was added.

  3. The output of the base model was flattened to 1 dimension.

  4. A dense layer with 1024 units and ReLU activation was added.

  5. Dropout was applied to prevent overfitting.

  6. A final dense layer with 10 units (for the 10 classes) was added with softmax activation to output probabilities for the classes.

Model Training

The model was trained with early stopping and model checkpointing. Early stopping was used to prevent overfitting by stopping the training process when the validation performance stopped improving. Model checkpointing was used to save the model weights at the end of each epoch if the model’s performance on the validation set had improved.

A learning rate scheduler callback was also used. This callback function adjusted the learning rate according to a schedule. Specifically, the learning rate was reduced by an order of magnitude (factor of 10) every 5 epochs. This helped to achieve better convergence of the model. Oddly enough, I initially utilized this but planned to switch to learning rate decay, but the results with the learning rate scheduler were oddly better, though that is surely mostly coder error :) The code for the learning rate scheduler is as follows:

class LearningRateScheduler(K.callbacks.Callback):
    """Learning rate scheduler callback"""
    def on_epoch_end(self, epoch, logs=None):
        if (epoch+1) % 5 == 0:
            lr = K.backend.get_value(self.model.optimizer.lr)
            K.backend.set_value(self.model.optimizer.lr, lr * 0.1)
            print(" ...Adjusted learning rate to:", lr*0.1)

The model was trained for 10 epochs with a batch size of 32. The training and validation data were the preprocessed CIFAR-10 data.

Results and Insights

‘lilTinyNet’ reached its peak training in the 7th epoch with 92.17% accuracy and 0.2936 validation loss in 11m53s of training! This was the first model that I ever really trained, so I had no idea how fast that was before I beat my head up against the wall for days trying to beat that accuracy with 5 ‘new and improved’ models. That is, until….

‘lilTinyNet’ evolved into a ‘megaNet’!

After much trial and error, the final form was ascended to, dubbed ‘megaNet’, and it indeed is a much more sophisticated version of ‘lilTinyNet’. The development of ‘megaNet’ involved the incorporation of data augmentation techniques, fine-tuning of the base model, batch normalization in the layers added to the pre-trained model, batch size increased to 64 from 32, and a switch to Adam optimization.

Data Augmentation

One of the key enhancements in ‘megaNet’ was the use of data augmentation techniques. Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. This is particularly useful when dealing with image data, where the acquisition of new data can be costly and time-consuming.

In ‘megaNet’, I used Keras’s ImageDataGenerator to perform data augmentation. This included random rotations, width and height shifts, horizontal flips, zooming, and brightness adjustments. These transformations introduced variability in the training set, helping the model to generalize better to unseen data. The code looks as so:

    datagen = K.preprocessing.image.ImageDataGenerator(
        featurewise_center=False,  # Set input mean to 0 over the dataset
        featurewise_std_normalization=False,  # Divide inputs by std of the dataset
        rotation_range=10,  # Degree range for random rotations
        width_shift_range=0.1,  # Range for random horizontal shifts
        height_shift_range=0.1,  # Range for random vertical shifts
        horizontal_flip=True,  # Randomly flip inputs horizontally
        zoom_range=0.2,  # Range for random zoom
        brightness_range=[0.8, 1.2])  # Range for picking a brightness shift value
    datagen.fit(x_train)

Fine-tuning the Base Model

Another significant enhancement in ‘megaNet’ was the fine-tuning of the base model. While in ‘lilTinyNet’ the base model was frozen and used as a feature extractor, in ‘megaNet’ I decided to unfreeze the last 20 layers of the base model and train them on the CIFAR-10 data. In theory, this allowed the model to adapt better to the specific features of the CIFAR-10 dataset.

Fine-tuning was performed after the initial training of the top layers. The learning rate was reduced to 0.0001 to prevent large updates that could destroy the pre-learned features. The fine-tuned model was trained for up to 24 epochs, with early stopping if the validation loss did not improve for 5 consecutive epochs.

Adam Optimization

In ‘megaNet’, I decided to switch from RMSprop to Adam optimization. Adam is an optimization algorithm that can handle sparse gradients on noisy problems. It’s known for combining the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle a wide range of data and parameter scales.

Why the switch, you ask? Well, it was mostly a fun shot in the dark! I wanted to see if Adam could provide any improvements over RMSprop. And as it turns out, ‘megaNet’ seemed to enjoy the change!

Learning Rate Scheduler

Just like in ‘lilTinyNet’, ‘megaNet’ also incorporated a learning rate scheduler. This helped to achieve better convergence of the model. The learning rate was reduced by a factor of 10 every 5 epochs, just as it was in ‘lilTinyNet’.

Results and Insights

Before fine-tuning, the ‘megaNet’ model achieved a training accuracy of 94.81% with a loss of 0.1804! After fine-tuning, the accuracy slightly improved to 94.99% with a loss of 0.1777. The improvement from fine-tuning was much less than expected, which might suggest that the fine-tuning process might not have been performed properly. However, learning was reasonably steady throughout the training process. It’s possible that I could have let the model run for even more epochs for a slight increase. I did not log the exact time of this training, it was probably around 40m, but it wouldn’t be proper to compare the two anyway, as ‘megaNet’ had many more epochs in both new layer training and in fine-tuning.

The journey of developing ‘megaNet’ was filled with learning and experimentation. I chased every rabbit hole that I could to try and improve ‘lilTinyNet’, most of which were negative or stagnating changes. Through this experimentation I feel like I was much more able to grasp at how deep learning actually works and why they say hyperparameter tuning is a regular issue. I was very happy to finally get a positive result from data augmentation, and even though the fine-tuning didn’t result in barely any improvement I was just happy that it was improving with training!

The ‘megaNet’ model, with its more sophisticated architecture and advanced techniques, represents a significant improvement over ‘lilTinyNet’ architecture and performance, even if it was only an improvement of 3.82% it sure seemed like a lot after training larger models for longer with worse results!

Discussion

The journey of building these models was an enlightening experience filled with learning and experimentation. I delved into the intricacies of data preprocessing, discovered the power of transfer learning, and explored the impact of different hyperparameter tweaks. I also got to experience firsthand the utility of callbacks in Keras.

One of the key lessons I learned was the trade-off between model complexity and training time. The ‘lilTinyNet’ model, despite its simplicity and faster training time, achieved nearly the same accuracy as the more complex ‘megaNet’ model. This highlighted the importance of model selection and optimization in machine learning.

A significant challenge I faced was resizing the images from 32x32 to the input size that the pre-trained models were trained on. I used a lambda layer with the resize_images function from the Keras backend, but I am still exploring if this is the best approach.

Another challenge was choosing the appropriate base model for transfer learning. I chose MobileNetV2 because it is lightweight and efficient, but I am still uncertain if it is the best choice for the CIFAR-10 dataset. My early experiments with other models didn’t yield great results, but I am eager to continue experimenting.

Interestingly, I discovered that it’s possible to continue training an already trained and compiled model. This was a revelation to me, and it opened up new possibilities for model improvement. I initially ran ResNet50 for 2 epochs, but switched to MobileNetV2 because it was faster and had higher accuracy.

However, MobileNetV2 was overfitting, as evidenced by the decreasing validation accuracy and large validation loss. To combat this, I added dropout layers and adjusted the learning rate. I started with a learning rate of 0.0001, thinking that a lower rate would be beneficial since I was using a pretrained model with already established weights.

I used the Adam optimization algorithm because of its efficiency and because it requires little memory. Adam also adjusts the learning rate adaptively, which can lead to better results.

Intriguingly, I found that I could experiment with a very high learning rate on MobileNetV2. This led me to learn about ‘learning rate scheduling’, a technique that adjusts the learning rate during training. After implementing learning rate scheduling, dropout, early stopping, and model checkpointing, I readjusted the learning rate of MobileNetV2 by two orders of magnitude.

I ran the model for 10 epochs, but the model was saved on the 7th epoch with the lowest validation loss. The entire process took 11.88 minutes, demonstrating the efficiency of MobileNetV2.

In conclusion, this project was a valuable learning experience. It taught me the importance of model selection, hyperparameter tuning, and various techniques to combat overfitting. It also showed me that there’s always room for experimentation and improvement in machine learning.

Acknowledgments

I would like to thank the creators of the CIFAR-10 dataset and the developers of TensorFlow and Keras for providing the tools and resources necessary for this project. It has been great as a newcomer to learn that there are so many resources which can be utilized reasonably easily by anyone, anywhere.

I would also like to thank my GPU for its hard work and resilience throughout this journey. Ape, AI, and silicon together strong.

Literature Cited

Congratulations! Congratulations! Congratulations!

Appendices

You made it this far and for that you get code! *the crowd goes wild*

holbertonschool-machine_learning/supervised_learning/transfer_learning at master ·…

Contribute to spindouken/holbertonschool-machine_learning development by creating an account on GitHub.

github.com

(megaNet is 0-transfer and 0-main will evaluate the trained megaNet models)

Read More