Optimizing Alzheimer’s Disease Classification using Bayesian Optimization and Transfer Learning

Introduction

Alzheimer’s disease is a progressive brain disorder that affects memory, thinking, and behavior. Early diagnosis plays a critical role in managing the disease. In this article we will discuss the process of using bayesian optimization techniques to fine-tune a convolutional neural network (CNN) aimed at classifying Alzheimer’s disease based on MRI scans. Specifically, we’ll focus on classifying Alzheimer’s disease stages. Transfer learning and from-scratch methods are employed. We’ll discuss the thought process behind each decision, the statistical concepts involved, and the conclusions drawn from the optimization process.

Process and Reasoning: Why This Project Structure?

When I was brainstorming how to approach this project I decided that the main goals were personal learning and training efficiency, as I basically constructed this project in 5 days. Both of these objectives influenced the choices made in dataset selection, code structuring, and overall project architecture. Below are the key considerations:

Dataset: Why MRI Images?

  • Clinical Relevance: MRI scans are one of the most reliable forms of diagnostic tools for Alzheimer’s disease, making them a clinically relevant choice.

  • Learning Objective: Given my keen interest in computational neuroscience, working with medical images like MRIs offers an invaluable learning experience. It provides a practical application of machine learning techniques in a neuroscience context.

Code Structure: Modular and Reusable

The codebase for this project was designed with reusability and scalability in mind. Here’s how:

  • Function-Based Design: The code is structured into separate functions for data loading, model creation, and optimization. This modularity makes it easier to understand, debug, and scale.

  • Dynamic Hyperparameters: Hyperparameters like batch size are made dynamic, allowing the bayesian optimization process to have full control, thereby improving learning efficiency. This design choice enables the model to adapt better to the data, leading to more robust performance.

  • Custom Metrics: Implementing custom metrics like F1-Score serves both as a learning exercise and as a way to fine-tune the model’s performance. The F1-Score is particularly useful when the class distribution is imbalanced, which is often the case in medical datasets.

Hyperparameter Optimization: Bayesian Over Grid Search

Hyperparameter tuning is often the most time-consuming part of a machine learning project. Traditional methods like grid search or random Search are computationally expensive and less efficient. That’s where bayesian optimization comes in.

Bayesian optimization is generally more efficient than other methods like grid search, especially for high-dimensional spaces. It uses a probabilistic model to predict the objective function and intelligently selects the next set of hyperparameters to evaluate, thereby reducing the number of evaluations needed.

Training Efficiency: Why Transfer Learning?

Transfer learning is the practice of taking a pre-trained model and fine-tuning it for a different but related task. Here’s why it’s beneficial:

  • Transfer Learning: Using a pre-trained model allows us to leverage prior learning, reducing training time and computational resources. For instance, a model trained on general brain images can be fine-tuned to classify Alzheimer’s stages, saving both time and computational power.

  • Note: in this project we will only utilize models that were pre-trained on ImageNet, so their usefulness in transferring knowledge to this project is limited but very easy to implement using Keras.

Primer on Concepts

What is a Gaussian Process?

A Gaussian Process (GP) is a powerful tool for probabilistic modeling. In the context of bayesian optimization, GPs are used to model the unknown objective function. A GP assumes that any finite set of points from the function are jointly gaussian distributed. This property allows us to not just make point estimates of the function but also quantify our uncertainty about those estimates.

In simpler terms, it’s a way to predict a function based on prior observations. GPs are the backbone of bayesian optimization, providing a probabilistic model of the objective function. Gaussian processes are widely used in machine learning to make predictions when the underlying distribution is unknown.

What is Bayesian Optimization?

Bayesian optimization (BO) aims to find the hyperparameter combination that will give the best score on the objective function, but without having to test every single possible combination, which can be computationally expensive. It uses probabilistic models, like gaussian processes, to predict the performance of untested hyperparameters and intelligently decides the next set of hyperparameters to test. This makes the optimization process more efficient. It’s particularly useful for optimizing functions that are expensive to evaluate, like high-dimensional hyperparameter tuning for deep learning models.

The Model(s)

MobileNetV2

MobileNetV2 is designed for mobile and embedded vision applications. Given its efficiency and lower computational requirements, MobileNetV2 is an excellent choice for quick experiments and prototyping. It’s also a good fit for real-world applications where computational resources may be limited. Also, I had used it for a previous transfer learning project and wanted to test the accuracy using it versus much more robust off-the-shelf CNNs.

VGG16

VGG16 has fewer parameters compared to other models like ResNet, making it faster to train. It has a simple architecture with a series of convolutional layers followed by max-pooling and fully connected layers, making it easier to understand and modify. This was used as a step up from MobileNetV2.

DenseNet169

The final model chosen: DenseNet is another CNN architecture. Specifically, DenseNet169 is a variant with 169 layers. Unlike traditional convolutional networks, where each layer obtains new features and passes them forward, DenseNet layers also receive features from all preceding layers. The architecture employs “dense blocks” where each layer receives the feature maps from all preceding layers and passes on its own feature maps to all subsequent layers. This results in fewer parameters, reduced overfitting, and better gradient flow. That also makes it useful for this project which is aiming for low computational overhead.

From Scratch

Some exploration was done with from scratch models, but this idea was scrapped due to time constraints.

Hyperparameters and Why They Were Chosen

  1. Learning Rate: This controls how quickly or slowly a model learns. Too high a learning rate can cause the model to converge too quickly and possibly overshoot the minimum cost. Too low a learning rate will make the model slow to converge.

  2. Dense Units: This refers to the number of units in the dense layers of the network. More units allow for more complex representations but can also lead to overfitting.

  3. Dropout Rate: Dropout is a regularization technique where a fraction of the input units are randomly set to zero during training. This prevents overfitting.

  4. L2 Weight: This is the weight for L2 regularization in the loss function. Regularization helps prevent overfitting by adding a penalty for large weights.

  5. Batch Size: This is the number of samples that will be used to update the model weights in one iteration. Smaller batch sizes often provide a regularizing effect and lower generalization error.

These hyperparameters were chosen because they have a significant impact on the model’s performance and are commonly tuned in deep learning projects.

Satisficing Metrics

When I started this project my go-to metric for model evaluation was the good ol’ validation loss. It’s a classic choice, often used as a quick and dirty way to gauge how well a model is performing. However, as I continued my research into how best deal with a dataset of MRI images in the context of satisficing metrics, I realized that this project had its own set of unique challenges — chief among them being class imbalance.

The Problem with Validation Loss in Imbalanced Classification

Validation loss is a great metric when you’re dealing with a balanced dataset. But here’s the kicker: Alzheimer’s disease stages are not uniformly distributed in the real world, and neither were they in my dataset. Some stages of the disease are more common than others, leading to an imbalanced dataset. Using validation loss in such a scenario can be misleading. The model might perform well on the majority class but terribly on the minority class, and yet show a deceptively low validation loss. This is a classic trap in machine learning, and it’s one that I fell into.

The Aha Moment: Enter F1 Score

As I was falling through this rabbit hole, I stumbled upon the concept of ‘satisficing metrics’ which are metrics that satisfy the minimum criteria for adequacy. This led me to the F1 Score, a metric that balances both precision and recall. In the context of imbalanced classification, the F1 Score shines because it gives you a more holistic view of how well your model is performing across all classes.

The F1 Score is calculated as follows:

F1 Score = (2 × Precision × Recall) / (Precision+Recall)

Here, Precision is the number of True Positives divided by the number of True Positives and False Positives.

Recall, or Sensitivity, is the number of True Positives divided by the number of True Positives and False Negatives.

The F1 Score harmoniously combines these two metrics into a single number that ranges from 0 to 1, with 1 being the best possible F1 Score.

Why F1 Score Over Other Metrics?

You might be wondering, “Why not use other metrics like ROC-AUC?” While ROC-AUC is a strong contender, it’s not as interpretable as the F1 score. Given that one of my goals was to make this project as accessible as possible, even to those without a deep technical background, the F1 score was the clear winner.

The Takeaway

Switching from validation loss to the F1 score was a pivotal moment in this project. It not only provided a more accurate measure of the model’s performance but also deepened my understanding of the nuances involved in choosing the right metric. The F1 score became my satisficing metric, ensuring that the model met the minimum criteria for adequacy while dealing with an imbalanced dataset.

So, the next time you find yourself knee-deep in a classification problem with an imbalanced dataset, remember: the choice of metric can make or break your model. Choose wisely!

Acquisition Functions

What is an Acquisition Function?

In bayesian optimization, the acquisition function is a heuristic that provides a measure of the utility of evaluating the objective function at a given point.

What is an objective function anyway? What is this gobbledygook speak? Well, it’s any mathematical formula that you want to optimize — either maximize or minimize — while solving a problem. In machine learning, for example, the objective function could measure how well a model predicts data; you’d aim to find the model settings that make this score as good as possible. In the context of bayesian optimization, the objective function represents the performance of a machine learning model across different hyperparameter settings. The “score” given by this function could be something like accuracy or F1 score for a classification task, or mean squared error for a regression task.

The acquisition function guides the selection of the next point to evaluate in the hyperparameter space. It balances the trade-off between exploration (searching unknown or less certain regions) and exploitation (zooming in on known good regions). The point that maximizes the acquisition function is chosen as the next point to evaluate in the objective function.

Types of Acquisition Functions

Here’s a quick rundown of some commonly used acquisition functions:

  1. Probability of Improvement (PI): It aims to improve over the current best-known value.

  2. Expected Improvement (EI): A balanced choice that considers both exploration and exploitation. It measures the expected improvement over the current best-known value.

  3. Upper Confidence Bound (UCB): It takes into account both the mean and variance of the predicted values, making it more explorative.

  4. Thompson Sampling: It samples from the posterior distribution and picks the point that maximizes this sample, offering a more randomized approach.

  5. Knowledge Gradient (KG): It considers the value of information collected from future evaluations, making it computationally more expensive but potentially more effective in some cases.

Choosing Expected Improvement (EI)

Among these, expected improvement (EI) is often the go-to choice for many practitioners, and for good reasons. EI provides a good balance between exploration and exploitation. Mathematically, it’s defined as:

EI(x) = E[max(f(x) − f(x+), 0)]

Here, f(x) is the objective function, and f(x+) is the value of the best sample so far. The expectation is computed with respect to the posterior distribution over the objective function.

Why expected improvement?

  1. Balance: EI naturally balances exploration and exploitation, making it a versatile choice for a wide range of problems.

  2. Analytical Tractability: The EI can be computed analytically, which makes it computationally efficient. And we are focused on computational efficiency here.

  3. Intuitiveness: It’s easy to interpret. A higher EI value simply means that there’s either a high likelihood of improvement, a large magnitude of potential improvement, or both.

  4. Parameter-Free: Unlike UCB, which has a tunable parameter to balance exploration and exploitation, EI is parameter-free, making it easier to use out-of-the-box.

The Takeaway

There’s no shame in the EI game!

Other project design choices

  • Docker: For containerization and easy deployment.

  • Colab: For leveraging free GPU resources and accessibility.

Training and RESULTS

Everything works and the F1 score is increasing with bayesian optimization and final model training, but I am in the process of planning and executing the final training sessions, which will be large in epoch and patience count.

Conclusions

Bayesian optimization proved to be a powerful tool for hyperparameter tuning with high computational efficiency. It’s important to note that the pre-trained models we used have limitations, such as being initially trained on ImageNet rather than medical images. In terms of performance metrics, the F1-Score was our chosen evaluation method, particularly because it is effective for imbalanced datasets. We chose the Expected Improvement (EI) as our acquisition function due to its balance between exploration and exploitation, making it highly effective for navigating the hyperparameter space in fewer iterations. On the coding side, the emphasis was on writing code that is not only functional but also modular, maintainable, and portable.

Looking forward, there are several avenues for building on this project. Current ideas include: using bayesian optimization in deeper ways with a broader hyperparameter space, increasing visualization, finding and utilizing a pre-trained model on MRI data, and more.

Final Thoughts

This project was not just an academic exercise but an exploration into the real-world applications of machine learning in healthcare. It was inspiring to actually work with a dataset from the web and train a model locally, even if it was very small project and only humble beginnings. From this process I was enlightened about the possibilities of using machine learning in medical diagnostics.

Previous
Previous

TinyML: It’s a Small World After All!

Next
Next

Classifying the CIFAR-10 Dataset with Transfer Learning (and Tensorflow Keras)