TinyML: It’s a Small World After All!

What’s So Tiny About It?

I think this excerpt from a talk from Dennis Laudick who is VP of marketing for ARM, a company who creates TinyML tech, provides a great introduction to the tiny behind TinyML:

Interviewer: “How tiny is tiny?”

Laudick: “Pretty much anywhere where you have a microcontroller. And there are tens of billions of microcontrollers shipped annually. This way you can run machine learning workloads in the device at the source of the information. Everything for awhile has been about making better and better models in order to achieve better and better accuracy, with the gold standard of beating humans. But in order to actually run the models you have to be close to where the data is. Think of an elevator, where you have sensors on the motors, the shafts, the chains, and the pulleys, and you want to know when the device is getting dangerously close to failing. You want to know when to perform maintenance. You want to listen to vibrations and motors and things. Well, the best place to do that is there where the data is, where the device happens to be. So you have a microcontroller controlling something and you’re sampling data, you know, at a very low Hz rate, you can run a small machine learning model that listens for variability in the motor, so you know when you need to shut it down for maintenance before something bad happens.”

Swap elevator for escalator or airplane engine and one definitely starts to appreciate the idea. Actually, after thinking about TinyML in that way, one might start thinking, “what can’t we monitor and/or optimize?” In other words, prepare for the tiny takeover of machine learning!

How the Framework Works: It’s Not Magic, It’s Tiny Magic!

Otherwise known as magical tiny science! Or is it tiny magical science??

*Ahem*

A quick primer :

TinyML, in simple terms, is like teaching an ant to do backflips — it’s all about doing a lot with a little. TinyML usually exists on extremely small, microcontroller-based devices with ultra-low power sources, often capable of running on a tiny amount of milliwatts. A fun comparison is to think of is if a tiny device can run on 10 mW then it takes 200–500% less power than a standard smartphone when active (~2–5W). This might not sound that impressive since we think heavily of our phones and all that they can do, but that tiny device can perform real-time inference directly on the device. For example it could be running a tiny model for specific sound recognition or gesture detection. It can process input data from its sensors, like an accelerometer or microphone, and make decisions based on the machine learning model’s output, all while consuming only a few milliwatts of power.

How Do They Make It So Tiny?!?

In order to achieve this size while being able to run ML they are powered by specialized processors and other architectures that focus on doing more with less. How? Through some neat optimizations using algorithms. Yep, it’s optimization all the way down (to low-level)!

Optimization as a term also includes processes like compression, pruning (think of it as removing unnecessary parts of the network), and quantization (making your calculations take up less bits). And there’s always transfer learning — the ability to use base tiny models to build on and train for new tasks. Plus, on-device learning means these devices get smarter while staying where they are, no need to chat with the cloud all the time.

To maximize efficiency, TinyML employs advanced methods like neural architecture search (NAS) and knowledge distillation. NAS automates the design of neural network architectures to find models that are both accurate and lightweight. Knowledge distillation, on the other hand, involves training a compact model (student) to replicate the behavior of a larger, more complex model (teacher), thereby retaining performance while reducing size.

But these are only some of the techniques used, and even the section below will not cover them all! This is meant to be a tiny dive into the world of TinyML. That being said, let’s go a tiny bit deeper.

Further, concise explanations of some hot TinyML terms:

Compression: reduces the overall model size.

Methods of compression include: pruning, quantization, weight sharing, low-rank factorization, sparse representations, knowledge distillation, parameter sharing, and network architecture optimization.

On some compression techniques and more:

https://medium.com/marionete/tinyml-models-whats-happening-behind-the-scenes-5e61d1555be9

Pruning: eliminates redundant or non-contributory neurons.

Pruning is an important method in machine learning that helps make big models more efficient. Imagine you have a large, detailed model that works really well. Pruning is like trimming down this model to make it smaller, but still trying to keep it working just as well. It’s like cutting off the branches of a tree that aren’t needed, so the tree stays healthy but slimmer overall and it’s easier to direct its growth.

Here’s how pruning works in simple terms:

  1. Identifying Unimportant Parts: First, you look at the model and find parts that aren’t doing much. These are like tiny gears in a machine that aren’t really contributing to the overall function. For example, if a part of the model has a value like 0.00001, it might not be very important, so it can be set to zero.

  2. Pruning and Retraining Cycle: After removing these small parts, the model can be retrained with the data. This is like giving the model a chance to adjust to the changes made. By doing this, the parts of the model that are left might work a bit harder to make up for what was removed. One can keep doing this — pruning a bit, then retraining — while making sure two things happen: the model still fits within the space allocated (like fitting a big book into a smaller bookshelf), and it still performs well (like making sure the book still tells the same story).

  3. Effectiveness in Big Models: In models that have millions of parts (parameters), this pruning method is really useful. It finds and gets rid of a small percentage of parts that aren’t needed. One can think of this process as gradually increasing the amount of pruning from the start to the end, like slowly turning up a dial from 0 to 100. In the end, one might find that they’ve removed up to 90% of the unnecessary parts, making the model much more manageable and efficient.

Guide on pruning:

https://towardsdatascience.com/model-compression-via-pruning-ac9b730a7c7b

Quantization: a technique that simplifies how numbers are represented by using fewer bits than the standard 64-bit floating point precision. When employing quantization one often uses formats like 16-bit or 8-bit, which require less memory. For instance, a 16-bit number needs four times less memory than a 64-bit number, and an 8-bit number needs eight times less.

This process involves mapping a number from its original range to a new, smaller range. Quantization includes two main steps:

  1. Quantization Process: This step converts a number into its simpler, quantized form.

  2. Dequantization Process: This step approximates the original number from its quantized form.

While quantization reduces the precision of the numbers and limits the range of calculations, the benefits often outweigh this drawback. For example, training results from quantized neural networks show that the loss in precision is usually small compared to the significant reduction in model size. A notable example of a quantized model is I-BERT, a version of the BERT model that uses only integer arithmetic. This model can perform complex operations more efficiently and can be up to four times faster using 8-bit integers compared to the standard 32-bit floating-point format.

I-BERT: Integer-only BERT Quantization

https://arxiv.org/abs/2101.01321

A White Paper on Neural Network Quantization https://arxiv.org/pdf/2106.08295.pdf

Weight Sharing: using the same weights for multiple connections in the neural network. By sharing weights across different parts of the network, the overall number of unique weights is reduced, leading to a smaller model size.

Low-Rank Factorization : reduces the model size by representing large matrices with smaller ones, which can significantly decrease the number of parameters.

Sparse Representations: modifying the network to have more zeros in the weight matrices. Unlike pruning, which removes connections entirely, sparse representations maintain the network structure but with many weights set to zero, which can be efficiently stored and computed.

Parameter Sharing: especially effective in models processing sequential data, where the same weights can be used at each time step.

Network Architecture Optimization: using techniques like Neural Architecture Search (NAS) to find optimal architectures that maintain performance with fewer parameters.

Neural Architecture Search (NAS): the automated design of neural network architectures.

Data Serialization: the transformation of complex data structures into streamlined, compact formats.

Transfer Learning: allows for the adaptation of pre-trained models to new tasks, significantly reducing the computational resources required for training. For example, one could use a tiny model that another has created as a base model to be trained to do another task. Models that do similar tasks and/or deal with similar data can transfer the most learning to a new task.

Knowledge Distillation: leverages the strength of a larger, more accurate model to enhance the training quality for a smaller model. Operates on the principle that a smaller model can achieve accurate results if trained with high-quality data. This approach is often described using a student-teacher model analogy.

Here’s how it works:

  1. Training the Teacher: First, a larger and more complex machine learning model, referred to as the “teacher,” is trained on a given dataset. This teacher model is capable of generating highly accurate predictions.

  2. Generating New Data: The teacher model’s predictions are then used to create a new dataset. This dataset combines the original data with the additional insights gained from the teacher model. Essentially, it enriches the original data with more detailed information.

  3. Training the Student: The new, enriched dataset is then used to train a smaller, less complex model, known as the “student.” The idea is that the high-quality data, now containing more detailed insights, can help improve the performance of this smaller model.

On Knowledge Distillation:

https://medium.com/@zone24x7_inc/shrinking-the-giants-how-knowledge-distillation-is-changing-the-landscape-of-deep-learning-models-83dffde577ec

Decision Trees: these are a viable model choice in TinyML due to their inherent simplicity and interpretability. These models operate by making a series of binary decisions, each based on specific data features, making them particularly suitable for scenarios with limited computational resources. It’s like plinko, but rigged!

On-Device Learning in TinyML: On-device learning in TinyML signifies a shift towards localized data processing, enabling devices to make decisions without reliance on continuous cloud connectivity. When TinyML devices do connect to the cloud, they send only essential data. This approach not only conserves bandwidth but also enhances privacy and reduces latency, which is crucial for real-time applications.

Edge Computing in TinyML: Edge computing plays a crucial role in TinyML. By processing data locally, we reduce latency and enhance privacy. This is crucial for applications like real-time language translation or emergency response systems in smart cities.

Examples of TinyTech

Imagine your morning alarm clock not just waking you up, but also easing you into a wakeup, telling you how well you slept, thanks to a TinyML device monitoring your sleep patterns, and it could tell you all kinds of data, like the news or what your schedule is or any important reminders, all according to what you have it set to report.

More examples?

In fitness, your running shoes could give you feedback on your jogging technique.

In healthcare, TinyML is like a mini-doctor you wear on your wrist, keeping an eye on your heart rate, oxygen levels, and maybe even predicting if you’re about to catch a cold.

In agriculture, it’s like having a tiny farmer in your field, monitoring soil moisture and crop health. That tiny farmer has a tiny farmer’s almanac and it knows things.

Some case studies that have been performed:

TinyML for environmental data collection, sign language detection, handwriting recognition, medical face mask detection, gesture recognition, speech recognition, and autonomous tiny vehicles.

These aren’t scenes from a sci-fi movie; they’re real possibilities with TinyML!

“Aren’t these sci-fi things expensive??”

Actually, more optimization means more scalability, and more scalability means leveraging economies of scale!! In other words, these things will be so tiny, in demand, and therefore numerous that we will be able to make them extremely cheaply after building the means of production. This will tap economies of scale at a magnitude that could summon the ghost of Andrew Carnegie to praise us all.

Now, let’s get our hands tiny! With TensorFlow Lite, Arduino, and even Raspberry Pi (not the tiniest) even hobbyists can join in the fun, turning everyday objects into smart gadgets. You could create a gadget that tells you when your pet needs food, or a system that understands sign language. You could create a bot that determines what type of dance move you are performing and starts playing the relevant music to your jig. The possibilities are endless!!

Much of this section was influenced by the paper: “TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications”, Section 4.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9227753/

Why I’m Excited: The Takeover of the Tinytech

Uhhh, because tiny is such a cute way to explain microtech (read the section title again: tinytech.. hilarious). But really, what excites me about TinyML is its potential to be everywhere, like invisible helpers making our lives easier without us even noticing. And they are built to be extremely efficient and focused on specific tasks, which is a huge boon to desirability in a world where we are all starting to think about ‘our data’ and privacy concerns.

Currently, machine learning is something that seems only to be useful in application for very large projects. Soon, many TinyML softwares and hardwares will be deployed and become part of our everyday lives.

Outro: From Tiny Steps to Tiny Leaps

In the near-term, expect to see more gadgets getting smarter and more efficient. Imagine a world where many little technologies seem to have a bit of ‘intelligence’, communicating and learning from each other actively in the environment. At first, this will be very task-focused and implemented sparsely. But in the moonshot future we are looking at swarms of ‘smart’ tech.

An example: when AI automated vehicles are swarming the roads, they will all be communicating their positions and the data of their environment with each other allowing the AI to know the movements of all objects around every road. This is the moonshot future that makes some automated vehicle engineers so sure that this type of future would be the safest one for vehicle passengers and pedestrians. Currently, we simply don’t have the data bandwidth or the devices deployed in order to monitor streets in such a way, though this is one of the purposes of 5G deployment.

One of my favorite concepts has always been nanobots. I feel like nanobots are the real TinyML moonshot. This is where little bots with nano-tiny power sources perform cellular level repair. Is this a real thing we can do? I don’t know but it sounds very cool.

That brings us to the end of this tiny blog. As we step into this paradigm of expanding TinyML and BigML, let’s embrace that it’s a small world, and it’s about to get a lot smarter!

Sources

TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9227753/

On Compression:

https://medium.com/marionete/tinyml-models-whats-happening-behind-the-scenes-5e61d1555be9

Tiny guide on pruning:

Model Compression via Pruning

Pruning Neural Network

towardsdatascience.com

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language…

arxiv.org

A White Paper on Neural Network Quantization: https://arxiv.org/pdf/2106.08295.pdf

On Knowledge Distillation:

https://medium.com/@zone24x7_inc/shrinking-the-giants-how-knowledge-distillation-is-changing-the-landscape-of-deep-learning-models-83dffde577ec

Next
Next

Optimizing Alzheimer’s Disease Classification using Bayesian Optimization and Transfer Learning