Reducing Core ML 2 Model Size by 4X Using Quantization in iOS 12

This year, Apple introduced Core ML 2 at WWDC 2018, with a focus on making machine learning more flexible and powerful for developers to use.

With the release of this new and improved framework, Apple also announced a freshly updated version of coremltools. This handy Python library can be used to convert trained models into a Core ML format as well as making predictions directly from your machine.

For mobile applications using Core ML, one of the main burdens is the model size. A heavy app can discourage some users from downloading it. As such, developers often end up storing models in the cloud, costing both time and money.

Introducing Quantization

Quantization is one of the new features in the updated coremltools, and it can help solve this size problem by reducing — sometimes drastically — the size of Core ML models.

It works by trimming the number of bits used to describe weights in models. As some models can have millions of weights, shaving a few bits on each one can have a tremendous impact on overall size.

In iOS 11, models only used 32-bit floats to describe weights. This was later improved in 11.2 with the introduction of half-precision floats, using 16-bit for the same accuracy.

With iOS 12, weights can now be encoded using any number of bits, all the way down to just 1-bit.

So now, instead of having the continuous representation of values weights would have with floats, we actually end up with a discrete subset. Of course, this will lower your prediction accuracy, but it’s up to you to consider what tradeoff you might be able to accept.

There are multiple ways to choose the values representing those new, quantized weights. The most straightforward is to distribute them linearly, like in the example above.

We can also have them distributed in other arbitrary ways using a lookup table (or LUT), based on the particularities of your model.

In total, you have access to these 4 distribution modes : linear, kmeans, linear_lut, andcustom_lut.

Quantizing your Core ML model

Let’s dive into the code!

First there are a few pre-requisites:

  • Must be working on MacOS 10.14 (Mojave)
  • Have the last version of coremltools installed. A simple pip install coremltools==2.0should do it.

Once your environment is ready, you should be able to run the following script. Just edit the last few lines to feed it your own model filename, along with the combination of bits-per-weight and distribution functions you wish to use.

import sys
import coremltools
from coremltools.models.neural_network.quantization_utils import *

def quantize(file, bits, functions):
    """ 
    Processes a file to quantize it for each bit-per-weight 
    and function listed.
    file : Core ML file to process (example : mymodel.mlmodel)
    bits : Array of bit per weight (example : [16,8,6,4,2,1])
    functions : Array of distribution functions (example : ["linear", "linear_lut", "kmeans"])
    """
    if not file.endswith(".mlmodel"): return # We only consider .mlmodel files
    model_name = file.split(".")[0]
    model = coremltools.models.MLModel(file)
    for function in functions :
        for bit in bits:
            print("--------------------------------------------------------------------")
            print("Processing "+model_name+" for "+str(bit)+"-bits with "+function+" function")
            sys.stdout.flush()
            quantized_model = quantize_weights(model, bit, function)
            sys.stdout.flush()
            quantized_model.author = "Alexis Creuzot" 
            quantized_model.short_description = str(bit)+"-bit per quantized weight, using "+function+"."
            quantized_model.save(model_name+"_"+function+"_"+str(bit)+".mlmodel")

# Launch quantization
quantize("starry_night.mlmodel", 
        [6,4,3,2], 
        ["linear"])

Right away, you should see that the less bits-per-weight used, the lighter the quantized file will be.

This script was used on one of Looq’s Neural Style Transfer models. It gives you an idea of the accuracy loss on a real-world use.

The first image is the un-styled picture for reference. After it are displayed the outputs for original, 16-bit, 8-bit, 6-bit, 3-bit, 4-bit, 2-bit and 1-bit quantized models.

Now let’s look at the actual size difference between each of those files.

In this case, we can see that the 8-bit model output is nearly identical to the original one. This means that we can get the model size from 6.7MB to 1.7MB—4 times lighter!

This example is, of course, very specific to NST models. A linear distribution won’t always work. Fortunately, you should be able to apply the same approach to your own models, as long as you have a good understanding of how those values should be distributed.

Save your models!

As of now, there is no way to “re-quantize” a model. Meaning that if you’ve quantized a model to 16-bit and later decide to make an 8-bit version, you won’t be able to do it unless you still have your original file.

I hope you found this article helpful. If you want to read more on machine learning (and more specifically, Core ML), have a look at my piece on how to build a Neural Style Transfer app!

Discuss this post on Hacker News.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square