Building a real-time object detection app on Android using Firebase ML Kit

This tutorial is the 9th part in the series, ML Kit for Mobile Developers. If you’re not quite caught up, you can start here:

Series Pit Stops

Introducing Firebase ML Kit Object Detection API

Earlier this month at Google I/O, the team behind Firebase ML Kit announced the addition of 2 new APIs into their arsenal: object detection and an on-device translation API.

This article focuses on the object detection API, and we’ll look into how we can detect and track objects in real-time using this API without using any network connectivity! Yes, this API uses on-device machine learning to perform object detection.

Just to give you a quick sense of what could be possible with this API, imagine a shopping app that tracks items in the camera feed in real time and shows the user items similar to those on the screen—perhaps even suggesting recipes to go along with certain ingredients!

For this blog post, we’ll be building a demo app that detects items from a real-time camera feed.

Step 1 : Create a new Android Studio Project and add the necessary dependencies

First, we need to create a new Android Studio project and add the relevant dependencies to it.

The first is a simple one — set up Firebase in your project. You can find a good tutorial here. In order to use this API, you’ll need to add the following dependency to your app:

dependencies {
  implementation 'com.google.firebase:firebase-ml-vision:20.0.0'
  implementation 'com.google.firebase:firebase-ml-vision-object-detection-model:16.0.0'
}

You might also want to add a camera library to your project in order to integrate camera features in your app easily. I personally recommend the following:

Step 2: Creating a basic layout and adding the camera preview

We need to create a basic layout in our app that hosts a Camera Preview and lets us interact with it:

<?xml version="1.0" encoding="utf-8"?>
<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    tools:context=".MainActivity">

    <com.otaliastudios.cameraview.CameraView
        android:id="@+id/cameraView"
        android:layout_width="match_parent"
        android:layout_height="match_parent" />

    <TextView
        android:textSize="24sp"
        android:textStyle="bold"
        android:textColor="#fff"
        android:id="@+id/tvDetectedItem"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_gravity="center" />

</FrameLayout>

Step 3 : Starting the Camera Preview

This is super simple: go to your activity inside the onCreate method, simply set the lifecycle owner for cameraView, and the library handles the rest of the work for you.

class MainActivity : AppCompatActivity() {

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        cameraView.setLifecycleOwner(this) //Automatically handles the camera lifecycle 
    }
}

At this point, if you run the app, you should see a screen that shows you a continuous preview of your Camera.

Next up, we’ll be adding a method that gives us these preview frames so we can perform some machine learning magic on them!

Step 4 : Adding a FrameProcessor to our CameraView

On our CameraView, we can add a FrameProcessor that gives us the captured frames and lets us perform object detection and tracking on those frames.

The code for doing that is pretty straightforward:

class MainActivity : AppCompatActivity() {
  
  override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        Fritz.configure(this)

        cameraView.addFrameProcessor { frame ->
             //This returns a frame, which cann be used to retrieve the preview                    
        }
    }  
}

Step 5 : Convert the frame to a FirebaseVisionImage

This is relatively easy as well; from the received frame, we can get the byte array containing the image data for that frame along with some more information like the rotation, width, height, and format for the captured frame.

These data points can then be used to construct a FirebaseVisionImage, which can be then passed on to the object detector that we’ll be creating next.

The steps for creating a FirebaseVisionImage have been outlined in the function below:

private fun getVisionImageFromFrame(frame : Frame) : FirebaseVisionImage{
    //ByteArray for the captured frame
    val data = frame.data
    
    //Metadata that gives more information on the image that is to be converted to FirebaseVisionImage
    val imageMetaData = FirebaseVisionImageMetadata.Builder()
        .setFormat(FirebaseVisionImageMetadata.IMAGE_FORMAT_NV21)
        .setRotation(frame.rotation)
        .setHeight(frame.size.height)
        .setWidth(frame.size.width)
        .build()

    val image = FirebaseVisionImage.fromByteArray(data, imageMetaData)    
    return image
    }

Step 6 : Perform inferencing on the converted Bitmap

This part is quite similar to implementations of the other Firebase ML Kit APIs; you get access to a detector according to your needs, pass in the FirebaseVisionImage to the detector, and then attach success/failure callbacks to get the output.

The operation of the object detector provided by the Object Detection API can be primarily classified as :

STREAM_MODE :
Can detect and track objects from an input stream (e.g. video). Has low latency but lower accuracy.
SINGLE_IMAGE_MODE :
Can detect and track objects from a still image. Has higher latency but higher accuracy.

For this post, we’ll be covering the STREAM_MODE to showcase the application of this API to video processing.

You can further outline if you want to categorize the detected item into its category or if you want to track multiple items at once by specifying corresponding options while creating the detector.

For instance, a sample detector that detects and classifies multiple objects from a stream input can be created as follows:

private fun extractDataFromFrame(frame: Frame, callback: (String) -> Unit) {
        val options = FirebaseVisionObjectDetectorOptions.Builder()
            .setDetectorMode(FirebaseVisionObjectDetectorOptions.STREAM_MODE)
            .enableMultipleObjects()  //Add this if you want to detect multiple objects at once
            .enableClassification()  // Add this if you want to classify the detected objects into categories
            .build()

        val objectDetector = FirebaseVision.getInstance().getOnDeviceObjectDetector(options)
    }

Once we have the detector, we can simply pass the FirebaseVisionImage that we receive from getVisionImageFromFrame that we defined above.

Once that’s done, we can simply attach the success and failure listeners to the method and get the inferred values. The code to achieve that is outlined below:

private fun extractDataFromFrame(frame: Frame, callback: (String) -> Unit) {
        ...
        objectDetector.processImage(getVisionImageFromFrame(frame))
            .addOnSuccessListener { objects ->
                objects.forEach { item ->
                    Log.e("TAG",item.entityId)  //Get the Knowledge Graph ID for the detected object(s)
                }
                callback(result)
            }
            .addOnFailureListener {
                callback("Unable to detect an object")
            }
    }

Once done, the complete code should look something like this:

class MainActivity : AppCompatActivity() {

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        cameraView.setLifecycleOwner(this)
        cameraView.addFrameProcessor {
            extractDataFromFrame(it) { result ->
                tvDetectedObject.text = result
            }
        }
    }

    private fun extractDataFromFrame(frame: Frame, callback: (String) -> Unit) {
    
        val options = FirebaseVisionObjectDetectorOptions.Builder()
            .setDetectorMode(FirebaseVisionObjectDetectorOptions.STREAM_MODE)
            .enableMultipleObjects()  //Add this if you want to detect multiple objects at once
            .enableClassification()  // Add this if you want to classify the detected objects into categories
            .build()

        val objectDetector = FirebaseVision.getInstance().getOnDeviceObjectDetector(options)

        objectDetector.processImage(getVisionImageFromFrame(frame))
            .addOnSuccessListener {
                it.forEach { item ->
                    Log.e("TAG",item.entityId)
                }
                callback(item.entityId)
            }
            .addOnFailureListener {
                callback("Unable to detect an object")
            }
    }

    private fun getVisionImageFromFrame(frame : Frame) : FirebaseVisionImage{
        //ByteArray for the captured frame
        val data = frame.data

        //Metadata that gives more information on the image that is to be converted to FirebaseVisionImage
        val imageMetaData = FirebaseVisionImageMetadata.Builder()
            .setFormat(FirebaseVisionImageMetadata.IMAGE_FORMAT_NV21)
            .setRotation(FirebaseVisionImageMetadata.ROTATION_90)
            .setHeight(frame.size.height)
            .setWidth(frame.size.width)
            .build()

        val image = FirebaseVisionImage.fromByteArray(data, imageMetaData)

        return image
    }

}

And that’s about it!

The API also provides the bounding-box coordinates for each object it detects along with some more info, which you can find in the reference docs here:

Further, if you want to, you can go ahead and use the Google Knowledge Graph API to get relevant results using the Knowledge Graph ID that the object detection API returns.

There are so many other possible good use cases for this API—in addition to our example at the beginning of this post of a recipe recommender, another good one could be an e-commerce app in which you could suggest products similar to the one being scanned by your app.

The full source code for the app that was built in this article and show in the screenshots above can be found here:

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.