Card Scanner on Android Using Firebase’s ML Kit and CameraX

Machine learning has changed the way users interact with mobile applications. It offers brand new experience to users, making apps capable of leveraging various features such as providing accurate, location-based recommendations; detecting and manipulating text from images; instantaneously detecting micro diseases, and much more!

ML can make mobile apps more user-friendly and easy-to-use…exactly the kinds of user experiences that we as developers want to create. 😉

So in this blog, we’ll be learning about text recognition, which is the process of recognizing textual information in images, videos, documents, and other sources. Using this, you can make apps that perform different tasks like translating various languages (Google Translate) and converting text to a pdf or any other format (CamScanner, Google Keep), just to name a couple.

Here, we’ll be making an Android app that scans business cards and recognizes the most important text, like contact numbers, email addresses, and names of the person/organization from the scanned image. Here’s the full project source code:

Thanks to ML Kit’s Text Recognition API, this task is actually super simple. This pre-trained model can recognize text from the image in any Latin-based language (and more, with cloud-based text recognition).

On top of this, we’ll use the CameraX library to scan a given card, and then integrate it with the ML Kit’s API. CameraX’s analysis step makes it easy to feed images to the image detector, hence reducing the overhead code of any native or additional camera library.

1. Setup Firebase in your project and add the Vision dependency

This is a simple step to set up Firebase in your project. You can find a good tutorial here.
Add the dependencies for the ML Kit Android libraries to your module’s (app-level) Gradle file (usually app/build.gradle):

Now, like with ML Kit’s other APIs, Text Recognition has 2 different types:

The first one is the On-Device API, which runs text recognition on the device itself. It’s free and recognizes more than 50 languages.
The second one is the Cloud API, which runs on Google Cloud and is more accurate and recognizes and identifies a wider range of languages and special characters. It’s free for the first 1000 requests, which is good enough if you just want to play around.

2. Implement camera functionality in your app

The Vision API needs an image to extract the data from, so either create an app that lets you upload images from the gallery or create an app that uses the Camera API to click a picture and use it instead.

As discussed earlier, we’ll be using CameraX to implement the camera functionality in the app. Using it provides us with an additional advantage of reducing code, as the Firebase Vision API has in-built methods to support CameraX API, it easier for us to call different methods with less overhead

3. Recognize text in images

After selecting an input image for processing, we run the text recognizer to extract useful information from it.

a) Run the text recognizer

To create a FirebaseVisionImage object from a media.Image object, such as when capturing an image from a device’s camera, pass the media.Image object and the image’s rotation to FirebaseVisionImage.fromMediaImage(). As we’re using CameraX, the OnImageCapturedListener and ImageAnalysis.Analyzer classes calculate the rotation value for you, so you just need to convert the rotation to one of ML Kit’s ROTATION_ constants before calling FirebaseVisionImage.fromMediaImage():

private class YourImageAnalyzer : ImageAnalysis.Analyzer {
    private fun degreesToFirebaseRotation(degrees: Int): Int = when(degrees) {
        0 -> FirebaseVisionImageMetadata.ROTATION_0
        90 -> FirebaseVisionImageMetadata.ROTATION_90
        180 -> FirebaseVisionImageMetadata.ROTATION_180
        270 -> FirebaseVisionImageMetadata.ROTATION_270
        else -> throw Exception("Rotation must be 0, 90, 180, or 270.")
    }

    override fun analyze(imageProxy: ImageProxy?, degrees: Int) {
        val mediaImage = imageProxy?.image
        val imageRotation = degreesToFirebaseRotation(degrees)
        if (mediaImage != null) {
            val image = FirebaseVisionImage.fromMediaImage(mediaImage, imageRotation)
            // Pass image to an ML Kit Vision API
            // ...
        }
    }
}

Then, pass the media.Image object and the rotation value to FirebaseVisionImage.fromMediaImage():

b) Get an instance of FirebaseVisionTextRecognizer

The code snippet below shows how to use the on-device model:

c) Pass the image to the processImage() method:

Pass the image to the FirebaseVisionTextRecognizer object created above.

val result = detector.processImage(image)
        .addOnSuccessListener { firebaseVisionText ->
            // Task completed successfully
            // ...
        }
        .addOnFailureListener { e ->
            // Task failed with an exception
            // ...
        }

4. Extract text from blocks of recognized text

If the text recognition operation succeeds, a FirebaseVisionText object will be passed to the success listener.

It contains the full text recognized in the image and zero or more TextBlock objects.
Each TextBlock represents a rectangular block of text, which contains zero or more Line objects.
Each Line object contains zero or more Element objects, which represent words and word-like entities (dates, numbers, and so on).

For each TextBlock, Line, and Element object, you can get the text recognized in the region and the bounding coordinates of that region.

The received text is organized into pages, blocks, paragraphs, words, and symbols. For each unit of organization, you can get information such as its dimensions and the languages it contains.

Given that we want to extract specific details from the card like name, phone number, email address, etc., we can follow a specific hierarchy to get to the elements within the line (block->line->element->element-text). Once you get to the text, you can apply your logic to get the details in a proper format (as different cards have different formats).

Below is an example that shows how to extract a phone number of the form “011–22246488” having a ‘-’ as a separator. You can check for various other test cases (i.e. when the phone number is in a different format) by applying a switch case or any other logic according to your needs. Similar logic can be applied for extracting the email address or name of the person/organization, etc.

//detector is an instane of FirebaseVisionTextRecognizer 
detector.processImage(image)
   .addOnSuccessListener { firebaseVisionText ->
                    // Task completed successfully

                    for (block in firebaseVisionText.textBlocks) {
                        val blockText = block.text
                        val blockFrame = block.boundingBox
                      
                        for (line in block.lines) {
                            val lineText = line.text
                        //once you get the text elements, you can easily extract the text, email address, name mentioned 
                        //on the visiting card or any other image containing text. 
                        //Apply your logic accordingly.
                        //One sample is shown below for extracting the number of the form:- 011-22246388
                            for (element in line.elements) {
                                var elementText = element.text

                                if(elementText.contains('-')){
                                  val split=  elementText.split('-')
                                    var part1=""
                                    var part2=""
                                   split.let {
                                        part1 = it[0]
                                        part2 = it[1]
                                   }
                                     elementText=part1+part2
                                }

                                if(!elementText.isEmpty()){
                                    var numeric = true
                                    var num: Double = parseDouble("0")
                                    try {
                                        num = parseDouble(elementText)
                                    } catch (e: NumberFormatException) {
                                        numeric = false
                                    }
                                    if (numeric)
                                        print("Phone number detected is $num")
                                      
                                    else
                                        print(" Phone number is not detected on the card.")
                                }
                            }
                        }
                    }
                }

Link to the GitHub repository:

Thanks for reading! If you enjoyed this story, please click the 👏 button and share it to help others find it!

Have feedback? Let’s connect on Twitter.