Face Recognition and Detection on iOS Using Native Swift Code, Core ML, and ARKit

Leveraging the native Swift library to perform face recognition and detection in an iOS app

In the landscape of new technologies that are capable of revolutionizing our daily lives, few are as tantalizing as facial recognition technologies. With all the recent controversy around Clearview AI, people are paying more and more attention to the technology, and they’re also eager to understand the how the tech works, and it’s limitations. This article won’t cover this ethical issue, but I’ll try my hand at explaining some facial recognition and detection techniques.

In recent months, Apple has been pushing new features and major improvements for its Vision API, which is their main framework for all things related to computer vision. The Vision API allows for quick, easy, and intuitive camera sessions while offering a multitude of possibilities such as:

  • Native face detection API
  • Handle Core ML models for image processing (e.g. classification, object detection )
  • Barcode recognition
  • Text recognition

Since the most-used cameras are (by far) the ones we have in our pockets, this tutorial will be covering native mobile solutions for iOS.

I have included code in this article where it’s most instructive. Full code and data can be found on my GitHub page. Let’s get started

Create a Single View Application

To begin, we need to create an iOS project with a single view app. Make sure to choose Storyboard in the “User interface” dropdown menu (Xcode 11 only):

Now we have our project ready to go. I don’t like using storyboards myself, so the app in this tutorial is built programmatically, which means no buttons or switches to toggle — just pure code 🤗.

To follow this method, you’ll have to delete the main.storyboard and set your SceneDelegate.swift file (Xcode 11 only) like so:

var window: UIWindow?
func scene(_ scene: UIScene, willConnectTo session: UISceneSession, options connectionOptions: UIScene.ConnectionOptions) {
    guard let windowScene = (scene as? UIWindowScene) else { return }
    window = UIWindow(frame: windowScene.coordinateSpace.bounds)
    window?.windowScene = windowScene
    window?.rootViewController = ViewController()
    window?.makeKeyAndVisible()
}

With Xcode 11, you’ll have to change the Info.plist file like so:

You need to delete the “Storyboard Name” in the file, and that’s about it.

Setting the layout

Set up ViewController():

This view controller will be our entry point for all others and will have four buttons:

  • Face Mask will open FaceMaskViewController(), which will show a face mesh using ARKit.
  • Face Detection will open FaceDetectionViewController(), which will count the number of faces in the capture session.
  • Face Classification will open FaceClassificationViewController(), which will try to recognize my face, and if I’m not in the capture session, will output unknown
  • Face Object Detection will open ObjectDetectionViewController(), which predicts if I’m in the capture session, and try to draw a bounding box.

Here are the four buttons:

    let faceMask: BtnPleinLarge = {
        let button = BtnPleinLarge()
        button.translatesAutoresizingMaskIntoConstraints = false
        button.addTarget(self, action: #selector(buttonToFaceMask(_:)), for: .touchUpInside)
        button.setTitle("Face mask", for: .normal)
        let icon = UIImage(systemName: "eye")?.resized(newSize: CGSize(width: 50, height: 30))
        button.addRightImage(image: icon!, offset: 30)
        button.backgroundColor = .systemGreen
        button.layer.borderColor = UIColor.systemGreen.cgColor
        button.layer.shadowOpacity = 0.3
        button.layer.shadowColor = UIColor.systemGreen.cgColor
        
        return button
    }()
    
    let faceDetection: BtnPleinLarge = {
        let button = BtnPleinLarge()
        button.translatesAutoresizingMaskIntoConstraints = false
        button.addTarget(self, action: #selector(buttonToFaceDetection(_:)), for: .touchUpInside)
        button.setTitle("Face detection", for: .normal)
        let icon = UIImage(systemName: "person.3.fill")?.resized(newSize: CGSize(width: 50, height: 25))
        button.addRightImage(image: icon!, offset: 30)
        button.backgroundColor = .systemOrange
        button.layer.borderColor = UIColor.systemOrange.cgColor
        button.layer.shadowOpacity = 0.3
        button.layer.shadowColor = UIColor.systemOrange.cgColor
        
        return button
    }()
    
    let faceClassification: BtnPleinLarge = {
        let button = BtnPleinLarge()
        button.translatesAutoresizingMaskIntoConstraints = false
        button.addTarget(self, action: #selector(buttonToFaceClassification(_:)), for: .touchUpInside)
        button.setTitle("Face classification", for: .normal)
        let icon = UIImage(systemName: "tray.fill")?.resized(newSize: CGSize(width: 50, height: 40))
        button.addRightImage(image: icon!, offset: 30)
        button.backgroundColor = .systemBlue
        button.layer.borderColor = UIColor.systemBlue.cgColor
        button.layer.shadowOpacity = 0.3
        button.layer.shadowColor = UIColor.systemBlue.cgColor
        
        return button
    }()
    
    let objectDetection: BtnPleinLarge = {
        let button = BtnPleinLarge()
        button.translatesAutoresizingMaskIntoConstraints = false
        button.addTarget(self, action: #selector(buttonToObjectDetection(_:)), for: .touchUpInside)
        button.setTitle("Object detection", for: .normal)
        let icon = UIImage(systemName: "crop")?.resized(newSize: CGSize(width: 50, height: 50))
        button.addRightImage(image: icon!, offset: 30)
        button.backgroundColor = .systemPurple
        button.layer.borderColor = UIColor.systemPurple.cgColor
        button.layer.shadowOpacity = 0.3
        button.layer.shadowColor = UIColor.systemPurple.cgColor
        
        return button       
    }()

Set up the layout and add to the subview:

    private func setupButtons() {
        
        view.addSubview(faceMask)
        view.addSubview(faceDetection)
        view.addSubview(faceClassification)
        view.addSubview(objectDetection)
        
        faceMask.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        faceMask.widthAnchor.constraint(equalToConstant: view.frame.width - 40).isActive = true
        faceMask.heightAnchor.constraint(equalToConstant: 70).isActive = true
        faceMask.centerYAnchor.constraint(equalTo: view.centerYAnchor).isActive = true
        
        faceDetection.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        faceDetection.widthAnchor.constraint(equalToConstant: view.frame.width - 40).isActive = true
        faceDetection.heightAnchor.constraint(equalToConstant: 70).isActive = true
        faceDetection.topAnchor.constraint(equalTo: faceMask.bottomAnchor, constant: 30).isActive = true
        
        faceClassification.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        faceClassification.widthAnchor.constraint(equalToConstant: view.frame.width - 40).isActive = true
        faceClassification.heightAnchor.constraint(equalToConstant: 70).isActive = true
        faceClassification.topAnchor.constraint(equalTo: faceDetection.bottomAnchor, constant: 30).isActive = true
        
        objectDetection.centerXAnchor.constraint(equalTo: view.centerXAnchor).isActive = true
        objectDetection.widthAnchor.constraint(equalToConstant: view.frame.width - 40).isActive = true
        objectDetection.heightAnchor.constraint(equalToConstant: 70).isActive = true
        objectDetection.topAnchor.constraint(equalTo: faceClassification.bottomAnchor, constant: 30).isActive = true
    }

Each button has a selector that will instantiate a ViewController and present it. The same logic applies to all buttons:

  @objc func buttonToFaceMask(_ sender: BtnPleinLarge) {
        
        let controller = FaceMaskViewController()

        let navController = UINavigationController(rootViewController: controller)
        
        self.present(navController, animated: true, completion: nil)
    }

Now we’re all set up and can start setting each and every controller with a capture session and “Face logic”.

Face Mask

Instantiate a Scene View

We’ll use ARKit and instantiate an ARSCNView that automatically renders the live video feed from the device camera as the scene background. It also automatically moves the SceneKit camera to match the real-world movement of the device, which means that we don’t need an anchor to track the positions of objects we add to the scene.

The scene view needs to have the whole screen bounds for the camera session:

let sceneView = ARSCNView(frame: UIScreen.main.bounds)

Start an ARFaceTrackingConfiguration session, which is an AR session that detects the user’s face (if visible in the front-facing camera image) and adds to its list of anchors an ARFaceAnchor object representing the face.

Here’s how we set up the ViewDidLoad() function:

override func viewDidLoad() {
    super.viewDidLoad()
    self.view.addSubview(sceneView)
    sceneView.delegate = self
    guard ARFaceTrackingConfiguration.isSupported else { return }
    let configuration = ARFaceTrackingConfiguration()
    configuration.isLightEstimationEnabled = true
    sceneView.session.run(configuration, options: [.resetTracking, .removeExistingAnchors])
    setupTabBar()
}

I’m using an extension to set the sceneview delegate and render the ARSCNFaceGeometry. I also added a function that updates the face geometry in real-time:

extension FaceMaskViewController: ARSCNViewDelegate {
    
    func renderer(_ renderer: SCNSceneRenderer, nodeFor anchor: ARAnchor) -> SCNNode? {
        
        guard let device = sceneView.device else {
            return nil
        }
        
        let faceGeometry = ARSCNFaceGeometry(device: device)
        let node = SCNNode(geometry: faceGeometry)
        node.geometry?.firstMaterial?.fillMode = .lines
        
        return node
    }
    
    func renderer(_ renderer: SCNSceneRenderer, didUpdate node: SCNNode, for anchor: ARAnchor) {
        
        guard let faceAnchor = anchor as? ARFaceAnchor,
            let faceGeometry = node.geometry as? ARSCNFaceGeometry else {
                return
        }
        faceGeometry.update(from: faceAnchor.geometry)
    }
}

Final Result:

Face geometry can be extremely helpful if you want to put objects on a specific region of the face—let’s say you want to put makeup on the nose. In this case, you can find the right vertices by using the ARSCNFaceGeometry object and then instantiating a node object.

Face Detection

Setting up the capture session

Nothing fancy here, but we need to set up an AVCaptureSession and add a preview layer to the sublayer:

fileprivate func setupCamera() {
    let captureSession = AVCaptureSession()
    captureSession.sessionPreset = .high
    
    guard let captureDevice = AVCaptureDevice.default(for: .video) else { return }
    guard let input = try? AVCaptureDeviceInput(device: captureDevice) else { return }
    captureSession.addInput(input)
    
    captureSession.startRunning()
    
    let previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
    view.layer.addSublayer(previewLayer)
    previewLayer.frame = view.frame
    
    let dataOutput = AVCaptureVideoDataOutput()
    dataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue"))
    captureSession.addOutput(dataOutput)
}

Updating the face count

The face detection method is part of the Vision framework, which is very fast and pretty accurate. The VNDetectFaceRectangleRequest() method returns an array of bounding boxes for the detected faces. In order to get the face count, we only need to count the number of elements in the array and then update the label accordingly:

 func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        
    guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    let request = VNDetectFaceRectanglesRequest { (req, err) in
        
        if let err = err {
            print("Failed to detect faces:", err)
            return
        }
        DispatchQueue.main.async {
            if let results = req.results {
                self.numberOfFaces.text = "(results.count) face(s)"
            }
        }
    }
    
    DispatchQueue.global(qos: .userInteractive).async {
        let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
        do {
            try handler.perform([request])
        } catch let reqErr {
            print("Failed to perform request:", reqErr)
        }
    }
}

To update the label, it needs to be done in the main thread; otherwise the session will crash (I found out the hard way).

Final result

The API returns an array of object that contains all the faces as rectangular bounding boxes, but here I chose to only count the number faces.

Face Classification

Setting up Create ML

As of WWDC 2019, Create ML is a standalone application in macOS, which is a much easier version to use. You can create multiple models, ranging from text processing to image classifiers.

The image classifier will need a folder with at least two folders that correspond to the classes we want to predict.

The idea is to create a folder with my name that will contain portrait images, and the other one will be “unknown”—basically any portrait image or even random images that will help the model differentiate between classes. These images will be used to communicate to the model that it’s not me.

I would recommend pushing the number of iterations very high, as opposed to 20, which is set by default. You can also use augmentation settings (cropping, blurring, etc) to improve the training process, especially if you don’t have a lot of captured images to work with.

Setting up the capture session

We do this the same way as we did with the FaceDetectionViewController():

fileprivate func setupCamera() {
    let captureSession = AVCaptureSession()
    captureSession.sessionPreset = .high
    
    guard let captureDevice = AVCaptureDevice.default(for: .video) else { return }
    guard let input = try? AVCaptureDeviceInput(device: captureDevice) else { return }
    captureSession.addInput(input)
    
    captureSession.startRunning()
    
    let previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
    view.layer.addSublayer(previewLayer)
    previewLayer.frame = view.frame
    
    let dataOutput = AVCaptureVideoDataOutput()
    dataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue"))
    captureSession.addOutput(dataOutput)
}

Setting up the model and updating the label

In the captureOuput() delegate, we process the CVPixelBuffer and instantiate the model, aka FaceRecognition(). Then we proceed to feed the model every frame and update the label with the top identifier, which corresponds to the best prediction:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        
    guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    guard let model = try? VNCoreMLModel(for: FaceRecognition().model) else {
                fatalError("Unable to load model")
            }
            
    let coreMlRequest = VNCoreMLRequest(model: model) {[weak self] request, error in
        guard let results = request.results as? [VNClassificationObservation],
            let topResult = results.first
            else {
                fatalError("Unexpected results")
        }
        DispatchQueue.main.async {[weak self] in
            self?.label.text = topResult.identifier
        }
    }
    
    let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
    DispatchQueue.global().async {
        do {
            try handler.perform([coreMlRequest])
        } catch {
            print(error)
        }
    }
}

Final result

Face Object Detection

Data preparation

Here, you’ll need a few hundred images in order to build an accurate model that will predict a bounding box.

In order to train the model we need to annotate every single image — I know, it’s tedious but necessary. So for the annotation part, we’ll be using Sebastian G. Perez’s GitHub repository. It’s a simple Flask application that handles image annotation and generates a .csv file with 7 columns, formatted in a way that can easily be used with Turi Create for model training. To understand a bit more about how this works, it pretty similar to a project I have worked on for license plate recognition.

Training

You can use Create ML or Turi Create with Python. I chose the former because I have more control, and I can run it on a Google Colab Notebook. Model training is very similar to what I did in the license plate recognition article, so feel free to use the same code and extract the .mlmodel file.

Setting up the capture session

We need to take a live camera preview in order to run inference on each and every frame and draw a bounding box. We’re going to take extra care in this step to not trigger a memory leak, because processing every frame, making a prediction, and finally drawing a box can be very resource-intensive and can easily crash our application:

    var videoCapture: VideoCapture!
    let semaphore = DispatchSemaphore(value: 1)
    
    let videoPreview: UIView = {
       let view = UIView()
        view.translatesAutoresizingMaskIntoConstraints = false
        return view
    }()
    
    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)
        self.videoCapture.start()
    }
    
    override func viewWillDisappear(_ animated: Bool) {
        super.viewWillDisappear(animated)
        self.videoCapture.stop()
    }
    
    // MARK: - SetUp Camera preview
    func setUpCamera() {
        videoCapture = VideoCapture()
        videoCapture.delegate = self
        videoCapture.fps = 30
        videoCapture.setUp(sessionPreset: .vga640x480) { success in
            
            if success {
                if let previewLayer = self.videoCapture.previewLayer {
                    self.videoPreview.layer.addSublayer(previewLayer)
                    self.resizePreviewLayer()
                }
                self.videoCapture.start()
            }
        }
    }
    
extension ViewController: VideoCaptureDelegate {
    func videoCapture(_ capture: VideoCapture, didCaptureVideoFrame pixelBuffer: CVPixelBuffer?, timestamp: CMTime) {
        if !self.isInferencing, let pixelBuffer = pixelBuffer {
            self.isInferencing = true
            self.predictUsingVision(pixelBuffer: pixelBuffer)
        }
    }
}

Here’s an explanation:

  • Create a UIView instance to host the VideoCapture.
  • Start the VideoCapture in the viewWillAppear view cycle function
  • Stop the VideoCapture in the viewWillDisappear view cycle function
  • Set up the VideoCapture with the number of frames per second, as well as the video quality. I’d recommend VGA 640×480 to get a steady frame per second (FPS) rate
  • Set the VideoCapture delegate to conform to AVCaptureVideoDataOutputSampleBufferDelegate

Setting up the model and drawing the bounding boxes

Now that the preview is set up, we need to instantiate the Core ML model and start predicting:

func setUpModel() {
    if let visionModel = try? VNCoreMLModel(for: model_face_omar_turi().model) {
        self.visionModel = visionModel
        request = VNCoreMLRequest(model: visionModel, completionHandler: visionRequestDidComplete)
        request?.imageCropAndScaleOption = .scaleFill
    } else {
        fatalError("fail to create vision model")
    }
}

The model here is called model_face_omar_turi. That’s only the Vision session, though—we still need to trigger the model by feeding it the frames. The model is expected to return a MultiArray object that encapsulates the bounding box. The final steps will be to predict, parse the object, and draw a box around the face.

extension ObjectDetectionViewController: VideoCaptureDelegate {
    func videoCapture(_ capture: VideoCapture, didCaptureVideoFrame pixelBuffer: CVPixelBuffer?, timestamp: CMTime) {
        // the captured image from camera is contained on pixelBuffer
        if !self.isInferencing, let pixelBuffer = pixelBuffer {
            self.isInferencing = true
            // predict!
            self.predictUsingVision(pixelBuffer: pixelBuffer)
        }
    }
}

extension ObjectDetectionViewController {
    func predictUsingVision(pixelBuffer: CVPixelBuffer) {
        guard let request = request else { fatalError() }
        // vision framework configures the input size of image following our model's input configuration automatically which is 416X416
        self.semaphore.wait()
        let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
        try? handler.perform([request])
    }
    
    // MARK: - Post-processing
    func visionRequestDidComplete(request: VNRequest, error: Error?) {
        if let predictions = request.results as? [VNRecognizedObjectObservation] {
            DispatchQueue.main.async {
                self.BoundingBoxView.predictedObjects = predictions
                self.isInferencing = false
            }
        } else {
            
            self.isInferencing = false
        }
        self.semaphore.signal()
    }
}

Final result

Conclusion

With Apple’s built-in Vision API, processing images and creating powerful models has never been easier. While more powerful models can be built for facial recognition using frameworks like PyTorch or TensorFlow, those models tend to lack a lot of attributes needed to work on device, like speed and bloated size, and as such they’re much harder to fit on mobile devices.

Moving forward, you can improve these models by either improving your dataset or creating your own network and converting it using coremltools.

The most important aspect of working with native APIs is that, most times, they’re optimized to work with a lot of devices and are also enhanced using hardware acceleration components like Apple’s Neural Engine.

Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at [email protected].

Here’s the full code:

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square