More and more, organizations are integrating machine learning (ML) into their apps to solve their most pressing business problems. From powering self-driving cars to conducting sophisticated document analysis, machine learning enables your application to process and interact with the world to make predictions and inform decisions.

But if you’ve never worked with pre-trained machine learning models before, understanding where to start can be intimidating. So let’s begin with an application so easy that even children could play it (and often do) – Rock, Paper, Scissors. By walking through the basics of using ML to bring this playground classic to life, you’ll start to see how you can use it in your more sophisticated business applications.

Creating a Plan

In our hypothetical application, we’re going to create a Rock, Paper, Scissors game that a user can play against the computer. The player will choose their option and display it to the computer camera. At the same time, the computer will play an option at random and detect the player’s gesture, allowing it to determine the winner of the match and keep score.

Just like any application, we need a plan of what to build before we put everything together. To build this application, we’ll need to:

  • Capture the camera feed
  • Teach the app to find the hand and fingers in the image
  • Teach the app to identify the gesture
  • Compute the result and display the score

I’ll build this application using the OutSystems platform, so if you want to follow this tutorial step-by-step, you can sign up for a free edition here.

Now, let’s get started!

Step 1. Capturing Video

We can capture the camera media using the browser’s media stream API. To do so, we need to add a video HTML element to our document object model (DOM). After it collects the stream from your camera, we need to set that stream into the video source object. 

 

Finding Hands and Fingers in an Image

Detecting the fingers or a hand gesture in an image is too complicated for standard algorithms, especially when using real-time video. That’s where ML comes in. In this step, we’ll conduct supervised learning to train the model using samples that have already been labeled. For our game, we’ll train the model using a set of images that have different types of hands in different configurations, classifying them as rock, paper or scissors. 

Finding Hands and Fingers in an Image

To train the model, we’ll select a learning algorithm; there are several options, so we’ll want to take the time to understand the differences between each to find the right one for the job. We’ll then supply the model with the training data it needs to detect patterns.

The training data must contain a correct answer, which is called the “target attribute.” After it processes the training data and target attributes, we’ll present the model with new images that it hasn’t seen before to see if it can determine the gesture. The more images we give the model to train with, the better our model will be.

Create a ML Model From Scratch or Use an Existing One

Should you create a model from scratch or use an existing one? If the problem you’re solving is very specific, you’ll need to create your own model; however, models are often already created and shared by the community that can solve the same problem you’re trying to solve.

For our game, we found the best pre-trained model for image classification and hand gesture detection available: MediaPipe HandPose, a lightweight machine learning pipeline consisting of a palm detector and a hand-skeleton finger-tracking model. This makes solving the issue of hand gesture recognition simple.

To integrate the pre-trained model in OutSystems, we’re going to use TensorFlow.js, which is a library for machine learning in JavaScript. The TensorFlow object detection API can identify which object from an established set of objects is potentially present.

Using our camera feed, we would then go frame by frame to train HandPose on the position of our fingers and the base of our palm. In addition, we’d need to define the backend that TensorFlow will use to do the calculations and use the machine learning model.

Identifying the Gestures

Now that we can identify the hand, it’s time to detect the gestures. To make things simple, we only need to detect two fingers on the hand to identify the gesture – the index finger and the pinky.

Finding Hands and Fingers in an Image

Depending on whether they’re up or down, the app will be able to tell which gesture is being used. To overcome differences in hand size and the position of the hand to the camera, the app will compare two points per finger with the distance from the palm base; if it’s further, we know the finger is stretched; if it's closer, we know the finger is curled.

After we find what position the fingers are in, it’s easy to determine what gesture has been played:

  • Index stretched + pinky stretched = Paper
  • Index stretched + pinky curled = Scissors
  • Index curled + pinky curled = Rock
 

Compute the Result and Display the Score

Once the gestures are programmed, it's a simple matter to compute the results by going into the OutSystems code to have the application compare what was played by the user and by the CPU opponent.

For example, if the user plays rock and the computer plays paper, it’s a win for the CPU. If the computer plays scissors, it’s a loss. If they both play rock, it’s a tie. A counter for each value of win, loss or tie is added to compute the score, which is then displayed.