Revolutionizing Home Automation: Gesture control for Analog settings

by Kunal Grover · Published September 24, 2020 · Updated October 7, 2020

Spread the love

Smart speakers are taking over the world and automating our homes. You tell your smart speaker to turn devices on and off, close the blinds and so on. Voice commands works amazingly well when working with digital controls, you want something in a particular state, but what if you also want to quantify the state (analog value)? This would have scenarios where you want to control an analog setting: volume, change the temperature, increase the brightness etc. This is where it gets tricky with voice based systems, and the existing implementations seem unusable and unnatural. In this post, I explore using gesture control as a solution for controlling analog settings around me using simple and intuitive gestures.

Controlling Analog settings: Before vs Now

In the non-smart world, you would use a physical controller to get the analog settings you want to achieve, typically with a mechanism to increase / decrease the setting. We have all used remote controls, and hardware such as knobs to control devices around us.

The smart-home world has made some progress, in trying to achieve controls for analog settings. Some devices can be controlled using smartphone apps with sliders which resemble the good old knob based control mechanism. Alexa allows you to specify the volume of the speaker to a value on a scale 1-10.

Knob based analog control

Slider based analog control

It is a no-brainer that the smart solutions are not so smart! Specifying an absolute value for an analog setting is an unnatural way of controlling things, we never tell a TV to set volume to 58! App based controls solves the problem of getting the current setting and changing it to a relatively higher or lower setting than the current, but ties us to another type of remote again. This is something I look to solve and be able to bridge the gap of no available methods to control analog settings in the smart home world.

Setting our goals

The way I try to approach this problem is to understand what’s so convenient about the hardware solutions? What would be the most important goals to tie to when building a system to control analog settings?

If we think a little about Remotes and knobs (and even sliders), there are 2 properties that stand out. First that we are looking at a system where the absolute value isn’t important at all. What is important is where we want to take the value from there ie the relative value. For instance we want to either increase or decrease the volume using a remote or a knob. The other interesting property is immediate feedback, we see that changing the setting resulted in desired results or not, and if not we try to set it to a different value.

If you keep those as the primary motivators of our modelling, it is kind of clear that voice based interfaces are not going to fit into this interface at all, given that they are much slower and require a longer loop. At the same time, I also want to add a few more important requirements that a smart solution should definitely need:

Picking up an additional sensor or a device like a smartphone is no better than picking a remote control.
Make the control as natural as possible for humans. I personally dislike how a lot of “gesture recognition” today requires you to remember gestures that achieve a particular outcome. Technology should be shaped by simplicity of use, it shouldn’t change our behaviours: No strange hand signals, no code language. Take the example of smart-home speakers, they don’t ask me to talk in a code language of instructions, but accept my commands based on English. (they can definitely do much better, but the only thing I had to memorize was a wake word)
Latency: As mentioned earlier, I am relying on a feedback mechanism to determine whether the end state is the desired one or not, should be immediate allowing me to take a different action.
Accuracy: Here’s where I believe it is okay doing a tradeoff. Analog controlling does not need to be accurate, it doesn’t matter if I signal to reduce the volume by 3 points, and it gets reduced by 4 as long as the volume is getting reduced. Again what’s important is the way humans think and modify analog controls, in terms of relative changes.

Gesture control for volume

I pick using gesture control system as a solution for achieving the above, since I feel that they are a great interface for achieving the above requirements. Enough talking, let’s see it in action!

How does it work?

Here’s the workflow:

Announce to Alexa that I am trying to control a device.
Use hand movements as gestures to control the device.
Have realtime feedback on whether the output is desired or should be modified.

Here’s a complete diagram of how the system works end to end. Might seem daunting at first, given that there are a lot of components to deal with, don’t worry, I will cover each of them in detail following sections, starting with the most interesting ones first:

Workflow of how Analog Gesture control works

Detecting and interpreting gestures

The coolest part of the project comes first! Which gestures to use and why? How to interpret gestures and be able to convert that into a number allowing to control volume?

Choosing the hardware

The hardware is the most important aspect since it will influence all our implementation going forward. Before I begin, one completely different way to solve this problem might be to start with a Microsoft Kinect sensor which has been a pioneer of gesture recognition research way before all the advancements in this field. It uses a fundamentally different approach from the one in this post of using multiple sensors to get 3D data, allowing easy skeletal detection.

I wanted to start with a more basic camera rather than a Kinect to solve this problem. After spending some time researching about which camera to pick, I decided that a security camera is the best solution for solving this problem. I decided to purchase an Amcrest ProHD camera. These are some interesting general advantages you get from a security camera:

Much higher resolution than the now defunct USB webcams. AmcrestProHD: 3MP
Designed from aspect of covering a larger angles. AmcrestProHD: 90deg
Self contained with LAN / Wifi connectivity. I don’t need my camera to be anywhere near the server which is processing the images.
The most important reason to use a security camera is their ability to work in the dark. They are equipped with an IR LED grid allowing them to continue capturing videos even when it is dark. Ofcourse for my application of controlling my projector volume, this becomes critical.

Linux + Nvidia + Deep learning

My code runs on a home server which is running PopOS based on Linux. The reason to choose this particular OS is how nicely it works out of the box with Nvidia drivers especially for laptops which have both an Nvidia and an Integrated GPU. If you are adventurous, you can look into configuring Nvidia drivers yourself, something that I have tried and given up on.

The body pose detection is done using the revolutionary project OpenPose
developed at CMU which in my opinion has dethroned the king of pose detection Kinect. It has made it much simpler to work with traditional cameras to be able to detect gestures and poses. Copying an excerpt from the Github page recognizing the authors:

“It is authored by Gines Hidalgo, Zhe Cao, Tomas Simon, Shih-En Wei, Hanbyul Joo, and Yaser Sheikh. Currently, it is being maintained by Gines Hidalgo and Yaadhav Raaj. In addition, OpenPose would not be possible without the CMU Panoptic Studio dataset. We would also like to thank all the people who helped OpenPose in any way.”

The gesture recongition part of our code simply processes frames from the camera’s RTSP stream and send it to OpenPose to do the detection of all the body’s keypoints for us using a pre-trained model.

Faster detection

The other major kudos would be to the folks at OpenCV for being able to integrate CudaDNN with OpenCV DNN module to get a massive speedup to be able to achieve realtime speed for body pose detection. This was a GSoC 2019 project by Yashas Samaga B L (mentored by Davis King).

The detection time using OpenPose goes down from 4s on my Intel integrated GPU to 0.2s on my old Nvidia 650M. This is an incredible speedup which is the reason we are able to achieve the near realtime feedback loop. This also becomes a limitation, you can’t host this on AWS lambda / any cheap server since having a GPU forms a major solution for the problem.

Don’t have a GPU? Don’t be disheartened. From my experiments with the OpenPose network, it is extremely powerful. Even at reduced settings of CNN size 100×100 instead of the default 384×384 and reducing the input image to 480px height instead of 720px height, I wasn’t able to observe any degradation in detection performance. This is going to bring down the time of executing on CPU / EC2 significantly.

Converting a gesture into an analog value

This sounds like a fairly challenging part, how do you take a body pose and convert that into an absolute value? How do I create a knob like control using my body pose? Ofcourse there are multiple solutions here with no single right answer. I tried to solve it in a simple way given the principles we discussed above: “Humans interpret relative much better than absolute and second accuracy isn’t that important.”

Based on the above principle, I attempt in trying to use my right hand as an indicator of whether I want to increase or decrease the volume. I use a very simple mechanism of the positioning of my hand to signify how to change the volume. For reference, I use my Left and Right shoulders to determine what is the magnitude of the volume change desired. The formal definitions kind of look far too complex, but realistically it is intuitive and natural for a human to understand.

The mathematics of implementing this are trivial, and need only a basic knowledge of vectors.

We have 3 points of interest from the pose retrieved from OpenPose: the left shoulder( $LSho$ ), right shoulder( $RSho$ ) and the right wrist( $RWr$ )
Define 2 vectors: $\overrightarrow{R}$ (from RSho to RWr) and $\overrightarrow{Sho}$ (from RSho to LSho)
The projection of $\overrightarrow{R}$ on $\overrightarrow{Sho}$ gives the magnitude of change that the user wants to indicate, with the value ranging from 0 to $|\overrightarrow{Sho}|$ . We further normalize it by dividing it by $|\overrightarrow{Sho}|$ to get a value ranging from 0 to 1. Finally subtract the obtained value from 0.5 to get a linear fitting with intercept at the origin as shown below.
$normalizedDelta = 0.5 - \frac{\overrightarrow{R} . \overrightarrow{Sho}}{|\overrightarrow{Sho}||\overrightarrow{Sho}|}$
We want to only consider the scenarios where the user’s wrist is above the shoulder level, else we don’t consider it as a valid gesture. This can be computed by looking at the Cross product of the same vectors we defined above and checking whether it is greater or less than 0.
$gestureConsidered = {\overrightarrow{R} \times \overrightarrow{Sho}} > \vec{0}$

: Vectors for gesture detection

: Converting the Right wrist position into an analog value for gesture control

If you are thinking that OpenPose sounds like an overkill for just detecting these 3 points given the simplistic solution, you are probably right! However, there is no reason to limit to the way I have used the information from OpenPose. I intend to understand better how to use more intuitive gestures for controlling devices around me.

See in action, both light and dark mode which clearly highight the advantage of using a security camera.

Controlling hardware using detected gestures

In this section, I talk majorly about specifics of how to control my Vankyo V600 projector volume using the analog value extracted from the above gesture. If you missed it, I had covered on the electronics part of it in a previous post where I talked about how to use Alexa voice commands to control the switching on and off of the device.

Here are the IR codes for volume control for the V600 projector. To understand how to get those for your device, go through the previous post. All the signals are in NEC format.

Volume up: 0x00FF31CE
Volume down: 0x00FF39C6

The ESP code is a simple HTTP server which accepts the volumeDelta parameter and returns a success response after sending across the IR signals to control the device. We make REST calls using Python to the ESP endpoint. We use the ESPWebServer library to run the server and the IRSend library to send the signals using the IR LED to the projector.

There are a lot of other devices in my home which can be controlled using the same set of gestures which I plan to support sometime in the future:

Smart light brightness: Unfortunately a lot of Smart home APIs are non standardized, so it is not really easy to control a device unless the provider creates an Open source API around that. This will hopefully become easier in the future with the upcoming Zigbee standardization for IOT devices.
Fan control: I don’t have a smart fan, so I plan to hack and try to create a simple fan control using a Triac circuit

Communication between detection and control layers

One major aspect that is common to both the Detection and the Control part is that both of them can be represented as a system which consumes some information from a producer and processes it.

Detection: Camera produces frames and they are consumed by OpenPose
Control: Our Detection setup produces values and they are consumed by our ESP volume control circuit

We need some sort of serialization in processing, there needs to be an order in which both of the layers execute. This is usually done by a queue which processes the information in a FIFO mechanism. Only one operation is done at a time on both the detection and the control layers.

Between the Detection and the Control layers, we have a queue executing HTTP REST calls to pass data. The queue ensures that at a single time, only 1 single command is being sent to the Control layer.

Slow consumer problem

Since a major goal is low latency, it is important to ensure that we are processing as realtime information as possible. In this scenario, we have a
“slow” consumer problem in both places. The camera can publish frames at probably 30FPS on the RTSP stream (haven’t tested), but given the time taken in OpenPose, we can’t process those many frames. Similarly, OpenPose can process a frame in 100s of milliseconds, but sending signals reliably needs a wait of around a second, otherwise some commands get dropped.

Solving the problem needs us to just have a tweaked queue size which achieves a balance of 2 things: ensure that the latest possible information is being sent to the next layer, and at the same time, the queue size is not so small that the consumer ends up waiting for the producer.

On the control layer, since we own both the producer and the consumer, we also add the additional timeout of 1s to discard stale actions. This would be most likely in the scenario when you want to stop detection so you put your hand down, but by then atleast 1 frame would have been processed and in the queue and gets processed even though the processing happened a long time ago.

Can we speed up communication?

Before going into the optimization bit, I would like to call out that I don’t know what are the slowest paths in our system currently which need to be optimized, and that can only be determined once we have some sort of instrumentation added on top of this entire system.

I will just talk about how the communication between the Detection and the Control layers has a few potential alternative solutions which might be better suited:

MQTT is designed to be a low-overhead Machine to Machine protocol. MQTT 5 also deals with the slow consumer problem inherently and designed to handle the problem which we are solving using queues.
If latency is more important than dropped information, it is possible that we could use UDP as the communication backend which might lead to a more realtime experience

Here’s the complete workflow for detection and device control:

Gesture control: Sequence diagram showing workflow of Gesture Detection and Volume Control

Alexa skill for control

This step is not exactly a requirement for our gesture control system, but from my perspective it gives a finish to the entire setup. Announcing to Alexa that I am wanting to control a particular device seems like a very small price to pay for the various advantages I get out of this:

No stray movements detected as gestures. This is already unlikely given that we are only considering a gesture to be valid if it the right wrist is above shoulder height.
Saved compute resources. For the remaining 23 hours and 58 minutes of a day when you are not trying to control any devices, it sounds wasteful to be running the detection algorithm

For me, it does not seem unnatural to have Alexa as an intermediary, but some of you might find it an annoying additional step, feel free to try it out and update in comments if that still works!

Sequence diagram explaining Custom Alexa skill working for Gesture control

Developing and Hosting the Alexa skill

I wrote a simple Custom Alexa skill running on Flask and hosted it on my local server. The goal of the skill is simple, to handle the intent where I say I want to control volume and it starts the Gesture recognition step. Similarly, it turns off Gesture recognition when I ask it to. The starting and stopping is done using Python Subprocess, which allows you to spawn and kill other processes within Python.

Bringing my computer on the internet

Since my ISP uses a NAT, my computer isn’t accessible from the internet. I use NGrok which acts as an amazing free tunnel proxy to be able to have my server being accessible from my Alexa skill.

Privacy and Security

Security is an extremely important aspect in this project and that should be visible from my design choices. None of the data leaves the confines of the internal home network, and the only externally exposed mechanism is the Alexa skill handler.

Amcrest by default supports webcam access over the internet through Amcrest servers but I would recommend disabling it. Similarly, although a lot of fantastic and cheap Security Cameras are available in the market today, I would suggest staying away unless they come with a capability to disable all propriety protocols and communication. Some folks have been able to hack these cameras and install open source software on them, but I couldn’t find a supported camera available in the IN market.

Can this become the future in our smart homes?

I believe that gesture control is the best mechanism for solving this problem, and I hope that this gets adopted to make our lives easier!

There are a few challenges in being able to get something like this to a common consumer. For nearly everybody, it is not feasible to afford a GPU hardware for being able to do this kind of processing. With the amazing breakthroughs on both the hardware and the software side, I would say it is only a matter of time when it becomes really easy to achieve this. The other major limitation would be around privacy concerns, a lot of us might be uncomfortable in letting a company watch us in our homes! Maybe integrated GPUs onboarded onto cameras would allow companies to do the detection and processing locally? All of us already possess some sort of smart GPUs in our pockets in the form of mobile phones, can they be used as our personal GPU systems to ensure privacy without requiring additional hardware?

Till these challenges exist, those who are ready to get your hands dirty can find all the code on my Github repository: analog-settings-gesture-control

Revolutionizing Home Automation: Gesture control for Analog settings