A video codec designed for AI analysis

Although techno-thriller The circle (2017) is more a commentary on the ethical implications of social media than the practicalities of external video analysis, the unlikely little “SeeChange” camera at the center of the plot is what really pushes the film into the category “science fiction”.

The “SeeChange” camera/surveillance device from the techno thriller “The Circle” (2017).

A wireless, free-roaming device the size of a large marble, it’s not the lack of solar panels or the inefficiency of drawing power from other ambient sources (such as radio waves) which makes SeeChange an unlikely prospect, but the fact that it will have to compress video 24/7, no matter what load it is able to sustain.

Powering inexpensive sensors of this type is a critical area of ​​research in computer vision (CV) and video analytics, especially in non-urban environments where the sensor will need to squeeze maximum performance from very high energy resources. limited (batteries, solar, etc.). .).

In cases where such a peripheral IoT/CV device of this type needs to send image content to a central server (often via conventional cellular coverage networks), the choices are difficult: either the device needs to run some kind of lightweight neural network locally in order to send only optimized relevant data segments for server-side processing; or he has to send a “dumb” video for the connected cloud resources to be evaluated.

Although motion activation via event-based intelligent vision sensors (SVS) can reduce those overheadsactivation monitoring also costs energy.

cling to power

Also, even with infrequent activation (i.e. a sheep pops up occasionally), the device doesn’t have enough power to send gigabytes of uncompressed video; nor does it have enough power to continuously run popular video compression codecs such as H.264/5, which wait for hardware plugged in or not far from the next charging session.

Video analysis pipelines for three typical computer vision tasks.  The video encoding architecture must be trained for the task at hand, and generally for the neural network that will receive the data.  Source: https://arxiv.org/pdf/2204.12534.pdf

Video analysis pipelines for three typical computer vision tasks. The video encoding architecture must be trained for the task at hand, and generally for the neural network that will receive the data. Source: https://arxiv.org/pdf/2204.12534.pdf

Although the widely distributed H.264 codec consumes less power than its H.265 successor, it has low compression efficiency. Its successor, H.265, has better compression efficiency, but higher power consumption. While Google’s open source VP9 codec beats them both in each domain, it requires higher local computational resources, which presents additional problems in a supposedly cheap IoT sensor.

As for local stream analysis: by the time you ran the most lightweight local neural network to determine which images (or areas of an image) are worth sending to the server, you have often spent the power that you would have saved by simply sending all the images.

Extraction of masked representations of livestock with a sensor unlikely to be connected to the network.  Is it devoting its limited power capacity to local semantic segmentation with a lightweight neural network?  by sending limited information to a server for further instructions (introducing latency);  or sending data

Extraction of masked representations of livestock with a sensor unlikely to be connected to the network. Is it devoting its limited power capacity to local semantic segmentation with a lightweight neural network? by sending limited information to a server for further instructions (introducing latency); or by sending “stupid” data (wasting energy on bandwidth)? Source: https://arxiv.org/pdf/1807.01972.pdf

It is clear that “in the wild” computer vision projects need dedicated video compression codecs that are optimized for specific neural network requirements across specific and diverse tasks such as semantic segmentation, keypoints (human motion analysis) and object detection, among other possible end uses.

If you can strike the perfect balance between video compression efficiency and minimal data transmission, you’re one step closer to SeeChange and the ability to deploy affordable sensor networks in harsh environments.

AccMPEG

New research from the University of Chicago might have taken a step closer to such a codec, in the form of AccMPEG – a new video encoding and streaming framework that operates at low latency, with high accuracy for server-side deep neural networks (DNN) and has remarkably low local computational requirements.

Architecture of AccMPEG.  Source: https://arxiv.org/pdf/2204.12534.pdf

Architecture of AccMPEG. Source: https://arxiv.org/pdf/2204.12534.pdf

The system is able to save money over previous methods by evaluating how each 16x16px macroblock might affect server-side DNN accuracy. Instead, previous methods typically had to assess this kind of precision based on each pixel in an image or perform some electrically expensive local operations to assess which regions of the image might be of most interest.

In AccMPEG, this accuracy is estimated in a custom module called AccGrad, which measures how the encoding quality of the macroblock is likely to be relevant to the end use case, such as a server-side DNN trying to count people, perform skeletal estimation on human motion or other common computer vision tasks.

When a video frame comes into the system, AccMPEG initially processes it through a cheap quality picker model, titled AccModel. Any areas that are unlikely to contribute to the useful calculations of a server-side DNN are essentially ballast and should be marked for encoding with the lowest possible quality, unlike salient regions, which should be sent with better quality.

This process presents three challenges: Can the process run fast enough to achieve acceptable latency without using power-intensive local compute resources? Can we establish an optimal relationship between frame rate and quality? And can a model be quickly trained for an individual server-side DNN?

Training logistics

Ideally, a computer vision codec would be pre-trained on systems tuned to the exact requirements of a specific neural network. The AccGrad module, however, can be directly derived from a DNN with only two forward propagations, saving ten times the standard overhead.

AccMPEG trains AccGrad for only 15 epochs of three propagations each through the final DNN, and can potentially be recycled “live” using its current model state as a model, at least for CV tasks of the same specification.

AccModel uses the MobileNet-SSD pre-trained feature extractor common in affordable edge devices. With a turnover of 12 GFLOPS, the model uses only a third of the typical ResNet18 approaches. Besides normalization and batch activation, the architecture consists only of convolutional layers and its computational overhead is proportional to the frame size.

AccGrad removes the need for back end DNN inference, improving deployment logistics.

AccGrad removes the need for back end DNN inference, improving deployment logistics.

Frame rate

The architecture operates optimally at 10 fps, which would make it suitable for purposes such as agricultural monitoring, building degradation monitoring, high-visibility traffic analysis, and representative skeletal inference in human movements; however, very fast-paced scenarios, such as low-visibility traffic (of cars or people) and other situations where high frame rates are beneficial, are not suitable for this approach.

Part of the frugality of the method lies in the premise that adjacent macroblocks are likely to have similar value, up to the point where a macroblock falls below the estimated accuracy. The surfaces obtained by this approach are more clearly delimited and can be calculated more quickly.

Performance improvement

The researchers tested the system on a $60 Jetson Nano board with a single 128-core Maxwell GPU and various other cheap equivalents. OpenVINO was used to offset some of the very scarce local DNN power requirements to processors.

AccModel itself was originally trained offline on a server with 8 GeForce RTX 2080S GPUs. While this is a tremendous array of computational power for an initial model build, the slight retraining made possible by the system and how a model can be tuned to certain tolerance settings on different DNNs that attack similar tasks mean that AccMPEG can be part of a system that requires minimal presence in the wild.

First published May 1, 2022.

Comments are closed.