This position article is cited from T. Huang, “Surveillance Video: The Biggest Big Data,” Computing Now, vol. 7, no. 2, Feb. 2014, IEEE Computer Society [online]; http://www.computer.org/portal/web/computingnow/archive/february2014.
Big data continues to grow exponentially, and surveillance video has become the largest source. Against that backdrop, this issue of Computing Now presents five articles from the IEEE Computer Society Digital Library focused on research activities related to surveillance video. It also includes some related references on how to compress and analyze the huge amount of video data that’s being generated.
Surveillance Video in the Digital Universe
In recent years, more and more video cameras have been appearing throughout our surroundings, including surveillance cameras in elevators, ATMs, and the walls of office buildings, as well as those along roadsides for traffic-violation detection, cameras for caring for kids or seniors, and those embedded in laptops and on the front and back sides of mobile phones. All of these cameras are capturing huge amounts of video and feeding it into cyberspace daily. For example, a city such as Beijing or London has about one million cameras deployed. Now consider that these cameras capture more in one hour than all the TV programs in the archives of the British Broadcasting Corporation (BBC) or China Central Television (CCTV). According to the International Data Corporation’s recent report, “The Digital Universe in 2020,” half of global big data — the valuable matter for analysis in the digital universe — was surveillance video in 2012, and the percentage is set to increase to 65 percent by 2015.
To understand about R&D activities related to video surveillance, I searched the keywords video and surveillance in IEEE Xplore (within metadata only) and the IEEE CSDL (by exact phrase). The search results showed 6,832 (in Xplore) and 3,111 (in CS Digital Library) related papers published in IEEE conferences, journals, or magazines. Figure 1 shows the annual histogram of these publications. Obviously, the sharp increasing in the past ten years indicates that research on surveillance video is very active.
Surveillance-video big data introduces many technological challenges, including compression, storage, transmission, analysis, and recognition. Among these, the two most critical challenges are how to efficiently transmit and store the huge amount of data, and how to intelligently analyze and understand the visual information inside.
Higher-efficiency video compression technology is urgently needed to reduce the storage and transmission cost of big surveillance data. The state-of-the-art High Efficiency Video Coding (HEVC) standard, featured in the October 2013 CN theme, can compress a video to about 3 percent of its original data size. In other words, HEVC doubles the data compression ratio of the H.264/MPEG-4 AVC approved in 2003. In fact, the latter doubled the ratio of the previous-generation standards MPEG-2/H.262, which were approved in 1993. Despite these advances, this doubling of video-compression performance every ten years is too slow to keep pace with the growth of surveillance video in our physical world, which is now doubling every two years, on average!
To achieve a higher compression ratio, the unique characteristics of surveillance video must be factored into the design of new video-encoding standards. Unlike standard video, for instance, surveillance footage is usually captured in a specific place day after day, or even month after month. Yet, previous standards fail to account for the specific residuals that exist in surveillance video (for example, unchanging backgrounds or foreground objects that appear many times). The new IEEE std 1857, entitled Standard for Advanced Audio and Video Coding, contains a surveillance profile that can further remove background residuals. The profile doubles the AVC/H.264 compression ratio with even lower complexity. In “IEEE 1857 Standard Empowering Smart Video Surveillance Systems,” Wen Gao, our colleagues, and I present an overview of the standard, highlighting its background-model-based coding technology and recognition-friendly functionalities. The new approach is also employed to enhance HEVC/H.265 and nearly double its performance as well. (Additional technical details can be found in “Background-Modeling Based Adaptive Prediction for Surveillance Video Coding,” which is available to subscribers via IEEE Xplore.)
Much like the physical universe, the vast majority of the digital universe is so-called digital dark matter — it’s there, but what we know about it is very limited. According to the IDC report I mentioned earlier, 23 percent of the information in the digital universe would be useful for big data if it were tagged and analyzed. Yet, technology is far from where it needs to be, and in practice, only 3 percent of potentially useful data is tagged — and even less is currently being analyzed. In fact, people, vehicles, and other moving objects appearing in millions of cameras will be a rich source for machine analysis to understand the complicated society and world. As guest editor Dorée Duncan Seligmann discussed in CN’s April 2012 theme, video is even more challenging than other data types for automatic analysis and understanding. This month we add three articles on the topic that have been published since then.
Human beings are generally the major objects of interest in surveillance video analysis. In the best paper from the 2013 IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), “Reference-Based Person Re-identification” (available to IEEE Xplore subscribers), Le An and his colleagues propose a reference-based method for learning a subspace in which the correlations among reference data from different cameras are maximized. From there, the system can identify people who are present in different camera views with significant illumination changes.
Human behavior analysis is the next step for deeper understanding. Shuiwang Ji and colleagues’ “3D Convolutional Neural Networks for Human Action Recognition” introduces the deep learning underlying human-action recognition. The proposed 3D convolutional neural networks model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent video frames. Experiments conducted using airport videos achieved superior performance compared to baseline methods.
In “Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes,” Christian Wojek and his colleagues present a novel probabilistic 3D scene model that integrates geometric 3D reasoning with state-of-the-art multiclass object detection, object tracking, and scene labeling. This model uses inference to jointly recover the 3D scene context and perform 3D multi-object tracking, using only monocular video as input. The article includes an evaluation of several challenging sequences captured by onboard cameras, which illustrate that the approach shows substantial improvement over the current state of the art in 3D multiperson tracking and multiclass 3D tracking of cars and trucks on a challenging data set.
Toward a Scene Video Age
This month’s theme also includes a video from John Roese, the CTO of EMC Corp., with his technical insight on this topic.
Much like surveillance, the number of videos captured in classrooms, courts, and other site-specific cases is increasing quickly as well. This is the prelude to a “scene video” age in which most videos will be captured from specific scenes. In the near future, these pervasive cameras will cover all the spaces the human race is able to reach.
In this new age, the ‘scene’ will become the bridge to connect video coding and computer vision research. Modeling these scenes could facilitate further video compression as demonstrated by the IEEE 1857 standard. And then, with the assistance of such scene models encoded in the video stream, the foreground-object detection, tracking, and recognition becomes less difficult. In this sense, the massive growth in surveillance and other kinds of scene video presents big challenges, as well as big opportunities, for the video- and vision-related research communities.
In 2015, the IEEE Computer Society’s Technical Committee on Multimedia Computing (TCMC) and Technical Committee on Semantic Computing (TCSEM) will jointly sponsor the first IEEE International Conference on Multimedia Big Data, a world premier forum of leading scholars in the highly active multimedia big data research, development and applications. Interested readers are welcome to join us at this new conference next spring in Beijing for more discussion on the rapidly growing multimedia big data.