Surveillance Video: The Biggest Big Data

This position article is cited from T. Huang, “Surveillance Video: The Biggest Big Data,” Computing Now, vol. 7, no. 2, Feb. 2014, IEEE Computer Society [online];


Big data continues to grow exponentially, and surveillance video has become the largest source. Against that backdrop, this issue of Computing Now presents five articles from the IEEE Computer Society Digital Library focused on research activities related to surveillance video. It also includes some related references on how to compress and analyze the huge amount of video data that’s being generated.

Surveillance Video in the Digital Universe

In recent years, more and more video cameras have been appearing throughout our surroundings, including surveillance cameras in elevators, ATMs, and the walls of office buildings, as well as those along roadsides for traffic-violation detection, cameras for caring for kids or seniors, and those embedded in laptops and on the front and back sides of mobile phones. All of these cameras are capturing huge amounts of video and feeding it into cyberspace daily. For example, a city such as Beijing or London has about one million cameras deployed. Now consider that these cameras capture more in one hour than all the TV programs in the archives of the British Broadcasting Corporation (BBC) or China Central Television (CCTV). According to the International Data Corporation’s recent report, “The Digital Universe in 2020,” half of global big data — the valuable matter for analysis in the digital universe — was surveillance video in 2012, and the percentage is set to increase to 65 percent by 2015.

To understand about R&D activities related to video surveillance, I searched the keywords video and surveillance in IEEE Xplore (within metadata only) and the IEEE CSDL (by exact phrase). The search results showed 6,832 (in Xplore) and 3,111 (in CS Digital Library) related papers published in IEEE conferences, journals, or magazines. Figure 1 shows the annual histogram of these publications. Obviously, the sharp increasing in the past ten years indicates that research on surveillance video is very active.

Figure 1. Histogram of publications in IEEE Computer Society Digital Library and IEEE Xplore for which metadata contains the keywords video and surveillance. Note: “~1989” shows all articles up to 1989. The numbers for 2013 might also increase as some are still waiting to be archived into the database.

Theme Articles

Surveillance-video big data introduces many technological challenges, including compression, storage, transmission, analysis, and recognition. Among these, the two most critical challenges are how to efficiently transmit and store the huge amount of data, and how to intelligently analyze and understand the visual information inside.

Higher-efficiency video compression technology is urgently needed to reduce the storage and transmission cost of big surveillance data. The state-of-the-art High Efficiency Video Coding (HEVC) standard, featured in the October 2013 CN theme, can compress a video to about 3 percent of its original data size. In other words, HEVC doubles the data compression ratio of the H.264/MPEG-4 AVC approved in 2003. In fact, the latter doubled the ratio of the previous-generation standards MPEG-2/H.262, which were approved in 1993. Despite these advances, this doubling of video-compression performance every ten years is too slow to keep pace with the growth of surveillance video in our physical world, which is now doubling every two years, on average!

To achieve a higher compression ratio, the unique characteristics of surveillance video must be factored into the design of new video-encoding standards. Unlike standard video, for instance, surveillance footage is usually captured in a specific place day after day, or even month after month. Yet, previous standards fail to account for the specific residuals that exist in surveillance video (for example, unchanging backgrounds or foreground objects that appear many times). The new IEEE std 1857, entitled Standard for Advanced Audio and Video Coding, contains a surveillance profile that can further remove background residuals. The profile doubles the AVC/H.264 compression ratio with even lower complexity. In “IEEE 1857 Standard Empowering Smart Video Surveillance Systems,” Wen Gao, our colleagues, and I present an overview of the standard, highlighting its background-model-based coding technology and recognition-friendly functionalities. The new approach is also employed to enhance HEVC/H.265 and nearly double its performance as well. (Additional technical details can be found in “Background-Modeling Based Adaptive Prediction for Surveillance Video Coding,” which is available to subscribers via IEEE Xplore.)

Much like the physical universe, the vast majority of the digital universe is so-called digital dark matter — it’s there, but what we know about it is very limited. According to the IDC report I mentioned earlier, 23 percent of the information in the digital universe would be useful for big data if it were tagged and analyzed. Yet, technology is far from where it needs to be, and in practice, only 3 percent of potentially useful data is tagged — and even less is currently being analyzed. In fact, people, vehicles, and other moving objects appearing in millions of cameras will be a rich source for machine analysis to understand the complicated society and world. As guest editor Dorée Duncan Seligmann discussed in CN’s April 2012 theme, video is even more challenging than other data types for automatic analysis and understanding. This month we add three articles on the topic that have been published since then.

Human beings are generally the major objects of interest in surveillance video analysis. In the best paper from the 2013 IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), “Reference-Based Person Re-identification” (available to IEEE Xplore subscribers), Le An and his colleagues propose a reference-based method for learning a subspace in which the correlations among reference data from different cameras are maximized. From there, the system can identify people who are present in different camera views with significant illumination changes.

Human behavior analysis is the next step for deeper understanding. Shuiwang Ji and colleagues’ “3D Convolutional Neural Networks for Human Action Recognition” introduces the deep learning underlying human-action recognition. The proposed 3D convolutional neural networks model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent video frames. Experiments conducted using airport videos achieved superior performance compared to baseline methods.

In “Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes,” Christian Wojek and his colleagues present a novel probabilistic 3D scene model that integrates geometric 3D reasoning with state-of-the-art multiclass object detection, object tracking, and scene labeling. This model uses inference to jointly recover the 3D scene context and perform 3D multi-object tracking, using only monocular video as input. The article includes an evaluation of several challenging sequences captured by onboard cameras, which illustrate that the approach shows substantial improvement over the current state of the art in 3D multiperson tracking and multiclass 3D tracking of cars and trucks on a challenging data set.

Toward a Scene Video Age

This month’s theme also includes a video from John Roese, the CTO of EMC Corp., with his technical insight on this topic.

Much like surveillance, the number of videos captured in classrooms, courts, and other site-specific cases is increasing quickly as well. This is the prelude to a “scene video” age in which most videos will be captured from specific scenes. In the near future, these pervasive cameras will cover all the spaces the human race is able to reach.

In this new age, the ‘scene’ will become the bridge to connect video coding and computer vision research. Modeling these scenes could facilitate further video compression as demonstrated by the IEEE 1857 standard. And then, with the assistance of such scene models encoded in the video stream, the foreground-object detection, tracking, and recognition becomes less difficult. In this sense, the massive growth in surveillance and other kinds of scene video presents big challenges, as well as big opportunities, for the video- and vision-related research communities.

In 2015, the IEEE Computer Society’s Technical Committee on Multimedia Computing (TCMC) and Technical Committee on Semantic Computing (TCSEM) will jointly sponsor the first IEEE International Conference on Multimedia Big Data, a world premier forum of leading scholars in the highly active multimedia big data research, development and applications. Interested readers are welcome to join us at this new conference next spring in Beijing for more discussion on the rapidly growing multimedia big data.



  • 应用维度来看,视频感知技术与系统在安防、交管上有初步成功应用,但在城市运行(如城市范围内的运行态势分析)和社会服务(如重点景区的实时直播、家庭看护)方面才刚起步。因此,如何从“管、控”到“服务”是视频大数据应用发展的重点。
  • 数据维度来看,基于单摄像头的视频感知分析技术与系统已初步可应用。一个典型的例子就是交通电子警察的监控与违章发现。然而,随着城市视频监控系统规模的不断扩大和应用需求的爆炸式增长,处理跨摄像头视频数据、大规模摄像头网络数据、甚至融合各类视频图像及关联数据的视频大数据就成为当务之急。
  • 技术维度来看,过去20年已基本解决监控视频摄像系统的数字化问题,近五年开始解决大规模监控摄像头的高清化问题。然而在现阶段,大量布设的监控摄像节点的智能程度低,信息不能实时处理,从而造成以人工监测为主的应用模式效率低下,对突发事件的反应缓慢。因此,视频监控的智能化是今后相当长一段时间内的研究与应用重点。更进一步,由于目前仍缺少广域范围内多摄像机视频数据的协同处理与计算技术,因而难以深度利用和挖掘这些广域视频中丰富的人、物、行为乃至事件信息。因此大数据智能化(简称大数据化)是未来若干年的必然研究与发展趋势。

1 视频大数据的主体是监控视频,但广义的视频大数据还包括会议视频、家庭视频、教学视频、法庭视频等,它们往往采用固定摄像头来对某个特定的场景(如家庭、教室、会议室、法庭、交通道口等)在一段时间内进行拍摄。由于这类视频都有较为固定的场景特性,场景的拍摄并没有预定义的剧本或意图,我们称之为“场景视频”(Scenic Video)。与其他类型视频(如新闻视频、影视视频、体育视频)相比,场景视频是机器视觉研究的最可能突破口,也是解决视频大数据问题的最佳实验数据。首先,场景视频是通过摄像头长期注视单一场景而采集的视频,场景中的“变”与“不变”要素能有效地进行建模、分析与推理,从而有可能达到高效压缩、场景理解甚至精确识别;其次,场景视频的体量巨大,是视频大数据中的主要研究对象,因此场景视频处理与分析技术的发展能直接推动视频大数据的技术水平。
从更大的视角来看,信息技术领域正在孕育重大技术突破与变革,视频大数据是主要的推动力之一及最佳的应用问题。一方面,认知与脑科学、机器视觉等相关领域正在发生从“量变”到“质变”的过程,预期能在未来若干年内获得理论上的突破。另一方面,近一两年来以深度学习和电子大脑为代表的人工智能技术获得了迅猛的发展,产生了包括Google Brain、百度大脑、IBM的类人脑芯片TrueNorth等一批标志性的阶段成果。这些成果将为场景视频的处理与分析提供强有力的计算与处理能力。

PKU-RSD Dataset

We constructed this PKU-RSD (Regional Saliency Dataset) dataset that could capture spatiotemporal visual saliency for evaluating different video saliency models. This dataset contains 431 short videos, which cover various scenes (surveillance, ad, news, cartoon, movie etc.) and the corresponding annotation results of salient objects in sampled key frames manually labeled by 23 subjects. Some samples of the annotation results are shown below:

samples of RSD


  • The videos and the corresponding annotation results for download are part of PKU-RSD (Regional Saliency Dataset) dataset.
  • The videos and the corresponding annotation results can only be used for ACADEMIC PURPOSES. NO COMERCIAL USE is allowed.
  • Copyright © National Engineering Laboratory for Video Technology (NELVT) and Institute of Digital Media, Peking University (PKU-IDM). All rights reserved.

All publications using PKU-RSD should cite the paper below:


You can download the agreement (pdf) by clicking the DOWNLOAD link.
After filling it, please send the electrical version to our Email: pkuml at (Subject: PKU-RSD Agreement)
After confirming your information, we will send the download link and password to you via Email. You need to follow the agreement.