Surveillance-Video-Understanding.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Towards Surveillance Video-and-Language Understanding</title>
    <link href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap" rel="stylesheet">
    <link rel="stylesheet" href="css/style.css" media="screen"/>
</head>
<body>
    <div class="container">
        <h1 class="project-title">Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges(CVPR 2024)</h1>

        <div class="authors">
            <p>Tongtong Yuan<sup>1</sup>, Xuange Zhang<sup>1</sup>, Kun Liu, Bo Liu<sup>1</sup>, Chen Chen<sup>2</sup>, Jian Jin<sup>3</sup>, Zhenzhen Jiao<sup>4</sup></p>
			<p><sup>1</sup> Beijing University of Technology, CN</p>
			<p><sup>2</sup> Center for Research in Computer Vision, University of Central Florida, USA</p>
			<p><sup>3</sup> Institute of Industrial Internet of Things, CAICT, CN</p>
			<p><sup>4</sup> Beijing Teleinfo Technology Co., Ltd., CAICT, CN</p>
        </div>

        <div class="icon-container">
            <ul class="icon-list">
                <li class="icon-item">
                    <a href="https://arxiv.org/abs/2309.13925">
                        <img src="img/file-pdf.png" alt="Paper">
                        <h4><strong>Paper</strong></h4>
                    </a>
                </li>
                <li class="icon-item">
                    <a href="https://github.com/Xuange923/Surveillance-Video-Understanding">
                        <img src="img/github.png" alt="Code">
                        <h4><strong>Code and Dataset</strong></h4>
                    </a>
                </li>
            </ul>
        </div>

        <h2>Abstract</h2>
        <p>Surveillance videos are an essential component of daily life with various critical applications, particularly in public security. However, current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory semantic understanding, although they have obtained considerable performance.  
To address this issue, we propose a new research direction of surveillance video-and-language understanding, and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours.
Furthermore, we benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding. Through our experiments, we find that mainstream models used in previously publicly available datasets perform poorly on surveillance video, which demonstrates the new challenges in surveillance video-and-language understanding. To validate the effectiveness of our UCA, we conducted experiments on multimodal anomaly detection. The results demonstrate that our multimodal surveillance learning can improve the performance of conventional anomaly detection tasks. All the experiments highlight the necessity of constructing this dataset to advance surveillance AI. </p>
        <p>The following figure shows Annotation Examples in our UCA dataset.</p>
		<img src="https://i.postimg.cc/ZqyVxR0W/fig-visual.jpg" alt="Abstract Image">
		
	<hr>	
		<h2>Dataset Description</h2>		
	
		<h3>Comparative Analysis with Other Video Datasets</h3>
		<p>The following table provides a statistical comparison between the UCA dataset and other traditional video datasets in multimodal learning tasks. Our dataset is specifically designed for the surveillance domain, featuring the longest average word count per sentence.</p>
		<img src="https://s2.loli.net/2023/12/04/pu2UQBCXsPYxikF.png" alt="Comparative Analysis Table">
	
		<h3>Quality and Fairness Assurance</h3>
		<p>During the video collection process for UCA, we conducted a meticulous screening of the original UCF-Crime dataset to filter out videos of lower quality. This ensures the quality and fairness of our UCA dataset. The low-quality videos identified had issues like repetitions, severe obstructions, or excessively fast playback speeds, which impeded the clarity of manual annotations and the precision of event time localization.</p>
		<p>Consequently, we removed 46 videos from the original UCF-Crime dataset, resulting in a total of 1,854 videos for UCA. The data split in UCA is outlined in the table below.</p>
		<img src="https://s2.loli.net/2023/12/04/TwCVjQvGf5PmU6H.png" alt="UCA Data Split Table">
		
		<h3>Parts of Speech Distribution</h3>
		<p>The table below displays the number of query descriptions for the events we labeled and the average number of words per query. The average word count in our annotations is approximately 20 words. The distribution of parts of speech (nouns, verbs, and adjectives) is approximately 2:2:1 in all sentences of the Train, Val, and Test splits.</p>
		<img src="https://i.postimg.cc/DyQDMyqg/tab-3.png" alt="Parts of Speech Distribution">
	
		<h3>Format Explanation</h3>
		<p>The UCA dataset is available in two formats: <code>txt</code> and <code>json</code>.</p>
		
		<p><strong>txt format:</strong></p>
		<pre>
			VideoName StartTime EndTime ##Video event description
		</pre>
	
		<p><strong>json format:</strong></p>
		<pre>
			"VideoName": {
				"duration": xx.xx,
				"timestamps": [
					[StartTime 1, EndTime 1],
					[StartTime 2, EndTime 2]
				],
				"sentences": ["Video event description 1", "Video event description 2"]
			}
		</pre>
		<hr>

        <h2>Experimental Tasks</h2>
		<p>We conducted four types of experiments on our dataset:</p>
        <ul style="font-size: 18px; line-height: 1.6; ">
            <li>Temporal Sentence Grounding in Videos (TSGV): This task focuses on temporal activity localization in a video based on a language query.</li>
            <li>Video Captioning (VC): Understanding a video clip and describing it with language.</li>
            <li>Dense Video Captioning (DVC): Involves generating the temporal localization and captioning of dense events in an untrimmed video.</li>
            <li>Multimodal Anomaly Detection (MAD): Utilizes captions as a text feature source to enhance traditional anomaly detection in complex surveillance videos.</li>
        </ul>
		<hr>

		<h2>Visualizations</h2>
		<p>To better understand the dataset and the experimental outcomes, the following visualizations are included:</p>
		
		<div class="visualization">
		    <h3>TSGV Visualization</h3>
		    <p>Example by MMN.</p>
		    <img src="https://i.postimg.cc/mhfbhhc7/fig-tsgv-2.jpg" alt="TSGV Visualization">
		</div>
		
		<div class="visualization">
		    <h3>VC Visualization</h3>
		    <p>Example by SwinBert.</p>
		    <img src="https://i.postimg.cc/pVqj6Hwk/fig-vc.jpg" alt="VC Visualization">
		</div>
		
		<div class="visualization">
		    <h3>DVC Visualization</h3>
		    <p>Example by PDVC.</p>
		    <img src="https://i.postimg.cc/wjd95twr/fig-dvc.jpg" alt="DVC Visualization">
		</div>
		
		<div class="visualization">
		    <h3>MAD Captioning Results</h3>
		    <p>Examples of different video captioning results.</p>
		    <img src="https://i.postimg.cc/V5PPFTWt/fig-mad.jpg" alt="MAD Captioning Results">
		</div>
<hr>
			
		<div class="ucf-crime-dataset">
			<h2>Original UCF-Crime Dataset Reference</h2>
			<p>Our UCA dataset is built upon the foundational UCF-Crime dataset. For those interested in exploring the original data, the UCF-Crime dataset can be downloaded directly from this link: <a href="http://www.crcv.ucf.edu/data1/chenchen/UCF_Crimes.zip" download="UCF_Crimes.zip">Download zip</a>&ensp; URL: <a href="http://www.crcv.ucf.edu/data1/chenchen/UCF_Crimes.zip">www.crcv.ucf.edu/data1/chenchen/UCF_Crimes.zip</a> </p>
			<p>Additionally, further details about the UCF-Crime project are available on their official website: <a href="https://www.crcv.ucf.edu/research/real-world-anomaly-detection-in-surveillance-videos/">Visit here</a></p>
			<p>If you wish to reference the UCF-Crime dataset in your work, please cite the following paper:</p>
			<pre>
				@inproceedings{sultani2018real,
				title={Real-world anomaly detection in surveillance videos},
				author={Sultani, Waqas and Chen, Chen and Shah, Mubarak},
				booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
				pages={6479--6488},
				year={2018}
				}
			</pre>
			<p>Each annotation in our UCA dataset is associated with a corresponding video in the original UCF-Crime dataset. Users interested in this dataset can easily match the videos to the annotation information after downloading.</p>
		</div>
<hr>
        <h2>Usage and Contact</h2>
        <p>Our dataset is exclusively available for academic and research purposes. Please feel free to contact the original authors for inquiries, suggestions, or collaboration proposals.</p>
        <h3>Citation</h3>
		<pre>
			@misc{yuan2023surveillance,
			      title={Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges}, 
			      author={Tongtong Yuan and Xuange Zhang and Kun Liu and Bo Liu and Chen Chen and Jian Jin and Zhenzhen Jiao},
			      year={2023},
			      eprint={2309.13925},
			      archivePrefix={arXiv},
			      primaryClass={cs.CV}
			}
		</pre>
    </div>
</body>
</html>