大规模视频数据复杂事件检测课件.pptx

上传人（卖家）：三亚风情

文档编号：3496817

上传时间：2022-09-07

格式：PPTX

页数：43

大小：12MB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

25 文币

交易提醒：下载本文档，相应价格的文币将全额进入上传人（卖家）的账号。立即下载优惠套餐（点此详情）

【下载声明】
1. 本站全部试题类文档，若标题没写含答案，则无答案；标题注明含答案的文档，主观题也可能无答案。请谨慎下单，一旦售出，不予退换。
2. 本站全部PPT文档均不含视频和音频，PPT中出现的音频或视频标识（或文字）仅表示流程，实际无音频或视频文件。请谨慎下单，一旦售出，不予退换。
3. 本页资料《大规模视频数据复杂事件检测课件.pptx》由用户（三亚风情）主动上传，其收益全归该用户。163文库仅提供信息存储空间，仅对该用户上传内容的表现方式做保护处理，对上传内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知163文库（点击联系客服），我们立即给予删除！
4. 请根据预览情况，自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器，压缩文件请下载最新的WinRAR软件解压。

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 大规模视频数据复杂事件检测课件

资源描述：: 1、大规模视频数据复杂事件检测Outline Introduction Standard pipeline MED with few exemplars A discriminative CNN representation for MED A new pooling method for MED Introduction Challenge 1:An event is usually characterized by a longer video clip.10 years ago:Constrained videos,e.g.,New videosNow:Unconstrained video
2、sThe length of videos in the TRECVID MED dataset varies from one min to one hourThe videos are unconstrained Introduction(Contd)Challenge 2:Multimedia events are higher level descriptions.landing a fishIntroduction(Contd)Challenge 3:Huge intra-classvariationsVideo 1Video 2Marriage proposalOutline In
3、troduction Standard pipeline MED with few exemplars A Discriminative CNN representation for MED A new pooling method for MED Standard Components in CDR PipelinePhaseProcessVisual AnalysisSIFTColor SIFT(CSIFT)Transformed Color Histogram(TCH)Motion SIFT(MoSIFT)STIPDense Trajectory CNN Audio AnalysisMF
4、CCAcoustic Unit Descriptors(AUDs)Text AnalysisOCRASRHigh Level Concept AnalysisSIN 11 ConceptsObject BankVideoLegend:ProcessObjectCDRVisual AnalysisLow-Level Feature VectorsAudio AnalysisText Analysis7High Level Concept AnalysisOutline Introduction Standard pipeline MED with few exemplars A Discrimi
5、native CNN representation for MED A new pooling method for MED Motivation There are three tasks in MED EK 100(100 positive exemplars per event)EK 10(10 positive exemplars per event)EK 0(No positive exemplar but only text descriptions)Solution for event detection with few(i.e.,10)exemplars Knowledge
6、adaptation Related exemplars Leveraging related videosA video related to“marriage proposal.”A girl plays music,dances down a hallway in school,and asks a boy to prom.A video related to“marriage proposal.”A large crowd cheers after a boy asks his girlfriend to go to prom with him with a bouquet of fl
7、owers and a huge sign.Our solution Automatically access the relatedness of each related videos for event detection.Experiment ResultsThe frames sampled from two video sequences marked as related exemplars to the event“birthday party”by the NIST.Experiment ResultsThe frames sampled from two video seq
8、uences marked as related to the event“town hall meeting”by NIST.Experiment ResultsTake home messages Exact positive training exemplars are difficult to obtain,but related samples are easier to obtain Appropriately leveraging related samples would help event detection The performance is more signific
9、ant when the exact positive exemplars are few There are also many other cases where related samples are largely available.For details,refer to our paper How Related Exemplars Help Complex Event Detection in Web Videos?Yi Yang,Zhigang Ma,Zhongwen Xu,Shuicheng Yan and Alexander Hauptmann.ICCV 2013 Out
10、line Introduction Standard CDR MED with few exemplars A Discriminative CNN representation for MED A new pooling method for MED Video analysis costs a lot Dense Trajectories and its enhanced version improved Dense Trajectories(IDT)have dominated complex event detection superior performance over other
11、 features such as the motion feature STIP and the static appearance feature Dense SIFTCredits:Heng WangVideo analysis costs a lot Paralleling 1,000 cores,it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collectionVideo
12、analysis costs a lot As a result of the unaffordable computation cost(a cluster with 1,000 cores),it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale MED datasets.It becomes important to propose an efficient representati
13、on for complex event detection with only affordable computational resources,e.g.,a single machine.Turn to CNN?One instinctive idea would be to utilize the deep learning approach,especially Convolutional Neural Networks(CNNs),given their overwhelming accuracy in image analysis and fast processing spe
14、ed,which is achieved by leveraging the massive parallel processing power of GPUs.Turn to CNN?However,it has been reported that the event detection performance of CNN based video representation is worse than the improved Dense Trajectories in TRECVID MED 2013.Technical problems of utilizing CNNs for
15、MED Firstly,CNN requires a large amount of labeled video data to train good models from scratch.TRECVID MED datasets have only 100 positive examples for each event.Secondly,fine-tuning from ImageNet to video data needs to change the structure of the networks e.g.convolutional pooling layer proposed
16、in Beyond Short Snippets:Deep Networks for Video Classification Finally,average pooling from the frames to generate the video representation is not effective for CNN features.Cont.Average Pooling for VideosWinning solution for the TRECVID MED 2013 competitionAverage Pooling of CNN frame features Con
17、volutional Neural Networks(CNNs)with standard approach(average pooling)to generate video representation from frame level featuresMEDTest 13MEDTest 14Improved Dense Trajectories34.027.6CNN in CMUMED 201329.0N.A.CNN from VGG-1632.724.8Video Pooling on CNN Descriptors Video pooling computes video repre
18、sentation over the entire video by pooling all the descriptors from all the frames in a video.For local descriptors like HOG,HOF,MBH in improved Dense Trajectories,the Fisher vector and Vector of Locally Aggregated Descriptors(VLAD)is applied to generate the video representation.To our knowledge,thi
19、s is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as the from local descriptors to CNN descriptors in video analysis.Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14Resultsfc6fc6_relufc7fc7_reluAverage pooling19.824.818.823.8Fis
20、her vector28.328.427.429.1VLAD33.132.633.231.5Table:Performance comparison(mAP in percentage)on MEDTest 14 100ExFigure:Performance comparisons on MEDTest 13 and MEDTest 14,both 100Ex and 10ExLatent Concept Descriptors(LCD)Convolutional filters can be regarded as generalized linear classifiers on the
21、 underlying data patches,and each convolutional filter corresponds to a latent concept.From this interpretation,pool5 layer of size aaM can be converted into a2 latent concept descriptors with M dimensions.Each latent concept descriptor represents the responses from the M filters for a specific pool
22、ing location.Latent Concept Descriptors(LCD)EncondingLCD Results on pool5100Ex10ExAverage pooling31.218.8LCDVLAD38.225.0LCDVLAD+SPP40.325.6Table 1:Performance comparisons for pool5 on MEDTest 13100Ex10ExAverage pooling24.615.3LCDVLAD33.922.8LCDVLAD+SPP35.723.2Table 2:Performance comparisons for pool
23、5 on MEDTest 14Representation Compression We utilize the Product Quantization(PQ)techniques to compress the video representation.Without PQ compression,the storage size of the features for 200,000 videos would be 48.8 GB,which severely compromises the execution time due to the I/O cost.With PQ,we ca
24、n store the features of the whole collection in less than 1 GB,which can be read by a normal SSD disk in a few seconds.Fast predictions can be made by an efficient look-up-table.Comparisons with previous best features IDTOursIDTRelative improvementMEDTest 13 100 Ex44.634.031.2%MEDTest 13 10 Ex29.818
25、.065.6%MEDTest 14 100Ex36.827.633.3%MEDTest 14 10Ex24.513.976.3%Notes The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques,or better descriptor encoding techniques.The proposed representation is very gene
26、ric for video analysis,not limited to multimedia event detection.We tested on MED datasets since they are the largest available video analysis datasets in the world.The proposed representation is pretty simple yet very effective,it is easy to generate the representation using Caffe/cxxnet/cuda-convn
27、et(for CNN features part)and vlfeat/Yael(for encoding part)toolkits.Take Home Messages1.Utilize VLAD/FV encoding techniques to generate video representations from frame-level CNN features,simple but effective2.Formulate the intermediate convolutional features into latent concept descriptors(LCD)3.Ap
28、ply Product Quantization to compress the generated CNN representation 4.For details,please refer to our paper:A Discriminative CNN Video Representation for Event Detection.Zhongwen Xu,Yi Yang and Alexander G.Hauptmann.CVPR 2015 Outline Introduction Standard CDR MED with few exemplars A Discriminativ
29、e CNN representation for MED A new pooling method for MED Motivations Only some shots in a long video are relevant to the event while others are less relevant or even useless.Representative works(average pooling/max pooling)largely ignore this difference.Our solution Define a novel notion of semanti
30、c saliency that evaluates the relevance of each shot with the event of interest,and re-order the shots according to their semantic saliency.Propose a new isotonic regularizer that respects the order information,leading to a nearly-isotonic SVM that enjoys more discriminaitve power.Develop an efficie
31、nt implementation using the proximal gradient algorithm,enhanced with newly proven,exact closed-form proximal steps.Extensive experiments on three real-world large scale video datasets confirm the effectiveness of the proposed approach.Re-Ordering according to Semantic SaliencyOur Method1.Each input
32、 video is divided into multiple shots,and each event has a short textual description.2.CNN is used to extract features.3.Semantic concept names and skip-gram model are used to derive a probability vector and a relevance vector,which are combined to yield the new semantic saliency and used for tempor
33、al alignment.Experimental ResultsCompare to state-of-the-art related algorithmsExperimental ResultsObservations Average pooling performs consistently better than max pooling in all events.The proposed approaches generally outperform average pooling,demonstrating that properly exploiting the order in
34、formation can significantly boost the performance.Using additional regularization consistently performs better than using regularization.We hypothesis that it is because our CNN features are very discriminative hence sparsity does not help much.Nonnegative variants have poorer performance than other
35、s,albeit being convex and leading to event sparser weights.Take Home Messages Re-ordering the video shots according to semantic saliency will improve performance of event detection.With the proposed nearly-isotonic SVM,we are able to exploit the carefully constructed order information.For details,refer to our paper Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM.Xiaojun Chang,Yi Yang,Eric Xing and Yaoliang Yu.ICML 2015

展开阅读全文

163文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：大规模视频数据复杂事件检测课件.pptx
链接地址：https://www.163wenku.com/p-3496817.html

三亚风情

内容提供者

实名认证

联系作者