Radosław Mantiuk, Michał Kowalik, Adam Nowosielski and Bartosz Bazyluk
Do-It-Yourself Eye Tracker: Low-Cost Pupil-Based Eye Tracker for Computer Graphics Applications
Do-It-Yourself Eye Tracker: Low-Cost Pupil-Based Eye Tracker for Computer Graphics Applications
Eye tracking technologies offer sophisticated methods for capturing humans’ gaze direction but they popularity in multimedia and computer graphics systems is still low. One of the main reasons for this are the high cost of commercial eye trackers that comes to 25,000 euros. Interestingly, this price seems to stem from the costs incurred in research rather than the value of used hardware components. In this work we show that an eye tracker of a satisfactory precision can build in the budget of 30 euros. In the paper detailed instruction on how to construct a low cost pupil-based eye tracker and utilise open source software to control its be- haviour is presented. We test the accuracy of our eye tracker and reveal that its precision is comparable to commercial video-based devices. We give an example of application in which our eye tracker is used to control the depth-of-field rendering in real time virtual environment.
Francesco Cricri, Igor Curcio, Sujeet Mate, Kostadin Dabov and Moncef Gabbouj
Sensor-Based Analysis of User Generated Video for Multi-Camera Video Remixing
Sensor-Based Analysis of User Generated Video for Multi-Camera Video Remixing
In this work we propose to exploit context sensor data for analyzing user generated videos. Firstly, we perform a low-level indexing of the recorded media with the instantaneous compass orientations of the recording device. Subsequently, we exploit the low level indexing to obtain a higher level indexing for discovering camera panning movements, classifying them, and for identifying the Region of Interest (ROI) of the recorded event. Thus, we extract information about the content without performing content analysis but by leveraging sensor data analysis. Furthermore, we develop an automatic remixing system that exploits the obtained high-level indexing for producing a video remix. We show that the proposed sensor-based analysis can correctly detect and classify camera panning and identify the ROI; in addition, we provide examples of their application to automatic video remixing.
Charles-Frederik Hollemeersch, Bart Pieters, Aljosha Demeulemeester, Peter Lambert and Rik Van De Walle
Real-time visualizations of gigapixel texture data sets using HTML5
Real-time visualizations of gigapixel texture data sets using HTML5
With the recent standardization of WebGL as part of HTML5, new possibilities have arisen for graphically intensive web-based applications. This paper presents our gigapixel texture visualization system which runs entirely within the limitations of a standards-compatible browser. Compared to existing approaches, our system offers high-performance 3D texture visualization and streaming without any dedicated plugins. We show that real-time performance can be achieved (less than 12ms render time per frame) on current-generation desktop hardware for texture data sets of at least 15 gigapixels.
Tina Walber, Ansgar Scherp and Steffen Staab
Identifying Objects in Images from Analyzing the Users' Gaze Movements for Provided Tags
Identifying Objects in Images from Analyzing the Users' Gaze Movements for Provided Tags
Millions of users share, tag, and search for images on social media platforms and social networking sites today. Annotating and searching for specific image regions, however, is still very hard. Assuming that eye tracking will be a common input device in the near future in notebooks equipped with cameras and mobile devices like iPads, it is possible to implicitly gain information about images and image regions from these users' gaze movements. In this paper, we investigate the principle idea of finding specific objects shown in images by looking at the users' gaze path information only. We have analyzed 547 gaze paths from 20 subjects viewing different image-tag-pairs with the task to decide if the tag presented is actually found in the image or not. By analyzing the gaze paths, we are able to correctly identify 67% of the image regions and significantly outperform two baselines. In addition, we have investigated if different regions of the same image can be differentiated by the gaze information. Here, we are able to correctly identify two different regions in the same image with an accuracy of 38%.
Xiaowei Ding, Yi Xu, Lei Deng and Xiaokang Yang
Colorization Using Quaternion Algebra with Automatic Scribble Generation
Colorization Using Quaternion Algebra with Automatic Scribble Generation
In current colorization techniques, major user interventionis required in the form of tedious, time-consuming scribble drawing. Inimage patches with abundant structures, the user has to modify the scrib-ble input repeatedly to achieve satisfying outputs. Moreover, color leak-age usually occurs across contours and object boundaries. In this paper,we focus on automatic scribble generation and structure-preservationmechanism, which are still open issues of colorization. Firstly, we gen-erate scribbles automatically along points where the spatial distributionentropy achieves locally extreme value. As a result, the requested col-or information of each homogeneous region is contained dominantly inthe neighborhood of these scribbles. Given the color scribbles, we com-pute quaternion wavelet phases to conduct colorization along equal-phaselines. These lines across scribbles and monochrome patches locate tex-tures with similar pattern distribution. Contour 'strength' model is alsoestablished in scale space to direct color propagation among similar edgestructures. Finally, we reconstruct color image patches as vector elementsusing polar representation in quaternion algebra, well-preserving interre-lationship between color channels. The experimental results demonstratethat the proposed colorization method can achieve natural color transi-tions between dierent objects with automatically generated scribbles.
Camille Simon, Rainer Schütze, Frank Boochs and Franck Marzani
Asserting the Precise Position of 3D and Multispectral Acquisition Systems for Multisensor Registration Applied to Cultural Heritage Analysis
Asserting the Precise Position of 3D and Multispectral Acquisition Systems for Multisensor Registration Applied to Cultural Heritage Analysis
We present a novel method to register multispectral acquisitions ona 3D model. The method is based on the external tracking of the acquisitionsystems using close-range photogrammetric techniques: multiple calibrated camerassimultaneously observe the successive acquisition systems in use. The viewsfrom these cameras are used to precisely determine the position of each acquisitionsystem. All datasets can then be projected in the same coordinate system.The registration is thus independent from the quality and content of the data.This method is well suited to the study of cultural heritage or any other applicationwhere which we do not wish to place targets on the object. We describe themethod and the simulation pipeline used to find an adequate setup for two casestudies.
Bogdan Ionescu, Klaus Seyerlehner, Christoph Rasche, Constantin Vertan and Patrick Lambert
Content-based Video Description for Automatic Video Genre Categorization
Content-based Video Description for Automatic Video Genre Categorization
In this paper, we propose an audio-visual approach to video genre categorization. It exploits audio, color, temporal and contour information, which are in general genre specific. Audio information is extracted at block-level, which has the advantage of capturing local temporal information. At temporal level, we asses action contents with respect to human perception. Further, color perception is quantified with statistics of color distribution, elementary hues, color properties and relationship of color. The final descriptor set determines statistics of contour geometry. Validation is performed on more than 91 hours of video footage and 7 common video genres. We obtain average precision and recall ratios within [87% − 100%] and [77% − 100%], respectively, while average correct classification is up to 97%. Additionally, we observe that movies displayed according to feature-based coordinates (we use a specially designed 3D browsing environment) tend to regroup with respect to genre, which has potential application with real content-based browsing systems (e.g. commercial video selling/rental platforms).
Pere Obrador, Michele Saad, Poonam Suryanarayan and Nuria Oliver
Towards Category-Based Aesthetic Models of Photographs
Towards Category-Based Aesthetic Models of Photographs
We present a novel data-driven category-based approach toautomatically assess the aesthetic appeal of photographs. In order totackle this problem, a novel set of image segmentation methods based onfeature contrast are introduced, such that luminance, sharpness, saliency,color chroma, and a measure of region appeal are computed to generatedifferent image partitions. Traditional image aesthetic features are com-puted in these regions (e.g. sharpness, light exposure, colorfulness). In ad-dition, image composition, color harmony and image simplicity featuresare measured on the overall image. Support Vector Regression modelsare generated for each of 7 popular image categories: animals, architec-ture, cityscape, floral, landscape, portraiture and seascapes. These modelsare analyzed to understand which features have greater influence in eachof those categories, and how they perform with respect to a generic stateof the art model.
Ehsan Younessian and Deepu Rajan
Multi-modal Solution for Unconstrained News Story Retrieval
Multi-modal Solution for Unconstrained News Story Retrieval
In this paper we propose a multi-modal approach to retrieve associated news stories sharing the same main topic. In textual domain, we utilize Automatic Speech Recognition (ASR) and refined Optical Character Recognition (OCR) transcripts while in visual domain we employ Near Duplicate Keyframe detection method to identify stories with common visual clues. In addition, we adopt another visual representation namely semantic signature, indicating pre-defined semantic concepts included in the news story, to improve the discriminativness of visual modality. We propose a query-class weighting scheme to integrate the retrieval outcomes gained from visual modalities. Experimental results show the distinguishing power of the enhanced representation in individual modalities and the superiority of our fusion approach performance compared to existing strategies.
Rafael Rodriguez-Sanchez, Jose Luis Martínez, Gerado Fernández-Escribano, Jose Luis Sánchez and Jose Manuel Claver
A Fast GPU-Based Motion Estimation Algorithm for H.264/AVC
A Fast GPU-Based Motion Estimation Algorithm for H.264/AVC
H.264/AVC is the most recent predictive video compression standard to outperform other existing video coding standards by means of higher computational complexity. In recent years, heterogeneous computing has emerged as a cost-efficient solution for high-performance computing. In the literature, several algorithms have been proposed to accelerate video compression, but so far there have not been many solutions that deal with video codecs using heterogeneous systems. This paper proposes an algorithm to perform H.264/AVC inter prediction. The proposed algorithm performs the motion estimation, both with full-pixel and sub-pixel accuracy, using CUDA to assist the CPU, obtaining remarkable time reductions while maintaining rate-distortion performance.
Michiel Hildebrand and Jacco Van Ossenbruggen
Linking user-generated video annotations to the web of data
Linking user-generated video annotations to the web of data
In the audiovisual domain tagging games are explored as a method to collect user-generated metadata. For example, the Netherlands Institute for Sound and Vision deployed the video labelling game "Waisda?" to collect user tags for videos from their collection. These tags are potentially useful to improve the access to the content within the videos. However, the uncontrolled and often incomplete tags allow for multiple interpretations, preventing long term access. In this paper we investigate a semi-automatic process to define the interpretation of the tags by linking them to concepts from the Linked Open Data cloud. More specifically, we investigate if existing web services are suited to find a number of candidate concepts, and if human users can select the most appropriate concept from these suggestions in the context of the video. We present a prototype application that supports this process and discuss the results of a user experiment where this application is used with different data sources.
Danil Korchagin, Stefan Duffner, Petr Motlicek and Carl Scheffler
Multimodal Cue Detection Engine for Orchestrated Entertainment
Multimodal Cue Detection Engine for Orchestrated Entertainment
In this paper, we describe a low delay real-time multimodal cue detection engine for a living room environment. The system is designed to be used in open, unconstrained environments to allow multiple people to enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed as a coupled component for orchestrated video conferencing system to improve the overall experience of interaction between spatially separated families and friends. Reduced latency levels achieved to date have shown improved responsiveness of the system.
Markus Waltl, Benjamin Rainer, Christian Timmerer and Hermann Hellwagner
Enhancing the User Experience with the Sensory Effect Media Player and AmbientLib
Enhancing the User Experience with the Sensory Effect Media Player and AmbientLib
Multimedia content is increasing in every area of our life. Still each type of content only stimulates the visual and/or the hearing system. Thus, the user experience depends only on those two stimuli. In this paper we introduce a standard which offers the possibility to add additional effects to multimedia content. Furthermore, we present a multimedia player and a Web browser plug-in which uses this standard to stimulate further senses by using additional sensory effects (i.e., wind, vibration, and light) to enhance the user experience resulting in a unique, worthwhile sensory experience.
Apostolos Axenopoulos, Stavroula Manolopoulou and Petros Daras
Optimizing Multimedia Retrieval using Multimodal Fusion and Relevance Feedback Techniques
Optimizing Multimedia Retrieval using Multimodal Fusion and Relevance Feedback Techniques
This paper introduces a novel approach for search and retrieval ofmultimedia content. The proposed framework retrieves multiple mediatypes simultaneously, namely 3D objects, 2D images and audio files,by utilizing an appropriately modified manifold learning algorithm.The latter, which is based on Laplacian Eigenmaps, is able to mapthe mono-modal low-level descriptors of the different modalitiesinto a new low-dimensional multimodal feature space. In order toaccelerate search and retrieval and make the framework suitable evenfor large-scale applications, a new multimedia indexing scheme isadopted. The retrieval accuracy of the proposed method is furtherimproved through relevance feedback, which enables users to refinetheir queries by marking the retrieved results as relevant ornon-relevant. Experiments performed on a multimodal datasetdemonstrate the effectiveness and efficiency of our approach.Finally, the proposed framework can be easily extended to involve asmany heterogeneous modalities as possible.
Harald Kosch and Andreas Wölfl
Large-Scale Similarity-Based Join Processing in Multimedia Databases
Large-Scale Similarity-Based Join Processing in Multimedia Databases
This paper presents efficient parallelization strategies for processing large-scale multimedia database operations. These strategies were implemented by extending and parallelizing the GiST (Generalized Search Tree)-framework. We integrate the parallelized framework into an Oracle 11g Multimedia Database using its extension mechanisms. Our strategies and their implementations are tested and validated with large real and random data sets consisting of up-to 10 millions of image objects.
Yingbo Li and Bernard Merialdo
Video Summarization Based on Balanced AV-MMR
Video Summarization Based on Balanced AV-MMR
Among the techniques of video processing, video summarization is a promising approach to process the multimedia content. In this paper we present a novel summarization algorithm, Balanced Audio Video Maximal Marginal Relevance (Balanced AV-MMR or BAV-MMR), for multi-video summarization based on both audio and visual information. Balanced AV-MMR exploits the balance between audio information and visual information, and the balance of temporal information in different videos. Furthermore, audio genres and human face of each frame are analyzed in order to be exploited in Balanced AV-MMR. Compared with its predecessors, Video Maximal Marginal Relevance (Video-MMR) and Audio Video Maximal Marginal Relevance (AV-MMR), we design a novel mechanism to combine these indispensible features from video track and audio track and achieve better summaries.
Huibo Zhong, Sha Shen, Yibo Fan and Xiaoyang Zeng
A Low Complexity Macroblock Layer Rate Control Scheme Base on Weighted-Window for H.264 Encoder
A Low Complexity Macroblock Layer Rate Control Scheme Base on Weighted-Window for H.264 Encoder
Rate control plays a very important role in video coding. A low complexity macroblock (MB) layer rate control scheme for H.264 encoder is presented in this paper. Based on the analysis of the relationship among the quantization parameter (QP), mean absolute distortion (MAD) and the coded bits, a weighted-window model is proposed. A weighted-window based QP decision and MAD prediction model is proposed to reduce the computational complexity of MB-layer rate control. A new rate control scheme based on these models is presented in detail. The experimental results show that the proposed scheme gives a quality improvement of about 0.80dB on the average for all sequences, and about 58% reduction in bit rate mismatch.
Xiaojian Zhao, Jin Yuan, Richang Hong, Meng Wang, Zhoujun Li and Tat-Seng Chua
On Video Recommendation Over Social Network
On Video Recommendation Over Social Network
Video recommendation is a hot research topic to help people accessinteresting videos. The existing video recommendation approaches include content-based filtering (CBF), collaborative filtering (CF) and the approach that combinesboth of them. However, these approaches treat the relationships between all usersas equal and neglect an important fact that the acquaintances or friends may be amore reliable source than strangers to recommend interesting videos. Thus, in thispaper we propose a novel approach to improve the accuracy of video recommendation. For a given user, our approach calculates a recommendation score for eachvideo candidate that composes of two parts: the interest degree of this video by theuser’s friends, and the relationship strengths between the user and his friends.Wemeasure the interest degree of each video by considering its textual and visual in-formation, while the relationship strengths between different users are calculatedby considering the users’ profile information, the interaction activities as well asthe activity domains from online social network. The final recommended videosare ranked according to the accumulated recommendation scores from different recommenders. We conducted experiments with 45 participants who were all active Facebook and YouTube users and the results demonstrated the feasibility andeffectiveness of our approach.
Martin Halvey, David Hannah, Graham Wilson and Stephen Brewster
Investigating Gestural and Pressure Interaction with a 3D Display
Investigating Gestural and Pressure Interaction with a 3D Display
We examine the use of a mobile device to provide multifunctional input and output for a stereoscopic 3D television (TV) display. Through a number of example applications, we demonstrate how a combination of gestural and haptic input (touch and pressure) can be successfully deployed to allow the user to navigate a complex information space (multimedia and TV content), while at the same time visual feedback can be used to provide additional information to the user enriching the experience. The use of a mobile device also provides a number of other benefits over a traditional remote control. In order to investigate the usefulness of our example applications a user evaluation was conducted, where our prototypes are compared with more traditional devices for multimedia interaction. The results of the user evaluations highlight the benefits of our approach and also provide some design guidelines.
Alberto Corrales-Garcia, Jose Luis Martinez, Gerardo Fernandez-Escribano and Francisco Jose Quiles
Scalable mobile-to-mobile video communications based on an improved WZ-to-SVC transcoder
Scalable mobile-to-mobile video communications based on an improved WZ-to-SVC transcoder
Nowadays, video communications between mobile devices are one of the most demanded multimedia services. Since Wyner-Ziv coding provides low cost video encoding, it is a suitable codec to encode video with less resources. On the other hand, the video delivery provided by Scalable Video Coding covers the needs of a wide range of homogeneous networks and different devices. As a consequence, Wyner-Ziv to Scalable Video Coding transcoding can offer a suitable framework to support scalable video communications between low-cost devices. However, the complexity of the transcoder accumulates most of the complexity of both codecs and it must be reduced. In this paper, we introduced an improved Wyner-Ziv to Scalable Video Coding transcoding framework to support homogeneous mobile video communications. In our transcoder the complexity of the second stage is reduced by reusing information generated during the first stage. The experimental results show that the complexity is reduced around 83.5% without significant Rate-Distortion penalty.
Alberto Corrales-Garcia, Jose Luis Martinez, Gerardo Fernández-Escribano and Francisco Jose Quiles
Forward Wyner-Ziv Fast Video Decoding Using Multicore Processors
Forward Wyner-Ziv Fast Video Decoding Using Multicore Processors
With the aim of providing low complexity encoders, Wyner-Ziv video coding provides a new paradigm where the complexity of the encoder is moved to the decoder. However, this high decoding complexity could involve a problem in some applications which have delay restrictions. Nowadays parallel computing is a growing field into the computation market. In particular, most of personal computers and hardware for video coding includes multicore processors, which allows a parallel execution by means of several independent cores in a same chip. As a consequence, several DVC parallel decoding approaches are beginning to appear. This work proposes a parallel DVC decoding scheme for multicore processors, which decodes each GOP in an independent and parallel way. This scheme achieves above 70% time reduction without any rate-distortion penalty.
Lei Huang, Tian Xia, Yongdong Zhang and Shouxun Lin
Finding Suits in Images of People
Finding Suits in Images of People
Clothing style is a salient feature for understanding images of people. To automatically identify the style of clothing that people wear is a challenging task. Suit as one of the clothing style is a key element in many important activities. In this paper, we propose a novel suits detection method. By analyzing the style of clothing, we propose the color features, shape features and statistical features for suits detection. Experiments with five popular classifiers have been conducted to demonstrate that the proposed features are effective and robust. Comparative experiments with Bag of Words (BoW) method demonstrate that the proposed features are superior to BoW which is a popular method for object detection. The proposed method has achieved promising performance over our dataset, which is a challenging web image set with various styles of clothing.
Musab Al-Hadrusi and Nabil Sarhan
Client-Driven Price Selection for Scalable Video Streaming with Advertisements
Client-Driven Price Selection for Scalable Video Streaming with Advertisements
This paper considers an extensive analysis for scalable delivery of streaming video content with advertisements system. In the envisioned delivery system, the revenues from the ads are used to subsidize the cost and thus attract more clients.We analyze a predictive scheme that provides clients with multiple price options, each with a certain number of expected viewed ads. The price depends on the royalty fee of the requested video, its delivery cost based on the current system state, the applied scheduling policy, and the number of viewed ads. The price is lower when the number of viewed ads is larger.
Jonas Etzold, Arnaud Brousseau, Paul Grimm and Thomas Steiner
Context-aware Querying for Multimodal Search Engines
Context-aware Querying for Multimodal Search Engines
Multimodal interaction provides the user with multiple modes of interacting with a system, such as gestures, speech, text, video, audio, etc. A multimodal system allows for several distinct means for input and output of data. In this paper, we present our work in the context of the I-SEARCH project, which aims at enabling context-aware querying of a multimodal search framework including real-world data such as user location or temperature. We introduce the concepts of MuSeBag for multimodal query interfaces, UIIFace for multimodal interaction handling, and CoFind for collaborative search as the core components behind the I-SEARCH multimodal user interface, which we evaluate via a user study.
Werner Bailer.
Sequence Kernels for Clustering and Visualizing Near Duplicate Video Segments
Organizing and visualizing video collections containing a high number of near duplicates is an important problem in film and video post-production. While kernels for matching sequences of feature vectors have been used e.g. for classification of video segments, kernel-based methods have not yet been applied to matching near duplicate video segments. In this paper we survey the application of six sequence-based kernels to clustering near duplicate video segments using kernel k-means and hierarchical clustering, and the application of kernel PCA for generating visualizations of content sets suitable for browsing. Evaluation on the TRECVID 2007 BBC rushes data set shows that the results of the kernel based methods are comparable to other approaches for matching near duplicates, eliminating differences between dynamic time warping and string matching based approaches. These results show that hierarchical clustering on the kernel matrix outperforms kernel $k$-means. We also show that well-arranged visualizations of both single- and multi-view content sets can be obtained using kernel PCA.
Robert Sorschag.
How to Select and Customize Object Recognition Approaches for an Application?
Recently, object recognition has been successfully implemented in a couple of multimedia content annotation and retrieval applications. The employed recognition approaches are carefully selected and adapted to the specific needs of their tasks. In this work, we propose a framework to automate the si-multaneous selection and customization of the entire recognition process. This framework only requires an annotated set of sample images or videos and pre-cisely specified task requirements to select an appropriate setup among thou-sands of possibilities. We use an efficient recognition infrastructure and itera-tive analysis strategies to make this approach practicable for real-world applica-tions. A case study for face recognition from a single image per person demon-strates the strength of this holistic approach.
Miriam Redi and Bernard Merialdo
A Multimedia Retrieval Framework Based on Automatic Graded Relevance Judgments
A Multimedia Retrieval Framework Based on Automatic Graded Relevance Judgments
Traditional Content Based Multimedia Retrieval (CBMR) systems measure the relevance of visual samples using a binary scale (Relevant/Non Relevant). However, a picture can be relevant to a semantic category with different degrees, depending on the way such concept is represented in the image. In this paper, we build a CBMR framework that supports graded relevance judgments. In order to quickly build graded ground truths, we propose a measure to reassess binary-labeled databases without involving manual effort: we automatically assign a reliable relevance degree (Non, Weakly, Average, Very Relevant) to each sample, based on its position with respect to the hyperplane drawn by support vector machines in the feature space. We test the effectiveness of our system on two large-scale databases, and we show that our approach outperforms the traditional binary relevance-based frameworks in both scene recognition and video retrieval.
Savvas Chatzichristofis, Konstantinos Zagoris, Yiannis S. Boutalis and Avi Arampatzis
A Fuzzy Rank-Based Late Fusion Method for Image Retrieval
A Fuzzy Rank-Based Late Fusion Method for Image Retrieval
Rank-based fusion is indispensable in multiple search setups in lack of item retrieval scores,such as in meta-search with non-cooperative engines.We introduce a novel, simple, and efficient method for rank-based late fusion of retrieval result-lists.The approach taken is rule-based, employs a fuzzy system,and does not require training data.We evaluate on an image database by fusing results retrieved by three MPEG-7 descriptors, and find statistically significantimprovements in effectiveness over other widely used rank-based fusion methods.
Rémi Vieux, Jenny Benois-Pineau and Jean Philippe Domenger
Content Based Image Retrieval using Bag-Of-Regions: an Efficient Approach
Content Based Image Retrieval using Bag-Of-Regions: an Efficient Approach
In this work we introduce the Bag-Of-Regions model, inspired from theBag-Of-Visual-Words. Instead of clustering local image patches represented bySIFT or related descriptors, low level descriptors are extracted and clusteredfrom image regions, as given by a segmentation algorithm. The Bag-Of-Region model allows to define visualdictionaries that capture extra information with respect toBag-Of-Visual-Words. Experiments on three public datasets show thatBag-Of-Regions signatures outperforms Bag-Of-Visual-Words for specific queries,demonstrating the complementarity with the latter approach. The Bag-Of-Regionmodel allows the creation of an outstanding number of different visualvocabularies. We present an efficient incremental clustering algorithm thathas lower computational and memory complexity than k-means clustering, whileproviding the same retrieval efficiency. Finally, we study methods tocombine Bag-Of-Visual-Words and Bag-Of-Regions dictionaries that outperform anyof the single systems retrieval efficiency.
Mario Doeller, Florian Stegmaier, Simone Jans and Harald Kosch
TempoM2: A Multi Feature Index Structure for Temporal Video Search
TempoM2: A Multi Feature Index Structure for Temporal Video Search
Efficient temporal video search will play an important role in the future related to the vast growth of video data in the Web. Here, access methods are one way to ensure efficient retrieval over a large amount of data. However, access methods targeting on indexing video data are rare. In this context, the paper introduces the TempoM^2-tree framework, which features a two-level index structure supporting the retrieval of similar video segments in combination with temporal relations.
Rosario Garrido-Cantos, Jan De Cock, Sebastiaan Van Leuven, Pedro Cuenca Castillo, Antonio Garrido and Rik Van De Walle
Fast Mode Decision Algorithm for H.264/AVC-to-SVC Transcoding with Temporal Scalability
Fast Mode Decision Algorithm for H.264/AVC-to-SVC Transcoding with Temporal Scalability
Scalable Video Coding (SVC) uses a notion of layers within the encoded bitstream for providing temporal, spatial and quality scalability, separately or combined. By truncating layers the bitstream can be adapted to devices with different characteristics and to varying network constraints. Since the majority of the existing video content is encoded using H.264/AVC without scalability, they cannot benefit from these scalability tools, so a transcoding process should be applied to provide scalability to this existing encoded content. In this paper, an algorithm based on Machine Learning techniques for temporal scalability transcoding from H.264/AVC to SVC focusing on mode decision task is discussed. The results show that when our technique is applied, the complexity is reduced by 82% while maintaining coding efficiency.
Christian Vilsmaier, Rolf Karp, Mario Doeller, Harald Kosch and Lionel Brunie
Towards automatic detection of CBIRs configuration
Towards automatic detection of CBIRs configuration
Many Content Based Image Retrieval systems (CBIRs) have been invented in the last decade. The general mechanism of the search process is very similar for all of these CBIRs, and the calculation of rank- ings is determined by the comparison of features (low-,mid-,high-level). Nevertheless, the respective realization leads, even under equal circum- stances, to different results. The knowledge about the internal config- uration (used features, weights and metrics) of such systems would be beneficial for many usage scenarios (e.g., by using a query image content sensitive query forwarding strategy or improved result ranking strate- gies in a meta search engines). In this context, the paper presents an approach that supports an automatic detection of the configuration of CBIR systems. We demonstrate that the problem can be partly traced back to an optimization problem and tested several optimization algo- rithms. The approach has been evaluated based on the ImageCLEF test set and shows good results.
Kevin Mcguinness, Kealan Mccusker, Neil O'Hare and Noel O'Connor
Efficient Storage and Decoding of SURF Feature Points
Efficient Storage and Decoding of SURF Feature Points
Practical use of SURF feature points in large-scale indexing and retrieval engines requires an efficient means for storing and decoding these features. This paper investigates several methods for compression and storage of SURF feature points, considering both storage consumption and disk-read efficiency. We compare each scheme with a baseline plain- text encoding scheme as used by many existing SURF implementations. Our final proposed scheme significantly reduces both the time required to load and decode feature points, and the space required to store them on disk.
Toshihiko Yamasaki and Tomoaki Matsunami
Pedestrian Attribute Analysis Using a Top-View Camera in a Public Space
Pedestrian Attribute Analysis Using a Top-View Camera in a Public Space
In this paper, we propose a method to analyze gender of the pedestrian and whether he or she has a baggage or not in a public space. The challenging part of this work is we only use top-view camera images to protect the pedestrians’ privacy. We focused on temporal changes in their position, shape, and contours over the frames because their appearances do not provide much information. We extracted the pedestrians' features using their position, area, aspect ratio, histogram of oriented gradients (HoG), and Fourier descriptors. The temporal information was taken into consideration by employing Gaussian mixture models (GMM), GMM universal background model (GMM-UBM), and bag of features (BoF) model. The attributes were classified by using support vector machines (SVM). We conducted experiments using 60-minute video captured by a top-view camera attached at an airport. Experimental results show that the classification accuracy is 69% for the gender classification and 79% for baggage possession classification.
Christian Weissig, Oliver Schreer, Peter Eisert and Peter Kauff
The Ultimate Immersive Experience: Panoramic 3D Video Acquisition
The Ultimate Immersive Experience: Panoramic 3D Video Acquisition
The paper presents a new approach on an omni-directional omni-stereo multi-camera system that allows the recording of panoramic 3D video with high resolution and quality. It has been developed in the framework of the TiME Lab at Fraunhofer HHI, an experimental platform for immersive media and related content creation. The new system uses a mirror rig to enable a multi-camera constellation that is close to the concept of concentric mosaics. A proof of concept has shown that the systematical approximation error related to concentric mosaics is negligible in practice and parallax-free stitching of stereoscopic video panoramas can be achieved with high 3D quality and for arbitrary scenes with depth ranges from 2 meters to infinity.
Daniel Kuettel, Matthieu Guillaumin and Vittorio Ferrari
Combining Image-level and Segment-level Models for Automatic Annotation
Combining Image-level and Segment-level Models for Automatic Annotation
For the task of assigning labels to an image to summarize its contents,many early attempts use segment-level information and try to determine which parts of the images correspond to which labels. Best performing methods use global image similarity and nearest neighbor techniques to transfer labels from training images to test images. However, global methods cannot localize the labels in the images, unlike segment-level methods. Also, they cannot take advantage of training images that are only locally similar to a test image. We propose several ways to combine recent image-level and segment-level techniques to predict both image and segment labels jointly. We cast our experimental study in an unified framework for both image-level and segment-level annotation tasks. On three challenging datasets, our joint prediction of image and segment labels outperforms either prediction alone on both tasks. This confirms that the two levels offer complementary information.
Taiga Yoshida, Go Irie, Takashi Satou, Akira Kojima and Suguru Higashino
Improving Item Recommendation Based on Social Tag Ranking
Improving Item Recommendation Based on Social Tag Ranking
Content-based filtering is a popular framework for item recommendation.Typical methods determine items to be recommended by measuring the similarity between items based on the tags provided by users.However, because the usefulness of tags depends on the annotator's skills, vocabularies and feelings, many tags are irrelevant.This fact degrades the accuracy of simple content-based recommendation methods.To tackle this issue, this paper enhances content-based filtering by introducing the idea of tag ranking, a state-of-the-art framework that ranks tags according to their relevance levels.We conduct experiments on videos from a video-sharing site.The results show that tag ranking significantly improves item recommendation performance, despite its simplicity.
Wei Yang, Masahiro Toyoura and Xiaoyang Mao
Hairstyle Suggestion Using Statistical Learning
Hairstyle Suggestion Using Statistical Learning
Hairstyle is one of the most important features people use to characterize one’s appearance. Whether a hairstyle is suitable or not is said to be closely related to one’s facial shape. This paper proposes a new technique for automatically retrieving a suitable hairstyle from a collection of hairstyle examples through learning the relationship between facial shapes and suitable hairstyles. A method of hair-face image composition utilizing modern matting technique was also developed to synthesize realistic hairstyle images. The effectiveness of the proposed technique was validated through evaluation experiments.
Svebor Karaman, Jenny Benois-Pineau, Rémi Mégret and Aurélie Bugeau
Multi-Layer Local Graph Words for Object Recognition
Multi-Layer Local Graph Words for Object Recognition
In this paper, we propose a new multi-layer structural approach for the task of object based image retrieval. In our work we tackle the problem of structural organization of local features. The structural features we propose are nested multi-layered local graphs built upon sets of SURF feature points with Delaunay triangulation. A Bag-of-Visual-Words (BoVW) framework is applied on these graphs, giving birth to a Bag-of-Graph-Words representation. The multi-layer nature of the descriptors consists in scaling from trivial Delaunay graphs – isolated feature points – by increasing the number of nodes layer by layer up to graphs with maximal number of nodes. For each layer of graphs its own visual dictionary is built. The experiments conducted on the SIVAL and Caltech-101 data sets reveal that the graph features at different layers exhibit complementary performances on the same content and perform better than baseline BoVW approach. The combination of all existing layers, yields significant improvement of the object recognition performance compared to single level approaches.
Wu Liu, Tian Xia, Ji Wan, Yongdong Zhang and Jintao Li
RGB-D based Multi-Attribute People Search in Intelligent Visual Surveillance
RGB-D based Multi-Attribute People Search in Intelligent Visual Surveillance
Searching people in surveillance videos is a typical task in intelligent visual surveillance (IVS). However, current IVS techniques can hardly handle multi-attribute queries, which is a natural way of finding people in real-world. The challenges arise from the extraction of multiple attributes which largely suffer from illumination change, shadow and complicated background in the real-world surveillance environments. In this paper, we investigate how these challenges can be addressed when IVS is equipped with RGB-D information obtained by an RGB-D camera. With the RGB-D information, we propose methods that accurately and robustly segment human region and extract three groups of attributes including biometrical attributes, appearance attributes and motion attributes. Furthermore, we introduce a novel IVS system which is capable of handling multi-attribute queries for searching people in surveillance videos. Experimental evaluations demonstrate the effectiveness of the proposed method and system, and also the promising applications of bringing RGB-D information into IVS.
Masahiro Toyoura, Mamoru Kunihiro and Xiaoyang Mao
Film Comic Reflecting Camera-Works
Film Comic Reflecting Camera-Works
We propose a novel technique for automatically creating film comics reflecting the camera-works of an original movie. Camera-works are one of the most important effects contributing to the mise en scene of the movie. A skilled director can use the camera-works dexterously for drawing the attention of audiences, representing sentiments, and give a change of pace in the movie. When creating film comics, camera-works are detected from the original movie, and mapped to panels and layouts of special comic styles. The technique is called as the grammar of manga. Our new algorithm is presented for automatically tiling the stylized panels into comic pages based on the grammar of manga. The results of our subject study show that reflecting camera-works in film comics enables the stories being presented in a more readable, vivid and immersive way.
Wanxia Lin, Tong Lu and Feng Su
A Novel Multi-modal Integration and Propagation Model for Cross-Media Information Retrieval
A Novel Multi-modal Integration and Propagation Model for Cross-Media Information Retrieval
In this paper, we present a novel PLSA-based aspect model and turn cross-media retrieval into two parts of multi-modal integration and correlation propagation. We first use multivariate Gaussian distributions to model continuous quantity in PLSA, avoiding information loss between feature-instance versus real-world matching. Multi-modal correlations are learned in an asymmetrical manner, giving a better control of the respective influence of each modality in the latent space. Then we propose a new propagation pattern to refine multi-modal correlations by efficiently taking the complementary from multi-modalities. Experimental results demonstrate that our method is accurate and robust for cross-media information retrieval.
Ehsan Younessian and Deepu Rajan
Scene Signatures for Unconstrained News Video Stories
Scene Signatures for Unconstrained News Video Stories
In this paper we propose a novel video signature called scene signature which can be applicable for variety of tasks in unconstrained news video domain. Same news stories that originate from different channels appear with different layouts, lengths, temporal order, additional visual content etc. This can significantly affect the effectiveness and robustness of existing video signatures. In this paper we aim to represent the visual clues appearing in a news story scene in a compact and comprehensive manner in the context of scene signature. To this end we detect Near Duplicate Keyframe clusters within a news story and then for each of them we generate an initial scene signature including most informative mutual and distinctive visual cues. A scene signature is defined as a collection of SIFT descriptors. Compared to conventional keypoint-trajectory-based signatures, we take the co-occurrence of SIFT keypoints into account. This is beneficial specially when we deal with picture-in-picture, split screen, or long shots with significant object/camera movement. Next through three steps of refinements on the initial scene signature we try to shorten the semantic gap to obtain a final scene signature which is more compact and semantically meaningful. The experimental results confirm the efficiency as well as robustness and uniqueness of our proposed scene signature compared to other global and local video signatures.
Yin-Tzu Lin, Shuen-Huei Guan, Yuan-Chang Yao, Wen-Huang Cheng and Ja-Ling Wu
U-Drumwave: An Interactive Performance System for Drumming
U-Drumwave: An Interactive Performance System for Drumming
In this paper, we share our experience of applying the modern multimedia technologies to the traditional performing art in a drumming performance project, U-Drumwave. By deploying an interactive system on the drumming stage, the audience will see augmented visual objects moving on the stage in accord with the performer's drumming rhythms. The creation and display of the visual objects are integrated with the concept of story intensity curve in order to vary the perceptual degree of tension given to the audience during the performance.
Zhiguo Yang, Yuxin Peng and Jianguo Xiao
Visual Vocabulary Optimization with Spatial Context for Image Annotation and Classification
Visual Vocabulary Optimization with Spatial Context for Image Annotation and Classification
In this paper, we propose a new approach of visual vocabulary optimization with spatial context, which contains important spatial information that has not been fully exploited. The novelty of our method mainly lies in two aspects: when spatial information is considered, and how spatial information is used. For the first aspect, the existing methods generally consider spatial information after the visual vocabulary is built, while we employ the spatial information in the construction of visual vocabulary, to produce more accurate visual vocabulary. For the second aspect, different from existing methods which use spatial information to re-rank the original retrieval results, to generate the local keypoint groups such as visual phrases, or in spatial pyramid matching kernel, etc, we propose a novel method that employs spatial information to optimize visual vocabulary. Instead of simply assigning keypoints to the nearest cluster centers, we also take the spatial context of keypoints into consideration. With the proposed approach, more accurate visual vocabulary can be generated, and the evaluation results can be improved in both image annotation and classification tasks. Experiments on widely-used 15-scenes dataset demonstrate the effectiveness of the proposed approach.
Xiaohua Zhai, Yuxin Peng and Jianguo Xiao
Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval
Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval
Emerging multimedia content including images and texts are always jointly utilized to describe the same semantics. As a result, cross-media retrieval becomes increasingly important, which is able to retrieve the results of the same semantics with the query but with different media types. In this paper, we propose a novel heterogeneous similarity measure with nearest neighbors (HSNN). Unlike traditional similarity measures which are limited in homogeneous feature space, HSNN could compute the similarity between media objects with different media types. The heterogeneous similarity is obtained by computing the probability for two media objects belonging to the same semantic category. The probability is achieved by analyzing the homogeneous nearest neighbors of each media object. HSNN is flexible so that any traditional similarity measure could be incorporated, which is further regarded as the weak ranker. An effective ranking model is learned from multiple weak rankers through AdaRank for cross-media retrieval. Experiments on the wikipedia dataset show the effectiveness of the proposed approach, compared with state-of-the-art methods. The cross-media retrieval also shows to outperform image retrieval systems on a unimedia retrieval task.
Hichem Bannour and Hudelot Céline
Building Semantic Hierarchies Faithful to Image Semantics
Building Semantic Hierarchies Faithful to Image Semantics
Achieving high level semantic interpretation of images is necessary to match user expectations in image retrieval systems. Effective tools are then required to allow a precise semantic description of images and allow at the same time a good interpretation of them. This paper proposes a new image-semantic measure, named "Semantico-Visual Relatedness of Concepts" (SVRC), to estimate the semantic similarity between concepts. The proposed measure incorporates visual, conceptual and contextual information to provide a measure which is more meaningful and more representative of image semantics. We also propose a new methodology to automatically build a semantic hierarchy suitable for the purpose of image annotation and/or classification. The building is based on the previously proposed measure SVRC and on a new heuristic, named TRUST-ME, to connect concepts with higher relatedness till the building of the final hierarchy. The built hierarchy explicitly encodes a general to specific concepts relationship and therefore provides a semantic structure to concepts which facilitates the semantic interpretation of images. Our experiments showed that the use of the constructed semantic hierarchies as a hierarchical classification framework provides a better image annotation.
Claudiu Tanase and Bernard Merialdo
Efficient Spatio-temporal Edge Descriptor
Efficient Spatio-temporal Edge Descriptor
Concept-based video retrieval is a developing area of current multimedia content analysis research. The use of spatio-temporal descriptors in content-based video retrieval has always seemed like a promising way to bridge the semantic gap problem in ways that typical visual retrieval methods cannot. In this paper we propose a spatio-temporal descriptor called ST-MP7EH which can address some of the challenges encountered in practical systems and we present our experimental results in support of our participation at TRECVid 2011 Semantic Indexing. This descriptor combines the MPEG-7 Edge Histogram descriptor with motion information and is designed to be computationally efficient, scalable and highly parallel. We show that our descriptor performs well in SVM classification compared to a baseline spatio-temporal descriptor, which is inspired by some of the state-of-the-art systems that make the top lists of TRECVid. We highlight the importance of the temporal component by comparing to the initial edge histogram descriptor and the potential of feature fusion with other classifiers.
Christian Beecks and Thomas Seidl
On Stability of Adaptive Similarity Measures for Content-Based Image Retrieval
On Stability of Adaptive Similarity Measures for Content-Based Image Retrieval
Retrieving similar images is a challenging task for today's content-based retrieval systems. Aiming at high retrieval performance, these systems frequently capture the user's notion of similarity through expressive image models and adaptive similarity measures, which try to approximate the individual user-dependent notion of similarity as close as possible. As image models appearing on the query side can significantly differ in quality compared to those stored in the multimedia database, similarity measures have to be robust against these individual changes in quality in order to maintain high retrieval performance. In order to evaluate the robustness of similarity measures, we introduce the general concept of the stability of a similarity measure with respect to query modifying transformations describing the change in quality on the query side. In addition, we include a comparison of the stability of the major state-of-the-art adaptive similarity measures based on different benchmark image databases.
Britta Meixner, Michael Ettengruber and Harald Kosch
Challenges in Storing Multimedia Data for the Future – an Overview
Challenges in Storing Multimedia Data for the Future – an Overview
Preserving access to multimedia data over time may prove to be the most challenging task in all things concerning multimedia. Preserving access to data from previous technical generations has always been a rather difficult endeavor but multimedia data with an almost endless succession of encoding and compression algorithms sets the stakes even higher especially when not only considering migrating the data from one generation earlier to a current technology but from decades ago. The time to start thinking and developing techniques and methodologies to keep data accessible over time is right now because the first challenges become visible on the horizon: How to archive the ever growing (and growing exponentially so) amounts of data without major manual intervention as soon as a storage media runs out of free space. Is there such a thing as endless storage capacity? Would an endless storage capacity really help? Or do we need totally new ways of thinking in regard to archiving digital data for the future?
Maia Zaharieva and Christian Breiteneder
Recurring Element Detection in Movies
Recurring Element Detection in Movies
Recurring elements in movies contribute significantly to the development of narration, themes, or even mood. The detection of such elements is impeded by the large variance of their visual appearance and usually relies on the experience and attentiveness of the viewer. In this paper, we present a new approach for the automated detection of recurring elements in movies such as motifs and main characters. Performed experiments show the reliability of the algorithm and its potential for automated high-level film analysis.
Ardhendu Behera, Anthony G Cohn and David C Hogg
Workflow Activity Monitoring using the Dynamics of Pair-wise Qualitative Spatial Relations
Workflow Activity Monitoring using the Dynamics of Pair-wise Qualitative Spatial Relations
We present a method for real-time monitoring of workflowsin a constrained environment. The monitoring system should not only beable to recognise the current step but also provide the instructions aboutthe possible next steps in an ongoing workflow. In this paper, we addressthis issue by using a robust approach (HMM-pLSA) which relies on Hid-den Markov Model (HMM) and generative model such as probabilisticLatent Semantic Analysis (pLSA). The proposed method exploits thedynamics of the qualitative spatial relation between pairs of objects in-volved in a workflow. The novel view-invariant relational feature is basedon distance and its rate of change in 3D space. The multiple pair-wiserelational features are represented in a multi-dimensional relational statespace using an HMM. The workflow monitoring task is inferred from therelational state space using pLSA on datasets, which consist of workflowactivities such as `hammering nails' and `driving screws'. The proposedapproach is evaluated for both `off-line' (complete observation) and `on-line' (partial observation). The evaluation of the novel approach justifiesthe robustness of the technique in overcoming issues of noise evolvingfrom object tracking and occlusions.
Hazem Wannous, Vladislavs Dovgalecs, Rémi Mégret and Mohamed Daoudi
Place Recognition via 3D Modeling for Personal Activity Lifelog using Wearable Camera
Place Recognition via 3D Modeling for Personal Activity Lifelog using Wearable Camera
In this paper, a method for location recognition in a visual lifelog is presented. Its motivation is the detection of activity related places within an indoor environment to facilitate navigation in the lifelog. It takes advantage of a camera mounted on the shoulder, which is primarily designed for the behavioral analysis of Instrumental Activities of Daily Living (IADL). The proposed approach provides an automatic indexing of the content stream, based on the presence in specific 3D places related to instrumental activites. It relies on 3D models of the places of interest that are built thanks to a lightweight semi-supervised approach. Performance evaluation on real data show the potential of this approach compared to 2D only recognition.
Markus Mühling, Ralph Ewerth, Jun Zhou and Bernd Freisleben
Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning
Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning
State-of-the-art systems for video concept detection mainly rely on visual features, e.g., on the bag of visual words representation in conjunction with scale-invariant feature transform descriptors. Some previous approaches have also included audio features in video concept detection systems, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. This vocabulary is used to describe video shots via a histogram of auditory words, and audio models are learned using support vector machines (SVM). Furthermore, the resulting BoAW features are combined with visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: First, it is demonstrated for SVM that a χ²-kernel yields almost 45% improvement compared to a radial basis function kernel. Second, the system using the BoAW features and a χ²-kernel is superior to a state-of-the-art approach relying on audio features and probabilistic latent semantic indexing. Finally, it is shown that an early fusion approach for visual and auditory bag of words features degrades video concept detection performance, whereas the combination of BoAW features with state-of-the-art visual features via MKL yields a relative improvement of more than 7% in terms of mean average precision.
Naeem Akhter.
Fusing template and point information to track planes with large interframe displacement
This paper presents a hybrid approach by fusing template and keypoint based tracking to track pose of planar textured targets with large interframe displacement. The fusion is made such that it adds to accuracy and convergence of template based tracking without involving feature selection, introducing pose drift strategy, and incorporating sophisticated prediction or motion model. The approach is not only robust against illumination changes and partial occlusion, but also free from offline pose learning and prior knowledge about background which makes it flexible to adapt change in scene.
Bahjat Safadi, Stéphane Ayache and Georges Quénot
Active Cleaning for Video Corpus Annotation
Active Cleaning for Video Corpus Annotation
In this paper, we have described the active cleaning approach that was used to complement the active learning approach in the TRECVID collaborative annotation. It consists in using a classification system in order to select the most informative samples for multiple annotations, in order to improve the quality and the reliability of the annotations. We have evaluated the actual impact of the active cleaning approach on TRECVID 2007 collection. The evaluations were conducted using complete annotations that were collected from different resources, including the TRECVID collaborative annotations and the MCG-ICT-CAS annotations.From our experiments, a significant improvement of the annotation quality was observed when applying the cleaning by Cross-Val strategy, which selects the samples to be re-annotated. Experiments show that higher performance can be reached with minimum double annotations of 10% of negative samples or 5% of all the annotated samples, retrieved by the proposed cleaning strategy using cross-validation. It has been shown that, with an appropriate strategy, using a small fraction of the annotations for cleaning improves much more the system's performance than using the same fraction for adding more annotations.
Anh-Phuong Ta, Mathieu Ben and Guillame Gravier
Improving cluster selection and event modeling in unsupervised mining for automatic audiovisual video structuring
Improving cluster selection and event modeling in unsupervised mining for automatic audiovisual video structuring
Can we discover audio-visually consistent events from videos in a totally unsupervised manner? And, how to mine videos with different genres? In this paper we present our new results in automatically discovering audio-visual events. A new measure is proposed to select audio-visually consistent elements from the two dendograms respectively representing hierarchical clustering results for the audio and visual modalities. Each selected element corresponds to a candidate event. In order to construct a model for each event, each candidate event is represented as a group of clusters, and a voting mechanism is applied to select training examples for discriminative classifiers. Finally, the trained model is tested on the entire video to select video segments that belong to the event discovered. Experimental results on different and challenging genres of videos, show the effectiveness of our approach.
Kong-Wah Wan, Ah-Hwee Tan, Joo-Hwee Lim and Liang-Tien Chia
Topic Based Query Suggestions for Video Search
Topic Based Query Suggestions for Video Search
Query suggestion is an assistive technology mechanism commonlyused in search engines to enable a user to formulate their searchqueries by predicting or completing the next few query words that theuser is likely to type. In most implementations, the suggestions are minedfrom query log and use some simple measure of query similarity such asquery frequency or lexicographical matching. In this paper, we proposean alternative method of presenting query suggestions by their thematictopics. Our method adopts a document-centric approach to mine topicsin the corpus, and does not require the availability of a query log.The heart of our algorithm is a probabilistic topic model that assumesthat topics are multinomial distributions of words, and jointly learnsthe co-occurrence of textual words and the visual information in thevideo stream. Empirical results show that this alternate way of organizingquery suggestions can better elucidate the high level query intent,and more effectively help a user meet his information need.
Zhenzhong Lan, Lei Bao, Shoou-I Yu, Wei Liu and Alexander Hauptmann
Double Fusion for Multimedia Event Detection
Double Fusion for Multimedia Event Detection
Multimedia event detection (MED) is a multimedia retrievaltask where sample videos for events are given. It aims to detect multimediaevent of interest from given videos, which requires a combination ofmultiple complementary features. Generally, early fusion and late fusionare two popular combination strategies. The former one fuses featuresbefore performing classification and the latter one combines output ofclassifiers from different features. In this paper, we introduce a new fusionscheme named double fusion, which combines early fusion and latefusion together to incorporate their advantages. Results are reported onTRECVID MED 2010 and 2011. For MED 2010 we get a mean minimalnormalized detection cost (MNDC) of 0.49, which exceeds the state ofthe art performance by more than 12 percent.
Hung-Wei Lin, Min-Chun Hu and Ja-Ling Wu
Gait-Based Action Recognition via Accelerated Minimum Incremental Coding Length Classifier
Gait-Based Action Recognition via Accelerated Minimum Incremental Coding Length Classifier
In this paper, we present a novel human action recognition approach based on the gait energy image (GEI) and the minimum incremental coding length (MICL) classifier. GEIs are extracted from video clips and transformed into vectors as input features, and MICL is employed to classify each GEI. We also use multiple cameras to capture GEIs of different views, and the voting strategy is applied after the MICL classification results to improve the overall system performance. Experimental results show that the proposed approach can achieve approximately 95% of accuracy. For practical usage, we also speed up the classification time so that it can be accomplished in a very short time. Moreover, other classification methods are used to classify GEIs and the experimental result shows that MICL is the most suitable classifier for this approach. Besides our recorded action clips, the Weizmann dataset is also used to verify the capability of our approach. The experimental results show that our approach is competitive to other state-of-the-art approaches. In other words, the proposed approach can be integrated as a useful component for detecting events in video surveillance applications.
Longfei Zhang, Yue Gao, Rongrong Ji, Alexander Hauptmann and Boaz Super
Symbiotic Black-Box Tracker
Symbiotic Black-Box Tracker
Many trackers have been proposed for tracking objects individually in previous research. However, it is still difficult to trust any single tracker over a variety of circumstances. Therefore, it is important to estimate how well each tracker performs and fusion the tracking results. In this paper, we propose a symbiotic black-box tracker (SBB) that learns only from the output of individual trackers, who run in parallel, without any detailed information about these trackers and choose the best one to generate the tracking result. In other words, all the individual trackers are considered as black-boxes and SBB learns the best combination scheme for all existing tracking results. SBB estimates confidence scores of these trackers, which are then used to weight Gaussian components. The confidence score is estimated based on how well the tracker is currently tracking the target, and how much the tracker is consistent with the other trackers. Initially, frame to frame (F2F) intra-tracker prediction procedure estimates the confidence score for each tracker using the relationship between the previous tracking results and the current frame's result. The confidence scores for all trackers in the current frame are further adjusted using Tracker to Tracker (T2T) spatial correlation propagation for inter-tracker consistency analysis in inner frame. The output of SBB can be the hypothesis of the best tracker with the maximum confidence score. Experiments and comparisons conducted on the "Caremedia" dataset and the "Caviar" dataset demonstrate the effectiveness of the proposed method with superior tracking.
Rui Hu, Stuart James and John Collomosse
Annotated Free-hand Sketches for Video Retrieval using Object Semantics and Motion
Annotated Free-hand Sketches for Video Retrieval using Object Semantics and Motion
We present a novel video retrieval system that accepts annotatedfree-hand sketches as queries. Existing sketch based video retrieval (SBVR)systems enable the appearance and movements of objects to be searchednaturally through pictorial representations. Whilst visuallyexpressive, such systems present an imprecise vehicle for conveyingthe semantics (e.g. object types) within a scene. Our contribution isto fuse the semantic richness of text with the expressivity of sketch,to create a hybrid `semantic sketch' based video retrieval system.Trajectory extraction and clustering are applied to pre-process eachclip into a video object representation that we augment with objectclassification and colour information. The result is a system capableof searching videos based on the desired colour, motion path, andsemantic labels of the objects present. We evaluate the performance ofour system over the TSF dataset of broadcast sports footage.
Nipun Pande, Mayank Jain, Dhawal Kapil and Prithwijit Guha
The Video Face Book
The Video Face Book
Videos are often characterized by the human participants, who in turn, are identified by their faces. We present a completely unsupervised system to index videos through faces. A multiple face detector-tracker combination bound by a reasoning scheme and operational in both forward and backward directions is used to extract face tracks from individual shots of a shot segmented video. These face tracks collectively form a face log which is filtered further to remove outliers or non-face regions. The face instances from the face log are clustered using a GMM variant to capture the facial appearance modes of different people. A face Track-Cluster-Correspondence-Matrix (TCCM) is formed further to identify the equivalent face tracks. The face track equivalences are analyzed to identify the shot presences of a particular person, thereby indexing the video in terms of faces, which we call the “Video Face Book''.
Andrei Bursuc and Titus Zaharia
Retrieval of Multiple Instances of Objects in Videos
Retrieval of Multiple Instances of Objects in Videos
This paper tackles the issue of retrieving different instances of an object of interest within a given video document or in a video database. The principle consists in considering a semi-global image representation based on an over-segmentation of image frames. An aggregation mechanism is then applied in order to group a set of sub-regions into an object similar to the query, under a global similarity criterion. Two different algorithms are proposed. The first one involves a greedy, dynamic region construction method. The second is based on simulated annealing, and aims at determining a global optimum. Experimental results show promising performances, with object detection rates of up to 79%
Lijuan Zhou and Cathal Gurrin
A Novel Music Retrieval System That Concerns About What You Are Feeling
A Novel Music Retrieval System That Concerns About What You Are Feeling
Music is inherently expressive of emotion meaning and affects the mood of people. In this paper, we present an EMIR (Emotional Music Information Retrieval System) that uses latent emotion elements both in music and non-descriptive queries (NDQs) to detect implicit emotional association between users and music to enhance Music Information Retrieval (MIR). We try to understand the latent emotional intent of queries via machine learning for emotion classification and compare the performance of emotion detection approaches on different feature sets. For this purpose, we extract music emotion features from lyrics and social tags crawled from the Internet, label some for training and model them in high-dimensional emotion space and recognize latent emotion of users by query emotion analysis. The similarity between queries and music is computed by verified BM25 model.
Manfred Del Fabro and Laszlo Böszörmenyi
Summarization and Presentation of Real-Life Events Using Community-Contributed Content
Summarization and Presentation of Real-Life Events Using Community-Contributed Content
We present an algorithm for the summarization of social events with community-contributed content from Flickr and YouTube. A clustering algorithm groups content related to the searched event. Date information, GPS coordinates, user ratings and visual features are used to select relevant photos and videos. The composed event summaries are presented in an innovative way with our video browser.
Rene Kaiser, Wolfgang Weiss and Gert Kienast
The FascinatE Production Scripting Engine
The FascinatE Production Scripting Engine
In the realm of a format agnostic live event broadcast system, the FascinatE Scripting Engines are software components that automate taking decisions on what is visible and audible at each playout device and prepare the audiovisual content streams for display. Essentially, they act together as a Virtual Director with the production team possibly steering it via a backend user interface. We present an architecture for this real-time system and describe interfaces to other production components. Details of subcomponents of the distributed engine, design decisions and technology choices are discussed.
Marcus Thaler, Rene Kaiser, Werner Bailer and Andreas Kriechbaum
Tracking Persons in Ultra-HD Panoramic Video
Tracking Persons in Ultra-HD Panoramic Video
We present a demo for person detection and tracking in high-resolution panoramic video streams, obtained from a panoramic camera stitching video streams from 6 HD resolution tiles. The AV content analysis uses a CUDA accelerated feature point tracker, a blob detector and a CUDA HOG person detector, which are used for region tracking in each of the tiles. The results of each tile are then fused for the entire panorama to track persons over multiple tiles.
Zhengwei Qiu, Cathal Gurrin, Aiden Doherty and Alan Smeaton
A real-time Life Experience Logging Tool
A real-time Life Experience Logging Tool
E-memories attempt to digitally encode all life experiences in an archive for later search and real-time recommendation. In this paper we describe a prototype real-time e-memory gathering infrastructure that uses smartphones to gather a semantically rich e-memory.
Duan-Yu Chen and Chia-Hsun Chen
Visual-Based Spatiotemporal Analysis for Nighttime Vehicle Braking Event Detection
Visual-Based Spatiotemporal Analysis for Nighttime Vehicle Braking Event Detection
In this paper, we propose a novel visual-based approach that can detect brake lights at night by analyzing the tail lights based on the thee-dimensional Nakagami imaging which can provide robust information of brake lights. Instead of using the knowledge of the heuristic features, such as symmetry and position of rear facing vehicle, size and so forth, we focus on extracting the invariant features based on modeling the scattering of brake lights and therefore can conduct the detection process in a part-based manner. Experiment from extensive dataset shows that our proposed system can effectively detect vehicle braking under different lighting and traffic conditions, and thus prove its feasibility in real-world environments.
Jun-Wei Hsieh, Fu-Jiang Fang, Guo-Jin Lin and Yu-Shi Wang
Template Matching and Monte Carlo Markova Chain for People Counting under Occlusions
Template Matching and Monte Carlo Markova Chain for People Counting under Occlusions
It is challenging to count and analyze people in crowds due to the changes of lighting, occlusions, shadows, backgrounds, and weather conditions. Especially for the occlusion problem, until now, it is still ill-posed. To deal with the occlusion problem, the MCMC (Monte Carlo Markova Chain) scheme is used in this paper to estimate all possible pedestrian positions across different frames. However, it requires good initial head positions for parameter searching and people counting. Thus, an intelligent head-shoulder-region detector is then developed for detecting all possible pedestrian candidates from videos. One key problem in head-shoulder detection is that the feature contrast between the objects and their background should be larger. To tackle this problem, a Linear Discriminant Analysis (LDA) approach is then used to enhance the boundaries between objects and features. Three contributions are made in this paper: (1) Intelligent head-shoulder-region detector; (2) People detection under occlusions; (3) Integrated people counting system using LDA. Experimental results have proved the superiorities of the proposed method in people detection and counting.
Jing-Ming Guo, Yun-Fu Liu, Chao-Yu Lin, Wen-Jan Lin and Thanh-Nam Le
Automatic Lips Recognition System Using Edge Detection and Color Mapping Methods
Automatic Lips Recognition System Using Edge Detection and Color Mapping Methods
Biometrics has been widely used because of the invariant of the adopted features. Among various biometric approaches, face recognition gears with a peculiar property that it can perform recognition from a remote distance. Yet, the recognition rate decreases when partial of the face is covered. This study presents a lip-based recognition system by using the Edge Detection (ED), Color Mapping (CM), and two different feature extracting algorithms to cope with various environments. Moreover, eight lips points are located for calculating six distance features which has been proved as effective features to classify individuals. As documented in the experimental results, the proposed system can provide a reliable recognition rate and can be considered as a competitive biometrics scheme for practical applications.