Dissertations / Theses: 'Computer vision and multimedia computation'

1

Gong, Shaogang. "Parallel computation of visual motion." Thesis, University of Oxford, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.238149.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Gavin, Andrew S. (Andrew Scott). "Low computation vision-based navigation for mobile robots." Thesis, Massachusetts Institute of Technology, 1994. http://hdl.handle.net/1721.1/38006.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Bryant, Bobby PROTOTYPES NIGHT VISION COMPUTER AIDED INSTRUCTION GOGGLES RISK TRAINING INTERACTIONS THREE DIMENSIONAL INSTRUCTIONS GRAPHICS OPERATION COMPUTERS PILOTS THESES. "A computer-based multimedia prototype for night vision goggles /." Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1994. http://handle.dtic.mil/100.2/ADA286208.

Full text

Abstract:

Thesis (M.S. in Information Technology Management) Naval Postgraduate School, September 1994.
Thesis advisor(s): Kishore Sengupta, Alice Crawford. "September 1994." Bibliography: p. 35. Also available online.

APA, Harvard, Vancouver, ISO, and other styles

4

Bryant, Bobby. "A computer-based multimedia prototype for night vision goggles." Thesis, Monterey, California. Naval Postgraduate School, 1994. http://hdl.handle.net/10945/30923.

Full text

Abstract:

Naval aviators who employ night vision goggles (NVG) face additional risks during nighttime operations. In an effort to reduce these risks, increased training with NVGs is suggested. Our goal was to design a computer-based, interactive multimedia system that would assist in the training of pilots who use NVGs. This thesis details the methods and techniques used in the development of the NVG multimedia prototype. It describes which hardware components and software applications were utilized as well as how the prototype was developed. Several facets of multimedia technology (sound, animation, video and three dimensional graphics) have been incorporated into the interactive prototype. For a more robust successive prototype, recommendations are submitted for future enhancements that include alternative methodologies as well as expanded interactions. Multimedia, Computer aided instruction, Night vision goggles.

APA, Harvard, Vancouver, ISO, and other styles

5

Sahiner, Ali Vahit. "A computation model for parallelism : self-adapting parallel servers." Thesis, University of Westminster, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.305872.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Liu, Jianguo, and 劉建國. "Fast computation of moments with applications to transforms." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1996. http://hub.hku.hk/bib/B31235086.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Liu, Jianguo. "Fast computation of moments with applications to transforms /." Hong Kong : University of Hong Kong, 1996. http://sunzi.lib.hku.hk/hkuto/record.jsp?B17664986.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Battiti, Roberto Fox Geoffrey C. "Multiscale methods, parallel computation, and neural networks for real-time computer vision /." Diss., Pasadena, Calif. : California Institute of Technology, 1990. http://resolver.caltech.edu/CaltechETD:etd-06072007-074441.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Hsiao, Hsu-Feng. "Multimedia streaming congestion control over heterogeneous networks : from distributed computation and end-to-end perspectives /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/5946.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Nóbrega, Rui Pedro da Silva. "Interactive acquisition of spatial information from images for multimedia applications." Doctoral thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11079.

Full text

Abstract:

Dissertação para obtenção do Grau de Doutor em Informática
This dissertation addresses the problem of creating interactive mixed reality applications where virtual objects interact in a real world scenario. These scenarios are intended to be captured by the users with cameras. In other words, the goal is to produce applications where virtual objects are introduced in photographs taken by the users. This is relevant to create games and architectural and space planning applications that interact with visual elements in the images such as walls, floors and empty spaces. Introducing virtual objects in photographs or video sequences presents several challenges, such as the pose estimation and the visually correct interaction with the boundaries of such objects. Furthermore, the introduced virtual objects should be interactive and respond to the real physical environments. The proposed detection system is semi-automatic and thus depends partially on the user to obtain the elements it needs. This operation should be significantly simple to accommodate the needs of a non-expert user. The system analyzes a photo captured by the user and detects high-level features such as vanishing points, floor and scene orientation. Using these features it will be possible to create virtual mixed and augmented reality applications where the user takes one or more photos of a certain place and interactively introduces virtual objects or elements that blend with the picture in real time. This document discusses computer vision, computer graphics and human-computer interaction techniques required to acquire images and information about the scenario involving the user. To demonstrate the framework and the proposed solutions, several proof-of-concept projects are presented and studied. Additionally, to validate the solution several system tests are described and each case-study interface was subject of different user-studies.
Fundação para a Ciência e Tecnologia - research grant SFRH/BD/47511/2008

APA, Harvard, Vancouver, ISO, and other styles

11

Muñiz, Pablo E. (Muñiz Aponte). "Detection of launch frame in long jump videos using computer vision and discreet computation." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123277.

Full text

Abstract:

Thesis: S.B., Massachusetts Institute of Technology, Department of Mechanical Engineering, 2019
Cataloged from PDF version of thesis.
Includes bibliographical references (page 44).
Pose estimation, a computer vision technique, can be used to develop a quantitative feedback training tool for long jumping. Key performance indicators (KPIs) such as launch velocity would allow a long jumping athlete to optimize their technique while training. However, these KPIs need a prior knowledge of when the athlete jumped, referred to as the launch frame in the context of videos and computer vision. Thus, an algorithm for estimating the launch frame was made using the OpenPose Demo and Matlab. The algorithm estimates the launch frame to within 0.8±0.91 frames. Implementing the algorithm into a training tool would give an athlete real-time, quantitative feedback from a video. This process of developing an algorithm to flag an event can be used in other sports as well, especially with the rise of KPIs in the sports industry (e.g. launch angle and velocity in baseball).
by Pablo E. Muniz.
S.B.
S.B. Massachusetts Institute of Technology, Department of Mechanical Engineering

APA, Harvard, Vancouver, ISO, and other styles

12

Kaloskampis, Ioannis. "Recognition of complex human activities in multimedia streams using machine learning and computer vision." Thesis, Cardiff University, 2013. http://orca.cf.ac.uk/59377/.

Full text

Abstract:

Modelling human activities observed in multimedia streams as temporal sequences of their constituent actions has been the object of much research effort in recent years. However, most of this work concentrates on tasks where the action vocabulary is relatively small and/or each activity can be performed in a limited number of ways. In this Thesis, a novel and robust framework for modelling and analysing composite, prolonged activities arising in tasks which can be effectively executed in a variety of ways is proposed. Additionally, the proposed framework is designed to handle cognitive tasks, which cannot be captured using conventional types of sensors. It is shown that the proposed methodology is able to efficiently analyse and recognise complex activities arising in such tasks and also detect potential errors in their execution. To achieve this, a novel activity classification method comprising a feature selection stage based on the novel Key Actions Discovery method and a classification stage based on the combination of Random Forests and Hierarchical Hidden Markov Models is introduced. Experimental results captured in several scenarios arising from real-life applications, including a novel application to a bridge design problem, show that the proposed framework offers higher classification accuracy compared to current activity identification schemes.

APA, Harvard, Vancouver, ISO, and other styles

13

Baró, i. Solé Xavier. "Probabilistic Darwin Machines: A new approach to develop Evolutionary Object Detection Systems." Doctoral thesis, Universitat Autònoma de Barcelona, 2009. http://hdl.handle.net/10803/5793.

Full text

Abstract:

Des dels principis de la informàtica, s'ha intentat dotar als ordinadors de la capacitat per realitzar moltes de les tasques quotidianes de les persones. Un dels problemes més estudiats i encara menys entesos actualment és la capacitat d'aprendre a partir de les nostres experiències i generalitzar els coneixements adquirits.
Una de les tasques inconscients per a les persones i que més interès està despertant en àmbit científics des del principi, és el que es coneix com a reconeixement de patrons. La creació de models del món que ens envolta, ens serveix per a reconèixer objectes del nostre entorn, predir situacions, identificar conductes, etc. Tota aquesta informació ens permet adaptar-nos i interactuar amb el nostre entorn. S'ha arribat a relacionar la capacitat d'adaptació d'un ésser al seu entorn amb la quantitat de patrons que és capaç d'identificar.
Quan parlem de reconeixement de patrons en el camp de la Visió per Computador, ens referim a la capacitat d'identificar objectes a partir de la informació continguda en una o més imatges. En aquest camp s'ha avançat molt en els últims anys, i ara ja som capaços d'obtenir resultats "útils" en entorns reals, tot i que encara estem molt lluny de tenir un sistema amb la mateixa capacitat d'abstracció i tan robust com el sistema visual humà.
En aquesta tesi, s'estudia el detector de cares de Viola i Jones, un dels mètode més estesos per resoldre la detecció d'objectes. Primerament, s'analitza la manera de descriure els objectes a partir d'informació de contrastos d'il·luminació en zones adjacents de les imatges, i posteriorment com aquesta informació és organitzada per crear estructures més complexes. Com a resultat d'aquest estudi, i comparant amb altres metodologies, s'identifiquen dos punts dèbils en el mètode de detecció de Viola i Jones. El primer fa referència a la descripció dels objectes, i la segona és una limitació de l'algorisme d'aprenentatge, que dificulta la utilització de millors descriptors.
La descripció dels objectes utilitzant les característiques de Haar, limita la informació extreta a zones connexes de l'objecte. En el cas de voler comparar zones distants, s'ha d'optar per grans mides de les característiques, que fan que els valors obtinguts depenguin més del promig de valors d'il·luminació de l'objecte, que de les zones que es volen comparar. Amb l'objectiu de poder utilitzar aquest tipus d'informacions no locals, s'intenta introduir els dipols dissociats en l'esquema de detecció d'objectes.
El problema amb el que ens trobem en voler utilitzar aquest tipus de descriptors, és que la gran cardinalitat del conjunt de característiques, fa inviable la utilització de l'Adaboost, l'algorisme utilitzat per a l'aprenentatge. El motiu és que durant el procés d'aprenentatge, es fa un anàlisi exhaustiu de tot l'espai d'hipòtesis, i al ser tant gran, el temps necessari per a l'aprenentatge esdevé prohibitiu. Per eliminar aquesta limitació, s'introdueixen mètodes evolutius dins de l'esquema de l'Adaboost i s'estudia els efectes d'aquest canvi en la capacitat d'aprenentatge. Les conclusions extretes són que no només continua essent capaç d'aprendre, sinó que la velocitat de convergència no és afectada significativament.
Aquest nou Adaboost amb estratègies evolutives obre la porta a la utilització de conjunts de característiques amb cardinalitats arbitràries, el que ens permet indagar en noves formes de descriure els nostres objectes, com per exemple utilitzant els dipols dissociats. El primer que fem és comparar la capacitat d'aprenentatge del mètode utilitzant les característiques de Haar i els dipols dissociats. Com a resultat d'aquesta comparació, el que veiem és que els dos tipus de descriptors tenen un poder de representació molt similar, i depenent del problema en que s'apliquen, uns s'adapten una mica millor que els altres. Amb l'objectiu d'aconseguir un sistema de descripció capaç d'aprofitar els punts forts tant de Haar com dels dipols, es proposa la utilització d'un nou tipus de característiques, els dipols dissociats amb pesos, els quals combinen els detectors d'estructures que fan robustes les característiques de Haar amb la capacitat d'utilitzar informació no local dels dipols dissociats. A les proves realitzades, aquest nou conjunt de característiques obté millors resultats en tots els problemes en que s'ha comparat amb les característiques de Haar i amb els dipols dissociats.
Per tal de validar la fiabilitat dels diferents mètodes, i poder fer comparatives entre ells, s'ha utilitzat un conjunt de bases de dades públiques per a diferents problemes, tals com la detecció de cares, la detecció de texts, la detecció de vianants i la detecció de cotxes. A més a més, els mètodes també s'han provat sobre una base de dades més extensa, amb la finalitat de detectar senyals de trànsit en entorns de carretera i urbans.
Ever since computers were invented, we have wondered whether they might perform some of the human quotidian tasks. One of the most studied and still nowadays less understood problem is the capacity to learn from our experiences and how we generalize the knowledge that we acquire.
One of that unaware tasks for the persons and that more interest is awakening in different scientific areas since the beginning, is the one that is known as pattern recognition. The creation of models that represent the world that surrounds us, help us for recognizing objects in our environment, to predict situations, to identify behaviors... All this information allows us to adapt ourselves and to interact with our environment. The capacity of adaptation of individuals to their environment has been related to the amount of patterns that are capable of identifying.
When we speak about pattern recognition in the field of Computer Vision, we refer to the ability to identify objects using the information contained in one or more images. Although the progress in the last years, and the fact that nowadays we are already able to obtain "useful" results in real environments, we are still very far from having a system with the same capacity of abstraction and robustness as the human visual system.
In this thesis, the face detector of Viola & Jones is studied as the paradigmatic and most extended approach to the object detection problem. Firstly, we analyze the way to describe the objects using comparisons of the illumination values in adjacent zones of the images, and how this information is organized later to create more complex structures. As a result of this study, two weak points are identified in this family of methods: The first makes reference to the description of the objects, and the second is a limitation of the learning algorithm, which hampers the utilization of best descriptors.
Describing objects using Haar-like features limits the extracted information to connected regions of the object. In the case we want to compare distant zones, large contiguous regions must be used, which provokes that the obtained values depend more on the average of lighting values of the object than in the regions we are wanted to compare. With the goal to be able to use this type of non local information, we introduce the Dissociated Dipoles into the outline of objects detection.
The problem using this type of descriptors is that the great cardinality of this feature set makes unfeasible the use of Adaboost as learning algorithm. The reason is that during the learning process, an exhaustive search is made over the space of hypotheses, and since it is enormous, the necessary time for learning becomes prohibitive. Although we studied this phenomenon on the Viola & Jones approach, it is a general problem for most of the approaches, where learning methods introduce a limitation on the descriptors that can be used, and therefore, on the quality of the object description. In order to remove this limitation, we introduce evolutionary methods into the Adaboost algorithm, studying the effects of this modification on the learning ability. Our experiments conclude that not only it continues being able to learn, but its convergence speed is not significantly altered.
This new Adaboost with evolutionary strategies opens the door to the use of feature sets with an arbitrary cardinality, which allows us to investigate new ways to describe our objects, such as the use of Dissociated Dipoles. We first compare the learning ability of this evolutionary Adaboost using Haar-like features and Dissociated Dipoles, and from the results of this comparison, we conclude that both types of descriptors have similar representation power, but depends on the problem they are applied, one adapts a little better than the other. With the aim of obtaining a descriptor capable of share the strong points from both Haar-like and Dissociated Dipoles, we propose a new type of feature, the Weighted Dissociated Dipoles, which combines the robustness of the structure detectors present in the Haar-like features, with the Dissociated Dipoles ability to use non local information. In the experiments we carried out, this new feature set obtains better results in all problems we test, compared with the use of Haar-like features and Dissociated Dipoles.
In order to test the performance of each method, and compare the different methods, we use a set of public databases, which covers face detection, text detection, pedestrian detection, and cars detection. In addition, our methods are tested to face a traffic sign detection problem, over large databases containing both, road and urban scenes.

APA, Harvard, Vancouver, ISO, and other styles

14

Liu, Yixian. "Reasoning scene geometry from single images." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/9131.

Full text

Abstract:

Holistic scene understanding is one of the major goals in recent research of computer vision. Most popular recognition algorithms focus on semantic understanding and are incapable of providing the global depth information of the scene structure from the 2D projection of the world. Yet it is obvious that recovery of scene surface layout could be used to help many practical 3D-based applications, including 2D-to-3D movie re-production, robotic navigation, view synthesis, etc. Therefore, we identify scene geometric reasoning as the key problem of scene understanding. This PhD work makes a contribution to the reconstruction problem of 3D shape of scenes from monocular images. We propose an approach to recognise and reconstruct the geometric structure of the scene from a single image. We have investigated several typical scene geometries and built a few corresponding reference models in a hierarchical order for scene representation. The framework is set up based on the analysis of image statistical features and scene geometric features. Correlation is introduced to theoretically integrate these two types of features. Firstly, an image is categorized into one of the reference geometric models using the spatial pattern classi cation. Then, we estimate the depth pro le of the speci c scene by proposing an algorithm for adaptive automatic scene reconstruction. This algorithm employs speci cally developed reconstruction approaches for di erent geometric models. The theory and algorithms are instantiated in a system for the scene classi cation and visualization. The system is able to fi nd the best fi t model for most of the images from several benchmark datasets. Our experiments show that un-calibrated low-quality monocular images could be e fficiently and realistically reconstructed in simulated 3D space. By our approach, computers could interpret a single still image as its underlying geometry straightforwardly, avoiding usual object occlusion, semantic overlapping and defi ciency problems.

APA, Harvard, Vancouver, ISO, and other styles

15

Cai, Bill Yang. "Applications of deep learning and computer vision in large scale quantification of tree canopy cover and real-time estimation of street parking." Thesis, Massachusetts Institute of Technology, 2018. https://hdl.handle.net/1721.1/122317.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Computation for Design and Optimization Program, 2018
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 73-77).
A modern city generates a large volume of digital information, especially in the form of unstructured image and video data. Recent advancements in deep learning techniques have enabled effective learning and estimation of high-level attributes and meaningful features from large digital datasets of images and videos. In my thesis, I explore the potential of applying deep learning to image and video data to quantify urban tree cover and street parking utilization. Large-scale and accurate quantification of urban tree cover is important towards informing government agencies in their public greenery efforts, and useful for modelling and analyzing city ecology and urban heat island effects. We apply state-of-the-art deep learning models, and compare their performance to a previously established benchmark of an unsupervised method.
Our training procedure for deep learning models is novel; we utilize the abundance of openly available and similarly labelled street-level image datasets to pre-train our model. We then perform additional training on a small training dataset consisting of GSV images. We also employ a recently developed method called gradient-weighted class activation map (Grad-CAM) to interpret the features learned by the end-to-end model. The results demonstrate that deep learning models are highly accurate, can be interpretable, and can also be efficient in terms of data-labelling effort and computational resources. Accurate parking quantification would inform developers and municipalities in space allocation and design, while real-time measurements would provide drivers and parking enforcement with information that saves time and resources. We propose an accurate and real-time video system for future Internet of Things (IoT) and smart cities applications.
Using recent developments in deep convolutional neural networks (DCNNs) and a novel intelligent vehicle tracking filter, the proposed system combines information across multiple image frames in a video sequence to remove noise introduced by occlusions and detection failures. We demonstrate that the proposed system achieves higher accuracy than pure image-based instance segmentation, and is comparable in performance to industry benchmark systems that utilize more expensive sensors such as radar. Furthermore, the proposed system can be easily configured for deployment in different parking scenarios, and can provide spatial information beyond traditional binary occupancy statistics.
by Bill Yang Cai.
S.M.
S.M. Massachusetts Institute of Technology, Computation for Design and Optimization Program

APA, Harvard, Vancouver, ISO, and other styles

16

Orriols, Majoral Xavier. "Generative Models for Video Analysis and 3D Range Data Applications." Doctoral thesis, Universitat Autònoma de Barcelona, 2004. http://hdl.handle.net/10803/3037.

Full text

Abstract:

La mayoría de problemas en Visión por computador no contienen una relación directa entre el estímulo que proviene de sensores de tipo genérico y su correspondiente categoría perceptual. Este tipo de conexión requiere de una tarea de aprendizaje compleja. De hecho, las formas básicas de energía, y sus posibles combinaciones, son un número reducido en comparación a las infinitas categorías perceptuales correspondientes a objetos, acciones, relaciones entre objetos, etc. Dos factores principales determinan el nivel de dificultad de cada problema específico: i) los diferentes niveles de información que se utilizan, y ii) la complejidad del modelo que se emplea con el objetivo de explicar las observaciones.
La elección de una representación adecuada para los datos toma una relevancia significativa cuando se tratan invariancias, dado que estas siempre implican una reducción del los grados de libertad del sistema, i.e., el número necesario de coordenadas para la representación es menor que el empleado en la captura de datos. De este modo, la descomposición en unidades básicas y el cambio de representación dan lugar a que un problema complejo se pueda transformar en uno de manejable. Esta simplificación del problema de la estimación debe depender del mecanismo propio de combinación de estas primitivas con el fin de obtener una descripción óptima del modelo complejo global. Esta tesis muestra como los Modelos de Variables Latentes reducen dimensionalidad, que teniendo en cuenta las simetrías internas del problema, ofrecen una manera de tratar con datos parciales y dan lugar a la posibilidad de predicciones de nuevas observaciones.
Las líneas de investigación de esta tesis están dirigidas al manejo de datos provinentes de múltiples fuentes. Concretamente, esta tesis presenta un conjunto de nuevos algoritmos aplicados a dos áreas diferentes dentro de la Visión por Computador: i) video análisis y sumarización y ii) datos range 3D. Ambas áreas se han enfocado a través del marco de los Modelos Generativos, donde se han empleado protocolos similares para representar datos.
The majority of problems in Computer Vision do not contain a direct relation between the stimuli provided by a general purpose sensor and its corresponding perceptual category. A complex learning task must be involved in order to provide such a connection. In fact, the basic forms of energy, and their possible combinations are a reduced number compared to the infinite possible perceptual categories corresponding to objects, actions, relations among objects... Two main factors determine the level of difficulty of a specific problem: i) The different levels of information that are employed and ii) The complexity of the model that is intended to explain the observations.
The choice of an appropriate representation for the data takes a significant relevance when it comes to deal with invariances, since these usually imply that the number of intrinsic degrees of
freedom in the data distribution is lower than the coordinates used to represent it. Therefore, the decomposition into basic units (model parameters) and the change of representation, make that a complex problem can be transformed into a manageable one. This simplification of the estimation problem has to rely on a proper mechanism of combination of those primitives in order to give an optimal description of the global complex model. This thesis shows how Latent Variable Models reduce dimensionality, taking into account the internal symmetries of a problem, provide a manner of dealing with missing data and make possible predicting new observations.
The lines of research of this thesis are directed to the management of multiple data sources. More specifically, this thesis presents a set of new algorithms applied to two different areas in Computer Vision: i) video analysis and summarization, and ii) 3D range data. Both areas have been approached through the Generative Models framework, where similar protocols for representing data have been employed.

APA, Harvard, Vancouver, ISO, and other styles

17

Rehn, Martin. "Aspects of memory and representation in cortical computation." Doctoral thesis, KTH, Numerisk Analys och Datalogi, NADA, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4161.

Full text

Abstract:

Denna avhandling i datalogi föreslår modeller för hur vissa beräkningsmässiga uppgifter kan utföras av hjärnbarken. Utgångspunkten är dels kända fakta om hur en area i hjärnbarken är uppbyggd och fungerar, dels etablerade modellklasser inom beräkningsneurobiologi, såsom attraktorminnen och system för gles kodning. Ett neuralt nätverk som producerar en effektiv gles kod i binär mening för sensoriska, särskilt visuella, intryck presenteras. Jag visar att detta nätverk, när det har tränats med naturliga bilder, reproducerar vissa egenskaper (receptiva fält) hos nervceller i lager IV i den primära synbarken och att de koder som det producerar är lämpliga för lagring i associativa minnesmodeller. Vidare visar jag hur ett enkelt autoassociativt minne kan modifieras till att fungera som ett generellt sekvenslärande system genom att utrustas med synapsdynamik. Jag undersöker hur ett abstrakt attraktorminnessystem kan implementeras i en detaljerad modell baserad på data om hjärnbarken. Denna modell kan sedan analyseras med verktyg som simulerar experiment som kan utföras på en riktig hjärnbark. Hypotesen att hjärnbarken till avsevärd del fungerar som ett attraktorminne undersöks och visar sig leda till prediktioner för dess kopplingsstruktur. Jag diskuterar också metodologiska aspekter på beräkningsneurobiologin idag.
In this thesis I take a modular approach to cortical function. I investigate how the cerebral cortex may realise a number of basic computational tasks, within the framework of its generic architecture. I present novel mechanisms for certain assumed computational capabilities of the cerebral cortex, building on the established notions of attractor memory and sparse coding. A sparse binary coding network for generating efficient representations of sensory input is presented. It is demonstrated that this network model well reproduces the simple cell receptive field shapes seen in the primary visual cortex and that its representations are efficient with respect to storage in associative memory. I show how an autoassociative memory, augmented with dynamical synapses, can function as a general sequence learning network. I demonstrate how an abstract attractor memory system may be realised on the microcircuit level -- and how it may be analysed using tools similar to those used experimentally. I outline some predictions from the hypothesis that the macroscopic connectivity of the cortex is optimised for attractor memory function. I also discuss methodological aspects of modelling in computational neuroscience.
QC 20100916

APA, Harvard, Vancouver, ISO, and other styles

18

Silva, João Miguel Ferreira da. "People and object tracking for video annotation." Master's thesis, Faculdade de Ciências e Tecnologia, 2012. http://hdl.handle.net/10362/8953.

Full text

Abstract:

Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Object tracking is a thoroughly researched problem, with a body of associated literature dating at least as far back as the late 1970s. However, and despite the development of some satisfactory real-time trackers, it has not yet seen widespread use. This is not due to a lack of applications for the technology, since several interesting ones exist. In this document, it is postulated that this status quo is due, at least in part, to a lack of easy to use software libraries supporting object tracking. An overview of the problems associated with object tracking is presented and the process of developing one such library is documented. This discussion includes how to overcome problems like heterogeneities in object representations and requirements for training or initial object position hints. Video annotation is the process of associating data with a video’s content. Associating data with a video has numerous applications, ranging from making large video archives or long videos searchable, to enabling discussion about and augmentation of the video’s content. Object tracking is presented as a valid approach to both automatic and manual video annotation, and the integration of the developed object tracking library into an existing video annotator, running on a tablet computer, is described. The challenges involved in designing an interface to support the association of video annotations with tracked objects in real-time are also discussed. In particular, we discuss our interaction approaches to handle moving object selection on live video, which we have called “Hold and Overlay” and “Hold and Speed Up”. In addition, the results of a set of preliminary tests are reported.
project “TKB – A Transmedia Knowledge Base for contemporary dance” (PTDC/EA /AVP/098220/2008 funded by FCT/MCTES), the UTAustin – Portugal, Digital Media Program (SFRH/BD/42662/2007 FCT/MCTES) and by CITI/DI/FCT/UNL (Pest-OE/EEI/UI0527/2011)

APA, Harvard, Vancouver, ISO, and other styles

19

Anistratov, Pavel. "Computation of Autonomous Safety Maneuvers Using Segmentation and Optimization." Licentiate thesis, Linköpings universitet, Fordonssystem, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-162164.

Full text

Abstract:

This thesis studies motion planning for future autonomous vehicles with main focus on passenger cars. By having automatic steering and braking together with information about the environment, such as other participants in the traffic or obstacles, it would be possible to perform autonomous maneuvers while taking limitations of the vehicle and road–tire interaction into account. Motion planning is performed to find such maneuvers that bring the vehicle from the current state to a desired future state, here by formulating the motion-planning problem as an optimal control problem. There are a number of challenges for such an approach to motion planning; some of them are how to formulate the criterion in the motion planning (objective function in the corresponding optimal control problem), and how to make the solution of motion-planning problems efficient to be useful in online applications. These challenges are addressed in this thesis. As a criterion for motion-planning problems of passenger vehicles on doublelane roads, it is investigated to use a lane-deviation penalty function to capture the observation that it is dangerous to drive in the opposing lane, but safe to drive in the original lane after the obstacle. The penalty function is augmented with certain additional terms to address also the recovery behavior of the vehicle. The resulting formulation is shown to provide efficient and steady maneuvers and gives a lower time in the opposing lane compared to other objective functions. Under varying parameters of the scenario formulation, the resulting maneuvers are changing in a way that exhibits structured characteristics. As an approach to improve efficiency of computations for the motion-planning problem, it is investigated to segment motion planning of the full maneuver into several smaller maneuvers. A way to extract segments is considered from a vehicle dynamics point of view, and it is based on extrema of the vehicle orientation and the yaw rate. The segmentation points determined using this approach are observed to allow efficient splitting of the optimal control problem for the full maneuver into subproblems. Having a method to segment maneuvers, this thesis further studies methods to allow parallel computation of these maneuvers. One investigated method is based on Lagrange relaxation and duality decomposition. Smaller subproblems are formulated, which are governed by solving a low-complexity coordination problem. Lagrangian relaxation is performed on a subset of the dynamic constraints at the segmentation points, while the remaining variables are predicted. The prediction is possible because of the observed structured characteristics resulting from the used lane-deviation penalty function. An alternative approach is based on adoption of the alternating augmented Lagrangian method. Augmentation of the Lagrangian allows to apply relaxation for all dynamic constraints at the segmentation points, and the alternating approach makes it possible to decompose the full problem into subproblems and coordinating their solutions by analytically solving an overall coordination problem. The presented decomposition methods allow computation of maneuvers with high correspondence and lower computational times compared to the results obtained for solving the full maneuver in one step.

APA, Harvard, Vancouver, ISO, and other styles

20

Ebadat, Ali-Reza. "Toward Robust Information Extraction Models for Multimedia Documents." Phd thesis, INSA de Rennes, 2012. http://tel.archives-ouvertes.fr/tel-00760383.

Full text

Abstract:

Au cours de la dernière décennie, d'énormes quantités de documents multimédias ont été générées. Il est donc important de trouver un moyen de gérer ces données, notamment d'un point de vue sémantique, ce qui nécessite une connaissance fine de leur contenu. Il existe deux familles d'approches pour ce faire, soit par l'extraction d'informations à partir du document (par ex., audio, image), soit en utilisant des données textuelles extraites du document ou de sources externes (par ex., Web). Notre travail se place dans cette seconde famille d'approches ; les informations extraites des textes peuvent ensuite être utilisées pour annoter les documents multimédias et faciliter leur gestion. L'objectif de cette thèse est donc de développer de tels modèles d'extraction d'informations. Mais les textes extraits des documents multimédias étant en général petits et bruités, ce travail veille aussi à leur nécessaire robustesse. Nous avons donc privilégié des techniques simples nécessitant peu de connaissances externes comme garantie de robustesse, en nous inspirant des travaux en recherche d'information et en analyse statistique des textes. Nous nous sommes notamment concentré sur trois tâches : l'extraction supervisée de relations entre entités, la découverte de relations, et la découverte de classes d'entités. Pour l'extraction de relations, nous proposons une approche supervisée basée sur les modèles de langues et l'algorithme d'apprentissage des k-plus-proches voisins. Les résultats expérimentaux montrent l'efficacité et la robustesse de nos modèles, dépassant les systèmes état-de-l'art tout en utilisant des informations linguistiques plus simples à obtenir. Dans la seconde tâche, nous passons à un modèle non supervisé pour découvrir les relations au lieu d'en extraire des prédéfinies. Nous modélisons ce problème comme une tâche de clustering avec une fonction de similarité là encore basée sur les modèles de langues. Les performances, évaluées sur un corpus de vidéos de matchs de football, montrnt l'intérêt de notre approche par rapport aux modèles classiques. Enfin, dans la dernière tâche, nous nous intéressons non plus aux relations mais aux entités, source d'informations essentielles dans les documents. Nous proposons une technique de clustering d'entités afin de faire émerger, sans a priori, des classes sémantiques parmi celles-ci, en adoptant une représentation nouvelle des données permettant de mieux tenir compte des chaque occurrence des entités. En guise de conclusion, nous avons montré expérimentalement que des techniques simples, exigeant peu de connaissances a priori, et utilisant des informations linguistique facilement accessibles peuvent être suffisantes pour extraire efficacement des informations précises à partir du texte. Dans notre cas, ces bons résultats sont obtenus en choisissant une représentation adaptée pour les données, basée sur une analyse statistique ou des modèles de recherche d'information. Le chemin est encore long avant d'être en mesure de traiter directement des documents multimédia, mais nous espérons que nos propositions pourront servir de tremplin pour les recherches futures dans ce domaine.

APA, Harvard, Vancouver, ISO, and other styles

21

Vukotic, Verdran. "Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data." Thesis, Rennes, INSA, 2017. http://www.theses.fr/2017ISAR0015/document.

Full text

Abstract:

La thèse porte sur le développement d'architectures neuronales profondes permettant d'analyser des contenus textuels ou visuels, ou la combinaison des deux. De manière générale, le travail tire parti de la capacité des réseaux de neurones à apprendre des représentations abstraites. Les principales contributions de la thèse sont les suivantes: 1) Réseaux récurrents pour la compréhension de la parole: différentes architectures de réseaux sont comparées pour cette tâche sur leurs facultés à modéliser les observations ainsi que les dépendances sur les étiquettes à prédire. 2) Prédiction d’image et de mouvement : nous proposons une architecture permettant d'apprendre une représentation d'une image représentant une action humaine afin de prédire l'évolution du mouvement dans une vidéo ; l'originalité du modèle proposé réside dans sa capacité à prédire des images à une distance arbitraire dans une vidéo. 3) Encodeurs bidirectionnels multimodaux : le résultat majeur de la thèse concerne la proposition d'un réseau bidirectionnel permettant de traduire une modalité en une autre, offrant ainsi la possibilité de représenter conjointement plusieurs modalités. L'approche été étudiée principalement en structuration de collections de vidéos, dons le cadre d'évaluations internationales où l'approche proposée s'est imposée comme l'état de l'art. 4) Réseaux adverses pour la fusion multimodale: la thèse propose d'utiliser les architectures génératives adverses pour apprendre des représentations multimodales en offrant la possibilité de visualiser les représentations dans l'espace des images
In this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain

APA, Harvard, Vancouver, ISO, and other styles

22

Ringaby, Erik. "Optical Flow Computation on Compute Unified Device Architecture." Thesis, Linköping University, Department of Electrical Engineering, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-15426.

Full text

Abstract:

There has been a rapid progress of the graphics processor the last years, much because of the demands from computer games on speed and image quality. Because of the graphics processor’s special architecture it is much faster at solving parallel problems than the normal processor. Due to its increasing programmability it is possible to use it for other tasks than it was originally designed for.

Even though graphics processors have been programmable for some time, it has been quite difficult to learn how to use them. CUDA enables the programmer to use C-code, with a few extensions, to program NVIDIA’s graphics processor and completely skip the traditional programming models. This thesis investigates if the graphics processor can be used for calculations without knowledge of how the hardware mechanisms work. An image processing algorithm calculating the optical flow has been implemented. The result shows that it is rather easy to implement programs using CUDA, but some knowledge of how the graphics processor works is required to achieve high performance.

APA, Harvard, Vancouver, ISO, and other styles

23

Gonzalez, Preciado Matilde. "Méthodes de vision par ordinateur pour la reconnaissance de gestes naturelles dans le contexte de lʼannotation en langue des signes." Phd thesis, Université Paul Sabatier - Toulouse III, 2012. http://tel.archives-ouvertes.fr/tel-00768440.

Full text

Abstract:

Cette thèse porte sur l'étude des méthodes de vision par ordinateur pour la reconnaissance de gestes naturels dans le contexte de l'annotation de la Langue des Signes. Les annotations de vidéo en LS sont réalisées manuellement par des linguistes ou experts en LS, ce qui est source d'erreur, non reproductible et extrêmement chronophage. De plus, la qualité des annotations dépend des connaissances en LS de l'annotateur. L'association de l'expertise de l'annotateur aux traitements automatiques facilite cette tâche et représente un gain de temps et de robustesse. Nous avons étudié un ensemble de méthodes permettant de réaliser l'annotation en glose. Dans un premier temps, nous cherchons à détecter les limites de début et fin de signe. Cette méthode d'annotation nécessite plusieurs traitements de bas niveau afin de segmenter les signes et d'extraire les caractéristiques de mouvement et de forme de la main. D'abord nous proposons une méthode de suivi des composantes corporelles robuste aux occultations basée sur le filtrage particulaire. Ensuite, un algorithme de segmentation des mains est développé afin d'extraire la région des mains même quand elles se trouvent devant le visage. Puis, les caractéristiques de mouvement sont utilisées pour réaliser une première segmentation temporelle des signes qui est par la suite améliorée grâce à l'utilisation de caractéristiques de forme. En effet celles-ci permettent de supprimer les limites de segmentation détectées en milieu des signes. Une fois les signes segmentés, on procède à l'extraction de caractéristiques visuelles pour leur reconnaissance en termes de gloses à l'aide de modèles phonologiques.

APA, Harvard, Vancouver, ISO, and other styles

24

Bondyfalat, Didier. "Interaction entre symbolique et numérique : application à la vision artificielle." Phd thesis, Université de Nice Sophia-Antipolis, 2000. http://tel.archives-ouvertes.fr/tel-00685629.

Full text

Abstract:

Les motivations initiales de ce travail proviennent de l'étalonnage de caméras en vision artificielle. Nous nous sommes surtout intéressés aux manières d'exploiter des mesures dans les images (détection d'objets) et des considérations géométriques formelles. Nous avons élargi nos recherches à la problématique suivante :"l'interaction entre symbolique et numérique ". Ce travail se divise en trois parties. La première partie traite de la résolution d'équations polynomiales avec des coefficients approchés. Nous étudions des méthodes matricielles qui transforment la résolution en la recherche des valeurs et des vecteurs propres d'une matrice. Ces transformations et et les calculs de valeurs et vecteurs propres sont continues par rapport aux coefficients et permettent donc de résoudre des équations à coefficients approchés. La deuxième partie présente un cadre algébrique permettant d'exprimer simplement des contraintes géométriques. Ce formalisme nous a permis de modéliser de manière fine l'étalonnage d'une ou plusieurs caméras avec l'aide d'un plan. L'étalonnage ne peut être effectué pratiquement qu'avec des résolutions numériques de systèmes linéaires. La troisième partie est consacrée à l'étude et surtout à l'utilisation des outils de démonstration automatique en géométrie pour la construction de modèles 3D articulés. Par des optimisations numériques, nous déterminons les paramètres des modèles articulés qui permettent aux images de ces modèles de coïncider avec les données extraites des photographies

APA, Harvard, Vancouver, ISO, and other styles

25

Piemontese, Cristiano. "Progettazione e implementazione di una applicazione didattica interattiva per il riconoscimento di oggetti basata sull'algoritmo SIFT." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10883/.

Full text

Abstract:

Nell'elaborato viene introdotto l'ambito della Computer Vision e come l'algoritmo SIFT si inserisce nel suo panorama. Viene inoltre descritto SIFT stesso, le varie fasi di cui si compone e un'applicazione al problema dell'object recognition. Infine viene presentata un'implementazione di SIFT in linguaggio Python creata per ottenere un'applicazione didattica interattiva e vengono mostrati esempi di questa applicazione.

APA, Harvard, Vancouver, ISO, and other styles

26

Fröml, Vojtěch. "API datového úložiště pro práci s videem a obrázky." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2013. http://www.nusl.cz/ntk/nusl-236419.

Full text

Abstract:

This master's thesis proposes and implements an extension of the database interface VTApi which is being developed as a part of the MV ČR project "Tools and methods for video and image processing for terrorism prevention" at FIT VUT. This interface provides support for representation, management and indexation of multimedia data and related descriptive metadata used by analytic applications based on computer vision. It currently uses DBMS PostgreSQL as its default datastore. Paper describes basic techniques for processing image and video data, VTApi concept and proposes and implements its modifications for the purpose of supporting multiple types of datastores. As an example of an alternative datastore, support for usage of a SQLite database is integrated into VTApi.

APA, Harvard, Vancouver, ISO, and other styles

27

Tjondronegoro, Dian W. "PhD Thesis: "Content-based Video Indexing for Sports Applications using Multi-modal approach"." Thesis, Deakin University, 2005. https://eprints.qut.edu.au/2199/1/PhDThesis_Tjondronegoro.pdf.

Full text

Abstract:

Triggered by technology innovations, there has been a huge increase in the utilization of video, as one of the most preferred types of media due to its content richness, for many significant applications. To sustain an ongoing rapid growth of video information, there is an emerging demand for a sophisticated content-based video indexing system. However, current video indexing solutions are still immature and lack of any standard. One solution, namely annotation-based indexing, allows video retrieval using textual annotations. However, the major limitations are the restrictions of pre-defined keywords that can be used and the expensive manual work on annotating video. Another solution called feature-based indexing allows video search by low-level features comparison such as query by a sample image. Even though this approach can use automatically extracted features, users would not be able to retrieve video intuitively, based on high-level concepts. This predicament is caused by the so-called semantic gap which highlights the fact that users recall video contents in a high-level abstraction while video is generally stored as an arbitrary sequence of audio-visual tracks. To bridge the semantic gap, this thesis will demonstrate the use of domain-specific approach which aims to utilize domain knowledge in facilitating the extraction of high-level concepts directly from the audiovisual features. The main idea behind domain-specific approach is the use of domain knowledge to guide the integration of features from multi-modal tracks. For example, to extract goal segments from soccer and basketball video, slow motion replay scenes (visual) and excitement (audio) should be detected as they are played during most goal segments. Domain-specific indexing also exploits specific browsing and querying methods which are driven by specific users/applications’ requirements. Sports video is selected as the primary domain due to its content richness and popularity. Moreover, broadcasted sports videos generally span for hours with many redundant activities and the key segments could make up only 30% to 60% of the entire data depending on the progress of the match. This thesis presents a research work based on an integrated multi-modal approach for sports video indexing and retrieval. By combining specific features extractable from multiple (audio-visual) modalities, generic structure and specific events can be detected and classified. During browsing and retrieval, users will benefit from the integration of high-level semantic and some descriptive mid-level features such as whistle and close-up view of player(s). The main objective is to contribute to the three major components of sports video indexing systems. The first component is a set of powerful techniques to extract audio-visual features and semantic contents automatically. The main purposes are to reduce manual annotations and to summarize the lengthy contents into a compact, meaningful and more enjoyable presentation. The second component is an expressive and flexible indexing technique that supports gradual index construction. Indexing scheme is essential to determine the methods by which users can access a video database. The third and last component is a query language that can generate dynamic video summaries for smart browsing and support user-oriented retrievals.

APA, Harvard, Vancouver, ISO, and other styles

28

Buratti, Luca. "Valutazione sperimentale di metodologie di rettificazione e impatto su algoritmi di visione stereo." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/11648/.

Full text

Abstract:

Ricavare informazioni dalla realtà circostante è un obiettivo molto importante dell'informatica moderna, in modo da poter progettare robot, veicoli a guida autonoma, sistemi di riconoscimento e tanto altro. La computer vision è la parte dell'informatica che se ne occupa e sta sempre più prendendo piede. Per raggiungere tale obiettivo si utilizza una pipeline di visione stereo i cui passi di rettificazione e generazione di mappa di disparità sono oggetto di questa tesi. In particolare visto che questi passi sono spesso affidati a dispositivi hardware dedicati (come le FPGA) allora si ha la necessità di utilizzare algoritmi che siano portabili su questo tipo di tecnologia, dove le risorse sono molto minori. Questa tesi mostra come sia possibile utilizzare tecniche di approssimazione di questi algoritmi in modo da risparmiare risorse ma che che garantiscano comunque ottimi risultati.

APA, Harvard, Vancouver, ISO, and other styles

29

Huisman, Maximiliaan. "Vision Beyond Optics: Standardization, Evaluation and Innovation for Fluorescence Microscopy in Life Sciences." eScholarship@UMMS, 2019. https://escholarship.umassmed.edu/gsbs_diss/1017.

Full text

Abstract:

Fluorescence microscopy is an essential tool in biomedical sciences that allows specific molecules to be visualized in the complex and crowded environment of cells. The continuous introduction of new imaging techniques makes microscopes more powerful and versatile, but there is more than meets the eye. In addition to develop- ing new methods, we can work towards getting the most out of existing data and technologies. By harnessing unused potential, this work aims to increase the richness, reliability, and power of fluorescence microscopy data in three key ways: through standardization, evaluation and innovation. A universal standard makes it easier to assess, compare and analyze imaging data – from the level of a single laboratory to the broader life sciences community. We propose a data-standard for fluorescence microscopy that can increase the confidence in experimental results, facilitate the exchange of data, and maximize compatibility with current and future data analysis techniques. Cutting-edge imaging technologies often rely on sophisticated hardware and multi-layered algorithms for reconstruction and analysis. Consequently, the trustworthiness of new methods can be difficult to assess. To evaluate the reliability and limitations of complex methods, quantitative analyses – such as the one present here for the 3D SPEED method – are paramount. The limited resolution of optical microscopes prevents direct observation of macro- molecules like DNA and RNA. We present a multi-color, achromatic, cryogenic fluorescence microscope that has the potential to produce multi-color images with sub-nanometer precision. This innovation would move fluorescence imaging beyond the limitations of optics and into the world of molecular resolution.

APA, Harvard, Vancouver, ISO, and other styles

30

Karaman, Svebor. "Indexation de la Vidéo Portée : Application à l'Étude Épidémiologique des Maladies Liées à l'Âge." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2011. http://tel.archives-ouvertes.fr/tel-00689855.

Full text

Abstract:

Le travail de recherche de cette thèse de doctorat s'inscrit dans le cadre du suivi médical des patients atteints de démences liées à l'âge à l'aide des caméras videos portées par les patients. L'idée est de fournir aux médecins un nouvel outil pour le diagnostic précoce de démences liées à l'âge telles que la maladie d'Alzheimer. Plus précisément, les Activités Instrumentales du Quotidien (IADL : Instrumental Activities of Daily Living en anglais) doivent être indexées automatiquement dans les vidéos enregistrées par un dispositif d'enregistrement portable. Ces vidéos présentent des caractéristiques spécifiques comme de forts mouvements ou de forts changements de luminosité. De plus, la tâche de reconnaissance visée est d'un très haut niveau sémantique. Dans ce contexte difficile, la première étape d'analyse est la définition d'un équivalent à la notion de " plan " dans les contenus vidéos édités. Nous avons ainsi développé une méthode pour le partitionnement d'une vidéo tournée en continu en termes de " points de vue " à partir du mouvement apparent. Pour la reconnaissance des IADL, nous avons développé une solution selon le formalisme des Modèles de Markov Cachés (MMC). Un MMC hiérarchique à deux niveaux a été introduit, modélisant les activités sémantiques ou des états intermédiaires. Un ensemble complexe de descripteurs (dynamiques, statiques, de bas niveau et de niveau intermédiaire) a été exploité et les espaces de description joints optimaux ont été identifiés expérimentalement. Dans le cadre de descripteurs de niveau intermédiaire pour la reconnaissance d'activités nous nous sommes particulièrement intéressés aux objets sémantiques que la personne manipule dans le champ de la caméra. Nous avons proposé un nouveau concept pour la description d'objets ou d'images faisant usage des descripteurs locaux (SURF) et de la structure topologique sous-jacente de graphes locaux. Une approche imbriquée pour la construction des graphes où la même scène peut être décrite par plusieurs niveaux de graphes avec un nombre de nœuds croissant a été introduite. Nous construisons ces graphes par une triangulation de Delaunay sur des points SURF, préservant ainsi les bonnes propriétés des descripteurs locaux c'est-à-dire leur invariance vis-à-vis de transformations affines dans le plan image telles qu'une rotation, une translation ou un changement d'échelle. Nous utilisons ces graphes descripteurs dans le cadre de l'approche Sacs-de-Mots-Visuels. Le problème de définition d'une distance, ou dissimilarité, entre les graphes pour la classification non supervisée et la reconnaissance est nécessairement soulevé. Nous proposons une mesure de dissimilarité par le Noyau Dépendant du Contexte (Context-Dependent Kernel : CDK) proposé par H. Sahbi et montrons sa relation avec la norme classique L2 lors de la comparaison de graphes triviaux (les points SURF). Pour la reconnaissance d'activités par MMC, les expériences sont conduites sur le premier corpus au monde de vidéos avec caméra portée destiné à l'observation des d'IADL et sur des bases de données publiques comme SIVAL et Caltech-101 pour la reconnaissance d'objets.

APA, Harvard, Vancouver, ISO, and other styles

31

Kumar, Tushar. "Characterizing and controlling program behavior using execution-time variance." Diss., Georgia Institute of Technology, 2016. http://hdl.handle.net/1853/55000.

Full text

Abstract:

Immersive applications, such as computer gaming, computer vision and video codecs, are an important emerging class of applications with QoS requirements that are difficult to characterize and control using traditional methods. This thesis proposes new techniques reliant on execution-time variance to both characterize and control program behavior. The proposed techniques are intended to be broadly applicable to a wide variety of immersive applications and are intended to be easy for programmers to apply without needing to gain specialized expertise. First, we create new QoS controllers that programmers can easily apply to their applications to achieve desired application-speciﬁc QoS objectives on any platform or application data-set, provided the programmers verify that their applications satisfy some simple domain requirements speciﬁc to immersive applications. The controllers adjust programmer-identiﬁed knobs every application frame to effect desired values for programmer-identiﬁed QoS metrics. The control techniques are novel in that they do not require the user to provide any kind of application behavior models, and are effective for immersive applications that defy the traditional requirements for feedback controller construction. Second, we create new proﬁling techniques that provide visibility into the behavior of a large complex application, inferring behavior relationships across application components based on the execution-time variance observed at all levels of granularity of the application functionality. Additionally for immersive applications, some of the most important QoS requirements relate to managing the execution-time variance of key application components, for example, the frame-rate. The proﬁling techniques not only identify and summarize behavior directly relevant to the QoS aspects related to timing, but also indirectly reveal non-timing related properties of behavior, such as the identiﬁcation of components that are sensitive to data, or those whose behavior changes based on the call-context.

APA, Harvard, Vancouver, ISO, and other styles

32

Arcila, Romain. "Séquences de maillages : classification et méthodes de segmentation." Phd thesis, Université Claude Bernard - Lyon I, 2011. http://tel.archives-ouvertes.fr/tel-00653542.

Full text

Abstract:

Les séquences de maillages sont de plus en plus utilisées. Cette augmentation des besoins entraîne un développement des méthodes de génération de séquences de maillages. Ces méthodes de générations peuvent produire des séquences de maillages de natures différentes. Le nombre d'applications utilisant ces séquences s'est également accru, avec par exemple la compression et le transfert de pose. Ces applications nécessitent souvent de calculer une partition de la séquence. Dans cette thèse, nous nous intéressons plus particulièrement à la segmentation en composantes rigides de séquences de maillages. Dans un premier temps, nous formalisons la notion de séquence de maillages et proposons donc une classification permettant de désigner quelles sont les propriétés attachées à un type de séquence, et ainsi de décrire précisément quel type de séquence est nécessaire pour une application donnée. Dans un second temps, nous formalisons la notion de segmentation de séquence de maillages, et présentons également l'état de l'art des méthodes de segmentation sur les séquences de maillages. Ensuite, nous proposons une première méthode de type globale pour les séquences stables de maillages, fondée sur la fusion de régions. Par la suite, nous présentons deux autres méthodes, reposant sur la classification spectrale. La première, produit un ensemble de segmentations globales, tandis que la seconde génère une segmentation globale ou une segmentation temporellement variable. Nous mettons également en place un système d'évaluation quantitative des segmentations. Enfin, nous présentons les différentes perspectives liées à la segmentation.

APA, Harvard, Vancouver, ISO, and other styles

33

Khan, Asim. "Automated Detection and Monitoring of Vegetation Through Deep Learning." Thesis, 2022. https://vuir.vu.edu.au/43941/.

Full text

Abstract:

Healthy vegetation are essential not just for environmental sustainability but also for the development of sustainable and liveable cities. It is undeniable that human activities are altering the vegetation landscape, with harmful implications for the climate. As a result, autonomous detection, health evaluation, and continual monitoring of the plants are required to ensure environmental sustainability. This thesis presents research on autonomous vegetation management using recent advances in deep learning. Currently, most towns do not have a system in place for detection and continual vegetation monitoring. On the one hand, a lack of public knowledge and political will could be a factor; on the other hand, no efficient and cost-effective technique of monitoring vegetation health has been established. Individual plants health condition data is essential since urban trees often develop as stand-alone objects. Manual annotation of these individual trees is a time-consuming, expensive, and inefficient operation that is normally done in person. As a result, skilled manual annotation cannot cover broad areas, and the data they create is out of date. However, autonomous vegetation management poses a number of challenges due to its multidisciplinary nature. It includes automated detection, health assessment, and monitoring of vegetation and trees by integrating techniques from computer vision, machine learning, and remote sensing. Other challenges include a lack of analysis-ready data and imaging diversity, as well as dealing with their dependence on weather variability. With a core focus on automation of vegetation management using deep learning and transfer learning, this thesis contributes novel techniques for Multi-view vegetation detection, robust calculation of vegetation index, and real- time vegetation health assessment using deep convolutional neural networks (CNNs) and deep learning frameworks. The thesis focuses on four general aspects: a) training CNN with possibly inaccurate labels and noisy image dataset; b) deriving semantic vegetation segmentation from the ordinal information contained in the image; c) retrieving semantic vegetation indexes from street-level imagery; and d) developing a vegetation health assessment and monitoring system. Firstly, it is essential to detect and segment the vegetation, and then calculate the pixel value of the semantic vegetation index. However, because the images in multi- sensory data are not identical, all image datasets must be registered before being fed into the model training. The dataset used for vegetation detection and segmentation was acquired from multi-sensors. The whole dataset was multi-temporal based; therefore, it was registered using deep affine features through a convolutional neural network. Secondly, after preparing the dataset, vegetation was segmented by using Deep CNN, a fully convolutional network, and U-net. Although the vegetation index interprets the health of a particular area’s vegetation when assessing small and large vegetation (trees, shrubs, grass, etc.), the health of large plants, such as trees, is determined by steam. In contrast, small plants’ leaves are evaluated to decide whether they are healthy or unhealthy. Therefore, initially, small plant health was assessed through their leaves by training a deep neural network and integrating that trained model into an internet of things (IoT) device such as AWS DeepLens. Another deep CNN was trained to assess the health of large plants and trees like Eucalyptus. This one could also tell which trees were healthy and which ones were unhealthy, as well as their geo-location. Thus, we may ultimately analyse the vegetation’s health in terms of the vegetation index throughout time on the basis of a semantic-based vegetation index and compute the index in a time-series fashion. This thesis shows that computer vision, deep learning and remote sensing approaches can be used to process street-level imagery in different places and cities, to help manage urban forests in new ways, such as biomass-surveillance and remote vegetation monitoring.

APA, Harvard, Vancouver, ISO, and other styles

34

"A Cooperative algorithm for stereo disparity computation." Chinese University of Hong Kong, 1991. http://library.cuhk.edu.hk/record=b5886947.

Full text

Abstract:

by Or Siu Hang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1991.
Bibliography: leaves [102]-[105].
Acknowledgements --- p.V
Chapter Chapter 1 --- Introduction
Chapter 1.1 --- The problem --- p.1
Chapter 1.1.1 --- The correspondence problem --- p.5
Chapter 1.1.2 --- The problem of surface reconstruction --- p.6
Chapter 1.2 --- Our goal --- p.8
Chapter 1.3 --- Previous works --- p.8
Chapter 1.3.1 --- Constraints on matching --- p.10
Chapter 1.3.2 --- Interpolation of disparity surfaces --- p.12
Chapter Chapter 2 --- Preprocessing of images
Chapter 2.1 --- Which operator to use --- p.14
Chapter 2.2 --- Directional zero-crossing --- p.14
Chapter 2.3 --- Laplacian of Gaussian --- p.16
Chapter 2.3.1 --- Theoretical background of the Laplacian of Gaussian --- p.18
Chapter 2.3.2 --- Implementation of the operator --- p.21
Chapter Chapter 3 --- Disparity Layers Generation
Chapter 3.1 --- Geometrical constraint --- p.23
Chapter 3.2 --- Basic idea of disparity layer --- p.26
Chapter 3.3 --- Consideration in matching --- p.28
Chapter 3.4 --- effect of vertical misalignment of sensor --- p.37
Chapter 3.5 --- Final approach --- p.39
Chapter Chapter 4 --- Disparity combination
Chapter 4.1 --- Ambiguous match from different layers --- p.52
Chapter 4.2 --- Our approach --- p.54
Chapter Chapter 5 --- Generation of dense disparity map
Chapter 5.1 --- Introduction --- p.58
Chapter 5.2 --- Cooperative computation --- p.58
Chapter 5.2.1 --- Formulation of oscillation algorithm --- p.59
Chapter 5.3 --- Interpolation by Gradient descent method --- p.69
Chapter 5.3.1 --- Formulation of constraints --- p.70
Chapter 5.3.2 --- Gradient projection interpolation algorithm --- p.72
Chapter 5.3.3 --- Implementation of the algorithm --- p.78
Chapter Chapter 6 --- Conclusion --- p.89
Reference
Appendix (Dynamical behavior of the cooperative algorithm)

APA, Harvard, Vancouver, ISO, and other styles

35

Sun, Jun. "Efficient computation of MRF for low-level vision problems." Master's thesis, 2012. http://hdl.handle.net/1885/156022.

Full text

Abstract:

Low-level computer vision problems, such as image restoration, stereo matching and image segmentation, have been given sufficient attention by computer vision researchers. These problems, though appear to be distinct, share a common structure: given the observed image data, estimating an hidden label at each pixel position. Due to this similarity, one effective solution is to deal with the low-level vision problems by a unified approach: Markov Random Fields (MRFs) framework, which includes MRFs modeling and inference. To be specific, the relationship between observed image data and hidden labels are modeled by MRFs network, formulating a joint distribution. Inference allows to efficiently find a local optimum of the distribution and producing the estimation of underlying labels. We study how to efficiently solve low-level vision problems by MRFs framework. To achieve this target, we mainly focus on two aspects: (1) optimizing the MRFs structure; (2) improving the efficiency of inference. MRFs structures express how we describe the relationship between observed data and hidden labels, as well as the relationship among hidden label themselves. We aim to optimize the MRFs structure for fast inference. Besides the structure, there are two important terms used in inference, one is called data-term and another is called prior-term. In this work, we also study generating a reliable and robust data-term to increase the accuracy and efficiency of inference. In this thesis, a multi-spanning-tree decomposition work is firstly proposed. The traditional 4-connected MRF is broken down to a set of spanning trees, which are loopy-free. Edges in the original grid are uniformly distributed in these spanning trees. Using the multi-spanning-tree structure, inference can be performed fast and parallel; inference results can be combined by an efficient median filter. In the second place, a deeper analysis on spanning tree MRFs is proposed. Tree structure is popularly utilized for inference, but how to select an optimal spanning tree is widely overlooked by researchers. The problems are formulated as finding an optimal spanning tree to approximate 4-connected grid and solved by minimizing the KL-divergence between tree and grid distribution. It is also demonstrated that for different low-level vision problems, the selection criterion of optimal spanning trees are distinct. Besides the MRFs structure analysis, we also optimize the data-term generation process. This work is done specialized for image denoising. We use a non-local technique to create a reliable and low dimensional label space, generating the data-term. This data-term helps to provide comparable results with state-of-art algorithims in a quite limited time. To sum up, we efficiently solve low-level vision problems by MRFs inference. Although MRFs based low-level vision problems have been analyzed fora long time, there are still some interesting work being overlooked. After a careful scrutiny on the literature, we extend existing ideas and develop new algorithms, empirically demonstrate the MRFs computation can be improved both in timing and accuracy.

APA, Harvard, Vancouver, ISO, and other styles

36

Zareian, Alireza. "Learning Structured Representations for Understanding Visual and Multimedia Data." Thesis, 2021. https://doi.org/10.7916/d8-94j1-yb14.

Full text

Abstract:

Recent advances in Deep Learning (DL) have achieved impressive performance in a variety of Computer Vision (CV) tasks, leading to an exciting wave of academic and industrial efforts to develop Artificial Intelligence (AI) facilities for every aspect of human life. Nevertheless, there are inherent limitations in the understanding ability of DL models, which limit the potential of AI in real-world applications, especially in the face of complex, multimedia input. Despite tremendous progress in solving basic CV tasks, such as object detection and action recognition, state-of-the-art CV models can merely extract a partial summary of visual content, which lacks a comprehensive understanding of what happens in the scene. This is partly due to the oversimplified definition of CV tasks, which often ignore the compositional nature of semantics and scene structure. It is even less studied how to understand the content of multiple modalities, which requires processing visual and textual information in a holistic and coordinated manner, and extracting interconnected structures despite the semantic gap between the two modalities. In this thesis, we argue that a key to improve the understanding capacity of DL models in visual and multimedia domains is to use structured, graph-based representations, to extract and convey semantic information more comprehensively. To this end, we explore a variety of ideas to define more realistic DL tasks in both visual and multimedia domains, and propose novel methods to solve those tasks by addressing several fundamental challenges, such as weak supervision, discovery and incorporation of commonsense knowledge, and scaling up vocabulary. More specifically, inspired by the rich literature of semantic graphs in Natural Language Processing (NLP), we explore innovative scene understanding tasks and methods that describe images using semantic graphs, which reflect the scene structure and interactions between objects. In the first part of this thesis, we present progress towards such graph-based scene understanding solutions, which are more accurate, need less supervision, and have more human-like common sense compared to the state of the art. In the second part of this thesis, we extend our results on graph-based scene understanding to the multimedia domain, by incorporating the recent advances in NLP and CV, and developing a new task and method from the ground up, specialized for joint information extraction in the multimedia domain. We address the inherent semantic gap between visual content and text by creating high-level graph-based representations of images, and developing a multitask learning framework to establish a common, structured semantic space for representing both modalities. In the third part of this thesis, we explore another extension of our scene understanding methodology, to open-vocabulary settings, in order to make scene understanding methods more scalable and versatile. We develop visually grounded language models that use naturally supervised data to learn the meaning of all words, and transfer that knowledge to CV tasks such as object detection with little supervision. Collectively, the proposed solutions and empirical results set a new state of the art for the semantic comprehension of visual and multimedia content in a structured way, in terms of accuracy, efficiency, scalability, and robustness.

APA, Harvard, Vancouver, ISO, and other styles

37

Battiti, Roberto. "Multiscale methods, parallel computation, and neural networks for real-time computer vision." Thesis, 1990. https://thesis.library.caltech.edu/2496/1/Battiti_r_1990.pdf.

Full text

Abstract:

This thesis presents new algorithms for low and intermediate level computer vision. The guiding ideas in the presented approach are those of hierarchical and adaptive processing, concurrent computation, and supervised learning. Processing of the visual data at different resolutions is used not only to reduce the amount of computation necessary to reach the fixed point, but also to produce a more accurate estimation of the desired parameters. The presented adaptive multiple scale technique is applied to the problem of motion field estimation. Different parts of the image are analyzed at a resolution that is chosen in order to minimize the error in the coefficients of the differential equations to be solved. Tests with video-acquired images show that velocity estimation is more accurate over a wide range of motion with respect to the homogeneous scheme. In some cases introduction of explicit discontinuities coupled to the continuous variables can be used to avoid propagation of visual information from areas corresponding to objects with different physical and/or kinematic properties. The human visual system uses concurrent computation in order to process the vast amount of visual data in "real-time." Although with different technological constraints, parallel computation can be used efficiently for computer vision. All the presented algorithms have been implemented on medium grain distributed memory multicomputers with a speed-up approximately proportional to the number of processors used. A simple two-dimensional domain decomposition assigns regions of the multiresolution pyramid to the different processors. The inter-processor communication needed during the solution process is proportional to the linear dimension of the assigned domain, so that efficiency is close to 100% if a large region is assigned to each processor. Finally, learning algorithms are shown to be a viable technique to engineer computer vision systems for different applications starting from multiple-purpose modules. In the last part of the thesis a well known optimization method (the Broyden-Fletcher-Goldfarb-Shanno memoryless quasi-Newton method) is applied to simple classification problems and shown to be superior to the "error back-propagation" algorithm for numerical stability, automatic selection of parameters, and convergence properties.

APA, Harvard, Vancouver, ISO, and other styles

38

Jou, Brendan Wesley. "Large-scale Affective Computing for Visual Multimedia." Thesis, 2016. https://doi.org/10.7916/D8474B0B.

Full text

Abstract:

In recent years, Affective Computing has arisen as a prolific interdisciplinary field for engineering systems that integrate human affections. While human-computer relationships have long revolved around cognitive interactions, it is becoming increasingly important to account for human affect, or feelings or emotions, to avert user experience frustration, provide disability services, predict virality of social media content, etc. In this thesis, we specifically focus on Affective Computing as it applies to large-scale visual multimedia, and in particular, still images, animated image sequences and video streams, above and beyond the traditional approaches of face expression and gesture recognition. By taking a principled psychology-grounded approach, we seek to paint a more holistic and colorful view of computational affect in the context of visual multimedia. For example, should emotions like 'surprise' and `fear' be assumed to be orthogonal output dimensions? Or does a 'positive' image in one culture's view elicit the same feelings of positivity in another culture? We study affect frameworks and ontologies to define, organize and develop machine learning models with such questions in mind to automatically detect affective visual concepts. In the push for what we call "Big Affective Computing," we focus on two dimensions of scale for affect -- scaling up and scaling out -- which we propose are both imperative if we are to scale the Affective Computing problem successfully. Intuitively, simply increasing the number of data points corresponds to "scaling up". However, less intuitive, is when problems like Affective Computing "scale out," or diversify. We show that this latter dimension of introducing data variety, alongside the former of introducing data volume, can yield particular insights since human affections naturally depart from traditional Machine Learning and Computer Vision problems where there is an objectively truthful target. While no one might debate a picture of a 'dog' should be tagged as a 'dog,' but not all may agree that it looks 'ugly'. We present extensive discussions on why scaling out is critical and how it can be accomplished while in the context of large-volume visual data. At a high-level, the main contributions of this thesis include: Multiplicity of Affect Oracles: Prior to the work in this thesis, little consideration has been paid to the affective label generating mechanism when learning functional mappings between inputs and labels. Throughout this thesis but first in Chapter 2, starting in Section 2.1.2, we make a case for a conceptual partitioning of the affect oracle governing the label generation process in Affective Computing problems resulting a multiplicity of oracles, whereas prior works assumed there was a single universal oracle. In Chapter 3, the differences between intended versus expressed versus induced versus perceived emotion are discussed, where we argue that perceived emotion is particularly well-suited for scaling up because it reduces the label variance due to its more objective nature compared to other affect states. And in Chapter 4 and 5, a division of the affect oracle along cultural lines with manifestations along both language and geography is explored. We accomplish all this without sacrificing the 'scale up' dimension, and tackle significantly larger volume problems than prior comparable visual affective computing research. Content-driven Visual Affect Detection: Traditionally, in most Affective Computing work, prediction tasks use psycho-physiological signals from subjects viewing the stimuli of interest, e.g., a video advertisement, as the system inputs. In essence, this means that the machine learns to label a proxy signal rather than the stimuli itself. In this thesis, with the rise of strong Computer Vision and Multimedia techniques, we focus on the learning to label the stimuli directly without a human subject provided biometric proxy signal (except in the unique circumstances of Chapter 7). This shift toward learning from the stimuli directly is important because it allows us to scale up with much greater ease given that biometric measurement acquisition is both low-throughput and somewhat invasive while stimuli are often readily available. In addition, moving toward learning directly from the stimuli will allow researchers to precisely determine which low-level features in the stimuli are actually coupled with affect states, e.g., which set of frames caused viewer discomfort rather a broad sense that a video was discomforting. In Part I of this thesis, we illustrate an emotion prediction task with a psychology-grounded affect representation. In particular, in Chapter 3, we develop a prediction task over semantic emotional classes, e.g., 'sad,' 'happy' and 'angry,' using animated image sequences given annotations from over 2.5 million users. Subsequently, in Part II, we develop visual sentiment and adjective-based semantics models from million-scale digital imagery mined from a social multimedia platform. Mid-level Representations for Visual Affect: While discrete semantic emotions and sentiment are classical representations of affect with decades of psychology grounding, the interdisciplinary nature of Affective Computing, now only about two decades old, allows for new avenues of representation. Mid-level representations have been proposed in numerous Computer Vision and Multimedia problems as an intermediary, and often more computable, step toward bridging the semantic gap between low-level system inputs and high-level label semantic abstractions. In Part II, inspired by this work, we adapt it for vision-based Affective Computing and adopt a semantic construct called adjective-noun pairs. Specifically, in Chapter 4, we explore the use of such adjective-noun pairs in the context of a social multimedia platform and develop a multilingual visual sentiment ontology with over 15,000 affective mid-level visual concepts across 12 languages associated with over 7.3 million images and representations from over 235 countries, resulting in the largest affective digital image corpus in both depth and breadth to date. In Chapter 5, we develop computational methods to predict such adjective-noun pairs and also explore their usefulness in traditional sentiment analysis but with a previously unexplored cross-lingual perspective. And in Chapter 6, we propose a new learning setting called 'cross-residual learning' building off recent successes in deep neural networks, and specifically, in residual learning; we show that cross-residual learning can be used effectively to jointly learn across even multiple related tasks in object detection (noun), more traditional affect modeling (adjectives), and affective mid-level representations (adjective-noun pairs), giving us a framework for better grounding the adjective-noun pair bridge in both vision and affect simultaneously.

APA, Harvard, Vancouver, ISO, and other styles

39

Sidhu, Reetinder P. S. "Novel Energy Transfer Computation Techniques For Radiosity Based Realistic Image Synthesis." Thesis, 1995. http://etd.iisc.ernet.in/handle/2005/1737.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

(9089423), Daniel Mas Montserrat. "Machine Learning-Based Multimedia Analytics." Thesis, 2020.

Find full text

Abstract:

Machine learning is widely used to extract meaningful information from video, images, audio, text, and other multimedia data.  Through a hierarchical structure, modern neural networks coupled with backpropagation learn to extract information from large amounts of data and to perform specific tasks such as classification or regression. In this thesis, we explore various approaches to multimedia analytics with neural networks. We present several image synthesis and rendering techniques to generate new images for training neural networks. Furthermore, we present multiple neural network architectures and systems for commercial logo detection, 3D pose estimation and tracking, deepfakes detection, and manipulation detection in satellite images.

APA, Harvard, Vancouver, ISO, and other styles

41

Avinash, Ramakanth S. "Approximate Nearest Neighbour Field Computation and Applications." Thesis, 2014. http://etd.iisc.ernet.in/2005/3503.

Full text

Abstract:

Approximate Nearest-Neighbour Field (ANNF\ maps between two related images are commonly used by computer vision and graphics community for image editing, completion, retargetting and denoising. In this work we generalize ANNF computation to unrelated image pairs. For accurate ANNF map computation we propose Feature Match, in which the low-dimensional features approximate image patches along with global colour adaptation. Unlike existing approaches, the proposed algorithm does not assume any relation between image pairs and thus generalises ANNF maps to any unrelated image pairs. This generalization enables ANNF approach to handle a wider range of vision applications more efficiently. The following is a brief description of the applications developed using the proposed Feature Match framework. The first application addresses the problem of detecting the optic disk from retinal images. The combination of ANNF maps and salient properties of optic disks leads to an efficient optic disk detector that does not require tedious training or parameter tuning. The proposed approach is evaluated on many publicly available datasets and an average detection accuracy of 99% is achieved with computation time of 0.2s per image. The second application aims to super-resolve a given synthetic image using a single source image as dictionary, avoiding the expensive training involved in conventional approaches. In the third application, we make use of ANNF maps to accurately propagate labels across video for segmenting video objects. The proposed approach outperforms the state-of-the-art on the widely used benchmark SegTrack dataset. In the fourth application, ANNF maps obtained between two consecutive frames of video are enhanced for estimating sub-pixel accurate optical flow, a critical step in many vision applications. Finally a summary of the framework for various possible applications like image encryption, scene segmentation etc. is provided.

APA, Harvard, Vancouver, ISO, and other styles

42

Gupta, Sonal. "Activity retrieval in closed captioned videos." Thesis, 2009. http://hdl.handle.net/2152/ETD-UT-2009-08-305.

Full text

Abstract:

Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle & zoom, occlusion and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This thesis explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting 'labeled' data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
text

APA, Harvard, Vancouver, ISO, and other styles

43

Kun-Chih, Shih, and 施昆志. "An creative and interactive multimedia system for playing comfortable music in general spaces based on computer vision and image processing technique, and combined analyses of color, psychology, and music information." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/r4366v.

Full text

Abstract:

碩士
南台科技大學
多媒體與電腦娛樂科學研究所
94
Systems based on computer vision and image processing are widely developed in scientific and medical applications. On the other hand, integrated analyses of color, psychology, music, and showing ways of multimedia are useful and helpful in life entertainments. Association of the two fields becomes more and more popular in recent years, and it will be a trend in the future. This motivates us to design a creative and interactive multimedia system that can recognize and capture the color information of one’s wearing when one enters a space. After the color recognition and extraction, we relate the color information with the psychology theory to analyze the characteristics and feeling of the people in the space. Moreover, we relate the psychology theory with the music theory to play appropriate music to comfort the people’s mind in the space. This application can easily be extended to exhibition centers, conference halls, coffee bars, or any space needing special music. Successful experimental results confirm the effectiveness of the proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

44

Vieira, Leonardo Machado Alves. "Development of an Automatic Image Enhancement Framework." Master's thesis, 2020. http://hdl.handle.net/10316/92489.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
Image enhancement is an image processing procedure in which an image becomes better suited for a task, and so, it is very relevant across multiple fields, such as medical imagery, space imagery, bio-metrics, etc. Image enhancement can be used to alter an image in several different ways, for instance, by highlighting a specific feature in order to ease post-processing analyses by a human or a machine, or by increasing its human perceived aesthetic. The main objective of this work is the study and development of a possible automatic image enhancement system, while having digital real-estate marketing as a case study, in the context of the project "Indest - Indicador de composicion estética". We explored existing research in image enhancement and propose an end-to-end image enhancement pipeline architecture that takes advantage of both classical, evolutionary and machine learning approaches from the literature. The framework is very modular as it can allow changes in its components and parameters. We tested it using a provided dataset of various real-estate pictures of different quality. The outputted enhanced images were evaluated using four image quality assessment tools and by conducting a user survey to assess their user perceived quality. We confirmed the initial presupposition that states that manipulating multiple image attributes at the same time is a complex problem. Also, looking at the survey results, we arrived to the conclusion that, in our scenario, similarity between an enhanced version and the original image, is more important to some extent, than improving its aesthetic value. This improvement can sometimes be exaggerated, causing the lost of useful contextual information or highlighting image defects. As such, a balance between similarity and aesthetic is desirable. Nevertheless, the attained results suggest that a modular and hybrid architecture like the one proposed, has potential in the area of image enhancement. Automatic image enhancement is very closely tied with the capability of machine automated image quality assessment systems, and so progress in both areas are also intrinsically connected.
Melhoramento de imagem é um procedimento da área de processamento de imagem onde uma imagem é manipulada de forma a que se adeque melhor a uma determinada tarefa, sendo por isso de grande relevância em múltiplas áreas, como por exemplo, imagem médica, imagem espacial, imagem biométrica, etc. Melhoramento de imagem pode resultar em diversos tipos de melhorias, por exemplo, o realce de uma características específica de forma a facilitar a subsequente análise realizada por uma máquina ou por um humano, ou melhoramento da sua percepção estética por um humano.O objectivo deste trabalho é o estudo e desenvolvimento de um possível sistema para melhoramento automático de imagens, tendo como caso de estudo marketing digital de imóveis no contexto do projecto "Indest - Indicador de composicion estética". Explorámos os estudos existentes em melhoramento de imagem e propomos uma pipeline com arquitectura "end-to-end", que tira partido de técnicas clássicas, evolucionarias e de aprendizagem computacional presentes na literatura.A estrutura apresentada é muito modular, pelo que permite a alteração de módulos e parâmetros. Os testes efectuados foram realizados sobre um conjunto variado de imagens de imobiliário, que nos foi fornecido. Os resultados dos testes foram avaliados utilizando técnicas de avaliação automática da qualidade de imagens e por um inquérito a utilizadores. Confirmámos o pressuposto inicial que indica que melhoramento de imagem manipulando múltiplos atributos, é uma tarefa complexa. Para além disso, olhando para os resultados do inquérito, chegámos à conclusão que, no nosso caso de uso, similaridade entre a imagem melhorada e a original, é algo mais importante do que puro melhoramento estético da imagem. Isto porque este melhoramento pode por vezes tornar-se exagerado, causando perda de informação contextual e realçando defeitos da imagem. Assim sendo, um balanço entre os dois é desejável. Apesar de tudo, os resultados obtidos sugerem que uma abordagem modular e híbrida como a apresentada, tem potencial na área de melhoramento de imagem. Melhoramento automático de imagem está fortemente ligado à capacidade de avaliar correctamente e automaticamente a qualidade das imagens, e assim sendo, o progresso nas duas áreas está também intrinsecamente ligado.

APA, Harvard, Vancouver, ISO, and other styles

45

Héon-Morissette, Barah. "L’espace du geste-son, vers une nouvelle pratique performative." Thèse, 2016. http://hdl.handle.net/1866/19567.

Full text

Abstract:

Cette thèse en recherche-création est une réflexion sur l’espace du geste-son. La dé- marche artistique de l’auteure, reposant sur six éléments : le corps, le son, le geste, l’image vidéo, l’espace physique et l’espace technologique, a été intégrée dans la conception d’un système de captation de mouvement en vision par ordinateur, le SICMAP (Système In- teractif de Captation du Mouvement en Art Performatif). Cette approche propose une nouvelle pratique performative hybride. Dans un premier temps, l’auteure situe sa démarche artistique en s’appuyant sur les trois piliers de la méthodologie transdisciplinaire : les niveaux de Réalité et de perception (le corps et l’espace-matière), la logique du tiers inclus (l’espace du geste-son) et la com- plexité (éléments du processus de création). Ces concepts transdisciplinaires sont ensuite mis en relation à travers l’analyse d’œuvres arborant un élément commun à la démarche de l’auteure, soit le corps au centre d’un univers sensoriel. L’auteure met ensuite en lumière des éléments relatifs à la pratique scénique susci- tée par cette démarche artistique innovante à travers le corps expressif. Le parcours du performeur-créateur, menant à la conception du SICMAP, est ensuite exposé en passant par une réflexion sur l’« instrument rêvé » et la réalisation de deux interfaces gestuelles pré- paratoires. Sous-entendant une nouvelle gestuelle dans un contexte d’interface sans retour haptique, la typologie du geste instrumental est revisitée dans une approche correspondant au nouveau paradigme de l’espace du geste-son. En réponse à ces recherches, les détails de la mise en œuvre du SICMAP sont ensuite présentés sous l’angle de l’espace technologique et de l’application de l’espace du geste- son. Puis, les compositions réalisées lors du développement du SICMAP sont décrites d’un point de vue artistique et poïétique à travers les éléments fondateurs du processus de création de l’auteure. La conclusion résume les objectifs de cette recherche-création ainsi que les contributions de cette nouvelle pratique performative hybride.
This research-creation thesis is a reflection on the gesture-sound space. The author’s artistic research, based on six elements: body, sound, gesture, video, physical space, and technological space, was integrated in the conception of a motion capture system based on computer vision, the SICMAP (Système Interactif de Captation du Mouvement en Art Performatif – Interactive Motion Capture System For Performative Arts). This approach proposes a new performative hybrid practice. In the first part, the author situates her artistic practice supported by the three pillars of transdisciplinary research methodology: the levels of Reality and perception (the body and space as matter), the logic of the included middle (gesture-sound space) and the com- plexity (elements of the creative process). These transdisciplinary concepts are juxtaposed through the analysis of works bearing a common element to the author’s artistic practice, the body at the center of a sensorial universe. The author then puts forth elements relative to scenic practice arisen by this innovative artistic practice through the expressive body. The path taken by the performer-creator, leading to the conception of the SICMAP, is then explained through a reflection on the “dream instrument” and the realization of two preparatory gestural interfaces. Implying a new gestural in the context of a non-haptic interface that of the free-body gesture, the topology of the instrumental gesture is revisited in response to a new paradigm of the gesture-sound space. In reply to this research, the details of the SICMAP are then presented from the angle of the technological space and then applied to the gesture-sound space. The compositions realized during the development of SICMAP are then presented. These works are discussed from an artistic and poietic point of view through the founding elements of the author’s creative process. The conclusion summarises the objectives of this research-creation as well as the contributions of this new performative hybrid practice.

APA, Harvard, Vancouver, ISO, and other styles

46

(8786558), Mehul Nanda. "You Only Gesture Once (YouGo): American Sign Language Translation using YOLOv3." Thesis, 2020.

Find full text

Abstract:

The study focused on creating and proposing a model that could accurately and precisely predict the occurrence of an American Sign Language gesture for an alphabet in the English Language

using the You Only Look Once (YOLOv3) Algorithm. The training dataset used for this study was custom created and was further divided into clusters based on the uniqueness of the ASL sign.

Three diverse clusters were created. Each cluster was trained with the network known as darknet. Testing was conducted using images and videos for fully trained models of each cluster and

Average Precision for each alphabet in each cluster and Mean Average Precision for each cluster was noted. In addition, a Word Builder script was created. This script combined the trained models, of all 3 clusters, to create a comprehensive system that would create words when the trained models were supplied

with images of alphabets in the English language as depicted in ASL.

APA, Harvard, Vancouver, ISO, and other styles

47

(8771429), Ashley S. Dale. "3D OBJECT DETECTION USING VIRTUAL ENVIRONMENT ASSISTED DEEP NETWORK TRAINING." Thesis, 2021.

Find full text

Abstract:

An RGBZ synthetic dataset consisting of five object classes in a variety of virtual environments and orientations was combined with a small sample of real-world image data and used to train the Mask R-CNN (MR-CNN) architecture in a variety of configurations. When the MR-CNN architecture was initialized with MS COCO weights and the heads were trained with a mix of synthetic data and real world data, F1 scores improved in four of the five classes: The average maximum F1-score of all classes and all epochs for the networks trained with synthetic data is F1∗ = 0.91, compared to F1 = 0.89 for the networks trained exclusively with real data, and the standard deviation of the maximum mean F1-score for synthetically trained networks is σ∗ _F1= 0.015, compared to σF 1 = 0.020 for the networks trained exclusively with real data. Various backgrounds in synthetic data were shown to have negligible impact on F1 scores, opening the door to abstract backgrounds and minimizing the need for intensive synthetic data fabrication. When the MR-CNN architecture was initialized with MS COCO weights and depth data was included in the training data, the net- work was shown to rely heavily on the initial convolutional input to feed features into the network, the image depth channel was shown to influence mask generation, and the image color channels were shown to influence object classification. A set of latent variables for a subset of the synthetic datatset was generated with a Variational Autoencoder then analyzed using Principle Component Analysis and Uniform Manifold Projection and Approximation (UMAP). The UMAP analysis showed no meaningful distinction between real-world and synthetic data, and a small bias towards clustering based on image background.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Computer vision and multimedia computation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles