Готові списки джерел за темами / Audio-visual scene analysis

Добірка наукової літератури з теми "Audio-visual scene analysis"

Автор: Grafiati

Опубліковано: 2 листопада 2022

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Audio-visual scene analysis".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Статті в журналах з теми "Audio-visual scene analysis":

Parekh, Sanjeel, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Perez, and Gael Richard. "Weakly Supervised Representation Learning for Audio-Visual Scene Analysis." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 416–28. http://dx.doi.org/10.1109/taslp.2019.2957889.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

O’Donovan, Adam, Ramani Duraiswami, Dmitry Zotkin, and Nail Gumerov. "Audio visual scene analysis using spherical arrays and cameras." Journal of the Acoustical Society of America 127, no. 3 (March 2010): 1979. http://dx.doi.org/10.1121/1.3385079.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Ahrens, Axel, and Kasper Duemose Lund. "Auditory spatial analysis in reverberant multi-talker environments with congruent and incongruent audio-visual room information." Journal of the Acoustical Society of America 152, no. 3 (September 2022): 1586–94. http://dx.doi.org/10.1121/10.0013991.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In a multi-talker situation, listeners have the challenge of identifying a target speech source out of a mixture of interfering background noises. In the current study, it was investigated how listeners analyze audio-visual scenes with varying complexity in terms of number of talkers and reverberation. The visual information of the room was either congruent with the acoustic room or incongruent. The listeners' task was to locate an ongoing speech source in a mixture of other speech sources. The three-dimensional audio-visual scenarios were presented using a loudspeaker array and virtual reality glasses. It was shown that room reverberation, as well as the number of talkers in a scene, influence the ability to analyze an auditory scene in terms of accuracy and response time. Incongruent visual information of the room did not affect this ability. When few talkers were presented simultaneously, listeners were able to detect a target talker quickly and accurately even in adverse room acoustical conditions. Reverberation started to affect the response time when four or more talkers were presented. The number of talkers became a significant factor for five or more simultaneous talkers.

Motlicek, Petr, Stefan Duffner, Danil Korchagin, Hervé Bourlard, Carl Scheffler, Jean-Marc Odobez, Giovanni Del Galdo, Markus Kallinger, and Oliver Thiergart. "Real-Time Audio-Visual Analysis for Multiperson Videoconferencing." Advances in Multimedia 2013 (2013): 1–21. http://dx.doi.org/10.1155/2013/175745.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario.

Gebru, Israel Dejene, Xavier Alameda-Pineda, Florence Forbes, and Radu Horaud. "EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis." IEEE Transactions on Pattern Analysis and Machine Intelligence 38, no. 12 (December 1, 2016): 2402–15. http://dx.doi.org/10.1109/tpami.2016.2522425.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Mulachela, Husen, Aurelius RL Teluma, and Eka Putri Paramita. "Gender Equality Messages in Film Marlina The Murderer In Four Acts." JCommsci - Journal of Media and Communication Science 2, no. 3 (September 13, 2019): 136. http://dx.doi.org/10.29303/jcommsci.v2i3.57.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

This research trying to analyze the meaning of symbols in the film Marlina The Murderer in Four Acts based on the indicators of gender equality namely, access, participation, control, and benefits. The unit of analysis in this study includes the audio and visual elements that exist in a selected scene for later analysis using the Roland Barthes semiotic method known as the "two order of signification" to find the meaning of denotation and connotation meanings and myths contained in both order systems. The whole series in this study refers to the framework of thinking with the aim of answering the formulation of the problem in research. From the results of the study, researchers found as many as 17 scenes containing the message of gender equality by including indicators of gender equality both in audio and visual elements. After going through the scene analysis process using the Roland Barthes semiotics method, control indicators in gender equality are found more prominently in films, then followed by indicators of access, participation, and benefits. This shows how the important role of control indicators in gender equality is applied so that other indicators can work.Keywords: Semiotic; film; gender equality; Marlina the Murderer in Four Acts

Xiao, Mei, May Wong, Michelle Umali, and Marc Pomplun. "Using Eye-Tracking to Study Audio — Visual Perceptual Integration." Perception 36, no. 9 (September 2007): 1391–95. http://dx.doi.org/10.1068/p5731.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Perceptual integration of audio—visual stimuli is fundamental to our everyday conscious experience. Eye-movement analysis may be a suitable tool for studying such integration, since eye movements respond to auditory as well as visual input. Previous studies have shown that additional auditory cues in visual-search tasks can guide eye movements more efficiently and reduce their latency. However, these auditory cues were task-relevant since they indicated the target position and onset time. Therefore, the observed effects may have been due to subjects using the cues as additional information to maximize their performance, without perceptually integrating them with the visual displays. Here, we combine a visual-tracking task with a continuous, task-irrelevant sound from a stationary source to demonstrate that audio—visual perceptual integration affects low-level oculomotor mechanisms. Auditory stimuli of constant, increasing, or decreasing pitch were presented. All sound categories induced more smooth-pursuit eye movement than silence, with the greatest effect occurring with stimuli of increasing pitch. A possible explanation is that integration of the visual scene with continuous sound creates the perception of continuous visual motion. Increasing pitch may amplify this effect through its common association with accelerating motion.

Nahorna, Olha, Frédéric Berthommier, and Jean-Luc Schwartz. "Audio-visual speech scene analysis: Characterization of the dynamics of unbinding and rebinding the McGurk effect." Journal of the Acoustical Society of America 137, no. 1 (January 2015): 362–77. http://dx.doi.org/10.1121/1.4904536.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Habib, Muhammad Alhada Fuadilah, Asik Putri Ayusari Ratnaningsih, and Michael Jeffri Sinabutar. "SEMIOTICS ANALYSIS OF AHOK-DJAROT’S CAMPAIGN VIDEO ON YOUTUBE SOCIAL MEDIA FOR THE SECOND ROUND OF THE 2017 DKI JAKARTA GUBERNATORIAL ELECTION." Journal of Urban Sociology 4, no. 2 (December 22, 2021): 76. http://dx.doi.org/10.30742/jus.v4i2.1772.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

This study focuses on the messages conveyed in Ahok-Djarot’s campaign video on Youtube social media for the second round of the 2017 DKI Jakarta gubernatorial election by exploring and analyzing the elements of the icons, the indexes, the symbols, the lyrics, and the storyline using Peirce's semiotics based on the visual methodology. Various messages that have been conveyed through the video with the title “Video Kampanye Ahok-Djarot: Pastikan Pancasila Hadir di Jakarta” (Ahok-Djarot’s Campaign Video: Ensure Pancasila is Present in Jakarta) are very interesting to study because the video has become a viral video during the heating up political climate for the second round of the 2017 DKI Jakarta gubernatorial election. A form of support for Ahok-Djarot and also a form of criticism for the current condition of Jakarta articulated through a 2-minutes video with visual images as well as audio lyrics with subtitles packaged interestingly. The results obtained from the analysis are; when Ahok-Jarot’s campaign video for the second round of the 2017 DKI Jakarta gubernatorial election is mapped, it has three main scenes and each scene tries to convey a message to the public, especially the Jakarta citizens. The first scene shows that the current condition of Jakarta is still intolerant and tends to discriminate against minorities. The second scene tries to show that Jakarta should really hold the motto "Bhineka Tunggal Ika" (Unity in Diversity) and should not just use it as a mere motto. The third scene tries to show that Ahok-Jarot are the suitable leaders for the Jakarta citizens because Ahok has worked for Jakarta and has put real efforts for Jakarta. Moreover, Ahok-Djarot is a pair of candidates who can solve the problems described in the previous scene. In general, the issue raised in this controversial video is the issue of intolerance currently considered as a big problem for the Jakarta citizens. Keywords: Semiotics, Intolerance, Political Campaign, Youtube Social Media, Ahok-Djarot, Jakarta

Ramenahalli, Sudarshan. "A Biologically Motivated, Proto-Object-Based Audiovisual Saliency Model." AI 1, no. 4 (November 3, 2020): 487–509. http://dx.doi.org/10.3390/ai1040030.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

The natural environment and our interaction with it are essentially multisensory, where we may deploy visual, tactile and/or auditory senses to perceive, learn and interact with our environment. Our objective in this study is to develop a scene analysis algorithm using multisensory information, specifically vision and audio. We develop a proto-object-based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes. A specialized audiovisual camera with 360∘ field of view, capable of locating sound direction, is used to collect spatiotemporally aligned audiovisual data. We demonstrate that the performance of a proto-object-based audiovisual saliency map in detecting and localizing salient objects/events is in agreement with human judgment. In addition, the proto-object-based AVSM that we compute as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps. Such an algorithm can be useful in surveillance, robotic navigation, video compression and related applications.

Більше джерел

Дисертації з теми "Audio-visual scene analysis":

Parekh, Sanjeel. "Learning representations for robust audio-visual scene analysis." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT015/document.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding

Phillips, Nicola Jane. "Audio-visual scene analysis : attending to music in film." Thesis, University of Cambridge, 2000. https://www.repository.cam.ac.uk/handle/1810/251745.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Alameda-Pineda, Xavier. "Egocentric Audio-Visual Scene Analysis : a machine learning and signal processing approach." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM024/document.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Depuis les vingt dernières années, l'industrie a développé plusieurs produits commerciaux dotés de capacités auditives et visuelles. La grand majorité de ces produits est composée d'un caméscope et d'un microphone embarqué (téléphones portables, tablettes, etc). D'autres, comme la Kinect, sont équipés de capteurs de profondeur et/ou de petits réseaux de microphones. On trouve également des téléphones portables dotés d'un système de vision stéréo. En même temps, plusieurs systèmes orientés recherche sont apparus (par exemple, le robot humanoïde NAO). Du fait que ces systèmes sont compacts, leurs capteurs sont positionnés près les uns des autres. En conséquence, ils ne peuvent pas capturer la scène complète, mais qu'un point de vue très particulier de l'interaction sociale en cours. On appelle cela "Analyse Égocentrique de Scènes Audio-Visuelles''.Cette thèse contribue à cette thématique de plusieurs façons. D'abord, en fournissant une base de données publique qui cible des applications comme la reconnaissance d'actions et de gestes, localisation et suivi d'interlocuteurs, analyse du tour de parole, localisation de sources auditives, etc. Cette base a été utilisé en dedans et en dehors de cette thèse. Nous avons aussi travaillé le problème de la détection d'événements audio-visuels. Nous avons montré comme la confiance en une des modalités (issue de la vision en l'occurrence), peut être modélisée pour biaiser la méthode, en donnant lieu à un algorithme d'espérance-maximisation visuellement supervisé. Ensuite, nous avons modifié l'approche pour cibler la détection audio-visuelle d'interlocuteurs en utilisant le robot humanoïde NAO. En parallèle aux travaux en détection audio-visuelle d'interlocuteurs, nous avons développé une nouvelle approche pour la reconnaissance audio-visuelle de commandes. Nous avons évalué la qualité de plusieurs indices et classeurs, et confirmé que l'utilisation des données auditives et visuelles favorise la reconnaissance, en comparaison aux méthodes qui n'utilisent que l'audio ou que la vidéo. Plus tard, nous avons cherché la meilleure méthode pour des ensembles d'entraînement minuscules (5-10 observations par catégorie). Il s'agit d'un problème intéressant, car les systèmes réels ont besoin de s'adapter très rapidement et d'apprendre de nouvelles commandes. Ces systèmes doivent être opérationnels avec très peu d'échantillons pour l'usage publique. Pour finir, nous avons contribué au champ de la localisation de sources sonores, dans le cas particulier des réseaux coplanaires de microphones. C'est une problématique importante, car la géométrie du réseau est arbitraire et inconnue. En conséquence, cela ouvre la voie pour travailler avec des réseaux de microphones dynamiques, qui peuvent adapter leur géométrie pour mieux répondre à certaines tâches. De plus, la conception des produits commerciaux peut être contrainte de façon que les réseaux linéaires ou circulaires ne sont pas bien adaptés
Along the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as "Egocentric Audio-Visual Scene Analysis''.This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been used later on inside and outside the thesis. We also investigated the problem of AV event detection. We showed how the trust on one of the modalities (visual to be precise) can be modeled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. We explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to video-only classifiers. Later, we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited

Khalidov, Vasil. "Modèles de mélanges conjugués pour la modélisation de la perception visuelle et auditive." Grenoble, 2010. http://www.theses.fr/2010GRENM064.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Dans cette thèse, nous nous intéressons à la modélisation de la perception audio-visuelle avec une tête robotique. Les problèmes associés, notamment la calibration audio-visuelle, la détection, la localisation et le suivi d'objets audio-visuels sont étudiés. Une approche spatio-temporelle de calibration d'une tête robotique est proposée, basée sur une mise en correspondance probabiliste multimodale des trajectoires. Le formalisme de modèles de mélange conjugué est introduit ainsi qu'une famille d'algorithmes d'optimisation efficaces pour effectuer le regroupement multimodal. Un cas particulier de cette famille d'algorithmes, notamment l'algorithme EM conjugue, est amélioré pour obtenir des propriétés théoriques intéressantes. Des méthodes de détection d'objets multimodaux et d'estimation du nombre d'objets sont développées et leurs propriétés théoriques sont étudiées. Enfin, la méthode de regroupement multimodal proposée est combinée avec des stratégies de détection et d'estimation du nombre d'objets ainsi qu'avec des techniques de suivi pour effectuer le suivi multimodal de plusieurs objets. La performance des méthodes est démontrée sur des données simulées et réelles issues d'une base de données de scénarios audio-visuels réalistes (base de données CAVA)
In this thesis, the modelling of audio-visual perception with a head-like device is considered. The related problems, namely audio-visual calibration, audio-visual object detection, localization and tracking are addressed. A spatio-temporal approach to the head-like device calibration is proposed based on probabilistic multimodal trajectory matching. The formalism of conjugate mixture models is introduced along with a family of efficient optimization algorithms to perform multimodal clustering. One instance of this algorithm family, namely the conjugate expectation maximization (ConjEM) algorithm is further improved to gain attractive theoretical properties. The multimodal object detection and object number estimation methods are developed, their theoretical properties are discussed. Finally, the proposed multimodal clustering method is combined with the object detection and object number estimation strategies and known tracking techniques to perform multimodal multiobject tracking. The performance is demonstrated on simulated data and the database of realistic audio-visual scenarios (CAVA database)

Stauffer, Chris. "Automated Audio-visual Activity Analysis." 2005. http://hdl.handle.net/1721.1/30568.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Current computer vision techniques can effectively monitor gross activities in sparse environments. Unfortunately, visual stimulus is often not sufficient for reliably discriminating between many types of activity. In many cases where the visual information required for a particular task is extremely subtle or non-existent, there is often audio stimulus that is extremely salient for a particular classification or anomaly detection task. Unfortunately unlike visual events, independent sounds are often very ambiguous and not sufficient to define useful events themselves. Without an effective method of learning causally-linked temporal sequences of sound events that are coupled to the visual events, these sound events are generally only useful for independent anomalous sounds detection, e.g., detecting a gunshot or breaking glass. This paper outlines a method for automatically detecting a set of audio events and visual events in a particular environment, for determining statistical anomalies, for automatically clustering these detected events into meaningful clusters, and for learning salient temporal relationships between the audio and visual events. This results in a compact description of the different types of compound audio-visual events in an environment.

Частини книг з теми "Audio-visual scene analysis":

Saraceno, Caterina, and Riccardo Leonardi. "Audio-visual processing for scene change detection." In Image Analysis and Processing, 124–31. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997. http://dx.doi.org/10.1007/3-540-63508-4_114.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Tsekeridou, Sofia, Stelios Krinidis, and Ioannis Pitas. "Scene Change Detection Based on Audio-Visual Analysis and Interaction." In Multi-Image Analysis, 214–25. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001. http://dx.doi.org/10.1007/3-540-45134-x_16.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Owens, Andrew, and Alexei A. Efros. "Audio-Visual Scene Analysis with Self-Supervised Multisensory Features." In Computer Vision – ECCV 2018, 639–58. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01231-1_39.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Ganesh, Attigodu Chandrashekara, Frédéric Berthommier, and Jean-Luc Schwartz. "Audio Visual Integration with Competing Sources in the Framework of Audio Visual Speech Scene Analysis." In Advances in Experimental Medicine and Biology, 399–408. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-25474-6_42.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Gupta, Vaibhavi, Vinay Detani, Vivek Khokar, and Chiranjoy Chattopadhyay. "C2VNet: A Deep Learning Framework Towards Comic Strip to Audio-Visual Scene Synthesis." In Document Analysis and Recognition – ICDAR 2021, 160–75. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-86331-9_11.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Pham, Lam, Alexander Schindler, Mina Schutz, Jasmin Lampert, Sven Schlarb, and Ross King. "Deep Learning Frameworks Applied For Audio-Visual Scene Classification." In Data Science – Analytics and Applications, 39–44. Wiesbaden: Springer Fachmedien Wiesbaden, 2022. http://dx.doi.org/10.1007/978-3-658-36295-9_6.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Тези доповідей конференцій з теми "Audio-visual scene analysis":

Wang, Shanshan, Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. "A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9415085.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Schwartz, Jean-Luc, Frédéric Berthommier, and Christophe Savariaux. "Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception." In 7th International Conference on Spoken Language Processing (ICSLP 2002). ISCA: ISCA, 2002. http://dx.doi.org/10.21437/icslp.2002-437.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

"ColEnViSon: Color Enhanced Visual Sonifier - A Polyphonic Audio Texture and Salient Scene Analysis." In International Conference on Computer Vision Theory and Applications. SciTePress - Science and and Technology Publications, 2009. http://dx.doi.org/10.5220/0001805105660572.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Schott, Gareth, and Raphael Marczak. "Understanding game actions: The development of a post-processing method for audio-visual scene analysis." In 2016 Future Technologies Conference (FTC). IEEE, 2016. http://dx.doi.org/10.1109/ftc.2016.7821657.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Fayek, Haytham M., and Anurag Kumar. "Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/78.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

YANG, LING, and SHENG-DONG YUE. "AN ANALYSIS OF THE CHARACTERISTICS OF MUSIC CREATION IN MEFISTOFELE." In 2021 International Conference on Education, Humanity and Language, Art. Destech Publications, Inc., 2021. http://dx.doi.org/10.12783/dtssehs/ehla2021/35726.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Successful opera art cannot be separated from literary elements, but also from the support of music. Opera scripts make up plots with words. Compared with emotional resonance directly from the senses, music can plasticize the abstract literary image from the perspective of sensibility. An excellent opera work can effectively promote the development of the drama plot through music design, and deepen the conflict of drama with the "ingenious leverage" of music. This article intends to analyze the music design of the famous opera, Mefistofele, and try to explore the fusion effect of music and drama, and its role in promoting the plot. After its birth at the end of the 16th century and the beginning of the 17th century, western opera art quickly received widespread attention and affection. The reason for its success is mainly due to its fusion of the essence of classical music and drama literature. Because of this, there have always been debates about the importance of music and drama in the long history of opera art development. In the book Opera as Drama, Joseph Kerman, a well-known contemporary musicologist, firmly believes that "opera is first and foremost a drama to show conflicts, emotions and thoughts among people through actions and events. In this process, music assumes the most important performance responsibilities."[1] Objectively speaking, these two elements with very different external forms and internal structures play an indispensable role in opera art. A classic opera is inseparable from the organic integration of music and drama, otherwise it will be difficult to meet the aesthetic experience expected by the audience. On the stage, it is necessary to present wonderful audio-visual enjoyment, and at the same time to pursue thematic expressions with deep thoughts, but the expression of emotions in music creation must be reflected through its independent specific language rather than separated from its own consciousness. Only through the superb expression of music can conflicts, thoughts and emotions be fully reflected, or it may be reduced to empty preaching. Joseph Kerman once pointed out that "the true meaning of opera is to carry drama with music". He believes that opera expresses thoughts and emotions through many factors such as scenes, actions, characters, plots and so on. However, the carrier of these elements lies in music. Only under the guidance and support of music can the characters, thoughts and emotions of the drama be truly portrayed. Indeed, opera scripts fictional plots with words, and music presents abstract literary image specifically and recreationally, allowing more potentially complex emotions that are difficult to express in words to be perceived by the audience in the flow of notes, thereby resonate with people.[2] Mefistofele, which this article intends to explore, is such an opera that is extremely exemplary in the organic integration of music and drama.