To see the other types of publications on this topic, follow the link: Generative audio models.

Journal articles on the topic 'Generative audio models'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Generative audio models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Evans, Zach, Scott H. Hawley, and Katherine Crowson. "Musical audio samples generated from joint text embeddings." Journal of the Acoustical Society of America 152, no. 4 (2022): A178. http://dx.doi.org/10.1121/10.0015956.

Full text
Abstract:
The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic music
APA, Harvard, Vancouver, ISO, and other styles
2

Kang, Hyunju, Geonhee Han, Yoonjae Jeong, and Hogun Park. "AudioGenX: Explainability on Text-to-Audio Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17733–41. https://doi.org/10.1609/aaai.v39i17.33950.

Full text
Abstract:
Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This m
APA, Harvard, Vancouver, ISO, and other styles
3

Samson, Grzegorz. "Perspectives on Generative Sound Design: A Generative Soundscapes Showcase." Arts 14, no. 3 (2025): 67. https://doi.org/10.3390/arts14030067.

Full text
Abstract:
Recent advancements in generative neural networks, particularly transformer-based models, have introduced novel possibilities for sound design. This study explores the use of generative pre-trained transformers (GPT) to create complex, multilayered soundscapes from textual and visual prompts. A custom pipeline is proposed, featuring modules for converting the source input into structured sound descriptions and subsequently generating cohesive auditory outputs. As a complementary solution, a granular synthesizer prototype was developed to enhance the usability of generative audio samples by ena
APA, Harvard, Vancouver, ISO, and other styles
4

Jeong, Yujin, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. "Read, Watch and Scream! Sound Generation from Text and Video." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17590–98. https://doi.org/10.1609/aaai.v39i17.33934.

Full text
Abstract:
Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Especially, our method esti
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Heng, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15492–501. http://dx.doi.org/10.1609/aaai.v38i14.29475.

Full text
Abstract:
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-
APA, Harvard, Vancouver, ISO, and other styles
6

Ji, Wenliang, Ming Jin, and Yixin Chen. "Optimization of Digital Media Content Generation and Communication Effect Combined with Deep Learning Technology." Journal of Combinatorial Mathematics and Combinatorial Computing 127a (April 15, 2025): 1449–66. https://doi.org/10.61091/jcmcc127a-084.

Full text
Abstract:
The combination of deep learning and digital media technology provides great scope for content creation. The article uses Generative Adversarial Network (GAN) in deep learning for content generation. Based on the three major forms of digital media content (image, audio, and video), image, audio, and video are generated by U-Net_GAN model, MAS-GAN model, and SSFLVGAN model, respectively, to construct a digital media content generation model based on generative adversarial networks. Subsequently, the model is validated for performance and the generated images, audio and video are evaluated for e
APA, Harvard, Vancouver, ISO, and other styles
7

Sakirin, Tam, and Siddartha Kusuma. "A Survey of Generative Artificial Intelligence Techniques." Babylonian Journal of Artificial Intelligence 2023 (March 10, 2023): 10–14. http://dx.doi.org/10.58496/bjai/2023/003.

Full text
Abstract:
Generative artificial intelligence (AI) refers to algorithms capable of creating novel, realistic digital content autonomously. Recently, generative models have attained groundbreaking results in domains like image and audio synthesis, spurring vast interest in the field. This paper surveys the landscape of modern techniques powering the rise of creative AI systems. We structurally examine predominant algorithmic approaches including generative adversarial networks (GANs), variational autoencoders (VAEs), and autoregressive models. Architectural innovations and illustrations of generated outpu
APA, Harvard, Vancouver, ISO, and other styles
8

Broad, Terence, Frederic Fol Leymarie, and Mick Grierson. "Network Bending: Expressive Manipulation of Generative Models in Multiple Domains." Entropy 24, no. 1 (2021): 28. http://dx.doi.org/10.3390/e24010028.

Full text
Abstract:
This paper presents the network bending framework, a new approach for manipulating and interacting with deep generative models. We present a comprehensive set of deterministic transformations that can be inserted as distinct layers into the computational graph of a trained generative neural network and applied during inference. In addition, we present a novel algorithm for analysing the deep generative model and clustering features based on their spatial activation maps. This allows features to be grouped together based on spatial similarity in an unsupervised fashion. This results in the mean
APA, Harvard, Vancouver, ISO, and other styles
9

Cao, Yongnian, Xuechun Yang, and Rui Sun. "Generative AI Models Theoretical Foundations and Algorithmic Practices." Journal of Industrial Engineering and Applied Science 3, no. 1 (2025): 1–9. https://doi.org/10.70393/6a69656173.323633.

Full text
Abstract:
Generative models in AI are an entirely new paradigm for machine learning, allowing computers to create realistic data in all kinds of categories, like text (NLP), images, and even physics simulations. In this paper this formalism is used to guide the theory, algorithms and applications of generative models, with particular focus on a few well established techniques like VAEs, GANs, and diffusion models. It stresses the importance of probabilistic generative modelling and information theory (I.e. KL divergence, ELBO, adversarial optimization, etc.) We cover algorithmic practices such as optimi
APA, Harvard, Vancouver, ISO, and other styles
10

Aldausari, Nuha, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. "Video Generative Adversarial Networks: A Review." ACM Computing Surveys 55, no. 2 (2023): 1–25. http://dx.doi.org/10.1145/3487891.

Full text
Abstract:
With the increasing interest in the content creation field in multiple sectors such as media, education, and entertainment, there is an increased trend in the papers that use AI algorithms to generate content such as images, videos, audio, and text. Generative Adversarial Networks (GANs) is one of the promising models that synthesizes data samples that are similar to real data samples. While the variations of GANs models in general have been covered to some extent in several survey papers, to the best of our knowledge, this is the first paper that reviews the state-of-the-art video GANs models
APA, Harvard, Vancouver, ISO, and other styles
11

Dzwonczyk, Luke, Carmine-Emanuele Cella, and David Ban. "Generating Music Reactive Videos by Applying Network Bending to Stable Diffusion." Journal of the Audio Engineering Society 73, no. 6 (2025): 388–98. https://doi.org/10.17743/jaes.2022.0210.

Full text
Abstract:
This paper presents the first steps toward the creation of a tool which enables artists to create music visualizations using pretrained, generative, machine learning models. First, the authors investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. A number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools, are identified. The aut
APA, Harvard, Vancouver, ISO, and other styles
12

Neto, Wilson A. de Oliveira, Elloá B. Guedes, and Carlos Maurício S. Figueiredo. "Anomaly Detection in Sound Activity with Generative Adversarial Network Models." Journal of Internet Services and Applications 15, no. 1 (2024): 313–24. http://dx.doi.org/10.5753/jisa.2024.3897.

Full text
Abstract:
In state-of-art anomaly detection research, prevailing methodologies predominantly employ Generative Adversarial Networks and Autoencoders for image-based applications. Despite the efficacy demonstrated in the visual domain, there remains a notable dearth of studies showcasing the application of these architectures in anomaly detection within the sound domain. This paper introduces tailored adaptations of cutting-edge architectures for anomaly detection in audio and conducts a comprehensive comparative analysis to substantiate the viability of this novel approach. The evaluation is performed o
APA, Harvard, Vancouver, ISO, and other styles
13

Shen, Qiwei, Junjie Xu, Jiahao Mei, Xingjiao Wu, and Daoguo Dong. "EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance." Applied Sciences 14, no. 8 (2024): 3193. http://dx.doi.org/10.3390/app14083193.

Full text
Abstract:
With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage
APA, Harvard, Vancouver, ISO, and other styles
14

Gupta, Jyoti, Monica Bhutani, Pramod Kumar, et al. "A comprehensive review of recent advances and future prospects of generative AI." Journal of Information and Optimization Sciences 46, no. 1 (2025): 205–11. https://doi.org/10.47974/jios-1864.

Full text
Abstract:
Generative AI has evolved rapidly and demonstrated accuracy in creating content with diverse yet too realistic styles. This paper provides a complete overview of the field, starting with its core principles and continuing with some recent results and potential future applications. This also covers requirements for new task-specific and data models, including our critical generative model generation (GANs, VAE, and more) in four image audio text videos. The paper emphasizes that generative AI has the potential to transform industries and lists some of these possible applications. It also review
APA, Harvard, Vancouver, ISO, and other styles
15

Meshram, Sahil. "Genius AI A Unified Platform for Text, Image, Audio, Video, and Code AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 6 (2025): 825–29. https://doi.org/10.22214/ijraset.2025.71461.

Full text
Abstract:
The rapid evolution of artificial intelligence (AI) has led to the development of specialized models across different modalities such as text, image, video, audio, and program code. This paper presents the design and conceptual framework for a multimodal AI platform that harmoniously brings together multiple AI systems into a single, user-friendly. The proposed platform leverages state-of-the-art AI models, each tailored for a specific modality—Natural Language Processing (NLP) models for text understanding and generation, Computer Vision models for image analysis and synthesis, Generative Vid
APA, Harvard, Vancouver, ISO, and other styles
16

Purshottam J. Assudani, Balakrishnan P, A. Anny Leema, and Rajesh K Nasare. "Generative AI-Powered Framework for Audio Analysis and Conversational Exploration." Metallurgical and Materials Engineering 31, no. 4 (2025): 206–11. https://doi.org/10.63278/1425.

Full text
Abstract:
This paper introduces a hybrid deep learning system for complex audio interpretation and post time communication utilizing associated hidden Convolutional Neural Networks (CNNs) with transformer based Large Language Models (LLMs) over spectrogram. The system inputs raw audio input in the form of audio signals, and maps them into spectrograms, extracts high level features using CNNs, and asks for fusion of LLM-produced embeddings with it, for adding semantic understanding, and contextual discussions. The multimodal attention technique helps in crossing the audio-linguistic gap and therefore, it
APA, Harvard, Vancouver, ISO, and other styles
17

S, Dr Manimala. "GenNarrate: AI-Powered Story Synthesis with Visual and Audio Outputs." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 2352–58. https://doi.org/10.22214/ijraset.2025.70567.

Full text
Abstract:
Abstract: The emergence of generative artificial intelligence has redefined the boundaries of digital content creation, particularly in the domain of computational storytelling. This paper presents GenNarrate, a modular, multi-modal generative AI system engineered to synthesize coherent narratives augmented with corresponding visual and auditory elements. The architecture leverages advanced machine learning models, including LLaMA2 for text generation, DALL·E for image synthesis, and a combination of Google Text-to-Speech (GTTS) and AudioLDM for expressive audio narration and sound design. Gen
APA, Harvard, Vancouver, ISO, and other styles
18

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Full text
Abstract:
Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-di
APA, Harvard, Vancouver, ISO, and other styles
19

Lattner, Stefan, and Javier Nistal. "Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks." Electronics 10, no. 11 (2021): 1349. http://dx.doi.org/10.3390/electronics10111349.

Full text
Abstract:
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a s
APA, Harvard, Vancouver, ISO, and other styles
20

Thorat, Ms Madhuri. "From Words to Wonders: AI-Generated Multimedia for Poetry Learning." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 3382–94. https://doi.org/10.22214/ijraset.2025.70946.

Full text
Abstract:
The rise of Generative AI has led to the development of various tools that present new opportunities for businesses and professionals engaged in content creation. The education sector is undergoing a significant transformation in the methods of content development and delivery. AI models and tools facilitate the creation of customized learning materials and effective visuals that enhance and simplify the educational experience. The advent of Large Language Models (LLMs) such as GPT and Text-to-Image models like Stable Diffusion, Flux-Schnell has fundamentally changed and expedited the content
APA, Harvard, Vancouver, ISO, and other styles
21

Giudici, Gregorio Andrea, Franco Caspe, Leonardo Gabrielli, Stefano Squartini, and Luca Turchet. "Distilling DDSP: Exploring Real-Time Audio Generation on Embedded Systems." Journal of the Audio Engineering Society 73, no. 6 (2025): 331–45. https://doi.org/10.17743/jaes.2022.0211.

Full text
Abstract:
This paper investigates the feasibility of running neural audio generative models on embedded systems, by comparing the performance of various models and evaluating their trade-offs in audio quality, inference speed, and memory usage. This work focuses on differentiable digital signal processing (DDSP) models, due to their hybrid architecture, which combines the efficiency and interoperability of traditional DSP with the flexibility of neural networks. In addition, the application of knowledge distillation (KD) is explored to improve the performance of smaller models. Two types of distillation
APA, Harvard, Vancouver, ISO, and other styles
22

G, Ananya. "RAG based Chatbot using LLMs." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35600.

Full text
Abstract:
Historically, Artificial Intelligence (AI) was used to understand and recommend information. Now, Generative AI can also help us create new content. Generative AI builds on existing technologies, like Large Language Models (LLMs) which are trained on large amounts of text and learn to predict the next word in a sentence. Generative AI can not only create new text, but also images, videos, or audio. This project focuses on the implementation of a chatbot based the concepts of Generative AI and Large Language Models which can answer any query regarding the content provided in the PDFs. The prima
APA, Harvard, Vancouver, ISO, and other styles
23

Yang, Junpeng, and Haoran Zhang. "Development And Challenges of Generative Artificial Intelligence in Education and Art." Highlights in Science, Engineering and Technology 85 (March 13, 2024): 1334–47. http://dx.doi.org/10.54097/vaeav407.

Full text
Abstract:
Thanks to the rapid development of generative deep learning models, Artificial Intelligence Generated Content (AIGC) has attracted more and more research attention in recent years, which aims to learn models from massive data to generate relevant content based on input conditions. Different from traditional single-modal generation tasks that focus on content generation for a particular modality, such as image generation, text generation, or semantic generation, AIGC trains a single model that can simultaneously understand language, images, videos, audio, and more. AIGC marks the transition fro
APA, Harvard, Vancouver, ISO, and other styles
24

Choi, Ha-Yeong, Sang-Hoon Lee, and Seong-Whan Lee. "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (2024): 17862–70. http://dx.doi.org/10.1609/aaai.v38i16.29740.

Full text
Abstract:
Diffusion-based generative models have recently exhibited powerful generative performance. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, we introduce decoupled denoising diffusion models (DDDMs) with disentangled representations, which can enable effective style transfers for each attribute in generative models. In particular, we apply DDDMs for voice conversion (VC) tasks,
APA, Harvard, Vancouver, ISO, and other styles
25

Zhou, Zhenghao, Yongjie Liu, and Chen Cao. "Advancing Audio-Based Text Generation with Imbalance Preference Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 26120–28. https://doi.org/10.1609/aaai.v39i24.34808.

Full text
Abstract:
Human feedback in generative systems is a highly active frontier of research that aims to improve the quality of generated content and align it with subjective preferences. Existing efforts predominantly focus on text-only large language models (LLMs) or text-based image generation, while cross-modal generation between audio and text remains largely unexplored. Moreover, there is currently no open-source preference dataset to support the deployment of alignment algorithms in this domain. In this work, we take audio speech translation (AST) and audio captioning (AAC) tasks as examples to explor
APA, Harvard, Vancouver, ISO, and other styles
26

Viomesh Singh. "VidTextBot using Generative AI." Journal of Information Systems Engineering and Management 10, no. 18s (2025): 128–32. https://doi.org/10.52783/jisem.v10i18s.2894.

Full text
Abstract:
Introduction: This research paper presents the design and implementation of a VidTextBot , it is a cutting-edge system that is used to integrate the video-to-text conversion using generative AI for analyzing the video content. The system will allow the users to upload the video or the youtube link. This youtube link or the video is processed to extract the audio, transcribe it into text, and extract subtitles if available. These outputs are stored into the database for smooth future reference and efficient data retrieval. By utilizing advanced NLP models like ChatGPT, the chatBot will help the
APA, Harvard, Vancouver, ISO, and other styles
27

Gupta, Jyoti, Monica Bhutani, Mahesh Kumar, Aman Dureja, Shyla Singh, and Mohit Dayal. "State-of-the-art review and critical analysis of emerging trends in generative artificial intelligence." Journal of Information and Optimization Sciences 46, no. 5 (2025): 1691–704. https://doi.org/10.47974/jios-1945.

Full text
Abstract:
Generative AI technology now leads the way as a transformative innovation that produces diverse, realistic content across various content modalities. This research extensively reviews generative AI by examining its core mechanisms, architectural improvements, and current breakthroughs in making images, text, audio, and videos. The paper discusses three main generative models, GANs, VAEs and diffusion models, to explain their distinctive advantages and technical constraints. The document demonstrates industrial uses in different sectors yet focuses on essential issues connected to biased data,
APA, Harvard, Vancouver, ISO, and other styles
28

Gupta, Chitralekha, Shreyas Sridhar, Denys J. C. Matthies, Christophe Jouffrais, and Suranga Nanayakkara. "SonicVista: Towards Creating Awareness of Distant Scenes through Sonification." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 2 (2024): 1–32. http://dx.doi.org/10.1145/3659609.

Full text
Abstract:
Spatial awareness, particularly awareness of distant environmental scenes known as vista-space, is crucial and contributes to the cognitive and aesthetic needs of People with Visual Impairments (PVI). In this work, through a formative study with PVIs, we establish the need for vista-space awareness amongst people with visual impairments, and the possible scenarios where this awareness would be helpful. We investigate the potential of existing sonification techniques as well as AI-based audio generative models to design sounds that can create awareness of vista-space scenes. Our first user stud
APA, Harvard, Vancouver, ISO, and other styles
29

Lin, Hong, Xuan Liu, Chaomurilige Chaomurilige, et al. "LongMergent: Pioneering audio mixing strategies for exquisite music generation." Computer Software and Media Applications 8, no. 1 (2025): 11516. https://doi.org/10.24294/csma11516.

Full text
Abstract:
Artificial intelligence-empowered music processing is a domain that involves the use of artificial intelligence technologies to enhance music analysis, understanding, and generation. This field encompasses a variety of tasks from music generation to music comprehension. In practical applications, the complexity of interwoven tasks, differences in data representation, scattered distribution of tool resources, and the threshold of professional music knowledge often become barriers that hinder developers from smoothly carrying out generative tasks. Therefore, it is essential to establish a system
APA, Harvard, Vancouver, ISO, and other styles
30

Yang, Chenyu, Shuai Wang, Hangting Chen, et al. "SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25597–605. https://doi.org/10.1609/aaai.v39i24.34750.

Full text
Abstract:
The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilit
APA, Harvard, Vancouver, ISO, and other styles
31

Adithya, Suresh, A. Faras, Habeeba K. M. Ummu, Eldho Anu, J. George Asha, and Roy Meckamalil Rotney. "Autism Detection Using Self-Stimulatory Behaviors." Advancement in Image Processing and Pattern Recognition 8, no. 3 (2025): 13–24. https://doi.org/10.5281/zenodo.15516090.

Full text
Abstract:
<em>This paper introduces a novel video-audio-based model for the early detection of Autism Spectrum Disorder (ASD), focusing on analyzing self-stimulatory behaviors (stim- ming) such as arm flapping, head banging, and spinning, which are critical diagnostic markers. Traditional diagnostic approaches often depend on subjective clinical observations, leading to in- consistencies, delays, and limited accessibility in diverse settings. The proposed model combines video analysis with audio detection to address these shortcomings, supported by a generative AI- based method to create an audio datase
APA, Harvard, Vancouver, ISO, and other styles
32

Prudhvi, Y., T. Adinarayana, T. Chandu, S. Musthak, and G. Sireesha. "Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound." International Journal of Innovative Research in Computer Science and Technology 11, no. 6 (2023): 13–17. http://dx.doi.org/10.55524/ijircst.2023.11.6.3.

Full text
Abstract:
In the field of computer graphics and animation, the challenge of generating lifelike and expressive talking face animations has historically necessitated extensive 3D data and complex facial motion capture systems. However, this project presents an innovative approach to tackle this challenge, with the primary goal of producing realistic 3D motion coefficients for stylized talking face animations driven by a single reference image synchronized with audio input. Leveraging state-of-the-art deep learning techniques, including generative models, image-to-image translation networks, and audio pro
APA, Harvard, Vancouver, ISO, and other styles
33

A M, Vandana Pranavi, and Dr Nagaraj G. Cholli. "Comprehensive Survey On Generative AI, Plethora Of Applications And Impacts." IOSR Journal of Computer Engineering 26, no. 5 (2024): 06–15. http://dx.doi.org/10.9790/0661-2605020615.

Full text
Abstract:
The primary objective for the AI subfield of "generative artificial intelligence" is to develop systems that may produce new, novel and creative content including text, photos, audio, music, and movies. These models are able to generate fresh content that nearly mimics realistic content created by humans by utilizing deep learning techniques. These models of Gen AI have gained significant importance in research and have plethora of applications in wide varieties of fields. The impact of GenAI is not just on abled but also on disabled communities who are sometimes unnoticed. This survey provide
APA, Harvard, Vancouver, ISO, and other styles
34

Liang, Kai, and Haijun Zhao. "Application of Generative Adversarial Nets (GANs) in Active Sound Production System of Electric Automobiles." Shock and Vibration 2020 (October 28, 2020): 1–10. http://dx.doi.org/10.1155/2020/8888578.

Full text
Abstract:
To improve the diversity and quality of sound mimicry of electric automobile engines, a generative adversarial network (GAN) model was used to construct an active sound production model for electric automobiles. The structure of each layer in the network in this model and the size of its convolution kernel were designed. The gradient descent in network training was optimized using the adaptive moment estimation (Adam) algorithm. To demonstrate the quality difference of the generated samples from different input signals, two GAN models with different inputs were constructed. The experimental re
APA, Harvard, Vancouver, ISO, and other styles
35

Li, Lianghao. "Overview of Multimodal Generative Models in Natural Language Processing and Computer Vision." Journal of Computer Technology and Applied Mathematics 1, no. 4 (2024): 69–78. https://doi.org/10.5281/zenodo.13988327.

Full text
Abstract:
Multimodal generative models have become essential in the deep learning renaissance, as they provide unparalleled flexibility over a diverse context of applications within Natural Language Processing (NLP) and Computer Vision (CV). In this paper, we systematically review the basic concepts and technical improvements in multimodal generative models by discussing their applications across different modalities such as text, images, audio,and video. These models though augment the strength of AI to comprehend and perform complicated tasks by coalescing data from various modalities. In this paper,
APA, Harvard, Vancouver, ISO, and other styles
36

Agarwal,, Pratham. "MedBot : A GenAI based Chatbot for Healthcare." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35757.

Full text
Abstract:
Generative Artificial intelligence (GenAI) is transforming the healthcare industry by providing innovative solutions for patient care and information retrieval. MedBot is an innovative GenAI-driven chatbot designed to improve healthcare services by providing accurate and timely medical information. Utilizing advanced generative AI models, MedBot can respond to text, image, and audio queries, making it a versatile tool for diverse healthcare needs. The chatbot offers functionalities such as document summarization and insight extraction, aiding users in comprehending complex medical data. MedBot
APA, Harvard, Vancouver, ISO, and other styles
37

Li, Jing, Zhengping Li, Ying Li, and Lijun Wang. "P‐2.12: A Comprehensive Study of Content Generation Using Diffusion Model." SID Symposium Digest of Technical Papers 54, S1 (2023): 522–24. http://dx.doi.org/10.1002/sdtp.16346.

Full text
Abstract:
The essence of the Metaverse is the process of integrating a large number of existing technologies to virtualize and digitize the real world. With the development of artificial intelligence technology, a large number of digital native content in the Metaverse needs to be completed by artificial intelligence. The current artificial intelligence technology allows computers to automatically and efficiently generate text, pictures, audio, video, and even 3D models. With the further development of natural language processing technology and generative network models, future artificial intelligence g
APA, Harvard, Vancouver, ISO, and other styles
38

Cheng, Liehai, Zhenli Zhang, Giuseppe Lacidogna, Xiao Wang, Mutian Jia, and Zhitao Liu. "Sound Sensing: Generative and Discriminant Model-Based Approaches to Bolt Loosening Detection." Sensors 24, no. 19 (2024): 6447. http://dx.doi.org/10.3390/s24196447.

Full text
Abstract:
The detection of bolt looseness is crucial to ensure the integrity and safety of bolted connection structures. Percussion-based bolt looseness detection provides a simple and cost-effective approach. However, this method has some inherent shortcomings that limit its application. For example, it highly depends on the inspector’s hearing and experience and is more easily affected by ambient noise. In this article, a whole set of signal processing procedures are proposed and a new kind of damage index vector is constructed to strengthen the reliability and robustness of this method. Firstly, a se
APA, Harvard, Vancouver, ISO, and other styles
39

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (2023): A99. http://dx.doi.org/10.1121/10.0022922.

Full text
Abstract:
In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction
APA, Harvard, Vancouver, ISO, and other styles
40

Cheng, Hsu-Yung, Chia-Cheng Su, Chi-Lun Jiang, and Chih-Chang Yu. "Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet." Electronics 14, no. 6 (2025): 1179. https://doi.org/10.3390/electronics14061179.

Full text
Abstract:
In recent years, generative AI has become popular in areas like natural language processing, as well as image and audio processing, significantly expanding AI’s creative capabilities. Particularly in the realm of image generation, diffusion models have achieved remarkable success across various applications, such as image synthesis and transformation. However, traditional diffusion models operate at the pixel level when learning image features, which inevitably demands significant computational resources. To address this issue, this paper proposes a pose transfer model that integrates the late
APA, Harvard, Vancouver, ISO, and other styles
41

Sheikh, Dr Shagufta Mohammad Sayeed. "Empowering Learning: Crafting Educational Podcasts with GEN AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 4 (2025): 4517–28. https://doi.org/10.22214/ijraset.2025.69144.

Full text
Abstract:
The integration of Generative AI has facilitated innovative tools that are transforming content creation in the educational landscape, ushering in a shift towards more efficient and accessible learning paradigms. In this project, AI is leveraged to automate podcast production, replicating text-based educational content into high-fidelity audio content that caters to varied learning needs. Leveraging advanced frameworks like Large Language Models (LLMs) and Text-to-Speech (TTS) technologies, the system streamlines otherwise time-consuming processes like scripting, recording, and editing, thus s
APA, Harvard, Vancouver, ISO, and other styles
42

B, Yeshitha, Vinitha V, Anubha Mittal, Harshitha Reddy P., and Katiyar Rajani. "Emotion Detection and Voice-Emotion Conversions using Deep Learning." International Journal of Microsystems and IoT 2, no. 3 (2024): 685–91. https://doi.org/10.5281/zenodo.11159090.

Full text
Abstract:
Emotion, especially through speech, is a powerful tool humans possess that conveys much more information than any text can describe. Using artificial intelligence to tap into this can have a big &nbsp;&nbsp; positive impact on a variety of industries, including audio mining, customer service applications, &nbsp;&nbsp;&nbsp;security and forensics, and more. A growing field of research, spoken emotion recognition, has &nbsp;relied heavily on models that employ audio data to create effective classifiers. This paper resents convolutional neural network as a deep learning classification algorithm t
APA, Harvard, Vancouver, ISO, and other styles
43

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (2023): 1834. http://dx.doi.org/10.3390/s23041834.

Full text
Abstract:
This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating dat
APA, Harvard, Vancouver, ISO, and other styles
44

Xi, Wang, Guillaume Devineau, Fabien Moutarde, and Jie Yang. "Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images." Algorithms 13, no. 12 (2020): 319. http://dx.doi.org/10.3390/a13120319.

Full text
Abstract:
Generative models for images, audio, text, and other low-dimension data have achieved great success in recent years. Generating artificial human movements can also be useful for many applications, including improvement of data augmentation methods for human gesture recognition. The objective of this research is to develop a generative model for skeletal human movement, allowing to control the action type of generated motion while keeping the authenticity of the result and the natural style variability of gesture execution. We propose to use a conditional Deep Convolutional Generative Adversari
APA, Harvard, Vancouver, ISO, and other styles
45

R, Arun Kumar, Lisa C, Rashmi V R, and Sandhya K. "GENERATIVE ADVERSARIAL NETWORKS (GANs) IN MULTIMODAL AI USING BRIDGING TEXT, IMAGE, AND AUDIO DATA FOR ENHANCED MODEL PERFORMANCE." ICTACT Journal on Soft Computing 15, no. 3 (2025): 3567–77. https://doi.org/10.21917/ijsc.2025.0497.

Full text
Abstract:
The integration of multimodal data is critical in advancing artificial intelligence models capable of interpreting diverse and complex inputs. While standalone models excel in processing individual data types like text, image, or audio, they often fail to achieve comparable performance when these modalities are combined. Generative Adversarial Networks (GANs) have emerged as a transformative approach in this domain due to their ability to synthesize and learn across disparate data types effectively. This study addresses the challenge of bridging multimodal datasets to improve the generalizatio
APA, Harvard, Vancouver, ISO, and other styles
46

Gong, Yuan, Cheng-I. Lai, Yu-An Chung, and James Glass. "SSAST: Self-Supervised Audio Spectrogram Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10699–709. http://dx.doi.org/10.1609/aaai.v36i10.21315.

Full text
Abstract:
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer mod
APA, Harvard, Vancouver, ISO, and other styles
47

Appiani, Andrea, and Cigdem Beyan. "VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection." Information 16, no. 3 (2025): 233. https://doi.org/10.3390/info16030233.

Full text
Abstract:
Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individu
APA, Harvard, Vancouver, ISO, and other styles
48

Juby Nedumthakidiyil Zacharias. "Generative product content using vision-language models: Transforming e-commerce experiences." World Journal of Advanced Engineering Technology and Sciences 15, no. 3 (2025): 1130–37. https://doi.org/10.30574/wjaets.2025.15.3.1046.

Full text
Abstract:
Vision-language models (VLMs) are fundamentally transforming product content creation in e-commerce, representing a paradigm shift in how digital retail platforms manage product information. These sophisticated systems, which leverage dual-encoder architectures and contrastive learning methods, establish meaningful connections between visual attributes and textual descriptions to generate comprehensive product content directly from images. By analyzing product photographs, these models automatically create detailed descriptions, ingredient lists, and usage recommendations with remarkable accur
APA, Harvard, Vancouver, ISO, and other styles
49

Davis, Jason. "In a Digital World With Generative AI Detection Will Not be Enough." Newhouse Impact Journal 1, no. 1 (2024): 9–12. http://dx.doi.org/10.14305/jn.29960819.2024.1.1.01.

Full text
Abstract:
Recent and dramatic improvements in AI driven large language models (LLMs), image generators, audio and video have fed an exponential growth in Generative AI applications and accessibility. The disruptive ripples of this rapid evolution have already begun to fundamentally impact how we create and consume content on a global scale. While the use of Generative AI has and will continue to enable massive increases in the speed and efficiency of content creation, it has come at the cost of uncomfortable conversations about transparency and the erosion of digital trust. To have any chance at actuall
APA, Harvard, Vancouver, ISO, and other styles
50

Armstrong Joseph J and Senthil S. "The Dark Side of Generative AI: Ethical, Security, and Social Concerns." International Research Journal on Advanced Engineering Hub (IRJAEH) 3, no. 04 (2025): 1720–23. https://doi.org/10.47392/irjaeh.2025.0247.

Full text
Abstract:
Generative Artificial Intelligence (AI) represents a significant leap in technology, enabling the creation of novel content from text, images, videos, and audio. While its potential to drive innovation and improve productivity is immense, the risks associated with its application are equally formidable. This paper explores the darker aspects of generative AI, focusing on ethical dilemmas, social implications, security threats, and the potential for misuse. We examine issues such as misinformation, biases in AI models, job displacement, and the dangers posed by AI-driven automation. Finally, we
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!