To see the other types of publications on this topic, follow the link: Generative audio models.

Journal articles on the topic 'Generative audio models'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Generative audio models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Evans, Zach, Scott H. Hawley, and Katherine Crowson. "Musical audio samples generated from joint text embeddings." Journal of the Acoustical Society of America 152, no. 4 (2022): A178. http://dx.doi.org/10.1121/10.0015956.

Full text
Abstract:
The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic musical sounds presents challenges for audio diffusion models.
APA, Harvard, Vancouver, ISO, and other styles
2

Kang, Hyunju, Geonhee Han, Yoonjae Jeong, and Hogun Park. "AudioGenX: Explainability on Text-to-Audio Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17733–41. https://doi.org/10.1609/aaai.v39i17.33950.

Full text
Abstract:
Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.
APA, Harvard, Vancouver, ISO, and other styles
3

Samson, Grzegorz. "Perspectives on Generative Sound Design: A Generative Soundscapes Showcase." Arts 14, no. 3 (2025): 67. https://doi.org/10.3390/arts14030067.

Full text
Abstract:
Recent advancements in generative neural networks, particularly transformer-based models, have introduced novel possibilities for sound design. This study explores the use of generative pre-trained transformers (GPT) to create complex, multilayered soundscapes from textual and visual prompts. A custom pipeline is proposed, featuring modules for converting the source input into structured sound descriptions and subsequently generating cohesive auditory outputs. As a complementary solution, a granular synthesizer prototype was developed to enhance the usability of generative audio samples by enabling their recombination into seamless and non-repetitive soundscapes. The integration of GPT models with granular synthesis demonstrates significant potential for innovative audio production, paving the way for advancements in professional sound-design workflows and immersive audio applications.
APA, Harvard, Vancouver, ISO, and other styles
4

Jeong, Yujin, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. "Read, Watch and Scream! Sound Generation from Text and Video." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17590–98. https://doi.org/10.1609/aaai.v39i17.33934.

Full text
Abstract:
Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency.
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Heng, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15492–501. http://dx.doi.org/10.1609/aaai.v38i14.29475.

Full text
Abstract:
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/.
APA, Harvard, Vancouver, ISO, and other styles
6

Ji, Wenliang, Ming Jin, and Yixin Chen. "Optimization of Digital Media Content Generation and Communication Effect Combined with Deep Learning Technology." Journal of Combinatorial Mathematics and Combinatorial Computing 127a (April 15, 2025): 1449–66. https://doi.org/10.61091/jcmcc127a-084.

Full text
Abstract:
The combination of deep learning and digital media technology provides great scope for content creation. The article uses Generative Adversarial Network (GAN) in deep learning for content generation. Based on the three major forms of digital media content (image, audio, and video), image, audio, and video are generated by U-Net_GAN model, MAS-GAN model, and SSFLVGAN model, respectively, to construct a digital media content generation model based on generative adversarial networks. Subsequently, the model is validated for performance and the generated images, audio and video are evaluated for effectiveness. By studying the shortcomings of digital media content generation, we propose suggestions to improve its dissemination effect. The U-Net_GAN model outperforms other image generation models in all the indexes of generating images. The performance of speech generation and enhancement of MAS-GAN is much better than other audio generation and enhancement models. The average score of HDR video generated by SSFLVGAN is 4.20, and the average DMOS score is 5.97. The average DMOS score of SSFLVGAN is 5.97. DMOS score is 5.97, which are both 0.16 points higher than the traditional scheme. SSFLVGAN and the traditional scheme are comparable in terms of the picture impact of the generated video. The picture detail effect of the SSFLVGAN generated video is much better than the traditional scheme.
APA, Harvard, Vancouver, ISO, and other styles
7

Sakirin, Tam, and Siddartha Kusuma. "A Survey of Generative Artificial Intelligence Techniques." Babylonian Journal of Artificial Intelligence 2023 (March 10, 2023): 10–14. http://dx.doi.org/10.58496/bjai/2023/003.

Full text
Abstract:
Generative artificial intelligence (AI) refers to algorithms capable of creating novel, realistic digital content autonomously. Recently, generative models have attained groundbreaking results in domains like image and audio synthesis, spurring vast interest in the field. This paper surveys the landscape of modern techniques powering the rise of creative AI systems. We structurally examine predominant algorithmic approaches including generative adversarial networks (GANs), variational autoencoders (VAEs), and autoregressive models. Architectural innovations and illustrations of generated outputs are highlighted for major models under each category. We give special attention to generative techniques for constructing realistic images, tracing rapid progress from early GAN samples to modern diffusion models like Stable Diffusion. The paper further reviews generative modeling to create convincing audio, video, and 3D renderings, which introduce critical challenges around fake media detection and data bias. Additionally, we discuss common datasets that have enabled advances in generative modeling. Finally, open questions around evaluation, technique blending, controlling model behaviors, commercial deployment, and ethical considerations are outlined as active areas for future work. This survey presents both long-standing and emerging techniques molding the state and trajectory of generative AI. The key goals are to overview major algorithm families, highlight innovations through example models, synthesize capabilities for multimedia generation, and discuss open problems around data, evaluation, control, and ethics. Please let me know if you would like any clarification or modification of this proposed abstract.
APA, Harvard, Vancouver, ISO, and other styles
8

Broad, Terence, Frederic Fol Leymarie, and Mick Grierson. "Network Bending: Expressive Manipulation of Generative Models in Multiple Domains." Entropy 24, no. 1 (2021): 28. http://dx.doi.org/10.3390/e24010028.

Full text
Abstract:
This paper presents the network bending framework, a new approach for manipulating and interacting with deep generative models. We present a comprehensive set of deterministic transformations that can be inserted as distinct layers into the computational graph of a trained generative neural network and applied during inference. In addition, we present a novel algorithm for analysing the deep generative model and clustering features based on their spatial activation maps. This allows features to be grouped together based on spatial similarity in an unsupervised fashion. This results in the meaningful manipulation of sets of features that correspond to the generation of a broad array of semantically significant features of the generated results. We outline this framework, demonstrating our results on deep generative models for both image and audio domains. We show how it allows for the direct manipulation of semantically meaningful aspects of the generative process as well as allowing for a broad range of expressive outcomes.
APA, Harvard, Vancouver, ISO, and other styles
9

Cao, Yongnian, Xuechun Yang, and Rui Sun. "Generative AI Models Theoretical Foundations and Algorithmic Practices." Journal of Industrial Engineering and Applied Science 3, no. 1 (2025): 1–9. https://doi.org/10.70393/6a69656173.323633.

Full text
Abstract:
Generative models in AI are an entirely new paradigm for machine learning, allowing computers to create realistic data in all kinds of categories, like text (NLP), images, and even physics simulations. In this paper this formalism is used to guide the theory, algorithms and applications of generative models, with particular focus on a few well established techniques like VAEs, GANs, and diffusion models. It stresses the importance of probabilistic generative modelling and information theory (I.e. KL divergence, ELBO, adversarial optimization, etc.) We cover algorithmic practices such as optimization techniques, multimodal and conditional generation, and efficient data-driven strategies, demonstrating the impact of these methods in various real-world applications including text, image, and audio generation, industrial design, and scientific discovery. However, the fields are still grappling with significant challenges — training instability, the need for huge computational resources, and a lack of consistent, unified treatment across applications. The paper finishes with an optimistic vision of what the future has to hold, such as finding more sample efficient ways to learn, architectures to facilitate scalability on a global scale, and cohesive theoretical frameworks to bring out the very best in generative AI. By combining this theoretical understanding with practical implications, this paper will explore generative AI technologies and their potential to transform whole industries and scientific disciplines.
APA, Harvard, Vancouver, ISO, and other styles
10

Aldausari, Nuha, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. "Video Generative Adversarial Networks: A Review." ACM Computing Surveys 55, no. 2 (2023): 1–25. http://dx.doi.org/10.1145/3487891.

Full text
Abstract:
With the increasing interest in the content creation field in multiple sectors such as media, education, and entertainment, there is an increased trend in the papers that use AI algorithms to generate content such as images, videos, audio, and text. Generative Adversarial Networks (GANs) is one of the promising models that synthesizes data samples that are similar to real data samples. While the variations of GANs models in general have been covered to some extent in several survey papers, to the best of our knowledge, this is the first paper that reviews the state-of-the-art video GANs models. This paper first categorizes GANs review papers into general GANs review papers, image GANs review papers, and special field GANs review papers such as anomaly detection, medical imaging, or cybersecurity. The paper then summarizes the main improvements in GANs that are not necessarily applied in the video domain in the first run but have been adopted in multiple video GANs variations. Then, a comprehensive review of video GANs models are provided under two main divisions based on existence of a condition. The conditional models are then further classified according to the provided condition into audio, text, video, and image. The paper concludes with the main challenges and limitations of the current video GANs models.
APA, Harvard, Vancouver, ISO, and other styles
11

Dzwonczyk, Luke, Carmine-Emanuele Cella, and David Ban. "Generating Music Reactive Videos by Applying Network Bending to Stable Diffusion." Journal of the Audio Engineering Society 73, no. 6 (2025): 388–98. https://doi.org/10.17743/jaes.2022.0210.

Full text
Abstract:
This paper presents the first steps toward the creation of a tool which enables artists to create music visualizations using pretrained, generative, machine learning models. First, the authors investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. A number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools, are identified. The authors find that this process allows for continuous, fine-grain control of image generation, which can be helpful for creative applications. Next, music-reactive videos are generated using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, the authors comment on certain transforms that radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms. This paper is an extended version of the paper “Network Bending of Diffusion Models,” which appeared in the 27th International Conference on Digital Audio Effects.
APA, Harvard, Vancouver, ISO, and other styles
12

Neto, Wilson A. de Oliveira, Elloá B. Guedes, and Carlos Maurício S. Figueiredo. "Anomaly Detection in Sound Activity with Generative Adversarial Network Models." Journal of Internet Services and Applications 15, no. 1 (2024): 313–24. http://dx.doi.org/10.5753/jisa.2024.3897.

Full text
Abstract:
In state-of-art anomaly detection research, prevailing methodologies predominantly employ Generative Adversarial Networks and Autoencoders for image-based applications. Despite the efficacy demonstrated in the visual domain, there remains a notable dearth of studies showcasing the application of these architectures in anomaly detection within the sound domain. This paper introduces tailored adaptations of cutting-edge architectures for anomaly detection in audio and conducts a comprehensive comparative analysis to substantiate the viability of this novel approach. The evaluation is performed on the DCASE 2020 dataset, encompassing over 180 hours of industrial machinery sound recordings. Our results indicate superior anomaly classification, with an average Area Under the Curve (AUC) of 88.16% and partial AUC of 78.05%, surpassing the performance of established baselines. This study not only extends the applicability of advanced architectures to the audio domain but also establishes their effectiveness in the challenging context of industrial sound anomaly detection.
APA, Harvard, Vancouver, ISO, and other styles
13

Shen, Qiwei, Junjie Xu, Jiahao Mei, Xingjiao Wu, and Daoguo Dong. "EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance." Applied Sciences 14, no. 8 (2024): 3193. http://dx.doi.org/10.3390/app14083193.

Full text
Abstract:
With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio.
APA, Harvard, Vancouver, ISO, and other styles
14

Gupta, Jyoti, Monica Bhutani, Pramod Kumar, et al. "A comprehensive review of recent advances and future prospects of generative AI." Journal of Information and Optimization Sciences 46, no. 1 (2025): 205–11. https://doi.org/10.47974/jios-1864.

Full text
Abstract:
Generative AI has evolved rapidly and demonstrated accuracy in creating content with diverse yet too realistic styles. This paper provides a complete overview of the field, starting with its core principles and continuing with some recent results and potential future applications. This also covers requirements for new task-specific and data models, including our critical generative model generation (GANs, VAE, and more) in four image audio text videos. The paper emphasizes that generative AI has the potential to transform industries and lists some of these possible applications. It also reviews GAN technology limitations, including data bias, ethical questions, and model interpretability. To maximize the potential of this technology, we stress that it is crucial to construct reliable, interpretable, and self-sustainable generative models. The paper ends with a discussion of future directions, new trends in multimodal models, and scopes of energy-efficient approaches. Knowing the risks and potential ahead is how researchers and practitioners can make an informed choice to use generative AI for better or worse.
APA, Harvard, Vancouver, ISO, and other styles
15

Meshram, Sahil. "Genius AI A Unified Platform for Text, Image, Audio, Video, and Code AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 6 (2025): 825–29. https://doi.org/10.22214/ijraset.2025.71461.

Full text
Abstract:
The rapid evolution of artificial intelligence (AI) has led to the development of specialized models across different modalities such as text, image, video, audio, and program code. This paper presents the design and conceptual framework for a multimodal AI platform that harmoniously brings together multiple AI systems into a single, user-friendly. The proposed platform leverages state-of-the-art AI models, each tailored for a specific modality—Natural Language Processing (NLP) models for text understanding and generation, Computer Vision models for image analysis and synthesis, Generative Video AI for dynamic scene creation, Audio AI for speech recognition and generation, and Code AI for intelligent code completion, debugging, and generation. This paper outlines the core design principles, technical challenges, system integration methods, and practical use cases, including educational tools and content creation. Our approach marks a significant step toward the realization of truly general-purpose AI platforms.
APA, Harvard, Vancouver, ISO, and other styles
16

Purshottam J. Assudani, Balakrishnan P, A. Anny Leema, and Rajesh K Nasare. "Generative AI-Powered Framework for Audio Analysis and Conversational Exploration." Metallurgical and Materials Engineering 31, no. 4 (2025): 206–11. https://doi.org/10.63278/1425.

Full text
Abstract:
This paper introduces a hybrid deep learning system for complex audio interpretation and post time communication utilizing associated hidden Convolutional Neural Networks (CNNs) with transformer based Large Language Models (LLMs) over spectrogram. The system inputs raw audio input in the form of audio signals, and maps them into spectrograms, extracts high level features using CNNs, and asks for fusion of LLM-produced embeddings with it, for adding semantic understanding, and contextual discussions. The multimodal attention technique helps in crossing the audio-linguistic gap and therefore, it is possible that they can have meaningful and context-aware response. The release offers the apps for intelligent assistant, education, intelligent monitoring, and other. Github repository, experimental evaluation presents increase in performance over the state-of-the-art in both experiments, with accuracy at 93.8%, latency at 420 ms and high semantic coherence (BLEU score of 0.74 is obtained). This result proves that the proposed system is usable to offer both user-friendly and intelligent audio exploration.
APA, Harvard, Vancouver, ISO, and other styles
17

S, Dr Manimala. "GenNarrate: AI-Powered Story Synthesis with Visual and Audio Outputs." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 2352–58. https://doi.org/10.22214/ijraset.2025.70567.

Full text
Abstract:
Abstract: The emergence of generative artificial intelligence has redefined the boundaries of digital content creation, particularly in the domain of computational storytelling. This paper presents GenNarrate, a modular, multi-modal generative AI system engineered to synthesize coherent narratives augmented with corresponding visual and auditory elements. The architecture leverages advanced machine learning models, including LLaMA2 for text generation, DALL·E for image synthesis, and a combination of Google Text-to-Speech (GTTS) and AudioLDM for expressive audio narration and sound design. GenNarrate facilitates user-driven content generation by accepting configurable parameters-such as genre, tone, character elements, and desired multimedia outputs-through an interactive front-end interface. These inputs are orchestrated through a Flask-based backend pipeline, which integrates the constituent modules and produces downloadable outputs comprising narrated stories, image-enhanced documents, and synchronized audio tracks. The proposed system demonstrates a novel approach to narrative automation, emphasizing cross-modal coherence, scalability, and personalization. This study further situates GenNarrate within the broader context of AI-enhanced storytelling technologies, offering comparative insights with existing open-source models such as GPT-3 and Stable Diffusion. Potential applications are explored across educational content delivery, therapeutic interventions, creative industries, and interactive media. The findings underscore the transformative potential of multi-modal AI systems in facilitating immersive, user-centric storytelling experiences, while also identifying avenues for future development in real-time interaction, fine-grained customization, and adaptive content generation
APA, Harvard, Vancouver, ISO, and other styles
18

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Full text
Abstract:
Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-dimensional mel spectrogram as the conditioner allows both user controllability and a way for the network to generate more diversity. Additionally, it gives the model style transfer capabilities. We evaluate several models in terms of the quality and variability of the generated sounds using both quantitative and subjective evaluations. The results suggest that there is a trade-off between quality and diversity. Nevertheless, our method achieves a quality level similar to that of the training set while generating perceivable variations according to a perceptual study that includes game audio experts.
APA, Harvard, Vancouver, ISO, and other styles
19

Lattner, Stefan, and Javier Nistal. "Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks." Electronics 10, no. 11 (2021): 1349. http://dx.doi.org/10.3390/electronics10111349.

Full text
Abstract:
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.
APA, Harvard, Vancouver, ISO, and other styles
20

Thorat, Ms Madhuri. "From Words to Wonders: AI-Generated Multimedia for Poetry Learning." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 3382–94. https://doi.org/10.22214/ijraset.2025.70946.

Full text
Abstract:
The rise of Generative AI has led to the development of various tools that present new opportunities for businesses and professionals engaged in content creation. The education sector is undergoing a significant transformation in the methods of content development and delivery. AI models and tools facilitate the creation of customized learning materials and effective visuals that enhance and simplify the educational experience. The advent of Large Language Models (LLMs) such as GPT and Text-to-Image models like Stable Diffusion, Flux-Schnell has fundamentally changed and expedited the content generation process. The capability to generate high-quality visuals from textual descriptions has exceeded expectations from just a few years ago. Nevertheless, current research predominantly concentrates on text generation from text, with a notable lack of studies exploring the use of multimodal generation capabilities to tackle critical challenges in instruction supported by multimodal data. In this paper, we propose a framework for generating situational video content based on English poetry, which is executed through several phases: context analysis, prompt generation, image generation, and video synthesis. This comprehensive process necessitates various types of AI models, including text-to-text, text-to-video, text-to-audio, and image-to-image. This project illustrates the potential of combining multiple generative AI models to produce rich multimedia experiences derived from textual content.
APA, Harvard, Vancouver, ISO, and other styles
21

Giudici, Gregorio Andrea, Franco Caspe, Leonardo Gabrielli, Stefano Squartini, and Luca Turchet. "Distilling DDSP: Exploring Real-Time Audio Generation on Embedded Systems." Journal of the Audio Engineering Society 73, no. 6 (2025): 331–45. https://doi.org/10.17743/jaes.2022.0211.

Full text
Abstract:
This paper investigates the feasibility of running neural audio generative models on embedded systems, by comparing the performance of various models and evaluating their trade-offs in audio quality, inference speed, and memory usage. This work focuses on differentiable digital signal processing (DDSP) models, due to their hybrid architecture, which combines the efficiency and interoperability of traditional DSP with the flexibility of neural networks. In addition, the application of knowledge distillation (KD) is explored to improve the performance of smaller models. Two types of distillation strategies were implemented and evaluated: audio distillation and control distillation. These methods were applied to three foundation DDSP generative models that integrate Harmonic-Plus-Noise, FM, and Wavetable synthesis. The results demonstrate the overall effectiveness of KD: the authors were able to train student models that are up to 100× smaller than their teacher counterparts while maintaining comparable performance and significantly improving inference speed and memory efficiency. However, cases where KD failed to improve or even degrade student performance have also been observed. The authors provide a critical reflection on the advantages and limitations of KD, exploring its application in diverse use cases and emphasizing the need for carefully tailored strategies to maximize its potential.
APA, Harvard, Vancouver, ISO, and other styles
22

G, Ananya. "RAG based Chatbot using LLMs." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35600.

Full text
Abstract:
Historically, Artificial Intelligence (AI) was used to understand and recommend information. Now, Generative AI can also help us create new content. Generative AI builds on existing technologies, like Large Language Models (LLMs) which are trained on large amounts of text and learn to predict the next word in a sentence. Generative AI can not only create new text, but also images, videos, or audio. This project focuses on the implementation of a chatbot based the concepts of Generative AI and Large Language Models which can answer any query regarding the content provided in the PDFs. The primary technologies utilized include Python libraries like LangChain, PyTorch for model training, and Hugging Face’s Transformers library for accessing pre-trained models like Llama2, GPT- 3.5 (Generative Pre-trained Transformer) architectures. The re- sponses are generated using the Retrieval Augmented Generation (RAG) approach. The project aims to develop a chatbot which can generate the sensible responses from the data in the form of PDF files. The project demonstrates the capabilities and applications of advanced Natural Language Processing (NLP) techniques in creating conversational agents that can be deployed across various platforms in the corporation, to enhance user interaction and support automated tasks. Index Terms—Generative AI, Artificial Intelligence, Natural Language Processing, Large Language Model, Llama2, Tran- formers, Document Loaders, Retrieval Augmented Generation, Vector Database, Langchain, Chainlit
APA, Harvard, Vancouver, ISO, and other styles
23

Yang, Junpeng, and Haoran Zhang. "Development And Challenges of Generative Artificial Intelligence in Education and Art." Highlights in Science, Engineering and Technology 85 (March 13, 2024): 1334–47. http://dx.doi.org/10.54097/vaeav407.

Full text
Abstract:
Thanks to the rapid development of generative deep learning models, Artificial Intelligence Generated Content (AIGC) has attracted more and more research attention in recent years, which aims to learn models from massive data to generate relevant content based on input conditions. Different from traditional single-modal generation tasks that focus on content generation for a particular modality, such as image generation, text generation, or semantic generation, AIGC trains a single model that can simultaneously understand language, images, videos, audio, and more. AIGC marks the transition from traditional decision-based artificial intelligence to generative artificial intelligence, which has been widely applied in various fields. Focusing on the key technologies and representative applications of AIGC, this paper identifies several key technical challenges and controversies in the field. These include defects in cross-modal and multimodal generation, issues related to model stability and data consistency, privacy concerns, and questions about whether advanced generative models like ChatGPT can be considered general artificial intelligence (AGI). While this dissertation provides valuable insights into the revolution and challenge of generative AI in art and education, it acknowledges the sensitivity of generated content and the ethical dilemmas it may pose, and ownership rights for AI-generated works and the need for new intellectual property norms are subjects of ongoing discussion. To address the current technical bottlenecks in cross-modal and multimodal generation, future research aims to quantitatively analyze and compare existing models, proposing practical optimization strategies. With the rapid advancement of generative AI, we anticipate a transition from user-generated content (UGC) to artificial intelligence-generated content (AIGC) and, ultimately, a new era of human-computer co-creation with strong interactive potential in the near future.
APA, Harvard, Vancouver, ISO, and other styles
24

Choi, Ha-Yeong, Sang-Hoon Lee, and Seong-Whan Lee. "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (2024): 17862–70. http://dx.doi.org/10.1609/aaai.v38i16.29740.

Full text
Abstract:
Diffusion-based generative models have recently exhibited powerful generative performance. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, we introduce decoupled denoising diffusion models (DDDMs) with disentangled representations, which can enable effective style transfers for each attribute in generative models. In particular, we apply DDDMs for voice conversion (VC) tasks, tackling the intricate challenge of disentangling and individually transferring each speech attributes such as linguistic information, intonation, and timbre. First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for style transfer with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance even when using a smaller model size. Audio samples are available at https://hayeong0.github.io/DDDM-VC-demo/.
APA, Harvard, Vancouver, ISO, and other styles
25

Zhou, Zhenghao, Yongjie Liu, and Chen Cao. "Advancing Audio-Based Text Generation with Imbalance Preference Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 26120–28. https://doi.org/10.1609/aaai.v39i24.34808.

Full text
Abstract:
Human feedback in generative systems is a highly active frontier of research that aims to improve the quality of generated content and align it with subjective preferences. Existing efforts predominantly focus on text-only large language models (LLMs) or text-based image generation, while cross-modal generation between audio and text remains largely unexplored. Moreover, there is currently no open-source preference dataset to support the deployment of alignment algorithms in this domain. In this work, we take audio speech translation (AST) and audio captioning (AAC) tasks as examples to explore how to enhance the performance of mainstream audio-based text generation models with limited human annotation. Specifically, we propose an novel framework named IPO that includes a model adversarial sampling concept--human annotators act as referees to determine model outcomes, using these results as pseudo-labels for the corresponding beam search hypotheses. Given these imbalance win-loss results, IPO effectively enable the two models to update interactively to win the next round of adversarial sampling. We conduct both subjective and objective evaluations to demonstrate the alignment benefits of IPO and its enhancement on model perception and generation capacities. On both AAC and AST, a few hundreds of annotations significantly enhance the weak model, and the strong model can also be encouraged to achieve new state-of-the-art results in terms of different objective metrics. Additionally, we show the extensibility of IPO by applying it to the reverse task of text-to-speech generation, improving the robustness of system on unseen reference speaker.
APA, Harvard, Vancouver, ISO, and other styles
26

Viomesh Singh. "VidTextBot using Generative AI." Journal of Information Systems Engineering and Management 10, no. 18s (2025): 128–32. https://doi.org/10.52783/jisem.v10i18s.2894.

Full text
Abstract:
Introduction: This research paper presents the design and implementation of a VidTextBot , it is a cutting-edge system that is used to integrate the video-to-text conversion using generative AI for analyzing the video content. The system will allow the users to upload the video or the youtube link. This youtube link or the video is processed to extract the audio, transcribe it into text, and extract subtitles if available. These outputs are stored into the database for smooth future reference and efficient data retrieval. By utilizing advanced NLP models like ChatGPT, the chatBot will help the user to interact with the video content and it will also answer the real time queries. The system’s architecture ensures seamless integration of transcription, subtitle extraction and AI interaction, which contribute to make it a user-friendly platform. Objectives: VidTextBot provides a unique solution, compared to the ordinary transcription tools, which focuses on real-time capabilities and scalability. Moreover, the paper searches for potential system enhancements, such as multi-language transcription support, personalized user experiences through authentication, and optimization for mobile platforms. The future advancement can involve integrating sentiment analysis and predictive models for deeper insights into video content. VidTextBot displays the potential of video processing and Generative AI, which offers an efficient way to analyze and interpret the video data. It addresses the growing demand for tools capable of making video data more accessible, insightful, and actionable.. Methods: The VidTextBot system allows the users to upload the video or provide the youtube link for the processing. The system then extracts the audio, and then transcribes it into text. It can also extract the subtitles if any of the youtube videos have it. This information is then stored into the database for efficient retrieval and future preferences. Then the system further uses the AI generated ChatBot , so that the users can interact with the video content and get real-time answers to all of the queries. Results: The VidTextBot using the Generative-AI System is definitely a new, innovative product changing the face of interaction with video content. Combining video/audio transcription, subtitle extraction, and AI-driven chatbot capabilities, the system makes video content accessible and more user-friendly.This project is based on real-world challenges, like the long running process of analyzing videos manually and the fact that video content would be hard to derive any valuable insight. The system lets users upload any video or provide a link from YouTube, allowing its audio to be converted into text that can be queried in real-time. Integration of advanced AI guarantees users will get the correct and context-related response to their questions, thereby ensuring it becomes both practical and efficient. Conclusions: The project illustrates a huge leap in how people consume and interact with video content. It combines speech recognition and generative AI to create an efficient, interactive, and user-centric solution. A system that is indeed a huge leap forward for smarter video content analysis, making it accessible and leading the way for further advancements in the field.
APA, Harvard, Vancouver, ISO, and other styles
27

Gupta, Chitralekha, Shreyas Sridhar, Denys J. C. Matthies, Christophe Jouffrais, and Suranga Nanayakkara. "SonicVista: Towards Creating Awareness of Distant Scenes through Sonification." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 2 (2024): 1–32. http://dx.doi.org/10.1145/3659609.

Full text
Abstract:
Spatial awareness, particularly awareness of distant environmental scenes known as vista-space, is crucial and contributes to the cognitive and aesthetic needs of People with Visual Impairments (PVI). In this work, through a formative study with PVIs, we establish the need for vista-space awareness amongst people with visual impairments, and the possible scenarios where this awareness would be helpful. We investigate the potential of existing sonification techniques as well as AI-based audio generative models to design sounds that can create awareness of vista-space scenes. Our first user study, consisting of a listening test with sighted participants as well as PVIs, suggests that current AI generative models for audio have the potential to produce sounds that are comparable to existing sonification techniques in communicating sonic objects and scenes in terms of their intuitiveness, and learnability. Furthermore, through a wizard-of-oz study with PVIs, we demonstrate the utility of AI-generated sounds as well as scene audio recordings as auditory icons to provide vista-scene awareness, in the contexts of navigation and leisure. This is the first step towards addressing the need for vista-space awareness and experience in PVIs.
APA, Harvard, Vancouver, ISO, and other styles
28

Lin, Hong, Xuan Liu, Chaomurilige Chaomurilige, et al. "LongMergent: Pioneering audio mixing strategies for exquisite music generation." Computer Software and Media Applications 8, no. 1 (2025): 11516. https://doi.org/10.24294/csma11516.

Full text
Abstract:
Artificial intelligence-empowered music processing is a domain that involves the use of artificial intelligence technologies to enhance music analysis, understanding, and generation. This field encompasses a variety of tasks from music generation to music comprehension. In practical applications, the complexity of interwoven tasks, differences in data representation, scattered distribution of tool resources, and the threshold of professional music knowledge often become barriers that hinder developers from smoothly carrying out generative tasks. Therefore, it is essential to establish a system that can automatically analyze their needs and invoke appropriate tools to simplify the music processing workflow. Inspired by the recent success of Large Language Models (LLMs) in task automation, we have developed a system named LongMergent, which integrates numerous music-related tools and autonomous workflows to address user requirements. By granting users the freedom to effortlessly combine tools, this system provides a seamless and rich musical experience.
APA, Harvard, Vancouver, ISO, and other styles
29

Yang, Chenyu, Shuai Wang, Hangting Chen, et al. "SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25597–605. https://doi.org/10.1609/aaai.v39i24.34750.

Full text
Abstract:
The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics.
APA, Harvard, Vancouver, ISO, and other styles
30

Adithya, Suresh, A. Faras, Habeeba K. M. Ummu, Eldho Anu, J. George Asha, and Roy Meckamalil Rotney. "Autism Detection Using Self-Stimulatory Behaviors." Advancement in Image Processing and Pattern Recognition 8, no. 3 (2025): 13–24. https://doi.org/10.5281/zenodo.15516090.

Full text
Abstract:
<em>This paper introduces a novel video-audio-based model for the early detection of Autism Spectrum Disorder (ASD), focusing on analyzing self-stimulatory behaviors (stim- ming) such as arm flapping, head banging, and spinning, which are critical diagnostic markers. Traditional diagnostic approaches often depend on subjective clinical observations, leading to in- consistencies, delays, and limited accessibility in diverse settings. The proposed model combines video analysis with audio detection to address these shortcomings, supported by a generative AI- based method to create an audio dataset. This dataset enhances the model&rsquo;s robustness by incorporating diverse audio features merged with video data to provide a comprehensive analysis. Video analysis employs YOLO for face detection, MediaPipe for facial landmarks, pose tracking, and gaze estimation, while CNN- LSTM models identify repetitive behaviors. Audio processing extracts features via Librosa, NoiseReduce, and PyDub, with Gradient Boosting models analyzing speech anomalies. Clas- sification integrates CNN-LSTM for video, Gradient Boosting for audio, and a Stacking Classifier with Logistic Regression for final predictions. By merging video and audio cues, this model enhances the objectivity, accuracy, and scalability of ASD detection, providing a reliable and efficient framework for early diagnosis and intervention across varied real-world environments.</em>
APA, Harvard, Vancouver, ISO, and other styles
31

Prudhvi, Y., T. Adinarayana, T. Chandu, S. Musthak, and G. Sireesha. "Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound." International Journal of Innovative Research in Computer Science and Technology 11, no. 6 (2023): 13–17. http://dx.doi.org/10.55524/ijircst.2023.11.6.3.

Full text
Abstract:
In the field of computer graphics and animation, the challenge of generating lifelike and expressive talking face animations has historically necessitated extensive 3D data and complex facial motion capture systems. However, this project presents an innovative approach to tackle this challenge, with the primary goal of producing realistic 3D motion coefficients for stylized talking face animations driven by a single reference image synchronized with audio input. Leveraging state-of-the-art deep learning techniques, including generative models, image-to-image translation networks, and audio processing methods, the methodology bridges the gap between static images and dynamic, emotionally rich facial animations. The ultimate aim is to synthesize talking face animations that exhibit seamless lip synchronization and natural eye blinking, thereby achieving an exceptional degree of realism and expressiveness, revolutionizing the realm of computer-generated character interactions.
APA, Harvard, Vancouver, ISO, and other styles
32

A M, Vandana Pranavi, and Dr Nagaraj G. Cholli. "Comprehensive Survey On Generative AI, Plethora Of Applications And Impacts." IOSR Journal of Computer Engineering 26, no. 5 (2024): 06–15. http://dx.doi.org/10.9790/0661-2605020615.

Full text
Abstract:
The primary objective for the AI subfield of "generative artificial intelligence" is to develop systems that may produce new, novel and creative content including text, photos, audio, music, and movies. These models are able to generate fresh content that nearly mimics realistic content created by humans by utilizing deep learning techniques. These models of Gen AI have gained significant importance in research and have plethora of applications in wide varieties of fields. The impact of GenAI is not just on abled but also on disabled communities who are sometimes unnoticed. This survey provides a thorough overview of general artificial intelligence (GenAI), its applications, the cutting edge models at the moment, its effects on the disabled groups, and its challenges going forward with the future aspects
APA, Harvard, Vancouver, ISO, and other styles
33

Liang, Kai, and Haijun Zhao. "Application of Generative Adversarial Nets (GANs) in Active Sound Production System of Electric Automobiles." Shock and Vibration 2020 (October 28, 2020): 1–10. http://dx.doi.org/10.1155/2020/8888578.

Full text
Abstract:
To improve the diversity and quality of sound mimicry of electric automobile engines, a generative adversarial network (GAN) model was used to construct an active sound production model for electric automobiles. The structure of each layer in the network in this model and the size of its convolution kernel were designed. The gradient descent in network training was optimized using the adaptive moment estimation (Adam) algorithm. To demonstrate the quality difference of the generated samples from different input signals, two GAN models with different inputs were constructed. The experimental results indicate that the model can accurately learn the characteristic distributions of raw audio signals. Results from a human ear auditory test show that the generated audio samples mimicked the real samples well, and a leave-one-out (LOO) test show that the diversity of the samples generated from the raw audio signals was higher than that of samples generated from a two-dimensional spectrogram.
APA, Harvard, Vancouver, ISO, and other styles
34

Li, Lianghao. "Overview of Multimodal Generative Models in Natural Language Processing and Computer Vision." Journal of Computer Technology and Applied Mathematics 1, no. 4 (2024): 69–78. https://doi.org/10.5281/zenodo.13988327.

Full text
Abstract:
Multimodal generative models have become essential in the deep learning renaissance, as they provide unparalleled flexibility over a diverse context of applications within Natural Language Processing (NLP) and Computer Vision (CV). In this paper, we systematically review the basic concepts and technical improvements in multimodal generative models by discussing their applications across different modalities such as text, images, audio,and video. These models though augment the strength of AI to comprehend and perform complicated tasks by coalescing data from various modalities. In this paper, we investigate how these principles apply to many of the existing mainstream models (including CLIP, DALL&middot;E, Flamingo), and consider their applications in VQA,text-to-image-synthesis; medical image analysis; edutainment content creation &amp; user research developments. This paper also examines the existing difficulties of such technologies including paucity in data availability, modality fusion effectiveness and constraints on computational resources while suggesting pathways for future research. The paper goes on to state privacy parallels between multi-modal generative models (GGMs) calls for a model of safety over responsibility when it comes to technological innovation.
APA, Harvard, Vancouver, ISO, and other styles
35

Agarwal,, Pratham. "MedBot : A GenAI based Chatbot for Healthcare." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35757.

Full text
Abstract:
Generative Artificial intelligence (GenAI) is transforming the healthcare industry by providing innovative solutions for patient care and information retrieval. MedBot is an innovative GenAI-driven chatbot designed to improve healthcare services by providing accurate and timely medical information. Utilizing advanced generative AI models, MedBot can respond to text, image, and audio queries, making it a versatile tool for diverse healthcare needs. The chatbot offers functionalities such as document summarization and insight extraction, aiding users in comprehending complex medical data. MedBot aims to enhance patient care by ensuring efficient and accessible interactions between users and healthcare information. This work signifies a substantial advancement in the integration of AI technologies within the healthcare sector, aiming to improve the overall efficiency and accessibility of medical support and information. Keywords—Generative Artificial Intelligence, Large Language Models, ChatBots, Healthcare
APA, Harvard, Vancouver, ISO, and other styles
36

Li, Jing, Zhengping Li, Ying Li, and Lijun Wang. "P‐2.12: A Comprehensive Study of Content Generation Using Diffusion Model." SID Symposium Digest of Technical Papers 54, S1 (2023): 522–24. http://dx.doi.org/10.1002/sdtp.16346.

Full text
Abstract:
The essence of the Metaverse is the process of integrating a large number of existing technologies to virtualize and digitize the real world. With the development of artificial intelligence technology, a large number of digital native content in the Metaverse needs to be completed by artificial intelligence. The current artificial intelligence technology allows computers to automatically and efficiently generate text, pictures, audio, video, and even 3D models. With the further development of natural language processing technology and generative network models, future artificial intelligence generation will subvert the existing content generation methods and gradually become an accelerator for the development of the Metaverse. In this paper, we have analyzed and studied the basic principles of diffusion models, clarified some existing problems of diffusion models, considered the key areas of concern and potential areas to be further explored, and looked forward to the future application of diffusion models in the field of Metaverse.
APA, Harvard, Vancouver, ISO, and other styles
37

Cheng, Liehai, Zhenli Zhang, Giuseppe Lacidogna, Xiao Wang, Mutian Jia, and Zhitao Liu. "Sound Sensing: Generative and Discriminant Model-Based Approaches to Bolt Loosening Detection." Sensors 24, no. 19 (2024): 6447. http://dx.doi.org/10.3390/s24196447.

Full text
Abstract:
The detection of bolt looseness is crucial to ensure the integrity and safety of bolted connection structures. Percussion-based bolt looseness detection provides a simple and cost-effective approach. However, this method has some inherent shortcomings that limit its application. For example, it highly depends on the inspector’s hearing and experience and is more easily affected by ambient noise. In this article, a whole set of signal processing procedures are proposed and a new kind of damage index vector is constructed to strengthen the reliability and robustness of this method. Firstly, a series of audio signal preprocessing algorithms including denoising, segmenting, and smooth filtering are performed in the raw audio signal. Then, the cumulative energy entropy (CEE) and mel frequency cepstrum coefficients (MFCCs) are utilized to extract damage index vectors, which are used as input vectors for generative and discriminative classifier models (Gaussian discriminant analysis and support vector machine), respectively. Finally, multiple repeated experiments are conducted to verify the effectiveness of the proposed method and its ability to detect the bolt looseness in terms of audio signal. The testing accuracy of the trained model approaches 90% and 96.7% under different combinations of torque levels, respectively.
APA, Harvard, Vancouver, ISO, and other styles
38

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (2023): A99. http://dx.doi.org/10.1121/10.0022922.

Full text
Abstract:
In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction loss into the original training objective to penalize imperfect audio reconstruction and compare it with neural vocoders and traditional spectrogram inversion methods. We use a Wasserstein GAN (WGAN) as an example model to explore the synthesis quality of generated sound effects, such as footsteps, birds, guns, rain, and engine sounds. In addition to synthesis quality, we also consider the range of sound variation that is possible with our generative model. We report on the trade-off that we obtain with our model regarding the quality and diversity of synthesized sound effects.
APA, Harvard, Vancouver, ISO, and other styles
39

Cheng, Hsu-Yung, Chia-Cheng Su, Chi-Lun Jiang, and Chih-Chang Yu. "Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet." Electronics 14, no. 6 (2025): 1179. https://doi.org/10.3390/electronics14061179.

Full text
Abstract:
In recent years, generative AI has become popular in areas like natural language processing, as well as image and audio processing, significantly expanding AI’s creative capabilities. Particularly in the realm of image generation, diffusion models have achieved remarkable success across various applications, such as image synthesis and transformation. However, traditional diffusion models operate at the pixel level when learning image features, which inevitably demands significant computational resources. To address this issue, this paper proposes a pose transfer model that integrates the latent diffusion model, ControlNet, and a multi-scale feature extraction module. Moreover, the proposed method incorporates a semantic extraction filter into the attention neural network layer. This approach enables the model to train images in the latent space, subsequently focusing on critical image features and the relationships between poses. As a result, the architecture can be efficiently trained using an RTX 4090 GPU instead of multiple A100 GPUs. This study advances generative AI by optimizing diffusion models for enhanced efficiency and scalability. Our integrated approach reduces computational demands and accelerates training, making advanced image generation more accessible to organizations with limited resources and paving the way for future innovations in AI efficiency.
APA, Harvard, Vancouver, ISO, and other styles
40

Sheikh, Dr Shagufta Mohammad Sayeed. "Empowering Learning: Crafting Educational Podcasts with GEN AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 4 (2025): 4517–28. https://doi.org/10.22214/ijraset.2025.69144.

Full text
Abstract:
The integration of Generative AI has facilitated innovative tools that are transforming content creation in the educational landscape, ushering in a shift towards more efficient and accessible learning paradigms. In this project, AI is leveraged to automate podcast production, replicating text-based educational content into high-fidelity audio content that caters to varied learning needs. Leveraging advanced frameworks like Large Language Models (LLMs) and Text-to-Speech (TTS) technologies, the system streamlines otherwise time-consuming processes like scripting, recording, and editing, thus speeding up the content creation process and improving accessibility for instructors. In addition, the system automates title image and metadata generation, which improves discoverability and professionalism of each podcast episode. By combining multiple AI capabilities, this project demonstrates the potential of Generative AI to personalize content, improve accessibility, and improve efficiency in educational environments. It enables personalized learning pathways and empowers educators to deliver more engaging and effective content, with the potential to reach an international audience and fit into different educational environments with ease. Additionally, this approach reduces the resource load of educational institutions, enabling them to deliver high-quality audio content even with an availability of limited resources and staff.
APA, Harvard, Vancouver, ISO, and other styles
41

B, Yeshitha, Vinitha V, Anubha Mittal, Harshitha Reddy P., and Katiyar Rajani. "Emotion Detection and Voice-Emotion Conversions using Deep Learning." International Journal of Microsystems and IoT 2, no. 3 (2024): 685–91. https://doi.org/10.5281/zenodo.11159090.

Full text
Abstract:
Emotion, especially through speech, is a powerful tool humans possess that conveys much more information than any text can describe. Using artificial intelligence to tap into this can have a big &nbsp;&nbsp; positive impact on a variety of industries, including audio mining, customer service applications, &nbsp;&nbsp;&nbsp;security and forensics, and more. A growing field of research, spoken emotion recognition, has &nbsp;relied heavily on models that employ audio data to create effective classifiers. This paper resents convolutional neural network as a deep learning classification algorithm to classify 7 emotions ith an accuracy of 69.45% on the combined datasets of Savee, Ravdess and Tess. It proposes a new system to help replicate the emotions on a neutral audio (voice conversion). The production of the emotional audio is implemented using MelGAN, a special type of Generative Adversarial Network (GAN).
APA, Harvard, Vancouver, ISO, and other styles
42

Xi, Wang, Guillaume Devineau, Fabien Moutarde, and Jie Yang. "Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images." Algorithms 13, no. 12 (2020): 319. http://dx.doi.org/10.3390/a13120319.

Full text
Abstract:
Generative models for images, audio, text, and other low-dimension data have achieved great success in recent years. Generating artificial human movements can also be useful for many applications, including improvement of data augmentation methods for human gesture recognition. The objective of this research is to develop a generative model for skeletal human movement, allowing to control the action type of generated motion while keeping the authenticity of the result and the natural style variability of gesture execution. We propose to use a conditional Deep Convolutional Generative Adversarial Network (DC-GAN) applied to pseudo-images representing skeletal pose sequences using tree structure skeleton image format. We evaluate our approach on the 3D skeletal data provided in the large NTU_RGB+D public dataset. Our generative model can output qualitatively correct skeletal human movements for any of the 60 action classes. We also quantitatively evaluate the performance of our model by computing Fréchet inception distances, which shows strong correlation to human judgement. To the best of our knowledge, our work is the first successful class-conditioned generative model for human skeletal motions based on pseudo-image representation of skeletal pose sequences.
APA, Harvard, Vancouver, ISO, and other styles
43

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (2023): 1834. http://dx.doi.org/10.3390/s23041834.

Full text
Abstract:
This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2—Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.
APA, Harvard, Vancouver, ISO, and other styles
44

R, Arun Kumar, Lisa C, Rashmi V R, and Sandhya K. "GENERATIVE ADVERSARIAL NETWORKS (GANs) IN MULTIMODAL AI USING BRIDGING TEXT, IMAGE, AND AUDIO DATA FOR ENHANCED MODEL PERFORMANCE." ICTACT Journal on Soft Computing 15, no. 3 (2025): 3567–77. https://doi.org/10.21917/ijsc.2025.0497.

Full text
Abstract:
The integration of multimodal data is critical in advancing artificial intelligence models capable of interpreting diverse and complex inputs. While standalone models excel in processing individual data types like text, image, or audio, they often fail to achieve comparable performance when these modalities are combined. Generative Adversarial Networks (GANs) have emerged as a transformative approach in this domain due to their ability to synthesize and learn across disparate data types effectively. This study addresses the challenge of bridging multimodal datasets to improve the generalization and performance of AI models. The proposed framework employs a novel GAN architecture that integrates textual, visual, and auditory data streams. Using a shared latent space, the system generates coherent representations for cross-modal understanding, ensuring seamless data fusion. The GAN model is trained on a benchmark dataset comprising 50,000 multimodal instances, with 25% allocated for testing. Results indicate significant improvements in multimodal synthesis and classification accuracy. The model achieves a text-to-image synthesis FID score of 14.7, an audio- to-text BLEU score of 35.2, and a cross-modal classification accuracy of 92.3%. These outcomes surpass existing models by 8-15% across comparable metrics, highlighting the GAN’s effectiveness in handling data heterogeneity. The findings suggest potential applications in areas such as virtual assistants, multimedia analytics, and cross-modal content generation.
APA, Harvard, Vancouver, ISO, and other styles
45

Gong, Yuan, Cheng-I. Lai, Yu-An Chung, and James Glass. "SSAST: Self-Supervised Audio Spectrogram Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10699–709. http://dx.doi.org/10.1609/aaai.v36i10.21315.

Full text
Abstract:
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
APA, Harvard, Vancouver, ISO, and other styles
46

Appiani, Andrea, and Cigdem Beyan. "VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection." Information 16, no. 3 (2025): 233. https://doi.org/10.3390/info16030233.

Full text
Abstract:
Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets.
APA, Harvard, Vancouver, ISO, and other styles
47

Juby Nedumthakidiyil Zacharias. "Generative product content using vision-language models: Transforming e-commerce experiences." World Journal of Advanced Engineering Technology and Sciences 15, no. 3 (2025): 1130–37. https://doi.org/10.30574/wjaets.2025.15.3.1046.

Full text
Abstract:
Vision-language models (VLMs) are fundamentally transforming product content creation in e-commerce, representing a paradigm shift in how digital retail platforms manage product information. These sophisticated systems, which leverage dual-encoder architectures and contrastive learning methods, establish meaningful connections between visual attributes and textual descriptions to generate comprehensive product content directly from images. By analyzing product photographs, these models automatically create detailed descriptions, ingredient lists, and usage recommendations with remarkable accuracy and efficiency. Implementation studies demonstrate significant reductions in manual copywriting requirements while improving content quality, search engine visibility, and customer engagement metrics. Despite their transformative potential, these technologies face challenges including hallucination prevention and brand voice alignment, which researchers address through knowledge graph integration, confidence scoring systems, and adaptive fine-tuning mechanisms. Ongoing innovation focuses on inventory-aware content generation and multimodal enhancement through audio, 3D, and video integration. As these technologies mature, they promise to revolutionize how e-commerce platforms create, maintain, and personalize product information while delivering meaningful operational efficiencies and enhanced shopping experiences.
APA, Harvard, Vancouver, ISO, and other styles
48

Davis, Jason. "In a Digital World With Generative AI Detection Will Not be Enough." Newhouse Impact Journal 1, no. 1 (2024): 9–12. http://dx.doi.org/10.14305/jn.29960819.2024.1.1.01.

Full text
Abstract:
Recent and dramatic improvements in AI driven large language models (LLMs), image generators, audio and video have fed an exponential growth in Generative AI applications and accessibility. The disruptive ripples of this rapid evolution have already begun to fundamentally impact how we create and consume content on a global scale. While the use of Generative AI has and will continue to enable massive increases in the speed and efficiency of content creation, it has come at the cost of uncomfortable conversations about transparency and the erosion of digital trust. To have any chance at actually diminishing the societal impact of digital disinformation in an age of generative AI, approaches strategically designed to assist human decision making must move past simple detection and provide more robust solutions.
APA, Harvard, Vancouver, ISO, and other styles
49

Armstrong Joseph J and Senthil S. "The Dark Side of Generative AI: Ethical, Security, and Social Concerns." International Research Journal on Advanced Engineering Hub (IRJAEH) 3, no. 04 (2025): 1720–23. https://doi.org/10.47392/irjaeh.2025.0247.

Full text
Abstract:
Generative Artificial Intelligence (AI) represents a significant leap in technology, enabling the creation of novel content from text, images, videos, and audio. While its potential to drive innovation and improve productivity is immense, the risks associated with its application are equally formidable. This paper explores the darker aspects of generative AI, focusing on ethical dilemmas, social implications, security threats, and the potential for misuse. We examine issues such as misinformation, biases in AI models, job displacement, and the dangers posed by AI-driven automation. Finally, we discuss the need for effective governance and regulatory measures to mitigate these risks and ensure responsible AI development.
APA, Harvard, Vancouver, ISO, and other styles
50

Charpe, Aditya. "Real-Time Deepfake Detection: A Systematic Review of Generative Adversarial Networks (GANs) and Generative Transformer Networks (GTNs)." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 2801–18. https://doi.org/10.22214/ijraset.2025.71021.

Full text
Abstract:
stract: Deepfakes, synthetic videos generated by artificial intelli- gence, pose severe threats to multimedia integrity, enabling misinfor- mation, financial fraud, and identity theft [34]. Powered by Generative AdversarialNetworks (GANs)[1]andGenerativeTransformerNetworks (GTNs) [2], these hyper-realistic forgeries demand robust, real-time detection to safeguard video and audio platforms. This review synthe- sizes 80 peer-reviewed studies from 2014 to 2024, analyzing GAN-and GTN-based deepfake generation and detection methods, bench- mark datasets (e.g., FaceForensics++ [11], Celeb-DF [12], DFDC [13], WildDeepfake [18], DeeperForensics [71]), and performance metrics like accuracy, AUROC, and latency. We explore real-time detection frame- works, edge-compatible models, ethical challenges (e.g., dataset bias, privacy risks) [35], and global regulatory frameworks. Case studies of deepfakeincidentshighlightreal-worldimpacts,whilegapsincomputationalefficiency(&lt;100ms)andcross-datasetgeneralizationunderscore theneed foradvanced solutions. Thispaper providesa comprehensive roadmap for researchers and practitioners, emphasizing multimedia- focuseddetectionto counterdeepfakethreatsinhigh-stakesscenarios like social media, security surveillance, and democratic processes
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!