Log in

Relevant bibliographies by topics / Generative audio models / Journal articles

To see the other types of publications on this topic, follow the link: Generative audio models.

Journal articles on the topic 'Generative audio models'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Generative audio models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Evans, Zach, Scott H. Hawley, and Katherine Crowson. "Musical audio samples generated from joint text embeddings." Journal of the Acoustical Society of America 152, no. 4 (October 2022): A178. http://dx.doi.org/10.1121/10.0015956.

Full text

Abstract:

The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic musical sounds presents challenges for audio diffusion models.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Heng, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (March 24, 2024): 15492–501. http://dx.doi.org/10.1609/aaai.v38i14.29475.

Full text

Abstract:

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/.

APA, Harvard, Vancouver, ISO, and other styles

3

Sakirin, Tam, and Siddartha Kusuma. "A Survey of Generative Artificial Intelligence Techniques." Babylonian Journal of Artificial Intelligence 2023 (March 10, 2023): 10–14. http://dx.doi.org/10.58496/bjai/2023/003.

Full text

Abstract:

Generative artificial intelligence (AI) refers to algorithms capable of creating novel, realistic digital content autonomously. Recently, generative models have attained groundbreaking results in domains like image and audio synthesis, spurring vast interest in the field. This paper surveys the landscape of modern techniques powering the rise of creative AI systems. We structurally examine predominant algorithmic approaches including generative adversarial networks (GANs), variational autoencoders (VAEs), and autoregressive models. Architectural innovations and illustrations of generated outputs are highlighted for major models under each category. We give special attention to generative techniques for constructing realistic images, tracing rapid progress from early GAN samples to modern diffusion models like Stable Diffusion. The paper further reviews generative modeling to create convincing audio, video, and 3D renderings, which introduce critical challenges around fake media detection and data bias. Additionally, we discuss common datasets that have enabled advances in generative modeling. Finally, open questions around evaluation, technique blending, controlling model behaviors, commercial deployment, and ethical considerations are outlined as active areas for future work. This survey presents both long-standing and emerging techniques molding the state and trajectory of generative AI. The key goals are to overview major algorithm families, highlight innovations through example models, synthesize capabilities for multimedia generation, and discuss open problems around data, evaluation, control, and ethics. Please let me know if you would like any clarification or modification of this proposed abstract.

APA, Harvard, Vancouver, ISO, and other styles

4

Broad, Terence, Frederic Fol Leymarie, and Mick Grierson. "Network Bending: Expressive Manipulation of Generative Models in Multiple Domains." Entropy 24, no. 1 (December 24, 2021): 28. http://dx.doi.org/10.3390/e24010028.

Full text

Abstract:

This paper presents the network bending framework, a new approach for manipulating and interacting with deep generative models. We present a comprehensive set of deterministic transformations that can be inserted as distinct layers into the computational graph of a trained generative neural network and applied during inference. In addition, we present a novel algorithm for analysing the deep generative model and clustering features based on their spatial activation maps. This allows features to be grouped together based on spatial similarity in an unsupervised fashion. This results in the meaningful manipulation of sets of features that correspond to the generation of a broad array of semantically significant features of the generated results. We outline this framework, demonstrating our results on deep generative models for both image and audio domains. We show how it allows for the direct manipulation of semantically meaningful aspects of the generative process as well as allowing for a broad range of expressive outcomes.

APA, Harvard, Vancouver, ISO, and other styles

5

Aldausari, Nuha, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. "Video Generative Adversarial Networks: A Review." ACM Computing Surveys 55, no. 2 (March 31, 2023): 1–25. http://dx.doi.org/10.1145/3487891.

Full text

Abstract:

With the increasing interest in the content creation field in multiple sectors such as media, education, and entertainment, there is an increased trend in the papers that use AI algorithms to generate content such as images, videos, audio, and text. Generative Adversarial Networks (GANs) is one of the promising models that synthesizes data samples that are similar to real data samples. While the variations of GANs models in general have been covered to some extent in several survey papers, to the best of our knowledge, this is the first paper that reviews the state-of-the-art video GANs models. This paper first categorizes GANs review papers into general GANs review papers, image GANs review papers, and special field GANs review papers such as anomaly detection, medical imaging, or cybersecurity. The paper then summarizes the main improvements in GANs that are not necessarily applied in the video domain in the first run but have been adopted in multiple video GANs variations. Then, a comprehensive review of video GANs models are provided under two main divisions based on existence of a condition. The conditional models are then further classified according to the provided condition into audio, text, video, and image. The paper concludes with the main challenges and limitations of the current video GANs models.

APA, Harvard, Vancouver, ISO, and other styles

6

Shen, Qiwei, Junjie Xu, Jiahao Mei, Xingjiao Wu, and Daoguo Dong. "EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance." Applied Sciences 14, no. 8 (April 10, 2024): 3193. http://dx.doi.org/10.3390/app14083193.

Full text

Abstract:

With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio.

APA, Harvard, Vancouver, ISO, and other styles

7

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (October 11, 2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Full text

Abstract:

Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-dimensional mel spectrogram as the conditioner allows both user controllability and a way for the network to generate more diversity. Additionally, it gives the model style transfer capabilities. We evaluate several models in terms of the quality and variability of the generated sounds using both quantitative and subjective evaluations. The results suggest that there is a trade-off between quality and diversity. Nevertheless, our method achieves a quality level similar to that of the training set while generating perceivable variations according to a perceptual study that includes game audio experts.

APA, Harvard, Vancouver, ISO, and other styles

8

Lattner, Stefan, and Javier Nistal. "Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks." Electronics 10, no. 11 (June 5, 2021): 1349. http://dx.doi.org/10.3390/electronics10111349.

Full text

Abstract:

Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.

APA, Harvard, Vancouver, ISO, and other styles

9

Yang, Junpeng, and Haoran Zhang. "Development And Challenges of Generative Artificial Intelligence in Education and Art." Highlights in Science, Engineering and Technology 85 (March 13, 2024): 1334–47. http://dx.doi.org/10.54097/vaeav407.

Full text

Abstract:

Thanks to the rapid development of generative deep learning models, Artificial Intelligence Generated Content (AIGC) has attracted more and more research attention in recent years, which aims to learn models from massive data to generate relevant content based on input conditions. Different from traditional single-modal generation tasks that focus on content generation for a particular modality, such as image generation, text generation, or semantic generation, AIGC trains a single model that can simultaneously understand language, images, videos, audio, and more. AIGC marks the transition from traditional decision-based artificial intelligence to generative artificial intelligence, which has been widely applied in various fields. Focusing on the key technologies and representative applications of AIGC, this paper identifies several key technical challenges and controversies in the field. These include defects in cross-modal and multimodal generation, issues related to model stability and data consistency, privacy concerns, and questions about whether advanced generative models like ChatGPT can be considered general artificial intelligence (AGI). While this dissertation provides valuable insights into the revolution and challenge of generative AI in art and education, it acknowledges the sensitivity of generated content and the ethical dilemmas it may pose, and ownership rights for AI-generated works and the need for new intellectual property norms are subjects of ongoing discussion. To address the current technical bottlenecks in cross-modal and multimodal generation, future research aims to quantitatively analyze and compare existing models, proposing practical optimization strategies. With the rapid advancement of generative AI, we anticipate a transition from user-generated content (UGC) to artificial intelligence-generated content (AIGC) and, ultimately, a new era of human-computer co-creation with strong interactive potential in the near future.

APA, Harvard, Vancouver, ISO, and other styles

10

Choi, Ha-Yeong, Sang-Hoon Lee, and Seong-Whan Lee. "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17862–70. http://dx.doi.org/10.1609/aaai.v38i16.29740.

Full text

Abstract:

Diffusion-based generative models have recently exhibited powerful generative performance. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, we introduce decoupled denoising diffusion models (DDDMs) with disentangled representations, which can enable effective style transfers for each attribute in generative models. In particular, we apply DDDMs for voice conversion (VC) tasks, tackling the intricate challenge of disentangling and individually transferring each speech attributes such as linguistic information, intonation, and timbre. First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for style transfer with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance even when using a smaller model size. Audio samples are available at https://hayeong0.github.io/DDDM-VC-demo/.

APA, Harvard, Vancouver, ISO, and other styles

11

Gupta, Chitralekha, Shreyas Sridhar, Denys J. C. Matthies, Christophe Jouffrais, and Suranga Nanayakkara. "SonicVista: Towards Creating Awareness of Distant Scenes through Sonification." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 2 (May 13, 2024): 1–32. http://dx.doi.org/10.1145/3659609.

Full text

Abstract:

Spatial awareness, particularly awareness of distant environmental scenes known as vista-space, is crucial and contributes to the cognitive and aesthetic needs of People with Visual Impairments (PVI). In this work, through a formative study with PVIs, we establish the need for vista-space awareness amongst people with visual impairments, and the possible scenarios where this awareness would be helpful. We investigate the potential of existing sonification techniques as well as AI-based audio generative models to design sounds that can create awareness of vista-space scenes. Our first user study, consisting of a listening test with sighted participants as well as PVIs, suggests that current AI generative models for audio have the potential to produce sounds that are comparable to existing sonification techniques in communicating sonic objects and scenes in terms of their intuitiveness, and learnability. Furthermore, through a wizard-of-oz study with PVIs, we demonstrate the utility of AI-generated sounds as well as scene audio recordings as auditory icons to provide vista-scene awareness, in the contexts of navigation and leisure. This is the first step towards addressing the need for vista-space awareness and experience in PVIs.

APA, Harvard, Vancouver, ISO, and other styles

12

Prudhvi, Y., T. Adinarayana, T. Chandu, S. Musthak, and G. Sireesha. "Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound." International Journal of Innovative Research in Computer Science and Technology 11, no. 6 (November 9, 2023): 13–17. http://dx.doi.org/10.55524/ijircst.2023.11.6.3.

Full text

Abstract:

In the field of computer graphics and animation, the challenge of generating lifelike and expressive talking face animations has historically necessitated extensive 3D data and complex facial motion capture systems. However, this project presents an innovative approach to tackle this challenge, with the primary goal of producing realistic 3D motion coefficients for stylized talking face animations driven by a single reference image synchronized with audio input. Leveraging state-of-the-art deep learning techniques, including generative models, image-to-image translation networks, and audio processing methods, the methodology bridges the gap between static images and dynamic, emotionally rich facial animations. The ultimate aim is to synthesize talking face animations that exhibit seamless lip synchronization and natural eye blinking, thereby achieving an exceptional degree of realism and expressiveness, revolutionizing the realm of computer-generated character interactions.

APA, Harvard, Vancouver, ISO, and other styles

13

Liang, Kai, and Haijun Zhao. "Application of Generative Adversarial Nets (GANs) in Active Sound Production System of Electric Automobiles." Shock and Vibration 2020 (October 28, 2020): 1–10. http://dx.doi.org/10.1155/2020/8888578.

Full text

Abstract:

To improve the diversity and quality of sound mimicry of electric automobile engines, a generative adversarial network (GAN) model was used to construct an active sound production model for electric automobiles. The structure of each layer in the network in this model and the size of its convolution kernel were designed. The gradient descent in network training was optimized using the adaptive moment estimation (Adam) algorithm. To demonstrate the quality difference of the generated samples from different input signals, two GAN models with different inputs were constructed. The experimental results indicate that the model can accurately learn the characteristic distributions of raw audio signals. Results from a human ear auditory test show that the generated audio samples mimicked the real samples well, and a leave-one-out (LOO) test show that the diversity of the samples generated from the raw audio signals was higher than that of samples generated from a two-dimensional spectrogram.

APA, Harvard, Vancouver, ISO, and other styles

14

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (October 1, 2023): A99. http://dx.doi.org/10.1121/10.0022922.

Full text

Abstract:

In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction loss into the original training objective to penalize imperfect audio reconstruction and compare it with neural vocoders and traditional spectrogram inversion methods. We use a Wasserstein GAN (WGAN) as an example model to explore the synthesis quality of generated sound effects, such as footsteps, birds, guns, rain, and engine sounds. In addition to synthesis quality, we also consider the range of sound variation that is possible with our generative model. We report on the trade-off that we obtain with our model regarding the quality and diversity of synthesized sound effects.

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Jing, Zhengping Li, Ying Li, and Lijun Wang. "P‐2.12: A Comprehensive Study of Content Generation Using Diffusion Model." SID Symposium Digest of Technical Papers 54, S1 (April 2023): 522–24. http://dx.doi.org/10.1002/sdtp.16346.

Full text

Abstract:

The essence of the Metaverse is the process of integrating a large number of existing technologies to virtualize and digitize the real world. With the development of artificial intelligence technology, a large number of digital native content in the Metaverse needs to be completed by artificial intelligence. The current artificial intelligence technology allows computers to automatically and efficiently generate text, pictures, audio, video, and even 3D models. With the further development of natural language processing technology and generative network models, future artificial intelligence generation will subvert the existing content generation methods and gradually become an accelerator for the development of the Metaverse. In this paper, we have analyzed and studied the basic principles of diffusion models, clarified some existing problems of diffusion models, considered the key areas of concern and potential areas to be further explored, and looked forward to the future application of diffusion models in the field of Metaverse.

APA, Harvard, Vancouver, ISO, and other styles

16

Xi, Wang, Guillaume Devineau, Fabien Moutarde, and Jie Yang. "Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images." Algorithms 13, no. 12 (December 3, 2020): 319. http://dx.doi.org/10.3390/a13120319.

Full text

Abstract:

Generative models for images, audio, text, and other low-dimension data have achieved great success in recent years. Generating artificial human movements can also be useful for many applications, including improvement of data augmentation methods for human gesture recognition. The objective of this research is to develop a generative model for skeletal human movement, allowing to control the action type of generated motion while keeping the authenticity of the result and the natural style variability of gesture execution. We propose to use a conditional Deep Convolutional Generative Adversarial Network (DC-GAN) applied to pseudo-images representing skeletal pose sequences using tree structure skeleton image format. We evaluate our approach on the 3D skeletal data provided in the large NTU_RGB+D public dataset. Our generative model can output qualitatively correct skeletal human movements for any of the 60 action classes. We also quantitatively evaluate the performance of our model by computing Fréchet inception distances, which shows strong correlation to human judgement. To the best of our knowledge, our work is the first successful class-conditioned generative model for human skeletal motions based on pseudo-image representation of skeletal pose sequences.

APA, Harvard, Vancouver, ISO, and other styles

17

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (February 7, 2023): 1834. http://dx.doi.org/10.3390/s23041834.

Full text

Abstract:

This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2—Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.

APA, Harvard, Vancouver, ISO, and other styles

18

Gong, Yuan, Cheng-I. Lai, Yu-An Chung, and James Glass. "SSAST: Self-Supervised Audio Spectrogram Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10699–709. http://dx.doi.org/10.1609/aaai.v36i10.21315.

Full text

Abstract:

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

APA, Harvard, Vancouver, ISO, and other styles

19

Vujović, Dušan. "Generative AI: Riding the new general purpose technology storm." Ekonomika preduzeca 72, no. 1-2 (2024): 125–36. http://dx.doi.org/10.5937/ekopre2402125v.

Full text

Abstract:

Generative AI promises to revolutionize many industries (entertainment, marketing, healthcare, finance, and research) by empowering machines to create new data content inspired by existing data. It experienced exponential growth in recent years. In 2023 breakout year Gen AI impact reached 2.6-4.4 trillion USD (2.5-4.2% of global GDP). The development of modern LLM-based models has been facilitated by improvements in computing power, data availability, and algorithms. These models have diverse applications in text, visual, audio, and code generation across various domains. Leading companies are rapidly deploying Gen AI for strategic decision-making at corporate executive levels. While AI-related risks have been identified, mitigation measures are still in early stages. Leaders in Gen AI adoption anticipate workforce changes and re-skilling needs. Gen AI is primarily used for text functions, big data analysis, and customer services, with the strongest impact in knowledge-based sectors. High-performing AI companies prioritize revenue generation over cost reduction, rapidly expand the use of Gen AI across various business functions, and link business value to organizational performance and structure. There is a notable lack of attention to addressing broader societal risks and the impact on the labor force. Gen AI creates new job opportunities and improves productivity in key areas. Future investment in AI is expected to rise. Concerns about the potential AI singularity, where machines surpass human intelligence, are subject to debate. Some view singularity as a risk, others are more optimistic based on human control and societal constraints. Leading experts in Gen AI predict that the coming decade can be the most prosperous in history if we manage to harness the benefits of Gen AI and control its downside.

APA, Harvard, Vancouver, ISO, and other styles

20

Davis, Jason. "In a Digital World With Generative AI Detection Will Not be Enough." Newhouse Impact Journal 1, no. 1 (2024): 9–12. http://dx.doi.org/10.14305/jn.29960819.2024.1.1.01.

Full text

Abstract:

Recent and dramatic improvements in AI driven large language models (LLMs), image generators, audio and video have fed an exponential growth in Generative AI applications and accessibility. The disruptive ripples of this rapid evolution have already begun to fundamentally impact how we create and consume content on a global scale. While the use of Generative AI has and will continue to enable massive increases in the speed and efficiency of content creation, it has come at the cost of uncomfortable conversations about transparency and the erosion of digital trust. To have any chance at actually diminishing the societal impact of digital disinformation in an age of generative AI, approaches strategically designed to assist human decision making must move past simple detection and provide more robust solutions.

APA, Harvard, Vancouver, ISO, and other styles

21

Samanta, Bidisha, Abir DE, Gourhari Jana, Pratim Kumar Chattaraj, Niloy Ganguly, and Manuel Gomez Rodriguez. "NeVAE: A Deep Generative Model for Molecular Graphs." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1110–17. http://dx.doi.org/10.1609/aaai.v33i01.33011110.

Full text

Abstract:

Deep generative models have been praised for their ability to learn smooth latent representation of images, text, and audio, which can then be used to generate new, plausible data. However, current generative models are unable to work with molecular graphs due to their unique characteristics—their underlying structure is not Euclidean or grid-like, they remain isomorphic under permutation of the nodes labels, and they come with a different number of nodes and edges. In this paper, we propose NeVAE, a novel variational autoencoder for molecular graphs, whose encoder and decoder are specially designed to account for the above properties by means of several technical innovations. In addition, by using masking, the decoder is able to guarantee a set of valid properties in the generated molecules. Experiments reveal that our model can discover plausible, diverse and novel molecules more effectively than several state of the art methods. Moreover, by utilizing Bayesian optimization over the continuous latent representation of molecules our model finds, we can also find molecules that maximize certain desirable properties more effectively than alternatives.

APA, Harvard, Vancouver, ISO, and other styles

22

Saxena, Divya, and Jiannong Cao. "Generative Adversarial Networks (GANs)." ACM Computing Surveys 54, no. 3 (June 2021): 1–42. http://dx.doi.org/10.1145/3446374.

Full text

Abstract:

Generative Adversarial Networks (GANs) is a novel class of deep generative models that has recently gained significant attention. GANs learn complex and high-dimensional distributions implicitly over images, audio, and data. However, there exist major challenges in training of GANs, i.e., mode collapse, non-convergence, and instability, due to inappropriate design of network architectre, use of objective function, and selection of optimization algorithm. Recently, to address these challenges, several solutions for better design and optimization of GANs have been investigated based on techniques of re-engineered network architectures, new objective functions, and alternative optimization algorithms. To the best of our knowledge, there is no existing survey that has particularly focused on the broad and systematic developments of these solutions. In this study, we perform a comprehensive survey of the advancements in GANs design and optimization solutions proposed to handle GANs challenges. We first identify key research issues within each design and optimization technique and then propose a new taxonomy to structure solutions by key research issues. In accordance with the taxonomy, we provide a detailed discussion on different GANs variants proposed within each solution and their relationships. Finally, based on the insights gained, we present promising research directions in this rapidly growing field.

APA, Harvard, Vancouver, ISO, and other styles

23

Avdeeff, Melissa. "Artificial Intelligence & Popular Music: SKYGGE, Flow Machines, and the Audio Uncanny Valley." Arts 8, no. 4 (October 11, 2019): 130. http://dx.doi.org/10.3390/arts8040130.

Full text

Abstract:

This article presents an overview of the first AI-human collaborated album, Hello World, by SKYGGE, which utilizes Sony’s Flow Machines technologies. This case study is situated within a review of current and emerging uses of AI in popular music production, and connects those uses with myths and fears that have circulated in discourses concerning the use of AI in general, and how these fears connect to the idea of an audio uncanny valley. By proposing the concept of an audio uncanny valley in relation to AIPM (artificial intelligence popular music), this article offers a lens through which to examine the more novel and unusual melodies and harmonization made possible through AI music generation, and questions how this content relates to wider speculations about posthumanism, sincerity, and authenticity in both popular music, and broader assumptions of anthropocentric creativity. In its documentation of the emergence of a new era of popular music, the AI era, this article surveys: (1) The current landscape of artificial intelligence popular music focusing on the use of Markov models for generative purposes; (2) posthumanist creativity and the potential for an audio uncanny valley; and (3) issues of perceived authenticity in the technologically mediated “voice”.

APA, Harvard, Vancouver, ISO, and other styles

24

Zhou, Kun, Wenyong Wang, Teng Hu, and Kai Deng. "Time Series Forecasting and Classification Models Based on Recurrent with Attention Mechanism and Generative Adversarial Networks." Sensors 20, no. 24 (December 16, 2020): 7211. http://dx.doi.org/10.3390/s20247211.

Full text

Abstract:

Time series classification and forecasting have long been studied with the traditional statistical methods. Recently, deep learning achieved remarkable successes in areas such as image, text, video, audio processing, etc. However, research studies conducted with deep neural networks in these fields are not abundant. Therefore, in this paper, we aim to propose and evaluate several state-of-the-art neural network models in these fields. We first review the basics of representative models, namely long short-term memory and its variants, the temporal convolutional network and the generative adversarial network. Then, long short-term memory with autoencoder and attention-based models, the temporal convolutional network and the generative adversarial model are proposed and applied to time series classification and forecasting. Gaussian sliding window weights are proposed to speed the training process up. Finally, the performances of the proposed methods are assessed using five optimizers and loss functions with the public benchmark datasets, and comparisons between the proposed temporal convolutional network and several classical models are conducted. Experiments show the proposed models’ effectiveness and confirm that the temporal convolutional network is superior to long short-term memory models in sequence modeling. We conclude that the proposed temporal convolutional network reduces time consumption to around 80% compared to others while retaining the same accuracy. The unstable training process for generative adversarial network is circumvented by tuning hyperparameters and carefully choosing the appropriate optimizer of “Adam”. The proposed generative adversarial network also achieves comparable forecasting accuracy with traditional methods.

APA, Harvard, Vancouver, ISO, and other styles

25

Skitsko, Volodymyr I. "Data Analysis Using Generative AI: Opportunities and Challenges." PROBLEMS OF ECONOMY 4, no. 58 (2023): 217–25. http://dx.doi.org/10.32983/2222-0712-2023-4-217-225.

Full text

Abstract:

The article examines topical issues of using generative artificial intelligence, in particular large language models ChatGPT and Claude, for data analysis. The essence of the term «artificial intelligence» changes over time. And if earlier, when using this term, was talked about expert systems, machine learning, etc., now this term means primarily large language models, among which the most famous are ChatGPT, Claude, Bing, Bard. These models allow you to generate texts, images, audio, and video based on user requests. The purpose of the article is to explore the opportunities and challenges of using ChatGPT and Claude in data analysis based on existing publications and our own experience. After all, the ability of large language models to communicate in natural language makes them a powerful analytics tool. The paper considers aspects of the main stages of data analysis in the context of the use of large language models: obtaining, collecting and loading input data, their pre-processing, application of mathematical models, visualization and interpretation of results. Practical recommendations for formulating requests to ChatGPT and Claude at each stage of data analysis are provided. It is noted that ChatGPT, thanks to the built-in Advanced Data Analysis service, allows an effectively analysis of data, powered by Python language. This provides higher accuracy of results compared to other large language models. Using a conditional example, a comparison of the capabilities of ChatGPT and Claude in data analysis is carried out. It is shown that ChatGPT allows you to build dependency models, generate graphs and give meaningful explanations of the results obtained. At the same time, Claude’s capabilities in data analysis are quite limited. It is concluded that ChatGPT has significantly greater potential for data analysis compared to Claude and other chatbots. However, so far, large language models cannot completely replace data analysts and powerful decision support systems. In further research, it is proposed to focus on the study of the practical application of the capabilities of ChatGPT and, in particular, its Advanced Data Analysis service to solve various data analysis problems.

APA, Harvard, Vancouver, ISO, and other styles

26

Shi, Yi, and Lin Sun. "How Generative AI Is Transforming Journalism: Development, Application and Ethics." Journalism and Media 5, no. 2 (May 10, 2024): 582–94. http://dx.doi.org/10.3390/journalmedia5020039.

Full text

Abstract:

Generative artificial intelligence (GAI) is a technology based on algorithms, models, etc., that creates content such as text, audio, images, videos, and code. GAI is deeply integrated into journalism as tools, platforms and systems. However, GAI’s role in journalism dilutes the power of media professionals, changes traditional news production and poses ethical questions. This study attempts to systematically answer these ethical questions in specific journalistic practices from the perspectives of journalistic professionalism and epistemology. Building on the review of GAI’s development and application, this study identifies the responsibilities of news organizations, journalists and audiences, ensuring that they realize the potential of GAI while adhering to journalism professionalism and universal human values to avoid negative technological effects.

APA, Harvard, Vancouver, ISO, and other styles

27

Raikwar, Piyush, Renato Cardoso, Nadezda Chernyavskaya, Kristina Jaruskova, Witold Pokorski, Dalila Salamani, Mudhakar Srivatsa, Kalliopi Tsolaki, Sofia Vallecorsa, and Anna Zaborowska. "Transformers for Generalized Fast Shower Simulation." EPJ Web of Conferences 295 (2024): 09039. http://dx.doi.org/10.1051/epjconf/202429509039.

Full text

Abstract:

Recently, transformer-based foundation models have proven to be a generalized architecture applicable to various data modalities, ranging from text to audio and even a combination of multiple modalities. Transformers by design should accurately model the non-trivial structure of particle showers thanks to the absence of strong inductive bias, better modeling of long-range dependencies, and interpolation and extrapolation capabilities. In this paper, we explore a transformer-based generative model for detector-agnostic fast shower simulation, where the goal is to generate synthetic particle showers, i.e., the energy depositions in the calorimeter. When trained with an adequate amount and variety of showers, these models should learn better representations compared to other deep learning models, and hence should quickly adapt to new detectors. In this work, we will show the prototype of a transformer-based generative model for fast shower simulation, as well as explore certain aspects of transformer architecture such as input data representation, sequence formation, and the learning mechanism for our unconventional shower data.

APA, Harvard, Vancouver, ISO, and other styles

28

Li, Siyuan, Lei Cheng, Jun Li, Zichen Wang, and Jianlong Li. "Learning data distribution of three-dimensional ocean sound speed fields via diffusion models." Journal of the Acoustical Society of America 155, no. 5 (May 1, 2024): 3410–25. http://dx.doi.org/10.1121/10.0026026.

Full text

Abstract:

The probability distribution of three-dimensional sound speed fields (3D SSFs) in an ocean region encapsulates vital information about their variations, serving as valuable data-driven priors for SSF inversion tasks. However, learning such a distribution is challenging due to the high dimensionality and complexity of 3D SSFs. To tackle this challenge, we propose employing the diffusion model, a cutting-edge deep generative model that has showcased remarkable performance in diverse domains, including image and audio processing. Nonetheless, applying this approach to 3D ocean SSFs encounters two primary hurdles. First, the lack of publicly available well-crafted 3D SSF datasets impedes training and evaluation. Second, 3D SSF data consist of multiple 2D layers with varying variances, which can lead to uneven denoising during the reverse process. To surmount these obstacles, we introduce a novel 3D SSF dataset called 3DSSF, specifically designed for training and evaluating deep generative models. In addition, we devise a high-capacity neural architecture for the diffusion model to effectively handle variations in 3D sound speeds. Furthermore, we employ state-of-the-art continuous-time-based optimization method and predictor-corrector scheme for high-performance training and sampling. Notably, this paper presents the first evaluation of the diffusion model's effectiveness in generating 3D SSF data. Numerical experiments validate the proposed method's strong ability to learn the underlying data distribution of 3D SSFs, and highlight its effectiveness in assisting SSF inversion tasks and subsequently characterizing the transmission loss of underwater acoustics.

APA, Harvard, Vancouver, ISO, and other styles

29

Sanjeeva, Polepaka, Vanipenta Balasri Nitin Reddy, Jagirdar Indraj Goud, Aavula Guru Prasad, and Ashish Pathani. "TEXT2AV – Automated Text to Audio and Video Conversion." E3S Web of Conferences 430 (2023): 01027. http://dx.doi.org/10.1051/e3sconf/202343001027.

Full text

Abstract:

The paper aims to develop a machine learning-based system that can automatically convert text to audio and text to video as per the user’s request. Suppose Reading a large text is difficult for anyone, but this TTS model makes it easy by converting text into audio by producing the audio output by an avatar with lip sync to make it look more attractive and human-like interaction in many languages. The TTS model is built based on Waveform Recurrent Neural Networks (WaveRNN). It is a type of auto-regressive model that predicts future data based on the present. The system identifies the keywords in the input texts and uses diffusion models to generate high-quality video content. The system uses GAN (Generative Adversarial Network) to generate videos. Frame Interpolation is used to combine different frames into two adjacent frames to generate a slow- motion video. WebVid-20M, Image-Net, and Hugging-Face are the datasets used for Text video and LibriTTS corpus, and Lip Sync are the dataset used for text-to-audio. The System provides a user-friendly and automated platform to the user which takes text as input and produces either a high-quality audio or high-resolution video quickly and efficiently.

APA, Harvard, Vancouver, ISO, and other styles

30

Nakano, Tomoyasu, Kazuyoshi Yoshii, and Masataka Goto. "Musical Similarity and Commonness Estimation Based on Probabilistic Generative Models of Musical Elements." International Journal of Semantic Computing 10, no. 01 (March 2016): 27–52. http://dx.doi.org/10.1142/s1793351x1640002x.

Full text

Abstract:

This paper proposes a novel concept we call musical commonness, which is the similarity of a song to a set of songs; in other words, its typicality. This commonness can be used to retrieve representative songs from a set of songs (e.g. songs released in the 80s or 90s). Previous research on musical similarity has compared two songs but has not evaluated the similarity of a song to a set of songs. The methods presented here for estimating the similarity and commonness of polyphonic musical audio signals are based on a unified framework of probabilistic generative modeling of four musical elements (vocal timbre, musical timbre, rhythm, and chord progression). To estimate the commonness, we use a generative model trained from a song set instead of estimating musical similarities of all possible song-pairs by using a model trained from each song. In experimental evaluation, we used two song-sets: 3278 Japanese popular music songs and 415 English songs. Twenty estimated song-pair similarities for each element and each song-set were compared with ratings by a musician. The comparison with the results of the expert ratings suggests that the proposed methods can estimate musical similarity appropriately. Estimated musical commonnesses are evaluated on basis of the Pearson product-moment correlation coefficients between the estimated commonness of each song and the number of songs having high similarity with the song. Results of commonness evaluation show that a song having higher commonness is similar to songs of a song set.

APA, Harvard, Vancouver, ISO, and other styles

31

Smith, Jason, and Jason Freeman. "Effects of Deep Neural Networks on the Perceived Creative Autonomy of a Generative Musical System." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 17, no. 1 (October 4, 2021): 91–98. http://dx.doi.org/10.1609/aiide.v17i1.18895.

Full text

Abstract:

Collaborative AI agents allow for human-computer collaboration in interactive software. In creative spaces such as musical performance, they are able to exhibit creative autonomy through independent actions and decision-making. These systems, called co-creative systems, autonomously control some aspects of the creative process while a human musician manages others. When users perceive a co-creative system to be more autonomous, they may be willing to cede more creative control to it, leading to an experience that users may find more expressive and engaging. This paper describes the design and implementation of a co-creative musical system that captures gestural motion and uses that motion to filter pre-existing audio content. The system hosts two neural network architectures, enabling comparison of their use as a collaborative musical agent. This paper also presents a preliminary study in which subjects recorded short musical performances using this software while alternating between deep and shallow models. The analysis includes a comparison of users' perceptions of the two models' creative roles and the models' impact on the subjects' sense of self-expression.

APA, Harvard, Vancouver, ISO, and other styles

32

De Silva, Daswin, Shalinka Jayatilleke, Mona El-Ayoubi, Zafar Issadeen, Harsha Moraliyage, and Nishan Mills. "The Human-Centred Design of a Universal Module for Artificial Intelligence Literacy in Tertiary Education Institutions." Machine Learning and Knowledge Extraction 6, no. 2 (May 18, 2024): 1114–25. http://dx.doi.org/10.3390/make6020051.

Full text

Abstract:

Generative Artificial Intelligence (AI) is heralding a new era in AI for performing a spectrum of complex tasks that are indistinguishable from humans. Alongside language and text, Generative AI models have been built for all other modalities of digital data, image, video, audio, and code. The full extent of Generative AI and its opportunities, challenges, contributions, and risks are still being explored by academic researchers, industry practitioners, and government policymakers. While this deep understanding of Generative AI continues to evolve, the lack of fluency, literacy, and effective interaction with Generative and conventional AI technologies are common challenges across all domains. Tertiary education institutions are uniquely positioned to address this void. In this article, we present the human-centred design of a universal AI literacy module, followed by its four primary constructs that provide core competence in AI to coursework and research students and academic and professional staff in a tertiary education setting. In comparison to related work in AI literacy, our design is inclusive due to the collaborative approach between multiple stakeholder groups and is comprehensive given the descriptive formulation of the primary constructs of this module with exemplars of how they activate core operational competence across the four groups.

APA, Harvard, Vancouver, ISO, and other styles

33

Lopez Duarte, Alvaro E. "Algorithmic interactive music generation in videogames." SoundEffects - An Interdisciplinary Journal of Sound and Sound Experience 9, no. 1 (January 22, 2020): 38–59. http://dx.doi.org/10.7146/se.v9i1.118245.

Full text

Abstract:

In this article, I review the concept of algorithmic generative and interactive music and discuss the advantages and challenges of its implementation in videogames. Excessive repetition caused by low interactivity in music sequences through gameplay has been tackled primarily by using random or sequential containers, coupled with overlapping rules and adaptive mix parameters, as demonstrated in the Dynamic Music Units in Audiokinetic’s Wwise middleware. This approach provides a higher variety through re-combinatorial properties of music tracks and also a responsive and interactive music stream. However, it mainly uses prerecorded music sequences that reappear and are easy to recognize throughout gameplay. Generative principles such as single-seed design have been occasionally applied in game music scoring to generate material. Some of them are complemented with rules and are assigned to sections with low emotional requirements, but support for real-time interaction in gameplay situations, although desirable, is rarely found.While algorithmic note-by-note generation can offer interactive flexibility and infinite diversity, it poses significant challenges such as achieving human-like performativity and producing a distinctive narrative style through measurable parameters or program arguments. Starting with music generation, I examine conceptual implementations and technical challenges of algorithmic composition studies that use Markov models, a-life/evolutionary music, generative grammars, agents, and artificial neural networks/deep learning. For each model, I evaluate rule-based strategies for interactive music transformation using parameters provided by contextual gameplay situations. Finally, I propose a compositional tool design based in modular instances of algorithmic music generation, featuring stylistic interactive control in connection with an audio engine rendering system.

APA, Harvard, Vancouver, ISO, and other styles

34

Arandas, Luís, Miguel Carvalhais, and Mick Grierson. "Computing Short Films Using Language-Guided Diffusion and Vocoding Through Virtual Timelines of Summaries." INSAM Journal of Contemporary Music, Art and Technology, no. 10 (July 15, 2023): 71–89. http://dx.doi.org/10.51191/issn.2637-1898.2023.6.10.71.

Full text

Abstract:

Language-guided generative models are increasingly used in audiovisual production. Image diffusion allows for the development of video sequences and some of its coordination can be established by text prompts. This research automates a video production pipeline leveraging CLIP-guidance with longform text inputs and a separate text-to-speech system. We introduce a method for producing frame-accurate video and audio summaries using a virtual timeline and document a set of video outputs with diverging parameters. Our approach was applied in the production of the film Irreplaceable Biography and contributes to a future where multimodal generative architectures are set as underlying mechanisms to establish visual sequences in time. We contribute to a practice where language modelling is part of a shared and learned representation which can support professional video production, specifically used as a vehicle throughout the composition process as potential videography in physical space.

APA, Harvard, Vancouver, ISO, and other styles

35

Dong, Zhongping, Yan Xu, Andrew Abel, and Dong Wang. "Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features." Applied Sciences 14, no. 2 (January 17, 2024): 798. http://dx.doi.org/10.3390/app14020798.

Full text

Abstract:

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction.

APA, Harvard, Vancouver, ISO, and other styles

36

Yakura, Hiromu, Youhei Akimoto, and Jun Sakuma. "Generate (Non-Software) Bugs to Fool Classifiers." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 1070–78. http://dx.doi.org/10.1609/aaai.v34i01.5457.

Full text

Abstract:

In adversarial attacks intended to confound deep learning models, most studies have focused on limiting the magnitude of the modification so that humans do not notice the attack. On the other hand, during an attack against autonomous cars, for example, most drivers would not find it strange if a small insect image were placed on a stop sign, or they may overlook it. In this paper, we present a systematic approach to generate natural adversarial examples against classification models by employing such natural-appearing perturbations that imitate a certain object or signal. We first show the feasibility of this approach in an attack against an image classifier by employing generative adversarial networks that produce image patches that have the appearance of a natural object to fool the target model. We also introduce an algorithm to optimize placement of the perturbation in accordance with the input image, which makes the generation of adversarial examples fast and likely to succeed. Moreover, we experimentally show that the proposed approach can be extended to the audio domain, for example, to generate perturbations that sound like the chirping of birds to fool a speech classifier.

APA, Harvard, Vancouver, ISO, and other styles

37

Mu, Jin. "Pose Estimation-Assisted Dance Tracking System Based on Convolutional Neural Network." Computational Intelligence and Neuroscience 2022 (June 3, 2022): 1–10. http://dx.doi.org/10.1155/2022/2301395.

Full text

Abstract:

In the field of music-driven, computer-assisted dance movement generation, traditional music movement adaptations and statistical mapping models have the following problems: Firstly, the dance sequences generated by the model are not powerful enough to fit the music itself. Secondly, the integrity of the dance movements produced is not sufficient. Thirdly, it is necessary to improve the suppleness and rationality of long-term dance sequences. Fourthly, traditional models cannot produce new dance movements. How to create smooth and complete dance gesture sequences after music is a problem that needs to be investigated in this paper. To address these problems, we design a deep learning dance generation algorithm to extract the association between sound and movement characteristics. During the feature extraction phase, rhythmic features extracted from music and audio beat features are used as musical features, and coordinates of the main points of human bones extracted from dance videos are used for training as movement characteristics. During the model building phase, the model’s generator module is used to achieve a basic mapping of music and dance movements and to generate gentle dance gestures. The identification module is used to achieve consistency between dance and music. The self-encoder module is used to make the audio function more representative. Experimental results on the DeepFashion dataset show that the generated model can synthesize the new view of the target person in any human posture of a given posture, complete the transformation of different postures of the same person, and retain the external features and clothing textures of the target person. Using a whole-to-detail generation strategy can improve the final video composition. For the problem of incoherent character movements in video synthesis, we propose to optimize the character movements by using a generative adversarial network, specifically by inserting generated motion compensation frames into the incoherent movement sequences to improve the smoothness of the synthesized video.

APA, Harvard, Vancouver, ISO, and other styles

38

Vejare, Ritvij, Abhishek Vaish, Kapish Singh, and Mrunali Desai. "Removal of Image Steganography using Generative Adversarial Network." Indian Journal of Artificial Intelligence and Neural Networking 2, no. 4 (June 30, 2022): 6–10. http://dx.doi.org/10.54105/ijainn.d1054.062422.

Full text

Abstract:

Secret messages can be concealed in ordinary media like audio, video and images. This is called as Steganography. Steganography is used by cyber attackers to send malicious content that could harm victims. Digital steganography, or steganography in photographs, is exceedingly difficult to detect. The detection of steganography in images, has been investigated in thoroughly by a variety of parties. The use of steganographic techniques to send more malware to a compromised host in order to undertake different post-exploitation operations that affect the exploited system. Many steganalysis algorithms, on the other hand, are limited to working with a subset of all potential photos in the wild or have a high false positive rate. As a result, barring any suspected image becomes an arbitrary policy. Filtering questionable photos before they are received by the host machine is a more practical policy. In this paper, a Generative Adversarial Network based model is proposed that may be optimized to delete steganographic content while maintaining the original image's perceptual quality. For removing steganography from photos while keeping the maximum visual image quality, a model is built utilizing a combination of Generative Adversarial Network (GAN) and Image Processing. In the future, utilizing a generator to synthesize a picture will become more popular, and detection of steganography in images will become very difficult. In comparison to other models that have been addressed further, the proposed model is able to give a mean square error of 5.4204 between the generated image and the cover image, as well as better outcomes based on several metrics. As a result, a GAN-based steganography eradication method will aid in this endeavor.

APA, Harvard, Vancouver, ISO, and other styles

39

Saeed, Tabish, Aneeqa Ijaz, Ismail Sadiq, Haneya Naeem Qureshi, Ali Rizwan, and Ali Imran. "An AI-Enabled Bias-Free Respiratory Disease Diagnosis Model Using Cough Audio." Bioengineering 11, no. 1 (January 5, 2024): 55. http://dx.doi.org/10.3390/bioengineering11010055.

Full text

Abstract:

Cough-based diagnosis for respiratory diseases (RDs) using artificial intelligence (AI) has attracted considerable attention, yet many existing studies overlook confounding variables in their predictive models. These variables can distort the relationship between cough recordings (input data) and RD status (output variable), leading to biased associations and unrealistic model performance. To address this gap, we propose the Bias-Free Network (RBF-Net), an end-to-end solution that effectively mitigates the impact of confounders in the training data distribution. RBF-Net ensures accurate and unbiased RD diagnosis features, emphasizing its relevance by incorporating a COVID-19 dataset in this study. This approach aims to enhance the reliability of AI-based RD diagnosis models by navigating the challenges posed by confounding variables. A hybrid of a Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks is proposed for the feature encoder module of RBF-Net. An additional bias predictor is incorporated in the classification scheme to formulate a conditional Generative Adversarial Network (c-GAN) that helps in decorrelating the impact of confounding variables from RD prediction. The merit of RBF-Net is demonstrated by comparing classification performance with a State-of-The-Art (SoTA) Deep Learning (DL) model (CNN-LSTM) after training on different unbalanced COVID-19 data sets, created by using a large-scale proprietary cough data set. RBF-Net proved its robustness against extremely biased training scenarios by achieving test set accuracies of 84.1%, 84.6%, and 80.5% for the following confounding variables—gender, age, and smoking status, respectively. RBF-Net outperforms the CNN-LSTM model test set accuracies by 5.5%, 7.7%, and 8.2%, respectively.

APA, Harvard, Vancouver, ISO, and other styles

40

Singh, Aman, and Tokunbo Ogunfunmi. "An Overview of Variational Autoencoders for Source Separation, Finance, and Bio-Signal Applications." Entropy 24, no. 1 (December 28, 2021): 55. http://dx.doi.org/10.3390/e24010055.

Full text

Abstract:

Autoencoders are a self-supervised learning system where, during training, the output is an approximation of the input. Typically, autoencoders have three parts: Encoder (which produces a compressed latent space representation of the input data), the Latent Space (which retains the knowledge in the input data with reduced dimensionality but preserves maximum information) and the Decoder (which reconstructs the input data from the compressed latent space). Autoencoders have found wide applications in dimensionality reduction, object detection, image classification, and image denoising applications. Variational Autoencoders (VAEs) can be regarded as enhanced Autoencoders where a Bayesian approach is used to learn the probability distribution of the input data. VAEs have found wide applications in generating data for speech, images, and text. In this paper, we present a general comprehensive overview of variational autoencoders. We discuss problems with the VAEs and present several variants of the VAEs that attempt to provide solutions to the problems. We present applications of variational autoencoders for finance (a new and emerging field of application), speech/audio source separation, and biosignal applications. Experimental results are presented for an example of speech source separation to illustrate the powerful application of variants of VAE: VAE, β-VAE, and ITL-AE. We conclude the paper with a summary, and we identify possible areas of research in improving performance of VAEs in particular and deep generative models in general, of which VAEs and generative adversarial networks (GANs) are examples.

APA, Harvard, Vancouver, ISO, and other styles

41

Marchi, Erik, Fabio Vesperini, Stefano Squartini, and Björn Schuller. "Deep Recurrent Neural Network-Based Autoencoders for Acoustic Novelty Detection." Computational Intelligence and Neuroscience 2017 (2017): 1–14. http://dx.doi.org/10.1155/2017/4694860.

Full text

Abstract:

In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders. The reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. There is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural network-based autoencoders. The present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and state-of-the-art methods, the article shows how RNN-based autoencoders outperform statistical approaches up to an absolute improvement of 16.4% averageF-measure over the three databases.

APA, Harvard, Vancouver, ISO, and other styles

42

Wang, Tianmeng. "Research and Application Analysis of Correlative Optimization Algorithms for GAN." Highlights in Science, Engineering and Technology 57 (July 11, 2023): 141–47. http://dx.doi.org/10.54097/hset.v57i.9992.

Full text

Abstract:

Generative Adversarial Networks (GANs) have been one of the most successful deep learning architectures in recent years, providing a powerful way to model high-dimensional data such as images, audio, and text data. GANs use two neural networks, generator and discriminator, to generate samples that resemble real data. The generator tries to create realistic looking samples while the discriminator tries to differentiate the generated samples from real ones. Through this adversarial training process, the generator learns to produce high-quality samples indistinguishable from the real ones.Different optimization algorithms have been utilized in GAN research, including different types of loss functions and regularization techniques, to improve the performance of GANs. Some of the most significant recent developments in GANs include M-DCGAN, which stands for multi-scale deep convolutional generative adversarial network, designed for image dataset augmentation; StackGAN, which is a text-to-image generation technique designed to produce high-resolution images with fine details and BigGAN, a scaled-up version of GAN that has shown improved performance in generating high-fidelity images.Moreover, the potential applications of GANs are vast and cross-disciplinary. They have been applied in various fields such as image and video synthesis, data augmentation, image translation, and style transfer. GANs also show promise in extending their use to healthcare, finance, and creative art fields. Despite their significant advancements and promising applications, GANs face several challenges such as mode collapse, vanishing gradients, and instability, which need to be addressed to achieve better performance and broader applicability.In conclusion, this review gives insights into the current state-of-the-art in GAN research, discussing its core ideas, structure, optimization techniques, applications, and challenges faced. This knowledge aims to help researchers and practitioners alike to understand the current GAN models' strengths and weaknesses and guide future GAN developments. As GANs continue to evolve, they have the potential to transform the way we understand and generate complex datasets across various fields.

APA, Harvard, Vancouver, ISO, and other styles

43

Zagallo, Patricia, Shanice Meddleton, and Molly S. Bolger. "Teaching Real Data Interpretation with Models (TRIM): Analysis of Student Dialogue in a Large-Enrollment Cell and Developmental Biology Course." CBE—Life Sciences Education 15, no. 2 (June 2016): ar17. http://dx.doi.org/10.1187/cbe.15-11-0239.

Full text

Abstract:

We present our design for a cell biology course to integrate content with scientific practices, specifically data interpretation and model-based reasoning. A 2-yr research project within this course allowed us to understand how students interpret authentic biological data in this setting. Through analysis of written work, we measured the extent to which students’ data interpretations were valid and/or generative. By analyzing small-group audio recordings during in-class activities, we demonstrated how students used instructor-provided models to build and refine data interpretations. Often, students used models to broaden the scope of data interpretations, tying conclusions to a biological significance. Coding analysis revealed several strategies and challenges that were common among students in this collaborative setting. Spontaneous argumentation was present in 82% of transcripts, suggesting that data interpretation using models may be a way to elicit this important disciplinary practice. Argumentation dialogue included frequent co-construction of claims backed by evidence from data. Other common strategies included collaborative decoding of data representations and noticing data patterns before making interpretive claims. Focusing on irrelevant data patterns was the most common challenge. Our findings provide evidence to support the feasibility of supporting students’ data-interpretation skills within a large lecture course.

APA, Harvard, Vancouver, ISO, and other styles

44

Journal, IJSREM. "Deep Fake Face Detection Using Deep Learning Tech with LSTM." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 02 (February 8, 2024): 1–10. http://dx.doi.org/10.55041/ijsrem28624.

Full text

Abstract:

The fabrication of extremely life like spoof films and pictures that are getting harder to tell apart from actual content is now possible because to the quick advancement of deep fake technology. A number of industries, including cybersecurity, politics, and journalism, are greatly impacted by the widespread use of deepfakes, which seriously jeopardizes the accuracy of digital media. In computer vision, machine learning, and digital forensics, detecting deepfakes has emerged as a crucial topic for study and development. An outline of the most recent cutting-edge methods and difficulties in deep fake detection is given in this abstract. In this article, we go over the fundamental ideas behind deepfake creation and investigate the many approaches used to spot and stop the spread of fake news. Methods include sophisticated machine learning algorithms trained on enormous datasets of real and fake media, as well as conventional forensic investigation. We explore the principal characteristics and artifacts that differentiate authentic video from deepfakes, such as disparities in audio-visual synchronization, aberrant eye movements, and inconsistent facial emotions. Convolutional neural networks (CNNs) and generative adversarial networks (GANs), two deep learning frameworks, have been used by researchers to create sophisticated detection models that can recognize minute modifications in multimedia information. The fast developments in deep fake generating techniques, however, continue to exceed efforts in detection and mitigation, making deep fake detection a daunting problem. The issue is made worse by the democratization of deepfake technology and its accessibility to non-experts, which calls for creative solutions and multidisciplinary cooperation to counter this expanding danger. Keywords: convolutional neural network,, generative adversarial network, deep fake ,long short term memory,

APA, Harvard, Vancouver, ISO, and other styles

45

Bonde, Lossan. "A Conceptual Design of a Generative Artificial Intelligence System for Education." International Journal of Research and Innovation in Applied Science IX, no. IV (2024): 457–69. http://dx.doi.org/10.51584/ijrias.2024.904034.

Full text

Abstract:

Artificial Intelligence (AI) and, more specifically, Generative Artificial Intelligence (GenAI) are among the most trending disruptive technologies, offering unprecedented possibilities for innovation in education by customising learning paths, fostering creativity, and bridging knowledge gaps. Gen AI is a subfield of AI which generates content (text, image, audio or video) that is comparable to human-generated content. It uses complex Machine Learning (ML) models and Neural Networks (NN) trained on large datasets. The launch of ChatGPT in November 2022 has significantly raised the standards for generative artificial intelligence capabilities. Since then, researchers and practitioners have extensively investigated the potential applications of this technology. Consequently, GenAI has been applied in various sectors, including business and marketing, media and entertainment, arts and design, research and innovation, and education. While it is evident that GenAI has a high potential transformative power on education, its actual application in classical educational settings such as schools and universities remains limited to individual (student or teacher) trials without institutional guidance and support. The aim of this is expressed in two points: (1) to explore the advances in GenAI and provide guidelines on leveraging the technology in the context of learning and teaching. (2) to identify the main barriers to adopting GenAI in education and how to address them when building a Gen AI application for education. The paper’s main contribution is conceptualising a Gen AI system for learning and teaching. Such a system produces course content, exercises, and supplementary materials tailored to the curriculum and each student’s learning pace and style. This system will enhance comprehension and retention, making education more accessible and effective. Secondarily, the research addresses the ethical issues and other barriers to adopting GenAI in education.

APA, Harvard, Vancouver, ISO, and other styles

46

В. В., Кабанова,, and Логунова, О. С. "Application of artificial intelligence in multimedia." Cherepovets State University Bulletin, no. 6(111) (December 27, 2022): 23–41. http://dx.doi.org/10.23859/1994-0637-2022-6-111-2.

Full text

Abstract:

Целью данной работы является изучение и обобщение существующих задач, методов анализа и обработки изображений, видеопотоков, аудиофайлов с применением искусственного интеллекта для дальнейшего развития направления. В основной части работы рассматривается принцип работы полносвязной нейронной сети, приводится пример, указываются основные типы нейронных сетей и ссылки на работы по тематике, описываются и анализируются разработки в области искусственного интеллекта и мультимедиа. В работе проведен литературный обзор научных трудов за последние 5 лет. Тезисно раскрывается суть генеративного и дискриминативного моделирования, определяется проблема, решаемая генеративно-состязательными сетями. Рассматривается применение нейронных сетей при генерации монофонической и полифонической музыки, определении жанра мелодии, при распознавании и классификации образов на изображении, стилизации изображений и генерации новых изображений на основе набора данных и описания на английском языке, при различных манипуляциях с лицом на изображении: морфинг лица, ретушь лица, генерирование уникальных лиц и обмен идентичностью, а также при использовании глубокого обучения в медицине. При этом кратко описываются модели сетей, используемые при различных манипуляциях, представленных в работе. Определяются сферы использования сверточных нейронных сетей, рекуррентных нейронных сетей, а также описываются основные характеристики и отличительные особенности моделей СNN, RNN, GAN. Также рассматривается создание deepfake-видео и их угроза обществу, методы распознавания deepfake-видео. Определяются перспективы генеративного моделирования и искусственного интеллекта при работе с мультимедийной информацией, подчеркивается важность нейронных сетей для общества. The aim of the work is to study and generalize existing tasks, methods for analyzing and processing images, video streams and audio files applying artificial intelligence for further development of the direction. The main part of the work considers the principle of a fully connected neural network, gives examples, indicating the main types of neural networks and references to works on the subject, describing and analyzing developments in the field of artificial intelligence and multimedia. The paper provides a literature review of scientific papers over the past 5 years. The authors highlight the essence of generative and discriminative modeling; determine the problem solved by generative adversarial networks. They also focus on the application of neural networks in monophonic and polyphonic music generation, melody genre identification, image recognition and classification, image stylization and new image generation based on data set and description in English, face manipulation in images: face morphing, face attribute, generation of unique faces and identity swap and also applying deep learning in medicine. In doing so, the network models used in the various manipulations presented in the paper are briefly described. The application spheres of convolutional neural networks, recurrent neural networks, as well as the main characteristics and distinctive features of CNN, RNN, GAN models are described. The paper also discusses the development of deepfake videos and their threat to society, as well as methods of deepfake video recognition. The authors determine the prospects of generative modeling and artificial intelligence when dealing with multimedia information; emphasize the importance of neural networks for society.

APA, Harvard, Vancouver, ISO, and other styles

47

Grassucci, Eleonora, Danilo Comminiello, and Aurelio Uncini. "An Information-Theoretic Perspective on Proper Quaternion Variational Autoencoders." Entropy 23, no. 7 (July 3, 2021): 856. http://dx.doi.org/10.3390/e23070856.

Full text

Abstract:

Variational autoencoders are deep generative models that have recently received a great deal of attention due to their ability to model the latent distribution of any kind of input such as images and audio signals, among others. A novel variational autoncoder in the quaternion domain H, namely the QVAE, has been recently proposed, leveraging the augmented second order statics of H-proper signals. In this paper, we analyze the QVAE under an information-theoretic perspective, studying the ability of the H-proper model to approximate improper distributions as well as the built-in H-proper ones and the loss of entropy due to the improperness of the input signal. We conduct experiments on a substantial set of quaternion signals, for each of which the QVAE shows the ability of modelling the input distribution, while learning the improperness and increasing the entropy of the latent space. The proposed analysis will prove that proper QVAEs can be employed with a good approximation even when the quaternion input data are improper.

APA, Harvard, Vancouver, ISO, and other styles

48

Ignatious, Henry Alexander, Hesham El-Sayed, and Salah Bouktif. "IFGAN—A Novel Image Fusion Model to Fuse 3D Point Cloud Sensory Data." Journal of Sensor and Actuator Networks 13, no. 1 (February 7, 2024): 15. http://dx.doi.org/10.3390/jsan13010015.

Full text

Abstract:

To enhance the level of autonomy in driving, it is crucial to ensure optimal execution of critical maneuvers in all situations. However, numerous accidents involving autonomous vehicles (AVs) developed by major automobile manufacturers in recent years have been attributed to poor decision making caused by insufficient perception of environmental information. AVs employ diverse sensors in today’s technology-driven settings to gather this information. However, due to technical and natural factors, the data collected by these sensors may be incomplete or ambiguous, leading to misinterpretation by AVs and resulting in fatal accidents. Furthermore, environmental information obtained from multiple sources in the vehicular environment often exhibits multimodal characteristics. To address this limitation, effective preprocessing of raw sensory data becomes essential, involving two crucial tasks: data cleaning and data fusion. In this context, we propose a comprehensive data fusion engine that categorizes various sensory data formats and appropriately merges them to enhance accuracy. Specifically, we suggest a general framework to combine audio, visual, and textual data, building upon our previous research on an innovative hybrid image fusion model that fused multispectral image data. However, this previous model faced challenges when fusing 3D point cloud data and handling large volumes of sensory data. To overcome these challenges, our study introduces a novel image fusion model called Image Fusion Generative Adversarial Network (IFGAN), which incorporates a multi-scale attention mechanism into both the generator and discriminator of a Generative Adversarial Network (GAN). The primary objective of image fusion is to merge complementary data from various perspectives of the same scene to enhance the clarity and detail of the final image. The multi-scale attention mechanism serves two purposes: the first, capturing comprehensive spatial information to enable the generator to focus on foreground and background target information in the sensory data, and the second, constraining the discriminator to concentrate on attention regions rather than the entire input image. Furthermore, the proposed model integrates the color information retention concept from the previously proposed image fusion model. Furthermore, we propose simple and efficient models for extracting salient image features. We evaluate the proposed models using various standard metrics and compare them with existing popular models. The results demonstrate that our proposed image fusion model outperforms the other models in terms of performance.

APA, Harvard, Vancouver, ISO, and other styles

49

Zielinski, Chris, Margaret Winker, Rakesh Aggarwal, Lorraine Ferris, Markus Heinemann, Jose Florencio Lapeña, Sanjay Pai, et al. "Chatbots, Generative AI, and Scholarly Manuscripts: WAME Recommendations on Chatbots and Generative Artificial Intelligence in Relation to Scholarly Publications Revised May 31, 2023." Philippine Journal of Otolaryngology Head and Neck Surgery 38, no. 1 (June 4, 2023): 7. http://dx.doi.org/10.32412/pjohns.v38i1.2135.

Full text

Abstract:

Introduction This statement revises our earlier “WAME Recommendations on ChatGPT and Chatbots in Relation to Scholarly Publications” (January 20, 2023). The revision reflects the proliferation of chatbots and their expanding use in scholarly publishing over the last few months, as well as emerging concerns regarding lack of authenticity of content when using chatbots. These Recommendations are intended to inform editors and help them develop policies for the use of chatbots in papers published in their journals. They aim to help authors and reviewers understand how best to attribute the use of chatbots in their work, and to address the need for all journal editors to have access to manuscript screening tools. In this rapidly evolving field, we will continue to modify these recommendations as the software and its applications develop. A chatbot is a tool “[d]riven by [artificial intelligence], automated rules, natural-language processing (NLP), and machine learning (ML)…[to] process data to deliver responses to requests of all kinds.”1 Artificial intelligence (AI) is “the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.”2 “Generative modeling is an artificial intelligence technique that generates synthetic artifacts by analyzing training examples; learning their patterns and distribution; and then creating realistic facsimiles. Generative AI (GAI) uses generative modeling and advances in deep learning (DL) to produce diverse content at scale by utilizing existing media such as text, graphics, audio, and video.”3, 4 Chatbots are activated by a plain-language instruction, or “prompt,” provided by the user. They generate responses using statistical and probability-based language models.5 This output has some characteristic properties. It is usually linguistically accurate and fluent but, to date, it is often compromised in various ways. For example, chatbot output currently carries the risk of including biases, distortions, irrelevancies, misrepresentations, and plagiarism many of which are caused by the algorithms governing its generation and heavily dependent on the contents of the materials used in its training. Consequently, there are concerns about the effects of chatbots on knowledge creation and dissemination – including their potential to spread and amplify mis- and disinformation6 – and their broader impact on jobs and the economy, as well as the health of individuals and populations. New legal issues have also arisen in connection with chatbots and generative AI.7 Chatbots retain the information supplied to them, including content and prompts, and may use this information in future responses. Therefore, scholarly content that is generated or edited using AI would be retained and as a result, could potentially appear in future responses, further increasing the risk of inadvertent plagiarism on the part of the user and any future users of the technology. Anyone who needs to maintain confidentiality of a document, including authors, editors, and reviewers, should be aware of this issue before considering using chatbots to edit or generate work.9 Chatbots and their applications illustrate the powerful possibilities of generative AI, as well as the risks. These Recommendations seek to suggest a workable approach to valid concerns about the use of chatbots in scholarly publishing.

APA, Harvard, Vancouver, ISO, and other styles

50

Liu, Jinglin, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. "DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11020–28. http://dx.doi.org/10.1609/aaai.v36i10.21350.

Full text

Abstract:

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generate realistic outputs. To further improve the voice quality and speed up inference, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we propose boundary prediction methods to locate the intersection and determine the shallow step adaptively. The evaluations conducted on a Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (DiffSpeech). Audio samples: https://diffsinger.github.io. Codes: https://github.com/MoonInTheRiver/DiffSinger.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!