Academic literature on the topic 'Automated audio captioning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Automated audio captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Automated audio captioning"

1

Bokhove, Christian, and Christopher Downey. "Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data." Methodological Innovations 11, no. 2 (2018): 205979911879074. http://dx.doi.org/10.1177/2059799118790743.

Full text
Abstract:
In the last decade, automated captioning services have appeared in mainstream technology use. Until now, the focus of these services have been on the technical aspects, supporting pupils with special educational needs and supporting teaching and learning of second language students. Only limited explorations have been attempted regarding its use for research purposes: transcription of audio recordings. This article presents a proof-of-concept exploration utilising three examples of automated transcription of audio recordings from different contexts; an interview, a public hearing and a classro
APA, Harvard, Vancouver, ISO, and other styles
2

P. Jayanth, K. Lakshmi Sree, K. Karthik Kumar Reddy, G. Om Prakash, and G. Reddy Prasad. "Vision-to-Voice: AI for generating Description & Audio of Visual Content." International Research Journal of Innovations in Engineering and Technology 09, Special Issue ICCIS (2025): 206–13. https://doi.org/10.47001/irjiet/2025.iccis-202533.

Full text
Abstract:
Abstract - The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and highquality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image ca
APA, Harvard, Vancouver, ISO, and other styles
3

Sejal Pawar, Shruti Mulay, Jivani Suryawanshi, Vaishnavi Walgude, and Prof. K. V. Patil. "Enhancing Traffic Scene and Understanding Through Image Captioning and Audio." International Research Journal on Advanced Engineering and Management (IRJAEM) 6, no. 07 (2024): 2423–29. http://dx.doi.org/10.47392/irjaem.2024.0349.

Full text
Abstract:
Navigating significant collections of traffic images on the net provides a tremendous task, mainly for users in search of particular facts. Many images lack captions, making it difficult to locate applicable content. Our undertaking addresses this difficulty by way of developing an automated labelling service that generates object-based totally description and gives auditory cues about their distances, the usage of a combination of computer vision and audio description techniques. With this, automation has become the need of the hour. The use of automation in motor vehicles is one similar area
APA, Harvard, Vancouver, ISO, and other styles
4

Haapaniemi, Riku, Annamaria Mesaros, Manu Harju, Irene Martín Morató, and Maija Hirvonen. "Primerjava semiotične konceptualizacije prevoda z besedilom, ki ga tvori UI." STRIDON: Journal of Studies in Translation and Interpreting 4, no. 1 (2024): 25–51. http://dx.doi.org/10.4312/stridon.4.1.25-51.

Full text
Abstract:
Using a semiotically-informed material approach to the study of translation, this paper analyses an artificial intelligence (AI) system developed for automatic audio captioning (AAC), which is the automated production of written descriptions for non-lingual environmental sounds. Comparing human and AI text production processes against a semiotic framework suggests that AI uses computational methods to reach textual outcomes which humans arrive at through semiotic means. Our analysis of sound description examples produced by an AAC system makes it apparent that this distinction is useful in art
APA, Harvard, Vancouver, ISO, and other styles
5

Sankalp, Kala, and Sridhar Ranganathan Prof. "Deep Learning Based Lipreading for Video Captioning." Engineering and Technology Journal 9, no. 05 (2024): 3935–46. https://doi.org/10.5281/zenodo.11120548.

Full text
Abstract:
Visual speech recognition, often referred to as lipreading, has garnered significant attention in recent years due to its potential applications in various fields such as human-computer interaction, accessibility technology, and biometric security systems. This paper explores the challenges and advancements in the field of lipreading, which involves deciphering speech from visual cues, primarily movements of the lips, tongue, and teeth. Despite being an essential aspect of human communication, lipreading presents inherent difficulties, especially in noisy environments or when contextual inform
APA, Harvard, Vancouver, ISO, and other styles
6

Koenecke, Allison, Andrew Nam, Emily Lake, et al. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117, no. 14 (2020): 7684–89. http://dx.doi.org/10.1073/pnas.1915768117.

Full text
Abstract:
Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the popu
APA, Harvard, Vancouver, ISO, and other styles
7

Mirzaei, Maryam Sadat, Kourosh Meshgi, Yuya Akita, and Tatsuya Kawahara. "Partial and synchronized captioning: A new tool to assist learners in developing second language listening skill." ReCALL 29, no. 2 (2017): 178–99. http://dx.doi.org/10.1017/s0958344017000039.

Full text
Abstract:
AbstractThis paper introduces a novel captioning method, partial and synchronized captioning (PSC), as a tool for developing second language (L2) listening skills. Unlike conventional full captioning, which provides the full text and allows comprehension of the material merely by reading, PSC promotes listening to the speech by presenting a selected subset of words, where each word is synched to its corresponding speech signal. In this method, word-level synchronization is realized by an automatic speech recognition (ASR) system, dedicated to the desired corpora. This feature allows the learne
APA, Harvard, Vancouver, ISO, and other styles
8

Guo, Rundong. "Advancing real-time close captioning: blind source separation and transcription for hearing impairments." Applied and Computational Engineering 30, no. 1 (2024): 125–30. http://dx.doi.org/10.54254/2755-2721/30/20230084.

Full text
Abstract:
This project investigates the potential of integrating Blind Source Separation (DUET algorithm) and Automatic Speech Recognition (Wav2Vec2 model) for real-time, accurate transcription in multi-speaker scenarios. Specifically targeted towards improving accessibility for individuals with hearing impairments, the project addresses the challenging task of separating and transcribing speech from simultaneous speakers in various contexts. The DUET algorithm effectively separates individual voices from complex audio scenarios, which are then accurately transcribed into text by the machine learning mo
APA, Harvard, Vancouver, ISO, and other styles
9

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text
Abstract:
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This
APA, Harvard, Vancouver, ISO, and other styles
10

Nam, Somang, and Deborah Fels. "Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models." International Journal of Semantic Computing 13, no. 01 (2019): 45–65. http://dx.doi.org/10.1142/s1793351x19400038.

Full text
Abstract:
As a primary user group, Deaf or Hard of Hearing (D/HOH) audiences use Closed Captioning (CC) service to enjoy the TV programs with audio by reading text. However, the D/HOH communities are not completely satisfied with the quality of CC even though the government regulators entail certain rules in the CC quality factors. The measure of the CC quality is often interpreted as an accuracy on translation and regulators use the empirical models to assess. The need of a subjective quality scale comes from the gap in between current empirical assessment models and the audience perceived quality. It
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Automated audio captioning"

1

Labbé, Etienne. "Description automatique des événements sonores par des méthodes d'apprentissage profond." Electronic Thesis or Diss., Université de Toulouse (2023-....), 2024. http://www.theses.fr/2024TLSES054.

Full text
Abstract:
Dans le domaine de l'audio, la majorité des systèmes d'apprentissage automatique se concentrent sur la reconnaissance d'un nombre restreint d'événements sonores. Cependant, lorsqu'une machine est en interaction avec des données réelles, elle doit pouvoir traiter des situations beaucoup plus variées et complexes. Pour traiter ce problème, les annotateurs ont recours au langage naturel, qui permet de résumer n'importe quelle information sonore. La Description Textuelle Automatique de l'Audio (DTAA ou Automated Audio Captioning en anglais) a été introduite récemment afin de développer des système
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Automated audio captioning"

1

M., Nivedita, AsnathVictyPhamila Y., Umashankar Kumaravelan, and Karthikeyan N. "Voice-Based Image Captioning System for Assisting Visually Impaired People Using Neural Networks." In Principles and Applications of Socio-Cognitive and Affective Computing. IGI Global, 2022. http://dx.doi.org/10.4018/978-1-6684-3843-5.ch011.

Full text
Abstract:
Many people worldwide have the problem of visual impairment. The authors' idea is to design a novel image captioning model for assisting the blind people by using deep learning-based architecture. Automatic understanding of the image and providing description of that image involves tasks from two complex fields: computer vision and natural language processing. The first task is to correctly identify objects along with their attributes present in the given image, and the next is to connect all the identified objects along with actions and generating the statements, which should be syntactically correct. From the real-time video, the features are extracted using a convolutional neural network (CNN), and the feature vectors are given as input to long short-term memory (LSTM) network to generate the appropriate captions in a natural language (English). The captions can then be converted into audio files, which the visually impaired people can listen. The model is tested on the two standardized image captioning datasets Flickr 8K and MSCOCO and evaluated using BLEU score.
APA, Harvard, Vancouver, ISO, and other styles
2

Venturini, Shamira, Michaela Mae Vann, Martina Pucci, and Giulia M. L. Bencini. "Towards a More Inclusive Learning Environment: The Importance of Providing Captions That Are Suited to Learners’ Language Proficiency in the UDL Classroom." In Studies in Health Technology and Informatics. IOS Press, 2022. http://dx.doi.org/10.3233/shti220884.

Full text
Abstract:
Captions have been found to benefit diverse learners, supporting comprehension, memory for content, vocabulary acquisition, and literacy. Captions may, thus, be one feature of universally designed learning (UDL) environments [1, 4]. The primary aim of this study was to examine whether captions are always useful, or whether their utility depends on individual differences, specifically proficiency in the language of the audio. To study this, we presented non-native speakers of English with an audio-visual recording of an unscripted seminar-style lesson in English retrieved from a University website. We assessed English language proficiency with an objective test. To test comprehension, we administered a ten-item comprehension test on the content of the lecture. Our secondary aim was to compare the effects of different types of captions on viewer comprehension. We, therefore, created three viewing conditions: video with no captions (NC), video with premade captions (downloaded from the university website) (UC) and video with automatically generated captions (AC). Our results showed an overall strong effect of proficiency on lecture comprehension, as expected. Interestingly, we also found that whether captions helped or not depended on proficiency and caption type. The captions provided by the University website benefited our learners only if their English language proficiency was high enough. When their proficiency was lower, however, the captions provided by the university were detrimental and performance was worse than having no captions. For the lower proficiency levels, automatic captions (AC) provided the best advantage. We attribute this finding to pre-existing characteristics of the captions provided by the university website. Taken together, these findings caution institutions with a commitment to UDL against thinking that one type of caption suits all. The study highlights the need for testing captioning systems with diverse learners, under different conditions, to better understand what factors are beneficial for whom and when.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Automated audio captioning"

1

Tan, Liwen, Yi Zhou, Yin Liu, and Wang Chen. "Enhanced Automated Audio Captioning Method Based on Heterogeneous Feature Fusion." In 2024 IEEE International Conference on Software System and Information Processing (ICSSIP). IEEE, 2024. https://doi.org/10.1109/icssip63203.2024.11012230.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kim, Minkyu, Kim Sung-Bin, and Tae-Hyun Oh. "Prefix Tuning for Automated Audio Captioning." In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023. http://dx.doi.org/10.1109/icassp49357.2023.10096877.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Drossos, Konstantinos, Sharath Adavanne, and Tuomas Virtanen. "Automated audio captioning with recurrent neural networks." In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017. http://dx.doi.org/10.1109/waspaa.2017.8170058.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Chen, Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, and Eng Siong Chng. "Interactive Auido-text Representation for Automated Audio Captioning with Contrastive Learning." In Interspeech 2022. ISCA, 2022. http://dx.doi.org/10.21437/interspeech.2022-10510.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Liu, Weizhuo, and Zhe Gao. "Improving Automated Audio Captioning with LLM Decoder and BEATs Audio Encoder." In CSAI 2024: 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI). ACM, 2024. https://doi.org/10.1145/3709026.3709118.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kim, Jaeyeon, Jaeyoon Jung, Jinjoo Lee, and Sang Hoon Woo. "EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning." In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. http://dx.doi.org/10.1109/icassp48485.2024.10446672.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Liu, Jizhong, Gang Li, Junbo Zhang, et al. "Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding." In Interspeech 2024. ISCA, 2024. http://dx.doi.org/10.21437/interspeech.2024-65.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Ye, Zhongjie, Yuqing Wang, Helin Wang, Dongchao Yang, and Yuexian Zou. "FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning." In 2022 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022. http://dx.doi.org/10.23919/apsipaasc55919.2022.9980325.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Koh, Andrew, Soham Tiwari, and Chng Eng Siong. "Automated Audio Captioning with Epochal Difficult Captions for curriculum learning." In 2022 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022. http://dx.doi.org/10.23919/apsipaasc55919.2022.9980242.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Wijngaard, Gijs, Elia Formisano, Bruno L. Giordano, and Michel Dumontier. "ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds." In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023. http://dx.doi.org/10.23919/eusipco58844.2023.10289793.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!