Log in

Relevant bibliographies by topics / Automated audio captioning / Journal articles

To see the other types of publications on this topic, follow the link: Automated audio captioning.

Journal articles on the topic 'Automated audio captioning'

Author: Grafiati

Published: 20 July 2024

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 30 journal articles for your research on the topic 'Automated audio captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Bokhove, Christian, and Christopher Downey. "Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data." Methodological Innovations 11, no. 2 (2018): 205979911879074. http://dx.doi.org/10.1177/2059799118790743.

Full text

Abstract:

In the last decade, automated captioning services have appeared in mainstream technology use. Until now, the focus of these services have been on the technical aspects, supporting pupils with special educational needs and supporting teaching and learning of second language students. Only limited explorations have been attempted regarding its use for research purposes: transcription of audio recordings. This article presents a proof-of-concept exploration utilising three examples of automated transcription of audio recordings from different contexts; an interview, a public hearing and a classro

APA, Harvard, Vancouver, ISO, and other styles

2

P. Jayanth, K. Lakshmi Sree, K. Karthik Kumar Reddy, G. Om Prakash, and G. Reddy Prasad. "Vision-to-Voice: AI for generating Description & Audio of Visual Content." International Research Journal of Innovations in Engineering and Technology 09, Special Issue ICCIS (2025): 206–13. https://doi.org/10.47001/irjiet/2025.iccis-202533.

Full text

Abstract:

Abstract - The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and highquality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image ca

APA, Harvard, Vancouver, ISO, and other styles

3

Sejal Pawar, Shruti Mulay, Jivani Suryawanshi, Vaishnavi Walgude, and Prof. K. V. Patil. "Enhancing Traffic Scene and Understanding Through Image Captioning and Audio." International Research Journal on Advanced Engineering and Management (IRJAEM) 6, no. 07 (2024): 2423–29. http://dx.doi.org/10.47392/irjaem.2024.0349.

Full text

Abstract:

Navigating significant collections of traffic images on the net provides a tremendous task, mainly for users in search of particular facts. Many images lack captions, making it difficult to locate applicable content. Our undertaking addresses this difficulty by way of developing an automated labelling service that generates object-based totally description and gives auditory cues about their distances, the usage of a combination of computer vision and audio description techniques. With this, automation has become the need of the hour. The use of automation in motor vehicles is one similar area

APA, Harvard, Vancouver, ISO, and other styles

4

Haapaniemi, Riku, Annamaria Mesaros, Manu Harju, Irene Martín Morató, and Maija Hirvonen. "Primerjava semiotične konceptualizacije prevoda z besedilom, ki ga tvori UI." STRIDON: Journal of Studies in Translation and Interpreting 4, no. 1 (2024): 25–51. http://dx.doi.org/10.4312/stridon.4.1.25-51.

Full text

Abstract:

Using a semiotically-informed material approach to the study of translation, this paper analyses an artificial intelligence (AI) system developed for automatic audio captioning (AAC), which is the automated production of written descriptions for non-lingual environmental sounds. Comparing human and AI text production processes against a semiotic framework suggests that AI uses computational methods to reach textual outcomes which humans arrive at through semiotic means. Our analysis of sound description examples produced by an AAC system makes it apparent that this distinction is useful in art

APA, Harvard, Vancouver, ISO, and other styles

5

Sankalp, Kala, and Sridhar Ranganathan Prof. "Deep Learning Based Lipreading for Video Captioning." Engineering and Technology Journal 9, no. 05 (2024): 3935–46. https://doi.org/10.5281/zenodo.11120548.

Full text

Abstract:

Visual speech recognition, often referred to as lipreading, has garnered significant attention in recent years due to its potential applications in various fields such as human-computer interaction, accessibility technology, and biometric security systems. This paper explores the challenges and advancements in the field of lipreading, which involves deciphering speech from visual cues, primarily movements of the lips, tongue, and teeth. Despite being an essential aspect of human communication, lipreading presents inherent difficulties, especially in noisy environments or when contextual inform

APA, Harvard, Vancouver, ISO, and other styles

6

Koenecke, Allison, Andrew Nam, Emily Lake, et al. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117, no. 14 (2020): 7684–89. http://dx.doi.org/10.1073/pnas.1915768117.

Full text

Abstract:

Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the popu

APA, Harvard, Vancouver, ISO, and other styles

7

Mirzaei, Maryam Sadat, Kourosh Meshgi, Yuya Akita, and Tatsuya Kawahara. "Partial and synchronized captioning: A new tool to assist learners in developing second language listening skill." ReCALL 29, no. 2 (2017): 178–99. http://dx.doi.org/10.1017/s0958344017000039.

Full text

Abstract:

AbstractThis paper introduces a novel captioning method, partial and synchronized captioning (PSC), as a tool for developing second language (L2) listening skills. Unlike conventional full captioning, which provides the full text and allows comprehension of the material merely by reading, PSC promotes listening to the speech by presenting a selected subset of words, where each word is synched to its corresponding speech signal. In this method, word-level synchronization is realized by an automatic speech recognition (ASR) system, dedicated to the desired corpora. This feature allows the learne

APA, Harvard, Vancouver, ISO, and other styles

8

Guo, Rundong. "Advancing real-time close captioning: blind source separation and transcription for hearing impairments." Applied and Computational Engineering 30, no. 1 (2024): 125–30. http://dx.doi.org/10.54254/2755-2721/30/20230084.

Full text

Abstract:

This project investigates the potential of integrating Blind Source Separation (DUET algorithm) and Automatic Speech Recognition (Wav2Vec2 model) for real-time, accurate transcription in multi-speaker scenarios. Specifically targeted towards improving accessibility for individuals with hearing impairments, the project addresses the challenging task of separating and transcribing speech from simultaneous speakers in various contexts. The DUET algorithm effectively separates individual voices from complex audio scenarios, which are then accurately transcribed into text by the machine learning mo

APA, Harvard, Vancouver, ISO, and other styles

9

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text

Abstract:

Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This

APA, Harvard, Vancouver, ISO, and other styles

10

Nam, Somang, and Deborah Fels. "Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models." International Journal of Semantic Computing 13, no. 01 (2019): 45–65. http://dx.doi.org/10.1142/s1793351x19400038.

Full text

Abstract:

As a primary user group, Deaf or Hard of Hearing (D/HOH) audiences use Closed Captioning (CC) service to enjoy the TV programs with audio by reading text. However, the D/HOH communities are not completely satisfied with the quality of CC even though the government regulators entail certain rules in the CC quality factors. The measure of the CC quality is often interpreted as an accuracy on translation and regulators use the empirical models to assess. The need of a subjective quality scale comes from the gap in between current empirical assessment models and the audience perceived quality. It

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Ruijing. "A Comparative Analysis of LSTM and Transformer-based Automatic Speech Recognition Techniques." Transactions on Computer Science and Intelligent Systems Research 5 (August 12, 2024): 272–76. http://dx.doi.org/10.62051/zq6v0d49.

Full text

Abstract:

Automatic Speech Recognition (ASR) is a technology that leverages artificial intelligence to convert spoken language into written text. It utilizes machine learning algorithms, specifically deep learning models, to analyze audio signals and extract linguistic features. This technology has revolutionized the way that people interact with voice-enabled devices, enabling efficient and accurate transcription of human speech in various applications, including voice assistants, captioning, and transcription services. Among previous works for ASR, Long Short-Term Memory (LSTM) networks and Transforme

APA, Harvard, Vancouver, ISO, and other styles

12

Gotmare, Abhay, Gandharva Thite, and Laxmi Bewoor. "A multimodal machine learning approach to generate news articles from geo-tagged images." International Journal of Electrical and Computer Engineering (IJECE) 14, no. 3 (2024): 3434. http://dx.doi.org/10.11591/ijece.v14i3.pp3434-3442.

Full text

Abstract:

Classical machine learning algorithms typically operate on unimodal data and hence it can analyze and make predictions based on data from a single source (modality). Whereas multimodal machine learning algorithm, learns from information across multiple modalities, such as text, images, audio, and sensor data. The paper leverages the functionalities of multimodal machine learning (ML) application for generating text from images. The proposed work presents an innovative multimodal algorithm that automates the creation of news articles from geo-tagged images by leveraging cutting-edge development

APA, Harvard, Vancouver, ISO, and other styles

13

Verma, Dr Neeta. "Assistive Vision Technology using Deep Learning Techniques." International Journal for Research in Applied Science and Engineering Technology 9, no. VII (2021): 2695–704. http://dx.doi.org/10.22214/ijraset.2021.36815.

Full text

Abstract:

One of the most important functions of the human visual system is automatic captioning. Caption generation is one of the more interesting and focused areas of AI, with numerous challenges to overcome. If there is an application that automatically captions the scenes in which a person is present and converts the caption into a clear message, people will benefit from it in a variety of ways. In this, we offer a deep learning model that detects things or features in images automatically, produces descriptions for the images, and transforms the descriptions to audio for louder readout. The model u

APA, Harvard, Vancouver, ISO, and other styles

14

Gotmare, Abhay, Gandharva Thite, and Laxmi Bewoor. "A multimodal machine learning approach to generate news articles from geo-tagged images." A multimodal machine learning approach to generate news articles from geo-tagged images 14, no. 3 (2024): 3434–42. https://doi.org/10.11591/ijece.v14i3.pp3434-3442.

Full text

Abstract:

Classical machine learning algorithms typically operate on unimodal data and hence it can analyze and make predictions based on data from a single source (modality). Whereas multimodal machine learning algorithm, learns from information across multiple modalities, such as text, images, audio, and sensor data. The paper leverages the functionalities of multimodal machine learning (ML) application for generating text from images. The proposed work presents an innovative multimodal algorithm that automates the creation of news articles from geo-tagged images by

APA, Harvard, Vancouver, ISO, and other styles

15

Eren, Aysegul Ozkaya, and Mustafa Sert. "Automated Audio Captioning with Topic Modeling." IEEE Access, 2023, 1. http://dx.doi.org/10.1109/access.2023.3235733.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Xiao, Feiyang, Jian Guan, Qiaoxi Zhu, and Wenwu Wang. "Graph Attention for Automated Audio Captioning." IEEE Signal Processing Letters, 2023, 1–5. http://dx.doi.org/10.1109/lsp.2023.3266114.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Mei, Xinhao, Xubo Liu, Mark D. Plumbley, and Wenwu Wang. "Automated audio captioning: an overview of recent progress and new challenges." EURASIP Journal on Audio, Speech, and Music Processing 2022, no. 1 (2022). http://dx.doi.org/10.1186/s13636-022-00259-2.

Full text

Abstract:

AbstractAutomated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly faci

APA, Harvard, Vancouver, ISO, and other styles

18

Bagheri, Majid Haji, Emma Gu, Asif Abdullah Khan, et al. "Machine Learning‐Enabled Triboelectric Nanogenerator for Continuous Sound Monitoring and Captioning." Advanced Sensor Research, January 8, 2025. https://doi.org/10.1002/adsr.202400156.

Full text

Abstract:

AbstractAdvancements in live audio processing, specifically in sound classification and audio captioning technologies, have widespread applications ranging from surveillance to accessibility services. However, traditional methods encounter scalability and energy efficiency challenges. To overcome these, Triboelectric Nanogenerators (TENG) are explored for energy harvesting, particularly in live‐streaming sound monitoring systems. This study introduces a sustainable methodology integrating TENG‐based sensors into live sound monitoring pipelines, enhancing energy‐efficient sound classification a

APA, Harvard, Vancouver, ISO, and other styles

19

Won, Hyejin, Baekseung Kim, Il-Youp Kwak, and Changwon Lim. "Using various pre-trained models for audio feature extraction in automated audio captioning." Expert Systems with Applications, June 2023, 120664. http://dx.doi.org/10.1016/j.eswa.2023.120664.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Poongodi, M., Mounir Hamdi, and Huihui Wang. "Image and audio caps: automated captioning of background sounds and images using deep learning." Multimedia Systems, February 26, 2022. http://dx.doi.org/10.1007/s00530-022-00902-0.

Full text

Abstract:

AbstractImage recognition based on computers is something human beings have been working on for many years. It is one of the most difficult tasks in the field of computer science, and improvements to this system are made when we speak. In this paper, we propose a methodology to automatically propose an appropriate title and add a specific sound to the image. Two models have been extensively trained and combined to achieve this effect. Sounds are recommended based on the image scene and the headings are generated using a combination of natural language processing and state-of-the-art computer v

APA, Harvard, Vancouver, ISO, and other styles

21

Kala, Sankalp, and Prof Sridhar Ranganathan. "Deep Learning Based Lipreading for Video Captioning." Engineering and Technology Journal 09, no. 05 (2024). http://dx.doi.org/10.47191/etj/v9i05.08.

Full text

Abstract:

Visual speech recognition, often referred to as lipreading, has garnered significant attention in recent years due to its potential applications in various fields such as human-computer interaction, accessibility technology, and biometric security systems. This paper explores the challenges and advancements in the field of lipreading, which involves deciphering speech from visual cues, primarily movements of the lips, tongue, and teeth. Despite being an essential aspect of human communication, lipreading presents inherent difficulties, especially in noisy environments or when contextual inform

APA, Harvard, Vancouver, ISO, and other styles

22

Gencyilmaz, Izel Zeynep, and Kürşat Mustafa Karaoğlan. "Optimizing Speech to Text Conversion in Turkish: An Analysis of Machine Learning Approaches." Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, March 20, 2024. http://dx.doi.org/10.17798/bitlisfen.1434925.

Full text

Abstract:

The Conversion of Speech to Text (CoST) is crucial for developing automated systems to understand and process voice commands. Studies have focused on developing this task, especially for Turkish-specific voice commands, a strategic language in the international arena. However, researchers face various challenges, such as Turkish's suffixed structure, phonological features and unique letters, dialect and accent differences, word stress, word-initial vowel effects, background noise, gender-based sound variations, and dialectal differences. To address the challenges above, this study aims to conv

APA, Harvard, Vancouver, ISO, and other styles

23

Lucia-Mulas, Maria Jose, Pablo Revuelta-Sanz, Belen Ruiz-Mezcua, and Israel Gonzalez-Carrasco. "Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises." Applied Intelligence, September 1, 2023. http://dx.doi.org/10.1007/s10489-023-04967-w.

Full text

Abstract:

AbstractThe ability of music to induce emotions has been arousing a lot of interest in recent years, especially due to the boom in music streaming platforms and the use of automatic music recommenders. Music Emotion Recognition approaches are based on combining multiple audio features extracted from digital audio samples and different machine learning techniques. In these approaches, neuroscience results on musical emotion perception are not considered. The main goal of this research is to facilitate the automatic subtitling of music. The authors approached the problem of automatic musical emo

APA, Harvard, Vancouver, ISO, and other styles

24

Bochner, Joseph, Mark Indelicato, and Pralhad Konnur. "Effects of Sound Quality on the Accuracy of Telephone Captions Produced by Automatic Speech Recognition: A Preliminary Investigation." American Journal of Audiology, December 14, 2022, 1–8. http://dx.doi.org/10.1044/2022_aja-22-00102.

Full text

Abstract:

Purpose: Automatic speech recognition (ASR) is commonly used to produce telephone captions to provide telecommunication access for individuals who are d/Deaf and hard of hearing (DHH). However, little is known about the effects of degraded telephone audio on the intelligibility of ASR captioning. This research note investigates the accuracy of telephone captions produced by ASR under degraded audio conditions. Method: Packet loss, delay, and repetition are common sources of degradation in sound quality for telephone audio. Eleven sets of wideband filtered sentences were degraded by high and lo

APA, Harvard, Vancouver, ISO, and other styles

25

Starr, Kim Linda, Sabine Braun, and Jaleh Delfani. "Taking a Cue From the Human." Journal of Audiovisual Translation 3, no. 2 (2020). http://dx.doi.org/10.47476/jat.v3i2.2020.138.

Full text

Abstract:

Human beings find the process of narrative sequencing in written texts and moving imagery a relatively simple task. Key to the success of this activity is establishing coherence by using critical cues to identify key characters, objects, actions and locations as they contribute to plot development. In the drive to make audiovisual media more widely accessible (through audio description), and media archives more searchable (through content description), computer vision experts strive to automate video captioning in order to supplement human description activities. Existing models for auto

APA, Harvard, Vancouver, ISO, and other styles

26

Kuhn, Korbinian, Verena Kersken, Benedikt Reuter, Niklas Egger, and Gottfried Zimmermann. "Measuring the Accuracy of Automatic Speech Recognition Solutions." ACM Transactions on Accessible Computing, December 8, 2023. http://dx.doi.org/10.1145/3636513.

Full text

Abstract:

For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability o

APA, Harvard, Vancouver, ISO, and other styles

27

Hekanaho, Laura, Maija Hirvonen, and Tuomas Virtanen. "Language-based machine perception: linguistic perspectives on the compilation of captioning datasets." Digital Scholarship in the Humanities, June 21, 2024. http://dx.doi.org/10.1093/llc/fqae029.

Full text

Abstract:

Abstract Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affe

APA, Harvard, Vancouver, ISO, and other styles

28

Man, Xin, Jie Shao, Feiyu Chen, Mingxing Zhang, and Heng Tao Shen. "TEVL: Trilinear Encoder for Video-Language Representation Learning." ACM Transactions on Multimedia Computing, Communications, and Applications, February 24, 2023. http://dx.doi.org/10.1145/3585388.

Full text

Abstract:

Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L resea

APA, Harvard, Vancouver, ISO, and other styles

29

Ellis, Katie, Mike Kent, and Gwyneth Peaty. "Captioned Recorded Lectures as a Mainstream Learning Tool." M/C Journal 20, no. 3 (2017). http://dx.doi.org/10.5204/mcj.1262.

Full text

Abstract:

In Australian universities, many courses provide lecture notes as a standard learning resource; however, captions and transcripts of these lectures are not usually provided unless requested by a student through dedicated disability support officers (Worthington). As a result, to date their use has been limited. However, while the requirement for—and benefits of—captioned online lectures for students with disabilities is widely recognised, these captions or transcripts might also represent further opportunity for a personalised approach to learning for the mainstream student population (Podszeb

APA, Harvard, Vancouver, ISO, and other styles

30

Burwell, Catherine. "New(s) Readers: Multimodal Meaning-Making in AJ+ Captioned Video." M/C Journal 20, no. 3 (2017). http://dx.doi.org/10.5204/mcj.1241.

Full text

Abstract:

IntroductionIn 2013, Facebook introduced autoplay video into its newsfeed. In order not to produce sound disruptive to hearing users, videos were muted until a user clicked on them to enable audio. This move, recognised as a competitive response to the popularity of video-sharing sites like YouTube, has generated significant changes to the aesthetics, form, and modalities of online video. Many video producers have incorporated captions into their videos as a means of attracting and maintaining user attention. Of course, captions are not simply a replacement or translation of sound, but have in

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!