Academic literature on the topic 'Image Captioning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Image Captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Image Captioning"

1

Vasudha Bahl and Nidhi Sengar, Gaurav Joshi, Dr Amita Goel. "Image Captioning System." International Journal for Modern Trends in Science and Technology 6, no. 12 (December 4, 2020): 40–44. http://dx.doi.org/10.46501/ijmtst061208.

Full text
Abstract:
Deep Learning is relatively a new field and it has grabbed a lot of attention because it provides higher level of accuracy in recognizing objects than ever earlier. NLP is also one field that has created a huge impact in our life. NLP has come a long way from producing a readable summary of the texts to analysis of mental illness, it shows the impact of NLP. Image captioning task combines both NLP and Deep Learning. Describing images in a meaningful way can be done using Image captioning. Describing an image don’t just mean recognizing objects, to describe an image properly we first need to identify objects present in the image and then the relationship between those objects. In this study we have used CNN-LSTM based framework. CNN will be used to extract features of the image while with the help of LSTM we will try to generate meaningful sentences. This study also discusses applications of Image captioning and major challenges faced in achieving this task.
APA, Harvard, Vancouver, ISO, and other styles
2

Beddiar, Djamila Romaissa, Mourad Oussalah, Tapio Seppänen, and Rachid Jennane. "ACapMed: Automatic Captioning for Medical Imaging." Applied Sciences 12, no. 21 (November 1, 2022): 11092. http://dx.doi.org/10.3390/app122111092.

Full text
Abstract:
Medical image captioning is a very challenging task that has been rarely addressed in the literature on natural image captioning. Some existing image captioning techniques exploit objects present in the image next to the visual features while generating descriptions. However, this is not possible for medical image captioning when one requires following clinician-like explanations in image content descriptions. Inspired by the preceding, this paper proposes using medical concepts associated with images, in accordance with their visual features, to generate new captions. Our end-to-end trainable network is composed of a semantic feature encoder based on a multi-label classifier to identify medical concepts related to images, a visual feature encoder, and an LSTM model for text generation. Beam search is employed to ensure the best selection of the next word for a given sequence of words based on the merged features of the medical image. We evaluated our proposal on the ImageCLEF medical captioning dataset, and the results demonstrate the effectiveness and efficiency of the developed approach.
APA, Harvard, Vancouver, ISO, and other styles
3

Mukund Upadhyay and Prof. Shallu Bashambu. "Image captioning Bot." International Journal for Modern Trends in Science and Technology 6, no. 12 (December 15, 2020): 348–54. http://dx.doi.org/10.46501/ijmtst061265.

Full text
Abstract:
Image captioning means automatically generating a caption for an image with the development of deep learning, the combination of computer vision and natural language process has caught great attention in the last few years. Image captioning is a representative of this filed, which makes the computer learn to use one or more sentences to understand the visual content of an image. The meaningful description generation process of highlevel image semantics requires not only the recognition of the object and the scene, but the ability of analyzing the state, the attributes and the relationship among these objects. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented.
APA, Harvard, Vancouver, ISO, and other styles
4

Rasha Mohammed Mualla, Jafar Alkheir, Samer Sulaiman, Rasha Mohammed Mualla, Jafar Alkheir, Samer Sulaiman. "Improving The Performance of the Image Captioning Systems Using a Pre- Classification Stage: تحسين أداء أنظمة وصف الصور باستخدام مرحلة التصنيف المسبق للصور." Journal of engineering sciences and information technology 6, no. 1 (March 27, 2022): 150–64. http://dx.doi.org/10.26389/ajsrp.l270721.

Full text
Abstract:
In this research, we introduce a novel image classification and captioning system by adding a classification layer before the image captioning models. The suggested approach consists of three main steps and inspired by the state- of- art that generating image captioning inside small sub- classes categories is better than the unclassified large dataset. In the first one, we have collected a dataset of two international datasets (MS- COCO and Flickr2k) including 10778 images in which 80% is used for training and 20% for validation. In the next step, dataset images have been classified into 11 classes (10 classes of indoor and outdoor categories and one class of "Null" category) and fed into a deep learning classifier. The classifier is re- trained again using our classes and learned to classify each image to the corresponding category. At the final step, each classified image is used as input of 11 pre- trained classified image captioning models, and the final captioning sentence is generated. The experiments show that adding the pre- classification step before the image captioning stage improves the performance significantly by (8.15% and 8.44%) and (12.7407% and 16.7048%) for Top- 1 and Top- 5 of English and Arabic systems respectively. The classification step achieves a true classification rate of 71.32% and 73.09% for English and Arabic systems respectively.
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Zhenyu, Qiao Liu, and Guojing Liu. "Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training." Symmetry 12, no. 12 (November 30, 2020): 1978. http://dx.doi.org/10.3390/sym12121978.

Full text
Abstract:
Compared with traditional image captioning technology, stylized image captioning has broader application scenarios, such as a better understanding of images. However, stylized image captioning faces many challenges, the most important of which is how to make the model take into account both the image meta information and the style factor of the generated captions. In this paper, we propose a novel end-to-end stylized image captioning framework (ST-BR). Specifically, we first use a style transformer to model the factual information of images, and the style attention module learns style factor form a multi-style corpus, it is a symmetric structure on the whole. At the same time, we use back-reinforcement to evaluate the degree of consistency between the generated stylized captions with the image knowledge and specified style, respectively. These two parts further enhance the learning ability of the model through adversarial learning. Our experiment has achieved effective performance on the benchmark dataset.
APA, Harvard, Vancouver, ISO, and other styles
6

Nivedita, M., and Asnath Victy Phamila Y. "Image Captioning for Spatially Rotated Images in Video Surveillance Applications Using Neural Networks." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 29, Supp02 (December 2021): 193–209. http://dx.doi.org/10.1142/s0218488521400110.

Full text
Abstract:
Video Surveillance has become an essential tool in the Security industry because of the sophisticated and fool-proof technology. Recent developments in image recognition and captioning have enabled us to adopt these technologies in the field of video surveillance. The biggest problem in image captioning is that it is variant on the rotation angle of the image. Different angles of the same image generate different captions. We aim to address and eliminate the rotation variance of image captioning. We have implemented a custom image rotation network using a Convolutional Neural Network (CNN). The input image is rotated to the original angle using this network and passed on to the image captioning. The caption of the image is generated and sent to the user for the situation analysis.
APA, Harvard, Vancouver, ISO, and other styles
7

Iwamura, Kiyohiko, Jun Younes Louhi Kasahara, Alessandro Moro, Atsushi Yamashita, and Hajime Asama. "Image Captioning Using Motion-CNN with Object Detection." Sensors 21, no. 4 (February 10, 2021): 1270. http://dx.doi.org/10.3390/s21041270.

Full text
Abstract:
Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.
APA, Harvard, Vancouver, ISO, and other styles
8

Junaid, Mohd Wasiuddin. "Image Captioning with Face Recognition using Transformers." International Journal for Research in Applied Science and Engineering Technology 10, no. 1 (January 31, 2022): 1426–32. http://dx.doi.org/10.22214/ijraset.2022.40057.

Full text
Abstract:
Abstract: The process of generating text from images is called Image Captioning. It not only requires the recognition of the object and the scene but the ability to analyze the state and identify the relationship among these objects. Therefore image captioning integrates the field of computer vision and natural language processing. Thus we introduces a novel image captioning model which is capable of recognizing human faces in an given image using transformer model. The proposed Faster R-CNN-Transformer model architecture comprises of feature extraction from images, extraction of semantic keywords from captions, and encoder-decoder transformers. Faster-RCNN is implemented for face recognition and features are extracted from images using InceptionV3 . The model aims to identify and recognizes the known faces in the images. The Faster R-CNN module creates the bounding box across the face which helps in better interpretation of an image and caption. The dataset used in this model has images with celebrity faces and caption with celebrity names included within itself, respectively has in total 232 celebrities. Due to small size of dataset, we have augmented images and added 100 images with their corresponding captions to increase the size of vocabulary for our model. The BLEU and METEOR scores were generated to evaluate the accuracy/quality of generated captions. Keywords: Image Captioning, Faster R-CNN , Transformers, Bleu score, Meteor score.
APA, Harvard, Vancouver, ISO, and other styles
9

Al-Malla, Muhammad Abdelhadie, Muhammad Abdelhadie Al-Malla, Assef Jafar, and Nada Ghneim. "Pre-trained CNNs as Feature-Extraction Modules for Image Captioning." ELCVIA Electronic Letters on Computer Vision and Image Analysis 21, no. 1 (May 10, 2022): 1–16. http://dx.doi.org/10.5565/rev/elcvia.1436.

Full text
Abstract:
In this work, we present a thorough experimental study about feature extraction using Convolutional NeuralNetworks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features areextracted from the last layer after removing the fully connected layer and fed into the captioning model. We usea unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changingthe CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics inimage captioning. We find a strong relationship between the model structure and the image captioning datasetand prove that VGG models give the least quality for image captioning feature extraction among the testedCNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metricswe want to optimise, and show the connection between our results and previous works. To our knowledge, thiswork is the most comprehensive comparison between feature extractors for image captioning.
APA, Harvard, Vancouver, ISO, and other styles
10

Chang, Yeong-Hwa, Yen-Jen Chen, Ren-Hung Huang, and Yi-Ting Yu. "Enhanced Image Captioning with Color Recognition Using Deep Learning Methods." Applied Sciences 12, no. 1 (December 26, 2021): 209. http://dx.doi.org/10.3390/app12010209.

Full text
Abstract:
Automatically describing the content of an image is an interesting and challenging task in artificial intelligence. In this paper, an enhanced image captioning model—including object detection, color analysis, and image captioning—is proposed to automatically generate the textual descriptions of images. In an encoder–decoder model for image captioning, VGG16 is used as an encoder and an LSTM (long short-term memory) network with attention is used as a decoder. In addition, Mask R-CNN with OpenCV is used for object detection and color analysis. The integration of the image caption and color recognition is then performed to provide better descriptive details of images. Moreover, the generated textual sentence is converted into speech. The validation results illustrate that the proposed method can provide more accurate description of images.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Image Captioning"

1

Hoxha, Genc. "IMAGE CAPTIONING FOR REMOTE SENSING IMAGE ANALYSIS." Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/351752.

Full text
Abstract:
Image Captioning (IC) aims to generate a coherent and comprehensive textual description that summarizes the complex content of an image. It is a combination of computer vision and natural language processing techniques to encode the visual features of an image and translate them into a sentence. In the context of remote sensing (RS) analysis, IC has been emerging as a new research area of high interest since it not only recognizes the objects within an image but also describes their attributes and relationships. In this thesis, we propose several IC methods for RS image analysis. We focus on the design of different approaches that take into consideration the peculiarity of RS images (e.g. spectral, temporal and spatial properties) and study the benefits of IC in challenging RS applications. In particular, we focus our attention on developing a new decoder which is based on support vector machines. Compared to the traditional decoders that are based on deep learning, the proposed decoder is particularly interesting for those situations in which only a few training samples are available to alleviate the problem of overfitting. The peculiarity of the proposed decoder is its simplicity and efficiency. It is composed of only one hyperparameter, does not require expensive power units and is very fast in terms of training and testing time making it suitable for real life applications. Despite the efforts made in developing reliable and accurate IC systems, the task is far for being solved. The generated descriptions are affected by several errors related to the attributes and the objects present in an RS scene. Once an error occurs, it is propagated through the recurrent layers of the decoders leading to inaccurate descriptions. To cope with this issue, we propose two post-processing techniques with the aim of improving the generated sentences by detecting and correcting the potential errors. They are based on Hidden Markov Model and Viterbi algorithm. The former aims to generate a set of possible states while the latter aims at finding the optimal sequence of states. The proposed post-processing techniques can be injected to any IC system at test time to improve the quality of the generated sentences. While all the captioning systems developed in the RS community are devoted to single and RGB images, we propose two captioning systems that can be applied to multitemporal and multispectral RS images. The proposed captioning systems are able at describing the changes occurred in a given geographical through time. We refer to this new paradigm of analysing multitemporal and multispectral images as change captioning (CC). To test the proposed CC systems, we construct two novel datasets composed of bitemporal RS images. The first one is composed of very high-resolution RGB images while the second one of medium resolution multispectral satellite images. To advance the task of CC, the constructed datasets are publically available in the following link: https://disi.unitn.it/~melgani/datasets.html. Finally, we analyse the potential of IC for content based image retrieval (CBIR) and show its applicability and advantages compared to the traditional techniques. Specifically, we focus our attention on developing a CBIR systems that represents an image with generated descriptions and uses sentence similarity to search and retrieve relevant RS images. Compare to traditional CBIR systems, the proposed system is able to search and retrieve images using either an image or a sentence as a query making it more comfortable for the end-users. The achieved results show the promising potentialities of our proposed methods compared to the baselines and state-of-the art methods.
APA, Harvard, Vancouver, ISO, and other styles
2

Hossain, Md Zakir. "Deep learning techniques for image captioning." Thesis, Hossain, Md. Zakir (2020) Deep learning techniques for image captioning. PhD thesis, Murdoch University, 2020. https://researchrepository.murdoch.edu.au/id/eprint/60782/.

Full text
Abstract:
Generating a description of an image is called image captioning. Image captioning is a challenging task because it involves the understanding of the main objects, their attributes, and their relationships in an image. It also involves the generation of syntactically and semantically meaningful descriptions of the images in natural language. A typical image captioning pipeline comprises an image encoder and a language decoder. Convolutional Neural Networks (CNNs) are widely used as the encoder while Long short-term memory (LSTM) networks are used as the decoder. A variety of LSTMs and CNNs including attention mechanisms are used to generate meaningful and accurate captions. Traditional image captioning techniques have limitations in generating semantically meaningful and superior captions. In this research, we focus on advanced image captioning techniques, which are able to generate semantically more meaningful and superior captions. As such we have made four contributions in this thesis. First, we investigate an attention based LSTM on image features extracted by DenseNet, which is a newer type of CNN. We integrate DenseNet features with attention mechanism and we show that this combination can generate more relevant image captions than other CNNs. Second, we use bi-directional self-attention as a language decoder. Bi-directional decoder can capture the context in both forward and backward directions, i.e., past context as well as any future context, in caption generation. Consequently, the generated captions are more meaningful and superior to those generated by typical LSTMs and CNNs. Third, we further extend the work by using an additional CNN layer to incorporate the structured local context together with the past and the future contexts attained by Bi-directional LSTM. A pooling scheme namely Attention Pooling is also used to enhance the information extraction capability of the pooling layer. Consequently, it is able to generate contextually superior captions. Fourth, existing image captioning techniques use human-annotated real images for training and testing, which involve an expensive and time-consuming process. Moreover, nowadays bulk of the images are synthetic or generated by machines. There is also a need for generating captions for such images. We investigate the use of synthetic images for training and testing image captioning. We show that such images can help improving the captions of real images and they can effectively be used in caption generation of synthetic images.
APA, Harvard, Vancouver, ISO, and other styles
3

Tu, Guoyun. "Image Captioning On General Data And Fashion Data : An Attribute-Image-Combined Attention-Based Network for Image Captioning on Mutli-Object Images and Single-Object Images." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-282925.

Full text
Abstract:
Image captioning is a crucial field across computer vision and natural language processing. It could be widely applied to high-volume web images, such as conveying image content to visually impaired users. Many methods are adopted in this area such as attention-based methods, semantic-concept based models. These achieve excellent performance on general image datasets such as the MS COCO dataset. However, it is still left unexplored on single-object images.In this paper, we propose a new attribute-information-combined attention- based network (AIC-AB Net). At each time step, attribute information is added as a supplementary of visual information. For sequential word generation, spatial attention determines specific regions of images to pass the decoder. The sentinel gate decides whether to attend to the image or to the visual sentinel (what the decoder already knows, including the attribute information). Text attribute information is synchronously fed in to help image recognition and reduce uncertainty.We build a new fashion dataset consisting of fashion images to establish a benchmark for single-object images. This fashion dataset consists of 144,422 images from 24,649 fashion products, with one description sentence for each image. Our method is tested on the MS COCO dataset and the proposed Fashion dataset. The results show the superior performance of the proposed model on both multi-object images and single-object images. Our AIC-AB net outperforms the state-of-the-art network, Adaptive Attention Network by 0.017, 0.095, and 0.095 (CIDEr Score) on the COCO dataset, Fashion dataset (Bestsellers), and Fashion dataset (all vendors), respectively. The results also reveal the complement of attention architecture and attribute information.
Bildtextning är ett avgörande fält för datorsyn och behandling av naturligt språk. Det kan tillämpas i stor utsträckning på högvolyms webbbilder, som att överföra bildinnehåll till synskadade användare. Många metoder antas inom detta område såsom uppmärksamhetsbaserade metoder, semantiska konceptbaserade modeller. Dessa uppnår utmärkt prestanda på allmänna bilddatamängder som MS COCO-dataset. Det lämnas dock fortfarande outforskat på bilder med ett objekt.I denna uppsats föreslår vi ett nytt attribut-information-kombinerat uppmärksamhetsbaserat nätverk (AIC-AB Net). I varje tidsteg läggs attributinformation till som ett komplement till visuell information. För sekventiell ordgenerering bestämmer rumslig uppmärksamhet specifika regioner av bilder som ska passera avkodaren. Sentinelgrinden bestämmer om den ska ta hand om bilden eller den visuella vaktposten (vad avkodaren redan vet, inklusive attributinformation). Text attributinformation matas synkront för att hjälpa bildigenkänning och minska osäkerheten.Vi bygger en ny modedataset bestående av modebilder för att skapa ett riktmärke för bilder med en objekt. Denna modedataset består av 144 422 bilder från 24 649 modeprodukter, med en beskrivningsmening för varje bild. Vår metod testas på MS COCO dataset och den föreslagna Fashion dataset. Resultaten visar den överlägsna prestandan hos den föreslagna modellen på både bilder med flera objekt och enbildsbilder. Vårt AIC-AB-nät överträffar det senaste nätverket Adaptive Attention Network med 0,017, 0,095 och 0,095 (CIDEr Score) i COCO-datasetet, modedataset (bästsäljare) respektive modedatasetet (alla leverantörer). Resultaten avslöjar också komplementet till uppmärksamhetsarkitektur och attributinformation.
APA, Harvard, Vancouver, ISO, and other styles
4

Karayil, Tushar [Verfasser], and Andreas [Akademischer Betreuer] Dengel. "Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks / Tushar Karayil ; Betreuer: Andreas Dengel." Kaiserslautern : Technische Universität Kaiserslautern, 2020. http://d-nb.info/1214640958/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Gennari, Riccardo. "End-to-end Deep Metric Learning con Vision-Language Model per il Fashion Image Captioning." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amslaurea.unibo.it/25772/.

Full text
Abstract:
L'image captioning è un task di machine learning che consiste nella generazione di una didascalia, o caption, che descriva le caratteristiche di un'immagine data in input. Questo può essere applicato, ad esempio, per descrivere in dettaglio i prodotti in vendita su un sito di e-commerce, migliorando l'accessibilità del sito web e permettendo un acquisto più consapevole ai clienti con difficoltà visive. La generazione di descrizioni accurate per gli articoli di moda online è importante non solo per migliorare le esperienze di acquisto dei clienti, ma anche per aumentare le vendite online. Oltre alla necessità di presentare correttamente gli attributi degli articoli, infatti, descrivere i propri prodotti con il giusto linguaggio può contribuire a catturare l'attenzione dei clienti. In questa tesi, ci poniamo l'obiettivo di sviluppare un sistema in grado di generare una caption che descriva in modo dettagliato l'immagine di un prodotto dell'industria della moda dato in input, sia esso un capo di vestiario o un qualche tipo di accessorio. A questo proposito, negli ultimi anni molti studi hanno proposto soluzioni basate su reti convoluzionali e LSTM. In questo progetto proponiamo invece un'architettura encoder-decoder, che utilizza il modello Vision Transformer per la codifica delle immagini e GPT-2 per la generazione dei testi. Studiamo inoltre come tecniche di deep metric learning applicate in end-to-end durante l'addestramento influenzino le metriche e la qualità delle caption generate dal nostro modello.
APA, Harvard, Vancouver, ISO, and other styles
6

Kan, Jichao. "Visual-Text Translation with Deep Graph Neural Networks." Thesis, University of Sydney, 2020. https://hdl.handle.net/2123/23759.

Full text
Abstract:
Visual-text translation is to produce textual descriptions in natural languages from images and videos. In this thesis, we investigate two topics in the field: image captioning and continuous sign language recognition, by exploring structural representations of visual content. Image captioning is to generate text descriptions for a given image. Deep learning based methods have achieved impressive performance on this topic. However, the relations among objects in an image have not been fully explored. Thus, a topic-guided local-global graph neural network is proposed to extract graph properties at both local and global levels. The local features are built with visual objects, while the global features are characterized with topics, both modelled with two individual graphs. Experimental results on the MS-COCO dataset showed that our proposed method outperforms several state-of-the-art image captioning methods. Continuous Sign language recognition (SLR) takes video clips of a sign language sentence as input while producing a sentence as output in a natural language, which can be regarded as a machine translation problem. However, SLR is different from general machine translation problem because of the unique features of the input, e.g., facial expression and relationship among body parts. The facial and hand features can be extracted with neural networks while the interaction between body parts has not yet fully exploited. Therefore, a hierarchical spatio-temporal graph neural network is proposed, which takes both appearance and motion features into account and models the relationship between body parts with a hierarchical graph convolution network. Experimental results on two widely used datasets, PHOENIX-2014-T and Chinese Sign Language, show the effectiveness of our proposed method. In summary, our studies demonstrate structural representations with graph neural networks are helpful for improving the translation performance from visual content to text descriptions.
APA, Harvard, Vancouver, ISO, and other styles
7

Ma, Yufeng. "Going Deeper with Images and Natural Language." Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/99993.

Full text
Abstract:
One aim in the area of artificial intelligence (AI) is to develop a smart agent with high intelligence that is able to perceive and understand the complex visual environment around us. More ambitiously, it should be able to interact with us about its surroundings in natural languages. Thanks to the progress made in deep learning, we've seen huge breakthroughs towards this goal over the last few years. The developments have been extremely rapid in visual recognition, in which machines now can categorize images into multiple classes, and detect various objects within an image, with an ability that is competitive with or even surpasses that of humans. Meanwhile, we also have witnessed similar strides in natural language processing (NLP). It is quite often for us to see that now computers are able to almost perfectly do text classification, machine translation, etc. However, despite much inspiring progress, most of the achievements made are still within one domain, not handling inter-domain situations. The interaction between the visual and textual areas is still quite limited, although there has been progress in image captioning, visual question answering, etc. In this dissertation, we design models and algorithms that enable us to build in-depth connections between images and natural languages, which help us to better understand their inner structures. In particular, first we study how to make machines generate image descriptions that are indistinguishable from ones expressed by humans, which as a result also achieved better quantitative evaluation performance. Second, we devise a novel algorithm for measuring review congruence, which takes an image and review text as input and quantifies the relevance of each sentence to the image. The whole model is trained without any supervised ground truth labels. Finally, we propose a brand new AI task called Image Aspect Mining, to detect visual aspects in images and identify aspect level rating within the review context. On the theoretical side, this research contributes to multiple research areas in Computer Vision (CV), Natural Language Processing (NLP), interactions between CVandNLP, and Deep Learning. Regarding impact, these techniques will benefit related users such as the visually impaired, customers reading reviews, merchants, and AI researchers in general.
Doctor of Philosophy
APA, Harvard, Vancouver, ISO, and other styles
8

Kvita, Jakub. "Popis fotografií pomocí rekurentních neuronových sítí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255324.

Full text
Abstract:
Tato práce se zabývá automatickým generovaním popisů obrázků s využitím několika druhů neuronových sítí. Práce je založena na článcích z MS COCO Captioning Challenge 2015 a znakových jazykových modelech, popularizovaných A. Karpathym. Navržený model je kombinací konvoluční a rekurentní neuronové sítě s architekturou kodér--dekodér. Vektor reprezentující zakódovaný obrázek je předáván jazykovému modelu jako hodnoty paměti LSTM vrstev v síti. Práce zkoumá, na jaké úrovni je model s takto jednoduchou architekturou schopen popisovat obrázky a jak si stojí v porovnání s ostatními současnými modely. Jedním ze závěrů práce je, že navržená architektura není dostatečná pro jakýkoli popis obrázků.
APA, Harvard, Vancouver, ISO, and other styles
9

(5930603), Hemanth Devarapalli. "Forced Attention for Image Captioning." Thesis, 2019.

Find full text
Abstract:

Automatic generation of captions for a given image is an active research area in Artificial Intelligence. The architectures have evolved from using metadata of the images on which classical machine learning was employed to neural networks. Two different styles of architectures evolved in the neural network space for image captioning: Encoder-Attention-Decoder architecture, and the transformer architecture. This study is an attempt to modify the attention to allow any object to be specified. An archetypical Encoder-Attention-Decoder architecture (Show, Attend, and Tell (Xu et al., 2015)) is employed as a baseline for this study, and a modification of the Show, Attend, and Tell architecture is proposed. Both the architectures are evaluated on the MSCOCO (Lin et al., 2014) dataset, and seven metrics: BLEU – 1, 2, 3, 4 (Papineni, Roukos, Ward & Zhu, 2002), METEOR (Banerjee & Lavie, 2005), ROGUE L (Lin, 2004), and CIDer (Vedantam, Lawrence & Parikh, 2015) are calculated. Finally, the statistical significance of the results is evaluated by performing paired t tests.

APA, Harvard, Vancouver, ISO, and other styles
10

Mathews, Alexander Patrick. "Automatic Image Captioning with Style." Phd thesis, 2018. http://hdl.handle.net/1885/151929.

Full text
Abstract:
This thesis connects two core topics in machine learning, vision and language. The problem of choice is image caption generation: automatically constructing natural language descriptions of image content. Previous research into image caption generation has focused on generating purely descriptive captions; I focus on generating visually relevant captions with a distinct linguistic style. Captions with style have the potential to ease communication and add a new layer of personalisation. First, I consider naming variations in image captions, and propose a method for predicting context-dependent names that takes into account visual and linguistic information. This method makes use of a large-scale image caption dataset, which I also use to explore naming conventions and report naming conventions for hundreds of animal classes. Next I propose the SentiCap model, which relies on recent advances in artificial neural networks to generate visually relevant image captions with positive or negative sentiment. To balance descriptiveness and sentiment, the SentiCap model dynamically switches between two recurrent neural networks, one tuned for descriptive words and one for sentiment words. As the first published model for generating captions with sentiment, SentiCap has influenced a number of subsequent works. I then investigate the sub-task of modelling styled sentences without images. The specific task chosen is sentence simplification: rewriting news article sentences to make them easier to understand. For this task I design a neural sequence-to-sequence model that can work with limited training data, using novel adaptations for word copying and sharing word embeddings. Finally, I present SemStyle, a system for generating visually relevant image captions in the style of an arbitrary text corpus. A shared term space allows a neural network for vision and content planning to communicate with a network for styled language generation. SemStyle achieves competitive results in human and automatic evaluations of descriptiveness and style. As a whole, this thesis presents two complete systems for styled caption generation that are first of their kind and demonstrate, for the first time, that automatic style transfer for image captions is achievable. Contributions also include novel ideas for object naming and sentence simplification. This thesis opens up inquiries into highly personalised image captions; large scale visually grounded concept naming; and more generally, styled text generation with content control.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Book chapters on the topic "Image Captioning"

1

Sarang, Poornachandra. "Image Captioning." In Artificial Neural Networks with TensorFlow 2, 471–522. Berkeley, CA: Apress, 2020. http://dx.doi.org/10.1007/978-1-4842-6150-7_10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

He, Sen, Wentong Liao, Hamed R. Tavakoli, Michael Yang, Bodo Rosenhahn, and Nicolas Pugeault. "Image Captioning Through Image Transformer." In Computer Vision – ACCV 2020, 153–69. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-69538-5_10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Deng, Chaorui, Ning Ding, Mingkui Tan, and Qi Wu. "Length-Controllable Image Captioning." In Computer Vision – ECCV 2020, 712–29. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58601-0_42.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Yang, Huan, Dandan Song, and Lejian Liao. "Image Captioning with Relational Knowledge." In Lecture Notes in Computer Science, 378–86. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-97310-4_43.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Ziwei, Zi Huang, and Yadan Luo. "PAIC: Parallelised Attentive Image Captioning." In Lecture Notes in Computer Science, 16–28. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-39469-1_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Meng, Zihang, David Yang, Xuefei Cao, Ashish Shah, and Ser-Nam Lim. "Object-Centric Unsupervised Image Captioning." In Lecture Notes in Computer Science, 219–35. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_13.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Bathija, Pranav, Harsh Chawla, Ashish Bhat, and Arti Deshpande. "Image Captioning Using Ensemble Model." In ICT Systems and Sustainability, 345–55. Singapore: Springer Singapore, 2022. http://dx.doi.org/10.1007/978-981-16-5987-4_35.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Cetinic, Eva. "Iconographic Image Captioning for Artworks." In Pattern Recognition. ICPR International Workshops and Challenges, 502–16. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-68796-0_36.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Wang, Shihao, Hong Mo, Yue Xu, Wei Wu, and Zhong Zhou. "Intra-Image Region Context for Image Captioning." In Advances in Multimedia Information Processing – PCM 2018, 212–22. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-00764-5_20.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Alsharid, Mohammad, Harshita Sharma, Lior Drukker, Aris T. Papageorgiou, and J. Alison Noble. "Weakly Supervised Captioning of Ultrasound Images." In Medical Image Understanding and Analysis, 187–98. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-12053-4_14.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Image Captioning"

1

Adhikari, Aashish, and Sushil Ghimire. "Nepali Image Captioning." In 2019 Artificial Intelligence for Transforming Business and Society (AITB). IEEE, 2019. http://dx.doi.org/10.1109/aitb48515.2019.8947436.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. "Convolutional Image Captioning." In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. http://dx.doi.org/10.1109/cvpr.2018.00583.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Feng, Yang, Lin Ma, Wei Liu, and Jiebo Luo. "Unsupervised Image Captioning." In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. http://dx.doi.org/10.1109/cvpr.2019.00425.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Ge, Xuri, Fuhai Chen, Chen Shen, and Rongrong Ji. "Colloquial Image Captioning." In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019. http://dx.doi.org/10.1109/icme.2019.00069.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Puscasiu, Adela, Alexandra Fanca, Dan-Ioan Gota, and Honoriu Valean. "Automated image captioning." In 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR). IEEE, 2020. http://dx.doi.org/10.1109/aqtr49680.2020.9129930.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Byrd, Emmanuel, and Miguel Gonzalez-Mendoza. "OSCAR and ActivityNet: an Image Captioning model can effectively learn a Video Captioning dataset." In LatinX in AI at Computer Vision and Pattern Recognition Conference 2021. Journal of LatinX in AI Research, 2021. http://dx.doi.org/10.52591/lxai202106257.

Full text
Abstract:
Activity Recognition and Classification in video sequences is an area of research that has received attention recently. However, video processing is computationally expensive, and its advances have not been as extraordinary compared to those of Image Captioning. This work, created by Latinx individuals from Mexico, uses a computationally limited environment and transforms the Video Captioning dataset of ActivityNet into an Image Captioning. Generating features with Bottom-Up attention and training an OSCAR Image Captioning model, and using different NLP Data Augmentation techniques, we show a viable and promising approach to simplify the Video Captioning task.
APA, Harvard, Vancouver, ISO, and other styles
7

Yan, Xu, Zhengcong Fei, Zekang Li, Shuhui Wang, Qingming Huang, and Qi Tian. "Semi-Autoregressive Image Captioning." In MM '21: ACM Multimedia Conference. New York, NY, USA: ACM, 2021. http://dx.doi.org/10.1145/3474085.3475179.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Mason, Rebecca, and Eugene Charniak. "Domain-Specific Image Captioning." In Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014. http://dx.doi.org/10.3115/v1/w14-1602.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zeng, Pengpeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. "S2 Transformer for Image Captioning." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/224.

Full text
Abstract:
Transformer-based architectures with grid features represent the state-of-the-art in visual and language reasoning tasks, such as visual question answering and image-text matching. However, directly applying them to image captioning may result in spatial and fine-grained semantic information loss. Their applicability to image captioning is still largely under-explored. Towards this goal, we propose a simple yet effective method, Spatial- and Scale-aware Transformer (S2 Transformer) for image captioning. Specifically, we firstly propose a Spatial-aware Pseudo-supervised (SP) module, which resorts to feature clustering to help preserve spatial information for grid features. Next, to maintain the model size and produce superior results, we build a simple weighted residual connection, named Scale-wise Reinforcement (SR) module, to simultaneously explore both low- and high-level encoded features with rich semantics. Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer. The source code is available at https://github.com/zchoi/S2-Transformer
APA, Harvard, Vancouver, ISO, and other styles
10

Guo, Longteng, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, and Hanqing Lu. "Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/107.

Full text
Abstract:
Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Besides, we propose to utilize massive unlabeled images to boost captioning performance. Extensive experiments on MSCOCO image captioning benchmark show that our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography