To see the other types of publications on this topic, follow the link: Image Captioning.

Journal articles on the topic 'Image Captioning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Image Captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Vasudha Bahl and Nidhi Sengar, Gaurav Joshi, Dr Amita Goel. "Image Captioning System." International Journal for Modern Trends in Science and Technology 6, no. 12 (December 4, 2020): 40–44. http://dx.doi.org/10.46501/ijmtst061208.

Full text
Abstract:
Deep Learning is relatively a new field and it has grabbed a lot of attention because it provides higher level of accuracy in recognizing objects than ever earlier. NLP is also one field that has created a huge impact in our life. NLP has come a long way from producing a readable summary of the texts to analysis of mental illness, it shows the impact of NLP. Image captioning task combines both NLP and Deep Learning. Describing images in a meaningful way can be done using Image captioning. Describing an image don’t just mean recognizing objects, to describe an image properly we first need to identify objects present in the image and then the relationship between those objects. In this study we have used CNN-LSTM based framework. CNN will be used to extract features of the image while with the help of LSTM we will try to generate meaningful sentences. This study also discusses applications of Image captioning and major challenges faced in achieving this task.
APA, Harvard, Vancouver, ISO, and other styles
2

Beddiar, Djamila Romaissa, Mourad Oussalah, Tapio Seppänen, and Rachid Jennane. "ACapMed: Automatic Captioning for Medical Imaging." Applied Sciences 12, no. 21 (November 1, 2022): 11092. http://dx.doi.org/10.3390/app122111092.

Full text
Abstract:
Medical image captioning is a very challenging task that has been rarely addressed in the literature on natural image captioning. Some existing image captioning techniques exploit objects present in the image next to the visual features while generating descriptions. However, this is not possible for medical image captioning when one requires following clinician-like explanations in image content descriptions. Inspired by the preceding, this paper proposes using medical concepts associated with images, in accordance with their visual features, to generate new captions. Our end-to-end trainable network is composed of a semantic feature encoder based on a multi-label classifier to identify medical concepts related to images, a visual feature encoder, and an LSTM model for text generation. Beam search is employed to ensure the best selection of the next word for a given sequence of words based on the merged features of the medical image. We evaluated our proposal on the ImageCLEF medical captioning dataset, and the results demonstrate the effectiveness and efficiency of the developed approach.
APA, Harvard, Vancouver, ISO, and other styles
3

Mukund Upadhyay and Prof. Shallu Bashambu. "Image captioning Bot." International Journal for Modern Trends in Science and Technology 6, no. 12 (December 15, 2020): 348–54. http://dx.doi.org/10.46501/ijmtst061265.

Full text
Abstract:
Image captioning means automatically generating a caption for an image with the development of deep learning, the combination of computer vision and natural language process has caught great attention in the last few years. Image captioning is a representative of this filed, which makes the computer learn to use one or more sentences to understand the visual content of an image. The meaningful description generation process of highlevel image semantics requires not only the recognition of the object and the scene, but the ability of analyzing the state, the attributes and the relationship among these objects. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented.
APA, Harvard, Vancouver, ISO, and other styles
4

Rasha Mohammed Mualla, Jafar Alkheir, Samer Sulaiman, Rasha Mohammed Mualla, Jafar Alkheir, Samer Sulaiman. "Improving The Performance of the Image Captioning Systems Using a Pre- Classification Stage: تحسين أداء أنظمة وصف الصور باستخدام مرحلة التصنيف المسبق للصور." Journal of engineering sciences and information technology 6, no. 1 (March 27, 2022): 150–64. http://dx.doi.org/10.26389/ajsrp.l270721.

Full text
Abstract:
In this research, we introduce a novel image classification and captioning system by adding a classification layer before the image captioning models. The suggested approach consists of three main steps and inspired by the state- of- art that generating image captioning inside small sub- classes categories is better than the unclassified large dataset. In the first one, we have collected a dataset of two international datasets (MS- COCO and Flickr2k) including 10778 images in which 80% is used for training and 20% for validation. In the next step, dataset images have been classified into 11 classes (10 classes of indoor and outdoor categories and one class of "Null" category) and fed into a deep learning classifier. The classifier is re- trained again using our classes and learned to classify each image to the corresponding category. At the final step, each classified image is used as input of 11 pre- trained classified image captioning models, and the final captioning sentence is generated. The experiments show that adding the pre- classification step before the image captioning stage improves the performance significantly by (8.15% and 8.44%) and (12.7407% and 16.7048%) for Top- 1 and Top- 5 of English and Arabic systems respectively. The classification step achieves a true classification rate of 71.32% and 73.09% for English and Arabic systems respectively.
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Zhenyu, Qiao Liu, and Guojing Liu. "Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training." Symmetry 12, no. 12 (November 30, 2020): 1978. http://dx.doi.org/10.3390/sym12121978.

Full text
Abstract:
Compared with traditional image captioning technology, stylized image captioning has broader application scenarios, such as a better understanding of images. However, stylized image captioning faces many challenges, the most important of which is how to make the model take into account both the image meta information and the style factor of the generated captions. In this paper, we propose a novel end-to-end stylized image captioning framework (ST-BR). Specifically, we first use a style transformer to model the factual information of images, and the style attention module learns style factor form a multi-style corpus, it is a symmetric structure on the whole. At the same time, we use back-reinforcement to evaluate the degree of consistency between the generated stylized captions with the image knowledge and specified style, respectively. These two parts further enhance the learning ability of the model through adversarial learning. Our experiment has achieved effective performance on the benchmark dataset.
APA, Harvard, Vancouver, ISO, and other styles
6

Nivedita, M., and Asnath Victy Phamila Y. "Image Captioning for Spatially Rotated Images in Video Surveillance Applications Using Neural Networks." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 29, Supp02 (December 2021): 193–209. http://dx.doi.org/10.1142/s0218488521400110.

Full text
Abstract:
Video Surveillance has become an essential tool in the Security industry because of the sophisticated and fool-proof technology. Recent developments in image recognition and captioning have enabled us to adopt these technologies in the field of video surveillance. The biggest problem in image captioning is that it is variant on the rotation angle of the image. Different angles of the same image generate different captions. We aim to address and eliminate the rotation variance of image captioning. We have implemented a custom image rotation network using a Convolutional Neural Network (CNN). The input image is rotated to the original angle using this network and passed on to the image captioning. The caption of the image is generated and sent to the user for the situation analysis.
APA, Harvard, Vancouver, ISO, and other styles
7

Iwamura, Kiyohiko, Jun Younes Louhi Kasahara, Alessandro Moro, Atsushi Yamashita, and Hajime Asama. "Image Captioning Using Motion-CNN with Object Detection." Sensors 21, no. 4 (February 10, 2021): 1270. http://dx.doi.org/10.3390/s21041270.

Full text
Abstract:
Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.
APA, Harvard, Vancouver, ISO, and other styles
8

Junaid, Mohd Wasiuddin. "Image Captioning with Face Recognition using Transformers." International Journal for Research in Applied Science and Engineering Technology 10, no. 1 (January 31, 2022): 1426–32. http://dx.doi.org/10.22214/ijraset.2022.40057.

Full text
Abstract:
Abstract: The process of generating text from images is called Image Captioning. It not only requires the recognition of the object and the scene but the ability to analyze the state and identify the relationship among these objects. Therefore image captioning integrates the field of computer vision and natural language processing. Thus we introduces a novel image captioning model which is capable of recognizing human faces in an given image using transformer model. The proposed Faster R-CNN-Transformer model architecture comprises of feature extraction from images, extraction of semantic keywords from captions, and encoder-decoder transformers. Faster-RCNN is implemented for face recognition and features are extracted from images using InceptionV3 . The model aims to identify and recognizes the known faces in the images. The Faster R-CNN module creates the bounding box across the face which helps in better interpretation of an image and caption. The dataset used in this model has images with celebrity faces and caption with celebrity names included within itself, respectively has in total 232 celebrities. Due to small size of dataset, we have augmented images and added 100 images with their corresponding captions to increase the size of vocabulary for our model. The BLEU and METEOR scores were generated to evaluate the accuracy/quality of generated captions. Keywords: Image Captioning, Faster R-CNN , Transformers, Bleu score, Meteor score.
APA, Harvard, Vancouver, ISO, and other styles
9

Al-Malla, Muhammad Abdelhadie, Muhammad Abdelhadie Al-Malla, Assef Jafar, and Nada Ghneim. "Pre-trained CNNs as Feature-Extraction Modules for Image Captioning." ELCVIA Electronic Letters on Computer Vision and Image Analysis 21, no. 1 (May 10, 2022): 1–16. http://dx.doi.org/10.5565/rev/elcvia.1436.

Full text
Abstract:
In this work, we present a thorough experimental study about feature extraction using Convolutional NeuralNetworks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features areextracted from the last layer after removing the fully connected layer and fed into the captioning model. We usea unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changingthe CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics inimage captioning. We find a strong relationship between the model structure and the image captioning datasetand prove that VGG models give the least quality for image captioning feature extraction among the testedCNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metricswe want to optimise, and show the connection between our results and previous works. To our knowledge, thiswork is the most comprehensive comparison between feature extractors for image captioning.
APA, Harvard, Vancouver, ISO, and other styles
10

Chang, Yeong-Hwa, Yen-Jen Chen, Ren-Hung Huang, and Yi-Ting Yu. "Enhanced Image Captioning with Color Recognition Using Deep Learning Methods." Applied Sciences 12, no. 1 (December 26, 2021): 209. http://dx.doi.org/10.3390/app12010209.

Full text
Abstract:
Automatically describing the content of an image is an interesting and challenging task in artificial intelligence. In this paper, an enhanced image captioning model—including object detection, color analysis, and image captioning—is proposed to automatically generate the textual descriptions of images. In an encoder–decoder model for image captioning, VGG16 is used as an encoder and an LSTM (long short-term memory) network with attention is used as a decoder. In addition, Mask R-CNN with OpenCV is used for object detection and color analysis. The integration of the image caption and color recognition is then performed to provide better descriptive details of images. Moreover, the generated textual sentence is converted into speech. The validation results illustrate that the proposed method can provide more accurate description of images.
APA, Harvard, Vancouver, ISO, and other styles
11

Cao, Tingjia, Ke Han, Xiaomei Wang, Lin Ma, Yanwei Fu, Yu-Gang Jiang, and Xiangyang Xue. "Feature Deformation Meta-Networks in Image Captioning of Novel Objects." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 10494–501. http://dx.doi.org/10.1609/aaai.v34i07.6620.

Full text
Abstract:
This paper studies the task of image captioning with novel objects, which only exist in testing images. Intrinsically, this task can reflect the generalization ability of models in understanding and captioning the semantic meanings of visual concepts and objects unseen in training set, sharing the similarity to one/zero-shot learning. The critical difficulty thus comes from that no paired images and sentences of the novel objects can be used to help train the captioning model. Inspired by recent work (Chen et al. 2019b) that boosts one-shot learning by learning to generate various image deformations, we propose learning meta-networks for deforming features for novel object captioning. To this end, we introduce the feature deformation meta-networks (FDM-net), which is trained on source data, and learn to adapt to the novel object features detected by the auxiliary detection model. FDM-net includes two sub-nets: feature deformation, and scene graph sentence reconstruction, which produce the augmented image features and corresponding sentences, respectively. Thus, rather than directly deforming images, FDM-net can efficiently and dynamically enlarge the paired images and texts by learning to deform image features. Extensive experiments are conducted on the widely used novel object captioning dataset, and the results show the effectiveness of our FDM-net. Ablation study and qualitative visualization further give insights of our model.
APA, Harvard, Vancouver, ISO, and other styles
12

S, Thivaharan, Srivatsun G, Pranav Kiran S, and Johan Benoni Raul J. "Image Captioning in Tamil Language using Encoder-Decoder Architecture." March 2023 5, no. 1 (March 2, 2023): 36–48. http://dx.doi.org/10.36548/jucct.2023.1.003.

Full text
Abstract:
Image captioning is the process of using clear, meaningful words to describe the characteristics of an image. This feature has wide applications in social networking applications such as Facebook and Instagram, and video streaming platforms such as YouTube and Netflix, where the need to verbalize an image or video is evident. Image captioning is also one of the most requested features in next-generation AI systems. It has huge applications in the Deep Learning domain. Much research is actively being done on image captioning, which can solve a good deal of real time problems such as the need for a system that can aid visually disabled people, creating effective captions that can be incorporated in self-driving vehicles, etc. This elaborate yet useful feature can be incorporated with the help of various technical concepts such as Natural Language Processing, Computer vision, Image Processing, etc. The image captioning feature has already been attempted on English language and with the help of extensive research and technical advancements these attempts have been fruitful and successful. Nowadays, there are many applications and models available based on image captioning of English language. This has paved a path for further advancements in this domain. A lot of research are now being undertaken to incorporate this highly useful feature with non-English languages. English being the native language for a relatively smaller proportion of people, it would be helpful for people whose native language is not English, to get their images captioned in the language of their choice. This research focuses on image captioning in Tamil language and its underlying methodology and architecture. Moreover, the paper also includes experiments related to this with the help of an image captioning model which uses a combination of Convolution Neural Network and Long Short -Term Memory models.
APA, Harvard, Vancouver, ISO, and other styles
13

Banda, Anish. "Image Captioning using CNN and LSTM." International Journal for Research in Applied Science and Engineering Technology 9, no. 8 (August 31, 2021): 2666–69. http://dx.doi.org/10.22214/ijraset.2021.37846.

Full text
Abstract:
Abstract: In the model we proposed, we examine the deep neural networks-based image caption generation technique. We give image as input to the model, the technique give output in three different forms i.e., sentence in three different languages describing the image, mp3 audio file and an image file is also generated. In this model, we use the techniques of both computer vision and natural language processing. We are aiming to develop a model using the techniques of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to build a model to generate a Caption. Target image is compared with the training images, we have a large dataset containing the training images, this is done by convolutional neural network. This model generates a decent description utilizing the trained data. To extract features from images we need encoder, we use CNN as encoder. To decode the description of image generated we use LSTM. To evaluate the accuracy of generated caption we use BLEU metric algorithm. It grades the quality of content generated. Performance is calculated by the standard calculation matrices. Keywords: CNN, RNN, LSTM, BLEU score, encoder, decoder, captions, image description.
APA, Harvard, Vancouver, ISO, and other styles
14

Atliha, Viktar, and Dmitrij Šešok. "Image-Captioning Model Compression." Applied Sciences 12, no. 3 (February 4, 2022): 1638. http://dx.doi.org/10.3390/app12031638.

Full text
Abstract:
Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efficient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications.
APA, Harvard, Vancouver, ISO, and other styles
15

Cho, Suhyun, and Hayoung Oh. "Generalized Image Captioning for Multilingual Support." Applied Sciences 13, no. 4 (February 14, 2023): 2446. http://dx.doi.org/10.3390/app13042446.

Full text
Abstract:
Image captioning is a problem of viewing images and describing images in language. This is an important problem that can be solved by understanding the image, and combining two fields of image processing and natural language processing into one. The purpose of image captioning research so far has been to create general explanatory captions in the learning data. However, various environments in reality must be considered for practical use, as well as image descriptions that suit the purpose of use. Image caption research requires processing new learning data to generate descriptive captions for specific purposes, but it takes a lot of time and effort to create learnable data. In this study, we propose a method to solve this problem. Popular image captioning can help visually impaired people understand their surroundings by automatically recognizing and describing images into text and then into voice and is an important issue that can be applied to many places such as image search, art therapy, sports commentary, and real-time traffic information commentary. Through the domain object dictionary method proposed in this study, we propose a method to generate image captions without the need to process new learning data by adjusting the object dictionary for each domain application. The method proposed in the study is to change the dictionary of the object to focus on the domain object dictionary rather than processing the learning data, leading to the creation of various image captions by intensively explaining the objects required for each domain. In this work, we propose a filter captioning model that induces generation of image captions from various domains while maintaining the performance of existing models.
APA, Harvard, Vancouver, ISO, and other styles
16

Jain, Uday. "Image Captioning - A Deep Learning Approach." International Journal for Research in Applied Science and Engineering Technology 10, no. 7 (July 31, 2022): 3068–72. http://dx.doi.org/10.22214/ijraset.2022.45638.

Full text
Abstract:
Abstract: Image captioning is a brand-new study area in the science of computer vision. The primary goal of picture captioning is to create a natural language description for the input image. In recent years, research on natural language processing and computer vision has become increasingly interested in the problem of automatically synthesising descriptive phrases for photos. Image captioning is a crucial task that demands both the ability to create precise and accurate description phrases as well as a semantic understanding of the images. Long Short Term Memory (LSTM) is used to precisely organise data using the available keywords to form meaningful sentences. The authors of this research propose a hybrid system based on multilayer Convolutional Neural Networks to create a lexicon for characterising the visuals. The convolutional neural network employs trained captions to deliver an accurate description after comparing the target image to a sizable dataset of training images. We demonstrate the effectiveness of our suggested methodology using the Flickr 8K datasets.
APA, Harvard, Vancouver, ISO, and other styles
17

Javanmardi, Shima, Ali Mohammad Latif, Mohammad Taghi Sadeghi, Mehrdad Jahanbanifard, Marcello Bonsangue, and Fons J. Verbeek. "Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network." Sensors 22, no. 21 (November 1, 2022): 8376. http://dx.doi.org/10.3390/s22218376.

Full text
Abstract:
In image captioning models, the main challenge in describing an image is identifying all the objects by precisely considering the relationships between the objects and producing various captions. Over the past few years, many methods have been proposed, from an attribute-to-attribute comparison approach to handling issues related to semantics and their relationships. Despite the improvements, the existing techniques suffer from inadequate positional and geometrical attributes concepts. The reason is that most of the abovementioned approaches depend on Convolutional Neural Networks (CNNs) for object detection. CNN is notorious for failing to detect equivariance and rotational invariance in objects. Moreover, the pooling layers in CNNs cause valuable information to be lost. Inspired by the recent successful approaches, this paper introduces a novel framework for extracting meaningful descriptions based on a parallelized capsule network that describes the content of images through a high level of understanding of the semantic contents of an image. The main contribution of this paper is proposing a new method that not only overrides the limitations of CNNs but also generates descriptions with a wide variety of words by using Wikipedia. In our framework, capsules focus on the generation of meaningful descriptions with more detailed spatial and geometrical attributes for a given set of images by considering the position of the entities as well as their relationships. Qualitative experiments on the benchmark dataset MS-COCO show that our framework outperforms state-of-the-art image captioning models when describing the semantic content of the images.
APA, Harvard, Vancouver, ISO, and other styles
18

Pandey, Subash, Rabin Kumar Dhamala, Bikram Karki, Saroj Dahal, and Rama Bastola. "Automatic Image Captioning Using Neural Networks." Journal of Innovations in Engineering Education 3, no. 1 (March 31, 2020): 138–46. http://dx.doi.org/10.3126/jiee.v3i1.34335.

Full text
Abstract:
Automatically generating a natural language description of an image is a major challenging task in the field of artificial intelligence. Generating description of an image bring together the fields: Natural Language Processing and Computer Vision. There are two types of approaches i.e. top-down and bottom-up. For this paper, we approached top-down that starts from the image and converts it into the word. Image is passed to Convolutional Neural Network (CNN) encoder and the output from it is fed further to Recurrent Neural Network (RNN) decoder that generates meaningful captions. We generated the image description by passing the real time images from the camera of a smartphone as well as tested with the test images from the dataset. To evaluate the model performance, we used BLEU (Bilingual Evaluation Understudy) score and match predicted words to the original caption.
APA, Harvard, Vancouver, ISO, and other styles
19

Fei, Zhengcong. "Memory-Augmented Image Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (May 18, 2021): 1317–24. http://dx.doi.org/10.1609/aaai.v35i2.16220.

Full text
Abstract:
Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets. Nevertheless, their ability to access and precisely manipulate the mastered knowledge is still limited. Besides, providing evidence for decisions and updating memory information are also important yet under explored. Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank. Adequate knowledge is recalled according to the similarity distance in the embedding space of history context, and the memory bank can be constructed conveniently from any matched image-text set, e.g., the previous training data. Incorporating such non-parametric memory-augmented method to various captioning baselines, the performance of resulting captioners imporves consistently on the evaluation benchmark. More encouragingly, extensive experiments demonstrate that our approach holds the capability for efficiently adapting to larger training datasets, by simply transferring the memory bank without any additional training.
APA, Harvard, Vancouver, ISO, and other styles
20

Ushiku, Yoshitaka. "1. Image/Video Captioning." Journal of The Institute of Image Information and Television Engineers 72, no. 9 (2018): 649–54. http://dx.doi.org/10.3169/itej.72.649.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Yang, Zuopeng, Pengbo Wang, Tianshu Chu, and Jie Yang. "Human-Centric Image Captioning." Pattern Recognition 126 (June 2022): 108545. http://dx.doi.org/10.1016/j.patcog.2022.108545.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Wu, Hanjie, Yongtuo Liu, Hongmin Cai, and Shengfeng He. "Learning Transferable Perturbations for Image Captioning." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 2 (May 31, 2022): 1–18. http://dx.doi.org/10.1145/3478024.

Full text
Abstract:
Present studies have discovered that state-of-the-art deep learning models can be attacked by small but well-designed perturbations. Existing attack algorithms for the image captioning task is time-consuming, and their generated adversarial examples cannot transfer well to other models. To generate adversarial examples faster and stronger, we propose to learn the perturbations by a generative model that is governed by three novel loss functions. Image feature distortion loss is designed to maximize the encoded image feature distance between original images and the corresponding adversarial examples at the image domain, and local-global mismatching loss is introduced to separate the mapping encoding representation of the adversarial images and the ground true captions from a local and global perspective in the common semantic space as far as possible cross image and caption domain. Language diversity loss is to make the image captions generated by the adversarial examples as different as possible from the correct image caption at the language domain. Extensive experiments show that our proposed generative model can efficiently generate adversarial examples that successfully generalize to attack image captioning models trained on unseen large-scale datasets or with different architectures, or even the image captioning commercial service.
APA, Harvard, Vancouver, ISO, and other styles
23

Nursikuwagus, Agus, Rinaldi Munir, and Masayu Layla Khodra. "Image Captioning menurut Scientific Revolution Kuhn dan Popper." Jurnal Manajemen Informatika (JAMIKA) 10, no. 2 (October 1, 2020): 110–21. http://dx.doi.org/10.34010/jamika.v10i2.2630.

Full text
Abstract:
Perkembangan untuk memberikan caption pada suatu gambar merupakan suatu ranah perkembangan baru dalam bidang intelejensia buatan. Image captioning merupakan penggabungan dari beberapa bidang seperti computer vision, natural language, dan pembelajaran mesin. Aspek yang menjadi perhatian dalam bidang image captioning ini adalah ketepatan arsitektur neural network yang dimodelkan untuk mendapatkan hasil yang sedekat mungkin dengan ground-thruth yang disampaikan oleh person. Beberapa kajian yang sudah diteliti masih mendapatkan kalimat yang masih jauh dari ground-thruth tersebut. Permasalahan yang dibahas pada umumnya mengenai image captioning adalah image generator dan text generator yaitu penggunaan deep learning seperti CNN dan LSTM untuk menyelesaikan masalah captioning. Hal ini menjadi dasar permasalahan untuk memberikan kontribusi baru dalam bidang image captioning yang meliputi image extractor, text generator, dan evaluator yang bisa digunakan pada model yang diusulkan. Perspektif Kuhn dan Popper dalam hal image captioning, diperoleh bahwa caption dalam bidang geologi sangat diperlukan dan mencapai tahap krisis. Perlu adanya metode usulan baru untuk menyajikan caption untuk citra geologi.
APA, Harvard, Vancouver, ISO, and other styles
24

Sharma, Himanshu, and Anand Singh Jalal. "Incorporating external knowledge for image captioning using CNN and LSTM." Modern Physics Letters B 34, no. 28 (July 16, 2020): 2050315. http://dx.doi.org/10.1142/s0217984920503157.

Full text
Abstract:
Image captioning is a multidisciplinary artificial intelligence (AI) research task that has captures the interest of both image and natural language processing experts. Image captioning is a complex problem as it sometimes requires accessing the information that may not be directly visualized in a given scene. It possibly will require common sense interpretation or the detailed knowledge about the object present in image. In this paper, we have given a method that utilizes both visual and external knowledge from knowledge bases such as ConceptNet for better description the images. We demonstrated the usefulness of the method on two publicly available datasets; Flickr8k and Flickr30k.The results explain that the proposed model outperforms the state-of-the art approaches for generating image captions. At last, we will talk about possible future prospects in image captioning.
APA, Harvard, Vancouver, ISO, and other styles
25

Dognin, Pierre, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, and Brian Belgodere. "Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge." Journal of Artificial Intelligence Research 73 (January 31, 2022): 437–59. http://dx.doi.org/10.1613/jair.1.13113.

Full text
Abstract:
Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems. This article appears in the special track on AI & Society.
APA, Harvard, Vancouver, ISO, and other styles
26

Vellakani, Sivamurugan, and Indumathi Pushbam. "An enhanced OCT image captioning system to assist ophthalmologists in detecting and classifying eye diseases." Journal of X-Ray Science and Technology 28, no. 5 (September 19, 2020): 975–88. http://dx.doi.org/10.3233/xst-200697.

Full text
Abstract:
Human eye is affected by the different eye diseases including choroidal neovascularization (CNV), diabetic macular edema (DME) and age-related macular degeneration (AMD). This work aims to design an artificial intelligence (AI) based clinical decision support system for eye disease detection and classification to assist the ophthalmologists more effectively detecting and classifying CNV, DME and drusen by using the Optical Coherence Tomography (OCT) images depicting different tissues. The methodology used for designing this system involves different deep learning convolutional neural network (CNN) models and long short-term memory networks (LSTM). The best image captioning model is selected after performance analysis by comparing nine different image captioning systems for assisting ophthalmologists to detect and classify eye diseases. The quantitative data analysis results obtained for the image captioning models designed using DenseNet201 with LSTM have superior performance in terms of overall accuracy of 0.969, positive predictive value of 0.972 and true-positive rate of 0.969using OCT images enhanced by the generative adversarial network (GAN). The corresponding performance values for the Xception with LSTM image captioning models are 0.969, 0.969 and 0.938, respectively. Thus, these two models yield superior performance and have potential to assist ophthalmologists in making optimal diagnostic decision.
APA, Harvard, Vancouver, ISO, and other styles
27

Mohamad Nezami, Omid, Mark Dras, Stephen Wan, and Cecile Paris. "Image Captioning using Facial Expression and Attention." Journal of Artificial Intelligence Research 68 (August 6, 2020): 661–89. http://dx.doi.org/10.1613/jair.1.12025.

Full text
Abstract:
Benefiting from advances in machine vision and natural language processing techniques, current image captioning systems are able to generate detailed visual descriptions. For the most part, these descriptions represent an objective characterisation of the image, although some models do incorporate subjective aspects related to the observer’s view of the image, such as sentiment; current models, however, usually do not consider the emotional content of images during the caption generation process. This paper addresses this issue by proposing novel image captioning models which use facial expression features to generate image captions. The models generate image captions using long short-term memory networks applying facial features in addition to other visual features at different time steps. We compare a comprehensive collection of image captioning models with and without facial features using all standard evaluation metrics. The evaluation metrics indicate that applying facial features with an attention mechanism achieves the best performance, showing more expressive and more correlated image captions, on an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the generated captions finds that, perhaps unexpectedly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.
APA, Harvard, Vancouver, ISO, and other styles
28

Mayank and Naveen Kumar Gondhi. "Comparative Assessment of Image Captioning Models." Journal of Computational and Theoretical Nanoscience 17, no. 1 (January 1, 2020): 473–78. http://dx.doi.org/10.1166/jctn.2020.8693.

Full text
Abstract:
Image Captioning is the combination of Computer Vision and Natural Language Processing (NLP) in which simple sentences have been automatically generated describing the content of the image. This paper presents the comparative analysis of different models used for the generation of descriptive English captions for a given image. Feature extractions of the images are done using Convolutional Neural Networks (CNN). These features are then, passed onto Recurrent Neural Networks (RNN) or Long Short-term Memory (LSTM) to generate captions in English language. The evaluation metrics used to appraise the conduct of the models are BLEU score, CIDEr and METEOR.
APA, Harvard, Vancouver, ISO, and other styles
29

Veena, S., K. S. Ashwin, and Prateek Gupta. "Comparison of various CNN encoders for image captioning." Journal of Physics: Conference Series 2335, no. 1 (September 1, 2022): 012029. http://dx.doi.org/10.1088/1742-6596/2335/1/012029.

Full text
Abstract:
Abstract Image captioning is the ability of machines to study the features of an image and then give a textual description of that image. It uses computer vision to extract the features from the image and natural language processing for generating the caption. Image captioning requires both Image analysis and natural language processing. It is a very important task for further research on visual intelligence in line with human perception. In this project we will compare the different CNN pretrained models that are available like VGG16, RESNET50, Xception and INCEPTION and find out which performs better for image captioning.
APA, Harvard, Vancouver, ISO, and other styles
30

Yang, Haiyu, Haiyu Song, Wei Li, Kexin Qin, Haoyu Shi, and Qi Jiao. "Social Image Annotation Based on Image Captioning." WSEAS TRANSACTIONS ON SIGNAL PROCESSING 18 (May 19, 2022): 109–15. http://dx.doi.org/10.37394/232014.2022.18.15.

Full text
Abstract:
With the popularity of new social media, automatic image annotation (AIA) has been an active research topic due to its great importance in image retrieval, understanding, and management. Despite their relative success, most of annotation models suffer from the low-level visual representation and semantic gap. To address the above shortcomings, we propose a novel annotation method utilizing textual feature generated by image captioning, in contrast to all previous methods that use visual feature as image feature. In our method, each image is regarded as a label-vector of k userprovided textual tags rather than a visual vector. We summarize our method as follows. First, the image visual features are extracted by combining the deep residual network and the object detection model, which are encoded and decoded by the mesh-connected Transformer network model. Then, the textual modal feature vector of the image is constructed by removing stop-words and retaining high-frequency tags. Finally, the textual feature vector of the image is applied to the propagation annotation model to generate a high-quality image annotation labels. Experimental results conducted on standard MS-COCO datasets demonstrate that the proposed method significantly outperforms existing classical models, mainly benefiting from the proposed textual feature generated by image captioning technology.
APA, Harvard, Vancouver, ISO, and other styles
31

Bhalekar, M., and M. Bedekar. "The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks." Engineering, Technology & Applied Science Research 12, no. 4 (August 7, 2022): 8803–8. http://dx.doi.org/10.48084/etasr.5039.

Full text
Abstract:
In the domain of image captioning, many pre-trained datasets are available. Using these datasets, models can be trained to automatically generate image descriptions regarding the contents of an image. Researchers usually do not spend much time in creating and training the new dataset before using it for a specific application, instead, they simply use existing pre-trained datasets. MS COCO, ImageNet, Flicker, and Pascal VOC, are well-known datasets that are widely used in the task of generating image captions. In most available image captioning datasets, image textual information, which can play a vital role in generating more precise image descriptions, is missing. This paper presents the process of creating a new dataset that consists of images along with text and captions. Images of the nearby vicinity of the campus of MIT World Peace University-MITWPU, India, were taken for the new dataset named MITWPU-1K. This dataset can be used in object detection and caption generation of images. The objective of this paper is to highlight the steps required for creating a new dataset. This necessitated a review of the existing dataset models prior to creating the new dataset. A sequential convolutional model for detecting objects on a new dataset is also presented. The process of creating a new image captioning dataset and the gained insights are described.
APA, Harvard, Vancouver, ISO, and other styles
32

Song, Lingyun, Jun Liu, Buyue Qian, and Yihe Chen. "Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8885–92. http://dx.doi.org/10.1609/aaai.v33i01.33018885.

Full text
Abstract:
Image captioning and visual language grounding are two important tasks for image understanding, but are seldom considered together. In this paper, we propose a Progressive Attention-Guided Network (PAGNet), which simultaneously generates image captions and predicts bounding boxes for caption words. PAGNet mainly has two distinctive properties: i) It can progressively refine the predictive results of image captioning, by updating the attention map with the predicted bounding boxes. ii) It learns bounding boxes of the words using a weakly supervised strategy, which combines the frameworks of Multiple Instance Learning (MIL) and Markov Decision Process (MDP). By using the attention map generated in the captioning process, PAGNet significantly reduces the search space of the MDP. We conduct experiments on benchmark datasets to demonstrate the effectiveness of PAGNet and results show that PAGNet achieves the best performance.
APA, Harvard, Vancouver, ISO, and other styles
33

Song, Zeliang, Xiaofei Zhou, Zhendong Mao, and Jianlong Tan. "Image Captioning with Context-Aware Auxiliary Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 3 (May 18, 2021): 2584–92. http://dx.doi.org/10.1609/aaai.v35i3.16361.

Full text
Abstract:
Image captioning is a challenging computer vision task, which aims to generate a natural language description of an image. Most recent researches follow the encoder-decoder framework which depends heavily on the previous generated words for the current prediction. Such methods can not effectively take advantage of the future predicted information to learn complete semantics. In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism that can guide the captioning model to perceive global contexts. Upon the captioning model, CAAG performs semantic attention that selectively concentrates on useful information of the global predictions to reproduce the current generation. To validate the adaptability of the method, we apply CAAG to three popular captioners and our proposal achieves competitive performance on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2 CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official online evaluation server.
APA, Harvard, Vancouver, ISO, and other styles
34

Singh, Yajush Pratap, Sayed Abu Lais Ezaz Ahmed, Prabhishek Singh, Neeraj Kumar, and Manoj Diwakar. "Image Captioning using Artificial Intelligence." Journal of Physics: Conference Series 1854, no. 1 (April 1, 2021): 012048. http://dx.doi.org/10.1088/1742-6596/1854/1/012048.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Li, Nannan, Zhenzhong Chen, and Shan Liu. "Meta Learning for Image Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8626–33. http://dx.doi.org/10.1609/aaai.v33i01.33018626.

Full text
Abstract:
Reinforcement learning (RL) has shown its advantages in image captioning by optimizing the non-differentiable metric directly in the reward learning process. However, due to the reward hacking problem in RL, maximizing reward may not lead to better quality of the caption, especially from the aspects of propositional content and distinctiveness. In this work, we propose to use a new learning method, meta learning, to utilize supervision from the ground truth whilst optimizing the reward function in RL. To improve the propositional content and the distinctiveness of the generated captions, the proposed model provides the global optimal solution by taking different gradient steps towards the supervision task and the reinforcement task, simultaneously. Experimental results on MS COCO validate the effectiveness of our approach when compared with the state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
36

Fei, Zhengcong. "Partially Non-Autoregressive Image Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (May 18, 2021): 1309–16. http://dx.doi.org/10.1609/aaai.v35i2.16219.

Full text
Abstract:
Current state-of-the-art image captioning systems usually generated descriptions autoregressively, i.e., every forward step conditions on the given image and previously produced words. The sequential attribution causes a unavoidable decoding latency. Non-autoregressive image captioning, on the other hand, predicts the entire sentence simultaneously and accelerates the inference process significantly. However, it removes the dependence in a caption and commonly suffers from repetition or missing issues. To make a better trade-off between speed and quality, we introduce a partially non-autoregressive model, named PNAIC, which considers a caption as a series of concatenated word groups. The groups are generated parallelly in global while each word in group is predicted from left to right, and thus the captioner can create multiple discontinuous words concurrently at each time step. More importantly, by incorporating curriculum learning-based training tasks of group length prediction and invalid group deletion, our model is capable of generating accurate captions as well as preventing common incoherent errors. Extensive experiments on MS COCO benchmark demonstrate that our proposed method achieves more than 3.5× speedup while maintaining competitive performance.
APA, Harvard, Vancouver, ISO, and other styles
37

Aghav, Jagannath. "Image Captioning using Deep Learning." International Journal for Research in Applied Science and Engineering Technology 8, no. 6 (June 30, 2020): 1430–35. http://dx.doi.org/10.22214/ijraset.2020.6232.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Han, Meng, Wenyu Chen, and Alemu Dagmawi Moges. "Fast image captioning using LSTM." Cluster Computing 22, S3 (March 29, 2018): 6143–55. http://dx.doi.org/10.1007/s10586-018-1885-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Long, Cuirong, Xiaoshan Yang, and Changsheng Xu. "Cross-domain personalized image captioning." Multimedia Tools and Applications 79, no. 45-46 (March 27, 2019): 33333–48. http://dx.doi.org/10.1007/s11042-019-7441-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Li, Jiangyun, Peng Yao, Longteng Guo, and Weicun Zhang. "Boosted Transformer for Image Captioning." Applied Sciences 9, no. 16 (August 9, 2019): 3260. http://dx.doi.org/10.3390/app9163260.

Full text
Abstract:
Image captioning attempts to generate a description given an image, usually taking Convolutional Neural Network as the encoder to extract the visual features and a sequence model, among which the self-attention mechanism has achieved advanced progress recently, as the decoder to generate descriptions. However, this predominant encoder-decoder architecture has some problems to be solved. On the encoder side, without the semantic concepts, the extracted visual features do not make full use of the image information. On the decoder side, the sequence self-attention only relies on word representations, lacking the guidance of visual information and easily influenced by the language prior. In this paper, we propose a novel boosted transformer model with two attention modules for the above-mentioned problems, i.e., “Concept-Guided Attention” (CGA) and “Vision-Guided Attention” (VGA). Our model utilizes CGA in the encoder, to obtain the boosted visual features by integrating the instance-level concepts into the visual features. In the decoder, we stack VGA, which uses the visual information as a bridge to model internal relationships among the sequences and can be an auxiliary module of sequence self-attention. Quantitative and qualitative results on the Microsoft COCO dataset demonstrate the better performance of our model than the state-of-the-art approaches.
APA, Harvard, Vancouver, ISO, and other styles
41

Yang, Xiaoshan, and Changsheng Xu. "Image Captioning by Asking Questions." ACM Transactions on Multimedia Computing, Communications, and Applications 15, no. 2s (August 12, 2019): 1–19. http://dx.doi.org/10.1145/3313873.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Omri, Mohamed, Sayed Abdel-Khalek, Eied M. Khalil, Jamel Bouslimi, and Gyanendra Prasad Joshi. "Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning." Mathematics 10, no. 3 (January 18, 2022): 288. http://dx.doi.org/10.3390/math10030288.

Full text
Abstract:
Image processing remains a hot research topic among research communities due to its applicability in several areas. An important application of image processing is the automatic image captioning technique, which intends to generate a proper description of an image in a natural language automated. Image captioning is a recently developed hot research topic, and it started to receive significant attention in the field of computer vision and natural language processing (NLP). Since image captioning is considered a challenging task, the recently developed deep learning (DL) models have attained significant performance with increased complexity and computational cost. Keeping these issues in mind, in this paper, a novel hyperparameter tuned DL for automated image captioning (HPTDL-AIC) technique is proposed. The HPTDL-AIC technique encompasses two major parts, namely encoder and decoder. The encoder part utilizes Faster SqueezNet with the RMSProp model to generate an effective depiction of the input image via insertion into a predefined length vector. At the same time, the decoder unit employs a bird swarm algorithm (BSA) with long short-term memory (LSTM) model to concentrate on the generation of description sentences. The design of RMSProp and BSA for the hyperparameter tuning process of the Faster SqueezeNet and LSTM models for image captioning shows the novelty of the work, which helps to accomplish enhanced image captioning performance. The experimental validation of the HPTDL-AIC technique is carried out against two benchmark datasets, and the extensive comparative study pointed out the improved performance of the HPTDL-AIC technique over recent approaches.
APA, Harvard, Vancouver, ISO, and other styles
43

Bhalekar, M., and M. Bedekar. "D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals." Engineering, Technology & Applied Science Research 12, no. 2 (April 9, 2022): 8366–73. http://dx.doi.org/10.48084/etasr.4772.

Full text
Abstract:
Automatically describing the information of an image using properly constructed sentences is a tricky task in any language. However, it has the potential to have a significant effect by enabling visually challenged individuals to better understand their surroundings. This paper proposes an image captioning system that generates detailed captions and extracts text from an image, if any, and uses it as a part of the caption to provide a more precise description of the image. To extract the image features, the proposed model uses Convolutional Neural Networks (CNNs) followed by Long Short-Term Memory (LSTM) that generates corresponding sentences based on the learned image features. Further, using the text extraction module, the extracted text (if any) is included in the image description and the captions are presented in audio form. Publicly available benchmark datasets for image captioning like MS COCO, Flickr-8k, Flickr-30k have a variety of images, but they hardly have images that contain textual information. These datasets are not sufficient for the proposed model and this has resulted in the creation of a new image caption dataset that contains images with textual content. With the newly created dataset, comparative analysis of the experimental results is performed on the proposed model and the existing pre-trained model. The obtained experimental results show that the proposed model is equally effective as the existing one in subtitle image captioning models and provides more insights about the image by performing text extraction.
APA, Harvard, Vancouver, ISO, and other styles
44

Yang, Li-Chuan, Chih-Yuan Yang, and Jane Yung-jen Hsu. "Object Relation Attention for Image Paragraph Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 4 (May 18, 2021): 3136–44. http://dx.doi.org/10.1609/aaai.v35i4.16423.

Full text
Abstract:
Image paragraph captioning aims to automatically generate a paragraph from a given image. It is an extension of image captioning in terms of generating multiple sentences instead of a single one, and it is more challenging because paragraphs are longer, more informative, and more linguistically complicated. Because a paragraph consists of several sentences, an effective image paragraph captioning method should generate consistent sentences rather than contradictory ones. It is still an open question how to achieve this goal, and for it we propose a method to incorporate objects' spatial coherence into a language-generating model. For every two overlapping objects, the proposed method concatenates their raw visual features to create two directional pair features and learns weights optimizing those pair features as relation-aware object features for a language-generating model. Experimental results show that the proposed network extracts effective object features for image paragraph captioning and achieves promising performance against existing methods.
APA, Harvard, Vancouver, ISO, and other styles
45

Poleak, Chanrith, and Jangwoo Kwon. "Parallel Image Captioning Using 2D Masked Convolution." Applied Sciences 9, no. 9 (May 7, 2019): 1871. http://dx.doi.org/10.3390/app9091871.

Full text
Abstract:
Automatically generating a novel description of an image is a challenging and important problem that brings together advanced research in both computer vision and natural language processing. In recent years, image captioning has significantly improved its performance by using long short-term memory (LSTM) as a decoder for the language model. However, despite this improvement, LSTM itself has its own shortcomings as a model because the structure is complicated and its nature is inherently sequential. This paper proposes a model using a simple convolutional network for both encoder and decoder functions of image captioning, instead of the current state-of-the-art approach. Our experiment with this model on a Microsoft Common Objects in Context (MSCOCO) captioning dataset yielded results that are competitive with the state-of-the-art image captioning model across different evaluation metrics, while having a much simpler model and enabling parallel graphics processing unit (GPU) computation during training, resulting in a faster training time.
APA, Harvard, Vancouver, ISO, and other styles
46

Kwon, Hyun, and SungHwan Kim. "Restricted-Area Adversarial Example Attack for Image Captioning Model." Wireless Communications and Mobile Computing 2022 (July 7, 2022): 1–9. http://dx.doi.org/10.1155/2022/9962972.

Full text
Abstract:
Deep neural networks provide good performance in the fields of image recognition, speech recognition, and text recognition. For example, recurrent neural networks are used by image captioning models to generate text after an image recognition step, thereby providing captions for the images. The image captioning model first extracts features from the image and generates a representation vector; it then generates the text for the image captions by using the recursive neural network. This model has a weakness, however: it is vulnerable to adversarial examples. In this paper, we propose a method for generating restricted adversarial examples that target image captioning models. By adding a minimal amount of noise just to a specific area of an original sample image, the proposed method creates an adversarial example that remains correctly recognizable to humans yet is misinterpreted by the target model. We evaluated the method’s performance through experiments with the MS COCO dataset and using TensorFlow as the machine learning library. The results show that the proposed method generates a restricted adversarial example that is misinterpreted by the target model while minimizing its distortion from the original sample.
APA, Harvard, Vancouver, ISO, and other styles
47

Li, Yangyang, Shuangkang Fang, Licheng Jiao, Ruijiao Liu, and Ronghua Shang. "A Multi-Level Attention Model for Remote Sensing Image Captions." Remote Sensing 12, no. 6 (March 13, 2020): 939. http://dx.doi.org/10.3390/rs12060939.

Full text
Abstract:
The task of image captioning involves the generation of a sentence that can describe an image appropriately, which is the intersection of computer vision and natural language. Although the research on remote sensing image captions has just started, it has great significance. The attention mechanism is inspired by the way humans think, which is widely used in remote sensing image caption tasks. However, the attention mechanism currently used in this task is mainly aimed at images, which is too simple to express such a complex task well. Therefore, in this paper, we propose a multi-level attention model, which is a closer imitation of attention mechanisms of human beings. This model contains three attention structures, which represent the attention to different areas of the image, the attention to different words, and the attention to vision and semantics. Experiments show that our model has achieved better results than before, which is currently state-of-the-art. In addition, the existing datasets for remote sensing image captioning contain a large number of errors. Therefore, in this paper, a lot of work has been done to modify the existing datasets in order to promote the research of remote sensing image captioning.
APA, Harvard, Vancouver, ISO, and other styles
48

Guan, Zhibin, Kang Liu, Yan Ma, Xu Qian, and Tongkai Ji. "Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning." Symmetry 10, no. 11 (November 12, 2018): 626. http://dx.doi.org/10.3390/sym10110626.

Full text
Abstract:
Image caption generation is a fundamental task to build a bridge between image and its description in text, which is drawing increasing interest in artificial intelligence. Images and textual sentences are viewed as two different carriers of information, which are symmetric and unified in the same content of visual scene. The existing image captioning methods rarely consider generating a final description sentence in a coarse-grained to fine-grained way, which is how humans understand the surrounding scenes; and the generated sentence sometimes only describes coarse-grained image content. Therefore, we propose a coarse-to-fine-grained hierarchical generation method for image captioning, named SDA-CFGHG, to address the two problems above. The core of our SDA-CFGHG method is a sequential dual attention that is used to fuse different grained visual information with sequential means. The advantage of our SDA-CFGHG method is that it can achieve image captioning in a coarse-to-fine-grained way and the generated textual sentence can capture details of the raw image to some degree. Moreover, we validate the impressive performance of our method on benchmark datasets—MS COCO, Flickr—with several popular evaluation metrics—CIDEr, SPICE, METEOR, ROUGE-L, and BLEU.
APA, Harvard, Vancouver, ISO, and other styles
49

Chen, Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, and Qi Ju. "Improving Image Captioning with Conditional Generative Adversarial Nets." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8142–50. http://dx.doi.org/10.1609/aaai.v33i01.33018142.

Full text
Abstract:
In this paper, we propose a novel conditional-generativeadversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture. To deal with the inconsistent evaluation problem among different objective language metrics, we are motivated to design some “discriminator” networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architectures (CNN and RNNbased structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing RL-based image captioning framework and we show that the conventional RL training method is just a special case of our approach. Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. In addition, the well-trained discriminators can also be viewed as objective image captioning evaluators.
APA, Harvard, Vancouver, ISO, and other styles
50

Atliha, Viktar, and Dmitrij Šešok. "Text Augmentation Using BERT for Image Captioning." Applied Sciences 10, no. 17 (August 28, 2020): 5978. http://dx.doi.org/10.3390/app10175978.

Full text
Abstract:
Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography