Abstract
Generating descriptive text from images, known as caption generation, is a noteworthy research field with potential applications, including aiding the visually impaired. Recently, numerous methods based on deep learning have been proposed. Previous methods learn the relationship between image features and captions on a large dataset of image-caption pairs. However, it is difficult to correctly learn all objects, object attributes, and relationships between objects. Therefore, occasionally incorrect captions are generated. For instance, captions about objects not included in the image are generated. In this study, we propose a scoring method using object detection and Word2Vec to output the correct caption for an object in the image. First, multiple captions are generated. Subsequently, object detection is performed, and the score is calculated using the resulting labels from object detection and the nouns extracted from each caption. Finally, the output is the caption with the highest score. Experimental evaluation of the proposed method on the Microsoft Common Objects in Context (MSCOCO) dataset demonstrates that the proposed method is effective in improving the accuracy of caption generation.
Original language | English |
---|---|
Pages (from-to) | 2195-2204 |
Number of pages | 10 |
Journal | Sensors and Materials |
Volume | 35 |
Issue number | 7 |
DOIs | |
State | Published - 2023 |
Keywords
- Word2Vec
- deep learning
- image caption generation
- object detection
- scoring
ASJC Scopus subject areas
- Instrumentation
- General Materials Science