TY - JOUR
T1 - Image Caption Generation Using Scoring Based on Object Detection and Word2Vec
AU - Misawa, Tadanobu
AU - Morizumi, Nozomi
AU - Yamashita, Kazuya
N1 - Publisher Copyright:
© MYU K.K.
PY - 2023
Y1 - 2023
N2 - Generating descriptive text from images, known as caption generation, is a noteworthy research field with potential applications, including aiding the visually impaired. Recently, numerous methods based on deep learning have been proposed. Previous methods learn the relationship between image features and captions on a large dataset of image-caption pairs. However, it is difficult to correctly learn all objects, object attributes, and relationships between objects. Therefore, occasionally incorrect captions are generated. For instance, captions about objects not included in the image are generated. In this study, we propose a scoring method using object detection and Word2Vec to output the correct caption for an object in the image. First, multiple captions are generated. Subsequently, object detection is performed, and the score is calculated using the resulting labels from object detection and the nouns extracted from each caption. Finally, the output is the caption with the highest score. Experimental evaluation of the proposed method on the Microsoft Common Objects in Context (MSCOCO) dataset demonstrates that the proposed method is effective in improving the accuracy of caption generation.
AB - Generating descriptive text from images, known as caption generation, is a noteworthy research field with potential applications, including aiding the visually impaired. Recently, numerous methods based on deep learning have been proposed. Previous methods learn the relationship between image features and captions on a large dataset of image-caption pairs. However, it is difficult to correctly learn all objects, object attributes, and relationships between objects. Therefore, occasionally incorrect captions are generated. For instance, captions about objects not included in the image are generated. In this study, we propose a scoring method using object detection and Word2Vec to output the correct caption for an object in the image. First, multiple captions are generated. Subsequently, object detection is performed, and the score is calculated using the resulting labels from object detection and the nouns extracted from each caption. Finally, the output is the caption with the highest score. Experimental evaluation of the proposed method on the Microsoft Common Objects in Context (MSCOCO) dataset demonstrates that the proposed method is effective in improving the accuracy of caption generation.
KW - Word2Vec
KW - deep learning
KW - image caption generation
KW - object detection
KW - scoring
UR - http://www.scopus.com/inward/record.url?scp=85167710456&partnerID=8YFLogxK
U2 - 10.18494/SAM4410
DO - 10.18494/SAM4410
M3 - 学術論文
AN - SCOPUS:85167710456
SN - 0914-4935
VL - 35
SP - 2195
EP - 2204
JO - Sensors and Materials
JF - Sensors and Materials
IS - 7
ER -