High accuracy image captioning

Mateusz Bartosiewicz

supervisor: Marcin Iwanowski



Image captioning focuses on generating sentences, describing content of the given image. Two types of description can be distinguished. First, when sentences are formulated from direct positions of objects on the image and second one, when algorithm tries to guess what happened on the scene.

During research work, these two motifs were raised. First is focused on generating sentences from detected objects on the image. Proposed method calculates relative bounding box positions in the fuzzifications process and stores it in fuzzy mutual position matrix, which allows representing image complexity in the 2-D structure. Finally, most relevant predicates on the saliency-based criteria are extracted to be passed to the language model to formulate semantically correct and human friendly sentences.

Second motif is the end-to-end solution, where neural network is trained on large amount of images and corresponding captions. Moreover, mostly researched is English language, which is relatively semantically simple language, in the compare with Polish models. In this regard, experimental study, that examined application of neural image captioning methods to Polish language, was conducted. The paper present results of using generative model to produce sentences in Polish. What is more usage of automatically translated Flickr8k dataset was investigated.

In the next step, model that integrates direct object position recognition on the image, with Transformers architecture will be developed.