Detecting human-object interactions

Marcin Grząbka

supervisor: Marcin Iwanowski



Understanding interactions between objects is a crucial element towards detailed scene understanding.

Recent achievements in object detection using Deep Neural Networks allow a step forward from image level interaction classification to instance based interaction detection. An important sub task involving human as the activity performer is called Human Object Interaction (HOI) and recently gained much attention in computer vision society.

HOI struggles to detect human, objects and infer an interaction creating a triplet <human, verb, object> with corresponding bounding boxes. Because a human can sit on a chair and work on a laptop at the same time HOI is a multi label classification problem. In contrast to typical action recognition task HOI is being processed on static images (frames). Potential benefits are image captioning, image retrieval and human behavior analysis.

Typical HOI processing pipeline consists of three stages: object detector, human/object/spatial stream and interaction classification layer. Object detector (typically Faster-RCNN with ResNet50 backbone) is responsible for detecting objects in a scene, human/object streams process detected instance features and spatial stream infers their relations. Most commonly used data sets are HICO-DET and V-COCO.

The goal of the research is to infer about detailed human behavior in a continuous sequence using detected human-object interactions.