Hierarchical modularized vision system for perception-action loops

Photo by Alex Knight on Unsplash

Principal Investigators:

Olaf Hellwich

Team Members:

Marah Halawa (Doctoral researcher)
Manuel Wöllhaf (Doctoral researcher)

Understanding the main aspects of complex visual analysis

Research Unit 3, SCIoI Project 29

Visual understanding is a key component of biological and synthetic intelligent systems. As visual sensors (of any kind) provide high-dimensional data vectors with structural relationships between vector elements, e.g. multi-channel 2d images, the analysis of visual data unavoidably is a search problem in highly complex spaces. This is especially true if the visual input has a time component as in the visual system of an acting agent. While video data enables the introduction of additional priors and objectives for unsupervised learning it also increases the redundancy in the input space dramatically. This becomes especially visible in the fact that one of the most powerful priors for two dimensional vision problems introduced in recent years, the convolutional neural network, could not be applied with the same success in temporal data (e.g. for action classification).
Most state of the art approaches are based on frame-based encoding of the visual input, which is then fed into a recurrent neural network or an approximation of a Bayes filter to reflect the time component of the problem. Therefore, the goal of this project is to develop a modularized and hierarchical temporal vision system for representation learning as a basis for a closed perception-action loop. The system is supposed to leverage the additional information contained in the time domain, considering multiple frames at once, to compensate for the low information density in video streams and allow unsupervised learning of task-relevant representations.
From a more general perspective, this project tries to ease the visual understanding bottleneck SCIoI example behaviors are confronted with by focusing on its complexity: while other research units make use of gold-standard or easily implemented first approaches to vision problems, it is analyzing principle aspects of the complex video analysis, identifies performance deficits, and suggests strategies for improvement by analyzing the perception-action loop.


Related Publications

Halawa, M., Hellwich, O., & Bideau, P. (2022). Action based Contrastive Learning for Trajectory Prediction. European Conference on Computer Vision (ECCV), 143–159. https://doi.org/10.1007/978-3-031-19842-7_9
Halawa, M., Wollhaf, M., Vellasques, E., Sanchez Sanz, U., Urko Sanz, & Hellwich, O. (2020). Learning Disentangled Expression Representations from Facial Images. arxiv and WiCV at ECCV2020. https://doi.org/10.48550/arXiv.2008.07001