Publications

Last modified: November 2020

caring-image

Uncertainty-sensitive Activity Recognition: A Reliability Benchmark and the CARING Models

Alina Roitberg, Monica Haurilet, Manuel Martinez, Rainer Stiefelhagen
International Conference on Pattern Recognition (ICPR),
Online, October 2020.
[paper] [bibtex]

@inproceedings{Roitberg2020,
author = {Alina Roitberg and Monica Haurilet and Manuel Martinez and Rainer Stiefelhagen},
title = {{Uncertainty-sensitive Activity Recognition: A Reliability Benchmark, and the CARING Models}},
year = {2020},
month = {October},
booktitle = {International Conference on Pattern Recognition (ICPR)},
}



detective-image

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen
International Conference on Pattern Recognition (ICPR),
Online, October 2020.
[paper] [abstract] [arxiv] [bibtex]

@inproceedings{Kechaou2020,
author = {Amine Kechaou and Manuel Martinez and Monica Haurilet and Rainer Stiefelhagen},
title = {{Detective: An Attentive Recurrent Model for Sparse Object Detection}},
year = {2020},
month = {October},
booktitle = {International Conference on Pattern Recognition (ICPR)},
}
In this work, we present Detective - an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object's class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.



calories-image

Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information

Robin Ruede, Verena Heusser, Lukas Frank, Alina Roitberg, Monica Haurilet, Rainer Stiefelhagen
International Conference on Pattern Recognition (ICPR),
Online, October 2020.
[bibtex]

@inproceedings{Ruede2020,
author = {Robin Ruede and Verena Heusser and Lukas Frank and Alina Roitberg and Monica Haurilet and Rainer Stiefelhagen},
title = {{Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information}},
year = {2020},
month = {October},
booktitle = {International Conference on Pattern Recognition (ICPR)},
}



domain_adap-image

Deep Classification-driven Domain Adaptation for Cross-Modal Behavior Recognition

Simon Reiss*, Alina Roitberg*, Monica Haurilet, Rainer Stiefelhagen
Intelligent Vehicles (IV),
Online, October 2020.
[paper] [bibtex]

@inproceedings{Reiss2020,
author = {Simon Rei\ss* and Alina Roitberg* and Monica Haurilet and Rainer Stiefelhagen},
title = {{Deep Classification-driven Domain Adaptation for Cross-Modal Behavior Recognition}},
year = {2020},
month = {October},
booktitle = {Intelligent Vehicles (IV)},
note = {*equal contribution}
}



openset-image

Open Set Driver Activity Recognition

Alina Roitberg, Chaoxiang Ma, Monica Haurilet, Rainer Stiefelhagen
Intelligent Vehicles (IV),
Online, October 2020.
[paper] [bibtex] 2nd Place "Best Student Paper Award"

@inproceedings{Roitberg2020,
author = {Alina Roitberg and Chaoxiang Ma and Monica Haurilet and Rainer Stiefelhagen},
title = {{Open Set Driver Activity Recognition}},
year = {2020},
month = {October},
booktitle = {Intelligent Vehicles (IV)}
}



sonification-image

Bring the Environment to Life: Sonifying Fine-Grained Localized Objects for Persons with Visual Impairments

Angela Constantinescu, Karin Mueller, Monica Haurilet, Vanessa Petrausch, Rainer Stiefelhagen
International Conference on Multimodal Interaction (ICMI),
Online, October 2020.
[paper] [bibtex]

@inproceedings{Constantinescu2020,
author = {Angela Constantinescu and Karin Mueller and Monica Haurilet and Vanessa Petrausch and Rainer Stiefelhagen},
title = {{Bring the Environment to Life: Sonifying Fine-Grained Localized Objects for Persons with Visual Impairments}},
year = {2020},
month = {October},
booktitle = {International Conference on Multimodal Interaction (ICMI)},
}



vis_action-image

CNN-based Driver Activity Understanding: Shedding Light on Deep Spatiotemporal Representations

Alina Roitberg, Monica Haurilet, Simon Reiss, Rainer Stiefelhagen
International Conference on Intelligent Transportation Systems (ITSC),
Online, September 2020.
[paper] [bibtex]

@inproceedings{Roitberg2020,
author = {Alina Roitberg and Monica Haurilet and Simon Reiss and Rainer Stiefelhagen},
title = {{CNN-based Driver Activity Understanding: Shedding Light on Deep Spatiotemporal Representations}},
year = {2020},
month = {September},
booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},
}



ZS_action-image

Activity-aware Attributes for Zero-Shot Driver Behavior Recognition

Simon Reiss*, Alina Roitberg*, Monica Haurilet, Rainer Stiefelhagen
CVPRW on Visual Learning with Limited Labels (VL-LL),
Online, June 2020.
[paper] [bibtex]

@inproceedings{Roitberg2020,
author = {Simon Reiss and Alina Roitberg and Monica Haurilet and Rainer Stiefelhagen},
title = {{Activity-aware Attributes for Zero-Shot Driver Behavior Recognition}},
year = {2020},
month = {June},
booktitle = {CVPRW on Visual Learning with Limited Labels (VL-LL)},
}



driveandact-image

Drive&Act: A Multi-modal Dataset for Fine-grained Driver Behavior Recognition in Autonomous Vehicles

Manuel Martin*, Alina Roitberg*, Monica Haurilet, Matthias Horne, Simon Reiß,
Michael Voit, Rainer Stiefelhagen

International Conference on Computer Vision (ICCV),
Seoul, South Korea, October 2019.
[paper] [abstract] [website] [bibtex]

@inproceedings{MartinRoitberg2019,
author = {Manuel Martin* and Alina Roitberg* and Monica Haurilet and Matthias Horne and Simon Rei\ss and Michael Voit and Rainer Stiefelhagen},
title = {{Drive\&Act: A Multi-modal Dataset for Fine-grained Driver\\ Behavior Recognition in Autonomous Vehicles}},
year = {2019},
booktitle = {International Conference on Computer Vision (ICCV)},
publisher = {IEEE},
month = {October},
note = {*equal contribution}
}
We introduce the novel domain-specific Drive&Act benchmark for fine-grained categorization of driver behavior. Our dataset features twelve hours and over 9.6 million frames of people engaged in distractive activities during both, manual and automated driving. We capture color, infrared, depth and 3D body pose information from six views and densely label the videos with a hierarchical annotation scheme, resulting in 83 categories. The key challenges of our dataset are: (1) recognition of fine-grained behavior inside the vehicle cabin; (2) multi-modal activity recognition, focusing on diverse data streams; and (3) across-view recognition benchmark, where a model handles data from an unfamiliar domain, as sensor type and placement in the cabin can change between vehicles. Finally, we provide challenging benchmarks by adopting prominent methods for video- and body pose-based action recognition.



wise-image

WiSe - Slide Segmentation in the Wild

Monica Haurilet, Alina Roitberg, Manuel Martinez, Rainer Stiefelhagen
International Conference on Document Analysis and Recognition (ICDAR),
Sydney, Australia, September 2019.
[paper] [poster] [abstract] [bibtex] [website]

@inproceedings{haurilet2019wise,
author = {Monica Haurilet and Alina Roitberg and Manuel Martinez and Rainer Stiefelhagen},
title = {{WiSe - Slide Segmentation in the Wild}},
year = {2019},
month = {Sep.},
booktitle = {ICDAR}}
We address the task of segmenting presentation slides, where the examined page was captured as a live photo during lectures. Slides are important document types used as visual components accompanying presentations in a variety of fields ranging from education to business. However, automatic analysis of presentation slides has not been researched sufficiently, and, so far, only preprocessed images of already digitalized slide documents were considered. We aim to introduce the task of analyzing unconstrained photos of slides taken during lectures and present a novel dataset for PageSegmentationwith slides captured in the Wild (WiSe). Our dataset covers pixel-wise annotations of 25 classes on 1300 pages, allowing overlapping regions (i.e., multi-class assignments). To evaluate the performance, we define multiple benchmark metrics and baseline methods for our dataset. We further implement two different deep neural network approaches previously used for segmenting natural images and adopt them for the task. Our evaluation results demonstrate the effectiveness of the deep learning-based methods, surpassing the baseline methods by over 30%. To foster further research of slide analysis in unconstrained photos, we make the WiSe dataset publicly available to the community



softpaths

It’s not about the Journey; It’s about the Destination:
Following Soft Paths under Question-Guidance for Visual Reasoning

Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR),
Long Beach, USA, June 2019.
[paper] [supp.] [poster] [abstract] [download_graphs] [bibtex]

@inproceedings{haurilet2019softpaths,
author = {Monica Haurilet and Alina Roitberg and Rainer Stiefelhagen},
title = {{It’s not about the Journey; It’s about the Destination: Following Soft Paths under Question-Guidance for Visual Reasoning}},
year = {2019},
booktitle = {CVPR},
month = {June},
}
Visual Reasoning remains a challenging task, as it has to deal with long-range and multi-step object relationships in the scene. We present a new model for Visual Reasoning, aimed at capturing the interplay among individual objects in the image represented as a scene graph. As not all graph components are relevant for the query, we introduce the concept of a question-based visual guide, which constrains the potential solution space by learning an optimal traversal scheme. The final destination nodes alone are then used to produce the answer. We show, that finding relevant semantic structures facilitates generalization to new tasks by introducing a novel problem of knowledge transfer: training on one question type and answering questions from a different domain without any training data. Furthermore, we achieve state-of-the-art results for Visual Reasoning on multiple query types and diverse image and video datasets.


driver-intention

End-to-end Prediction of Driver Intention using 3D Convolutional Neural Networks

Patrick Gebert*, Alina Roitberg*, Monica Haurilet and Rainer Stiefelhagen
Intelligent Vehicles Symposium (IV),
IEEE, Paris, France, 2019.
[paper] [bibtex]

@inproceedings{roitbergIV2019,
author = {Patrick Gebert* and Alina Roitberg* and Monica Haurilet and Rainer Stiefelhagen},
title = {{End-to-end Prediction of Driver Intention using 3D Convolutional Neural Networks}},
booktitle = {Intelligent Vehicles Symposium (IV)},
publisher = {IEEE},
year = {2019},
month = {June},
address = {Paris, France},
note = {*equal contribution}
}


fgfe-image

Learning Fine-Grained Image Representations for Mathematical Expression Recognition

Sidney Bender*, Monica Haurilet*, Alina Roitberg, Rainer Stiefelhagen
ICDARW on Graphics Recognition (GREC),
Sydney, Australia, 2019.
[paper] [slides] [bibtex]

@inproceedings{bender2019fgfe,
author = {Sidney Bender and Monica Haurilet and Alina Roitberg and Rainer Stiefelhagen},
title = {{Learning Fine-Grained Image Representations for Mathematical Expression Recognition}},
year = {2019},
month = {Sep.},
booktitle = {International Conference for Document Analysis and Recognition Workshop on Graphics Recognition}
}



gesture-recognition

Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition

Alina Roitberg*, Tim Pollert*, Monica Haurilet, Manuel Martin, Rainer Stiefelhagen
CVPRW on Analysis and Modeling of Faces and Gestures (AMFG),
IEEE, Long Beach, USA, 2019.
[paper] [bibtex]

@inproceedings{roitbergCVPRW2019DeepFusion,
author = {Alina Roitberg and Tim Pollert and Monica Haurilet and Manuel Martin and Rainer Stiefelhagen},
title = {{Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition}},
year = {2019},
booktitle = {CVPR Workshop on Analysis and Modeling of Faces and Gestures (AMFG)},
month = {June}
}


dyngraph-image

DynGraph: Visual Question Answering via Dynamic Scene Graphs

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
German Conference on Pattern Recognition (GCPR),
Dortmund, Germany, 2019.
[bibtex]

@inproceedings{haurilet2019gcpr,
author = {Monica Haurilet and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{DynGraph: Visual Question Answering via Dynamic Scene Graphs}},
year = {2019},
booktitle = {German Conference on Pattern Recognition}
}



spase

SPaSe - Multi-Label Page Segmentation for Presentation Slides

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
Winter Conference on Applications of Computer Vision (WACV),
Waikoloa, Hawaii, USA, Jan. 2019.
[paper] [supp.] [abstract] [website] [slides] [bibtex]

@inproceedings{haurilet2019spase,
author = {Monica Haurilet and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{ SPaSe - Multi-Label Page Segmentation for Presentation Slides}},
year = {2019},
booktitle = {Winter Conference on Applications of Computer Vision},
month = {Jan.},
}
We introduce the first benchmark dataset for slide-page segmentation. Presentation slides are one of the most prominent document types used to exchange ideas across the web, educational institutes and businesses. This document format is marked with a complex layout which contains a rich variety of graphical (e.g. diagram, logo), textual (e.g. heading, affiliation) and structural components (e.g. enumeration, legend). This vast and popular knowledge source is still unattainable by modern machine learning techniques due to lack of annotated data. To tackle this issue, we introduce SPaSe (Slide Page Segmentation), a novel dataset containing in total dense, pixel-wise annotations of 25 classes for 2000 slides. We show that slide segmentation reveals some interesting properties that characterize this task. Unlike the common image segmentation problem, disjoint classes tend to have a high overlap of regions, thus posing this segmentation task as a multi-label problem. Furthermore, many of the frequently encountered classes in slides are location sensitive (e.g. title, footnote). Hence, we believe our dataset represents a challenging and interesting benchmark for novel segmentation models. Finally, we evaluate state-of-the-art segmentation networks on our dataset and show that they are suitable for developing deep learning models without any need of pre-training. The annotations will be released to the public to foster further research on this interesting task.


Haurilet_2018

MoQA - A Multi-Modal Question Answering Architecture

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
ECCVW on Shortcomings in Vision and Language (SiVL Spotlight),
Munich, Germany, 2018.
[paper] [bibtex] [poster] [slides] Winner of the TQA challenge

@inproceedings{hauriletSiVL2018Moqa,
author = {Monica Haurilet and Ziad Al-Halah, Rainer Stiefelhagen},
title = {{MoQA - A Multi-Modal Question Answering Architecture}},
booktitle = {ECCV Workshop on Shortcomings in Vision and Language (SiVL)},
publisher = {Springer},
year = {2018},
month = {September},
address = {Munich, Germany}


Schwarz_2017

Towards a Fair Evaluation of Zero-Shot Action Recognition using External Data

Alina Roitberg, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen
ECCVW on Shortcomings in Vision and Language (SiVL Spotlight),
Munich, Germany, 2018.
[paper] [bibtex] [poster] [slides]

@inproceedings{roitbergSiVL2018ZSAction,
author = {Alina Roitberg and Manuel Martinez and Monica Haurilet and Rainer Stiefelhagen},
title = {{Towards a Fair Evaluation of Zero-Shot Action Recognition using External Data}},
booktitle = {ECCV Workshop on Shortcomings in Vision and Language (SiVL)},
publisher = {Springer},
year = {2018},
month = {September},
address = {Munich, Germany} } }
coming soon


Schwarz_2017

DriveAHead - A Large-Scale Driver Head Pose Dataset

Anke Schwarz*, Monica Haurilet*, Manuel Martinez, Rainer Stiefelhagen
CVPRW on Computer Vision in Vehicle Technology (CVVT Oral),
Honolulu, Hawaii, USA, 2017.
[paper] [data] [bibtex] [abstract] [website]

@inproceedings{Schwarz2017,
author = {Anke Schwarz and Monica-Laura Haurilet and Manuel Martinez and Rainer Stiefelhagen},
title = {{ DriveAHead - A Large-Scale Driver Head Pose Dataset}},
year = {2017},
booktitle = {Computer Vision and Pattern Recognition Workshop (CVPRW) on Computer Vision in Vehicle Technology},
month = {Jul.},
doi = {}
}
coming soon


Haurilet2016_WACV

Marlin: A High Throughput Variable-to-Fixed Codec using Plurally Parsable Dictionaries

Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen and Joan Serra-Sagrista
Data Compression Conference (DCC Oral),
Snowbird, Utah, USA, 2017.
[paper] [bibtex] [abstract]

@inproceedings{Martinez2017,
author = {Manuel Martinez, Monica-Laura Haurilet, Rainer Stiefelhagen and Joan Serra-Sagrista},
title = {{Marlin: A High Throughput Variable-to-Fixed Codec using Plurally Parsable Dictionaries}},
year = {2017},
booktitle = {Data Compression Conference},
month = {April},
doi = {}
}
We present Marlin, a variable-to-fixed (VF) codec optimized for decoding speed. Marlin builds upon a novel way of constructing VF dictionaries that maximizes efficiency for a given dictionary size. On a lossless image coding experiment, Marlin achieves a compression ratio of 1.94 at 2494MiB/s. Marlin is as fast as state-of-the-art high-throughput codecs (e.g., Snappy, 1.24 at 2643MiB/s), and its compression ratio is close to the best entropy codecs (e.g., FiniteStateEntropy, 2.06 at 523MiB/s). Therefore, Marlin enables efficient and high-throughput encoding for memoryless sources, which was not possible until now.

Haurilet2016_WACV

Naming TV Characters by Watching and Analyzing Dialogs

Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision (WACV),
Lake Placid, NY, USA, 2016.
[paper] [bibtex] [abstract] [poster] [slides] [data]

@inproceedings{Haurilet2016,
author = {Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Naming TV Characters by Watching and Analyzing Dialogs}},
year = {2016},
booktitle = {WACV IEEE Workshop on Applications of Computer Vision},
month = {Mar.},
doi = {}
}
Person identification in TV series has been a popular research topic over the last decade. Most works in this area use either manually annotated data or extract character supervision from a combination of subtitles and transcripts. However, manual annotation is expensive and transcripts are often hard to find making it hard to scale the methods for all TV series. We investigate the topic of automatically labeling all character appearances in TV series using information obtained solely from subtitles. This task is extremely difficult due to very sparse and weakly supervised data, that can be obtained from dialogs between characters. We address these challenges by exploiting recent advances in face descriptors and Multiple Instance Learning methods, which are powerful to cope with labeled sets of face tracks. We propose a method to create MIL bags and evaluate and discuss several MIL techniques. Our best methods achieve an average precision above 80% on three diverse TV series. We demonstrate that only using subtitles provides good results on identifying characters in TV series and wish to encourage the community towards this problem.





Thesis


phd thesis

High-level Understanding of Visual Content in Learning Materials through Graph Neural Networks

Monica Haurilet
Dissertation, Karlsruhe Institute of Technology, 2020.
[coming soon]



master thesis

Completely Unsupervised Person Identification in TV-Series using Subtitles

Monica Haurilet
Master Thesis, Karlsruhe Institute of Technology, 2015.
[thesis]