Publications



driveandact-image

Drive&Act: A Multi-modal Dataset for Fine-grained Driver Behavior Recognition in Autonomous Vehicles

Manuel Martin*, Alina Roitberg*, Monica Haurilet, Matthias Horne, Simon Reiß,
Michael Voit, Rainer Stiefelhagen

International Conference on Computer Vision (ICCV),
Seoul, South Korea, October 2019.
[paper] [abstract] [website] [bibtex]

@inproceedings{MartinRoitberg2019,
author = {Manuel Martin* and Alina Roitberg* and Monica Haurilet and Matthias Horne and Simon Rei\ss and Michael Voit and Rainer Stiefelhagen},
title = {{Drive\&Act: A Multi-modal Dataset for Fine-grained Driver\\ Behavior Recognition in Autonomous Vehicles}},
year = {2019},
booktitle = {International Conference on Computer Vision (ICCV)},
publisher = {IEEE},
month = {October},
note = {*equal contribution}
}
We introduce the novel domain-specific Drive&Act benchmark for fine-grained categorization of driver behavior. Our dataset features twelve hours and over 9.6 million frames of people engaged in distractive activities during both, manual and automated driving. We capture color, infrared, depth and 3D body pose information from six views and densely label the videos with a hierarchical annotation scheme, resulting in 83 categories. The key challenges of our dataset are: (1) recognition of fine-grained behavior inside the vehicle cabin; (2) multi-modal activity recognition, focusing on diverse data streams; and (3) across-view recognition benchmark, where a model handles data from an unfamiliar domain, as sensor type and placement in the cabin can change between vehicles. Finally, we provide challenging benchmarks by adopting prominent methods for video- and body pose-based action recognition.



wise-image

WiSe - Slide Segmentation in the Wild

Monica Haurilet, Alina Roitberg, Manuel Martinez, Rainer Stiefelhagen
International Conference on Document Analysis and Recognition (ICDAR),
Sydney, Australia, September 2019.
[paper] [abstract] [bibtex]

@inproceedings{haurilet2019wise,
author = {Monica Haurilet and Alina Roitberg and Manuel Martinez and Rainer Stiefelhagen},
title = {{WiSe - Slide Segmentation in the Wild}},
year = {2019},
month = {Sep.},
booktitle = {ICDAR}}
We address the task of segmenting presentation slides, where the examined page was captured as a live photo during lectures. Slides are important document types used as visual components accompanying presentations in a variety of fields ranging from education to business. However, automatic analysis of presentation slides has not been researched sufficiently, and, so far, only preprocessed images of already digitalized slide documents were considered. We aim to introduce the task of analyzing unconstrained photos of slides taken during lectures and present a novel dataset for PageSegmentationwith slides captured in the Wild (WiSe). Our dataset covers pixel-wise annotations of 25 classes on 1300 pages, allowing overlapping regions (i.e., multi-class assignments). To evaluate the performance, we define multiple benchmark metrics and baseline methods for our dataset. We further implement two different deep neural network approaches previously used for segmenting natural images and adopt them for the task. Our evaluation results demonstrate the effectiveness of the deep learning-based methods, surpassing the baseline methods by over 30%. To foster further research of slide analysis in unconstrained photos, we make the WiSe dataset publicly available to the community



softpaths

It’s not about the Journey; It’s about the Destination:
Following Soft Paths under Question-Guidance for Visual Reasoning

Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR),
Long Beach, USA, June 2019.
[paper] [supp.] [poster] [abstract] [download_graphs] [bibtex]

@inproceedings{haurilet2019softpaths,
author = {Monica Haurilet and Alina Roitberg and Rainer Stiefelhagen},
title = {{It’s not about the Journey; It’s about the Destination: Following Soft Paths under Question-Guidance for Visual Reasoning}},
year = {2019},
booktitle = {CVPR},
month = {June},
}
Visual Reasoning remains a challenging task, as it has to deal with long-range and multi-step object relationships in the scene. We present a new model for Visual Reasoning, aimed at capturing the interplay among individual objects in the image represented as a scene graph. As not all graph components are relevant for the query, we introduce the concept of a question-based visual guide, which constrains the potential solution space by learning an optimal traversal scheme. The final destination nodes alone are then used to produce the answer. We show, that finding relevant semantic structures facilitates generalization to new tasks by introducing a novel problem of knowledge transfer: training on one question type and answering questions from a different domain without any training data. Furthermore, we achieve state-of-the-art results for Visual Reasoning on multiple query types and diverse image and video datasets.


driver-intention

End-to-end Prediction of Driver Intention using 3D Convolutional Neural Networks

Patrick Gebert*, Alina Roitberg*, Monica Haurilet and Rainer Stiefelhagen
Intelligent Vehicles Symposium (IV),
IEEE, Paris, France, 2019.
[paper] [bibtex]

@inproceedings{roitbergIV2019,
author = {Patrick Gebert* and Alina Roitberg* and Monica Haurilet and Rainer Stiefelhagen},
title = {{End-to-end Prediction of Driver Intention using 3D Convolutional Neural Networks}},
booktitle = {Intelligent Vehicles Symposium (IV)},
publisher = {IEEE},
year = {2019},
month = {June},
address = {Paris, France},
note = {*equal contribution}
}


fgfe-image

Learning Fine-Grained Image Representations for Mathematical Expression Recognition

Sidney Bender*, Monica Haurilet*, Alina Roitberg, Rainer Stiefelhagen
ICDARW on Graphics Recognition (GREC),
Sydney, Australia, 2019.
[paper] [bibtex]

@inproceedings{bender2019fgfe,
author = {Sidney Bender and Monica Haurilet and Alina Roitberg and Rainer Stiefelhagen},
title = {{Learning Fine-Grained Image Representations for Mathematical Expression Recognition}},
year = {2019},
month = {Sep.},
booktitle = {International Conference for Document Analysis and Recognition Workshop on Graphics Recognition}
}



gesture-recognition

Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition

Alina Roitberg*, Tim Pollert*, Monica Haurilet, Manuel Martin, Rainer Stiefelhagen
CVPRW on Analysis and Modeling of Faces and Gestures (AMFG),
IEEE, Long Beach, USA, 2019.
[paper] [bibtex]

@inproceedings{roitbergCVPRW2019DeepFusion,
author = {Alina Roitberg and Tim Pollert and Monica Haurilet and Manuel Martin and Rainer Stiefelhagen},
title = {{Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition}},
year = {2019},
booktitle = {CVPR Workshop on Analysis and Modeling of Faces and Gestures (AMFG)},
month = {June}
}


spase

SPaSe - Multi-Label Page Segmentation for Presentation Slides

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
Winter Conference on Applications of Computer Vision (WACV),
Waikoloa, Hawaii, USA, Jan. 2019.
[paper] [supp.] [abstract] [website] [slides] [bibtex]

@inproceedings{haurilet2019spase,
author = {Monica Haurilet and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{ SPaSe - Multi-Label Page Segmentation for Presentation Slides}},
year = {2019},
booktitle = {Winter Conference on Applications of Computer Vision},
month = {Jan.},
}
We introduce the first benchmark dataset for slide-page segmentation. Presentation slides are one of the most prominent document types used to exchange ideas across the web, educational institutes and businesses. This document format is marked with a complex layout which contains a rich variety of graphical (e.g. diagram, logo), textual (e.g. heading, affiliation) and structural components (e.g. enumeration, legend). This vast and popular knowledge source is still unattainable by modern machine learning techniques due to lack of annotated data. To tackle this issue, we introduce SPaSe (Slide Page Segmentation), a novel dataset containing in total dense, pixel-wise annotations of 25 classes for 2000 slides. We show that slide segmentation reveals some interesting properties that characterize this task. Unlike the common image segmentation problem, disjoint classes tend to have a high overlap of regions, thus posing this segmentation task as a multi-label problem. Furthermore, many of the frequently encountered classes in slides are location sensitive (e.g. title, footnote). Hence, we believe our dataset represents a challenging and interesting benchmark for novel segmentation models. Finally, we evaluate state-of-the-art segmentation networks on our dataset and show that they are suitable for developing deep learning models without any need of pre-training. The annotations will be released to the public to foster further research on this interesting task.


Haurilet_2018

MoQA - A Multi-Modal Question Answering Architecture

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
ECCVW on Shortcomings in Vision and Language (SiVL Spotlight),
Munich, Germany, Jul. 2018.
[paper] [bibtex] [poster] [slides]

@inproceedings{hauriletSiVL2018Moqa,
author = {Monica Haurilet and Ziad Al-Halah, Rainer Stiefelhagen},
title = {{MoQA - A Multi-Modal Question Answering Architecture}},
booktitle = {ECCV Workshop on Shortcomings in Vision and Language (SiVL)},
publisher = {Springer},
year = {2018},
month = {September},
address = {Munich, Germany}


Schwarz_2017

Towards a Fair Evaluation of Zero-Shot Action Recognition using External Data

Alina Roitberg, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen
ECCVW on Shortcomings in Vision and Language (SiVL Spotlight),
Munich, Germany, Jul. 2018.
[paper] [bibtex] [poster] [slides]

@inproceedings{roitbergSiVL2018ZSAction,
author = {Alina Roitberg and Manuel Martinez and Monica Haurilet and Rainer Stiefelhagen},
title = {{Towards a Fair Evaluation of Zero-Shot Action Recognition using External Data}},
booktitle = {ECCV Workshop on Shortcomings in Vision and Language (SiVL)},
publisher = {Springer},
year = {2018},
month = {September},
address = {Munich, Germany} } }
coming soon


Schwarz_2017

DriveAHead - A Large-Scale Driver Head Pose Dataset

Anke Schwarz*, Monica Haurilet*, Manuel Martinez, Rainer Stiefelhagen
CVPRW on Computer Vision in Vehicle Technology (CVVT Oral),
Honolulu, Hawaii, USA, Jul. 2017.
[paper] [data] [bibtex] [abstract] [website]

@inproceedings{Schwarz2017,
author = {Anke Schwarz and Monica-Laura Haurilet and Manuel Martinez and Rainer Stiefelhagen},
title = {{ DriveAHead - A Large-Scale Driver Head Pose Dataset}},
year = {2017},
booktitle = {Computer Vision and Pattern Recognition Workshop (CVPRW) on Computer Vision in Vehicle Technology},
month = {Jul.},
doi = {}
}
coming soon


Haurilet2016_WACV

Marlin: A High Throughput Variable-to-Fixed Codec using Plurally Parsable Dictionaries

Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen and Joan Serra-Sagrista
Data Compression Conference (DCC Oral),
Snowbird, Utah, USA, April 2017.
[paper] [bibtex] [abstract]

@inproceedings{Martinez2017,
author = {Manuel Martinez, Monica-Laura Haurilet, Rainer Stiefelhagen and Joan Serra-Sagrista},
title = {{Marlin: A High Throughput Variable-to-Fixed Codec using Plurally Parsable Dictionaries}},
year = {2017},
booktitle = {Data Compression Conference},
month = {April},
doi = {}
}
We present Marlin, a variable-to-fixed (VF) codec optimized for decoding speed. Marlin builds upon a novel way of constructing VF dictionaries that maximizes efficiency for a given dictionary size. On a lossless image coding experiment, Marlin achieves a compression ratio of 1.94 at 2494MiB/s. Marlin is as fast as state-of-the-art high-throughput codecs (e.g., Snappy, 1.24 at 2643MiB/s), and its compression ratio is close to the best entropy codecs (e.g., FiniteStateEntropy, 2.06 at 523MiB/s). Therefore, Marlin enables efficient and high-throughput encoding for memoryless sources, which was not possible until now.

Haurilet2016_WACV

Naming TV Characters by Watching and Analyzing Dialogs

Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision (WACV),
Lake Placid, NY, USA, Mar. 2016.
[paper] [bibtex] [abstract] [poster] [slides] [data]

@inproceedings{Haurilet2016,
author = {Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Naming TV Characters by Watching and Analyzing Dialogs}},
year = {2016},
booktitle = {WACV IEEE Workshop on Applications of Computer Vision},
month = {Mar.},
doi = {}
}
Person identification in TV series has been a popular research topic over the last decade. Most works in this area use either manually annotated data or extract character supervision from a combination of subtitles and transcripts. However, manual annotation is expensive and transcripts are often hard to find making it hard to scale the methods for all TV series. We investigate the topic of automatically labeling all character appearances in TV series using information obtained solely from subtitles. This task is extremely difficult due to very sparse and weakly supervised data, that can be obtained from dialogs between characters. We address these challenges by exploiting recent advances in face descriptors and Multiple Instance Learning methods, which are powerful to cope with labeled sets of face tracks. We propose a method to create MIL bags and evaluate and discuss several MIL techniques. Our best methods achieve an average precision above 80% on three diverse TV series. We demonstrate that only using subtitles provides good results on identifying characters in TV series and wish to encourage the community towards this problem.


Ghaleb2015_Accio

Completely Unsupervised Person Identification in TV-Series using Subtitles

Monica-Laura Haurilet
Master Thesis, Karlsruhe Institute of Technology, 2015.
[thesis]