Jan Hendrik Metzen's website

I am Senior AI Researcher at Aleph Alpha Research. As part of Aleph Alpha’s Foundation Models team, I focus on LLM pretraining and optimization. In particular, we have developed a new tokenizer-free LLM architecture that allows for efficient pretraining, domain adaptation, and inference. Moreover, we are working on a new optimization methods that allow for efficient training of large-efficient models.

I was Senior Expert at Bosch Center for Artificial Intelligence (BCAI) until 08/2024. My primary research focused on making AI (specifically computer-vision based perception) robust, reliable, and safe. For this, we levaraged strong generative models for finding systematic errors of image classifiers on rare subgroups and systematic errors of object detectors. We also identified vulnerabilities of Transformer-based neural network against adversarial patch/token attacks. To counteract such vulnerabilities, we developed architectures that are certifiably robust against patch attacks for image classifiers as well as for semantic segmentation. Furthermore, we proposed methods for adversarially training neural networks to become robust against universal perturbations and universal adversarial patches. In addition, we provide methods for test-time adaptation of neural networks to improve robustness to domain shifts and study the role of shape-biased representations on robustness to common image corruptions.

A different strand of my research is automating machine learning (AutoML), specifically Neural Architecture Search. The latter research field is motivated by the vast design space of neural networks and the diversity of inference hardware. Manually tailoring a neural architecture for every type of hardware is cumbersome and not scalable - hardware-aware neural architecture search can vastly improve design efficiency and thus reduce cost of AI development. See our survey on Neural Architecture Search and a more recent survey on Neural Architecture Search for Dense Prediction Tasks in Computer Vision. Recently, we developed AutoCLIP, a method auto-tuning zero-shot classifiers for vision-language models that improves zero-shot performance across a broad range of domains.

I also love contributing to machine learning libraries, both open source and proprietary. I am a core contributor of scikit-learn, where I contributed tools for probability calibration of classifiers and for kernel ridge regression. Moreover, I have written a complete redesign of the Gaussian process module for scikit-learn. At BCAI, I am/was involved as core developer for frameworks for deep learning training pipelines, neural architecture search, and robustness evaluation.

I am a member of ELLIS and regularly review for scientific conferences and journals such as ICLR, ICML, NeurIPS, and TMLR. I was senior area chair of the AutoML 2022 conference and co-organizer of the workshops NAS@ICLR 2020 and NAS@ICLR 2021. I have been recognized by ICLR as Highlighted Reviewer in 2022 and Outstanding Reviewer in 2021.

See also my curriculum vitae for more information. Feel free to contact me via janmetzen@mailbox.org.

news

Aug 12, 2024	I have built SVGStud.io, an AI-based tool for searching and generating Scalable Vector Graphics (SVG) files. SVGStud.io offers the following core functionalities: Semantic SVG Search Find SVG files that match a search term or a sample image as closely as possible, from a library of more than 10,000 SVGs. AI-based SVG Generator Generate novel SVGs based on textual descriptions and (optionally) example images. SVG Gallery Explore a gallery of all SVGs in our library. Perfect for serendipity! SVG Bundles Browse a large variety of free pre-generated SVG bundles. All SVGs in SVGStud.io are licensed under CC-BY-SA 4.0 license and can be downloaded at any time.

Aug 12, 2024

I have built SVGStud.io, an AI-based tool for searching and generating Scalable Vector Graphics (SVG) files. SVGStud.io offers the following core functionalities:

Semantic SVG Search Find SVG files that match a search term or a sample image as closely as possible, from a library of more than 10,000 SVGs.
AI-based SVG Generator Generate novel SVGs based on textual descriptions and (optionally) example images.
SVG Gallery Explore a gallery of all SVGs in our library. Perfect for serendipity!
SVG Bundles Browse a large variety of free pre-generated SVG bundles.

All SVGs in SVGStud.io are licensed under CC-BY-SA 4.0 license and can be downloaded at any time.

latest posts

Aug 18, 2024	Building SVGStud.io: Fine-tuning Stable Diffusion XL for Efficient SVG Generation

selected publications

2024

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Jan Hendrik Metzen, Piyapat Saranrittichai, and Chaithanya Kumar Mummadi

Transactions on Machine Learning Research, 2024

Abs arXiv PDF

Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has very low computational overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.
Label-free Neural Semantic Image Synthesis

Jiayi Wang, Kevin Alexander Laube, Yumeng Li, Jan Hendrik Metzen, Shin-I Cheng, Julio Borges, and Anna Khoreva

In European Conference on Computer Vision (ECCV), 2024

Abs arXiv PDF

Recent work has shown great progress in integrating spatial conditioning to control large, pre-trained text-to-image diffusion models. Despite these advances, existing methods describe the spatial image content using hand-crafted conditioning inputs, which are either semantically ambiguous (e.g., edges) or require expensive manual annotations (e.g., semantic segmentation). To address these limitations, we propose a new label-free way of conditioning diffusion models to enable fine-grained spatial control. We introduce the concept of \emphneural semantic image synthesis, which uses neural layouts extracted from pre-trained foundation models as conditioning. Neural layouts are advantageous as they provide rich descriptions of the desired image, containing both semantics and detailed geometry of the scene. We experimentally show that images synthesized via neural semantic image synthesis achieve similar or superior pixel-level alignment of semantic classes compared to those created using expensive semantic label maps. At the same time, they capture better semantics, instance separation, and object orientation than other label-free conditioning options, such as edges or depth. Moreover, we show that images generated by neural layout conditioning can effectively augment real data for training various perception tasks.
Feature Distillation Improves Zero-Shot Transfer from Synthetic Images

Niclas Popp, Jan Hendrik Metzen, and Matthias Hein

Transactions on Machine Learning Research, 2024

arXiv PDF

2023

Identification of Systematic Errors of Image Classifiers on Rare Subgroups

Jan Hendrik Metzen, Robin Hutmacher, N Grace Hua, Valentyn Boreiko, and Dan Zhang

In International Conference on Computer Vision, 2023

Abs arXiv PDF

Despite excellent average-case performance of many image classifiers, their performance can substantially deteriorate on semantically coherent subgroups of the data that were under-represented in the training data. These systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift. A major challenge is to identify such subgroups with subpar performance when the subgroups are not annotated and their occurrence is very rare. We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance on the prompt-conditioned synthesized data. To tackle the exponentially growing number of subgroups, we employ combinatorial testing. We denote this procedure as PromptAttack as it can be interpreted as an adversarial attack in a prompt space. We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify novel systematic errors on rare subgroups.

2022

Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Giulio Lovisotto, Nicole Finnie, Mauricio Munoz, Chaithanya Kumar Mummadi, and Jan Hendrik Metzen

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Abs arXiv PDF

Neural architectures based on attention such as vision transformers are revolutionizing image recognition. Their main benefit is that attention allows reasoning about all parts of a scene jointly. In this paper, we show how the global reasoning of (scaled) dot-product attention can be the source of a major vulnerability when confronted with adversarial patch attacks. We provide a theoretical understanding of this vulnerability and relate it to an adversary’s ability to misdirect the attention of all queries to a single key token under the control of the adversarial patch. We propose novel adversarial objectives for crafting adversarial patches which target this vulnerability explicitly. We show the effectiveness of the proposed patch attacks on popular image classification (ViTs and DeiTs) and object detection models (DETR). We find that adversarial patches occupying 0.5% of the input can lead to robust accuracies as low as 0% for ViT on ImageNet, and reduce the mAP of DETR on MS COCO to less than 3%.