Building SVGStud.io: Fine-tuning Stable Diffusion XL for Efficient SVG Generation

SVGStud.io provides AI-based generation of SVGs based on text or image prompts. This blog post explains our approach to SVG generation based on fine-tuning of Stable Diffusion XL (SDXL) to generate raster images optimized for conversion into SVGs. By applying an end-to-end training approach with a reward signal focused on vectorizability, aesthetics, and prompt adherence, we create high-quality, concise SVGs from AI-generated images, overcoming data scarcity and copyright challenges.

Making SDXL SVG-Friendly for SVGStud.io

Motivation

In the world of digital design, Scalable Vector Graphics (SVGs) have become a go-to format for creating crisp, scalable images that look great on any screen. Unlike raster images, which are made up of pixels, SVGs are composed of paths defined by mathematical equations. This makes them resolution-independent and perfect for applications like web design, where images need to look sharp at any size.

However, creating SVGs has traditionally been a challenging task, requiring expertise with vector graphics editor tools like Adobe Illustrator or Inkscape. The process of manually crafting paths, shapes, and curves can be time-consuming and complex, often limiting SVG creation to those with specialized skills. Recognizing this challenge, we developed SVGStud.io, a platform that democratizes SVG creation by enabling users to generate SVGs effortlessly using simple text prompts or sketches. SVGStudio’s AI-powered SVG Generator streamlines the design process, making vector graphic creation accessible to everyone, regardless of their technical expertise.

Generating SVGs autonomously using AI presents a unique set of challenges. The scarcity of SVG files and the dominance of raster images in existing datasets mean that training a model directly on SVGs is difficult. Moreover, copyright concerns limit the availability of permissively licensed SVGs, further narrowing the potential training set.

To overcome these obstacles, we adopted a creative approach: instead of generating SVGs directly, we decided to harness the power of a pre-trained text-to-image model such as Stable Diffusion XL (SDXL) , to first generate raster images. These images are then converted into vector graphics using image tracing, with tools like potracer. This method allows us to leverage the vast capabilities of SDXL to create high-quality, vectorizable raster images.

But simply generating images isn’t enough. For the raster images to be efficiently converted into clean SVGs, they need to be tailored for vectorization—meaning they should be simple, clear, and structured in a way that they can be traced effectively. This post summarizes how we fine-tuned SDXL for this purpose.

Fine-Tuning SDXL for SVG Generation

The fine-tuning process is essential for adapting SDXL to generate raster images that are particularly well-suited for conversion into concise SVGs. Instead of relying on a large dataset of SVG files—which is impractical—we employ an innovative end-to-end fine-tuning technique that directly optimizes the model’s output for vectorization, without requiring any SVG data (thus avoiding any copyright issues).

Here’s a step-by-step breakdown of how this process works:

Image Generation and Prompting: We start by generating images using the pre-trained SDXL model. These images will eventually serve as the raw material that will be converted into SVGs, and our fine-tuning focuses on adapting their properties such that they are well suited for vectorization. During finetuning, we generate SVGs for prompts of the form “{style}, monochrome, {category}”. Here style encodes different types of SVG appearance such as “silhouette”, “cartoon style”, or “logo”. Monochrome is added to the prompt to ensure that black-and-white images are generated since our focus is on monochrome SVGs. Finally, category encodes the actual content of the SVG; we select the query from a large number of predefined single-word categories such as different types of animals or objects. We find that training on simple single-word categories generalizes reasonable well to more complex user prompts.
Reward Signal: To ensure that these images are suitable for vectorization, we compute a differentiable reward signal based on a weighted sum of several key terms:
- Aesthetic Appeal: We collect human preference data on a pairs of SVGs generated for the same prompt. Based on this, we train a linear reward model based on CLIP features extracted from the corresponding images (before tracing and SVG conversion) and a loss function based on the Bradley-Terry model .
- Prompt Adherence: How well the SVG aligns with the user’s input text prompt. For this, we define a reward term that is based on the cosine similarity between CLIP image embedding of the generated image and CLIP text embedding of the user’s prompt.
- Diversity: The variety of the generated SVGs for a given prompt. For each prompt, we generate two images with SDXL and compute a dissimilarity measure on the CLIP image embeddings. We define a reward term that is proportional to this dissimilarity. We find that this reward term helps in avoiding model collapse.
- Binarization: We focus on generating binary (black and white) SVGs. For each generated image, we compute pixel-wise the minimum distance to the RGB encoding of black (0, 0, 0) and white (1, 1, 1). We define a reward term corresponding to the negative average value of this difference across the generated image.
- Negative Prompt Avoidance: We define a set of undesirable properties and corresponding text prompts. We then compute the average cosine distance between the CLIP image embedding of the generated image and the CLIP text embeddings of the undesirable properties. We add a reward term corresponding to the negative value of the average cosine distance.
Gradient Ascent: The reward signal is then used to guide the training of a low-rank adapter (LoRA) for SDXL. By backpropagating the reward, we adjust the LoRA parameters through gradient ascent, fine-tuning SDXL to produce raster images that can be easily traced into elegant SVGs. For this, we have adapted AlignProp to be compatible with SDXL and our reward function. AlignProp is very sample-efficient, allowing us to adapt SDXL for our purpose within a few GPU-hours. The flip-side is that AlignProp is conducting greedy gradient ascent without enforcing the resulting fine-tuned model to stay “close” to the original model. However, we did not observe any model collapse or divergence in practice in our case.

Results

Below are 99 randomly uncurated (randomly selected) SVGs generated by SVGStudio:

AI-generated SVG of 'falcon with fedora'

AI-generated SVG of 'sword and snake wrapped'

AI-generated SVG of 'squirrel with acorn'

AI-generated SVG of 'rafting down rapids'

AI-generated SVG of 'running through jungle'

AI-generated SVG of 'valentine's day gift'

AI-generated SVG of 'waterfall in forest'

AI-generated SVG of 'cute raccoon cartoon'

AI-generated SVG of 'hummingbird with harmonica'

AI-generated SVG of 'tropical rainforest trek'

AI-generated SVG of 'alexander graham bell'

AI-generated SVG of 'horse-drawn carriage'

AI-generated SVG of 'jungle canopy bridge'

AI-generated SVG of 'nighttime city exploration'

AI-generated SVG of 'moose with briefcase'

AI-generated SVG of 'romantic beach walk'

Conclusion

The LoRA fine-tuning process of SDXL is crucial because it allows us to circumvent the limitations of direct SVG generation. By focusing on generating raster images that are easy to vectorize, we can produce SVGs that are both visually appealing and structurally concise. This approach not only mitigates the challenges posed by limited SVG data but also ensures that the generated SVGs are of high quality and diverse in style.

The result is a powerful AI-based SVG generator that can create stunning, scalable graphics from scratch, tailored to meet the needs of modern digital design. You can try out our AI-powered SVG Generator for free!