Evaluating Stable Diffusion 3 Medium (2b parameters) with Prompt Challenges
[View all images generated in this test]
In this article I evaluate how well the SD3 Medium model by Stability handles different prompt challenges. My results show that for most challenges, SD3 Medium is doing OK, but overall it isn't great.
To conduct the test I used a subset of Parti Prompts (prompts designed by the engineers at Google to evaluate image generation models).
I tested over 100 prompts, each prompt tests a specific challenge. I then rated all images, gathered showcases and here are the results.
List of challenges:
SD3 Medium is doing a good job at this challenge, understanding most basic concepts.
The Basic challenge in Parti Prompts primarily tests how effectively image generation models can interpret and execute simple tasks. These tasks often consist of basic shapes, colors, and compositions. They serve as a litmus test on how well a model understands fundamental visual concepts and can replicate them accurately.
cfg prompt | 4.5 | 7.5 |
A bowl of Pho | ||
Salvador DalĂ | ||
The Starry Night | ||
U.S. 101 | ||
a fall landscape | ||
a kitchen | ||
a shiba inu | ||
a walnut | ||
an F1 | ||
an espresso machine | ||
bond | ||
parallel lines |
SD Medium is doing well here too
The Simple Details test in Parti Prompts is designed to evaluate how efficiently an image generation model can handle tasks that are slightly more intricate than the basics. These tasks typically involve rendering detailed objects, textures, or patterns, or carrying out instructions that have different steps or components. The test gives an idea of the models ability to handle complexity while maintaining accuracy in image generation.
cfg prompt | 4.5 | 7.5 |
A bowl of Chicken Pho | ||
A green heart | ||
A living area with a television and a table | ||
A shiny VW van parked on grass. | ||
A van parked on grass | ||
Siberian husky playing the piano. | ||
a baby daikon radish | ||
a farm scene with cows, ducks and a tractor. | ||
a lavender backpack with a triceratops stuffed animal head on top | ||
a team playing baseball |
Although not a complete failure, SD3 Medium fails to understand complex prompts thoroughly.
This test involves tasks that are highly intricate or require a high level of detail. The complexity could be in terms of intricate designs or complex instructions to replicate.
cfg prompt | 4.5 | 7.5 |
A dignified beaver wearing glasses, a vest, and colorful neck tie. He stands next to a tall stack of books in a library. | ||
A photo of a hamburger fighting a hot dog in a boxing ring. The hot dog is tired and up against the ropes. | ||
A photograph of the inside of a subway train. There are frogs sitting on the seats. One of them is reading a newspaper. The window shows the river in the background. | ||
A robot painted as graffiti on a brick wall. The words "Fly an airplane" are written on the wall. A sidewalk is in front of the wall, and grass is growing out of cracks in the concrete. | ||
A set of 2x2 emoji icons with happy, angry, surprised and sobbing faces. The emoji icons look like dogs. All of the dogs are wearing blue turtlenecks. | ||
A solitary figure shrouded in mists peers up from the cobble stone street at the imposing and dark gothic buildings surrounding it. an old-fashioned lamp shines nearby. oil painting. | ||
A wall in a royal castle. There are two paintings on the wall. The one on the left a detailed oil painting of the royal raccoon king. The one on the right a detailed oil painting of the royal raccoon queen. A cute dog looking at the two paintings, holding a sign saying 'plz conserve' | ||
Greek statue of a man comforting a cat. The cat has a big head. The man looks angry. | ||
Horses pulling a carriage on the moon's surface, with the Statue of Liberty and Great Pyramid in the background. The Planet Earth can be seen in the sky. | ||
a photograph of a fiddle next to a basketball on a ping pong table | ||
a tree reflected in the hood of a blue car |
Overall SD3 is good at handling fine grained detail prompts, still lots of room for improvements.
This test deals with tasks that require the model to generate an image with a high level of detail. This could be in terms of creating very realistic images, replicating detailed textures, or creating images with a lot of elements.
cfg prompt | 4.5 | 7.5 |
A bare kitchen has wood cabinets and white appliances | ||
A punk rock squirrel in a studded leather jacket shouting into a microphone while standing on a stump | ||
A sunken ship becomes the homeland of fish. | ||
A teddy bear wearing a motorcycle helmet and cape is standing in front of Loch Awe with Kilchurn Castle behind him | ||
a baby daikon radish in a tutu walking a dog | ||
a kids' book cover with an illustration of white dog driving a red pickup truck | ||
a photograph of the mona lisa drinking coffee as she has her breakfast. her plate has an omelette and croissant. | ||
a young badger delicately sniffing a yellow rose, richly textured oil painting | ||
fairy cottage with smoke coming up chimney and a squirrel looking from the window | ||
purple lego dollhouse with a pool and a swing |
SD3 Medium isn't very imaginative so don't expect great results here
This test involves imaginative tasks where the image generation model needs to create unique, non-existing concepts based on the instruction given.
cfg prompt | 4.5 | 7.5 |
A giant cobra snake made from corn | ||
A group of farm animals (cows, sheep, and pigs) made out of cheese and ham, on a wooden board. There is a dog in the background eyeing the board hungrily. | ||
A horse sitting on an astronaut's shoulders. | ||
A large city fountain that has milk instead of water. Several cats are leaning into the fountain. | ||
A shiny robot wearing a race car suit and black visor stands proudly in front of an F1 race car. The sun is setting on a cityscape in the background. comic book illustration. | ||
A television made of water that displays an image of a cityscape at night. | ||
A tornado made of sharks crashing into a skyscraper. painting in the style of Hokusai. | ||
The 1970s logo for a london-area football club called "The Rumbury Wanderers" | ||
The collision of two black holes in the center of a galaxy. | ||
a baby daikon radish in a tutu | ||
a dump truck filled with soccer balls scuba diving in a coral reef. | ||
a small kitchen with a white goat in it |
It is clearly not understanding the what is NO, you gonna still need to use the negative prompt a lot
This test involves tasks that require understanding of language. The model would need to recognize words, phrases or even sentences and generate relevant images accordingly.
cfg prompt | 4.5 | 7.5 |
A bird gives an apple to a squirrel | ||
An aerial view of Ha Long Bay without any boats | ||
a bookshelf without any books on it | ||
a concert without any fans | ||
a kitchen without a refrigerator | ||
a plate that has no bananas on it. there is a glass without orange juice next to it. | ||
a street without vehicles | ||
a summer tree without any leaves | ||
supercalifragilisticexpialidocious |
Some big failures and some big successes here, so hope for the best or create a bunch of images to increase your luck
This test involves tasks that require understanding of perspective and depth. The image generation model would need to generate images that show a clear understanding of dimensions, distance and perspective.
cfg prompt | 4.5 | 7.5 |
A robot with a black visor and the number 42 on its chest. It stands proudly in front of an F1 race car. The sun is setting on a cityscape in the background. wide-angle view. comic book illustration. | ||
A smiling sloth is wearing a leather jacket, a cowboy hat, a kilt and a bowtie. The sloth is holding a quarterstaff and a big book. The sloth is standing on grass a few feet in front of a shiny VW van with flowers painted on it. wide-angle lens from below. | ||
Saturn rises on the horizon. | ||
Three-quarters front view of a blue 1977 Corvette coming around a curve in a mountain road and looking over a green valley on a cloudy day. | ||
Zoomed out view of a giraffe and a zebra in the middle of a field covered with colorful flowers | ||
a close up of a handpalm with leaves growing from it | ||
a close-up of a bloody mary cocktail | ||
a cross-section view of a walnut | ||
long shards of a broken mirror reflecting the eyes of a great horned owl |
SD3 Medium is failing here too
This test focuses on the models ability to understand the properties of different objects\n" + and their positioning in the generated image.
cfg prompt | 4.5 | 7.5 |
A photo of a Persian Metal Engraving vase sitting to the left of a bunch of orange flowers. | ||
a green pepper to the left of a red pepper | ||
a stack of three red cubes with a blue sphere on the right and two green cones on the left | ||
a white flag with a red circle next to a solid blue flag |
SD3 fails most of the time to produce good prompts involving quantities
This test involves tasks that check for the models ability to count and distinguish between different quantities."
cfg prompt | 4.5 | 7.5 |
Four dragons surrounding a dinosaur | ||
Times Square with thousands of dogs running around | ||
a bunch of laptops piled on a sofa | ||
ten red apples | ||
the hands of a single person holding a basketball | ||
three airplanes parked in a row at a terminal | ||
three green peppers | ||
two baseballs to the left of three tennis balls | ||
two parallel chemtrails in blue sky | ||
two red boxes |
If your prompt is imaginative expect bad results, if it is straightforward you should be good
The Style & Format test is designed to check the models ability to understand and replicate various styles and formats in the generated images.
cfg prompt | 4.5 | 7.5 |
A photo of a dragonfly made of water. | ||
A photo of a lotus flower made of water. | ||
A rusty spaceship blasts off in the foreground. A city with tall skyscrapers is in the distance, with a mountain and ocean in the background. A dark moon is in the sky. realistic high-contrast anime illustration. | ||
Oil painting generated by artificial intelligence | ||
a painting of a white country home with a wrap-around porch | ||
a painting of the food of china | ||
a satellite image of a costal french city there is a large park on the west side and a mountain to the north. There is a cloud covering part of the image | ||
an abstract oil painting in deep red and black with a thick patches of white | ||
close-up portrait of a smiling businesswoman holding a cell phone, oil painting in the style of Rembrandt |
Good results when the prompt is not complicated, but expect missing/additional characters here and there.
This test tasks involve interpretation of written instructions or generation of images containing some form of writing or symbols.
cfg prompt | 4.5 | 7.5 |
A glass of red wine tipped over on a couch, with a stain that writes "OOPS" on the couch. | ||
A green sign that says "Very Deep Learning" and is at the edge of the Grand Canyon. | ||
A sign that says Deep Learning | ||
Portrait of a tiger wearing a train conductor's hat and holding a skateboard that has a yin-yang symbol on it. charcoal sketch | ||
Two cups of coffee, one with latte art of a lovely princess. The other has latte art of a frog. | ||
a boat with 'BLUE GROOVE' written on its hull | ||
graffiti spelling BE KIND on white subway tile | ||
the saying "BE EXCELLENT TO EACH OTHER" on a rough wall with a graffiti image of a green alien wearing a tuxedo. |