DALL-E 2 - Unofficial Natural Language Image Editing, Art Critique Survey

Jun 06, 2022

Exciting news! I just completed an independent, unofficial survey gauging people’s feedback of DALL-E 2 image generations.

I started the survey because I was curious about DALL-E’s editing features and how they compare to the breadth of natural language editing capabilities creatives/artists may desire in actuality. At the moment, DALL-E’s editing capabilities are through a process called inpainting which is limited to adding in or removing elements using AI. Also, in this analysis, I was interested to learn how people perceive the value of art before and after making high level edits and changes using text based, natural language.

Survey Design

To conduct this survey, I asked 32 participants to give me feedback on my own DALL-E 2 art. Each participant was asked to assume they had access to a magic computer which could understand natural language text and complete any changes they wanted1.

Each participant was also asked to score the perceived value of the art before and after their editing changes on several dimensions like beauty and impact2. However, I will not be discussing this here. In this analysis, we will be examining the qualitative aspects of the survey which was mainly respondents’ feedback. This was based on specific edits/changes they would make to an image using natural language. The survey was divided into 10 primary sections, one per image, you can see an example section in the footnotes below.

Some of the participants’ natural language text requests surprised me a lot. They ranged from simple colour adjustments, camera angle changes, or even more advanced requests like emotional adjustments to characters’ faces.

DALL-E Art Editing Requests

I am now going to be including some of my sample art below as well as the feedback highlights from different participants. I have added bullets inside of each participant feedback quote to make it easier for you to read:

Image #1 - "Creative soul mates, digital Art”

Remove what looks like a needle & thread from the eye of the pink character
Change the expressions of both characters to be more positive (they both look sad right now)
Add design elements to the body of the green character (the pink one has a design on her body, whereas the green one is "blank" right now)
I would add in some cool clothing designs for each of the "bodies". Some kind of top would look good on them.
The stars in the background could use some extra glitter and attraction. Make it seem like the background is sparkling.
Instead of a solid black background, I would like to add in a brighter background that transitions from dark to light from the top left corner
Image is too flat; create visual separation between over lapping elements.
Image is too symmetrical with no focal point; change one of the characters to stick out more.
Connect the characters line of sight to a focal point. Add depth by making the stars more blurred.

Image #6 - “A Hot Dog in the style of a Renaissance Painting”

Change the mayonaisse topping and replace it with ketchup
Remove the spilled mustard from the plate
Add relish, chopped onions, jalapenos and cream cheese as toppings to make it more overtly decadent
Make the bun slightly burnt (with blackened sides like it was overcooked)
Change the background color from black to red.
Desaturate the colors of the plate.
Crop the picture so that the edges of the plate create negative spaces.
Add a second light source behind the hotdog to remove the harsh shadow it is casting.
Mirror image vertically.
make the brush strokes more interesting and visible
make an interesting background and not just have it plain black
have Greek or Roman Gods dancing around the hot dog or taking bites of it
have a visible light source to add more depth and interest

Image #9 - "A photo of a confused grizzly bear in calculus class”

Give the bear fur that is dirtier and less groomed.
Gives the bear the appearance of being stressed and disorganized.
Add in some erase marks over the chalkboard. This will make it look like the person kept changing their minds.
Would be good to add in a teacher behind them, almost like they are looming over the bear, "demanding" they get the correct answer.
Crop the image around the bear's head while leaving in the chalk board.
Increase the contrast with the chalk board and chalk writings.
Add a second light source bellow the bear's head that points upward to create an unnatural sense of lighting to highlight the bear's sense of angst.
Adjust the image by zoom out to get the full body of the bear.
Add blurred file of letters and mathematics formula PNG's in the foreground.
Apply vignette mode for center of attention to the bear.
Increase the texture of the image.
Add birds flying around bear's head.

Categorizing Natural Language Text Based Editing Requests

In order to better understand the types of edit changes respondents were looking for, I tried to group all the different requests into natural language editing categories.

Word Cloud

After cleaning up the responses a bit and combining them into a single text based string, I went ahead and generated a word cloud of the 10 most used terms:

Looks like people really want to influence colors, backgrounds, and image clarity/quality!

The top 50 ended up looking like this:

We see new editing/feature categories like center of attention, vignette mode, “warm” colors”, negative space, and light sources

… and of course the top 100 ended up looking like this:

interesting words like focal point, positioning, characters, “emotional connect”, creative theme, hair colours, and so much more appear!

The above word clouds are sorted not just on count, but on monkeylearn’s relevance scoring as well. Here’s what the top 100 looked like from their actual word cloud dataset, sorted by count (ignoring relevance):

Discussion - Wordcloud

We can see there is a meaningful need for control over the colour and background over all the images. We can also observe additional design and art terminology based changes users are looking for.

Hand Tally

I went ahead and tallied by hand all the response feedback, point by point, into various categories of image editing changes that I thought made sense. Here are my findings for the first two images:

Image #1 - "Creative soul mates, digital Art”

Image #2 - "Photo of a red dodgeball resting on the center line of a school gym”

Hand Tally - Discussion

Many of the change requests are based on the colour, background, and adding/removing elements. Lighting and illumination took a bigger role in the second image, perhaps, because it is supposed to be photorealistic. In both cases, respondents were also interested in positional changes as well as more control over composition, framing, and the angles of things.

Additional Image Editing Categories

Regardless of frequency, by individually reviewing respondent feedback per image, we can observe additional kinds of natural language editing feature capabilities and more naturally expressive, fluid requests for image changes:

Focal point, depth:

Image is too symmetrical with no focal point; change one of the characters to stick out more.
Add depth by making the stars more blurred.

Color contrast, flat colours, brightness and more:

Should change the color contrast of the stars.
Make the color brightness little lesser,
The art may contain light color in the background, that may look better.
Back drop theme should be better with more number of colors.

We can also observe additional requests around basic creative tool functionality such as cropping and vignettes.

Advanced Categories of Changes:

Besides emotionally expressive revisions, occasionally, respondents described changes to the image which required a higher level of natural language text and image based multimodal understanding:

I would add water streaks on the ball, almost like the ball is sweating. This would give the impression that ball is "working hard".
Add 2 opposing players standing over the ball, as if they are about to engage in each other.
have more shadow under the ball to show depth, have more scuffs on the floor to show imperfections, have more objects in the scene to be more dynamic, have the ball be slight deformed to show it has weight and gravity, add particles in the air to show the room is more real
Change the ball's position as bouncing upwards.

The feedback from the, “confused grizzly bear” image was particularly instructive. Comments included adding spinning birds around the bear’s head to indicate his confusion as well as adding claw marks onto the chalkboard behind him to indicate his prior frustration.

Perhaps, these advanced kinds of changes dealt with the larger, “story” or theme behind the image itself, making them more unique but also challenging. They also made better appeals to the audience’s perception, psychology, and emotions.

Finally, as a potential opportunity, support for these kinds of higher level advanced changes could possibly help multimodal AI editing models generalize better in the future.

Limitations of this Survey and Analysis

There are several limitations to this survey:

Only 32 participants were surveyed
Respondents were not given the option to, “skip an image” and not include any feedback at all. Every image required feedback and change suggestions.
The hand tally approach for categorizing natural language change requests is highly subjective and leaves room for misinterpretation
Limited theoretical grounding in this survey design in areas such as the value of art, how people perceive art, the role of editing an image, and natural language communication for AI based creative tools
Unclear audience demographics. This survey only asked participants to self identify as fluent english language speakers with at least 5 years of professional or personal art/design experience
Mechanical turk was not the best participant sourcing partner - over half of the responses received for this survey were omitted from the results for being suspicious, spam, or likely bot generated content. Although the best attempt was made to source both high quality responses and trustworthy/relevant participants in this analysis, this cannot be guaranteed.
This survey was written by me! There is a lack of diversity amongst the authors of this analysis, nor does it have any academic or organizational affiliation.
This write up has omitted the quantitative data which was collected in the survey to evaluate the respondents’ perception of the art before and after their changes.
Lack of diverse images. This survey could have used more examples of AI generated art from different artists, different kinds of art styles, as well as more graphic design oriented work like company logos
Lack of instructional task diversity. Editing feedback will likely change based on the survey instructions given to respondents. This survey mainly asked participants to give overall feedback on art, however, their feedback would likely change if they were asked to critique a DALL-E generated industrial design product like an armchair, graphic design heavy website designs, or a specific character design. This is another area to explore in future studies. Perhaps, more categories and different kinds of changes would emerge.
Finally, asking users to input editing changes they would make inside of a textbox in a survey form may not be the best way to collect organic image editing feedback. Perhaps, alternative ways of collecting this information like advanced tracking in existing creative software tools or parsing existing markup on images could provide higher quality data for a future study:

Kyler Steele @kyler_steele

how a photograph is edited

Further Research

Further research could combine this analysis with better theoretical grounding of past work and formal studies. Also, it could be helpful to ask respondents in a future study to subjectively rank the value of their changes to better understand which natural language editing categories are the most important. Finally, a study could be conducted asking participants to evaluate images which have already been edited to measure their perceived value improvements along dimensions such as beauty, impact, and originality.

Why is this survey important and exciting?

This analysis attempted to examine the categories of natural language image editing change requests by looking at qualitative responses on DALL-E generated art from designers and creatives. While not only proposing several categories of creative change requests, this analysis even suggested advanced categories of changes. These could be used as high level goals while also potentially helping AI editing models generalize better in the future. Through findings from studies like this one, it is possible that future AI models, AI creative alignment initiatives, quality benchmarks, as well as better multimodal natural language editing-specific related training/validation datasets could be created.

My Personal Thoughts

This survey was so much fun to conduct and I learned a lot. Some thoughts which crossed my mind after reviewing the results:

I agreed with most of the suggestions people made and felt they would dramatically improve the quality of my work! I’ll admit, as the creator of these images, I grew to not only like them but accept many of their flaws. In a way, I was blinded and biased rooting for the success of my own creations. After reviewing feedback from respondents, I now realize there is actually so much more room for them to grow and become even better. Through this process, I have found that improvements to text based natural language editing tools have the potential to make better art.
In fact, now that I am looking back at these images from the lens of the respondents’ editing suggestions, I actually think that they are below the bar - but not far from - an acceptable quality level for professional level art and design. Besides general AI weirdness, these images are lacking in many key areas like colour, depth, and composition.
DALL-E’s current editing capabilities, mainly inpainting, do not sufficiently cover the breadth of natural language editing commands which I have surfaced here. Many of the feature categories I have outlined are currently not supported by the system. At the same time, in my opinion, this survey demonstrates that greater creative AI alignment is needed between the natural language editing commands entered by users and the results from the model itself.
On the other hand, it is exciting to see that DALL-E currently does support adding/removing elements via inpainting, this was a consistent category of editing change request from virtually all respondents.
It is unclear how much improving the DALL-E 2 model on its own will improve its editing and inpainting capabilities as well. Besides creative editing alignment with human feedback approaches, perhaps, a new kind of AI editing model is needed.
With a wider survey, with more diverse art examples, my speculation is that there is a greater list of features and natural language editing categories which could be compiled leading to complete AI editing coverage. Also, I have reason to believe that a wider survey could help compile more unique examples likes the ones from the, “advance categories” section above which could lead to greater AI model editing capability generalization.

Responses, Raw Survey Data

You can check out the original google form survey here. Please note, it can take up to 2 hours to complete the survey.

You can find the qualitative responses for each image at the google doc I’ve compiled here.

You can find the raw data responses as well as the omitted quantitative data - which wasn’t discussed in this write up - from the survey here at this link to the google spreadsheet.

Multimodal by Bakz T. Future

Discussion about this post