The Impact of DALL·E on Creative Work

OpenAI recently teased a new deep learning system called DALL·E. It can generate images from text with a surprising amount of creativity and variation. Click this link to see example images.

I've received questions from reporters about the technology, and my answers are reposted here for posterity.

Q: Is DALL·E going to take away the jobs of creatives?

The tool does not yet produce high-quality images that are expected in professional settings: they are like 256x256 caricatures that would need to be remade by a professional artist — either in vector art or pixel form. As such, you can expect DALL·E to be mostly used for assisted brainstorming / prototyping at first.

Few jobs are at risk from brainstorming tools: ideation is where you need the most possible input and having a neural network assist is ideal!

Q: How about the longer term impact of this technolgy?

Creative Industry is already at risk from the deployment of a variety of algorithms (not just deep learning, even old-school techniques) that will increasingly disrupt day-to-day work to the point where the organizational structure needs to change. Everyone is overworked in agencies and if the manual tasks are done by AI faster, that's going to be quite a mess!

DALL·E's future iterations will likely disrupt creative work incrementally, since OpenAI is likely to allow access to companies in an incremental fashion — like they did for GPT-3. It's not clear if DALL·E will ever reach professional quality outputs, as GPT-3 has 15x more parameters and there are still many mistakes in its output text (grammatical, factual, etc).

See AI and the Death of Creative Industry and Generating Promotional Brochures and Mailers for perspective.

Q: What are ethical issues with deep learning for creativity?

The primary ethical issue is copyright laundering. Such large models are trained on a large dataset scraped from the internet with no attribution. This is confirmed by early review versions of the white paper:

"To adress this, we constructed a new dataset of 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (images, text) pairs as part of the construction process that includes one of a set of 500,000 queries."

GPT models were shown to reproduce their training content verbatim, so the legal situation about copyright infringement and Fair Use remains unclear until it's tested in court!

Q: What about bias in machine learning with DALL·E?

Similar to natural language produced by AI, image generators can also portray bias / racism, and create material inappropriate for public display. In creative agencies, inappropriate ideas come up regularly in brainstorming: to get great ideas you need to remove inhibitions and anything can come out. Those ideas get filtered out quickly by creative teams though!

For DALL·E, it will require people in-the-loop to filter out those inappropriate images. If it's used in a fully automated system, then it's asking for trouble...

Q: How much does DALL·E cost to train? (edited)

It takes 256 GPUs for 2 weeks to train CLIP (another model released alongside DALL·E), that's a total of 86,016 GPU hours. Assuming the current price of Amazon's EC2 p3.16x-large instances (if the model fits into memory), that would cost around $131,604.48. For a large company that owns servers or rents them as "reserved instances", that could be up-to 70% cheaper: around $40k per model but assuming sunk costs for hardware investments.

It's unknown how much more expensive DALL·E is, likely between 5x and 50x more? The closest prior art is this blog post by NVIDIA about training a 8-billion parameter model, whereas DALL·E is 12-billion parameters. Unfortunately, the blog post does not indicate how long it took to train GPT2-8B.

The work that OpenAI has done to improve performance and reduce training on CLIP is impressive and this models will likely be accessible to many companies!

Q: How does it work under the hood? (added)

There are two parts to DALL·E's deep learning model:

  1. A language model based on GPT that takes query text and learns to output a binary 32x32x8192 image, where each pixel is one of 8k tokens in a learned visual language.
  2. An image super-resolution "decoder" that takes this 32x32x8192 bitmap and turns it into a 256x256x3 image with RGB channels.

Both of these components are trained jointly over the dataset of 400M images and text pairs. Text is given to the language model and the corresponding image is used to correct the predicted output, likely using a simple mean-squared error loss.

Q: What's the closest prior art? (added)

The following papers are cited as references:

However, conceptually it seems to derive from Jukebox and its underlying VQVAE2:

The concepts of the architecture are relatively simple, but when training with web-scale datasets of 400M samples, everything becomes significantly harder!

(Cover image Ship with Butterfly Sails by Salvador Dalí.)