New Video: The Essence of Multimodal Creativity [GPT-X, DALL-E, and our Multimodal Future]

Sep 01, 2021

What is the major opportunity now available for a, "multimodal creative"? What great, new thing can you do now, thanks to Multimodal AI, that simply couldn't be done ever before in human history? What can we make and how will we able to improve our creative work through advances in AI?

In this video, I’ll walk you through the first critical lesson I can foresee creatives will need to know in the future!

YouTube Transcript (SPOILER WARNING):

France-003324 - Mona Lisa | PLEASE, NO invitations or self p… | Flickr

A typical AI transformer model, like GPT-3, is only trained on raw text data mostly from the internet. As a result, it can only generate sequences of text. GPT-3 may have read about images and know what they are about, but by design, it has never actually seen an image before. Let alone, something like the mona lisa.

Multimodal models, however, are trained on multiple kinds of data at once. They could be trained on images and text, text and video, or even audio and other kinds of sensory data. From looking at the world through multiple mediums (or modalities), these models develop a wholistic, better understanding of the “real world”.

The most notable model, OpenAI’s DALL-E, can generate images just with a text description. Given the text, “a collection of glasses is sitting on the table”, DALL-E was able to instantly generate these images from scratch.

Notice the diversity and real-world feasibility of it’s drawings?

It’s amazing it was able to do all that just with a simple description written in English. A true breakthrough for everyone. However, the real essence of multimodal creativity starts by combining unique kinds of text and having them “imagined” for you as images right away by AI.

For example, DALL-E was given the description, “a chair shaped like an avocado” and here’s what it came up with instantly:

This impressive industrial product design is pretty groundbreaking and represents the, “mixing” of different kinds of ideas which is possible through multimodal models.

Let’s take this concept of “mixing” and imagine a tool, an easy way to generate logos.

Imagine just giving a description of your business, including other logos you like, and describing specific influences you want your logo to have and then sitting back and watching AI endlessly generate logo possibilities for you.

Imagine being able to give AI your exact story and text and have it generate custom, full-scale fonts for you based on the themes of the story that you’re trying to tell as well as any related art works.

At a smaller level, you could even sprinkle in little bits of your favourite art or media and add rich textures heavily influenced by other creative works. A dash of Pulp Fiction or the brushwork of Starry Night just in areas of your design or specific verses of your new poem.

Finally, imagine listening to an album, like my favourite from last year, Kid Cudi’s Man on the Moon 3. And feeling moved and inspired by it, and then giving it to an AI model and asking it to generate a new pair of shoes based on the emotions, creative direction, and musical style of that album.

Wow.

The Key Idea

When it comes out to GPT-3, DALL-E, and our multimodal future - mixing and texturing will be the name of the game. Learn to mix ideas that gave never met before and develop your own personal library and index of your favourite creative works. Use it to your advantage. Add texture, depth, and richness to your own creations and what you’ll end up with will be something entirely new that the world has never seen before.

Multimodal by Bakz T. Future

Discussion about this post