The year 2022 was an amazing year for generative AI market and no one can deny in this year, release of some cool models such as Midjourney, Stable Diffusion and ChatGPT made this market bigger, better and more competitive. You may also know Mann-E, the model I have developed on top of Runway ML’s Stable Diffusion 1.5 using Dream Booth. In this particular article, I provide you with a report for the development procedure of Mann-E 5, which will be accessible at April 14th 2023 on Mann-E Platform.
Introduction
The Intention
The main intention of the Mann-E at first place was a personal discovery of AI Art and text-to-image models, but later I found the business/commercial opportunities and since I also am an open-source enthusiast, the main intention changed to providing an easy and accessible open-source alternative to midjourney.
Since Midjourney is only accessible through Discord, it’s expensive (compared to most of other image generation models) and there is also a huge problem for Iranian users to use the basic or standard plans, the idea of a platform for art generation.
The method
For this particular version, I used self-instruct method which was used for Stanford’s Alpaca dataset and model. The tools used for this project were as following:
- ChatGPT
- Midjourney
- Dream Booth
The Procedure
Using Midjourney
The main idea of using midjourney generated images in the fine-tuning process sparked in my mind from PromptHero’s Openjourney project. They used Dream Booth and data from Midjourney version 4.0 at first, then they did the train on more than 100K images on their own infrastructure.
So, Midjourney became a good source of data, because you probably won’t face any intellectual property or copyright issues in the process of using images created by their algorithm (the full explanation is available in my previous post).
ChatGPT as a prompt engineer
I’ve seen people create great prompt for Midjourney using ChatGPT. As a large language model, both ChatGPT and GPT-3 (and GPT-4) can be great choices for creating prompts. So I’ve chosen ChatGPT since it had a free interface and also more affordable API’s.
P.S: There are also different models which we can use in order to generate prompts, but they may need extra setup. They’ll be explained in future researches.
Dream Booth
The most affordable way of creating your own text to image model is Dream Booth. It can be executed on a free Colab Notebook and there are also tons of tools available for doing the job.
For development of Mann-E 5, I used the code from this repository. Although some modifications were needed, but the whole code is perfect.
Development of Mann-E
Getting needed tools for development
First, I had to open an OpenAI account. Which I had done before. I needed to use ChatGPT in order to get prompts.
Then, I got a basic license for midjourney, which gives you ~4 hours of fast generation and you can generate images without any difficulties. Of course I’ve done a few test (around a hundred images) before getting to work.
The last thing was writing a piece of python code to crop my images to 512×512 (midjourney upscaled outputs are 1024×1024) and also save the respective prompts in a text file.
Then, I used fast Dream Booth code in order to fine-tune Stable Diffusion version 2.1 on the dataset I just gathered.
Using ChatGPT as a prompt engineer
Currently everyone is using ChatGPT in their advantage, since it was the most revolutionary LLM of 2022 and probably 2023. For the development process of Mann-E 5, I’ve used GPT-3.5 Turbo (which is the free one) and it generated pretty good prompts.
This is the example of message for prompt formula:
Here is a Midjourney prompt formula:
raw photo, [the subject], [mood], [time of the day], [art style], [camera], [lens]
The “mood” can be realistic, cyberpunk, steampunk or futurist.
Do you understand?
As you can see, I just gave it the way of my prompting. Then it responded:
Yes, I understand the Midjourney prompt formula you have provided. It includes a raw photo as the starting point, followed by information about the subject, mood, time of day, art style, camera, and lens. The mood can be one of four options: realistic, cyberpunk, steampunk, or futurist.
The response shows it’s now ready to be asked for new prompts. Then I asked it for 5 to 10 prompts per idea. It means I could have 20 to 40 prompts per idea so I was set for Midjourney image generation.
Here is how I asked it about prompts:
Give me five prompts for “ruins of a roman temple”
A striking black and white image of the ruins of a Roman temple, with dramatic shadows and highlights emphasizing the structure’s grandeur and decay, shot at night with a modern digital camera and a wide-angle lens.
Generating images using midjourney
This was the easy part. The whole process was feeding ChatGPT generated prompts to Midjourney and upscale and download the images.
The result was 464 images with different prompts which included different moods, styles and genres.
Pre-processing the dataset
Since Stable Diffusion only accepts 512×512 or 768×768 images as the input data, I had to write a simple python code to do the resizing using opencv.
Also there was an excel file including image file names and prompts used for image. I had to add a function to turn each prompt to a text file with the same name as the image files.
Training Stable Diffusion using Dream Booth
Unlike Mann-E 4, Mann-E 5 is based on Stable Diffusion version 2.1 (512px version). The training was done in two different steps.
In the first steps, it was 5440 steps of Dream Booth training (which is calculated by (number of images * 10) + 800 formula) and 928 steps on the text encoder to understand the trigger words.
In the second steps, the resulting checkpoints and weights of the first steps were tuned on 10880 steps (twice the first one) and 928 text-encoder steps to get the resulting images closer to the dataset.
It took total of 4 hours of training on a T4 shared GPU on Google Colab. Of course upgrading the colab plan to pro or pro+ can be beneficial in order to get better GPU’s and better training time.
The Results
Further Study and Research
The new model still has problems in photo-realistic images, but does a great job on illustration and concept art. So for now, it can be considered an artistic model. In the future, the other side also most be fixed.
The next thing is trying to tune the base model (whether Stable Diffusion version 2.1 or Mann-E checkpoints) on a larger dataset with more diverse images in order to get it closer to Midjourney.
Conclusion
Using pre-trained and available AI models such as ChatGPT not only elevate people’s lives, but also helps even AI engineers and developers to have more concern free data for their projects and products.
Also using Midjourney as a tool for creating Royalty Free images is a wise choice specially when you try to create a brand new text to image AI model.
In conclusion, I can say I’ve got much better results this time, because I utilized both ChatGPT and Midjourney for my needs. The checkpoints for Mann-E 5 will be available at HuggingFace on Friday, April 14th, 2023 at the same time of the public release of Mann-E platform.