Re-creating Midjourney with only $10 – Technical Report for Mann-E 5 development

The year 2022 was an amazing year for generative AI market and no one can deny in this year, release of some cool models such as Midjourney, Stable Diffusion and ChatGPT made this market bigger, better and more competitive. You may also know Mann-E, the model I have developed on top of Runway ML’s Stable Diffusion 1.5 using Dream Booth. In this particular article, I provide you with a report for the development procedure of Mann-E 5, which will be accessible at April 14th 2023 on Mann-E Platform.

Introduction

The Intention

The main intention of the Mann-E at first place was a personal discovery of AI Art and text-to-image models, but later I found the business/commercial opportunities and since I also am an open-source enthusiast, the main intention changed to providing an easy and accessible open-source alternative to midjourney.

Since Midjourney is only accessible through Discord, it’s expensive (compared to most of other image generation models) and there is also a huge problem for Iranian users to use the basic or standard plans, the idea of a platform for art generation.

The method

For this particular version, I used self-instruct method which was used for Stanford’s Alpaca dataset and model. The tools used for this project were as following:

  • ChatGPT
  • Midjourney
  • Dream Booth

The Procedure

Using Midjourney

The main idea of using midjourney generated images in the fine-tuning process sparked in my mind from PromptHero’s Openjourney project. They used Dream Booth and data from Midjourney version 4.0 at first, then they did the train on more than 100K images on their own infrastructure.

So, Midjourney became a good source of data, because you probably won’t face any intellectual property or copyright issues in the process of using images created by their algorithm (the full explanation is available in my previous post).

ChatGPT as a prompt engineer

I’ve seen people create great prompt for Midjourney using ChatGPT. As a large language model, both ChatGPT and GPT-3 (and GPT-4) can be great choices for creating prompts. So I’ve chosen ChatGPT since it had a free interface and also more affordable API’s.

P.S: There are also different models which we can use in order to generate prompts, but they may need extra setup. They’ll be explained in future researches.

Dream Booth

The most affordable way of creating your own text to image model is Dream Booth. It can be executed on a free Colab Notebook and there are also tons of tools available for doing the job.

For development of Mann-E 5, I used the code from this repository. Although some modifications were needed, but the whole code is perfect.

Development of Mann-E

Getting needed tools for development

First, I had to open an OpenAI account. Which I had done before. I needed to use ChatGPT in order to get prompts.

Then, I got a basic license for midjourney, which gives you ~4 hours of fast generation and you can generate images without any difficulties. Of course I’ve done a few test (around a hundred images) before getting to work.

The last thing was writing a piece of python code to crop my images to 512×512 (midjourney upscaled outputs are 1024×1024) and also save the respective prompts in a text file.

Then, I used fast Dream Booth code in order to fine-tune Stable Diffusion version 2.1 on the dataset I just gathered.

Using ChatGPT as a prompt engineer

Currently everyone is using ChatGPT in their advantage, since it was the most revolutionary LLM of 2022 and probably 2023. For the development process of Mann-E 5, I’ve used GPT-3.5 Turbo (which is the free one) and it generated pretty good prompts.

This is the example of message for prompt formula:

Here is a Midjourney prompt formula:

raw photo, [the subject], [mood], [time of the day], [art style], [camera], [lens]

The “mood” can be realistic, cyberpunk, steampunk or futurist.

Do you understand?

As you can see, I just gave it the way of my prompting. Then it responded:

Yes, I understand the Midjourney prompt formula you have provided. It includes a raw photo as the starting point, followed by information about the subject, mood, time of day, art style, camera, and lens. The mood can be one of four options: realistic, cyberpunk, steampunk, or futurist.

The response shows it’s now ready to be asked for new prompts. Then I asked it for 5 to 10 prompts per idea. It means I could have 20 to 40 prompts per idea so I was set for Midjourney image generation.

Here is how I asked it about prompts:

Give me five prompts for “ruins of a roman temple”
and here is how it gave me the set of prompts (it’s trimmed for this article).
A striking black and white image of the ruins of a Roman temple, with dramatic shadows and highlights emphasizing the structure’s grandeur and decay, shot at night with a modern digital camera and a wide-angle lens.
If you spend time on Midjourney prompting, you will notice it’s a pretty good prompt, even if it doesn’t follow the formula very well.

Generating images using midjourney

This was the easy part. The whole process was feeding ChatGPT generated prompts to Midjourney and upscale and download the images.

The result was 464 images with different prompts which included different moods, styles and genres.

Pre-processing the dataset

Since Stable Diffusion only accepts 512×512 or 768×768 images as the input data, I had to write a simple python code to do the resizing using opencv.

Also there was an excel file including image file names and prompts used for image. I had to add a function to turn each prompt to a text file with the same name as the image files.

Training Stable Diffusion using Dream Booth

Unlike Mann-E 4, Mann-E 5 is based on Stable Diffusion version 2.1 (512px version). The training was done in two different steps.

In the first steps, it was 5440 steps of Dream Booth training (which is calculated by (number of images * 10) + 800 formula) and 928 steps on the text encoder to understand the trigger words.

In the second steps, the resulting checkpoints and weights of the first steps were tuned on 10880 steps (twice the first one) and 928 text-encoder steps to get the resulting images closer to the dataset.

It took total of 4 hours of training on a T4 shared GPU on Google Colab. Of course upgrading the colab plan to pro or pro+ can be beneficial in order to get better GPU’s and better training time.

The Results



Further Study and Research

The new model still has problems in photo-realistic images, but does a great job on illustration and concept art. So for now, it can be considered an artistic model. In the future, the other side also most be fixed.

The next thing is trying to tune the base model (whether Stable Diffusion version 2.1 or Mann-E checkpoints) on a larger dataset with more diverse images in order to get it closer to Midjourney.

Conclusion

Using pre-trained and available AI models such as ChatGPT not only elevate people’s lives, but also helps even AI engineers and developers to have more concern free data for their projects and products.

Also using Midjourney as a tool for creating Royalty Free images is a wise choice specially when you try to create a brand new text to image AI model.

In conclusion, I can say I’ve got much better results this time, because I utilized both ChatGPT and Midjourney for my needs. The checkpoints for Mann-E 5 will be available at HuggingFace on Friday, April 14th, 2023 at the same time of the public release of Mann-E platform.

You don’t owe money to the brush company if you sell your art

In my previous post, I explained how the future of the content is AI. Also, in an older post, I was talking about how AI generated content can revolutionize the world of interior design/architecture. In this post however, I’m not talking about these topics and I’m going to talk about legal issues and questions about AI generated art, and there will be a twist at the end. Wait for it 😁

AI content creators are concerned about legal stuff

Yes, they are. If they are not, they are making a very very big mistake. When you create any form of content, one of the most important aspects of publishing it is the legal issues.

These legal stuff are usually about the rights of content creators over their content and also the rights of companies who develop the tools for content creation.

In this part of the article, I am talking about what I guess is the important legal topic in this new generation of content creation.

The Ownership

The very first time I posted about my own AI art generator model Voyage in a Telegram chat room, one of my friends asked Who owns the generated art? You? Or us? and I explained since you have to run the generator on your own computer, you are the owner of the generated art and you don’t owe me anything.

By the way, most of them gave me huge credits when they posted their artwork on social media or even on the very same chat room.

But I found out most of those proprietary art generators like Midjourney don’t act like that. They make you pay them if you want to own what is your own. Let me make this a little bit clear.

Imagine you are going to buy a nice set of brushes and colors. You paid for it, right? Now you made a beautiful piece of art with those tools and now you want to sell it. Now imagine the brush company asks for shares! Isn’t it hilarious? of course it is. I believe this must be considered by AI Artists who use these proprietary tools to generate content.

Use by and for minors

another important topic in the new generation of content creation tool is always how minors will use it? and it also concerns me a lot (specially since Stable Diffusion 2.0 has no NSFW filtering). So what should we do for our younger friends? A lot of content creation platforms like YouTube, Pinterest, Instagram, DeviantArt, etc have their own policies and filters for public content distribution.

For example, I’m a big fan of horror movies and when I search about content about them such as reviews, fan arts and even scripts, I usually face the age confirmation pages and modals. Now you can understand where will I go with this topic.

AI is dumb, it cannot understand what it generates and we need a little more human observation on the generated content. For example in Stable Diffusion’s discord, I remember reacting to NSFW content by a certain emoji, could mark it as potentially harmful and then they could improve their NSFW filtering system.

Plagiarism

I guess you thought I don’t give a fine F about copyrights, right? No it’s not true. I believe artists and content creators should be credited well. So let’s talk about another topic which seems very important.

The very first day I started AI content generation, there only was a good free (in any sense of the word free) and it was VQGAN+CLIP. It was a great tool to make art and even today it has a unique quality of art comparing to other tools.

But even those days, I had a huge concern. What if I plagiarize another artist’s work? and this concern was at its highest form when I figured out adding names of well known artists such as Greg Rutkowski, James Gurney, Thomas Kinkade, Salvador Dali and thousands more can alter the result for us! So as both AI generator developers and artists, we should pay attention to this matter as well!

And last but not the least: Fake Art!

One of my most favorite activities is trying new artist names in my prompts. I love to see how their minds would paint what I’m thinking of. But there is a problem, What if I say this is an unreleased painting by a well known artist? and this can lead us to a huge money fraud.

I never could stop thinking about these matters, and as a person who developed a model and generated tons of content with AI, I never want to be classified as a fraud or scammer or even a person who disrupts the work of other artists.

I guess we talked enough about legal issues, let’s get to the big plot twist of this blog!

Big Twist!

The young blonde woman in the picture is beautiful. Isn’t she? I made it using my model Voyage which I introduced earlier in this blog post. You want to use Voyage and create your own art? Fine. You won’t owe me anything if you do. And if you want to use it in Google Colab, here is the link to the notebook!

Voyage is trained on the data crawled from OpenArt and as you can see, it is a model which can work with a very artistic feel comparing to other models which are available.

Conclusion

In this blog post, we discussed about one of the important aspects of AI content creation/generation which is legal stuff. We also have to fight for our rights of ownership as content creators. In my personal opinion, it is okay to ask for money for a service. We pay a lot for infrastructure and computing power as developers or companies but if we make our users pay us shares, I guess it’s not fair.

In the other hand, we need more and more open source tools for AI content creation. Big tech companies are ruling the market in this world as well and it never is good.

I hope this article was useful and if you like more content like this, please consider sharing it with your friends 🙂

The future of content is AI

I personally never counted myself as a content creator but apparently, I always have been counted as one. Why you may ask? The answer is easy. I have a habit of filming my work, writing blog posts (mostly in Persian), posting my work and code on twitter and stuff. All of these are behaviors from a content creator.

My content on the other hand were mostly about me, I never cared about making those type of advertisement reports (where you have to care a lot about SEO, back-links and stuff) because it wasn’t my job to create the content. Now, I am thinking about it, but my own way.

The history of content creation

Before going deep about this, let’s clear something. This part of the article is from my own point of view and it’s not a certain history, but at least, this is how I saw content creation and how it works.

One-way content generation

Let’s go back a lot. I mean A LOT! Maybe in 2006, you opened a URL in your Internet Explorer and then find out a very ugly static website written in pure HTML. Some of those websites also had some annoying JS functions (we should be grateful about the modern use of JS, there are no mouse pointer following figures or rain in background anymore!).

This is an example of a one way form of content. The content you can not react to as is. You had to find an email address in Contact Us page, or fill their forms and usually they did not ever viewed the respective inbox. So you couldn’t help them improve their content or right their wrongs.

Here comes the blog

I almost was 12 when I discovered the concept of blogs, and I also started writing in a free blogging service (which is very popular among Iranian community and you can find it here) and it was amazing.

The whole greatness of blogging was that it wasn’t “one-way” and people could interact with each other using comments and at the same time, chatrooms were also pretty popular. So we usually had a good time with our internet pals those days. And you know what does that mean?

User generated content (UGC) matters!

It really does. Imagine you want to get a new hair dryer. So what do you do? I guess you go to amazon and search for hair dryers. A hair dryer is not an object you buy once a week, so you need to know that the hair dryer in question lasts enough or not, how much power does it take and does it meet health guidelines and regulations for a product like that?

You just read the description, specifications and other details provided by the seller on Amazon. It’s good, but not great. You have an idea about the product, but you don’t know how is its user experience. What can we do about this? Easy, we scroll down to the user reviews. Were people rated and described their feelings about the product.

In the reviews section, you find out this product doesn’t last that much, you even may search in other platforms about the very same product and find out what is wrong with the product in question. For me, the second platform is always YouTube. People do a lot of good product reviews on YouTube (even those who got sponsored by the brand we’re looking for, are usually helpful as well!) and guess what? YouTube is also a platform for UGC!

But what, it doesn’t end here. You read this post but you still are confused about the title. I have to say this is where the actual fun begins!

The future of content is AI!

Now this is the part you were waiting for, in this section, I’m going to talk about how AI can help us create better content because recently, I follow the trend of AI art a lot! I also coded and developed some AI Art tools myself! I also was too cheap to get copilot paid membership, and created my own version. See? I officially joined the army of content creators, but in my very own way.

Sentiment Analysis

I guess this one is not really about content creation but more about content moderation. But moderation is as important as creation (if not more) and I had to put it here. Having a sentiment analysis system on our user generated content, can help us find if the product has poor quality or how toxic our community is or something like this.

To be honest it helps us more than it seems. It helps us make a better community (pretty much by banning suspicious users) and also give feedback to our suppliers who sent us products with poor quality. It doesn’t end here by the way, my example is still about a retail store and not a general website.

In the modern day, you have to watch your tongue more than before. A lot of people stood for their rights and the typical words of your daily speech can be offensive to other people. So in this particular case I believe these analytic tools can help us improve even in our personal lives by having a better community.

We talked enough about content moderation using AI, let’s go to the fun and interesting topic of content generation!

The rise of AI art generators

AI art is basically an empire now. AI art generators such as Dall-E 2 and Midjourney (you probably would like to take a look at my open source version of midjourney, OpenJourney, just saying) are very popular and in the other hand, Stable Diffusion (and forks) are really growing in the open source side as well.

You cannot deny the fact that these are pretty cool tools of content creation. These tools can help us bring our ideas to life in forms of art, 3D design, interior design, UI/UX and a lot more. So we have to talk about these, we have to recognize these images as the new content people create and enjoy!

It does not end here as well. There is also a new trend of Text To Music which means a lot of music creators (me included!) may use AI to create music as well. This is the beauty of AI content creation.

And finally, everyone offers AI these days.

Yes, every company which had an even small relation to content creation, offers AI! We expect big names of our industry such as Google or Meta provide tons of AI tools such as libraries, frameworks, models, datasets and even programming languages. But do you know what amazed me recently?

Notion also provides AI solutions for productivity and ideas! You basically can have some sort of copilot for your content calendar or even better (in case of some people worse) an ai companion for task management and I think this is great.

Now we have tools to create text, images, videos and sounds, what should be our next step? I guess we have to read minds (and I’ll write an article about that as soon as possible).

Conclusion

Now let’s conclude (I know, I have this section on every blog post and I don’t put anything useful here). We just found out where we have started the age of digital content creation. Internet had a great role in revolutionizing this age and opened new doors of opportunity for us, people who usually couldn’t get the chance of writing in a magazine or newspaper easily. These days we write on Twitter (at least until we can write without paying Elon Musk for that!) and it needs no privilege. It only requires an internet connection.

So AI can help us improve our content, it can help us write better reviews, it can help us turn a bunch of photographs into a full report. You just input your photos, the image-to-text pipeline starts and extract details of each photo, then you edit them and now you have your reports.

In my opinion, AI is there to help us make the world a better place. Because it provides us an equal chance of being author, artist, musician and anything which required some level of privilege in the past.

Severus does the magic

It is not too long after I told you that I was too cheap to pay $10 a month for github copilot and I came up with the idea for Severus, my own AI pair programmer. It was something that went boom. My blog usually doesn’t have more than 20 or 30 viewers a day (at its best) and for almost a week, I had more than 200 views per day. Since people showed interest in yet another AI pair programmer, I have decided to continue working on severus, more seriously.

Severus code generation
Severus is now capable of being accessed as an API

My plans for Severus

So in this article, I may discuss a bunch of problems I may face in the long path of creating Severus and making it available as an end-user software. There are some serious concerns, for example when I talked about the idea of Severus with one of my colleagues, he told me he is concerned about the confidential codes he has written.

Almost all of your concerns are valid (except the one who thinks this whole process is handled by the Illuminati) and those are my concerns as well. The next problem I may face is for the scaling, so I perhaps need to hire a well-educated DevOps engineer.

In this section, I explain all of my serious concerns and needs, and I expect some help from you, the kind readers of the article.

The Community

Creating a community around something which is honestly a weekend project, doesn’t seem like a good idea. You may say this thing happened for the Linux kernel as well. You’re right, but this is a little bit different. There are tons of tools which may work much better than Severus.

Also, it is important to know the place for creating the community. A subreddit? A discord server? A room on Matrix? An internet forum? I have no idea honestly.

So this is the biggest concern for me. The community!

Performance and text-generation glitches

The performance is good, thanks to huggingface inference API. Actually, knowing the fact that huggingface API exists, helped me with the implementation. But I still have some concerns here.

My main concern is that BLOOM starts generating some text which is not or cannot be classified as code. I tried different ways to get better results, but I still need some ways to verify the generated result is code and it’s not a text which includes the code. And this is really the hard part I guess.

For this purpose, I may need some help. Validation must be done on the results in order to get a good AI pair programmer, otherwise it’ll become more like an annoying colleague or an intern who knows something, but can’t gather his/her mind.

The Product

And final concern/plan is the product. For current use, I only have a simple application which runs on port 5000 on my laptop. Nothing more. There is no authentication and no user validation system, no monitoring, no scaling, no infrastructure. Basically a MacBook Pro which runs tons of programs daily and severus is currently one of them.

I had a VS Code extension in mind, also I thought of a web app as the MVP, when you can easily copy your code and then use it in your very own projects (and of course it won’t be the best choice for a confidential piece of code).

Although I have ideas in mind, I still need more brainstorming about how this project should be delivered to you as a product.

Conclusion

I still have a lot to do with this project. There might be some language detection to detect if the generated output is the code or not, and also there might be some more code validation to avoid mixing different programming languages.

Overall, this is one of the most difficult and at the same time the funnest projects I’ve ever done. I won’t give up on this, even if it seems like a painful and expensive hobby to people around me 🙂

 

I was too cheap to pay $10 a month for copilot, so I made my own

In mid 2021, there was a revolution in coding. As a lazy programmer who always needed a fast and smart assistant, I was really happy to have Github Copilot in my arsenal of coding tools. So I was one of the early adapters of the whole idea of AI pair programmer.

Everything was fine with Copilot. I wrote tens of thousands of lines of code in last year and I could code a lot of projects which were impossible with a good, smart and fast pair programmers, but everything has been changed since last week I got an email from github, telling me I can’t have free access to Copilot anymore.

It was a sad moment in my life, but I had different ways of adapting and accepting the reality. First, I was thinking of paying $10 a month for a github premium account, but since I won’t use most of github’s premium options, it wasn’t a suitable solution for me. I also checked tabnine or kite as well, and those didn’t work out for me, as well.

My own copilot!

Say hello to Severus, my new AI pair programmer!

First, let me talk about the name a little bit. I was watching Harry Potter franchise recently, and my favorite character in whole franchise is non other than Severus Snape. So I named my AI pair programmer after him. But I know you might be curious about how I made it. So let’s find out!

The language model

First, I needed a language model which could be capable of generating code. At first, I had OpenAI’s GPT-3 in my mind but I remembered that for some reasons, I can’t use it. Then, I fell for free language models. I used GPT-J and although it could understand the code, it didn’t seem a very high-accuracy model to me.

Then, I realized that Meta has released OPT-175B model. I put some of its functionalities to the test. It is a really perfect language model, but it works well when you use it as a core for a chatbot or a blog-post generator (or maybe a prompt engineering tool for Text-To-Image models) but not a great code generator.

Then, I found my saving angel. A lot of open-source engineers and enthusiasts of the world and it’s non other than BigScience’s BLOOM.

Code tests and inference

Like what most of you may have done, first I tried to complete a love story with the model. It was cool. Then I tried to create a friendly, a helpful, an idiot and an evil chatbot with the model. All worked out perfectly. Back then, I did not have any limitations to Copilot, so I didn’t care about the code generation.

When I found out myself in misery of not having my beloved AI pair programmer, I tried some basic python code generation with BLOOM. It was fine, then I have tested PHP, Ruby and JavaScript as well. I found that it works pretty well, so I have decided to write a simple inference code over the API.

Code generation may go wrong

Since I didn’t fine-tune the model (and I don’t have resources to) it may glitch sometimes. For example, when you don’t really pay attention to your code formatting, it might generate explanation of the code.

For me, what happened was that it started explaining the code in a tutorial format (and I bet the whole python codes were from towardsdatascience website since it had pretty similar literature).

In general, I may need a solution for this, as well.

Will it be open source?

Yes. At least it’ll be partly open sourced in near future. But more than being open source, it will be free (as in non-paid) and I guess it may be a pro for the tool. I haven’t even paid a single penny on the model, so why should I make you pay for it? By the way I will be open for donations and technical helps from the community.

Future Plans

  • The API
  • VSCode extension
  • A community website (or discord server)

Conclusion

At the end, it seems we have a lot to do with these brand new language models. I found my way to create a free, reliable and smart AI pair programmer and of course I need some help in this way.

I have to warmly thank you for the time you’ve spent to read my article, and I openly accept your comments and ideas.

Revolutionizing Interior Design With Artificial Intelligence

Considering you have an interior design/interior architecture project (or even company) and you want to go a bit (or a lot) further in your industry, right? What will you think of, first? If you ask me, I personally may tell you that your answer is augmented reality and since I’m a co-founder of an AR company (link) it makes the most sense.

But let’s be a little bit more thirsty for pioneering the Interior design industry. We all know these days, AI Art is becoming somehow the new wave of art and you’re probably trying to get access to Dall-E 2 or Midjourney beta programs. They’re cool, but they are not enough. Let’s talk about the model I’ve been working on.

interior design with ai

The idea

The idea of developing an AI which can paint, isn’t a new idea for me. I’m a big art enthusiast and I play guitars and compose music. But I never spent time to learn how to paint like painters I like (e.g. Salvador Dali). So in the last winter, I decided to put all my computer knowledge to develop models (or to be more accurate, software) that can paint for me.

First, I went after VQGAN and developed on top of that. It was cool and artistic but to my taste, it was to “machine-y”. You may think it’s the point, but for me, it wasn’t.

Later, I found more and better models and developed much better software. Today, I got very surprised about the results. I was working on something, but I also created prompts first, just for fun then I got the image I posted above! I just asked for an abstract painting as the wallpaper of a living room, and it created this realistic looking living room for me!

More interesting designs

Well, first I wanted to work on a modern minimal living room so I prompted it, and these two images are my results:

It’s great, isn’t it? It was completely what I’ve been looking for. I never could get better results for interior design.

Now, let’s talk about this! why this matters? why it will revolutionize the industry? why you need it alongside AR/VR solutions? So let’s discuss!

The importance of AI generated Interior Design

Fine, this part isn’t as interesting as the other parts of this blog post. Probably because I’m going to talk about things which are not an average midjourney tester’s concerns.

That is okay, you still can go to midjourney’s channel and look for how will it look like for Shrek being the next US president or something like that. Here we’ll discuss about importance of AI in interior design.

  •  Aesthetic: This is very important. At least in my opinion. It has two minds involved. The customer’s and the architect’s (or designer?). For example, I myself am a big fan of Salvador Dali (I’m not joking, I really love Dali) and I look for an interior designer who has a taste in surrealism. But there’s no guarantee that we can agree on a design any time soon. So AI can help us find our desired design much faster and easier.
  • Reducing the cost: Sometimes you may pay different designers/architects to get your desired designs. It can get costy, specially if you want to get help from famous architects. So I guess you prefer to have an overview of your desired design and give it to the architect.
  • Diversity in choice: These days we have diversity in pretty much everything we want. So why not our interior designs? You can get as many designs as you need then choose one from them. It’s a win-win game!
  • Getting unimaginable designs: Okay, now it’s time for midjourney fans to come back. Have you ever seen AI generated art? Most of them are other-worldly good! And they obviously can be used in different aspects of interior design.

More designs

Above designs are from the prompt interior of a modern office, with pop art as wallpaper pattern, blue color scheme for the furniture and as you can see it almost covered all I asked!

I know, there are some minor problems with pop art in these pictures, but we should remember a machine designed this. It means with a little bit of more training, it can get much much better at generating images which can be inspiring for interior designers.

Conclusion

Before going any further, I should say this is my acutely personal opinion and I probably will make money by promoting my AI, so if you have any other opinions, it’s %100 welcome in the comment section.

In conclusion, I have to say it’s 2022. We have the greatest AI engines which can generate text, images and even music and currently most of them are becoming toys for curious teenagers and this is not totally good.

We can use these potentials in different segments of different industries and make our world a much better place. The world won’t become like The Matrix franchise (and ofc, I love that franchise) specially when we learn how to use machines to improve our works.

So final conclusion is that we obviously need an intelligent solution for interior design. Since it can reduce our costs, makes our procedures faster and diverse our choices.

At the end, I’m going to invite you read my Persian blog as well (if you speak or understand that language) because I write more frequently there.

Analyzing components of an electric circuit with YOLOv5

In past recent weeks, I did a lot with YOLOv5. A few weeks prior to this article, I wrote an article on why I love YOLOv5 and later, I did a project with YOLOv5 which was somehow a try for making something like symbolab or similar software.

I explained that project in details in my Persian blog (link) and I may write an English article on that project soon. But in this article specifically I am going to explain about a newly done project of mine!

Electric Circuit component analysis using YOLOv5

Introduction

After making the math equation OCR I got a few ideas in my head about doing identical projects but in different scopes and areas of my interest. Believe it or not, I am not really the type of person who sticks to only one thing and I tried to many different things in my life. As my job is making computer software and platforms, I have decided to use the knowledge I have in this field to improve my performance in the other fields as well.

I have studied Computer Hardware Engineering in the university and I know a thing or two about electronics. I have never been an electrician or an electronics expert but I have made some cool gear using Arduino, Raspberry Pi and even basic electronic components. I also am a big fan of YouTuber electroboom and like what he does a lot!

So this is the reason I started this project. I decided to make a computer vision program which helps us understand the components in a schematics and in this article, I will explain how I did it.

Who’s the audience of this article?

Since I am not a type of content creator or writer who bombards the audience with complex math and physics (or computer science) concepts, I have to say everyone.

But for being more specific, I have to say that everyone who’s enthusiastic about artificial intelligence, computer vision and electronics and is able to read English is my audience. At least in this particular article. Also if you are a newbie who wants to find their own path in the vast universe of computer science, this article will give you an idea about computer vision projects combined with deep learning.

Nikola Tesla

Previously done works

Although I didn’t want this article to be a thesis/research paper, I had to put this in the article. Honestly, I haven’t search about what people may have done with YOLOv5 (or other tools) to analyze electric or electronic circuitry.

I’m sure there are other minds out there who had thoughts of this and I appreciate their thoughts and also their efforts.

The research procedure

The problem

We have tons of circuit schematics in books or notes which students or enthusiasts can’t understand very well. Unlike math or physics formulas, there is no application or tool to find out what schematic represents what component therefore we need some tool to understand our circuits better.

The possible ways of implementation

  1.  Using OpenCV functions such as contouring and similar stuff to detect which shape is which.
  2. Using a pre-trained model for electrical components.
  3. Developing a CNN or similar network to detect the components.
  4. Fine-tuning YOLOv5 to our need.

Each of these ways, had their own problems. In the following lines, we’ll find out why most of them were inefficient for me.

Using OpenCV functions, although it’s first go-to for most of computer vision programmers but it is really problematic specially when you get pictures which are very close to each other. This is an example of my input data:

Example of input data

and as you can see, I have a battery in series with a capacitor and even to human eyes, these two can be mistaken! And remember, OpenCV doesn’t do magic and it is only a great tool for processing images.

The next way was to Find a pre-trained deep learning model which has the data of the components. It is a nice idea but it also has its own problems. For example, I had no idea which network is used, which libraries are used, etc. Also there is no mediapipe for electric circuitry where you are sure about its functionality in your projects.

Third way was my second favorite by far. Developing our own CNN or identical network for object detection or localization. It is cool, it can be efficient but the amount of work I had to put on it was actually out of my range of tolerance. Specially since I’m not doing these projects for graduation or money, I did not want to put too much effort on my project.

And last but not the least, Fine-tuning YOLOv5 for my needs, was the best solution I could ever think of. YOLOv5 is one of the best tools for quickly implementing your computer vision plus deep learning ideas. It also is a very very accurate and fast tool. So I went with this one.

Data gathering and preparation

YOLOv5 requires a set of labeled images. It means we need to have images of our topic of interest and nothing more.

Nicholas Renotte explains how to get data or images you need in this video. So if you want to do a similar project, I suggest giving that video a watch. But in my case, things were a little different.

I needed tons of schematics and on the other hand, I didn’t really want to spend a very long time labeling and preparing the data. So I have decided to draw a couple of schematics on a piece of A4 paper like this:

Example of my data

and for preparation, I just took photos of these drawings using my phone (Xiaomi Redmi Note 8 Pro) and then moved them to my computer.

For slicing them to small chunks of photos, I just used Adobe Photoshop (I know that might be surprising but I am too lazy to use any other tool) and then saved them in to a folder structure acceptable for YOLOv5.

The next part (which I always call the worst part of an A.I/Data project) was cleaning up the data and then labeling it. I used leabelImg in order to label my images since it has provided a YOLO type of labeling system.

Training YOLOv5

After doing all the hard stuff the time to train our beloved YOLOv5 arrived. Training YOLOv5 is fairly easy! You just have to follow their guide provided in their github repository to train your own version of YOLOv5.

Since the process of training YOLOv5 is easy and well-documented, I don’t really spend so much time explaining the process here. I only point out what I have done in order to get the best results.

I used 416×416 image sizes (if you’re not familiar with YOLOv5, you must know that their training script resizes the images) and a batch size of 32.

At the beginning I used their base weights (which is trained on COCO dataset) called yolov5s which stands for Small YOLOv5 and apparently, it has 7.2 million parameters (according to this table) and it wasn’t really good after almost 200 epochs. So I did reset my training process with yolov5m which stands for Medium YOLOv5 which has 21.2 million parameters.

To be honest, I know the number of parameters isn’t the only thing that matters, but for the love of God, let’s keep things simple.

Finally, with 416×416 images, batch size of 32, 500 epochs and medium model and almost five hours of waiting (since I was doing this process on my Macbook Pro and not in Google Colab), I got my desired results.

The result

The final result

As you can see, I got pretty good confidence levels on my components. Unfortunately, confidence levels for those inductors isn’t fit in the picture so for a better understanding of this resulting photo, I put this table here as well:

Confidence levels and coordinations

Future works

After finishing this project I’ve got a few ideas in my head. The very first thing is to generate a net list for a SPICE software. Imagine if you can draw a circuit on paper (Most of us engineers usually use paper to do our initial designs, right?) and then take a photo of it and boom! you have it in your SPICE software.

The second thing coming to my mind is actually combining this with an OCR software which can understand numbers and units we’ve used in our electrical circuitry. For example understands that 200K besides a resistor, means the resistor has 200 kilo ohms of electrical resistance.

Then, we can apply all these data to some calculator which can help us have a better understanding of our designs and gives us information about the behavior of our circuit in different situations such as changes in current, voltage or frequency.

Conclusion

In conclusion, I believe every kind of OCR can be helpful in our lives. I remember when I was a child there was some sort of pen-like device which could read verses of Quran and I liked the whole idea.

Later when I got older I decided to find out how that magical pen works and can we improve that? Yes Quran is very important for Muslim people and there is no doubt of that but that wasn’t enough in my opinion since that device could be used by visually impaired people. They could use that pen to understand Quran and other types of texts as well.

And now, I have the knowledge to make the world a better place to use the technology to people’s advantage. After making a real-time sign language translation program with A.I, I have decided to just conquer another realms of computer vision as well.

Lastly I have to say there is a very vast world of the unknown we can easily uncover using our knowledge and I try my best to do that.

Regards.

Why I love YOLOv5?

I am a big fan of Nicholas Renotte’s channel on YouTube. I also love computer vision and its combination with deep learning. A few months ago, Nicholas posted this video, which is about YOLOv5. I usually am too lazy to watch videos which are longer than 15 minutes and I watch them in a few episodes. But this video made me sit behind the laptop screen for over an hour and I’m sure I won’t regret it.

So let’s start the article and see where this story begins. As I mentioned earlier, I love computer vision specially when it’s combined with deep learning. I believe it can help us solve very complex problems of our projects with ease. My journey in world of these YOLO models have started almost a year ago, when I wanted to develop a simple object detection for detecting street signs.

Firstly, I found a lot of tutorials on darknet based training but l did not manage to get it to the work, specially since I have a mac, it could be a very realistic nightmare. So I guess YOLOv5 was a miracle. In this article, I am going to explain why I love YOLOv5 and why I prefer it to other YOLO versions.

What is YOLOv5?

According to their github repository, YOLOv5 is a family of deep learning models which is essentially trained on Microsoft’s COCO dataset. This makes it a very very general-purpose object detection tool which is fine for basic research and fun projects.

But I also needed to have my own models because I wanted to develop some domain-specific object detection software. So I realized they also provide a python script which helps you fine-tune and train your own version of YOLOv5.

So I basically fell in love with this new thing I have discovered. In the next sections, I will explain why I love YOLOv5!

Why I love YOLOv5?

Firstly, I invite you to see this chart, which shows the comparison of YOLOv5 with other commonly used object detection models:

And since there’s been a controversy about YOLOv5 claims about training time, inference time, model storage size, etc. I highly recommend you read this article on Roboflow’s blog.

So we can conclude the very first thing which made me happy is the speed and that’s right. The second thing by the way is the fact I am lazy. Yes, I am lazy and I know it.

I always tried to compile darknet and use it for having a YOLOv4 model and make my projects on top of YOLOv4 but when I saw how hard it can get and since I have a mac and I didn’t really want to fire-up an old computer for these projects, I was looking for something which does everything with a bunch of python scripts.

Since I discovered the YOLOv5, I started working with it and the very first project I have done was this pedestrian detection for a self-driving car.

Then, I started doing a lot of research and asking about what I can do with YOLOv5. I find out I can do pretty much anything I want with ease as they provided a lot of stuff themselves. Isn’t that good enough? Fine. Let me show you another youtube video of mine which I solved my crop problem with their internal functions.

If you’re not convinced yet, I have to tell you there is a great method which is called pandas in this family of models.

As the name tells us, it really outputs a pandas dataframe which you can easily use data from that dataframe. Let me set a better example for you. Considering we want to find out which plants are afflicted and which ones are not in a drone footage.

By using this method, we can simply make an algorithm which counts the amount of afflicted ones in a single frame, so we can easily find out how many afflicted plants we have in a certain area. The whole point here is that we have statistically right data for most of our researches.

The other example would be the same as my pedestrian detection system. We can command the car to get data first from the cameras to make sure we’re dealing with pedestrians and second get data from distance measurement system (which can be an Ultrasonic or LiDAR) to make sure when it should send braking command.

Conclusion

Let’s make a conclusion on the whole article. I love YOLOv5 because it made life easier for me, as a computer vision enthusiast. It provided the tools I wanted and honestly, I am really thankful to Ultralytics for this great opportunity they have provided for us.

In general I always prefer easy-to-use tools and YOLOv5 was this for me. I need to focus on the goal I have instead of making a whole object detection algorithm or model from scratch.

I finally can conclude that having a fast, easy-to-use and all-python tool for object detection was what I was always seeking and YOLOv5 was my answer.

I am glad to have you as a reader on my blog and I have to say thank you for the time you’ve spent on my blog reading this article. Stay safe!

A to Z of making an intelligent voice assistant

It was 2011, a sad year for a lot of apple fans (me included) because Steve Jobs, one of original co-founders of Apple Computers died October that year. Also, it could become sadder if there was no iPhone 4S and its features that year.

A few years prior to the first introduction of Siri (which introduced with iPhone 4S), a movie called Iron Man came out from Marvel Studios. Unlike comic books, Jarvis wasn’t an old man in this movie. Jarvis was an A.I. I’m not sure if the movie inspired companies to add the voice assistant to their systems or not, but I’m sure a lot of people just bought those phones or tablets to have their own version of Jarvis!

Long story short, a lot of engineers like me, were under the influence of the MCU (Marvel’s cinematic universe) and Apple and wanted to have their voice assistant a little bit differently! Instead of buying an iPhone 4S, we preferred to start making our own voice assistants.

In this article, I’m discussing the basics you need to learn for making your very own version of Siri. I warn you here, there wil be no codes at least in this one!

How does a voice assistant work?

In order to make something, we first need to learn how on earth that thing works! So, let’s discuss about voice assistants and how they work. They’re much simpler than what you think. It’s guaranteed your mind will be blown by their simplicity!

  • Listening: a voice assistant, as called, needs to listen to the voices and detects what is a decent human voice. For this, we need speech recognition systems. These systems will be discussed further. We just can make one, or we can use one that’s already made.
  • Understanding: In the 2015 movie Avengers: Age of Ultron, Tony Stark (a.k.a Iron Man) says “Jarvis is only a natural language understanding matrix” not considering the matrix part, other part of this sentence makes sense to me. Voice assistants need to understand what we tell them. They can have A.I or hard coded answers or a little bit of both.
  • Responding: after processing what we’ve said, the voice assistant needs to provide the responses that fit our request. For example, you say “Hey Alexa, play music” and your Alexa device will ask you for the title, you say “Back in Black” and she’ll play the song from spotify or youtube music.

Now, we know about the functionality. What about the implementation? It’s a whole other story. The rest of the article, is more about the technical side of making an intelligent chatbot…

Implementation of a Voice Assistant

Speech Recognition

Before we start to make our voice assistant, we have to make sure it can hear. So we need to implement a simple speech recognition system.

Although it’s not really hard to implement a speech recognition system, I personally prefer to go with something which is already made, like Python’s speech recognition library (link). This library sends the audio signal directly to IBM, Microsoft or Google API’s and shows us the transcription of our talk.

In the other hand, we can make our own system with a dataset, which has tons of voices and their transcriptions. But as you may know, you need to make your data diverse af. Why? Let me explain it a little bit better.

When you have your own voice only, your dataset doesn’t have the decent diversity. If you add your girlfriend, sister, brother, co-workers, etc. You still have no diversity. The result may be decent, but it only limits itself to your own voice, or the voices of your family members and friends!

The second problem is that your very own speech recognition, can’t understand that much. Because your words and sentences might be limited to the movie dialogues or books you like. We need the diversity to be everywhere in our dataset.

Is there any solution to this problem? Yes. You can use something like Mozilla’s dataset (link) for your desired language and make a speech recognition system. These data provided by the people around the world and it’s as diverse as possible.

Natural Language Understanding

As I told you, a voice assistant should process what we tell her. The best way of processing is artificial intelligence but we also can do a hard coded proof-of-concept as well.

What does that mean? hard coding in programming means when we want some certain input to have a fixed output, we don’t rely on our logic for that answer, but we just write code like if the input is this, give the user that, with no regard of the logic. In this case, the logic can be A.I, but we tell the machine if user said Hi, you simply say Hi!

But in the real world applications we can’t just go with the A.I. or hard coded functions. A real voice assistant is usually a combination of both. How? When you ask your voice assistant for the price of bitcoin, it’s a hard coded function.

But when you just talk to your voice assistant she’ll may make some answers to you, which may have a human feel and that’s when A.I. comes in.

Responding

Although providing responses can be considered a part of the understanding process, I prefer to talk about the whole thing in a separate section.

A response is usually what the A.I. will tell us, and the question is how that A.I. knows what we mean? and this is an excellent question. Designing the intelligent part of the voice assistant or in general chatbots, is the trickiest part.

The main backbone of responses, is your intention. What is your chatbot for? Is it a college professor assistant or it’s just something that will give you a Stark feeling? Is it designed to flirt with lonely people or it’s designed to help the elderly? There are tons of questions you have to answer before designing your own assistant.

After you asked you those questions, you need to classify what people would say to your bot under different categories. These categories are called intents. Let me explain by example.

You go to a Cafe, the waiter gives you the menu and you see the menu, right? Your intention is now clear. You want some coffee. So, how you ask about coffee? I will say Sir, a cup of espresso please. And that’s this simple. In order to answer all coffee related questions, we need to consider different states, as much as possible. What if customer asks for Macchiato? What if they ask for Mocha? What if they ask for a cookie with their coffee? and this is where A.I. can help.

A.I. is nothing other than making predictions using math. A long time ago, I used to write the whole A.I. logic myself. But later a YouTuber called NeuralNine developed a library called neural intents and it’s for this purpose! How does this library work?

It’s simple. We give the library a bunch of questions and our desired answers. The model we train, can classify questions and then simply predict what category our sayings belong to. Let me show you the example.

When you say a cup of espresso please, the A.I. sees words cup and espresso. What happens then? she’ll know these words belong to the coffee category, so she’ll give you one of those fixed answers from that category.

Keeping answers fixed by the way, is not always a good thing. For some reasons, we may need to make a generative chatbot which also can make responses like a human. Those bots are more complex and require more resources, studies and time.

Final Thoughts

The world of programming is beautiful and vast. When it comes to A.I. it becomes more fun of course. In this article, I tried to explain how a voice assistant can be constructed but I actually didn’t dig deep to the implementation.

Why so? I guess implementation is good, but in most cases, like every other aspect of programming, it’s just putting together some tools. So learning the concept, is much more important in most cases, like this.

I hope the article was useful for you. If it is, please share it with your friends and leave a comment for me. I’d be super thankful.

How to make video games like movies!

It was a long time that I did not write any thing in this blog. Now, I decided to write a topic about “video games” (as I wrote in my Persian blog). I was member of a game development team for about three months and I learned a lot. At least, I know the way they were doing the job was “How to not make a video game”. So, When I left the team, I decided to research about game development process. In this topic, I explain everything I found (experience and research result!)

When I was in the team…

It was in October (2017), a person sent me a message in Telegram, and the message was like this :

Mr. Haghiri, we need a musical composer for our game, we did a search and we found you. Please come here Wednesday 4:00 PM to talk about the project and your role.

Wednesday, I went to their office. That guy greeted me so nicely and started talking to me, about their project. I found the game is a horror game (horror games are popular in Iran, but there’s no “good” horror game “made in Iran”.) and it made me happy! Because it was the first time I heard about an indie team decided to make such a great game. They let me two weeks to research about “Sound and Music” in Unity Engine and I did it. I composed two pieces and also I tried to learn some tools for mixing and mastering sounds in the game (But it wasn’t actually my role, I just did that as a sample!).

After two and half months, they said “Mr. Haghiri, you don’t do what we wanted you to do”, anyway, they haven’t paid me even a “rial” in those days and expected me do great music composition, and they also wanted me to do what wasn’t my actual role. And this was not a good experience actually. But in those two months, I learned Unity game engine and I also met other “game developers”.  And, I decided to publish my experience on my blog.

Why movies?

Recently, I read a book called Making Short Films : Complete Guide from Script to Screen, by Clifford Thurlow. In the book, I found great names like “Salvador Dali” or “Charlie Chaplin”, and also great movies and books also mentioned in the book, like “One Flew Over Cuckoo’s Nest”. Everything was perfect, In I think about process of making a video game! It’s so similar to process of making a movie!

But I realize game and movie have lots of differences. The biggest difference is that games are interactive and players interact with the environment or other characters, but movies are not. Anyway, the main procedure – I mean “writing” – is %90 the same! So, I decided to mention some movies I watched, then tell you how I make a video game like them!

Great movies gave me ideas!

An Andalusian Dog

Movie is written by Louis Bunuel and Salvador Dali. I think these names are enough. But, after I watched the movie (it’s now available on YouTube and other video-sharing platforms, you can easily find and watch it), I discovered “product of a melancholic and depressed mind”. And both “Melancholia” and “Depression” are good subjects for a story or game.

Phantom of The Opera (Musical)

The book is about a musician, in a better word, a “genius” and smart person who hided himself. Andrew Lloyd Webber, just made that sad and creepy story to a romantic story by his music! First time I watched the movie, I could not understand it well. Because it’s not following the book’s story, and I’m not a native English speaker! I watched that movie 4 times. I watched the live performance 2 or 3 times and finally, I got the concept.

It has “Misanthropy” and “Romance” at once. I think these two things are also good for people who want to make video games (Please go and check “INSIDE” by Playdead! It’s the Misanthropy! Pure Misanthropy!)

School of Rock

It’s a bit different, School of Rock is comedy, and it’s also attractive for children. Because the topic is about a guy who teaches a bunch of 4th grade children to ROCK! Yes, he teaches them how to play electric guitars, bass, drums and keys. I think the concept of “Music” can be a good idea, too! Specially if you plan for a game like “Guitar Hero”.

Only Lovers Left Alive

If you like Vampires, please watch this movie. This movie has no “teenager” content, but it still tells stories about two vampires who married for centuries. Movie is directed by Jim Jarmusch and he also made music for his movie, in his rock trio SQRL. This movie has two things for game developers. It’s and independent movie and it can be a good idea bank for indie developers and also, it has “Romance” and “Fear” and “History”. All of these three factors can make a game great!

Ok, I talked a lot about movies! Let’s find “how can I make a game like a movie?”

Finally, let’s make a game like a movie!

The most basic thing you must have for a movie is “plot”. Plot is the main idea, it’s developed and tells people your concept, but it’s not a completed “script”. Writing plot in both games and movies is the same. To write a good plot, you need to study and read books, scripts and other plots. I prefer printed (and Persian) books in this case.

You will need research on the topic you want to write a plot about. For example, if you want to write about a middle eastern civilization (for example : Sassanid Empire or Ottoman Empire), you have to read history of Iran, Turkey, Afghanistan, Iraq, etc. If you want to write a plot about Satan, you have to search about Satan’s role in Judaism, Christianity and Islam, and find which belief is closer to what you want. So, you need to have a background.

After you wrote the plot, you have to write the script. But, writing script is different here, you have to clarify where and when the scene is interactive and when and where is not. The best example, is “Bioshock Infinite”. You know why? Because when you’re not interacting with the environment, you still can move camera and see what happens around you. I’m not a good script writer (But I try to be!) and I will write a post about how to write script from plot, when I manage to do that.

After you wrote the script, you have to make your basic ideas in the engine. Please! Please! Please! Call an experienced game developer before doing that, because the experienced one can help you find experienced character/concept and environment artist and game-play designer. After that, I can say you’re ready to start making your game! With a good team, you can make a good game!

Finally, I wrote this article but it wouldn’t be the last article about games in my English blog. I try to continue this, because I even couldn’t find good articles about being “game script writer” or “game director” even in English! I hope you like my post 🙂