Let’s build Metaverse with AI: We need to talk about 3D

In the previous post about building metaverse with AI, we discussed different possibilities and AI models we can access in order to make the virtual world. Although I personally am a big fan of 2D worlds, but let’s be honest, a 2D world is basically a perfect choice for a low budget indie game and nothing more.

In this post, I am going to talk about different models and ways I found about making 3D objects from text or image inputs. It was a fun experiment and I guess it’s worth sharing with the outside world in form of a blog post.

My discoveries

The very first thing I want to discuss is about my own discoveries in the field of 3D generation using AI. I always wondered what are 3D objects? And I got my answer.

The simplest way of discovering this was that make different 3D files using a 3D creation/editing tool such as Blender and do further investigation on the outputs. While working with different files, I discovered OBJ files are just simple text based explanations of the vertices and dots forming a shape.

Also, recently I found out about a research paper called LLaMA mesh. If I want to make it short, I should say that these people found out that LLaMA models are capable of generating OBJ files, then they fine-tuned the model further on 3D and OBJ files data in order to make the model better in making more coherent results when asked about 3D obj files.

Well, in order to find out the best metaverse base model, I just did a bunch of tests on different models and here, I am explaining every single test I’ve done.

Models I’ve tested

ChatGPT

Yes. ChatGPT is always my first goto for AI specially when it’s about text. Since OBJ files are basically text files with information about the desired shape, I made a stop on ChatGPT’s website and tested its capabilities in making 3D objects.

I used GPT-4o mini, GPT-4o and o1 models. They have understandings of the OBJ creation, but this understanding was very basic. The best shape I could get from OpenAI’s flagship models was just a simple cube. Which you don’t need any design skill to make in different 3D design programs.

Claude

Anthropic’s Claude, was nothing better than ChatGPT. I personally got much better code output from this model in the past and I had it in mind that this model will perform better in case of code generation. 

But I was wrong. I still couldn’t get anything better than basic shapes from this one as well. Cubes, Cylinders or Pyramids. These shapes aren’t really complicated and even without any 3D design knowledge, you can make them since blender, 3ds max, maya, etc. all have them as built-in tools.

LLaMA

Since I read the paper and understood that this whole game of LLaMA Mesh has started since the researches found out LLaMA is capable of generating 3D OBJ files. It wasn’t surprising for me, since LLaMA models are from Meta and Meta is the company starting the whole metaverse hype.

In this particular section, I’m just talking about LLaMA and not the fine-tune. I used 8B, 70B, 1B, 3B and 405B models from 3.1 and 3.2 versions. Can’t see they performed better in order to generate the results, but they showed a better understanding which was really hopeful for me.

At the end of the day, putting their generations in test, again I got the same result. These models were great when it comes to basic shapes and when it gets more complicated, the model seems to understand, but the results are far from acceptable.

LLaMA Mesh

I found an implementation of LLaMA Mesh on huggingface which can be accessed here. But unfortunately, I couldn’t get it to work. Even on their space on HF, the model sometimes stops working without any errors.

It seems due to high traffic this model can cause, they limited the amount of requests and tokens you can get from the model and this is the main cause of those strange errors.

The samples from their pages seem so promising, and of course we will give this model the benefit of the doubt.

Image to 3D test

Well as someone who’s interested in image generation using artificial intelligence, I like the image to 3D approach more than text to 3D. Also I have another reason for this personal preference.

Remember the first blog post of this series when I mentioned that I was a cofounder at ARmo? One of the most requested features from most of our customers was this we give you a photo of our product and you make it 3D. Although we got best 3D design experts to work, it was still highly human dependent and not scalable at all.

Okay, I am not part of that team anymore, but it doesn’t mean I don’t care about scalability concerns in the industry. Also, I may be working in the same space anytime.

Anyway, In this part of the blog post, I think I have to explain different image generators I used for finding out what models have the best results.

Disclaimer: I do not put example images here, just explain about the behavior of the model. The image samples will be uploaded in the future posts.

Midjourney

When you’re talking about AI image generation, the very first name people usually mention is Midjourney. I personally use it a lot for different purposes. Mostly comparing with my own model.

In this case, with the right prompting and right parameters, it made pretty good in app screenshots of 3D renders. Specially my most favorite one “lowpoly”. Although I still need more time and study to make it better.

Dall-E

Not really bad, but has one big down side. You cannot disable prompt enhancement in this model while using it. This basically made me put Dall-E out of the picture.

Ideogram

It is amazing. Details and everything is good, you can turn prompt enhancement off, you can tune different parameters, but still has problems in understanding the background colors. This was the only problem I could face with this model.

Stable Diffusion XL, 3 and 3.5

SD models perform really well, but you need to understand how to use them. Actually when it comes to XL or 1.5, you must have a big library of LoRA adapters, text embeddings, controlnets, etc.

I am not interested in 3 or 3.5 models that much but without any special addition, they perform well.

Something good about all stable diffusion models is that all of them are famous for being coherent. Specially the finetunes. So something we may consider for this particular project might be a finetune of SD 1.5 or XL as well.

FLUX

FLUX has good results, specially when using Ultra model. There are a few problems with this model (mostly licensing) and also sometimes, it loses its coherency. I don’t know how to explain this, it seems like the times you press the brake pedal but it doesn’t stop your car and there’s nothing wrong with the brake system.

Although it has these problems, seemed to be one of the best options for generating images of 3D renders. It still needs more study.

Mann-E

Well, as the founder and CEO of Mann-E, I can’t leave my own platform behind! But since our models are mostly SDXL based, I guess the same goes here. Anyway, I performed the test on all of our 3 models.

I have to say it is not really any different from FLUX or SD, and the coherency is somehow stable. What I have in mind is basically a way to fine tune this model in order to generate better render images of 3D objects.

Converting images to 3D objects

I remember almost two years ago, we used a technique called photogrammetry in order to make 3D objects from photos. It was a really hard procedure.

I remember we needed at least three cameras in three different angles, a turning table and some sort of constant lighting system. It needed its own room, its own equipment and wasn’t really affordable for a lot of companies.

It was one step forward in making our business scalable but it also was really expensive. Imagine just making a 3D model of a single shoe, takes hours of photography with expensive equipment. No, it’s now what I want. 

Nowadays, I am using an artificial intelligence system called TripoSR which can convert one single image to a 3D object. I tested it and I guess it has potentials. I guess we have one of our needed ingredients in order to make this magical potion of metaverse.

Now we need to make a way for building the metaverse using AI.

What’s next?

It is important to find out what is next. In my opinion, the next step is to find a way to make image models perform better in terms of generating 3D renders. Also, designing a pipeline for image to 3D is necessary.

Also, for now I am thinking of a different thing. You enter the prompt as a text, it generates images, images fed to TripoSR and then we have 3D models we need.

I guess the next actual step will be finding potentials of the universe/world generation by AI!

Why I love YOLOv5?

I am a big fan of Nicholas Renotte’s channel on YouTube. I also love computer vision and its combination with deep learning. A few months ago, Nicholas posted this video, which is about YOLOv5. I usually am too lazy to watch videos which are longer than 15 minutes and I watch them in a few episodes. But this video made me sit behind the laptop screen for over an hour and I’m sure I won’t regret it.

So let’s start the article and see where this story begins. As I mentioned earlier, I love computer vision specially when it’s combined with deep learning. I believe it can help us solve very complex problems of our projects with ease. My journey in world of these YOLO models have started almost a year ago, when I wanted to develop a simple object detection for detecting street signs.

Firstly, I found a lot of tutorials on darknet based training but l did not manage to get it to the work, specially since I have a mac, it could be a very realistic nightmare. So I guess YOLOv5 was a miracle. In this article, I am going to explain why I love YOLOv5 and why I prefer it to other YOLO versions.

What is YOLOv5?

According to their github repository, YOLOv5 is a family of deep learning models which is essentially trained on Microsoft’s COCO dataset. This makes it a very very general-purpose object detection tool which is fine for basic research and fun projects.

But I also needed to have my own models because I wanted to develop some domain-specific object detection software. So I realized they also provide a python script which helps you fine-tune and train your own version of YOLOv5.

So I basically fell in love with this new thing I have discovered. In the next sections, I will explain why I love YOLOv5!

Why I love YOLOv5?

Firstly, I invite you to see this chart, which shows the comparison of YOLOv5 with other commonly used object detection models:

And since there’s been a controversy about YOLOv5 claims about training time, inference time, model storage size, etc. I highly recommend you read this article on Roboflow’s blog.

So we can conclude the very first thing which made me happy is the speed and that’s right. The second thing by the way is the fact I am lazy. Yes, I am lazy and I know it.

I always tried to compile darknet and use it for having a YOLOv4 model and make my projects on top of YOLOv4 but when I saw how hard it can get and since I have a mac and I didn’t really want to fire-up an old computer for these projects, I was looking for something which does everything with a bunch of python scripts.

Since I discovered the YOLOv5, I started working with it and the very first project I have done was this pedestrian detection for a self-driving car.

Then, I started doing a lot of research and asking about what I can do with YOLOv5. I find out I can do pretty much anything I want with ease as they provided a lot of stuff themselves. Isn’t that good enough? Fine. Let me show you another youtube video of mine which I solved my crop problem with their internal functions.

If you’re not convinced yet, I have to tell you there is a great method which is called pandas in this family of models.

As the name tells us, it really outputs a pandas dataframe which you can easily use data from that dataframe. Let me set a better example for you. Considering we want to find out which plants are afflicted and which ones are not in a drone footage.

By using this method, we can simply make an algorithm which counts the amount of afflicted ones in a single frame, so we can easily find out how many afflicted plants we have in a certain area. The whole point here is that we have statistically right data for most of our researches.

The other example would be the same as my pedestrian detection system. We can command the car to get data first from the cameras to make sure we’re dealing with pedestrians and second get data from distance measurement system (which can be an Ultrasonic or LiDAR) to make sure when it should send braking command.

Conclusion

Let’s make a conclusion on the whole article. I love YOLOv5 because it made life easier for me, as a computer vision enthusiast. It provided the tools I wanted and honestly, I am really thankful to Ultralytics for this great opportunity they have provided for us.

In general I always prefer easy-to-use tools and YOLOv5 was this for me. I need to focus on the goal I have instead of making a whole object detection algorithm or model from scratch.

I finally can conclude that having a fast, easy-to-use and all-python tool for object detection was what I was always seeking and YOLOv5 was my answer.

I am glad to have you as a reader on my blog and I have to say thank you for the time you’ve spent on my blog reading this article. Stay safe!

A to Z of making an intelligent voice assistant

It was 2011, a sad year for a lot of apple fans (me included) because Steve Jobs, one of original co-founders of Apple Computers died October that year. Also, it could become sadder if there was no iPhone 4S and its features that year.

A few years prior to the first introduction of Siri (which introduced with iPhone 4S), a movie called Iron Man came out from Marvel Studios. Unlike comic books, Jarvis wasn’t an old man in this movie. Jarvis was an A.I. I’m not sure if the movie inspired companies to add the voice assistant to their systems or not, but I’m sure a lot of people just bought those phones or tablets to have their own version of Jarvis!

Long story short, a lot of engineers like me, were under the influence of the MCU (Marvel’s cinematic universe) and Apple and wanted to have their voice assistant a little bit differently! Instead of buying an iPhone 4S, we preferred to start making our own voice assistants.

In this article, I’m discussing the basics you need to learn for making your very own version of Siri. I warn you here, there wil be no codes at least in this one!

How does a voice assistant work?

In order to make something, we first need to learn how on earth that thing works! So, let’s discuss about voice assistants and how they work. They’re much simpler than what you think. It’s guaranteed your mind will be blown by their simplicity!

  • Listening: a voice assistant, as called, needs to listen to the voices and detects what is a decent human voice. For this, we need speech recognition systems. These systems will be discussed further. We just can make one, or we can use one that’s already made.
  • Understanding: In the 2015 movie Avengers: Age of Ultron, Tony Stark (a.k.a Iron Man) says “Jarvis is only a natural language understanding matrix” not considering the matrix part, other part of this sentence makes sense to me. Voice assistants need to understand what we tell them. They can have A.I or hard coded answers or a little bit of both.
  • Responding: after processing what we’ve said, the voice assistant needs to provide the responses that fit our request. For example, you say “Hey Alexa, play music” and your Alexa device will ask you for the title, you say “Back in Black” and she’ll play the song from spotify or youtube music.

Now, we know about the functionality. What about the implementation? It’s a whole other story. The rest of the article, is more about the technical side of making an intelligent chatbot…

Implementation of a Voice Assistant

Speech Recognition

Before we start to make our voice assistant, we have to make sure it can hear. So we need to implement a simple speech recognition system.

Although it’s not really hard to implement a speech recognition system, I personally prefer to go with something which is already made, like Python’s speech recognition library (link). This library sends the audio signal directly to IBM, Microsoft or Google API’s and shows us the transcription of our talk.

In the other hand, we can make our own system with a dataset, which has tons of voices and their transcriptions. But as you may know, you need to make your data diverse af. Why? Let me explain it a little bit better.

When you have your own voice only, your dataset doesn’t have the decent diversity. If you add your girlfriend, sister, brother, co-workers, etc. You still have no diversity. The result may be decent, but it only limits itself to your own voice, or the voices of your family members and friends!

The second problem is that your very own speech recognition, can’t understand that much. Because your words and sentences might be limited to the movie dialogues or books you like. We need the diversity to be everywhere in our dataset.

Is there any solution to this problem? Yes. You can use something like Mozilla’s dataset (link) for your desired language and make a speech recognition system. These data provided by the people around the world and it’s as diverse as possible.

Natural Language Understanding

As I told you, a voice assistant should process what we tell her. The best way of processing is artificial intelligence but we also can do a hard coded proof-of-concept as well.

What does that mean? hard coding in programming means when we want some certain input to have a fixed output, we don’t rely on our logic for that answer, but we just write code like if the input is this, give the user that, with no regard of the logic. In this case, the logic can be A.I, but we tell the machine if user said Hi, you simply say Hi!

But in the real world applications we can’t just go with the A.I. or hard coded functions. A real voice assistant is usually a combination of both. How? When you ask your voice assistant for the price of bitcoin, it’s a hard coded function.

But when you just talk to your voice assistant she’ll may make some answers to you, which may have a human feel and that’s when A.I. comes in.

Responding

Although providing responses can be considered a part of the understanding process, I prefer to talk about the whole thing in a separate section.

A response is usually what the A.I. will tell us, and the question is how that A.I. knows what we mean? and this is an excellent question. Designing the intelligent part of the voice assistant or in general chatbots, is the trickiest part.

The main backbone of responses, is your intention. What is your chatbot for? Is it a college professor assistant or it’s just something that will give you a Stark feeling? Is it designed to flirt with lonely people or it’s designed to help the elderly? There are tons of questions you have to answer before designing your own assistant.

After you asked you those questions, you need to classify what people would say to your bot under different categories. These categories are called intents. Let me explain by example.

You go to a Cafe, the waiter gives you the menu and you see the menu, right? Your intention is now clear. You want some coffee. So, how you ask about coffee? I will say Sir, a cup of espresso please. And that’s this simple. In order to answer all coffee related questions, we need to consider different states, as much as possible. What if customer asks for Macchiato? What if they ask for Mocha? What if they ask for a cookie with their coffee? and this is where A.I. can help.

A.I. is nothing other than making predictions using math. A long time ago, I used to write the whole A.I. logic myself. But later a YouTuber called NeuralNine developed a library called neural intents and it’s for this purpose! How does this library work?

It’s simple. We give the library a bunch of questions and our desired answers. The model we train, can classify questions and then simply predict what category our sayings belong to. Let me show you the example.

When you say a cup of espresso please, the A.I. sees words cup and espresso. What happens then? she’ll know these words belong to the coffee category, so she’ll give you one of those fixed answers from that category.

Keeping answers fixed by the way, is not always a good thing. For some reasons, we may need to make a generative chatbot which also can make responses like a human. Those bots are more complex and require more resources, studies and time.

Final Thoughts

The world of programming is beautiful and vast. When it comes to A.I. it becomes more fun of course. In this article, I tried to explain how a voice assistant can be constructed but I actually didn’t dig deep to the implementation.

Why so? I guess implementation is good, but in most cases, like every other aspect of programming, it’s just putting together some tools. So learning the concept, is much more important in most cases, like this.

I hope the article was useful for you. If it is, please share it with your friends and leave a comment for me. I’d be super thankful.