You only need Python to make AI agents.

In 2022, ChatGPT released and LLMs becoming the hot topic of pretty much every technology related press, event, YouTube video, etc. It was like finding the secret ingredient to a potion which can make you immortal.

But Meta didn’t let OpenAI becoming the one and only. They also started the game by releasing their well-named model Large Language Model Meta AI or LLaMA which we all know and love. Not only Meta, but our friends at Mistral AI weren’t idle and they also released a good bunch of open source models and the result of their work even motivated me in making of my Persian LLM, Maral.

But nowadays, good LLM is not a big problem. With a quick search on the internet, we easily can find good LLMs. Base models and fine-tunes which are made for generic or specific purposes, models which are armed with reasoning, models which are made for programmers, etc.

We have the text output, now we need action. This is what I’m going to discuss in this particular post and I also will love to hear back from you as well.

AI Agents add action to LLMs

Well, I remember when the make-shift Android rip-off of iPod touch or simply Rabbit R1 was introduced, they just advertised the device to work on a Large Action Model or LAM. I always was thinking about how can we modify one of the open LLMs to have action? Then I got the answer.

The simplest thing we can think of is an LLM tuned on JSON input for different API’s with different tones. It is what I believe function calling or tool calling is. But it still has the downside.

Imagine I train LLaMA 3.2 on API’s from AirBnB, Shopify, Amazon, Uber and Spotify. What will happen if you ask for a YouTube video? You even won’t get rick-rolled and it won’t be a good sign for products such as Rabbit R1 (or any other competitors).

Then I got familiar with Crew AI which is a framework for making agents. But honestly, I never understood these AI frameworks. Most of them are making the process of making a simple application over complicated. But thanks to Crew AI, I finally could understand what an AI agent is.

An AI agent, adds actions in a human understandable way to LLMs. Like when you ask ChatGPT to create a picture, it calls an API running Dall-E and then gives you the image. This is what an agent is…! (at least until it’s not called Smith).

Making an AI Agent without the frameworks is possible!

Well, it is possible. You only need Python and probably OpenAI’s library to make an agent. First of all let’s see what an agent does. An agent simply gets a prompt from you. Something like Send an email to John Doe and explain why I will be late tomorrow. The AI model has to understand some steps here.

First, it has to call a function to search your contact list and find John Doe then it has to generate a text explaining why you will be late. Then the last part is to send the email over an email server (which can be a private mail server or a provider like Google’s Gmail).

Also, you can make it one step more difficult for your own agent and ask it to do these in the GUI (basically you need to use a Vision model for this task).

Let’s make it happen in Python. It will be easy and you will understand it better.

Python example

Disclaimer: Since I have a full working code example on github, this part of the blog will be just a simple example.

First step is to find an LLM. I personally think any provider with an OpenAI compatible API works perfectly and for this particular project, I’m using my own LLM which is known as Jabir Project.

Jabir Project is a finetune on LLaMA 3.1 405B and proven itself in many different tasks. If you don’t want to use Jabir LLMs, it’s fine. You may prefer OpenAI, DeepInfra or OpenRouter. Also you may want to go local, so why not using Ollama?

Well, assuming you want to use Jabir’s API, you need to set up an OpenAI client like this:

from openai import OpenAI

client = OpenAI(api_key="FAKE", base_url="https://openai.jabirpoject.org/v1")

This is as easy as typing one line of code! You may be wondering why I used “FAKE” as the API key? It was when I tried to add Ollama’s API to my code and I understood that OpenAI library requires a value for the API key.

Then, we need to set up a simple agent class:

class Agent:
    
    def __init__(self, system=""):
        self.system = system
        self.messages = []
        if self.system:
            self.messages.append({"role" : "system", "content" : system})
    
    def __call__(self, message):
        self.messages.append({"role" : "user", "content" : message})
        result = self.execute()
        self.messages.append({"role" : "assistant", "content" : result})
        return result
    
    def execute(self):
        completion = client.chat.completions.create(
            model = "jabir-400b",
            messages = self.messages,
            temperature = 0.0
        )
        
        return completion.choices[0].message.content

This agent class is what that matters a lot. Since it has a memory of what happened.

You can run the agent like this:

sample_agent = Agent("You are a helpful assistant")
print(sample_agent("What is 1+1?"))

Now the main question is that how can we add actions to this agent?

The Sample Agent with real action

As I was working on a way to make agents with no frameworks, I came up with the idea of making each action a python function and then ask the AI to generate something for me which can be later parsed into inputs for those.

I made it in form of a jupyter notebook and it is available through my Github account. You can write agents like this and be completely framework-independent.

Conclusion

Almost three years ago I made a blog post here called I was too cheap to pay $10 a month for Github’s copilot so I made my own and it was a good start of my journey to generative AI. Although I abandoned text generation for a somehow long time and started Mann-E, I got back to the world of NLP with Maral models.

And Maral got abandoned because my personal life was getting a little rough and then I decided to start a personalization platform called Atelier AI. Which lets you create your own LoRAs for Mann-E models.

But when I restarted the Jabir Project, I thought an LLM is not enough. This model should be the foundation of something bigger. This is why I did a lot of research on AI agents, and now I completely am aware of what I’m going to do.

I love to hear back from readers of my blog about what possible ideas we can implement using LLMs and agents, so I politely ask all of you participate in the discussion and let’s build the future together.

Let’s build Metaverse with AI: Building asset generator

Look at this:

How do you think this apple has been made? Excellent question. After the previous post, I said we should put LLMs out of the picture for now. Also we needed to talk about 3D, because it is important in whole metaverse space, right? Today I just did it. I trained a LoRA on FLUX and then tried to make 3D objects from what an AI model is capable of generating.

The Image Generator

In this part, I specifically talk about the image generation procedure. It will be a good experience sharing procedure and the open source models created in this process will be linked in the topic as well.

For making an image generator model, we need a base model. Since the whole Generative Metaverse project for me was a fun project and not a serious commercial one, I chose FLUX. However, if I try to go to the blockchain/crypto side of things (probably on TON network) I may consider SDXL as base in order to have no problems in terms of commercial use.

Anyway, everything here is pretty standard. Pretty much every step I took in order to make early versions of Mann-E. So I guess it will be worth sharing one more time, right?

The Dataset

AI models are just a bunch of boring mathematical functions and they become amazing when they are fed with good data. So we needed to create a dataset. As always, the best data generator I could use was Midjourney and of course, I headed over to their website and recharged my account.

I played with a good bunch of prompt combinations to find what is the best one fitting what I have in mind. So after tweaking a lot, I got this: <subject>, lowpoly, 3d illustration, dark background, isometric camera angle. 

Here is a sample of what generated with this prompt formula:

After that, I used ChatGPT in order to generate a list of objects we may use or see everyday. After that, I made a prompt list and automated the image generation procedure and got around 800 pictures. Now it was time for training!

The training

First, I was thinking about using Replicate or fal.ai in order to train the LoRA. Honestly they provide easy and affordable ways of training LoRA on FLUX (and to my knowledge, you also may be able to have SD 1.5 and SDXL LoRA’s trained on replicate) but there is one big problem.

These websites are usually not suitable for large scale training or if they offer large scale training systems, you should negotiate with them and as I said, this is a fun project. Not a big OpenAI scale commercial product!

So I was looking for another way. As you may know, Google Colab’s free tier subscription is also no good for FLUX training. So I used AI Toolkit template on RunPod in order to train the said LoRA. I used an 80GB A100 and it took around 3 hours on 100 pictures.

The files

If you’re interested in the dataset, I uploaded the whole dataset and pictures here. You can see there is a folder called minimized images which is 100 hand picked images from the original dataset.

And if you’re looking for the LoRA, you can download and even test it here.

The 3D Generation

Well, after making the image generator, we needed a way of turning single images to 3D files and of course the 3D format must be something acceptable for all devices.

OBJ and FBX are great formats when it comes to game development (specially if you’re using Unity game engine) but for WebGL and WebXR, gLTF or GLB formats are usually preferred.

The best option for this, is fal.ai’s TripoSR API. You upload your image, the model is being called and BOOM you have a GLB file which can be used on every WebGL or WebXR project you can think of.

What’s next?

Since I personally am working on another project with Mann-E’s proprietary models, I may stop this particular project right here. I almost did everything I had in mind.

Although we still have the important topic of world generation using AI, but I guess it needs a more in depth study and will not be this easy at all. Also the commercializing process of the whole thing is also a topic of thought and for now, I just want to keep the project fun.

Maybe in a few weeks, I return with a more commercial approach and also some ideas about the whole blockchain or crypto space.

Let’s build Metaverse with AI : LLaMA Mesh is out of picture

In the previous post I mentioned that I could not get LLaMA Mesh to work, right? So I could and in this particular post, I am going to explain what happened and why LLaMA Mesh is not a good option at all.

First, I will explain the workflow of the model’s deployment. Because I think it is important to know the flow. Then, I will tell you what I asked it and why I am very disappointed in this model (although I thought it might be a promising one).

The Flow

In this part, I’m explaining what flows I chose in order to make LLaMA Mesh work. First flow I chose was an absolute failure, but this morning I was thinking about every place I could host a custom model, so I managed to deploy and test the model and pretty much get disappointed.

The failed flow

First, I paid a visit to my always goto website RunPod and tried to use their serverless system and deploy the model using vLLM package. I explained this in the previous post.

First, it didn’t work and I decided to go with a quantized version. It didn’t work either. I know if I could spend a few hours on their website, I’d be successful in terms of running the model but to be honest, it wasn’t really a priority for me at the moment.

The second failure

This wasn’t quite a failure tough. After I couldn’t deploy the model in one possible way I knew, I just headed over to Open Router.  I guessed they may have the model but I was wrong.

I also didn’t surrender here. I paid a visit to Replicate as well.  When I was there, I noticed there are good models labeled as 3D but non of them are LLaMA Mesh, my desired one.

The Successful One

Well after a few unsuccessful tests, I was thinking of Google Colab. But I remembered that their free tier subscription is not suitable for eight billion parameter models which are not quantized.

What is another option then? Well it all is because of an email I received this morning. I was struggling to wake up as usual and I saw my phone vibrating. I picked my phone up and saw an email from GLHF. They have a quite good bunch of models on their always on mode and also they let you run your own models (if hosted on hugging face) and then I decided to go with them!

The Disappointment

Now, this is the time I’m going to talk about how disappointed I got when I saw the results. The model is not really different from other LLMs I covered in the previous post and just had one advantage: quantization in the output 3D objects.

The integer quantization however is just good for speeding up the generation and make the output a little more “lowpoly”. Otherwise the final results were good only if you asked for basic shapes such as cubes or pyramids.

Should we rely on LLMs for 3D mesh generation at all?

Short answer is No. Long answer is that we need to work more on the procedures, understand formats more and then try to work on different formats and ways of generating 3D meshes.

Mesh generation in general is only one problem. We also have problems such as polishing and materializing the output 3D object which can’t be easily done by a large language model.

What’s next?

Now, I’m more confident about the idea I discussed before. Taking existing image models, fine tune them on 3D objects and use an existing image to 3D model in order to make the 3D objects needed.

But I have another problem, what happens when we generate items and not having a place to put them? So for now I guess we need a world generator system which we should be thinking about.

Let’s build Metaverse with AI: We need to talk about 3D

In the previous post about building metaverse with AI, we discussed different possibilities and AI models we can access in order to make the virtual world. Although I personally am a big fan of 2D worlds, but let’s be honest, a 2D world is basically a perfect choice for a low budget indie game and nothing more.

In this post, I am going to talk about different models and ways I found about making 3D objects from text or image inputs. It was a fun experiment and I guess it’s worth sharing with the outside world in form of a blog post.

My discoveries

The very first thing I want to discuss is about my own discoveries in the field of 3D generation using AI. I always wondered what are 3D objects? And I got my answer.

The simplest way of discovering this was that make different 3D files using a 3D creation/editing tool such as Blender and do further investigation on the outputs. While working with different files, I discovered OBJ files are just simple text based explanations of the vertices and dots forming a shape.

Also, recently I found out about a research paper called LLaMA mesh. If I want to make it short, I should say that these people found out that LLaMA models are capable of generating OBJ files, then they fine-tuned the model further on 3D and OBJ files data in order to make the model better in making more coherent results when asked about 3D obj files.

Well, in order to find out the best metaverse base model, I just did a bunch of tests on different models and here, I am explaining every single test I’ve done.

Models I’ve tested

ChatGPT

Yes. ChatGPT is always my first goto for AI specially when it’s about text. Since OBJ files are basically text files with information about the desired shape, I made a stop on ChatGPT’s website and tested its capabilities in making 3D objects.

I used GPT-4o mini, GPT-4o and o1 models. They have understandings of the OBJ creation, but this understanding was very basic. The best shape I could get from OpenAI’s flagship models was just a simple cube. Which you don’t need any design skill to make in different 3D design programs.

Claude

Anthropic’s Claude, was nothing better than ChatGPT. I personally got much better code output from this model in the past and I had it in mind that this model will perform better in case of code generation. 

But I was wrong. I still couldn’t get anything better than basic shapes from this one as well. Cubes, Cylinders or Pyramids. These shapes aren’t really complicated and even without any 3D design knowledge, you can make them since blender, 3ds max, maya, etc. all have them as built-in tools.

LLaMA

Since I read the paper and understood that this whole game of LLaMA Mesh has started since the researches found out LLaMA is capable of generating 3D OBJ files. It wasn’t surprising for me, since LLaMA models are from Meta and Meta is the company starting the whole metaverse hype.

In this particular section, I’m just talking about LLaMA and not the fine-tune. I used 8B, 70B, 1B, 3B and 405B models from 3.1 and 3.2 versions. Can’t see they performed better in order to generate the results, but they showed a better understanding which was really hopeful for me.

At the end of the day, putting their generations in test, again I got the same result. These models were great when it comes to basic shapes and when it gets more complicated, the model seems to understand, but the results are far from acceptable.

LLaMA Mesh

I found an implementation of LLaMA Mesh on huggingface which can be accessed here. But unfortunately, I couldn’t get it to work. Even on their space on HF, the model sometimes stops working without any errors.

It seems due to high traffic this model can cause, they limited the amount of requests and tokens you can get from the model and this is the main cause of those strange errors.

The samples from their pages seem so promising, and of course we will give this model the benefit of the doubt.

Image to 3D test

Well as someone who’s interested in image generation using artificial intelligence, I like the image to 3D approach more than text to 3D. Also I have another reason for this personal preference.

Remember the first blog post of this series when I mentioned that I was a cofounder at ARmo? One of the most requested features from most of our customers was this we give you a photo of our product and you make it 3D. Although we got best 3D design experts to work, it was still highly human dependent and not scalable at all.

Okay, I am not part of that team anymore, but it doesn’t mean I don’t care about scalability concerns in the industry. Also, I may be working in the same space anytime.

Anyway, In this part of the blog post, I think I have to explain different image generators I used for finding out what models have the best results.

Disclaimer: I do not put example images here, just explain about the behavior of the model. The image samples will be uploaded in the future posts.

Midjourney

When you’re talking about AI image generation, the very first name people usually mention is Midjourney. I personally use it a lot for different purposes. Mostly comparing with my own model.

In this case, with the right prompting and right parameters, it made pretty good in app screenshots of 3D renders. Specially my most favorite one “lowpoly”. Although I still need more time and study to make it better.

Dall-E

Not really bad, but has one big down side. You cannot disable prompt enhancement in this model while using it. This basically made me put Dall-E out of the picture.

Ideogram

It is amazing. Details and everything is good, you can turn prompt enhancement off, you can tune different parameters, but still has problems in understanding the background colors. This was the only problem I could face with this model.

Stable Diffusion XL, 3 and 3.5

SD models perform really well, but you need to understand how to use them. Actually when it comes to XL or 1.5, you must have a big library of LoRA adapters, text embeddings, controlnets, etc.

I am not interested in 3 or 3.5 models that much but without any special addition, they perform well.

Something good about all stable diffusion models is that all of them are famous for being coherent. Specially the finetunes. So something we may consider for this particular project might be a finetune of SD 1.5 or XL as well.

FLUX

FLUX has good results, specially when using Ultra model. There are a few problems with this model (mostly licensing) and also sometimes, it loses its coherency. I don’t know how to explain this, it seems like the times you press the brake pedal but it doesn’t stop your car and there’s nothing wrong with the brake system.

Although it has these problems, seemed to be one of the best options for generating images of 3D renders. It still needs more study.

Mann-E

Well, as the founder and CEO of Mann-E, I can’t leave my own platform behind! But since our models are mostly SDXL based, I guess the same goes here. Anyway, I performed the test on all of our 3 models.

I have to say it is not really any different from FLUX or SD, and the coherency is somehow stable. What I have in mind is basically a way to fine tune this model in order to generate better render images of 3D objects.

Converting images to 3D objects

I remember almost two years ago, we used a technique called photogrammetry in order to make 3D objects from photos. It was a really hard procedure.

I remember we needed at least three cameras in three different angles, a turning table and some sort of constant lighting system. It needed its own room, its own equipment and wasn’t really affordable for a lot of companies.

It was one step forward in making our business scalable but it also was really expensive. Imagine just making a 3D model of a single shoe, takes hours of photography with expensive equipment. No, it’s now what I want. 

Nowadays, I am using an artificial intelligence system called TripoSR which can convert one single image to a 3D object. I tested it and I guess it has potentials. I guess we have one of our needed ingredients in order to make this magical potion of metaverse.

Now we need to make a way for building the metaverse using AI.

What’s next?

It is important to find out what is next. In my opinion, the next step is to find a way to make image models perform better in terms of generating 3D renders. Also, designing a pipeline for image to 3D is necessary.

Also, for now I am thinking of a different thing. You enter the prompt as a text, it generates images, images fed to TripoSR and then we have 3D models we need.

I guess the next actual step will be finding potentials of the universe/world generation by AI!

FrontBricks, my LLM-based weekend project which is inspired by Vercel’s V0

Since 2022, there is a hype of generative artificial intelligence and it resulted in a bunch of cool projects. Although a lot of us may remember that Github’s copilot was much older. Those days, I wrote an article about how I was too cheap to pay $10 a month for copilot, so I made my own!

That was somehow the beginning of my interest in AI field. I spent around four years in this field and like most of us, I tried to utilize different tools and products. In this article, I’m talking about FrontBricks which is my newest product and how it started as a weekend project!

A little bit of history

In 2023, I launched Mann-E which is an AI image generator based on its own models (and more information is provided in the website). A few months ago, I also launched Maral, which is a 7 billion parameter LLM specialized for the Persian language (the language I speak).

Also, around a month ago, I did some tests with brand new LLMs such as LLaMa 3, in order to make Mann-E Search which can be somehow an alternative to Perplexity but with a little difference (it doesn’t provide a chat interface).

I guess this can clarify how I am drowned in AI space and how much I love generative AI! Now we can talk about FrontBricks!

What is FrontBricks?

You may be familiar with Vercel’s V0 which is a generative AI tool helping people generate frontend components. I liked their idea, and I joined their waitlist and a couple days later, I got access to the platform.

It was a cool experience, and some sparks formed in my head. I found out that pretty much all LLMs are really good at the task of code generation, and we can utilize one to generate the code and use another one in order to find out if the code is valid or not.

This was my whole idea so I sat at my desk and started to code a basic tool to send my prompts to OpenAI’s API in order to generate and then another one to do the validation using LLaMa 3 70B and GPT-4 as well (I used OpenAI again).

I also found another bottleneck, which was JSX code generation. I did a little bit of research and I found that is not really a big deal and using the power of Regex and text manipulation, it’s easily possible to turn pure HTML to JSX!

I wrote pretty much everything, so I just switched to my work environment, created a simple rails app and then connected it to my backend module. Now, I have a platform which can be an alternative to Vercel’s V0!

Today, I am just announcing frontbricks, but I have to say before this post around 211 people gave me their email addresses to put them in the list of early adopters and I gave them access to the platform earlier this week!

My birthday (May 30th) was in this week, so I guess it can also be a bit of surprise for my friends and the community.

How can I access FrontBricks?

Well, it is easy. You just need to go to frontbricks.com and create an account (sign up link). Then you just need to confirm your email and boom, you have unlimited access to FrontBricks, completely free of charge!

You can generate a component, then improve it and every time you felt you need a new component, you easily can choose to create a new code snippet. It is as easy as drinking a cup of tea.

Future Plans

Since this project isn’t monetized yet, the very first thing coming to my mind is a way to monetize it (you still can donate in crypto through this link). A good business model can help this project be much better.

I also am thinking of releasing an open source model based on the data provided on FrontBricks, because one of the reasons I coded this project is just that I couldn’t find a model specialized for front-end generation!

These are my concerns for now. If you have any other ideas, I’m open to here.

Conclusion

I have a haystack of ideas in my mind, and if I find enough time, I implement them. Mann-E and FrontBricks are just two of projects I just made and to be honest, Mann-E with around 6000 users and more than 50,000 generated images, is somehow one my most successful projects.

FrontBricks has potential, but I guess I can’t keep it up alone. I’m open to technical and business ideas as well. So if you have any ideas in mind, feel free to send me a message, my email is haghiri75@gmail.com 😁

Nucleus is the proof that “Small is the new Big”

No matter what you heard, size matters. Specially in the world of AI models, having a smaller and more affordable model is the key to win the competition. This is why Microsoft even invested time, GPU and money on Phi project, which is a Small Language Model or SLM for short.

In this post, I represent Nucleus. My newest language model project, which is based on Mistral (again) and has 1.13 billion parameters. And of course, this post will have a s*it ton of reference to HBO’s Silicon Valley series 😁

Background

If you know me, you know that I have a good background in messing around with generative AI models such as Stable Diffusion, GPT-2, LLaMa and Mistral. I even tried to do something with BLOOM (here) before but since the 176B model is too expensive to be put in the mix, I left it behind.

But later, I started my own AI image generation platform called Mann-E, and in previous weeks, my team delivered Maral, which is a 7 billion parameters language model specializing in Persian language.

After observing the world of smaller but more specific language models (should we call them SMBMLMs now?) like Phi, and also after observing the release of TinyLLaMa, I just started a journey to find how can I stay loyal to Mistral models but make them smaller.

You know, since the dawn of time, the mankind tried to make things smaller. Like smaller cars, smaller homes, smaller computers and now smaller AI models!

Basic Research

In my journey, I just wanted to know if someone ever bothered to make a smaller version of Mistral or we have to go through the whole coding procedure ourselves.

Lucky us, I could find Mistral 1B Untrained on HuggingFace and even asked the author a few questions about the model. As you can see, they’re not really okay with the model but I saw the potential. So I decided to keep this model in my arsenal of small models for research.

Then, I searched for datasets, and sparks started in my head about how can I make the damn thing happen!

The name and branding (and probably Silicon Valley references)

The name Nucleus comes from HBO’s Silicon Valley. Which is by far my most favorite shows of all time. If you remember correctly, Hooli CEO, Gavin Belson had something to do to piss Richard off, right? So he made Nucleus. But his Nucleus was bad. I tried to make mine better at least 😁

Since we know it’s time to pay the piper, let’s waste less time and jump right into the technical details of the project.

Pre-Training

Since the model claimed to be untrained we can understand that it only knows what the language is right? Even if now you try to infer the model on HuggingFace or locally, you may get a huge sequence of letters with no meaning at all.

So our first task was to pretrain that. Pretraining the model was quiet easy using a 3090 and spending 40 hours. It was done on the one and only TinyStories dataset.

Actually this dataset is great for pre-training and giving the base models the idea of the language and linguistic structures. It does it pretty well. Although since it only has 2 million rows, you have to expect huge over-fitting which can be easily fixed trough fine tuning the model.

Training on Tiny Textbooks

Well, the whole point of Phi 1 was that textbooks are all you need and since Microsoft doesn’t like to share their dataset with us, we had to perform a huge research on available options.

The very first option coming to my mind was using GPT-4 to generate textbooks but it could be astronomical minding that we are not funded and spending a few thousand dollars on a dataset? no thanks.

So during this research procedure, we discovered Tiny Textbooks dataset. Apparently Nam Pham did a great thing. They crawled the web and made it to textbooks. So kudos to them for letting us use their awesome dataset.

Okay, fine tuning called for another 40 hours of training, and it was fine. After fine-tuning for two epochs and 420k steps each, we’ve got the best results we could get.

Results

On TinyStories, the model really loved telling stories about Lily and there was no surprise for me at least. But on Tiny Textbooks, the model did a great job. Okay, this is just the result when I asked for a pizza recipe:

And as you can see, it’s basically what HuggingFace offers. With a little bit of settings, you easily can get good results out of this baby!

But sadly it still sucks at two things (which are basically what make you even click on an LLM-related link). First is question-answering (or instruction following) which is not surprising and second which made me personally sad is coding since I am a developer and I like a well-made coding assistant.

But in general, it can be competing with other well known models I guess. it all depends on what we train the model on, right?

But it still needs more effort and training, so we are heading to the next section!

License

If you know me from the past, you know I love permissive licenses. So this model is licensed and published under MIT license. You can use it for commercial use without any permission from us.

Further changes and studies

  • The model does well on English. But what about more languages? My essential mission is to try to make it work with Persian language.
  • It is good at generation of textbooks and apparently loves food recipes and history lessons. But it needs more. Maybe code textbooks are fine.
  • The model should be trained on pure code (StableCode style) and also code-instruct style (I haven’t seen models like that or maybe because I am too lazy to not check all the models).
  • The model should be trained on a well-crafted instruct-following dataset. For me personally, it is OpenOrca. What do you suggest?

Links

Donations are appreciated

If you open our github repository, you will find a few crypto wallets. Well, we appreciate donations to the projects, because we’re still not funded and we’re waiting for investors’ responses.

These donations help us keep the project up, make content about them and spread the word for Free/Libre and Open Source Software or FLOSS!

Conclusion

In the world where people get excited about pretty much every react js app wrapped around OpenAI’s Chat API and call it a new thing, or companies try to reinvent the iPod with the power of ChatGPT, and also make a square shaped iPod Touch, new models are the key to keep our business up.

But you know, if models are still huge and you can’t run them locally, this will call for more and more proprietary stuff where you have no control over the data and you may end up giving up the confidential data of your company to a third party.

Open source small language models or Open SLMs, are the key to have a better world. You easily can run this model on a 2080 (or even less powerful GPU) and you know what it means? Consumer hardware can have access to good AI stuff.

This is where we are headed in 2024, a new year of awesomeness with open models, regardless of their size.

Re-creating Midjourney with only $10 – Technical Report for Mann-E 5 development

The year 2022 was an amazing year for generative AI market and no one can deny in this year, release of some cool models such as Midjourney, Stable Diffusion and ChatGPT made this market bigger, better and more competitive. You may also know Mann-E, the model I have developed on top of Runway ML’s Stable Diffusion 1.5 using Dream Booth. In this particular article, I provide you with a report for the development procedure of Mann-E 5, which will be accessible at April 14th 2023 on Mann-E Platform.

Introduction

The Intention

The main intention of the Mann-E at first place was a personal discovery of AI Art and text-to-image models, but later I found the business/commercial opportunities and since I also am an open-source enthusiast, the main intention changed to providing an easy and accessible open-source alternative to midjourney.

Since Midjourney is only accessible through Discord, it’s expensive (compared to most of other image generation models) and there is also a huge problem for Iranian users to use the basic or standard plans, the idea of a platform for art generation.

The method

For this particular version, I used self-instruct method which was used for Stanford’s Alpaca dataset and model. The tools used for this project were as following:

  • ChatGPT
  • Midjourney
  • Dream Booth

The Procedure

Using Midjourney

The main idea of using midjourney generated images in the fine-tuning process sparked in my mind from PromptHero’s Openjourney project. They used Dream Booth and data from Midjourney version 4.0 at first, then they did the train on more than 100K images on their own infrastructure.

So, Midjourney became a good source of data, because you probably won’t face any intellectual property or copyright issues in the process of using images created by their algorithm (the full explanation is available in my previous post).

ChatGPT as a prompt engineer

I’ve seen people create great prompt for Midjourney using ChatGPT. As a large language model, both ChatGPT and GPT-3 (and GPT-4) can be great choices for creating prompts. So I’ve chosen ChatGPT since it had a free interface and also more affordable API’s.

P.S: There are also different models which we can use in order to generate prompts, but they may need extra setup. They’ll be explained in future researches.

Dream Booth

The most affordable way of creating your own text to image model is Dream Booth. It can be executed on a free Colab Notebook and there are also tons of tools available for doing the job.

For development of Mann-E 5, I used the code from this repository. Although some modifications were needed, but the whole code is perfect.

Development of Mann-E

Getting needed tools for development

First, I had to open an OpenAI account. Which I had done before. I needed to use ChatGPT in order to get prompts.

Then, I got a basic license for midjourney, which gives you ~4 hours of fast generation and you can generate images without any difficulties. Of course I’ve done a few test (around a hundred images) before getting to work.

The last thing was writing a piece of python code to crop my images to 512×512 (midjourney upscaled outputs are 1024×1024) and also save the respective prompts in a text file.

Then, I used fast Dream Booth code in order to fine-tune Stable Diffusion version 2.1 on the dataset I just gathered.

Using ChatGPT as a prompt engineer

Currently everyone is using ChatGPT in their advantage, since it was the most revolutionary LLM of 2022 and probably 2023. For the development process of Mann-E 5, I’ve used GPT-3.5 Turbo (which is the free one) and it generated pretty good prompts.

This is the example of message for prompt formula:

Here is a Midjourney prompt formula:

raw photo, [the subject], [mood], [time of the day], [art style], [camera], [lens]

The “mood” can be realistic, cyberpunk, steampunk or futurist.

Do you understand?

As you can see, I just gave it the way of my prompting. Then it responded:

Yes, I understand the Midjourney prompt formula you have provided. It includes a raw photo as the starting point, followed by information about the subject, mood, time of day, art style, camera, and lens. The mood can be one of four options: realistic, cyberpunk, steampunk, or futurist.

The response shows it’s now ready to be asked for new prompts. Then I asked it for 5 to 10 prompts per idea. It means I could have 20 to 40 prompts per idea so I was set for Midjourney image generation.

Here is how I asked it about prompts:

Give me five prompts for “ruins of a roman temple”
and here is how it gave me the set of prompts (it’s trimmed for this article).
A striking black and white image of the ruins of a Roman temple, with dramatic shadows and highlights emphasizing the structure’s grandeur and decay, shot at night with a modern digital camera and a wide-angle lens.
If you spend time on Midjourney prompting, you will notice it’s a pretty good prompt, even if it doesn’t follow the formula very well.

Generating images using midjourney

This was the easy part. The whole process was feeding ChatGPT generated prompts to Midjourney and upscale and download the images.

The result was 464 images with different prompts which included different moods, styles and genres.

Pre-processing the dataset

Since Stable Diffusion only accepts 512×512 or 768×768 images as the input data, I had to write a simple python code to do the resizing using opencv.

Also there was an excel file including image file names and prompts used for image. I had to add a function to turn each prompt to a text file with the same name as the image files.

Training Stable Diffusion using Dream Booth

Unlike Mann-E 4, Mann-E 5 is based on Stable Diffusion version 2.1 (512px version). The training was done in two different steps.

In the first steps, it was 5440 steps of Dream Booth training (which is calculated by (number of images * 10) + 800 formula) and 928 steps on the text encoder to understand the trigger words.

In the second steps, the resulting checkpoints and weights of the first steps were tuned on 10880 steps (twice the first one) and 928 text-encoder steps to get the resulting images closer to the dataset.

It took total of 4 hours of training on a T4 shared GPU on Google Colab. Of course upgrading the colab plan to pro or pro+ can be beneficial in order to get better GPU’s and better training time.

The Results



Further Study and Research

The new model still has problems in photo-realistic images, but does a great job on illustration and concept art. So for now, it can be considered an artistic model. In the future, the other side also most be fixed.

The next thing is trying to tune the base model (whether Stable Diffusion version 2.1 or Mann-E checkpoints) on a larger dataset with more diverse images in order to get it closer to Midjourney.

Conclusion

Using pre-trained and available AI models such as ChatGPT not only elevate people’s lives, but also helps even AI engineers and developers to have more concern free data for their projects and products.

Also using Midjourney as a tool for creating Royalty Free images is a wise choice specially when you try to create a brand new text to image AI model.

In conclusion, I can say I’ve got much better results this time, because I utilized both ChatGPT and Midjourney for my needs. The checkpoints for Mann-E 5 will be available at HuggingFace on Friday, April 14th, 2023 at the same time of the public release of Mann-E platform.

You don’t owe money to the brush company if you sell your art

In my previous post, I explained how the future of the content is AI. Also, in an older post, I was talking about how AI generated content can revolutionize the world of interior design/architecture. In this post however, I’m not talking about these topics and I’m going to talk about legal issues and questions about AI generated art, and there will be a twist at the end. Wait for it 😁

AI content creators are concerned about legal stuff

Yes, they are. If they are not, they are making a very very big mistake. When you create any form of content, one of the most important aspects of publishing it is the legal issues.

These legal stuff are usually about the rights of content creators over their content and also the rights of companies who develop the tools for content creation.

In this part of the article, I am talking about what I guess is the important legal topic in this new generation of content creation.

The Ownership

The very first time I posted about my own AI art generator model Voyage in a Telegram chat room, one of my friends asked Who owns the generated art? You? Or us? and I explained since you have to run the generator on your own computer, you are the owner of the generated art and you don’t owe me anything.

By the way, most of them gave me huge credits when they posted their artwork on social media or even on the very same chat room.

But I found out most of those proprietary art generators like Midjourney don’t act like that. They make you pay them if you want to own what is your own. Let me make this a little bit clear.

Imagine you are going to buy a nice set of brushes and colors. You paid for it, right? Now you made a beautiful piece of art with those tools and now you want to sell it. Now imagine the brush company asks for shares! Isn’t it hilarious? of course it is. I believe this must be considered by AI Artists who use these proprietary tools to generate content.

Use by and for minors

another important topic in the new generation of content creation tool is always how minors will use it? and it also concerns me a lot (specially since Stable Diffusion 2.0 has no NSFW filtering). So what should we do for our younger friends? A lot of content creation platforms like YouTube, Pinterest, Instagram, DeviantArt, etc have their own policies and filters for public content distribution.

For example, I’m a big fan of horror movies and when I search about content about them such as reviews, fan arts and even scripts, I usually face the age confirmation pages and modals. Now you can understand where will I go with this topic.

AI is dumb, it cannot understand what it generates and we need a little more human observation on the generated content. For example in Stable Diffusion’s discord, I remember reacting to NSFW content by a certain emoji, could mark it as potentially harmful and then they could improve their NSFW filtering system.

Plagiarism

I guess you thought I don’t give a fine F about copyrights, right? No it’s not true. I believe artists and content creators should be credited well. So let’s talk about another topic which seems very important.

The very first day I started AI content generation, there only was a good free (in any sense of the word free) and it was VQGAN+CLIP. It was a great tool to make art and even today it has a unique quality of art comparing to other tools.

But even those days, I had a huge concern. What if I plagiarize another artist’s work? and this concern was at its highest form when I figured out adding names of well known artists such as Greg Rutkowski, James Gurney, Thomas Kinkade, Salvador Dali and thousands more can alter the result for us! So as both AI generator developers and artists, we should pay attention to this matter as well!

And last but not the least: Fake Art!

One of my most favorite activities is trying new artist names in my prompts. I love to see how their minds would paint what I’m thinking of. But there is a problem, What if I say this is an unreleased painting by a well known artist? and this can lead us to a huge money fraud.

I never could stop thinking about these matters, and as a person who developed a model and generated tons of content with AI, I never want to be classified as a fraud or scammer or even a person who disrupts the work of other artists.

I guess we talked enough about legal issues, let’s get to the big plot twist of this blog!

Big Twist!

The young blonde woman in the picture is beautiful. Isn’t she? I made it using my model Voyage which I introduced earlier in this blog post. You want to use Voyage and create your own art? Fine. You won’t owe me anything if you do. And if you want to use it in Google Colab, here is the link to the notebook!

Voyage is trained on the data crawled from OpenArt and as you can see, it is a model which can work with a very artistic feel comparing to other models which are available.

Conclusion

In this blog post, we discussed about one of the important aspects of AI content creation/generation which is legal stuff. We also have to fight for our rights of ownership as content creators. In my personal opinion, it is okay to ask for money for a service. We pay a lot for infrastructure and computing power as developers or companies but if we make our users pay us shares, I guess it’s not fair.

In the other hand, we need more and more open source tools for AI content creation. Big tech companies are ruling the market in this world as well and it never is good.

I hope this article was useful and if you like more content like this, please consider sharing it with your friends 🙂

Severus does the magic

It is not too long after I told you that I was too cheap to pay $10 a month for github copilot and I came up with the idea for Severus, my own AI pair programmer. It was something that went boom. My blog usually doesn’t have more than 20 or 30 viewers a day (at its best) and for almost a week, I had more than 200 views per day. Since people showed interest in yet another AI pair programmer, I have decided to continue working on severus, more seriously.

Severus code generation
Severus is now capable of being accessed as an API

My plans for Severus

So in this article, I may discuss a bunch of problems I may face in the long path of creating Severus and making it available as an end-user software. There are some serious concerns, for example when I talked about the idea of Severus with one of my colleagues, he told me he is concerned about the confidential codes he has written.

Almost all of your concerns are valid (except the one who thinks this whole process is handled by the Illuminati) and those are my concerns as well. The next problem I may face is for the scaling, so I perhaps need to hire a well-educated DevOps engineer.

In this section, I explain all of my serious concerns and needs, and I expect some help from you, the kind readers of the article.

The Community

Creating a community around something which is honestly a weekend project, doesn’t seem like a good idea. You may say this thing happened for the Linux kernel as well. You’re right, but this is a little bit different. There are tons of tools which may work much better than Severus.

Also, it is important to know the place for creating the community. A subreddit? A discord server? A room on Matrix? An internet forum? I have no idea honestly.

So this is the biggest concern for me. The community!

Performance and text-generation glitches

The performance is good, thanks to huggingface inference API. Actually, knowing the fact that huggingface API exists, helped me with the implementation. But I still have some concerns here.

My main concern is that BLOOM starts generating some text which is not or cannot be classified as code. I tried different ways to get better results, but I still need some ways to verify the generated result is code and it’s not a text which includes the code. And this is really the hard part I guess.

For this purpose, I may need some help. Validation must be done on the results in order to get a good AI pair programmer, otherwise it’ll become more like an annoying colleague or an intern who knows something, but can’t gather his/her mind.

The Product

And final concern/plan is the product. For current use, I only have a simple application which runs on port 5000 on my laptop. Nothing more. There is no authentication and no user validation system, no monitoring, no scaling, no infrastructure. Basically a MacBook Pro which runs tons of programs daily and severus is currently one of them.

I had a VS Code extension in mind, also I thought of a web app as the MVP, when you can easily copy your code and then use it in your very own projects (and of course it won’t be the best choice for a confidential piece of code).

Although I have ideas in mind, I still need more brainstorming about how this project should be delivered to you as a product.

Conclusion

I still have a lot to do with this project. There might be some language detection to detect if the generated output is the code or not, and also there might be some more code validation to avoid mixing different programming languages.

Overall, this is one of the most difficult and at the same time the funnest projects I’ve ever done. I won’t give up on this, even if it seems like a painful and expensive hobby to people around me 🙂

 

I was too cheap to pay $10 a month for copilot, so I made my own

In mid 2021, there was a revolution in coding. As a lazy programmer who always needed a fast and smart assistant, I was really happy to have Github Copilot in my arsenal of coding tools. So I was one of the early adapters of the whole idea of AI pair programmer.

Everything was fine with Copilot. I wrote tens of thousands of lines of code in last year and I could code a lot of projects which were impossible with a good, smart and fast pair programmers, but everything has been changed since last week I got an email from github, telling me I can’t have free access to Copilot anymore.

It was a sad moment in my life, but I had different ways of adapting and accepting the reality. First, I was thinking of paying $10 a month for a github premium account, but since I won’t use most of github’s premium options, it wasn’t a suitable solution for me. I also checked tabnine or kite as well, and those didn’t work out for me, as well.

My own copilot!

Say hello to Severus, my new AI pair programmer!

First, let me talk about the name a little bit. I was watching Harry Potter franchise recently, and my favorite character in whole franchise is non other than Severus Snape. So I named my AI pair programmer after him. But I know you might be curious about how I made it. So let’s find out!

The language model

First, I needed a language model which could be capable of generating code. At first, I had OpenAI’s GPT-3 in my mind but I remembered that for some reasons, I can’t use it. Then, I fell for free language models. I used GPT-J and although it could understand the code, it didn’t seem a very high-accuracy model to me.

Then, I realized that Meta has released OPT-175B model. I put some of its functionalities to the test. It is a really perfect language model, but it works well when you use it as a core for a chatbot or a blog-post generator (or maybe a prompt engineering tool for Text-To-Image models) but not a great code generator.

Then, I found my saving angel. A lot of open-source engineers and enthusiasts of the world and it’s non other than BigScience’s BLOOM.

Code tests and inference

Like what most of you may have done, first I tried to complete a love story with the model. It was cool. Then I tried to create a friendly, a helpful, an idiot and an evil chatbot with the model. All worked out perfectly. Back then, I did not have any limitations to Copilot, so I didn’t care about the code generation.

When I found out myself in misery of not having my beloved AI pair programmer, I tried some basic python code generation with BLOOM. It was fine, then I have tested PHP, Ruby and JavaScript as well. I found that it works pretty well, so I have decided to write a simple inference code over the API.

Code generation may go wrong

Since I didn’t fine-tune the model (and I don’t have resources to) it may glitch sometimes. For example, when you don’t really pay attention to your code formatting, it might generate explanation of the code.

For me, what happened was that it started explaining the code in a tutorial format (and I bet the whole python codes were from towardsdatascience website since it had pretty similar literature).

In general, I may need a solution for this, as well.

Will it be open source?

Yes. At least it’ll be partly open sourced in near future. But more than being open source, it will be free (as in non-paid) and I guess it may be a pro for the tool. I haven’t even paid a single penny on the model, so why should I make you pay for it? By the way I will be open for donations and technical helps from the community.

Future Plans

  • The API
  • VSCode extension
  • A community website (or discord server)

Conclusion

At the end, it seems we have a lot to do with these brand new language models. I found my way to create a free, reliable and smart AI pair programmer and of course I need some help in this way.

I have to warmly thank you for the time you’ve spent to read my article, and I openly accept your comments and ideas.