Maral is here, 7 billion parameters bilingual model with support of Persian!

If you read my previous post, you know how much I like open source AI material, and I even jokingly titled my BLOOM post I was too cheap to pay for GitHub’s copilot! So making an open source model was always one of my goals of life. Also, in my Persian blog, I pointed out that the dominance of English language in current LLM scene is a little bit concerning (read it here).

Now as of today, I am pleased to announce that Maral is here. The 7 billion parameters bilingual model which can respond to Persian and English prompts, and can produce GPT-3.5 level of answers based on the dataset we fed to it!

Maral 7B alpha 1 and its advantages

Since the release of GPT2 and BERT models, there were efforts for making a Persian text generation model in our community. But to be honest, most of them left untouched in middle of the road.

In last years AI revolution however, people saw potential in the realm of generative AI and started working on models. From RAGs on existing models to fine-tuning basic models which could somehow understand Perso-Arabic alphabet.

But with the release of Mistral model, everything has changed. I personally never thought a 7 billion parameters model can understand multiple languages this well. So I put more information on the next section of the article on why Mistral became my number one choice as the base model!

However, the biggest problem was still there and it was the dataset. Finding a good enough dataset is always a bottleneck. But we’ve been lucky enough that one of Iranian developers, has translated Alpaca Dataset to our beloved Persian language (and it’s accessible here).

When you’re in possession of needed ingredients for your potion, I guess it’s time to light up the caldron and start making the potion!

Why Mistral?

As a developer and an enthusiastic person, I always try new models and tools specially when it comes to text. Mistral was the new kid in the corner and I personally witnessed a lot of positive reviews about it. So I tried these:

  • Loading and testing model on normal English tasks it was good for.
  • Testing model on some more complicated task such as reasoning or basic math.
  • Testing the model on code generation.

All of the above tests passed very well. You probably never expect a middle sized model to perform well on all of the given tasks, but this one was a little different. Although it was a little bit confused in reasoning tasks, I could pass on that (since even GPT-4 has problems with reasoning).

But I always do another tests on these models, because I’m Iranian and I speak Persian/Farsi, and I really like to know how model performs on my language. So these were what I have tested:

  • Generic Persian text generation, when the model started generating nonsense but it showed me the potential, I had a guess it may have seen some Persian text before.
  • Asking Persian questions, it tried the best to put words together but at some point, it returned to nonsense or even answered completely in English!
  • Translation! Believe it or not, it can be a very good measure of accuracy in multilinguality of the model (Okay, I made that term up, stay calm). Although model was successful in English to French and Spanish (with my very limited knowledge), it haven’t performed well on Persian.

Okay, the test showed me the potential. So I had to team up with my colleague and make it happen! Let’s add support for our mother tongue to this model!

Train procedure and infrastructure

Now let’s talk about the fun stuff. First, we saw that we may need a very big and somehow unaffordable (at least for us) infrastructure to train mistral from scratch.

So we performed a big research on the topic and found these methods:

  • Retrieve-Augment Generation (RAG)
  • Quantized Low Rand Adoption (QLoRa) and Parameter Efficient Fine Tuning (PEFT)

To be honest RAGs are cool, but they won’t lead to a new model. So we tried QLoRa and PEFT.

The basic training (with extremely inaccurate results) have done on a T4 (Colab’s free tier) and then we’ve decided to go further. So I went after our friends at Jupyto, a company where you can rent GPUs hourly from and based in Iran.

They had great offers for powerful GPUs and we got our hands on a 3090 Ti with 64 GB of RAM. It was a perfect machine for doing the training and we’ve trained the better model on this setup.

The QLoRa training took over 10 hours for 5 epochs (each epoch took more than 100 minutes) and the results were out of this world! It could give us text which is semantically and grammatically correct!

Then, we’ve merged the adapter to the base model to take advantage of the main knowledge of the model as well.

Although, I personally faced a set of problems which I will point out int the next section.

The problems you may face using Maral

Since we’re on our alpha stage, I have to admit you may face these problems while using Maral, specially on Persian language.

  • The prompt format is based on Guanaco format. So it doesn’t have tokens for start and end of sentences.
  • The tokenizer is not optimized for Persian letters yet. So it may make it slow on Persian language.
  • The model is really good at hallucinating.
  • According to the previous item, it also easily produce misinformation. So please be careful with the answers you get from the model.
  • The model likes to repeat itself a lot. So If you get a repetitive answer, do not worry.
  • Model being so large, is a little hard to deploy on consumer hardware. However in the HuggingFace page, we’ve provided 8 bit loading instructions as well.

Furthrer works

  • Optimizing tokenizer for Perso-Arabic alphabet.
  • Providing a better dataset.
  • Add bos_token and eos_token to the tokenizer, specially for instruct following/chat model.
  • Providing GTPQ, GGUF or GGML models to make it more affordable on consumer hardware.
  • Making much smaller models (say 1B or 2B) with more focused niche.

Related links