-
Notifications
You must be signed in to change notification settings - Fork 8k
Configuring Custom Models
Warning
- We will now walk through the steps of finding, downloading and configuring a custom model. All these steps are required for it to (possibly) work.
- Jinja2 templates are complicated this wiki is written for advanced users.
- Models found on Huggingface or anywhere else are "unsupported" you should follow this guide before asking for help.
Whether you "Sideload" or "Download" a custom model you must configure it to work properly.
- We will refer to a "Download" as being any model that you found using the "Add Models" feature.
- A custom model is one that is not provided within the "GPT4All" default models list in the "Explore Models" window. Custom models usually require configuration by the user.
- A "Sideload" is any model you get from somewhere else and then put in the models directory.
Open GPT4All and click on "Find models". In this example, we use "HuggingFace" in the Explore Models window. Searching HuggingFace here will return a list of custom GGUF models. As an example, down below, we type "GPT4All-Community", which will find models from the GPT4All-Community repository.
It is strongly recommended to use custom models from the GPT4All-Community repository, which can be found using the search feature in the explore models page or alternatively can be sideloaded, but be aware, that those also have to be configured manually.
- The GGUF model below (GPT4All-Community/DeepSeek-R1-Distill-Llama-8B-GGUF) is an example of a custom model which at the time of this tutorial required rewriting the jinja2 template for minja compatibility.
- You will find that most custom models will require similar work for the jinja2 template.
A GPT4All-Community model may have a compatible minja template, click "More info can be found here." which brings you to the HuggingFace Model Card.
Keep in mind:
- Some repos may not have fully tested the model provided.
- The model authors may not have bothered to change the model configuration files from finetuning to inferencing workflows.
- Even if they show you a template it may be wrong.
- Each model has its own tokens and its own syntax.
- The models are trained using these tokens, which is why you must use them for the model to work.
- The model uploader may not understand this either and can fail to provide a good model or a mismatching template.
Here, you find information that you need to configure the model and understand it better. (A model may be outdated, it may have been a failed experiment, it may not yet be compatible with GPT4All, it may be dangerous, it may also be GREAT!)
- You should learn the maximum context for the model.
- You need to know if there is a problem. See the community tab and look.
Maybe this won't affect you. Though it's a good place to find out.
GPT4All uses minja which is not fully compatible with python based jinja2 that is included in models.
Using the wrong template will cause problems. You may be lucky and get some output but it could be better. Maybe you will get nothing at all.
Important
The chat templates must be followed on a per model basis. Every model is different.
You can imagine them to be like magic spells.
Your magic won't work if you say the wrong word. It won't work if you say it at the wrong place or time.
You need a clean prompt without any jinja2:
You are a helpful AI assistant.
You could get complicated and write a little json that the llm will interpret to dictate behavior.
{
"talking_guidelines": {
"format": "Communication happens before or after thoughts.",
"description": "All outward communication must be outside of a thought either before or after the think tags."
},
"thinking_guidelines": {
"format": "<think>All my thoughts must happen inside think tags.</think>",
"description": "All internal thoughts of the character MUST be enclosed within these tags. This includes reactions, observations, internal monologues, and any other thought processes. Do not output thoughts outside of these tags. The tags themselves should not be modified. The content within the tags should be relevant."
}
}
The default settings are a good safe place to start. The default and provides good output for most models. For instance, you can't blow up your RAM on only 2048 context and you can always increase it to whatever the model supports.
This is the maximum context that you will use with the model. Context is somewhat the sum of the models tokens in the system prompt + chat template + user prompts + model responses + tokens that were added to the models context via retrieval augmented generation (RAG), which would be the LocalDocs feature. You need to keep context length within two safe margins.
-
- your system can only use so much memory. Using more than you have will cause severe slowdowns or even crashes.
-
- your model is only capable of what it was trained for. Using more than that will give trash answers and gibberish.
Since we are talking about computer terminology here, 1k = 1024 not 1000. So 128k, as is advertised by the phi3 model will translate to (1024 x 128 = 131072).
I will use 4192 which is 4k of a response. I like allowing for a great response but want to stop the model at that point. (Maybe you want it longer? Try 8192)
This is one that you need to think about if you have a small GPU or a big model.
This will be set to load all layers on the GPU. You may need to use less to get the model to work for your GPU.
These settings are model independent. They are only for the GPT4All environment. You can play with them all you like.
The rest of these are special settings that need more training and experience to learn. They don't need to be changed most of the time.
- Explain why the model is now configured but still doesn't work.
- Explain the .Json files used to make the gguf.
- Explain how the tokens work.
So, the model you got from some stranger on the internet didn't work like you expected it to?
They probably didn't test it. They probably don't know it won't work for everyone else.
Some problems are caused by the settings provided in the config files used to make the gguf.
Perhaps llama.cpp doesn't support that model and GPT4All can't use it.
Sometimes the model is just bad. (maybe an experiment)
You will be lucky if they include the source files, used for this exact gguf. (This person did not.)
The model used in the example above only links you to the source, of their source. This means you can't tell what they did to it when they made the gguf using that source. After the gguf was made someone may have changed anything on either side, Microsoft or QuantFactory.
In the following example I will use a model with a known source. This source will have an error, and they can fix it, or you can, like we did. (Expert: Make your own gguf by converting and quantizing the source.)
The following relevant files were used in the making of the gguf.
- config.json (Look for "eos_token_id")
- tokenizer_config.json (Look for "eos_token" and "chat_template")
- generation_config.json (Look for "eos_token_id")
- special_tokens_map.json (Look for "eos_token" and "bos_token")
- tokenizer.json (Make sure those, match this.)
We will begin in this tokenizer_config.json it defines how the model's tokenizer should process input text.
"add_bos_token": false,
"add_eos_token": false,
"add_prefix_space": true,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<\|startoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<\|endoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<\|im_end\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<\|startoftext\|>",
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<\|im_end\|>",
"legacy": true,
"model_max_length": 16384,
"pad_token": "<unk>",
"padding_side": "right",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"split_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}
Here we want to make sure that the "chat_template" exists. (It exists, good.)
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",
There is a BOS token and an EOS token. (They exist, excellent!)
"bos_token": "<\|startoftext\|>",
"eos_token": "<\|im_end\|>",
You can also see what the id numbers to expect are, very nice.
"7": {
"content": "<\|im_end\|>",
Hopefully all of those tokens match in this file and in the other files as well. (let's see)
Open up the next important file special_tokens_map.json. This file is special because when the model is built, the tokens in this file will treat these tokens differently from regular vocabulary tokens. For example:
- They may be exempt from subword tokenization, they can never be broken!
- For example, the word "unhappiness" might be tokenized into "un", "happy", and "ness".
- However, special tokens like [EOS], [BOS], are typically treated as single, indivisible units.
- They have specific positions in input sequences, like the bos and eos, the model also learned a special meaning for them.
- BOS (Beginning of Sequence) token:
- Often represented as "[BOS]" or "
" - Typically placed at the very start of an input sequence.
- Signals to the model that a new sequence is beginning.
- Often represented as "[BOS]" or "
- EOS (End of Sequence) token:
- Often represented as "[EOS]" or ""
- Typically placed at the very end of an input sequence.
- Signals to the model that the sequence has ended.
- Crucial for tasks where the model needs to know when to stop generating output.
- BOS (Beginning of Sequence) token:
Lets take a look at this special_tokens_map.json
{
"bos_token": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
As you can imagine, if we are missing any special tokens here that were in the tokenizer_config.json, you can end up with gibberish as the output. It just might break those tokens up and never know it was supposed to stop or start or whatever else may be important to the training of that model.
Next let's look at the tokenizer.json file. This file includes all the "vocabulary" of the model. We should know this all matches the other files; this includes all the tokens the model will use and the "mapping" of them. For instance, we know the tokenizer_config.json believes a few things.
"7": {
"content": "<\|im_end\|>",
It must match the tokenizer.json to work. In this case take a close look at the first seven of the 64000 tokens.
"added_tokens": [
{
"id": 0,
"content": "<unk>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|startoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 2,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 6,
"content": "<|im_start|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 7,
"content": "<|im_end|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},..................................
"陨": 63999
},
The id number of the token is 7, and the token itself is <\|im_end\|>
.
This must be true to work, everything in the files must match, you need to cross-check each file for errors.
Now we will see the generation_config.json.
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"transformers_version": "4.40.0"
}
If something is set here it is enforced during generation. You may have missed it if you weren't paying attention. This doesn't match our other files!
The other files tell the model to use "eos_token": "<\|im_end\|>",
this one is watching for "eos_token_id": 2,
and we all know that this model is using "id": 2
which is "content": "<|endoftext|>",
Which isn't going to work. The gguf model you downloaded will have an endless generation loop, unless this is corrected.
Finally lets look at the config.json file. When a model is loaded this is what it will know about itself.
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 16384,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 48,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 5000000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0",
"use_cache": false,
"vocab_size": 64000
}
Well here are all the things the model believes to be true. We can see it is also wrong. The model believes "eos_token_id": 2,
will stop the generation, but it was trained to use "eos_token_id": 7,
which the chat template is telling us to use. It is also found in the special_tokens_map.json so it will be protected for this purpose.
Now you know why your model won't work, hopefully you didn't download it yet!