How To Do Stable Diffusion Textual Inversion (TI) / Text Embeddings By Automatic1111 Web UI Tutorial #299
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How To Do Stable Diffusion Textual Inversion (TI) / Text Embeddings By Automatic1111 Web UI Tutorial
Full tutorial: https://www.youtube.com/watch?v=dNOpWt-epdQ
Our Discord : https://discord.gg/HbqgGaZVmr. Grand Master tutorial for Textual Inversion / Text Embeddings. If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰 https://www.patreon.com/SECourses
Playlist of Stable Diffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img:
https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3
In this video, I am explaining almost every aspect of Stable Diffusion Textual Inversion (TI) / Text Embeddings. I am demonstrating a live example of how to train a person face with all of the best settings including technical details.
TI Academic Paper: https://arxiv.org/pdf/2208.01618.pdf
Automatic1111 Repo: https://github.com/AUTOMATIC1111/stable-diffusion-webui
Easiest Way to Install & Run Stable Diffusion Web UI on PC
https://youtu.be/AZg6vzWHOTA
How to use Stable Diffusion V2.1 and Different Models in the Web UI
https://youtu.be/aAyvsX-EpG4
Automatic1111 Used Commit : d8f8bcb821fa62e943eb95ee05b8a949317326fe
Git Bash : https://git-scm.com/downloads
Automatic1111 Command Line Arguments List: https://bit.ly/StartArguments
S.D. 1.5 CKPT: https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main
Latest Best S.D. VAE File: https://huggingface.co/stabilityai/sd-vae-ft-mse-original/tree/main
VAE File Explanation: https://bit.ly/WhatIsVAE
Cross attention optimizations bug: https://bit.ly/CrosOptBug
Vector Pull Request: AUTOMATIC1111/stable-diffusion-webui#6667
All of the tokens list in Stable Diffusion: https://huggingface.co/openai/clip-vit-large-patch14/tree/main
Example training dataset used in the video:
https://drive.google.com/file/d/1Hom2XbILub0hQc-zmLizRcwFrKwHYGcc/view?usp=sharing
Inspect-Embedding-Training Script repo:
https://github.com/Zyin055/Inspect-Embedding-Training
How to Inject Your Trained Subject: https://youtu.be/s25hcW4zq4M
Comparison of training techniques: https://bit.ly/TechnicComparison
Embedding file name list generator script:
https://jsfiddle.net/MonsterMMORPG/Lg0swc1b/10/
00:00:00 Introduction to #StableDiffusion #TextualInversion Embeddings
00:01:00 Which commit of the #Automatic1111 Web UI we are using and how to checkout / switch to specific commit of any Git project
00:04:07 Used command line arguments of Automatic1111 webui-user.bat file
00:04:35 Automatic1111 command line arguments
00:05:31 How to and where to put Stable Diffusion models and VAE files in Automatic1111 installation
00:06:05 Why do we use latest VAE file and what does VAE file do
00:08:24 Training settings of Automatic1111
00:10:38 All about names of text embeddings
00:11:00 What is initialization text of textual inversion training
00:11:32 Embedding inspector extension of Automatic1111
00:14:25 How to set number of vectors per token when doing Textual Inversion training
00:11:52 Technical and detailed explanation of tokens and their numerical weights vectors in Stable Diffusion
00:16:00 How the prompts getting tokenized - turned into tokens - by using tokenizer extension
00:18:58 Setting number of training vectors
00:20:24 Where embedding files are saved in automatic1111 installation
00:20:38 All about preprocess images before TI training
00:23:06 Training tab of textual inversion
00:23:18 What to and how to set embedding learning rate
00:23:40 What are the Batch size and Gradient accumulation steps and how to set them
00:24:40 How to set training learning rate according to Batch size and Gradient accumulation steps
00:26:21 What are prompt templates, what are they used for, how to set and use them in textual inversion training
00:29:06 What are filewords and how they are used in training in automatic1111 web ui
00:29:35 How to edit image captions when doing textual inversion training
00:31:07 From training images pool, how and why did i choose some of them and not all of them
00:31:54 Why I did add noise to the backgrounds of some training dataset images
00:32:07 How should be your training dataset. What is a good training dataset
00:34:48 Save TI training checkpoints
00:36:31 Which latent sampling method is best
00:39:59 Training started
00:38:08 Overclock GPU to get 10% training speed up
00:38:32 Where to find TI training preview images
00:39:15 Where to see used final prompts during training
00:41:34 How to use inspect_embedding_training script to determine overtraining of textual inversion
00:42:31 What is training loss
00:48:23 Technical difference of Textual Inversion, DreamBooth, LoRA and HyperNetworks training
00:52:17 Over 200 epochs and already got very good sample preview images
00:54:28 How to set newest VAE file as default in the settings of automatic1111 web ui
00:55:06 How to use generated embeddings checkpoint files
00:58:31 How to test different checkpoints via X/Y plot and embedding files name generator script
01:07:27 How to upscale image by using AI
01:08:42 How to use multiple embeddings in a prompt
Video Transcription
00:00:01 Greetings everyone. Welcome to the most comprehensive, technical, detailed and yet
00:00:07 still beginner-friendly Stable Diffusion Text Embeddings, also known as Textual Inversion
00:00:12 training tutorial. In this video I am going to cover all of the topics that you see here and
00:00:18 more. Currently I am hovering my mouse over there. You can pause the video and check them out if you
00:00:25 wish. Also, you see here the training dataset we used and here the textual embedding used results.
00:00:32 Let's start by quickly introducing what is textual inversion and its officially released academic
00:00:39 paper. If you are interested in reading this article, so you can open the link and read it,
00:00:47 I am also going to show some of the important parts of this article when we are going to use
00:00:56 them. I will explain through the article. So to do training, we are going to use
00:01:02 Automatic1111 web UI. If you don't know how to install and set up the Automatic1111 web UI,
00:01:10 I already have a video for that on my channel: Easiest Way to Install & Run Stable Diffusion
00:01:16 Web UI. Also, I have another video How to use Stable Diffusion V2.1 and Different Models.
00:01:24 So I am going to use specific version of the Automatic1111 web UI. It is constantly
00:01:30 getting updated and therefore it is constantly getting broken and you are asking me that:
00:01:37 which version did you use? I am going to use this specific version, this commit, because after bump
00:01:44 gradio to 3.16, it has given me a lot of errors. So how am I going to use this specific version?
00:01:55 To use specific version, I am going to clone it with Git Bash. If you didn't still install
00:02:00 Git Bash, you can install it by using Google. Just type Git Bash to Google.
00:02:06 You can download from this website and install it. It is so easy to install.
00:02:10 First I am going to select the folder where I want to clone my Automatic1111 web UI. I am entering my
00:02:18 F drive and in here I am generating a new folder with right click new folder. Let's give it a name
00:02:25 as tutorial web UI. OK, then we will move inside this folder in our Git Bash window to do that.
00:02:36 Type CD F: Now we are in the F drive, then CD, put a quotation mark like this and hit enter.
00:02:45 Now we are inside this folder. Now we can clone Automatic1111 with git clone and copy the URL from
00:02:55 here like this and paste it into here. Right click, paste and it will clone it. OK, it is
00:03:03 cloned inside this folder. So I will enter there CD "s" tab and it will be automatically completed
00:03:11 like this and hit enter. Now we will check out to certain version from here. Let me show you again.
00:03:21 Click the commits from here, and here I am moving to the commit that I want: enable progress bar
00:03:27 without gallery. This is the commit ID. I will also put this into the description of the video.
00:03:32 Then we are going to do Git checkout like this. And right click paste. Now we are into that commit
00:03:44 and we are using the that specific version inside our folder. So before starting setup, I am copy
00:03:53 pasting this web UI user bat. But because I am going to add my command line arguments to here.
00:04:01 OK, right click the copy and edit and let me zoom in copy paste. So I am going to use xformers,
00:04:09 no-half and disable-safe-unpickle. So how did I come up with these command line arguments?
00:04:15 xformers going to increase your speed significantly and reduce the VRAM usage
00:04:21 of your graphic card. No-half is necessary for xformers to work correctly when you are using
00:04:29 SD 1.5 or 2.1 versions and disable-safe-unpickle. According to the web UI documentation you see the
00:04:38 URL here. Command line arguments and settings: disable checking pytorch models for malicious
00:04:44 code. Why I am using this? Because if you train your model on a Google Colab, sometimes it is not
00:04:50 working. It is not necessary, but I am just using it and I am not downloading any model
00:04:54 without knowing it. OK, then we save and run, then we are going to get our fresh installation.
00:05:04 OK, like this, it will install all necessary things. And you see, Let me zoom in It is
00:05:12 using Python 3.10.8 version. By the way, you have to have installed Python correctly for
00:05:23 this to install. It is also showing the commit hash that I am using like this.
00:05:30 I also need to put my Stable Diffusion models into the models folder. So let's open it. Open
00:05:36 the State Diffusion folder, copy paste from my previous download. And one another thing is I
00:05:43 am going to use the latest VAE file that I have downloaded from the Internet, which I am going to
00:05:49 show you right now. So where do we put this VAE file? Go to the stable diffusion web UI and in
00:05:54 here you will see VAE files folder. It is not generated. It is inside the models folder and
00:06:02 inside here VAE, and this is the VAE file. Why we are using this VAE file? Because it is improving
00:06:10 generation of person images. And now let me show the link. OK, this is the link of the VAE file.
00:06:17 This is the latest version of VAE file. Just click the CKPT file from here and click the download
00:06:24 button. I will also put the link of this into the description. So if you are wondering the technical
00:06:30 description, technical details of the VAE files, there is a long explanation in here in this
00:06:37 thread. I will also put the link of this thread into this description of the video and there is a
00:06:42 shorter description which I liked. Each generation is done in a compressed representation. The VAE
00:06:48 takes the compressed results and turn them into full sized images. SD comes with a VAE already,
00:06:54 but certain models may supply a custom VAE that works better for that model and SD 1.5 version
00:07:03 model is not using the latest VAE file. Therefore, we are downloading this and putting that into our
00:07:10 folder. SD 2.1 version is using the latest VAE file. And which SD 1.5 version model I am using?
00:07:20 I am using the 1.5 pruned ckpt. And where did I download it? I have downloaded it from this
00:07:29 URL and we are using the pruned ckpt because it is better for training than emaonly file which,
00:07:36 you see, is lesser in size. By the way, the things I am going to show in this video can
00:07:43 be applied to any model, such as Protogen or SD 2.1 version. Actually, I have made
00:07:51 experiments on Protogen training as well, and I will show the results of that too to you.
00:07:59 Okay, the fresh installation has been completed. No errors, and these are the messages displayed,
00:08:04 and it has started on this URL and I have already opened it. You can copy and paste
00:08:11 this URL in my browser. So currently it has selected by default the Protogen and I am going
00:08:18 to make this tutorial on version 1.5 pruned, the official version. Okay, before starting training,
00:08:26 I am going to first settings. First going to show you the settings that we need. Go to the
00:08:32 training tab in here and check this checkbox, Move VAE and CLIP to RAM when training. This
00:08:40 requires a lot of RAM actually. I have 64 GB and if you have checked this, it will reduce to VRAM,
00:08:48 which is the GPU RAM, which is our more limited RAM. Then you can also, on check this,
00:08:56 Turn on pin_memory for DataLoader. This makes training slightly faster, but increase memory
00:09:01 usage. I think this is increasing the RAM usage, not VRAM usage. So you can test this.
00:09:07 In other videos. You will see that. Check this checkbox. Use cross attention optimizations while
00:09:13 training. This will significantly increase your training speed and reduce the VRAM usage. However,
00:09:19 it is also significantly reducing your training success. So, if your graphic card can do training
00:09:27 without checking this out, do not check this, because it will reduce your training success and
00:09:33 it will reduce your learning rate. How do I know this? According to the vladmandic from Github,
00:09:41 this is causing a lot of problems. He has opened a bug topic on the Stable
00:09:50 Diffusion web UI issues and he says that this is causing a lot of problems. Let me show you.
00:09:59 He says that when he disabled cross attention for training and rerun exactly the same settings,
00:10:04 the results are perfect and I can verify this. So do not check this if your graphic card can
00:10:11 run it. There is also one more settings that we are going to set: Save an CSV containing the loss
00:10:18 to log directory every N steps. So I am going to make this 1. Why? Because I will show you how we
00:10:24 are going to use this during the training. Then go to the apply settings. Okay. Then reload UI.
00:10:31 Okay, settings were saved and UI is reloaded. Now go to the train tab. Okay. First of all,
00:10:39 we are going to give a name to our embedding. The name is not important at all,
00:10:46 so you can give it any name. This will be used to activate our embedding. Okay,
00:10:54 so I am going to give it a name as training example. It can be any character length,
00:11:00 It won't affect your results or token count. Initialization text. Now what does this mean?
00:11:07 For example, you are teaching a face and you want it to be similar to Brad Pitt. Then you
00:11:14 can type Brad Pitt. So what does this mean? Actually, to show you that first we are going
00:11:20 to install an extension, go to the available load from and in here, type embed into your search bar
00:11:29 and you will see embedding inspector. This is an extremely useful extension and let's install it.
00:11:39 Okay, the extension has been installed, so let's just restart with this. Okay, now we can see the
00:11:50 embedding inspector. So everything in the Stable Diffusion is composed by tokens. What does that
00:12:00 mean? You can think tokens as keywords, but not exactly like that. For example, when we type cat
00:12:07 and click inspect, the cat is a single token and it has an embedding ID and it has weights.
00:12:16 So every token has numerical weights, like this. And when we do training with embeddings,
00:12:27 actually we are going to generate a new vector that doesn't exist in the stable diffusion. We
00:12:34 are going to do training on that. So when you set initialization text like this, by the way,
00:12:42 it is going to generate a vector with the weights of this. However, this is a two token. How do I
00:12:51 know, go to go to the embedding inspector and type Brad. So you see, Brad is a single token.
00:12:57 It has weights. And let's type Pitt, and Pitt is also another token and it has also vector.
00:13:05 So these weights will be assigned initially to our new vector. However, we have to use at least
00:13:14 two vectors, otherwise we wouldn't be able to get two vectors. So if we start our training with Brad
00:13:24 Pitt, our first initial weights will be according to the Brad Pitt and our model will learn upon
00:13:32 that. Is this good? If your face is very similar to Brad Pitt, yes, but if it is not, no. So
00:13:42 Shondoit from the automatic community has made an extensive experimentation and he found that
00:13:54 starting with zero initialization text as empty. So we will start with zeroed vectors is performing
00:14:04 better than starting with, for example, *. Because * is also another token and you can see it from
00:14:13 here. Just type * here. It is just some vectors like this. So starting with empty vectors is
00:14:21 better. And now the number of vectors per token. So everything, every token has a vector in the
00:14:30 stable diffusion, and you may wonder how many tokens there are To find out that we are going to
00:14:38 check out the clip vit large patch. So in here you will see the tokenizer json. Yes, inside this json
00:14:45 file all of the tokens are listed. So you see, let me show, there is word IDs and words themselves,
00:14:56 like here: you see yes. So the list is starting from here. So each one of these are tokens and
00:15:04 it goes to the bottom like this: For example, sickle, whos, lamo, etour, finity. So these are
00:15:12 all of the tokens, all of the embeddings that the stable diffusion contains. If you wonder how many
00:15:19 there are exactly, there are exactly 49408 tokens and each contain one vector. For SD 1.x versions,
00:15:33 it is 762 vector size and for SD 2 version, it is 1024. So when we do embedding inspector,
00:15:45 you see it is showing the vector. So everything is composed by numerical weights
00:15:50 and they are being used by the machine learning algorithms to do inference and so also every
00:16:00 prompt we type is getting into tokenized and I will also show that tokenization right now.
00:16:06 Before we start to do that, go to the available tab load here and search for token and you will
00:16:13 see there is tokenizer, like tokenizer extension. Just install it, restart the UI and now you will
00:16:22 see tokenizer. So type your prompt here and see how it is getting tokenized. So let's say I am
00:16:29 going to use this kind of prompt. It is showing in the web UI that fifty eight tokens are being
00:16:37 used and we are limited to seventy five tokens. But we are not using fifty eight words here.
00:16:44 If you count the number of words it is not fifty eight. So let's copy this and go to the tokenizer,
00:16:50 paste it and tokenize, and now it is showing all of the tokenization. So the face is a single token
00:16:57 with ID of 1810. Photo is single token of a single token and let's see: OK, So the artstation is two
00:17:06 tokens. It is art and station. Comas are also one tokens, as you can see, and let's see if there is
00:17:13 any other being tokenized into some tokens. Or photorealistic. Photorealistic is also two
00:17:21 tokens and artstation is two tokens. So this is how tokenization works. Each of these tokens have
00:17:29 their own vectors and you can see their weights from embedding inspector. However, it is not very
00:17:35 useful because these numbers doesn't mean anything individually, but in the bigger scheme they
00:17:42 are working very well with the machine learning algorithms. Machine learning is all about weights.
00:17:50 Also, in the official paper of textual inversion, on the page four, you see they are showing a photo
00:17:58 of a star, which is our embedding name. So you see there is a tokenizer and token IDs and they have
00:18:07 vectors like this. So it is all about vectors and their weights. OK. Now we can return to
00:18:14 train tab. Now we have idea of our tokenization. So let's give a name as tutorial training. You
00:18:23 can give this any name. This will be activation. Initialization text. I am just leaving it empty to
00:18:30 obtain best results. So our vectors with will start with zero. Let's say you are training
00:18:37 a bulldog image, so you can start with bulldog weights. So it will make your, it may make your
00:18:46 training better. However, for faces since we are training a new face that the database has no idea,
00:18:54 I think leaving it is. Leaving it as empty is better. So, number of vectors: now you know that
00:19:01 each token has one vector, which means that when we type Brad Pitt, only two vectors are used for
00:19:10 that. So all of the Brad Pitt images are saved in the stable diffusion model with just two vectors,
00:19:18 which means that two vectors is a good number of , is a good number for our face training or
00:19:29 for our subjects training. I also have made a lot of experiments with one vector, two vector,
00:19:35 three, vector four vector, and I have found that two vectors are working best. However, this is
00:19:42 based on my training data set. You can also try one, two, three, four, five and you will see that
00:19:49 the quality is decreasing as you increase the number of vectors. Also, in the official papers
00:19:55 the researchers have used up to three vectors. You see extended latent spaces. This is the number of
00:20:01 vector count that is derived from the official paper and they have used up to three. You see
00:20:07 detonated two words and three words, but it is up to you to do experimentation and I am going to
00:20:14 use two. If you write overwrite old embedding, it will override if there is embedding like this. So
00:20:20 let's click create embedded and it is created. So where it is, saved. Go to your folder installation
00:20:30 folder and in here you will see embeddings and in here we can see already our embedding is composed.
00:20:37 Then let's go to the preprocess images. So this is a generic tab of web UI. It provides you to crop
00:20:48 images, create flipped copies, split oversized images. Auto focal point crop, use BLIP for
00:20:53 caption, use deepbooru for for captioning. There is source directory and destination directory.
00:21:00 So I have a folder like this for experimentation and showing I am copying its address like this
00:21:07 and pasting it in here as source and I am going to give it a destination directory like a1. They
00:21:15 are going to be auto-resized and cropped. So let's check. Let's check this checkbox. Create flipped
00:21:22 copies. By the way, for faces, I am not suggesting to use this. It is not improving quality. You can
00:21:28 also split oversized images, but this doesn't make sense for faces. Autofocal point: yes,
00:21:34 let's just also click that. Use BLIP for caption. So it will use BLIP algorithm for
00:21:40 captioning. This is better for real images and deepbooru is better for i think anime images. OK,
00:21:48 and then just let's click preprocess. By the way, why we are doing 512 and 512? Because version 1.5,
00:21:58 Stable Diffusion version 1.5 is based on 512 and 512 pixels. If you use version 2.1 Stable
00:22:07 Diffusion. Then it has both 512 pixels and 768 pixels. So you need to process images based on
00:22:19 the model native resolution. Based on the model that you are going to do training. In the training
00:22:26 tab it will use the selected model here. So be careful with that. And when the first time
00:22:32 when you do preprocessing, it is downloading the necessary files as usual. OK, the processing has
00:22:38 been finished. Let's open the processed folder from pictures and a2 folder, a1 folder. And now
00:22:45 you see there are flipped copies and they were automatically cropped to 512 and 512 pixels. And
00:22:52 there are also descriptions generated by the BLIP. When you open the descriptions, you will see like
00:22:58 this: a man standing in front of a metal door in a building with a blue shirt on and black pants.
00:23:05 So now we are ready with the preprocess images, we can go to the training tab. In here we are
00:23:13 selecting the embedding that we are going to train, embedding learning rate. There are various,
00:23:20 let's say, discussions on this learning rate, but in the official paper, 0.005 is used. Therefore,
00:23:29 I believe that this is the best learning rate. The gradient clipping is related to hyper network
00:23:36 learning rate, a hyper network training, so just don't touch it. So the batch size and gradient
00:23:42 accumulation size. This is also explained in the official paper. The batch size and gradient
00:23:49 accumulation steps will just increase your training speed if you have sufficient amount
00:23:54 of RAM, VRAM memory. However, make sure that the number of training images can be divided
00:24:00 the multiplication of these both numbers. So let's say you have 10 training images, then you can set
00:24:10 these as 2 batch size and 5 gradient accumulation, which is two multiplied by five, is equal to 10,
00:24:20 or they can be. Or let's say you have 40 training images, then you can set it as 20 or 10 or 5,
00:24:28 it is up to you. However, this will increase, significantly, increase your VRAM usage. And
00:24:35 let's say the multiplication of these two numbers is equal to 10. Then you should also multiply
00:24:42 learning rate with 10. Why? Because this requires learning rate to be increased. How do I know that?
00:24:51 In the official paper, in the implementation details,
00:24:55 they say that they are using two graphic cards with batch size of four. Then they are changing
00:25:03 the base learning rate with multiplying by by eight. Why? Because two graphic cards,
00:25:08 batch size for four multiplied by two is eight, and when you multiply 0.005 with 8, then we obtain
00:25:17 0.04. So be careful with that. If you increase this batch size and gradient accumulation steps,
00:25:25 just also make sure that you are increasing the learning rate as well. However, for this tutorial
00:25:31 I am going to use batch size one and gradient accumulation steps as one. Actually, until you
00:25:37 obtain good initial results, don't change them, I suggest you. Then you can change them. Then you
00:25:44 need to set your training data set directory. So let's say I am going to use these images,
00:25:51 then I am going to set them. Also, there is log directory, so the training logs will be logged
00:25:59 in this directory where it is. When we open our installation folder, we will see that there is a
00:26:11 textual inversion. However, since we still didn't start yet, it is not generated. So when the first
00:26:18 time we start, it will be generated. I suggest you to not change this. Okay,
00:26:23 prompt template. So what are prompts templates? Why are they used? Actually, there is not a clear
00:26:31 explanation of this in the official paper. When you go to the very bottom, you will see
00:26:38 training prompt templates. So these templates are actually derived from here. From my experience,
00:26:46 I have a theory that these prompts are used like this. So let's say you are teaching a photo of a
00:26:55 person, then this, the vectors of these tokens are also used to obtain your target image. So
00:27:04 they are helping to reach your target image. This is my theory. So it is using the vector of photo,
00:27:14 it is using a vector of a of, or you are teaching a style, then it is using that. So these templates
00:27:23 are actually these ones. When you open the prompt template folder which is in here, let's go to the
00:27:32 textual inversion temples and you will see the template file files like this. So let's say, when
00:27:38 you open subject file words, it will, you will get a list of, like this: a photo of name and file,
00:27:44 or so. The name is the activation name that we have given. It will be treated specially. It will
00:27:52 not get turned into a regular token. For example, tutorial training would be tokenized like this if
00:28:00 it was not an embedding tutorial: training. Let's click tokenize. You see the tutorial training.
00:28:07 Actually three tokens. Tutorial is a tokenized, like tutor, ial and training. However, since
00:28:16 it will be our special embedding name, therefore it will be treated as with the number of special
00:28:25 tokens and it will be based on the number of vectors per token. We decided, if we decide this,
00:28:33 to take 10, then it will use 10 token space from our prompting, so it will take 10 space in here.
00:28:42 However, it will now take only two instead of three because it will be specially treated. Okay,
00:28:51 let's go back to the train tab. So? So this name is the. Sorry about that. This name is the name of
00:29:01 our embedding name and the filewords. So the filewords is the description generated here.
00:29:08 So, basically, the prompt for training will become, tutorial training, and the file words,
00:29:16 let's say it is training for this particular image. It will just get this and append it
00:29:22 here and this will become the final prompt for that image when doing training. So, what
00:29:30 should we? How should we edit this description? You should define the parts that you don't want
00:29:39 model to learn. Which parts i don't want model to learn? I don't want model to learn this clothing,
00:29:45 these walls, for example, or the this item here. So i have to define them as much as possibly.
00:29:53 So if i want model to learn glasses, then i need to remove glasses, okay,
00:29:58 and for example, if i want model to learn my smile, i should just remove it. Okay, i want my,
00:30:06 i want model to learn my face. Therefore, i can, i can just remove it, and this is so on. However,
00:30:15 i am not going to use file words in this training, because i have found that if you
00:30:21 pick your training data set carefully, you don't need to use filewords. So, how am i going to do
00:30:29 training in this case? I am just going to generate a new text file here and i will say: my, special,
00:30:41 okay, let's just open it. And here i am just going to type [name]. You have to use at least name,
00:30:47 otherwise it won't work. It will throw an error, and i am not going to use, filewords. Also,
00:30:55 i am not going to use myself in this training. I am going to use one of my followers. He had
00:31:03 he had sent me his pictures. Let me show you the original pictures he had sent me. Okay, this is
00:31:10 the images he had sent me. However, i didn't use all of them. You see the images right now. There
00:31:17 are different angles and, different backgrounds. When you are doing textual inversion, you should
00:31:25 only teach one subject at a time, but if you want to combine multiple subjects, then you can train
00:31:32 multiple embeddings and you can combine all of them when using, when do, when generating images.
00:31:38 So which ones i did pick, let me show you. I have picked these ones, okay, and now you will notice
00:31:46 something here. You see, the background is here, is like this. You see green and some noise. Why?
00:31:55 Because i don't want model to learn background. So if multiple images containing same background,
00:32:01 i am just noising out those backgrounds. And why I did not noise out to other backgrounds? Because
00:32:07 other backgrounds are different. So you see, in your training data set, only the subject should
00:32:14 be same and all other things need to be different, like backgrounds, like clothing and other things.
00:32:21 So the training will learn only your subject, in this case the face. It will not learn the
00:32:28 background or the clothing. Okay, so let me show the original one. So in the original one you see
00:32:34 this image, this image, this image and these two images have same backgrounds. So i have edited
00:32:40 those same backgrounds with paint .NET, which is a free editing software. You can also edit with
00:32:47 paint. How did i edit it? It is so actually simple and amateur, you may say. So let's set a brush
00:32:55 size here and just, for example, change the color like this. Then i did added some noise: select it
00:33:04 with a selection tool, set the tolerance from here and go to the effects, adjust effects and
00:33:11 in here you will see distort and frosted glass and when you click it it will change the appearance.
00:33:20 You can also try other distortion. By the way, i am providing these images to you for
00:33:27 testing. Let me show you the link. So i have uploaded images into a google drive folder
00:33:33 and i am going to put the link of this into the description so you can download this data
00:33:38 set and do training and see how it performs. Are you able to obtain good results, as me?
00:33:45 Okay, so i am going to change my training data set folder from pictures and i am going to use
00:33:55 example training set folder. I am going to set it in my training here. Okay, and i am going to use
00:34:05 my prompt template. Just refresh it and go to the my special. So what was my special? My special was
00:34:12 just only containing [name]. It is not containing any file descriptions. I have found that this is
00:34:18 working great if you optimize your training data set, as, like me, you can try both of them. You
00:34:26 can try with [filewords] and you can try without [filewords]. [filewords] and you can see how it
00:34:33 is working. Okay, do not resize images, because our images are already 512 pixels max steps. Now,
00:34:40 this can be set anything. I will show you a way to understand whether you started over training
00:34:48 or not, so this can be stay like this. In how many steps we want to save? Okay, this is rather
00:34:56 different than epochs in the DreamBooth, if you have watched my DreamBooth videos. So each image
00:35:03 is one step and there is no epoch saving here. It is step saving. How many training images i
00:35:10 have. I have total 10 images, therefore, okay, so for 10 epochs we need to set this 100, actually,
00:35:18 okay. So the formula is like this: one epoch equal to number of training images. 10 epoch
00:35:24 for 10 training images is 10 multiplied by 10 is 100, so it will be saved every 10 epoch.
00:35:31 Save images with embedding in png chunks. This will save embed. This will save embedding info
00:35:41 in the generated preview images. I will show you. Read parameters from text to image tab
00:35:48 when generating preview images. I don't want that, so it will just use the regular prompts, that is,
00:35:57 that we will see in here. Shuffle tags by comma, which which means that if you use file words,
00:36:05 the words in there will be shuffled when doing training. This can be useful. You can test it
00:36:11 out and drop out tags when creating prompts. This means that it will randomly drop the
00:36:18 file descriptions, file captions, that you have used. This is, i think,
00:36:23 percentage based. So if you set it 0.1, it will drop out randomly the 10 percent, and i am not
00:36:30 going to use file words. Therefore, this will have zero effect. Okay, choose latent sampling methods.
00:36:36 I also have searched this. In the official paper. Random is used. However, one of the
00:36:43 community developer proposed deterministic and he found that deterministic is working best. So
00:36:49 choose deterministic. And now we are ready so we can start training. So i am going to click train
00:36:57 embedding. Okay, training has started, as you can see, and we are. It is displaying the number of
00:37:05 epochs and number of steps. It is displaying the iteration, so currently it is 1.30 seconds for
00:37:14 per iteration. Why? Because i am recording and it is taking already a lot of gpu power. I have
00:37:21 RTX 3060. It has 12 gigabyte of memory. Let me also show you what is taking the memory usage.
00:37:32 You see, OBS studio is already using a lot of gpu memory and also training uses. But since they are
00:37:39 using different parts of the gpu, i think it is working fine. When we open the performance,
00:37:43 we can see that the training is using the 3d part of the gpu and obs is using the video encode part
00:37:52 of the gpu. That is how i am still able to record, but sometimes it is dropping out out my voice. I
00:37:59 hope that it is currently recording very well. Okay, and i also did some overclocking to my gpu
00:38:10 by using MSI Afterburner. I have increased the core clock by 175 and i have increased
00:38:18 memory clock by 900, so this boosted my training speed like 10%. You can also do that if you want.
00:38:26 I didn't do any core voltage increasing. So it has already generated two preview images where
00:38:34 we are going to find them. Let me show you. Now it will be inside textual inversion folder. You
00:38:42 see it has just arrived and when you open it you will see the date, the date of the training, and
00:38:51 you will see the name of the embed we are going training and in here you will see embeddings. This
00:38:56 is the checkpoint. So you can use any checkpoint to do to generate images and, this is the images
00:39:04 that it has generated. So this is the first image and also in image embeddings. This image
00:39:11 embedding contains the embedding info. Why this is generated? Because we did check this checkbox.
00:39:19 You will see the used prompts here. Since i didn't use any file words and i just used name,
00:39:25 it is only using this name as a prompt. And what does that mean? That means that it is only using
00:39:34 the vectors we have generated in the beginning to learn our subject, to learn the details of
00:39:40 our subject, which is the face, and we have two vectors to learn, and also Brad Pitt is
00:39:48 based on the two vectors, so why not? We can be also taught to the model with two vectors.
00:39:56 Okay, in the just in the 20th epoch, we already getting some similarity.
00:40:03 Actually, i already did the same training, so i already have the trained data.
00:40:12 But i am doing recording while training for you again, for to explain to you better.
00:40:21 It also shows here the estimated time, for training to be completed. This time is based on
00:40:28 100 000 steps, but we are not going to train that much. Actually, i have found that around three
00:40:35 thousand steps. We are getting very good results, with the training data set i have. It will totally
00:40:42 depend on the training data set you have with how many number of steps you can teach your subject.
00:40:48 I will show you the way how to determine which one is best, which checkpoint is best, which number of
00:40:56 steps is best. Okay, with 30 epoch we already got a very much similar image. You see, with just 30
00:41:06 epoch we are starting to get very similar images. It is starting to learn our subject very well
00:41:12 with just 30 epoch and when we get over 100 epoch, we will get much better quality images.
00:41:20 Okay, it has been over 60 steps, 600 steps and over 60 epochs,
00:41:26 and we got six preview images. Since we are generating preview images and checkpoints,
00:41:32 for every 10 epoch. Now i am going to show you how you can determine whether you are overtraining or
00:41:40 not with a community developed script. So the script name is: inspect embedding training.
00:41:50 It is hosted on github. It's a public project. I will put the link of this project to the
00:41:55 description as well. Everything, every link, will be put to the description. So check out
00:41:59 the video description and in here, just click code and download as zip. Okay, it is downloaded.
00:42:06 When you open it you will see inspect embedding training part. Extract this file into your textual
00:42:14 inversion and tutorial training, as i have shown. So you will see these files there. To extract it,
00:42:21 just drag and drop. Why we are extracting it in here? Because we are going to analyze the loss.
00:42:28 And so the loss, what is loss? You are always seeing the loss here. The number value is here:
00:42:37 loss is the penalty for a bad prediction. That is, that is loss is a number indicating how bad
00:42:43 the model's prediction was on a single example. If the model's prediction is perfect, the loss is
00:42:49 zero. Otherwise the loss is greater. In our case we can think that as the model generated image,
00:42:56 how likely, how close to our training subjects, training images. So if you get a zero loss,
00:43:05 that means that model is learning very good, okay. If your loss is too high, that means that
00:43:11 your model is not learning. Now, with this script we have extracted here, we are going to see the
00:43:20 loss. And how are we going to use this script? This script requires torch installation and
00:43:28 the torch is already installed in our web ui folder, inside venv folder, virtual environment,
00:43:36 and inside here scripts. So we are going to use the python exe here to do that. First copy the
00:43:43 path of this. Open a notepad file like this: okay, put quotation marks and just type python exe like
00:43:54 this: okay, then we are going to get the path of the file. Let me show you. The script file
00:44:02 is in this folder. So, with quotation marks, just copy and paste it in here and type the
00:44:12 script file name like this: then open a new cmd window by typing like this:
00:44:19 okay, let me some zoom in, copy and paste the path like this, the code, and just hit enter
00:44:28 and you will see it has generated some info for us: learning rate at step at,
00:44:34 loss jpg, vector, vector jpg and the average vector strength. So let's open our folder in
00:44:42 here and we will see the files. When we open the loss file we are going to see a graph like this:
00:44:50 an average loss is below 0.2, which means it is learning very well. Why, as close as it is to 0,
00:44:58 it is better, so as close as it is to 1, it is worse. So currently we are able to
00:45:04 learn very well. Now i will show you how to determine the over training or not.
00:45:13 To do that, we are going to add a parameter here, --folder, and just give the folder of
00:45:21 embedding files here. Just copy paste it again do not forget quotation marks and open a new
00:45:28 cmd window. Just copy and paste it, hit enter. It will calculate the average strength of the
00:45:37 vectors and when this strength is over 0.2, that usually means that you started over training. How
00:45:46 do we know? According to the developer of this, this script, if the strength of the average
00:45:54 strength of the all vectors is greater than 0.2, the embedding starts to become inflexible. That
00:46:02 means over training. So you will not be able to stylize your trained subject. So you won't
00:46:14 be able to get good images like this if you do over training, if you're, if the strength
00:46:20 of the vectors becomes too weak. And what was the vector strength? It was so easy. When we opened,
00:46:29 the embedding tab, the embedding inspector tab, we were able to see the values of vectors. So this
00:46:37 strength means that the average of these values, when they, when the average of these values is
00:46:42 over 0.2 that means that you are starting to do over training. You need to check this out
00:46:50 to determine that. By the way, it is said that the DreamBooth is best to teach faces,
00:46:58 and in the official paper of embedding, the textual inversion, the authors, the researchers,
00:47:04 have used all you see objects like this, or they have used training on style, let me show you like
00:47:13 here. However, as i have just shown, as i have just demonstrated you this textual embeddings are
00:47:22 also very good, very successful, for teaching faces as well, and for objects, of course,
00:47:29 it is working very well as well. And for styles. I think the textual inversion, the text embeddings,
00:47:36 is much better than DreamBooth. So if you want to teach objects or styles, then i suggest you
00:47:44 to use textual inversion. Actually for faces, i think textual inversion of the automatic1111 is
00:47:51 working very well as well. And for DreamBooth to obtain very good results, you need to merge your
00:47:59 learned subject into a new model, which i have shown in my video. So if you use dream boot,
00:48:05 you should inject your trained subject into a good custom model to obtain very good images.
00:48:10 But on textual inversion, you can already obtain very good images. Okay, we are over 170 epoch and
00:48:20 meanwhile training is going on. I will show you the difference of DreamBooth, textual inversion,
00:48:27 LoRA and hypernetworks. One of the community member of reddit, use_excalidraw, prepared
00:48:36 an infographic like this and this is very useful. So in DreamBooth, we are modifying the weights of
00:48:43 the model itself. So all of the prompt words we use you know. You already know by now that
00:48:52 they all, they all have vectors of them, each of them, and these vectors are getting modified in
00:48:59 DreamBooth, all of them. The token we selected for DreamBooth is also getting modified and in
00:49:06 DreamBooth we are not able to add a new vector. We already have to use one of the existing vectors
00:49:15 of the model. Therefore, we are selecting one of the existing tokens in the models,
00:49:20 such as sks or ohwx. So in DreamBooth we are basically modifying, altering the model itself.
00:49:32 Okay, in Textual Inversion we are adding a new token. Actually, this is displayed incorrectly
00:49:39 because it is generating a unique new vector which does not exist in the model, and we are modifying
00:49:50 the weights of these new vectors. So when we set the vector count as two, it is actually using two
00:49:58 unique new tokens. So it is modifying two vectors. If we set the vector count 10, it is using 10
00:50:06 unique tokens. It is being specially treated, it is adding new 10 vectors and it is not modifying
00:50:15 any of the existing vector of the model. So if we set the vector count to 10, actually when we do,
00:50:25 when we generate an image in here, it will use 10 vectors. It will use 10 tokens out
00:50:31 of 75 tokens we have. We have 75 tokens limit. So this is how it works. Also, if you use 10 vector,
00:50:41 you will see that you are getting very bad results for face. I have made tests. Tests. Okay,
00:50:47 in LoRA it is. This is very similar to the DreamBooth. It is modifying the existing vectors
00:50:55 of the model. It is. I have found that the LoRA is inferior to the DreamBooth, but it is just using
00:51:05 lesser VRAM and it is faster. Therefore, people is choosing that. However, for quality, DreamBooth is
00:51:12 better in as shown here, and the hypernetworks. Hypernetworks doesn't have an official academic
00:51:21 paper. I think it is made upon a leaked code and this is the least successful method. It
00:51:29 is the worst quality, so just don't waste time with it. It is so i don't suggest to use it.
00:51:38 So in hypernetworks, the original weights, original vectors of the model is not modified,
00:51:44 but at the inference time. Inference means that when you generate an image from text to image,
00:51:50 it is inference. They are just getting swapped in. So you see there are some images which are
00:51:58 training sample, apply noise compare and there is loss. So this is how the model
00:52:06 is learning. Basically, of course, there are a lot of details if you are interested in them,
00:52:11 you can just read the official paper, but it is very hard to understand and complex thing things.
00:52:19 Okay, we are over 200 epochs, so we have 20 example images and the last one is extremely
00:52:27 similar to our official training set, as you can see. So let's also check out our strength
00:52:34 of the training vector. So i am just typing, hitting the up arrow in my keyboard, and it is
00:52:43 retyping the last executed command and hit enter. Okay, so our strength, average strength, is 0.13,
00:52:53 actually almost 0.14. We are getting close to 0.2. After 0.2, we can assume we started over training.
00:53:02 Of course, this would depend on your training data set, but it is an indication according
00:53:08 to the experience of the this developer. It also makes sense because as the strength of the vector
00:53:16 increases, it will override the other vectors. You see, since they are all floating point numbers,
00:53:27 numeric numbers, the bigger numeric number is usually making ineffective the lower numeric
00:53:36 numbers. This is how machine learning usually works, according to the chosen algorithms. They
00:53:41 are extremely complex stuff, but this is one of the, let's say, common principles that in the many
00:53:49 of the numerical weights based machine learning algorithms. Therefore, it also makes sense.
00:53:57 Okay, we are over 500 epochs at the moment. So let me show you the generated sample images. These
00:54:05 are the sample images. We are already very similar and the latest one is, you see, looks like getting
00:54:11 over trained. So let's check out with the script we have. Just hit the up arrow and hit enter
00:54:20 and you see, we are now over 0.2 strength. Therefore, I am going to cancel the training and
00:54:28 now I will show you how to use these embeddings. But before doing that, first let's set the newest
00:54:36 VAE file to generate better quality images. To do that, let's go to the let me find it
00:54:49 okay go to the Stable Diffusion tab in the settings and in here, you see,
00:54:54 SD VAE is automatic. I am going to select the one we did put. Let's apply settings, okay,
00:55:02 and then we will reload to UI. Okay, settings applied and UI is reloaded. So how are we going to
00:55:11 use these generated embeddings? It is easy. First let's enter to our textual inversion directory and
00:55:21 inside here let's go to the embeddings folder. Let me show you what kind of path it is. I know
00:55:29 that it is looking small, so this is where I have installed my automatic1111. This is the
00:55:40 main folder: textual inversion. This is the date of the training, when it was started. This is the
00:55:46 embedding name that I have given and this is the folder where the embedding checkpoints are saved.
00:55:53 When we analyze the weights, we see the change it has. So I am going to pick 20
00:56:01 of them to compare. How am I gonna do that? I will pick with 200 per epoch, like this:
00:56:10 okay, I have selected 24. Right click copy. By the way, for selecting each one of them, I have
00:56:18 used control button. You can select all of them. It is just fine. Then move to the main folder,
00:56:23 installation, and in here you will see embeddings folder. Go there, I'm just going to delete
00:56:28 the original one and I am pasting the ones as checkpoints. So how we are going to use them, just
00:56:37 type their name like this: this is equal to OHWX in the, in the DreamBooth tutorials that we have
00:56:45 and let's see. Currently it says that it is using seven prompts, but this is not correct
00:56:53 actually. It should be using just two. Okay, maybe it didn't refresh. Let's do a generation.
00:57:03 Okay, we got our picture. I think this is taking seven length because it is also using the okay,
00:57:15 yeah, so okay, now fixed. Now you see it is using only two tokens. Why? Because now it has loaded
00:57:25 the embedding file name and our embedding was composed with two vectors. Therefore,
00:57:34 it is using two vectors. However, if this was not our embedding name, it, if it was, was just a
00:57:42 regular prompt. If we go to the tokenizer we can see it was going to take. Let me show you one,
00:57:50 two, three, four, five, six, seven, eight tokens. You see each number is a token. This is a token.
00:57:58 So it was going to use eight tokens, but since it is an embedding name and the embedding is only two
00:58:05 vectors, it is using only two tokens because in the background, in the technical details,
00:58:13 it has composed of two unique tokens, since we did set the vector count 2. So for each vector a
00:58:20 token is generated and with a textual embedding we are able to insert, we are able to generate
00:58:28 new tokens, unlike DreamBooth. DreamBooth can only use the existing tokens. Okay, so now we are going
00:58:35 to generate a test case with using X/Y plot. I have tested CFG values and the prompt strength.
00:58:45 So from prompt strength i mean that prompt attention emphasis and it is explained in the wiki
00:58:51 of the Stable Diffusion of automatic1111 web ui. So you see, when you use parentheses like this,
00:58:57 it increases the attention toward by factor of 1.1. You can also set directly the attention like
00:59:03 this. So i have tested with embeddings, the prompt attention and it. It always resulted bad quality
00:59:12 for me, but still you can test with them. I also played with the CFG higher values. They were also
00:59:18 not very good, but now i will show you how to test each one of the embeddings. So instead of
00:59:26 typing manually, manually each one of the name, i have prepared a public cs fiddle script. I will
00:59:34 also share the link of this script so you will be also able to use it. So the starting checkpoint:
00:59:41 the starting checkpoint is 400, so let's set it as 400. Our increment is 200. We have selected
00:59:48 and our embedding name is tutorial name. Okay, so let's just type it in here. Then just click run
00:59:58 and you see it has generated me all of the names. I have names up to 5000. I copy them with a ctrl
01:00:07 c or copy. Then we are going to paste it in here in the X/Y plot and in here select the prompt sr.
01:00:15 Then we need to set a keyword. Okay, let's set a keyword as kw. Okay, test, it is not important,
01:00:26 you can simply set, set anything here. Now i will copy and paste some good prompts. To do
01:00:33 that i will use png info, drag and drop. Okay, i have lots of experimentation. As you can see,
01:00:42 these experimentation are from protogen training with textual embeddings. It was
01:00:47 also extremely successful for my face. Okay, let's pick from my today, experimentation which is under
01:00:59 okay, under here. Let's just pick one of them. Okay, now i am going to copy paste
01:01:06 this into text to image tab. You see, when you use png info, it shows all of
01:01:11 the parameters of the selected picture, if they are select, if they are generated by the web,
01:01:19 ui by default. Okay, so you see. Face photo of. Let me zoom in like this:
01:01:27 testp2400. This is from my previous embedding, so it is currently 60 tokens. Now i am going to
01:01:34 replace these with my test keyword. They will be replaced with these all of the tutorial
01:01:43 training text which are my embedding names, and prompt sr. Okay, as a second parameter. You see,
01:01:51 now it is reduced to 55. You can try CFG values actually, if you want. Or you can try the prompt
01:02:01 strength, prompt emphasis. To do that just at another keyword here as another kw. Okay, and
01:02:11 let's put it in as a prompt strength, actually, not more strength, but prompt sr, okay, and
01:02:19 replace it with 1.0 and 1.1, for example. So you will see the results of different prompt emphasis,
01:02:30 attention emphasis, as explained here. You can test them. You can also test the CFG values.
01:02:35 It is totally up to you. Do not check this box, because you will. You want to see the
01:02:42 same seed images. Actually, since these are different checkpoints, you are not going to
01:02:47 get the same image. By the way, when we use the command argument in here, let me show you. When
01:02:58 we use xformers, even if you use the same seed, you, you will not get the same image because,
01:03:07 since this is doing a lot of optimization, it will not allow you to get the exactly same image,
01:03:14 even if you use the same model and the same seed.. And also, there is one more thing:
01:03:21 actually there are two more options. If you have low vram. Let me show you. So in command
01:03:29 line arguments of the wiki, if we search for VRAM, let's see like this: you will see there
01:03:37 is medvram, medium vram and low vram. So if you also add these parameters to your command line
01:03:45 arguments like this. Let me show, okay, medvram and lowvram. It will allow you to run the web
01:03:55 ui on a lower vram gpu. and with lowvram and medvram you can still generate images with very
01:04:04 low amount of gpu. However, when you use low vram, it will not allow you to do training. So you can
01:04:13 add medvram to your command line argument and this will allow you to do training, textual embedding,
01:04:20 textual inversion training on a lower vram having gpu. Okay, okay, now we are ready. I'm not going
01:04:30 to test the strength, so i'm only going to test, the different embedding checkpoints. Okay, draw
01:04:39 legend, include separate images. Keep, keep minus one. Okay, we are ready. I'm not going to apply
01:04:45 restore faces or tiling or high resolution fix and okay, so let's just click and see the results.
01:04:56 Oh, by the way, to get a better idea, i am setting the batch size eight. So in each
01:05:03 in each generation, it will generate eight images for each one of the embedding checkpoint.
01:05:14 Okay, let me also show you the speed. So it is going to generate 25 grids because we have
01:05:20 selected 25 checkpoints and each one will be eight images. Therefore, it will generate 200 images.
01:05:27 Currently it shows speed as 5.73 second per iteration. Actually, per iteration is
01:05:37 currently eight images, eight steps, because we are generating eight images parallely as a batch.
01:05:46 Therefore, it is actually eight times faster than the regular one single image generation.
01:05:54 Okay, since it was going to take one hour and it's already 3 am and i want to finish this
01:06:00 video today, i am going to show the results of my previous training with exactly same data set
01:06:07 and exactly same settings and you are going to get this kind of output after generating grid
01:06:14 images. It is actually, let me see, 90 megabytes. So you see, these are the different checkpoints,
01:06:24 as you can see, and from these images you need to decide which one looks best. For example, i have
01:06:33 picked in this example: testp-2400 steps count, which means from 10 training images, 240 epochs,
01:06:47 and i have generated a lot of images from this epoch and actually they are the ones that i have
01:06:54 shown in the beginning of the video, these ones. So these ones were generated from the testp-2400
01:07:07 steps, as you can see. Also, the name is written on the images description. And
01:07:12 show me one of the example and see how good it is. It is a 3d rendering of the person. We did
01:07:22 trained and you see the quality. This is the raw quality. I didn't upscale it or did anything and
01:07:28 it is just amazing. Let's just upscale it and see how it looks, in the bigger resolution.
01:07:34 Okay, to do that, let's go to the extras tab and in here i will drag and drop it one moment.
01:07:45 Okay, this image, okay, and then i am going to use R-ESRGAN 4x+ I find this the best one. Actually,
01:07:57 you can try also anime for this one, and let's just upscale it four times.
01:08:05 Okay, upscale is done. And look at the quality. It is just amazingly stylized quality and these
01:08:13 are the original images. You see how good it is. It is exactly the same person and
01:08:19 hundred percent stylized as we wanted. If you wanted some artist to draw this it,
01:08:25 i think the artist would draw as good as only like this, and i also didn't generate too much images
01:08:32 because i had little time. I have been doing a lot of research, experimentation to explain
01:08:37 to you everything in this video with as much as possible details. Now, how you can
01:08:45 combine multiple embeddings in the single query. Let's say you have trained on multiple person,
01:08:51 multiple object and you want to use them. Or you have trained multiple styles and you want to apply
01:08:57 them in the same query. It is just so easy. All you need to do is just type the names of them. So
01:09:08 if you add here like this, and you, if you had, if you add this one, they will be used both,
01:09:16 since these two are using the same tokens, their strength will be applied, both of their strength
01:09:25 will be applied, both of their weights and vectors will be applied. But if they were different,
01:09:32 embedding file, both of them would be applied. So this is how you use embeddings:
01:09:41 in the text to image tab. Hopefully, i plan i plan to work on an experiment on teaching a style and
01:09:50 object and make another video about them, but, the principles are same. It may just require to
01:09:57 select the prepare the good training data set. You see, this training data set is not even good. The
01:10:04 images are blurry, not high quality. The lightning is not very good. As you can see,
01:10:10 this is a blurry image actually, and this is also a blurry image and you will get the link of this
01:10:16 data set to see on your computer as well. However, even though these are not very good. The results
01:10:24 are just amazing. As you can see, the textual embeddings are very strong to teach faces as well,
01:10:31 and you can train do you can do training on official pruned or you can do training
01:10:38 on protogen, like a protogen, a custom, very good model or SD 2.1. And the one good side of
01:10:47 textual inversion than the DreamBooth is that, for example, i did DreamBooth training on protogen and
01:10:53 it was a failure. However, it was a great success for textual inversion. By the way,
01:11:00 the grid images will be saved under the outputs folder inside text to image grids like this. When
01:11:07 you do X/Y plot generation and regular outputs are saved in the text to image stuff like this.
01:11:14 And this is all for today. I hope you have enjoyed it. I have worked a lot for preparing
01:11:24 this tutorial. I have read a lot of technical documents. I have done a lot of research
01:11:30 and experimentation and please subscribe. If you join and support us, i appreciate it. Like the
01:11:38 video, share it and if you have any questions, just join our discord channel. To do that, go to
01:11:44 our about tab and in here you will see official discord channel. Just click it. And if you support
01:11:49 us on patreon. I would appreciate that very much. So far, we have 10 patrons and i thank them a
01:11:57 lot. They are keeping me to prepare more, better videos and, hopefully see you in another video.
Beta Was this translation helpful? Give feedback.
All reactions