diff --git a/19-slm/README.md b/19-slm/README.md index 7057ab4ae7..2425e97ef4 100644 --- a/19-slm/README.md +++ b/19-slm/README.md @@ -32,10 +32,10 @@ In this lesson, we hope to introduce the knowledge of SLM and combine it with Mi By the end of this lesson, you should be able to answer the following questions: -- What is SLM -- What is the difference about SLM and LLM -- What is Microsoft Phi-3/3.5 Family -- How to inference Microsoft Phi-3/3.5 Family +- What is SLM? +- What is the difference between SLM and LLM? +- What is the Microsoft Phi-3/3.5 Family? +- How to run inference with the Microsoft Phi-3/3.5 Family? Ready? Let's get started. @@ -74,7 +74,7 @@ The reduced size of SLMs affords them a significant advantage in terms of infere In summary, while both LLMs and SLMs share a foundational basis in machine learning, they differ significantly in terms of model size, resource requirements, contextual understanding, susceptibility to bias, and inference speed. These distinctions reflect their respective suitability for different use cases, with LLMs being more versatile but resource-heavy, and SLMs offering more domain-specific efficiency with reduced computational demands. -***Note:In this chapter, we will introduce SLM using Microsoft Phi-3 / 3.5 as an example.*** +***Note: In this lesson, we will introduce SLM using Microsoft Phi-3 / 3.5 as an example.*** ## Introduce Phi-3 / Phi-3.5 Family @@ -86,7 +86,7 @@ Mainly for text generation, chat completion, and content information extraction, **Phi-3-mini** -The 3.8B language model is available on Microsoft Azure AI Studio, Hugging Face, and Ollama. Phi-3 models significantly outperform language models of equal and larger sizes on key benchmarks (see benchmark numbers below, higher numbers are better). Phi-3-mini outperforms models twice its size, while Phi-3-small and Phi-3-medium outperform larger models, including GPT-3.5 +The 3.8B language model is available on Microsoft Azure AI Studio, Hugging Face, and Ollama. Phi-3 models significantly outperform language models of equal and larger sizes on key benchmarks (see benchmark numbers below, higher numbers are better). Phi-3-mini outperforms models twice its size, while Phi-3-small and Phi-3-medium outperform larger models, including GPT-3.5. **Phi-3-small & medium** @@ -96,8 +96,7 @@ The Phi-3-medium with 14B parameters continues this trend and outperforms the Ge **Phi-3.5-mini** -We can think of it as an upgrade of Phi-3-mini. While the parameters remain unchanged, it improves the ability to support multiple languages( -Support 20+ languages:Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian) ​​and adds stronger support for long context. +We can think of it as an upgrade of Phi-3-mini. While the parameters remain unchanged, it improves the ability to support multiple languages (support 20+ languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian) ​​and adds stronger support for long context. Phi-3.5-mini with 3.8B parameters outperforms language models of the same size and is on par with models twice its size. @@ -115,14 +114,14 @@ Phi-3-vision, with only 4.2B parameters, continues this trend and outperforms la Phi-3.5-Vision is also an upgrade of Phi-3-Vision, adding support for multiple images. You can think of it as an improvement in vision, not only can you see pictures, but also videos. -Phi-3.5-vision outperforms larger models such as Claude-3.5 Sonnet and Gemini 1.5 Flash across OCR, table and chart understanding tasks and on par on general visual knowledge reasoning tasks.Support multi-frame input, i.e., perform reasoning on multiple input images +Phi-3.5-vision outperforms larger models such as Claude-3.5 Sonnet and Gemini 1.5 Flash across OCR, table and chart understanding tasks and on par on general visual knowledge reasoning tasks. Support multi-frame input, i.e., perform reasoning on multiple input images ### Phi-3.5-MoE ***Mixture of Experts(MoE)*** enables models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining. -Phi-3.5-MoE comprises 16x3.8B expert modules.Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of reasoning, language understanding, and math as much larger models +Phi-3.5-MoE comprises 16x3.8B expert modules. Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of reasoning, language understanding, and math as much larger models We can use the Phi-3/3.5 Family model based on different scenarios. Unlike LLM, you can deploy Phi-3/3.5-mini or Phi-3/3.5-Vision on edge devices. @@ -133,13 +132,13 @@ We hope to use Phi-3/3.5 in different scenarios. Next, we will use Phi-3/3.5 bas ![phi3](./img/phi3.png?WT.mc_id=academic-105485-koreyst) -### Inference difference Cloud's API +### Inference via Cloud APIs **GitHub Models** GitHub Models is the most direct way. You can quickly access the Phi-3/3.5-Instruct model through GitHub Models. Combined with the Azure AI Inference SDK / OpenAI SDK, you can access the API through code to complete the Phi-3/3.5-Instruct call. You can also test different effects through Playground. -- Demo:Comparison of the effects of Phi-3-mini and Phi-3.5-mini in Chinese scenarios +- Demo: Comparison of the effects of Phi-3-mini and Phi-3.5-mini in Chinese scenarios ![phi3](./img/gh1.png?WT.mc_id=academic-105485-koreyst) @@ -153,7 +152,7 @@ Or if we want to use the vision and MoE models, you can use Azure AI Studio to c **NVIDIA NIM** -In addition to the cloud-based Model Catalog solutions provided by Azure and GitHub, you can also use [Nivida NIM](https://developer.nvidia.com/nim?WT.mc_id=academic-105485-koreyst) to complete related calls. You can visit NIVIDA NIM to complete the API calls of the Phi-3/3.5 Family. NVIDIA NIM (NVIDIA Inference Microservices) is a set of accelerated inference microservices designed to help developers deploy AI models efficiently across various environments, including clouds, data centers, and workstations. +In addition to the cloud-based Model Catalog solutions provided by Azure and GitHub, you can also use [NVIDIA NIM](https://developer.nvidia.com/nim?WT.mc_id=academic-105485-koreyst) to complete related calls. You can visit NVIDIA NIM to complete the API calls of the Phi-3/3.5 Family. NVIDIA NIM (NVIDIA Inference Microservices) is a set of accelerated inference microservices designed to help developers deploy AI models efficiently across various environments, including clouds, data centers, and workstations. Here are some key features of NVIDIA NIM: @@ -165,10 +164,10 @@ Here are some key features of NVIDIA NIM: NIM is part of NVIDIA AI Enterprise, which aims to simplify the deployment and operationalization of AI models, ensuring they run efficiently on NVIDIA GPUs. -- Demo: Using Nividia NIM to call Phi-3.5-Vision-API [[Click this link](./python/Phi-3-Vision-Nividia-NIM.ipynb?WT.mc_id=academic-105485-koreyst)] +- Demo: Using NVIDIA NIM to call Phi-3.5-Vision-API [[Click this link](./python/Phi-3-Vision-Nividia-NIM.ipynb?WT.mc_id=academic-105485-koreyst)] -### Inference Phi-3/3.5 in local env +### Running Phi-3/3.5 Locally Inference in relation to Phi-3, or any language model like GPT-3, refers to the process of generating responses or predictions based on the input it receives. When you provide a prompt or question to Phi-3, it uses its trained neural network to infer the most likely and relevant response by analyzing patterns and relationships in the data it was trained on. **Hugging Face Transformer** @@ -185,17 +184,17 @@ Hugging Face Transformers is a powerful library designed for natural language pr 5. **Community and Resources**: Hugging Face has a vibrant community and extensive documentation, tutorials, and guides to help users get started and make the most of the library. [official documentation](https://huggingface.co/docs/transformers/index?WT.mc_id=academic-105485-koreyst) or their [GitHub repository](https://github.com/huggingface/transformers?WT.mc_id=academic-105485-koreyst). -This is the most commonly used method, but it also requires GPU acceleration. After all, scenes such as Vision and MoE require a lot of calculations, which will be very limited in the CPU if they are not quantized. +This is the most commonly used method, but it also requires GPU acceleration. After all, scenarios such as Vision and MoE require a lot of calculations, which will be very slow on CPU if they are not quantized. -- Demo:Using Transformer to call Phi-3.5-Instuct [Click this link](./python/phi35-instruct-demo.ipynb?WT.mc_id=academic-105485-koreyst) +- Demo: Using Transformer to call Phi-3.5-Instruct [Click this link](./python/phi35-instruct-demo.ipynb?WT.mc_id=academic-105485-koreyst) -- Demo:Using Transformer to call Phi-3.5-Vision[Click this link](./python/phi35-vision-demo.ipynb?WT.mc_id=academic-105485-koreyst) +- Demo: Using Transformer to call Phi-3.5-Vision [Click this link](./python/phi35-vision-demo.ipynb?WT.mc_id=academic-105485-koreyst) -- Demo:Using Transformer to call Phi-3.5-MoE[Click this link](./python/phi35_moe_demo.ipynb?WT.mc_id=academic-105485-koreyst) +- Demo: Using Transformer to call Phi-3.5-MoE [Click this link](./python/phi35_moe_demo.ipynb?WT.mc_id=academic-105485-koreyst) **Ollama** -[Ollama](https://ollama.com/?WT.mc_id=academic-105485-koreyst) is a platform designed to make it easier to run large language models (LLMs) locally on your machine. It supports various models like Llama 3.1, Phi 3, Mistral, and Gemma 2, among others. The platform simplifies the process by bundling model weights, configuration, and data into a single package, making it more accessible for users to customize and create their own models. Ollama is available for macOS, Linux, and Windows. It’s a great tool if you’re looking to experiment with or deploy LLMs without relying on cloud services. Ollama is the most direct way, you just need to execute the following statement. +[Ollama](https://ollama.com/?WT.mc_id=academic-105485-koreyst) is a platform designed to make it easier to run large language models (LLMs) locally on your machine. It supports various models like Llama 3.1, Phi 3, Mistral, and Gemma 2, among others. The platform simplifies the process by bundling model weights, configuration, and data into a single package, making it more accessible for users to customize and create their own models. Ollama is available for macOS, Linux, and Windows. It’s a great tool if you’re looking to experiment with or deploy LLMs without relying on cloud services. Ollama is the most direct way, you just need to execute the following command. ```bash @@ -210,7 +209,7 @@ ollama run phi3.5 [ONNX Runtime](https://github.com/microsoft/onnxruntime-genai?WT.mc_id=academic-105485-koreyst) is a cross-platform inference and training machine-learning accelerator. ONNX Runtime for Generative AI (GENAI) is a powerful tool that helps you run generative AI models efficiently across various platforms. ## What is ONNX Runtime? -ONNX Runtime is an open-source project that enables high-performance inference of machine learning models. It supports models in the Open Neural Network Exchange (ONNX) format, which is a standard for representing machine learning models.ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms +ONNX Runtime is an open-source project that enables high-performance inference of machine learning models. It supports models in the Open Neural Network Exchange (ONNX) format, which is a standard for representing machine learning models.ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. ## What is Generative AI? Generative AI refers to AI systems that can generate new content, such as text, images, or music, based on the data they have been trained on. Examples include language models like GPT-3 and image generation models like Stable Diffusion. ONNX Runtime for GenAI library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. @@ -302,7 +301,7 @@ while not generator.is_done(): new_token = generator.get_next_tokens()[0] - code += tokenizer_stream.decode(new_token) + output = tokenizer_stream.decode(new_token) print(tokenizer_stream.decode(new_token), end='', flush=True) diff --git a/20-mistral/README.md b/20-mistral/README.md index 747731437a..eace4247e5 100644 --- a/20-mistral/README.md +++ b/20-mistral/README.md @@ -5,14 +5,14 @@ This lesson will cover: - Exploring the different Mistral Models - Understanding the use-cases and scenarios for each model -- Code samples show the unique features of each model. +- Exploring code samples that show the unique features of each model. ## The Mistral Models In this lesson, we will explore 3 different Mistral models: **Mistral Large**, **Mistral Small** and **Mistral Nemo**. -Each of these models is available free on the Github Model marketplace. The code in this notebook will be using these models to run the code. Here are more details on using Github Models to [prototype with AI models](https://docs.github.com/en/github-models/prototyping-with-ai-models?WT.mc_id=academic-105485-koreyst). +Each of these models is available free on the GitHub Model marketplace. The code in this notebook will be using these models to run the code. Here are more details on using GitHub Models to [prototype with AI models](https://docs.github.com/en/github-models/prototyping-with-ai-models?WT.mc_id=academic-105485-koreyst). ## Mistral Large 2 (2407) @@ -92,7 +92,7 @@ d = text_embeddings.shape[1] index = faiss.IndexFlatL2(d) index.add(text_embeddings) -question = "저자가 대학에 오기 전에 주로 했던 두 가지 일은 무엇이었나요??" +question = "저자가 대학에 오기 전에 주로 했던 두 가지 일은 무엇이었나요?" question_embedding = embed_client.embed( input=[question], @@ -214,7 +214,7 @@ It is viewed as an upgrade to the earlier open source LLM from Mistral, Mistral Some other features of the NeMo model are: -- *More efficient tokenization:* This model using the Tekken tokenizer over the more commonly used tiktoken. This allows for better performance over more languages and code. +- *More efficient tokenization:* This model uses the Tekken tokenizer over the more commonly used tiktoken. This allows for better performance over more languages and code. - *Finetuning:* The base model is available for finetuning. This allows for more flexibility for use-cases where finetuning may be needed. @@ -225,7 +225,7 @@ Some other features of the NeMo model are: In this sample, we will look at how Mistral NeMo handles tokenization compared to Mistral Large. -Both samples take the same prompt but you should see that NeMo returns back less tokens vs Mistral Large. +Both samples take the same prompt but you should see that NeMo returns fewer tokens than Mistral Large. ```bash pip install mistral-common @@ -245,7 +245,7 @@ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer # Load Mistral tokenizer -model_name = "open-mistral-nemo " +model_name = "open-mistral-nemo" tokenizer = MistralTokenizer.from_model(model_name) @@ -267,7 +267,7 @@ tokenized = tokenizer.encode_chat_completion( "format": { "type": "string", "enum": ["celsius", "fahrenheit"], - "description": "The temperature unit to use. Infer this from the users location.", + "description": "The temperature unit to use. Infer this from the user's location.", }, }, "required": ["location", "format"], @@ -323,7 +323,7 @@ tokenized = tokenizer.encode_chat_completion( "format": { "type": "string", "enum": ["celsius", "fahrenheit"], - "description": "The temperature unit to use. Infer this from the users location.", + "description": "The temperature unit to use. Infer this from the user's location.", }, }, "required": ["location", "format"], @@ -343,6 +343,6 @@ tokens, text = tokenized.tokens, tokenized.text print(len(tokens)) ``` -## Learning does not stop here, continue the Journey +## Learning does not stop here, continue the journey After completing this lesson, check out our [Generative AI Learning collection](https://aka.ms/genai-collection?WT.mc_id=academic-105485-koreyst) to continue leveling up your Generative AI knowledge! diff --git a/21-meta/README.md b/21-meta/README.md index a7658110a8..3a91551e29 100644 --- a/21-meta/README.md +++ b/21-meta/README.md @@ -11,7 +11,7 @@ This lesson will cover: ## The Meta Family of Models -In this lesson, we will explore 2 models from the Meta family or "Llama Herd" - Llama 3.1 and Llama 3.2 +In this lesson, we will explore 2 models from the Meta family or "Llama Herd" - Llama 3.1 and Llama 3.2. These models come in different variants and are available on the GitHub Model marketplace. Here are more details on using GitHub Models to [prototype with AI models](https://docs.github.com/en/github-models/prototyping-with-ai-models?WT.mc_id=academic-105485-koreyst). @@ -27,13 +27,13 @@ Model Variants: At 405 Billion Parameters, Llama 3.1 fits into the open source LLM category. -The mode is an upgrade to the earlier release Llama 3 by offering: +The model is an upgrade to the earlier release Llama 3 by offering: - Larger context window - 128k tokens vs 8k tokens - Larger Max Output Tokens - 4096 vs 2048 - Better Multilingual Support - due to increase in training tokens -These enables Llama 3.1 to handle more complex use cases when building GenAI applications including: +These enable Llama 3.1 to handle more complex use cases when building GenAI applications including: - Native Function Calling - the ability to call external tools and functions outside of the LLM workflow - Better RAG Performance - due to the higher context window - Synthetic Data Generation - the ability to create effective data for tasks such as fine-tuning @@ -45,7 +45,7 @@ Llama 3.1 has been fine-tuned to be more effective at making function or tool ca - **Brave Search** - Can be used to get up-to-date information like the weather by performing a web search - **Wolfram Alpha** - Can be used for more complex mathematical calculations so writing your own functions is not required. -You can also create your own custom tools that LLM can call. +You can also create your own custom tools that the LLM can call. In the code example below: @@ -53,7 +53,7 @@ In the code example below: - Send a user prompt that asks about the weather in a certain city. - The LLM will respond with a tool call to the Brave Search tool which will look like this `<|python_tag|>brave_search.call(query="Stockholm weather")` -*Note: This example only makes the tool call, if you would like to get the results, you will need to create a free account on the Brave API page and define the function itself` +*Note: This example only makes the tool call, if you would like to get the results, you will need to create a free account on the Brave API page and define the function itself. ```python import os @@ -95,7 +95,7 @@ print(response.choices[0].message.content) ## Llama 3.2 -Despite being an LLM, one limitation that Llama 3.1 has is multimodality. That is, being able to use different types of input such as images as prompts and providing responses. This ability is one of the main features of Llama 3.2. These features also include: +Despite being an LLM, one limitation of Llama 3.1 is its lack of multimodality. That is, the inability to use different types of input such as images as prompts and provide responses. This ability is one of the main features of Llama 3.2. These features also include: - Multimodality - has the ability to evaluate both text and image prompts - Small to Medium size variations (11B and 90B) - this provides flexible deployment options, @@ -151,7 +151,7 @@ response = client.complete( print(response.choices[0].message.content) ``` -## Learning does not stop here, continue the Journey +## Learning does not stop here, continue the journey After completing this lesson, check out our [Generative AI Learning collection](https://aka.ms/genai-collection?WT.mc_id=academic-105485-koreyst) to continue leveling up your Generative AI knowledge!