NVIDIA-NeMo
diff --git a/‎.github/workflows/gpu_tests.yml‎
Lines changed: 3 additions & 0 deletions b/‎.github/workflows/gpu_tests.yml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/basics/inference.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/basics/inference.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/basics/prompt-format.md‎
Lines changed: 35 additions & 12 deletions b/‎docs/basics/prompt-format.md‎
Lines changed: 35 additions & 12 deletions
diff --git a/‎docs/openmathreasoning1/evaluation.md‎
Lines changed: 4 additions & 2 deletions b/‎docs/openmathreasoning1/evaluation.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/openmathreasoning1/training.md‎
Lines changed: 6 additions & 2 deletions b/‎docs/openmathreasoning1/training.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎nemo_skills/inference/chat_interface/core.py‎
Lines changed: 2 additions & 1 deletion b/‎nemo_skills/inference/chat_interface/core.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎nemo_skills/inference/generate.py‎
Lines changed: 2 additions & 1 deletion b/‎nemo_skills/inference/generate.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎nemo_skills/inference/server/code_execution_model.py‎
Lines changed: 59 additions & 14 deletions b/‎nemo_skills/inference/server/code_execution_model.py‎
Lines changed: 59 additions & 14 deletions
@@ -47,6 +47,7 @@ jobs:
       run: |
         cd ${{ github.run_id }}
         nvidia-smi
+        export DOCKER_CLIENT_TIMEOUT=120
         set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
         ./tests/gpu-tests/run_llama.sh
     - name: Cleanup
@@ -86,6 +87,7 @@ jobs:
       run: |
         cd ${{ github.run_id }}
         nvidia-smi
+        export DOCKER_CLIENT_TIMEOUT=120
         set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
         ./tests/gpu-tests/run_qwen.sh
     - name: Cleanup
@@ -122,6 +124,7 @@ jobs:
       run: |
         cd ${{ github.run_id }}
         nvidia-smi
+        export DOCKER_CLIENT_TIMEOUT=120
         set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
         ./tests/gpu-tests/run_rm.sh
     - name: Cleanup
 
@@ -129,7 +129,7 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
 
     sandbox = get_sandbox()  # localhost by default
     llm = get_code_execution_model(server_type="vllm", sandbox=sandbox)
-    prompt = get_prompt('generic/default', 'llama3-instruct') # (1)!
+    prompt = get_prompt('generic/default', 'llama3-instruct', code_tags='llama3') # (1)!
     prompt.config.system = ( # (2)!
         "Environment: ipython\n\n"
         "Use Python to solve this math problem."
 
@@ -1,11 +1,16 @@
 # Prompt utilities
 
-Our prompts are configured via two input yaml files: prompt template and prompt config.
+Our prompts are configured via three yaml files:
+
+1. **Prompt template** - defines model-specific chat format and special tokens
+2. **Prompt config** - contains the actual prompt text with placeholders  
+3. **Code tags** - specifies code formatting tokens, required for code execution
+
 
 ## Prompt template
 
 The template file defines model-specific special tokens, e.g. bos, turn tokens,
-user/assistant/system message, special tokens for code execution, etc. All of the
+user/assistant/system message, etc. All of the
 templates that we support by default are available in
 [nemo_skills/prompt/template](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/prompt/template)
 folder. Here is an example template for
@@ -34,13 +39,6 @@ assistant_begin: "<|start_header_id|>assistant<|end_header_id|>\n\n"
 assistant_end: "<|eot_id|>"
 
 stop_phrases: ["<|eot_id|>"]
-
-# used to execute code within these tags
-code_begin: '<|python_tag|>'
-code_end: '<|eom_id|>'
-# used to extract the code output
-code_output_begin: '<|start_header_id|>ipython<|end_header_id|>'
-code_output_end: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
 ```
 
 You can specify a particular template with `++prompt_template=...`. If you don't add a .yaml extension (e.g.
@@ -96,22 +94,47 @@ prompt the `gsm8k_standard_few_shot` examples from
 [here](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/prompt/few_shot_examples/gsm8k.py) are used.
 
 
+## Code tags
+
+Code tags define the special tokens that models use to mark executable code blocks and their output. Code tags are required when using code execution.
+All code tags that we support by default are available in
+[nemo_skills/prompt/code_tags](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/prompt/code_tags).
+
+Here is an example code tags file for the [llama3](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/prompt/code_tags/llama3.yaml) family:
+
+```yaml
+# Code tags for llama3 family models
+
+# used to execute code within these tags
+code_begin: "<|python_tag|>"
+code_end: "<|eom_id|>"
+
+# used to extract the code output
+code_output_begin: "<|start_header_id|>ipython<|end_header_id|>"
+code_output_end: "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
+
+# how to post-process the captured output (choices: llama, qwen)
+code_output_format: "llama"
+```
+
 ## Prompt API
 
-If you're running one of the pipeline scripts, you can control the prompt by using
+If you're running one of the pipeline scripts, you can control the prompt by using:
 
 ```bash
 ++prompt_template=...
 ++prompt_config=...
+++code_tags=...
 ++examples_type=...
 ```
 
-If you're implementing a new script, you can use the following code to create a prompt and then use it
+If you're implementing a new script, you can use the following code to create a prompt and then use it:
 
 ```python
 from nemo_skills.prompt.utils import get_prompt
 
-prompt = get_prompt('generic/math', 'llama3-instruct')
+# The third parameter is optional and only needed for code execution
+prompt = get_prompt('generic/math', 'llama3-instruct', code_tags='llama3')
 print(prompt.fill({'problem': "What's 2 + 2?"}))
 ```
 
 
@@ -188,7 +188,8 @@ ns eval \
     --server_gpus=1 \
     --num_jobs=1 \
     --with_sandbox \
-    ++prompt_template=openmath-instruct \
+    ++code_tags=openmath \
+    ++prompt_template=qwen-instruct \
     ++prompt_config=openmath/tir \
     ++inference.tokens_to_generate=32768 \
     ++inference.temperature=0.6 \
@@ -210,7 +211,8 @@ ns eval \
     --server_gpus=1 \
     --num_jobs=1 \
     --with_sandbox \
-    ++prompt_template=openmath-instruct \
+    ++code_tags=openmath \
+    ++prompt_template=qwen-instruct \
     ++prompt_config=generic/math \
     ++inference.tokens_to_generate=32768 \
     ++inference.temperature=0.6 \
 
@@ -33,13 +33,15 @@ for inference_mode in ["cot", "tir", "genselect"]:
     dataset[inference_mode] = dataset[inference_mode].rename_column("problem", "input")
     dataset[inference_mode] = dataset[inference_mode].rename_column("generated_solution", "output")
 
+    code_tags = None
     if inference_mode == 'cot':
         prompt_config = 'generic/math'
     if inference_mode == 'tir':
         prompt_config = 'openmath/tir'
+        code_tags = 'openmath'
     if inference_mode == 'genselect':  # already formatted
         prompt_config = {'user': '{problem}'}
-    prompt = get_prompt(prompt_config, 'qwen-instruct')
+    prompt = get_prompt(prompt_config, 'qwen-instruct', code_tags)
     func = partial(apply_format, prompt=prompt, is_tir=(inference_mode == 'tir'))
     dataset[inference_mode] = dataset[inference_mode].map(func, num_proc=20)
 
@@ -275,15 +277,17 @@ for inference_mode in ["cot", "tir", "genselect"]:
     dataset[inference_mode] = dataset[inference_mode].rename_column("problem", "input")
     dataset[inference_mode] = dataset[inference_mode].rename_column("generated_solution", "output")
 
+    code_tags = None
     if inference_mode == 'cot':
         prompt_config = 'generic/math'
     if inference_mode == 'tir':
         prompt_config = 'openmath/tir'
+        code_tags = 'openmath'
     if inference_mode == 'genselect':  # already formatted
         prompt_config = {'user': '{problem}'}
     func = partial(filter_func, inference_mode=inference_mode)
     dataset[inference_mode] = dataset[inference_mode].filter(func, num_proc=20)
-    prompt = get_prompt(prompt_config, 'qwen-instruct')
+    prompt = get_prompt(prompt_config, 'qwen-instruct', code_tags)
     func = partial(apply_format, prompt=prompt, is_tir=(inference_mode == 'tir'))
     dataset[inference_mode] = dataset[inference_mode].map(func, num_proc=20)
 
 
@@ -50,7 +50,8 @@ class AppConfig:
     # Prompt configuration
     base_prompt_config: str = "generic/math"
     code_prompt_config: str = "openmath/tir"
-    prompt_template: str = "openmath-instruct"
+    prompt_template: str = "qwen-instruct"
+    code_tags: str = "openmath"
 
     # Code-execution related
     initial_code_execution_state: bool = False
 
@@ -55,6 +55,7 @@ class GenerateSolutionsConfig:
     output_file: str  # Where to save the generations
     prompt_config: str  # How to format the data into prompts
     prompt_template: str | None = None  # not required for OpenAI server
+    code_tags: str | None = None # required when using code execution
     examples_type: str | None = None  # to be able to customize few-shot examples
 
     # Inference server configuration {server_params}
@@ -245,7 +246,7 @@ def setup_llm(self):
         return llm
 
     def setup_prompt(self):
-        prompt = get_prompt(self.cfg.prompt_config, self.cfg.prompt_template, examples_type=self.cfg.examples_type)
+        prompt = get_prompt(self.cfg.prompt_config, self.cfg.prompt_template, self.cfg.code_tags, examples_type=self.cfg.examples_type)
         LOG.info("Prompt used: %s", prompt)
         return prompt
 
 
@@ -61,7 +61,7 @@ def _is_generation_cancelled(self, gen_id):
 
     def _generate_single(
         self,
-        prompt: str,
+        prompt: str | list,
         code_begin: str,
         code_end: str,
         code_output_begin: str,
@@ -81,8 +81,9 @@ def _generate_single(
         max_code_executions: int | None = None,  # if not None, will override self.config.max_code_executions
         stream: bool = False,
     ):
-        if not isinstance(prompt, str):
-            raise NotImplementedError("OpenAI API is not supported yet.")
+        # Handle OpenAI-style dictionary prompts
+        is_openai_format = not isinstance(prompt, str)
+            
         if top_logprobs is not None:  # TODO: add this
             raise NotImplementedError("top_logprobs is not supported yet.")
 
@@ -106,18 +107,20 @@ def _generate_single(
                 max_code_executions=max_code_executions,
             )
 
-        if stop_phrases is None:
-            stop_phrases = []
-
         effective_max_code_executions = self.config.max_code_executions
         if max_code_executions is not None:
             effective_max_code_executions = max_code_executions
 
         # making a copy of prompts to not corrupt original data
-        new_prompt = copy.deepcopy(prompt)
+        if is_openai_format:
+            new_prompt = copy.deepcopy(prompt)
+        else:
+            new_prompt = copy.deepcopy(prompt)
 
         start_time = int(time.time())
 
+        stop_phrases = stop_phrases or []
+
         request = {
             "prompt": new_prompt,
             "tokens_to_generate": tokens_to_generate,
@@ -176,7 +179,19 @@ def _generate_single(
             output, num_generated_tokens = output_dict['generation'], output_dict.get('num_generated_tokens', 0)
             # no need to do anything with this as the code below should just exit, so that's only for logging
             stopped_on_repetition = output_dict.get('stopped_on_repetition', False)
-            request['prompt'] += output
+
+            # openai don't show what stop word was triggered, so we assume that it was `code_end`
+            # if there's an unfinished code block
+            if is_openai_format and output_dict.get('finish_reason') == 'stop':
+                if output.count(code_end) + 1 == output.count(code_begin):
+                    output += code_end
+            # Update the prompt based on format
+            if is_openai_format:
+                request['prompt'].append({'role': 'assistant', 'content': output})
+                request['prompt'].append({'role': 'user', 'content': "continue"})
+            else:
+                request['prompt'] += output
+
             # if it's the extra iteration, we don't execute the code block and just finish
 
             if generation_index == effective_max_code_executions:
@@ -204,17 +219,28 @@ def _generate_single(
                 if self.config.add_remaining_code_executions:
                     remaining_code_executions = effective_max_code_executions - generation_index - 1
                 # adding code output to the prompt
-                request['prompt'] += format_code_output(
+                code_output = format_code_output(
                     execution_dict, code_output_begin, code_output_end, code_output_format, remaining_code_executions
                 )
+                
+                if is_openai_format:
+                    request['prompt'][-2]['content'] += code_output
+                else:
+                    request['prompt'] += code_output
+                    
                 code_execution_time += int(time.time() - code_execution_time_start)
                 code_rounds_executed += 1
             else:  # if no code was generated, we need to finish
                 break
 
-        # removing original prompt
+        # removing original prompt and returning the generation
+        if is_openai_format:
+            generation = "\n".join(msg['content'] for msg in request['prompt'] if msg['role'] == 'assistant')
+        else:
+            generation = request['prompt'][len(prompt):]
+            
         return {
-            'generation': request['prompt'][len(prompt) :],
+            'generation': generation,
             'code_rounds_executed': code_rounds_executed,
             'num_generated_tokens': total_num_generated_tokens,
             'generation_time': generation_time,
@@ -433,6 +459,9 @@ def _stream_single(
         """
         Helper method, that implements streaming generation.
         """
+        # Handle OpenAI-style dictionary prompts
+        is_openai_format = not isinstance(prompt, str)
+        
         effective_max_code_executions = self.config.max_code_executions
         if max_code_executions is not None:
             effective_max_code_executions = max_code_executions
@@ -452,7 +481,7 @@ def _stream_single(
             'stream': True,
         }
 
-        current_full_prompt = prompt
+        current_full_prompt = copy.deepcopy(prompt)
         session_id = None  # For sandbox state continuity
         for generation_index in range(effective_max_code_executions + 1):
             model_token_iterator = self.model._generate_single(prompt=current_full_prompt, **request)
@@ -470,7 +499,18 @@ def _stream_single(
             if not current_output_segment:
                 break
 
-            current_full_prompt += current_output_segment
+            # openai don't show what stop word was triggered, so we assume that it was `code_end`
+            # if there's an unfinished code block
+            if is_openai_format and chunk.get('finish_reason') == 'stop':
+                if current_output_segment.count(code_end) + 1 == current_output_segment.count(code_begin):
+                    current_output_segment += code_end
+
+            # Update the prompt based on format
+            if is_openai_format:
+                current_full_prompt.append({'role': 'assistant', 'content': current_output_segment})
+                current_full_prompt.append({'role': 'user', 'content': "continue"})
+            else:
+                current_full_prompt += current_output_segment
 
             if generation_index == effective_max_code_executions:
                 # This was the last iteration, intended for final text generation after all code executions.
@@ -496,7 +536,12 @@ def _stream_single(
                 )
 
                 yield {'generation': formatted_code_output}  # Yield the entire formatted code output as one chunk
-                current_full_prompt += formatted_code_output  # Append executed code's output to the prompt
+                
+                # Append executed code's output to the prompt
+                if is_openai_format:
+                    current_full_prompt[-2]['content'] += formatted_code_output
+                else:
+                    current_full_prompt += formatted_code_output
             else:
                 break