Skip to content

Fix fail llama test #2819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 16, 2025
Merged

Conversation

Vidit-Ostwal
Copy link
Contributor

Fixes #2817

Registered a new interaction.

@Vidit-Ostwal Vidit-Ostwal mentioned this pull request May 13, 2025
@@ -359,7 +359,7 @@ def test_convert_with_instructions():

@pytest.mark.vcr(filter_headers=["authorization"])
def test_converter_with_llama3_2_model():
llm = LLM(model="ollama/llama3.2:3b", base_url="http://localhost:11434")
llm = LLM(model="openrouter/meta-llama/llama-3.2-3b-instruct", api_key='ABC')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change has fixed the flaky?

Copy link
Contributor Author

@Vidit-Ostwal Vidit-Ostwal May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am still confused.

Initially I tested the previous cassettes file, locally on 3.10,3.11,3.12
and they were working completely fine.

Still the test case when I was making on the PR were failing, so I thought maybe If a different interaction is being made, might resolve the issue.

I am still confused to completely understand why the test case made, became flaky in the first place (couldn't find anything odd)

Also reading from the logs of another PR,

model = 'llama3.2:3b'
messages = [{'content': 'Name: Alice Llama, Age: 30', 'role': 'user'}, {'content': '', 'role': 'assistant', 'tool_calls': [{'func...\': \'Age\', \'type\': \'integer\'}}, \'required\': [\'age\', \'name\'], \'type\': \'object\'}}}\n', 'role': 'system'}]
timeout = 600.0, temperature = None, top_p = None, n = None, stream = None
stream_options = None, stop = None, max_completion_tokens = None
max_tokens = None, modalities = None, prediction = None, audio = None
presence_penalty = None, frequency_penalty = None, logit_bias = None
user = None, reasoning_effort = None, response_format = None, seed = None
tools = [{'function': {'description': 'Correctly extracted `SimpleModel` with all the required parameters with correct types',...'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['age', 'name'], 'type': 'object'}}, 'type': 'function'}]
tool_choice = {'function': {'name': 'SimpleModel'}, 'type': 'function'}
logprobs = None, top_logprobs = None, parallel_tool_calls = None
deployment_id = None, extra_headers = None, functions = None
function_call = None, base_url = None, api_version = None, api_key = None
model_list = None, thinking = None
kwargs = {'litellm_call_id': 'c1f47fba-ec49-40af-ba33-f12f2ebebea9', 'litellm_logging_obj': <litellm.litellm_core_utils.litellm_logging.Logging object at 0x7fd020dc6810>}
args = {'acompletion': False, 'api_base': 'http://localhost:11434', 'api_key': None, 'api_version': None, ...}
api_base = 'http://localhost:11434', mock_response = None
mock_tool_calls = None, mock_timeout = None, force_timeout = 600
logger_fn = None

I feel that at runtime it's not able to catch the cassettes file, trying to make an actual call.

@Vidit-Ostwal Vidit-Ostwal requested a review from lucasgomide May 14, 2025 04:47
@Vidit-Ostwal
Copy link
Contributor Author

I ran this test cases 50 times for 3.10, 3.11, 3.12

Screenshot 2025-05-16 at 8 08 34 PM Screenshot 2025-05-16 at 8 10 10 PM Screenshot 2025-05-16 at 8 10 10 PM

Successful in first test run :)

@Vidit-Ostwal Vidit-Ostwal requested a review from lucasgomide May 16, 2025 19:10
Copy link
Contributor

@lucasgomide lucasgomide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Vidit-Ostwal, I really appreciate your work.

I think you've shown that this interaction fixes the flaky test, so I believe it's okay to go ahead and merge it.
I'm going to be very honest though… it's still unclear to me why the flaky test was happening. I'm not a big fan of this kind of situation: something still feels unclear, even to you. You just changed the model, and the test magically stopped failing.
Another point is I'm not sure why we wrote a test to converter for custom LLM model. But since you are using a very similar approach - it might be safety(?).

Going to approve that with this consideration.. but again good work!

@lucasgomide lucasgomide merged commit aa6e5b7 into crewAIInc:main May 16, 2025
6 checks passed
@Vidit-Ostwal
Copy link
Contributor Author

I think you've shown that this interaction fixes the flaky test, so I believe it's okay to go ahead and merge it. I'm going to be very honest though… it's still unclear to me why the flaky test was happening. I'm not a big fan of this kind of situation: something still feels unclear, even to you.

Agreed even I am not a big fan of this, coding is deterministic, but this randomness is out of some explanation

You just changed the model, and the test magically stopped failing.

I think the issue was with the .yaml interaction file. I’m not exactly sure what went wrong—especially since all the matchers seemed to be working fine—but my hunch is that re-recording the same HTTP interaction could solve it. It’s a bit like restarting your PC or phone when something’s off and you don’t know why. Not the biggest fan of that approach, but sometimes it just works.

Another point is I'm not sure why we wrote a test to converter for custom LLM model. But since you are using a very similar approach - it might be safety(?).

Yup, even I am confused why a converter was test specifically with llama3_2?
@lucasgomide, thanks for your input.

@Vidit-Ostwal Vidit-Ostwal deleted the fix-fail-llama-test branch May 16, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]Ollama Test Case Issue,
2 participants