Comparing Llama Models for Python Function Generation
Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance.
Experimental Setup
Both models were run locally on a PC equipped with:
* CPU: Intel i7-8700
* RAM: 16GB
* GPU: Nvidia Quadro P2200
The models operated in generation-only mode, with no feedback or fine-tuning. Each attempt to solve the task involved the following process:
1. Generating Code: The task prompt was structured as a docstring:
3. Testing the Function: The function was validated with:
* Default parameters.
* Various valid inputs.
* Invalid parameters to test error handling.
The task was considered successfully solved only if the function passed all tests.
Evaluation Metrics
To measure the effectiveness of each model, three key metrics were considered:
* Number of Attempts: how many attempts were required to produce a correct solution.
* Time per Attempt: the time taken to generate and process the function code in each iteration.
* Total Time to Solve the Task: the cumulative time from the start of the process until a correct solution was achieved.
Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance.
Experimental Setup
Both models were run locally on a PC equipped with:
* CPU: Intel i7-8700
* RAM: 16GB
* GPU: Nvidia Quadro P2200
The models operated in generation-only mode, with no feedback or fine-tuning. Each attempt to solve the task involved the following process:
1. Generating Code: The task prompt was structured as a docstring:
Create a Python function `object_height` that calculates object height2. Running the Code: The function code was executed locally.
based on the image height in pixels.
Input parameters:
- focal length `F` in mm, default 35 mm
- image height `H` in pixels, default 1152
- pixel size `p` in μm, default 3.45 μm
- object distance `D`, default 6.5 m
Returns:
The object height in mm as an absolute float value.
Raises:
`RuntimeError` if input parameters are invalid (negative or zero).
Extracting the Function Code: The generated response was parsed to isolate the function implementation.
3. Testing the Function: The function was validated with:
* Default parameters.
* Various valid inputs.
* Invalid parameters to test error handling.
The task was considered successfully solved only if the function passed all tests.
Evaluation Metrics
To measure the effectiveness of each model, three key metrics were considered:
* Number of Attempts: how many attempts were required to produce a correct solution.
* Time per Attempt: the time taken to generate and process the function code in each iteration.
* Total Time to Solve the Task: the cumulative time from the start of the process until a correct solution was achieved.
🔥1
Number of Attempts
Observing the distribution of attempts to solve the task, we can see that model
Time per Attempt
Looking at the average attempt times, model
Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
#testing
Observing the distribution of attempts to solve the task, we can see that model
llama3.1
has lower values (median = 33) compared to model llama3.2
(median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2
having extreme values that are 6 and 10 times larger than its median.Time per Attempt
Looking at the average attempt times, model
llama3.2
(3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1
(8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
llama3.2
(3.2B parameters):- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
llama3.1
(8B parameters):- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
llama3.2
model proves significantly more efficient for this task, offering faster completion times.#testing
❤1
In addition to
Let's analyze the solution it provided:
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
#testing
llama
models, I also tested the script with Alibaba's Qwen2.5 7B model, but encountered a significant performance issue: the first solve time was 1 day, 5:22:03 🤯.Let's analyze the solution it provided:
def object_height(F=35, H=1152, p=3.45, D=6.5):While the function passes all tests, I spotted two logical issues:
# Constants in appropriate units for calculation
F_mm = F # focal length in mm (already in correct unit)
H_px = H # image height in pixels (already in correct unit)
p_um = p * 1000 # pixel size in micrometers
D_mm = D * 1000 # Convert object distance from meters to millimeters
# Input validation
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Calculate image height in millimeters
H_mm = H_px * p_um / 1000
# Calculate object height using the lens equation
h_object = (H_mm / F_mm) * D_mm
return h_object # Absolute value not needed in this context
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
llama3.2
model:def object_height(F=35, H=1152, p=3.45, D=6.5):Now this is more like it! Clear documentation, proper unit conversions, and straightforward calculations. The only minor quirk is the unnecessary
"""
Calculates the object height in mm from known image height, focal length, pixel size and object distance.
Args:
F (float): Focal length in mm, default 35 mm.
H (int): Image height in pixels, default 1152.
p (float): Pixel size in mkm, default 3.45 mkm.
D (float): Object distance in m, default 6.5 m.
Returns:
float: Object height in mm.
Raises:
RuntimeError: If any input parameter is invalid (negative or equal to zero).
"""
# Check for invalid input parameters
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Convert object distance from meters to millimeters
D_mm = D * 1000
# Calculate image height in millimeters
H_mm = H / 1000 * p
# Calculate object height using the lens equation and magnification formula
h_object = (H_mm / F) * D_mm
return abs(h_object)
abs()
in the return statement - we don't really need it here. Otherwise, it's a solid implementation.#testing
Writing python function with code-specific models
My goal is to create a local AI agent that understands my projects' context and generates working code efficiently. While general-purpose LLMs are impressive, code-specific models should be better suited for this task.
I tested the following models:
- CodeLlama (13B parameters),
- CodeStral (22B parameters),
- Qwen2.5-Coder (14B parameters),
- DeepSeek-Coder-v2 (16B parameters)
For the testing environment, I spun up a server with a Tesla T4 GPU (16GB VRAM). This hardware constraint helped narrow down my model selection.
I gave each model the same test: generate a function that calculates an object's size using known image parameters and distance, as described earlier. To improve code extraction from the responses, I implemented a more robust parsing pattern:
The results show significant variance between the models' performance. CodeStral emerged as the clear leader, requiring a median of only 5 attempts to complete the task. In stark contrast, Alibaba's Qwen2.5 needed around 110 attempts (22 times more!) to achieve the same result.
Comparing these results with previous tests using general-purpose models, only CodeStral and CodeLlama showed improved performance. This suggests that being a code-specific model doesn't automatically guarantee better efficiency.
My goal is to create a local AI agent that understands my projects' context and generates working code efficiently. While general-purpose LLMs are impressive, code-specific models should be better suited for this task.
I tested the following models:
- CodeLlama (13B parameters),
- CodeStral (22B parameters),
- Qwen2.5-Coder (14B parameters),
- DeepSeek-Coder-v2 (16B parameters)
For the testing environment, I spun up a server with a Tesla T4 GPU (16GB VRAM). This hardware constraint helped narrow down my model selection.
I gave each model the same test: generate a function that calculates an object's size using known image parameters and distance, as described earlier. To improve code extraction from the responses, I implemented a more robust parsing pattern:
match = re.search(Number of Attempts
r"(?:`{3}\w*|\[PYTHON\])\n(.+?)(?:`{3}|\[\/PYTHON\])",
out_str,
flags=re.DOTALL,
)
The results show significant variance between the models' performance. CodeStral emerged as the clear leader, requiring a median of only 5 attempts to complete the task. In stark contrast, Alibaba's Qwen2.5 needed around 110 attempts (22 times more!) to achieve the same result.
Comparing these results with previous tests using general-purpose models, only CodeStral and CodeLlama showed improved performance. This suggests that being a code-specific model doesn't automatically guarantee better efficiency.
⚡1
Analyzing Qwen2.5's Performance Issues
Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using ( f * h * p / d ) instead of the correct formula ( h * p * d / f ).
This consistent misapplication of the formula is particularly interesting, as other models showed more variety in their attempts and errors. This behavior suggests that Qwen2.5 might be "stuck" in a local optimum, repeatedly suggesting the same incorrect solution rather than exploring alternative approaches like its competitors.
Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using ( f * h * p / d ) instead of the correct formula ( h * p * d / f ).
This consistent misapplication of the formula is particularly interesting, as other models showed more variety in their attempts and errors. This behavior suggests that Qwen2.5 might be "stuck" in a local optimum, repeatedly suggesting the same incorrect solution rather than exploring alternative approaches like its competitors.
🆒1
Attempt time
The models showed significant variations in their processing speeds. Here's how they ranked by median response time:
- CodeLlama: 5 seconds (fastest),
- DeepSeek-Coder: 8 seconds,
- CodeStral: 15 seconds,
- Qwen2.5: 50 seconds (significantly slower).
A pairwise Mann-Whitney U test with a confidence level of 0.05 confirms that the differences in attempt times between models are statistically significant. This statistical analysis reinforces that the observed performance differences are not due to random variation but represent genuine differences in model capabilities.
Qwen2.5's combination of slow processing (50-second median response time) and high number of attempts (median 110) makes it particularly inefficient for this task. While other models complete the task in 5-15 seconds with few attempts, Qwen2.5 requires significantly more resources to generate solutions - which are often incorrect.
The models showed significant variations in their processing speeds. Here's how they ranked by median response time:
- CodeLlama: 5 seconds (fastest),
- DeepSeek-Coder: 8 seconds,
- CodeStral: 15 seconds,
- Qwen2.5: 50 seconds (significantly slower).
A pairwise Mann-Whitney U test with a confidence level of 0.05 confirms that the differences in attempt times between models are statistically significant. This statistical analysis reinforces that the observed performance differences are not due to random variation but represent genuine differences in model capabilities.
Qwen2.5's combination of slow processing (50-second median response time) and high number of attempts (median 110) makes it particularly inefficient for this task. While other models complete the task in 5-15 seconds with few attempts, Qwen2.5 requires significantly more resources to generate solutions - which are often incorrect.
🔥1
Time to solve task
Each model attempted to write the working function 60 times. The results reveal two distinct performance groups:
Group 1 (Statistically Similar):
- CodeLlama: 1.4 minutes median
- CodeStral: 1.6 minutes median Mann-Whitney U test (α=0.05) confirms no significant difference between these models.
Group 2:
- DeepSeek-Coder: 5 minutes median
- Qwen2.5: Excluded due to inefficiency
Conclusion
CodeLlama and CodeStral emerge as the most promising candidates for my AI agent development. While other models might improve with tuning, my next step will focus on implementing feedback mechanisms to enhance these two models' performance. I'll explore using chat mode instead of generate mode to leverage feedback mechanisms. The key difference is:
- Generate mode: One-shot code generation
- Chat mode: Interactive process where previous responses can guide subsequent attempts.
Each model attempted to write the working function 60 times. The results reveal two distinct performance groups:
Group 1 (Statistically Similar):
- CodeLlama: 1.4 minutes median
- CodeStral: 1.6 minutes median Mann-Whitney U test (α=0.05) confirms no significant difference between these models.
Group 2:
- DeepSeek-Coder: 5 minutes median
- Qwen2.5: Excluded due to inefficiency
Conclusion
CodeLlama and CodeStral emerge as the most promising candidates for my AI agent development. While other models might improve with tuning, my next step will focus on implementing feedback mechanisms to enhance these two models' performance. I'll explore using chat mode instead of generate mode to leverage feedback mechanisms. The key difference is:
- Generate mode: One-shot code generation
- Chat mode: Interactive process where previous responses can guide subsequent attempts.
❤1
Compare Phi4 with CodeLlama and CodeStral
After discovering the news about Phi4 - 'state-of-the-art open small size model from Microsoft' I couldn't walk by and tested it in generating mode comparing with my previous favorites: CodeLlama and CodeStral.
Here are my results:
- smallest number of attempts to generate working code - 3.5 (median);
- however, highest attempt time - 30.5 seconds (median) on Tesla T4 GPU;
- total task solving time - 1.9 minutes (median), slightly higher than others.
All three distributions of total solving time are non-normal. Despite close medians, Kruskal-Wallis test (α=0.05) shows significant differences. Mann-Whitney U (α=0.05) test revealed:
- CodeLlama vs CodeStral: no significant difference;
- CodeLlama vs Phi4: Phi4 has significantly higher values;
- CodeStral vs Phi4: no significant difference.
These results from Phi4, a general-purpose model, are impressive. It's a strong candidate to join CodeLlama and CodeStral in further research.
After discovering the news about Phi4 - 'state-of-the-art open small size model from Microsoft' I couldn't walk by and tested it in generating mode comparing with my previous favorites: CodeLlama and CodeStral.
Here are my results:
- smallest number of attempts to generate working code - 3.5 (median);
- however, highest attempt time - 30.5 seconds (median) on Tesla T4 GPU;
- total task solving time - 1.9 minutes (median), slightly higher than others.
All three distributions of total solving time are non-normal. Despite close medians, Kruskal-Wallis test (α=0.05) shows significant differences. Mann-Whitney U (α=0.05) test revealed:
- CodeLlama vs CodeStral: no significant difference;
- CodeLlama vs Phi4: Phi4 has significantly higher values;
- CodeStral vs Phi4: no significant difference.
These results from Phi4, a general-purpose model, are impressive. It's a strong candidate to join CodeLlama and CodeStral in further research.
❤🔥1👍1
AI & Robotics Lab
Analyzing Qwen2.5's Performance Issues Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using…
Failed Attempt to Use Feedback for Improving Performance
My assumptions about using feedback proved unsuccessful. I tried two different approaches:
- creating a chat that collected detailed error descriptions when checks weren't successful;
- in generating mode, adding non-working code from previous attempt to the initial prompt.
In both scenarios, the models eventually became "stuck" on certain wrong answers (as Qwen2.5 model in the previous test), and the attempt count increased dramatically. I tested both the general-purpose model Llama3.2 and the code-specific Codestral - the results were the same. While I could have potentially tried something more sophisticated in making feedback, I decided not to pursue this path further at this time.
My assumptions about using feedback proved unsuccessful. I tried two different approaches:
- creating a chat that collected detailed error descriptions when checks weren't successful;
- in generating mode, adding non-working code from previous attempt to the initial prompt.
In both scenarios, the models eventually became "stuck" on certain wrong answers (as Qwen2.5 model in the previous test), and the attempt count increased dramatically. I tested both the general-purpose model Llama3.2 and the code-specific Codestral - the results were the same. While I could have potentially tried something more sophisticated in making feedback, I decided not to pursue this path further at this time.
🔥1
AI & Robotics Lab
Comparing Llama Models for Python Function Generation Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance. …
Keep a Cool Head - Tuning the Models' Temperature
The temperature parameter controls how diverse a model's outputs can be. Lower temperature values make the model more deterministic, causing it to focus on the most likely responses for better accuracy. I conducted experiments with this parameter on three selected models (Phi4, Codestral, and Codellama), which revealed some interesting patterns.
Each model was tasked with generating code of the
The Phi4 model's performance at temperature 0.2 is particularly impressive - generating working code on the first attempt in 17 out of 30 trials (over 55%) - an outstanding result!
The second insight comes from examining the default temperature values across different models, which we can infer from the graph. The general-purpose Phi4 operates with a default temperature around 0.5-0.7, allowing for more creative responses across various scenarios. Interestingly, the code-specific Codestral model has a default temperature of about 0.2 - a setting that aligns well with its specialized purpose. Perhaps surprisingly, the Codellama model runs with a higher default temperature of around 0.5-0.7, despite its code-focused nature.
These findings highlight that tuning the temperature parameter is a crucial step in optimizing code generation performance. The time invested in such experiments is clearly worthwhile, as it can significantly impact the model's effectiveness in generating correct code.
The temperature parameter controls how diverse a model's outputs can be. Lower temperature values make the model more deterministic, causing it to focus on the most likely responses for better accuracy. I conducted experiments with this parameter on three selected models (Phi4, Codestral, and Codellama), which revealed some interesting patterns.
Each model was tasked with generating code of the
object_height
function 30 times to pass a specific test. A clear trend emerged across all models: a lower temperature of 0.2 consistently delivered the best performance in generating test function code. This finding was statistically validated using the Kruskal-Wallis test (alpha = 0.05), both with and without the results from this temperature group. The conclusion is clear - for code generation, it's best to keep the model's head cool.The Phi4 model's performance at temperature 0.2 is particularly impressive - generating working code on the first attempt in 17 out of 30 trials (over 55%) - an outstanding result!
The second insight comes from examining the default temperature values across different models, which we can infer from the graph. The general-purpose Phi4 operates with a default temperature around 0.5-0.7, allowing for more creative responses across various scenarios. Interestingly, the code-specific Codestral model has a default temperature of about 0.2 - a setting that aligns well with its specialized purpose. Perhaps surprisingly, the Codellama model runs with a higher default temperature of around 0.5-0.7, despite its code-focused nature.
These findings highlight that tuning the temperature parameter is a crucial step in optimizing code generation performance. The time invested in such experiments is clearly worthwhile, as it can significantly impact the model's effectiveness in generating correct code.
👏1