Comparing Llama Models for Python Function Generation
Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance.
Experimental Setup
Both models were run locally on a PC equipped with:
* CPU: Intel i7-8700
* RAM: 16GB
* GPU: Nvidia Quadro P2200
The models operated in generation-only mode, with no feedback or fine-tuning. Each attempt to solve the task involved the following process:
1. Generating Code: The task prompt was structured as a docstring:
3. Testing the Function: The function was validated with:
* Default parameters.
* Various valid inputs.
* Invalid parameters to test error handling.
The task was considered successfully solved only if the function passed all tests.
Evaluation Metrics
To measure the effectiveness of each model, three key metrics were considered:
* Number of Attempts: how many attempts were required to produce a correct solution.
* Time per Attempt: the time taken to generate and process the function code in each iteration.
* Total Time to Solve the Task: the cumulative time from the start of the process until a correct solution was achieved.
Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance.
Experimental Setup
Both models were run locally on a PC equipped with:
* CPU: Intel i7-8700
* RAM: 16GB
* GPU: Nvidia Quadro P2200
The models operated in generation-only mode, with no feedback or fine-tuning. Each attempt to solve the task involved the following process:
1. Generating Code: The task prompt was structured as a docstring:
Create a Python function `object_height` that calculates object height2. Running the Code: The function code was executed locally.
based on the image height in pixels.
Input parameters:
- focal length `F` in mm, default 35 mm
- image height `H` in pixels, default 1152
- pixel size `p` in μm, default 3.45 μm
- object distance `D`, default 6.5 m
Returns:
The object height in mm as an absolute float value.
Raises:
`RuntimeError` if input parameters are invalid (negative or zero).
Extracting the Function Code: The generated response was parsed to isolate the function implementation.
3. Testing the Function: The function was validated with:
* Default parameters.
* Various valid inputs.
* Invalid parameters to test error handling.
The task was considered successfully solved only if the function passed all tests.
Evaluation Metrics
To measure the effectiveness of each model, three key metrics were considered:
* Number of Attempts: how many attempts were required to produce a correct solution.
* Time per Attempt: the time taken to generate and process the function code in each iteration.
* Total Time to Solve the Task: the cumulative time from the start of the process until a correct solution was achieved.
🔥1
Number of Attempts
Observing the distribution of attempts to solve the task, we can see that model
Time per Attempt
Looking at the average attempt times, model
Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
#testing
Observing the distribution of attempts to solve the task, we can see that model
llama3.1
has lower values (median = 33) compared to model llama3.2
(median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2
having extreme values that are 6 and 10 times larger than its median.Time per Attempt
Looking at the average attempt times, model
llama3.2
(3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1
(8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
llama3.2
(3.2B parameters):- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
llama3.1
(8B parameters):- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
llama3.2
model proves significantly more efficient for this task, offering faster completion times.#testing
❤1