AI & Robotics Lab

Channel created

09:49

Comparing Llama Models for Python Function Generation

Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance.

Experimental Setup

Both models were run locally on a PC equipped with:

* CPU: Intel i7-8700
* RAM: 16GB
* GPU: Nvidia Quadro P2200

The models operated in generation-only mode, with no feedback or fine-tuning. Each attempt to solve the task involved the following process:

1. Generating Code: The task prompt was structured as a docstring:

Create a Python function `object_height` that calculates object height 
based on the image height in pixels.

Input parameters:
- focal length `F` in mm, default 35 mm
- image height `H` in pixels, default 1152
- pixel size `p` in μm, default 3.45 μm
- object distance `D`, default 6.5 m

Returns:
The object height in mm as an absolute float value.

Raises:
`RuntimeError` if input parameters are invalid (negative or zero).
Extracting the Function Code: The generated response was parsed to isolate the function implementation.

2. Running the Code: The function code was executed locally.

3. Testing the Function: The function was validated with:
* Default parameters.
* Various valid inputs.
* Invalid parameters to test error handling.

The task was considered successfully solved only if the function passed all tests.

Evaluation Metrics

To measure the effectiveness of each model, three key metrics were considered:
* Number of Attempts: how many attempts were required to produce a correct solution.
* Time per Attempt: the time taken to generate and process the function code in each iteration.
* Total Time to Solve the Task: the cumulative time from the start of the process until a correct solution was achieved.

🔥1

43 views14:37

AI & Robotics Lab

Number of Attempts

Observing the distribution of attempts to solve the task, we can see that model llama3.1 has lower values (median = 33) compared to model llama3.2 (median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2 having extreme values that are 6 and 10 times larger than its median.

Time per Attempt

Looking at the average attempt times, model llama3.2 (3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1 (8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.

Total Time to Solve the Task

This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:

- llama3.2 (3.2B parameters):
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range

- llama3.1 (8B parameters):
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times

Despite having fewer parameters, the newer llama3.2 model proves significantly more efficient for this task, offering faster completion times.

#testing

❤1

38 viewsedited 14:38

About

Blog

Apps

Platform