Number of Attempts
Observing the distribution of attempts to solve the task, we can see that model
Time per Attempt
Looking at the average attempt times, model
Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
#testing
Observing the distribution of attempts to solve the task, we can see that model
llama3.1
has lower values (median = 33) compared to model llama3.2
(median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2
having extreme values that are 6 and 10 times larger than its median.Time per Attempt
Looking at the average attempt times, model
llama3.2
(3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1
(8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
llama3.2
(3.2B parameters):- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
llama3.1
(8B parameters):- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
llama3.2
model proves significantly more efficient for this task, offering faster completion times.#testing
❤1
In addition to
Let's analyze the solution it provided:
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
#testing
llama
models, I also tested the script with Alibaba's Qwen2.5 7B model, but encountered a significant performance issue: the first solve time was 1 day, 5:22:03 🤯.Let's analyze the solution it provided:
def object_height(F=35, H=1152, p=3.45, D=6.5):While the function passes all tests, I spotted two logical issues:
# Constants in appropriate units for calculation
F_mm = F # focal length in mm (already in correct unit)
H_px = H # image height in pixels (already in correct unit)
p_um = p * 1000 # pixel size in micrometers
D_mm = D * 1000 # Convert object distance from meters to millimeters
# Input validation
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Calculate image height in millimeters
H_mm = H_px * p_um / 1000
# Calculate object height using the lens equation
h_object = (H_mm / F_mm) * D_mm
return h_object # Absolute value not needed in this context
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
llama3.2
model:def object_height(F=35, H=1152, p=3.45, D=6.5):Now this is more like it! Clear documentation, proper unit conversions, and straightforward calculations. The only minor quirk is the unnecessary
"""
Calculates the object height in mm from known image height, focal length, pixel size and object distance.
Args:
F (float): Focal length in mm, default 35 mm.
H (int): Image height in pixels, default 1152.
p (float): Pixel size in mkm, default 3.45 mkm.
D (float): Object distance in m, default 6.5 m.
Returns:
float: Object height in mm.
Raises:
RuntimeError: If any input parameter is invalid (negative or equal to zero).
"""
# Check for invalid input parameters
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Convert object distance from meters to millimeters
D_mm = D * 1000
# Calculate image height in millimeters
H_mm = H / 1000 * p
# Calculate object height using the lens equation and magnification formula
h_object = (H_mm / F) * D_mm
return abs(h_object)
abs()
in the return statement - we don't really need it here. Otherwise, it's a solid implementation.#testing
👨🔬Testing Results: ROS2 Network Scanner Generation
I want to share the results from my test of ROS2 Network Scanner generation example.
After running 30 iterations of generating the ROS2 Network Scanner:
• Total test duration: ~6 hours 15 minutes
• Average successful generation time: ~2 minutes per attempt
• Distribution of attempts: Right-skewed (median: 4, mean: 6.7)
This means that, on average, the generator produces working code in about 13 minutes - quite reasonable performance for automated code generation in my opinion!
Failure Analysis
Looking at where generation stopped, the distribution clearly demonstrates the generator's stability:
• Over 80% stopped at the testing stage
• ~15% were successful attempts
• Only about 5% failed during the PARSING or GENERATION stages
Test Coverage Patterns
Examining the test pass rates revealed two distinct patterns:
• Basic functionality (7 tests): Node startup with valid/invalid parameters and handling overlapping scans using
• Advanced scenarios (9 tests): Including handling invalid JSON format in the
This suggests that generating code with specific behavior for edge cases remains challenging.
I've included all metrics and analysis notebooks in my project repository, so feel free to explore the data yourself!
#ROS2 #AI #NetworkScanning #Robotics #CodeGenerating #Codestral #testing
I want to share the results from my test of ROS2 Network Scanner generation example.
After running 30 iterations of generating the ROS2 Network Scanner:
• Total test duration: ~6 hours 15 minutes
• Average successful generation time: ~2 minutes per attempt
• Distribution of attempts: Right-skewed (median: 4, mean: 6.7)
This means that, on average, the generator produces working code in about 13 minutes - quite reasonable performance for automated code generation in my opinion!
Failure Analysis
Looking at where generation stopped, the distribution clearly demonstrates the generator's stability:
• Over 80% stopped at the testing stage
• ~15% were successful attempts
• Only about 5% failed during the PARSING or GENERATION stages
Test Coverage Patterns
Examining the test pass rates revealed two distinct patterns:
• Basic functionality (7 tests): Node startup with valid/invalid parameters and handling overlapping scans using
nscan
utility• Advanced scenarios (9 tests): Including handling invalid JSON format in the
node <-> nscan
interface and managing outdated scan resultsThis suggests that generating code with specific behavior for edge cases remains challenging.
I've included all metrics and analysis notebooks in my project repository, so feel free to explore the data yourself!
#ROS2 #AI #NetworkScanning #Robotics #CodeGenerating #Codestral #testing
🔥2