AI & Robotics Lab

Number of Attempts

Observing the distribution of attempts to solve the task, we can see that model llama3.1 has lower values (median = 33) compared to model llama3.2 (median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2 having extreme values that are 6 and 10 times larger than its median.

Time per Attempt

Looking at the average attempt times, model llama3.2 (3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1 (8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.

Total Time to Solve the Task

This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:

- llama3.2 (3.2B parameters):
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range

- llama3.1 (8B parameters):
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times

Despite having fewer parameters, the newer llama3.2 model proves significantly more efficient for this task, offering faster completion times.

#testing

❤1

40 viewsedited 14:38

AI & Robotics Lab

In addition to llama models, I also tested the script with Alibaba's Qwen2.5 7B model, but encountered a significant performance issue: the first solve time was 1 day, 5:22:03 🤯.

Let's analyze the solution it provided:

def object_height(F=35, H=1152, p=3.45, D=6.5):
    # Constants in appropriate units for calculation
    F_mm = F # focal length in mm (already in correct unit)
    H_px = H # image height in pixels (already in correct unit)
    p_um = p * 1000 # pixel size in micrometers
    D_mm = D * 1000 # Convert object distance from meters to millimeters

    # Input validation
    if F <= 0 or H <= 0 or p <= 0 or D <= 0:
        raise RuntimeError("Invalid input parameters. All values must be positive.")

    # Calculate image height in millimeters
    H_mm = H_px * p_um / 1000
    
    # Calculate object height using the lens equation
    h_object = (H_mm / F_mm) * D_mm
    
    return h_object  # Absolute value not needed in this context

While the function passes all tests, I spotted two logical issues:
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.

These issues become even more apparent when we look at the cleaner solution from the llama3.2 model:

def object_height(F=35, H=1152, p=3.45, D=6.5):
    """
    Calculates the object height in mm from known image height, focal length, pixel size and object distance.

    Args:
        F (float): Focal length in mm, default 35 mm.
        H (int): Image height in pixels, default 1152.
        p (float): Pixel size in mkm, default 3.45 mkm.
        D (float): Object distance in m, default 6.5 m.

    Returns:
        float: Object height in mm.

    Raises:
        RuntimeError: If any input parameter is invalid (negative or equal to zero).
    """

    # Check for invalid input parameters
    if F <= 0 or H <= 0 or p <= 0 or D <= 0:
        raise RuntimeError("Invalid input parameters. All values must be positive.")

    # Convert object distance from meters to millimeters
    D_mm = D * 1000

    # Calculate image height in millimeters
    H_mm = H / 1000 * p

    # Calculate object height using the lens equation and magnification formula
    h_object = (H_mm / F) * D_mm

    return abs(h_object)

Now this is more like it! Clear documentation, proper unit conversions, and straightforward calculations. The only minor quirk is the unnecessary abs() in the return statement - we don't really need it here. Otherwise, it's a solid implementation.

#testing

36 viewsedited 11:27

AI & Robotics Lab

👨‍🔬Testing Results: ROS2 Network Scanner Generation

I want to share the results from my test of ROS2 Network Scanner generation example.

After running 30 iterations of generating the ROS2 Network Scanner:
• Total test duration: ~6 hours 15 minutes
• Average successful generation time: ~2 minutes per attempt
• Distribution of attempts: Right-skewed (median: 4, mean: 6.7)

This means that, on average, the generator produces working code in about 13 minutes - quite reasonable performance for automated code generation in my opinion!

Failure Analysis
Looking at where generation stopped, the distribution clearly demonstrates the generator's stability:
• Over 80% stopped at the testing stage
• ~15% were successful attempts
• Only about 5% failed during the PARSING or GENERATION stages

Test Coverage Patterns
Examining the test pass rates revealed two distinct patterns:
• Basic functionality (7 tests): Node startup with valid/invalid parameters and handling overlapping scans using nscan utility
• Advanced scenarios (9 tests): Including handling invalid JSON format in the node <-> nscan interface and managing outdated scan results

This suggests that generating code with specific behavior for edge cases remains challenging.

I've included all metrics and analysis notebooks in my project repository, so feel free to explore the data yourself!

#ROS2 #AI #NetworkScanning #Robotics #CodeGenerating #Codestral #testing

🔥2

21 views18:51

About

Blog

Apps

Platform