Number of Attempts
Observing the distribution of attempts to solve the task, we can see that model
Time per Attempt
Looking at the average attempt times, model
Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
#testing
Observing the distribution of attempts to solve the task, we can see that model
llama3.1
has lower values (median = 33) compared to model llama3.2
(median = 47). This difference is statistically significant according to the Mann-Whitney U test (used due to non-normal distribution) with a confidence level of 0.05 (p-value = 0.008). The box plot shows several outliers, with model llama3.2
having extreme values that are 6 and 10 times larger than its median.Time per Attempt
Looking at the average attempt times, model
llama3.2
(3.2B parameters) is significantly faster, taking around 12 seconds per attempt compared to 58 seconds for llama3.1
(8B parameters). The histogram shows the distribution with outliers removed for clearer visualization, revealing a distinct speed advantage for the smaller model.Total Time to Solve the Task
This is the main metric for user who want to receive working and tested code from the model. It reveals significant differences between the models:
-
llama3.2
(3.2B parameters):- Median solution time: 13 minutes
- More than 50% of tasks completed within 15 minutes
- Shows strong right-skewed distribution
- Most efficient results clustered in 5-20 minute range
-
llama3.1
(8B parameters):- Median solution time: ~50 minutes
- More uniform distribution across 20-120 minutes
- Shows wider spread in completion times
Despite having fewer parameters, the newer
llama3.2
model proves significantly more efficient for this task, offering faster completion times.#testing
β€1
In addition to
Let's analyze the solution it provided:
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
#testing
llama
models, I also tested the script with Alibaba's Qwen2.5 7B model, but encountered a significant performance issue: the first solve time was 1 day, 5:22:03 π€―.Let's analyze the solution it provided:
def object_height(F=35, H=1152, p=3.45, D=6.5):While the function passes all tests, I spotted two logical issues:
# Constants in appropriate units for calculation
F_mm = F # focal length in mm (already in correct unit)
H_px = H # image height in pixels (already in correct unit)
p_um = p * 1000 # pixel size in micrometers
D_mm = D * 1000 # Convert object distance from meters to millimeters
# Input validation
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Calculate image height in millimeters
H_mm = H_px * p_um / 1000
# Calculate object height using the lens equation
h_object = (H_mm / F_mm) * D_mm
return h_object # Absolute value not needed in this context
- The pixel size conversion is oddly documented - it claims to convert from micrometers to micrometers (which makes no sense since they're the same unit!)
- The unit handling seems backward - instead of messing with focal length units, it should focus on converting the distance from meters to millimeters.
These issues become even more apparent when we look at the cleaner solution from the
llama3.2
model:def object_height(F=35, H=1152, p=3.45, D=6.5):Now this is more like it! Clear documentation, proper unit conversions, and straightforward calculations. The only minor quirk is the unnecessary
"""
Calculates the object height in mm from known image height, focal length, pixel size and object distance.
Args:
F (float): Focal length in mm, default 35 mm.
H (int): Image height in pixels, default 1152.
p (float): Pixel size in mkm, default 3.45 mkm.
D (float): Object distance in m, default 6.5 m.
Returns:
float: Object height in mm.
Raises:
RuntimeError: If any input parameter is invalid (negative or equal to zero).
"""
# Check for invalid input parameters
if F <= 0 or H <= 0 or p <= 0 or D <= 0:
raise RuntimeError("Invalid input parameters. All values must be positive.")
# Convert object distance from meters to millimeters
D_mm = D * 1000
# Calculate image height in millimeters
H_mm = H / 1000 * p
# Calculate object height using the lens equation and magnification formula
h_object = (H_mm / F) * D_mm
return abs(h_object)
abs()
in the return statement - we don't really need it here. Otherwise, it's a solid implementation.#testing
Writing python function with code-specific models
My goal is to create a local AI agent that understands my projects' context and generates working code efficiently. While general-purpose LLMs are impressive, code-specific models should be better suited for this task.
I tested the following models:
- CodeLlama (13B parameters),
- CodeStral (22B parameters),
- Qwen2.5-Coder (14B parameters),
- DeepSeek-Coder-v2 (16B parameters)
For the testing environment, I spun up a server with a Tesla T4 GPU (16GB VRAM). This hardware constraint helped narrow down my model selection.
I gave each model the same test: generate a function that calculates an object's size using known image parameters and distance, as described earlier. To improve code extraction from the responses, I implemented a more robust parsing pattern:
The results show significant variance between the models' performance. CodeStral emerged as the clear leader, requiring a median of only 5 attempts to complete the task. In stark contrast, Alibaba's Qwen2.5 needed around 110 attempts (22 times more!) to achieve the same result.
Comparing these results with previous tests using general-purpose models, only CodeStral and CodeLlama showed improved performance. This suggests that being a code-specific model doesn't automatically guarantee better efficiency.
My goal is to create a local AI agent that understands my projects' context and generates working code efficiently. While general-purpose LLMs are impressive, code-specific models should be better suited for this task.
I tested the following models:
- CodeLlama (13B parameters),
- CodeStral (22B parameters),
- Qwen2.5-Coder (14B parameters),
- DeepSeek-Coder-v2 (16B parameters)
For the testing environment, I spun up a server with a Tesla T4 GPU (16GB VRAM). This hardware constraint helped narrow down my model selection.
I gave each model the same test: generate a function that calculates an object's size using known image parameters and distance, as described earlier. To improve code extraction from the responses, I implemented a more robust parsing pattern:
match = re.search(Number of Attempts
r"(?:`{3}\w*|\[PYTHON\])\n(.+?)(?:`{3}|\[\/PYTHON\])",
out_str,
flags=re.DOTALL,
)
The results show significant variance between the models' performance. CodeStral emerged as the clear leader, requiring a median of only 5 attempts to complete the task. In stark contrast, Alibaba's Qwen2.5 needed around 110 attempts (22 times more!) to achieve the same result.
Comparing these results with previous tests using general-purpose models, only CodeStral and CodeLlama showed improved performance. This suggests that being a code-specific model doesn't automatically guarantee better efficiency.
β‘1
Analyzing Qwen2.5's Performance Issues
Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using ( f * h * p / d ) instead of the correct formula ( h * p * d / f ).
This consistent misapplication of the formula is particularly interesting, as other models showed more variety in their attempts and errors. This behavior suggests that Qwen2.5 might be "stuck" in a local optimum, repeatedly suggesting the same incorrect solution rather than exploring alternative approaches like its competitors.
Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using ( f * h * p / d ) instead of the correct formula ( h * p * d / f ).
This consistent misapplication of the formula is particularly interesting, as other models showed more variety in their attempts and errors. This behavior suggests that Qwen2.5 might be "stuck" in a local optimum, repeatedly suggesting the same incorrect solution rather than exploring alternative approaches like its competitors.
π1
Attempt time
The models showed significant variations in their processing speeds. Here's how they ranked by median response time:
- CodeLlama: 5 seconds (fastest),
- DeepSeek-Coder: 8 seconds,
- CodeStral: 15 seconds,
- Qwen2.5: 50 seconds (significantly slower).
A pairwise Mann-Whitney U test with a confidence level of 0.05 confirms that the differences in attempt times between models are statistically significant. This statistical analysis reinforces that the observed performance differences are not due to random variation but represent genuine differences in model capabilities.
Qwen2.5's combination of slow processing (50-second median response time) and high number of attempts (median 110) makes it particularly inefficient for this task. While other models complete the task in 5-15 seconds with few attempts, Qwen2.5 requires significantly more resources to generate solutions - which are often incorrect.
The models showed significant variations in their processing speeds. Here's how they ranked by median response time:
- CodeLlama: 5 seconds (fastest),
- DeepSeek-Coder: 8 seconds,
- CodeStral: 15 seconds,
- Qwen2.5: 50 seconds (significantly slower).
A pairwise Mann-Whitney U test with a confidence level of 0.05 confirms that the differences in attempt times between models are statistically significant. This statistical analysis reinforces that the observed performance differences are not due to random variation but represent genuine differences in model capabilities.
Qwen2.5's combination of slow processing (50-second median response time) and high number of attempts (median 110) makes it particularly inefficient for this task. While other models complete the task in 5-15 seconds with few attempts, Qwen2.5 requires significantly more resources to generate solutions - which are often incorrect.
π₯1
Time to solve task
Each model attempted to write the working function 60 times. The results reveal two distinct performance groups:
Group 1 (Statistically Similar):
- CodeLlama: 1.4 minutes median
- CodeStral: 1.6 minutes median Mann-Whitney U test (Ξ±=0.05) confirms no significant difference between these models.
Group 2:
- DeepSeek-Coder: 5 minutes median
- Qwen2.5: Excluded due to inefficiency
Conclusion
CodeLlama and CodeStral emerge as the most promising candidates for my AI agent development. While other models might improve with tuning, my next step will focus on implementing feedback mechanisms to enhance these two models' performance. I'll explore using chat mode instead of generate mode to leverage feedback mechanisms. The key difference is:
- Generate mode: One-shot code generation
- Chat mode: Interactive process where previous responses can guide subsequent attempts.
Each model attempted to write the working function 60 times. The results reveal two distinct performance groups:
Group 1 (Statistically Similar):
- CodeLlama: 1.4 minutes median
- CodeStral: 1.6 minutes median Mann-Whitney U test (Ξ±=0.05) confirms no significant difference between these models.
Group 2:
- DeepSeek-Coder: 5 minutes median
- Qwen2.5: Excluded due to inefficiency
Conclusion
CodeLlama and CodeStral emerge as the most promising candidates for my AI agent development. While other models might improve with tuning, my next step will focus on implementing feedback mechanisms to enhance these two models' performance. I'll explore using chat mode instead of generate mode to leverage feedback mechanisms. The key difference is:
- Generate mode: One-shot code generation
- Chat mode: Interactive process where previous responses can guide subsequent attempts.
β€1
Compare Phi4 with CodeLlama and CodeStral
After discovering the news about Phi4 - 'state-of-the-art open small size model from Microsoft' I couldn't walk by and tested it in generating mode comparing with my previous favorites: CodeLlama and CodeStral.
Here are my results:
- smallest number of attempts to generate working code - 3.5 (median);
- however, highest attempt time - 30.5 seconds (median) on Tesla T4 GPU;
- total task solving time - 1.9 minutes (median), slightly higher than others.
All three distributions of total solving time are non-normal. Despite close medians, Kruskal-Wallis test (Ξ±=0.05) shows significant differences. Mann-Whitney U (Ξ±=0.05) test revealed:
- CodeLlama vs CodeStral: no significant difference;
- CodeLlama vs Phi4: Phi4 has significantly higher values;
- CodeStral vs Phi4: no significant difference.
These results from Phi4, a general-purpose model, are impressive. It's a strong candidate to join CodeLlama and CodeStral in further research.
After discovering the news about Phi4 - 'state-of-the-art open small size model from Microsoft' I couldn't walk by and tested it in generating mode comparing with my previous favorites: CodeLlama and CodeStral.
Here are my results:
- smallest number of attempts to generate working code - 3.5 (median);
- however, highest attempt time - 30.5 seconds (median) on Tesla T4 GPU;
- total task solving time - 1.9 minutes (median), slightly higher than others.
All three distributions of total solving time are non-normal. Despite close medians, Kruskal-Wallis test (Ξ±=0.05) shows significant differences. Mann-Whitney U (Ξ±=0.05) test revealed:
- CodeLlama vs CodeStral: no significant difference;
- CodeLlama vs Phi4: Phi4 has significantly higher values;
- CodeStral vs Phi4: no significant difference.
These results from Phi4, a general-purpose model, are impressive. It's a strong candidate to join CodeLlama and CodeStral in further research.
β€βπ₯1π1
AI & Robotics Lab
Analyzing Qwen2.5's Performance Issues Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: usingβ¦
Failed Attempt to Use Feedback for Improving Performance
My assumptions about using feedback proved unsuccessful. I tried two different approaches:
- creating a chat that collected detailed error descriptions when checks weren't successful;
- in generating mode, adding non-working code from previous attempt to the initial prompt.
In both scenarios, the models eventually became "stuck" on certain wrong answers (as Qwen2.5 model in the previous test), and the attempt count increased dramatically. I tested both the general-purpose model Llama3.2 and the code-specific Codestral - the results were the same. While I could have potentially tried something more sophisticated in making feedback, I decided not to pursue this path further at this time.
My assumptions about using feedback proved unsuccessful. I tried two different approaches:
- creating a chat that collected detailed error descriptions when checks weren't successful;
- in generating mode, adding non-working code from previous attempt to the initial prompt.
In both scenarios, the models eventually became "stuck" on certain wrong answers (as Qwen2.5 model in the previous test), and the attempt count increased dramatically. I tested both the general-purpose model Llama3.2 and the code-specific Codestral - the results were the same. While I could have potentially tried something more sophisticated in making feedback, I decided not to pursue this path further at this time.
π₯1
AI & Robotics Lab
Comparing Llama Models for Python Function Generation Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the objectβs distance. β¦
Keep a Cool Head - Tuning the Models' Temperature
The temperature parameter controls how diverse a model's outputs can be. Lower temperature values make the model more deterministic, causing it to focus on the most likely responses for better accuracy. I conducted experiments with this parameter on three selected models (Phi4, Codestral, and Codellama), which revealed some interesting patterns.
Each model was tasked with generating code of the
The Phi4 model's performance at temperature 0.2 is particularly impressive - generating working code on the first attempt in 17 out of 30 trials (over 55%) - an outstanding result!
The second insight comes from examining the default temperature values across different models, which we can infer from the graph. The general-purpose Phi4 operates with a default temperature around 0.5-0.7, allowing for more creative responses across various scenarios. Interestingly, the code-specific Codestral model has a default temperature of about 0.2 - a setting that aligns well with its specialized purpose. Perhaps surprisingly, the Codellama model runs with a higher default temperature of around 0.5-0.7, despite its code-focused nature.
These findings highlight that tuning the temperature parameter is a crucial step in optimizing code generation performance. The time invested in such experiments is clearly worthwhile, as it can significantly impact the model's effectiveness in generating correct code.
The temperature parameter controls how diverse a model's outputs can be. Lower temperature values make the model more deterministic, causing it to focus on the most likely responses for better accuracy. I conducted experiments with this parameter on three selected models (Phi4, Codestral, and Codellama), which revealed some interesting patterns.
Each model was tasked with generating code of the
object_height
function 30 times to pass a specific test. A clear trend emerged across all models: a lower temperature of 0.2 consistently delivered the best performance in generating test function code. This finding was statistically validated using the Kruskal-Wallis test (alpha = 0.05), both with and without the results from this temperature group. The conclusion is clear - for code generation, it's best to keep the model's head cool.The Phi4 model's performance at temperature 0.2 is particularly impressive - generating working code on the first attempt in 17 out of 30 trials (over 55%) - an outstanding result!
The second insight comes from examining the default temperature values across different models, which we can infer from the graph. The general-purpose Phi4 operates with a default temperature around 0.5-0.7, allowing for more creative responses across various scenarios. Interestingly, the code-specific Codestral model has a default temperature of about 0.2 - a setting that aligns well with its specialized purpose. Perhaps surprisingly, the Codellama model runs with a higher default temperature of around 0.5-0.7, despite its code-focused nature.
These findings highlight that tuning the temperature parameter is a crucial step in optimizing code generation performance. The time invested in such experiments is clearly worthwhile, as it can significantly impact the model's effectiveness in generating correct code.
π1
It looks like we're entering the sunset years of traditional software engineering π€
Forwarded from AI Post β Artificial Intelligence
Please open Telegram to view this post
VIEW IN TELEGRAM
GitHub Copilot: Using Your Workspace as Context
Recently I received an email saying that GitHub's AI code assistant Copilot is now free. I've tried a few different AI assistant extensions in VS Code - the last one was Tabnine AI, which is actually pretty good at creating docstrings and live code completion. Since it's always interesting to try something new, I decided to give Copilot a shot.
This extension has two main features - a chat in the side panel and live code completion suggestions, which are standard for these kinds of assistants. It uses two models: GPT-4 by default and Claude 3.5 Sonnet, which is pretty impressive for a free assistant.
My favorite Copilot's feature - using your workspace as context for the model. Here's a PyQt utility project I'm working on that handles autofocusing for custom cameras. It runs multiple threads to manage UI interactions, camera motor movements, and image capture. The autofocus process involves several modules with complex data flow between them. If you need to refresh your understanding of the whole system, you can open the Copilot chat and ask questions about your entire workspace. What you get is a high-level description with clickable links to navigate through your project - it's really cool and super convenient. I haven't seen such a helpful feature in other assistants.
Let's try something else. When you ask for code improving, Copilot provides the updated code of a specific function in a particular module. Since you're in the project context, you can use the "
So that's my quick look at Copilot. While it has the usual AI features, some extras make it stand out. Since it's free now, it's worth playing around with it and seeing how it fits your workflow. Thanks for listening! Let's go design!
Recently I received an email saying that GitHub's AI code assistant Copilot is now free. I've tried a few different AI assistant extensions in VS Code - the last one was Tabnine AI, which is actually pretty good at creating docstrings and live code completion. Since it's always interesting to try something new, I decided to give Copilot a shot.
This extension has two main features - a chat in the side panel and live code completion suggestions, which are standard for these kinds of assistants. It uses two models: GPT-4 by default and Claude 3.5 Sonnet, which is pretty impressive for a free assistant.
My favorite Copilot's feature - using your workspace as context for the model. Here's a PyQt utility project I'm working on that handles autofocusing for custom cameras. It runs multiple threads to manage UI interactions, camera motor movements, and image capture. The autofocus process involves several modules with complex data flow between them. If you need to refresh your understanding of the whole system, you can open the Copilot chat and ask questions about your entire workspace. What you get is a high-level description with clickable links to navigate through your project - it's really cool and super convenient. I haven't seen such a helpful feature in other assistants.
Let's try something else. When you ask for code improving, Copilot provides the updated code of a specific function in a particular module. Since you're in the project context, you can use the "
Apply in Editor
" option - Copilot automatically inserts the changes right into code. You can review these changes and decide whether to keep them or not.So that's my quick look at Copilot. While it has the usual AI features, some extras make it stand out. Since it's free now, it's worth playing around with it and seeing how it fits your workflow. Thanks for listening! Let's go design!
π₯1