AI & Robotics Lab

Channel photo updated

11:38

Writing python function with code-specific models

My goal is to create a local AI agent that understands my projects' context and generates working code efficiently. While general-purpose LLMs are impressive, code-specific models should be better suited for this task.

I tested the following models:
- CodeLlama (13B parameters),
- CodeStral (22B parameters),
- Qwen2.5-Coder (14B parameters),
- DeepSeek-Coder-v2 (16B parameters)

For the testing environment, I spun up a server with a Tesla T4 GPU (16GB VRAM). This hardware constraint helped narrow down my model selection.

I gave each model the same test: generate a function that calculates an object's size using known image parameters and distance, as described earlier. To improve code extraction from the responses, I implemented a more robust parsing pattern:

match = re.search(
    r"(?:`{3}\w*|\[PYTHON\])\n(.+?)(?:`{3}|\[\/PYTHON\])",
    out_str,
    flags=re.DOTALL,
)

Number of Attempts

The results show significant variance between the models' performance. CodeStral emerged as the clear leader, requiring a median of only 5 attempts to complete the task. In stark contrast, Alibaba's Qwen2.5 needed around 110 attempts (22 times more!) to achieve the same result.

Comparing these results with previous tests using general-purpose models, only CodeStral and CodeLlama showed improved performance. This suggests that being a code-specific model doesn't automatically guarantee better efficiency.

⚡1

31 views16:57

AI & Robotics Lab

Analyzing Qwen2.5's Performance Issues

Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using ( f * h * p / d ) instead of the correct formula ( h * p * d / f ).

This consistent misapplication of the formula is particularly interesting, as other models showed more variety in their attempts and errors. This behavior suggests that Qwen2.5 might be "stuck" in a local optimum, repeatedly suggesting the same incorrect solution rather than exploring alternative approaches like its competitors.

🆒1

33 viewsedited 16:57

AI & Robotics Lab

Attempt time

The models showed significant variations in their processing speeds. Here's how they ranked by median response time:

- CodeLlama: 5 seconds (fastest),
- DeepSeek-Coder: 8 seconds,
- CodeStral: 15 seconds,
- Qwen2.5: 50 seconds (significantly slower).

A pairwise Mann-Whitney U test with a confidence level of 0.05 confirms that the differences in attempt times between models are statistically significant. This statistical analysis reinforces that the observed performance differences are not due to random variation but represent genuine differences in model capabilities.

Qwen2.5's combination of slow processing (50-second median response time) and high number of attempts (median 110) makes it particularly inefficient for this task. While other models complete the task in 5-15 seconds with few attempts, Qwen2.5 requires significantly more resources to generate solutions - which are often incorrect.

🔥1

33 views17:22

AI & Robotics Lab

Time to solve task

Each model attempted to write the working function 60 times. The results reveal two distinct performance groups:

Group 1 (Statistically Similar):
- CodeLlama: 1.4 minutes median
- CodeStral: 1.6 minutes median Mann-Whitney U test (α=0.05) confirms no significant difference between these models.

Group 2:
- DeepSeek-Coder: 5 minutes median
- Qwen2.5: Excluded due to inefficiency
Conclusion

CodeLlama and CodeStral emerge as the most promising candidates for my AI agent development. While other models might improve with tuning, my next step will focus on implementing feedback mechanisms to enhance these two models' performance. I'll explore using chat mode instead of generate mode to leverage feedback mechanisms. The key difference is:
- Generate mode: One-shot code generation
- Chat mode: Interactive process where previous responses can guide subsequent attempts.

❤1

34 views11:57

AI & Robotics Lab

Compare Phi4 with CodeLlama and CodeStral

After discovering the news about Phi4 - 'state-of-the-art open small size model from Microsoft' I couldn't walk by and tested it in generating mode comparing with my previous favorites: CodeLlama and CodeStral.

Here are my results:
- smallest number of attempts to generate working code - 3.5 (median);
- however, highest attempt time - 30.5 seconds (median) on Tesla T4 GPU;
- total task solving time - 1.9 minutes (median), slightly higher than others.

All three distributions of total solving time are non-normal. Despite close medians, Kruskal-Wallis test (α=0.05) shows significant differences. Mann-Whitney U (α=0.05) test revealed:
- CodeLlama vs CodeStral: no significant difference;
- CodeLlama vs Phi4: Phi4 has significantly higher values;
- CodeStral vs Phi4: no significant difference.

These results from Phi4, a general-purpose model, are impressive. It's a strong candidate to join CodeLlama and CodeStral in further research.

❤‍🔥1👍1

27 views19:03

AI & Robotics Lab

Analyzing Qwen2.5's Performance Issues Upon investigating poor performance of Qwen2.5 model, I discovered a persistent error pattern. Out of approximately 1,230 attempts to write the function, 1,036 (84%) failed due to the same mathematical mistake: using…

Failed Attempt to Use Feedback for Improving Performance

My assumptions about using feedback proved unsuccessful. I tried two different approaches:
- creating a chat that collected detailed error descriptions when checks weren't successful;
- in generating mode, adding non-working code from previous attempt to the initial prompt.

In both scenarios, the models eventually became "stuck" on certain wrong answers (as Qwen2.5 model in the previous test), and the attempt count increased dramatically. I tested both the general-purpose model Llama3.2 and the code-specific Codestral - the results were the same. While I could have potentially tried something more sophisticated in making feedback, I decided not to pursue this path further at this time.

🔥1

34 viewsedited 09:52

AI & Robotics Lab

Comparing Llama Models for Python Function Generation Testing two models, Llama 3.1 (8B) and Llama 3.2 (3.2B), on their ability to generate a Python function that computes the height of an object using its image size in pixels and the object’s distance. …

Keep a Cool Head - Tuning the Models' Temperature

The temperature parameter controls how diverse a model's outputs can be. Lower temperature values make the model more deterministic, causing it to focus on the most likely responses for better accuracy. I conducted experiments with this parameter on three selected models (Phi4, Codestral, and Codellama), which revealed some interesting patterns.

Each model was tasked with generating code of the object_height function 30 times to pass a specific test. A clear trend emerged across all models: a lower temperature of 0.2 consistently delivered the best performance in generating test function code. This finding was statistically validated using the Kruskal-Wallis test (alpha = 0.05), both with and without the results from this temperature group. The conclusion is clear - for code generation, it's best to keep the model's head cool.

The Phi4 model's performance at temperature 0.2 is particularly impressive - generating working code on the first attempt in 17 out of 30 trials (over 55%) - an outstanding result!

The second insight comes from examining the default temperature values across different models, which we can infer from the graph. The general-purpose Phi4 operates with a default temperature around 0.5-0.7, allowing for more creative responses across various scenarios. Interestingly, the code-specific Codestral model has a default temperature of about 0.2 - a setting that aligns well with its specialized purpose. Perhaps surprisingly, the Codellama model runs with a higher default temperature of around 0.5-0.7, despite its code-focused nature.

These findings highlight that tuning the temperature parameter is a crucial step in optimizing code generation performance. The time invested in such experiments is clearly worthwhile, as it can significantly impact the model's effectiveness in generating correct code.

👏1

32 views15:34

AI & Robotics Lab

It looks like we're entering the sunset years of traditional software engineering 🤔

31 views16:59

AI & Robotics Lab

Forwarded from AI Post — Artificial Intelligence

0:35

This media is not supported in your browser

VIEW IN TELEGRAM

In 2025: code like mid-level engineers. Eventually, AI engineers will build most of the code and AI in apps, replacing human engineers. You heard it directly from Zuck. AI will replace your job. No denying anymore.

@aipost 🪙 | Our X

🥇

Please open Telegram to view this post

VIEW IN TELEGRAM

36 views16:59

AI & Robotics Lab

GitHub Copilot: Using Your Workspace as Context

Recently I received an email saying that GitHub's AI code assistant Copilot is now free. I've tried a few different AI assistant extensions in VS Code - the last one was Tabnine AI, which is actually pretty good at creating docstrings and live code completion. Since it's always interesting to try something new, I decided to give Copilot a shot.

This extension has two main features - a chat in the side panel and live code completion suggestions, which are standard for these kinds of assistants. It uses two models: GPT-4 by default and Claude 3.5 Sonnet, which is pretty impressive for a free assistant.

My favorite Copilot's feature - using your workspace as context for the model. Here's a PyQt utility project I'm working on that handles autofocusing for custom cameras. It runs multiple threads to manage UI interactions, camera motor movements, and image capture. The autofocus process involves several modules with complex data flow between them. If you need to refresh your understanding of the whole system, you can open the Copilot chat and ask questions about your entire workspace. What you get is a high-level description with clickable links to navigate through your project - it's really cool and super convenient. I haven't seen such a helpful feature in other assistants.

Let's try something else. When you ask for code improving, Copilot provides the updated code of a specific function in a particular module. Since you're in the project context, you can use the "Apply in Editor" option - Copilot automatically inserts the changes right into code. You can review these changes and decide whether to keep them or not.

So that's my quick look at Copilot. While it has the usual AI features, some extras make it stand out. Since it's free now, it's worth playing around with it and seeing how it fits your workflow. Thanks for listening! Let's go design!

🔥1

33 viewsedited 18:32

30 views18:32

What is the brave new world we are stepping into now? Everyone has heard about countless people losing their jobs because of AI: programmers, graphic designers, copywriters and so many others. Another technical revolution is definitely unfolding before our eyes. Like previous ones, it brings possibilities we couldn't imagine before - almost everyone now can have access to an AI assistant that knows literally "everything" and eagerly answers any of your questions. It's an incredible time for creative, curious and open-minded people. And as always there is a dark side: the internet gave instructions how to print weapons, AI now can suggest how to make them more dangerous and invisible to scanners... Technology is only a tool, but today it's more powerful than ever before.

👍1

29 views10:32

AI & Robotics Lab

Forwarded from Science in telegram

DIY Fusion: How to Build a Nuclear Reactor in Your Kitchen (with AI)

A guy managed to assemble a neutron fusion reactor in his kitchen, using AI as his consultant. 🔬

Technical Specs:
• 30kV/10mA Electrostatic Precipitator
• Vacuum at 3 mTorr (253,333 times deeper than atmospheric pressure!)
• Bubble Detector for neutron counting
• Homemade Deuterium extracted from heavy water via electrolysis

The most impressive part? The entire deuterium production process cost just $112:
• $32 for a hydrocar PEM
• $80 for 50g of D₂O (heavy water)

From this, he managed to produce 56 liters of D₂ gas! 🧪

How AI Helped:

The author heavily relied on Claude for:
• Process debugging
• Safety checks
• Following complex instructions

While this isn’t a commercial reactor, as a demonstration of AI-assisted DIY, it’s absolutely mind-blowing. 🔥

The Journey:

The build was live-streamed over 36 hours straight. Remarkably, just months earlier, the same individual assembled a plasma reactor. What’s even more fascinating? He didn’t have deep expertise in nuclear physics—he simply asked Claude the right questions. Independent study would have required thousands (if not tens of thousands) of hours.

The Bigger Picture:

As exciting as this is, it’s also a bit terrifying. If a hobbyist can pull this off in fusion, imagine the possibilities with biology. We might someday look back at bats with nostalgia. 🦇

AI-powered DIY is here, and it’s opening doors to both innovation and ethical challenges.

UnrollNow

Thread By @hud_zah - I built a nuclear fusor in my kitchen...

I built a nuclear fusor in my kitchen over 36 hours with AI help, achieving nuclear fusion for the first time. I used parts from eBay, electrolyzed heavy water...

27 views10:32

About

Blog

Apps

Platform