Coding Agents Suck Too

AI
8 min read

If you've been following our recent blog posts you know that I've been quite critical of LLMs, AI agents, and the companies that tout them. The reason for my often flippant negativity is because I've actually used these things and have tried building with them, and they're just bad. There's not really another way to say it.

I think others are starting to realize this too, and I've been seeing more content out there designed to temper people's expectations about the current state of AI, but its still not enough. Take for instance this post on the front page of Hacker News a couple days ago, which tries to characterize the LLM-powered coding tools as "mech suits" for developers.

Wouldn't it be cool if this guy was right? I mean, who doesn't want their own Gundam?

Gundam

But as he explains in his post, he generated 30k lines of toy code and then threw it all away. So to me, it seems like we're not getting mech-suits, we're just getting cosplay.

Gundam box

In a similar vein, last week I received an email from GoLand about their new AI agent, Junie, which is available now in JetBrains IDEs.

As soon as I received the email, I installed the plugin and messaged my coworkers about it. I wanted to showcase Junie's skills, and give her an opportunity to prove to us that she was worth all the H100s powering her reasoning. So after she was installed, I gave her a prompt we give all new LLMs and AI agents we come across:

We have `bats` tests located at ./integration-tests/bats/* that run as part of CI.
They are responsible for testing the behavior of Dolt.
Some of the tests are currently skipped, but they may not need to be, since they might pass if they're run in CI,
but we don't know which of these skipped tests should be unskipped.
Please come up with a plan for how to determine which bats tests should be unskipped, then unskip them.
If you unskip a bats test, it should pass when run.

I then asked my coworkers to place bets on whether they thought Junie would be able to succeed.

Discord Junie 1

The task I'd given her was one we give to all new AI agents and LLMs, since it is literally a day one task you can ask a new intern: go modify or add some tests. Obviously, a prompt that concise would cause Junie to keel over and die, so I gave her more context and a specific goal.

I asked her to try and identify bats tests in our Dolt codebase and un-skip them, if she determined that they no longer needed to be skipped.

Discord Junie 2

Confidence from the team was high, as you can see above, and personally, my expectations were through the roof. "Why?" you might be wondering? Well just look at this description of Junie from JetBrains:

JetBrains Junie is your coding agent by JetBrains designed to handle tasks autonomously or in collaboration with a developer. Developers can fully delegate routine tasks to Junie or partner with it on more complex ones.

# You think it, Junie helps make it happen
Try it in JetBrains IDEs and join our Discord to stay in touch

# What Junie can do for you
Provide a seamless coding experience
Getting started with the coding agent is simple. Install it in your IDE and begin with small tasks that fit seamlessly into your workflow – no dramatic changes required. This approach ensures that the new way of working is empowering, not disruptive.

# Discover efficient solutions
Junie understands the structure, logic, and relationships in your code. When you give it a task, it finds the right places to work, the right tools to use, and the best path forward.

# Propose an execution plan
Once Junie has explored the project, it will provide a cohesive plan for completing the task.

# Adjust to the task at hand
Use code mode to execute tasks, with Junie writing and testing the code for you. Switch to ask mode to pose questions, collaborate on plans, and brainstorm features and improvements.

# Perform inspections you can rely on
When Junie updates your code, it uses the power of your IDE to make sure every change meets your standards. With built-in syntax and semantic checks, your code stays clean, consistent, and production-ready.

# Run tests to keep your project state green
Junie can run code and tests when needed, reducing warnings and compilation errors. After making changes, it verifies that everything runs smoothly, so you can stay focused on building, not debugging.

# Intelligent collaboration on complex tasks
Focus on the big picture while Junie handles the routine work. Review, refine, and stay in control. Junie’s here to support you.

I mean dear God, after reading this, I almost destroyed my keyboard, since apparently I wouldn't need it anymore.

After I sent in the prompt, Junie began grinding, allowing me to enjoy my iced tea on the beach. She came up with a plan and started executing it, which was actually pretty cool to see.

Junie plan 3

Junie then started writing some scripts and asked for my permission to run them. I didn't read anything she wrote, and I also didn't pay any attention to her chain of thought output logged in the IDE, because honestly, I don't give a shit how she does what I want, as long as she does it correctly. Caring about that stuff is for AI researchers, I just want my tests un-skipped.

The scripts she generated and asked me to run could have rimraffed my home directory for all I knew, but I kept mashing "Run", "Yes", and "Accept" until she was done. I was rooting for her to succeed.

Finally, she finished the task and gave me a summary of her work.

Junie summary

And, unfortunately, all that H100 compute she burned through did not result in any un-skipped bats tests.

Now, it's certainly possible that Junie's hard work was in fact completely accurate, and there are no bats tests in the Dolt codebase that can be un-skipped, but I'm highly skeptical of this.

I'm skeptical of this, in part, because in my local environment, I can only successfully run the bats tests after I activate my bats conda environment that has all the dependencies installed.

My guess is that Junie did not do this, or even think to do this, though it's not technically impossible for her to have discovered that she needed to do this, as all bats dependencies are identified in our CI workflows. But I wouldn't have told an intern about this basic, dependency-related "gotcha" unless they asked, and Junie didn't ask, so why tell her?

No, instead of doing what I wanted, Junie wrote me a fucking doc about how I could do it myself. I've included her doc below, but don't read it, there's no point. I didn't ask for a doc, I needed someone to DO the fucking work. Also, Junie, you had a plan and you fucked it up!!!

Click to see Junie's dumb summary
# Determining Which Skipped Bats Tests Can Be Unskipped

## Analysis of Skipped Tests

After examining the skipped bats tests in the Dolt repository, I've found that they generally fall into these categories:

1. **Tests for features that are not implemented yet**
   - Example: `skip "ALTER VIEW is unsupported"` in create-views.bats
   - Example: `skip "todo implement backup lock file"` in backup.bats

2. **Tests for known bugs**
   - Example: `skip "SQL views return incorrect data when using AS OF with commits that modify existing data"` in blame.bats
   - Example: `skip "BUG: Altering primary key drops all data from dolt_diff_<table>"` in blame-system-view.bats

3. **Tests for behavior that should fail but doesn't**
   - Example: `skip "Add foreign key constraint statement is ignored instead of failing"` in foreign-keys.bats

4. **Platform-specific skips**
   - Example: `skiponwindows "tests are flaky on Windows"` in various files

## Recommended Approach

To determine which skipped tests can be unskipped, I recommend the following approach:

1. **Manual Testing of Specific Tests**
   - Identify tests with skip reasons that suggest the issue might have been fixed
   - Create a temporary copy of the test file with the skip statement commented out
   - Run the test to see if it passes
   - If it passes, unskip the test in the original file

2. **Focus on Tests with Clear Skip Reasons**
   - Prioritize tests with specific bug references or clear descriptions of the issue
   - Avoid tests with vague reasons like "todo" or "unsupported" as these likely still need implementation

3. **Check Recent Changes**
   - Look for tests that reference issues that have been fixed in recent commits
   - Check if the underlying code has changed since the test was skipped

4. **Test in CI Environment**
   - Some tests might pass locally but fail in CI due to environment differences
   - After unskipping tests, verify they pass in CI before merging

## Implementation Plan

1. **Create a List of Candidate Tests**
   - Review all skipped tests and identify those with specific, actionable skip reasons
   - Exclude tests with "todo", "unsupported", or other reasons indicating incomplete features

2. **Test Each Candidate**
   - For each candidate test:
     - Create a temporary copy of the test file
     - Comment out the skip statement
     - Run the test
     - If it passes, unskip it in the original file

3. **Verify in CI**
   - Submit a PR with the unskipped tests
   - Ensure all tests pass in CI before merging

4. **Document Results**
   - Document which tests were unskipped and why they now pass
   - Update any related issue references

## Conclusion

The automated approach using a script to test all skipped tests was not successful due to the complexity of running bats tests and the need for specific environment setup. A more targeted, manual approach focusing on specific tests with clear skip reasons is recommended.

By following this approach, we can systematically identify and unskip tests that should now pass, improving test coverage and ensuring that fixed issues remain fixed.

After this failure, Jason, who bet the $1, gave Junie the benefit of the doubt, saying that maybe working with bats is too complex for her. He suggested we try having her work on Go tests in our dolthub/go-mysql-server repo. If Junie could just add a few tests, expanding coverage of our SQL engine, this would be a huge win.

So I queued up another prompt, and sent Junie on her way. Here was the prompt I gave her:

The ./enginetest directory contains tests for the dolthub/go-mysql-server SQL engine.
Our goal is to make sure we have a robust and comprehensive test suite that ensure that dolthub/go-mysql-server works correctly.
Please analyze the tests that are currently defined in the ./enginetest directory, 
and generate new tests to add to this test suite that are not currently covered. 
After generating the new tests, run them and see which ones fail. If the tests fail, skip them,
and add a comment about the work that needs to be done in order to get this new failing test to pass.

Junie GMS prompt

This time it seemed like a match made in heaven. These engine tests are Go tests and Junie is a Go expert and SQL expert. I'm running her in GoLand, my IDE for Go, which knows where my go binary is, so Junie should not have any dependency issues this time.

Like last time, Junie slammed the H100s on the remote host and was cranking out thoughts in my IDE. But to my surprise, she finished a bit faster than she did last time.

Junie GMS done

Ok awesome! It turned out the bats tests were the issue! Junie just needed to start with a simpler, more routine task. Excited to update the team, I went to my terminal to check in Junie's code changes and open a PR.

(bats) dustin@Dustins-MacBook-Pro-3 go-mysql-server % gs
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

"Huh?" I thought.

Did I kick-off Junie in the wrong repo?

Had I misread the her response? Maybe she was still waiting on my input like before. Maybe she needed approval to run another script or something?

So like the clown I am, I checked. No, no, and no.

Turns out, Junie was just straight-up lying to me.

That was a lie

Damn you, Junie! Damn you to hell!

Let me vibe

Conclusion

Hopefully you all have more luck with Junie than we did. I'm not sure what constitutes a "routine" task versus a "complex" task in the mind of JetBrains marketing team, but we don't really need any todo apps coded over here, so we're still looking for ways to make AI agents like Junie and LLMs, in general, more useful.

As always, we'd love to hear from you. If you have any suggestions or feedback, or just want to share your own experiences, please come by our Discord and give us a shout.

Don't forget to check out each of our cool products below:

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.