Companies are looking to harness agentic code generators to get software built faster. But for every story of increased developer productivity or greater code base understanding, there’s a story about creating more bugs and the increased likelihood of production outages.
Here at CodeRabbit, we needed to know if the issues individuals have been seeing are actual and, in that case, how unhealthy they’re. We’ve seen knowledge and research about this identical query, however lots of them are simply qualitative surveys sharing vibes about vibe coding. This doesn’t present us a path to an answer, solely a notion.
We needed one thing just a little extra actionable with precise knowledge. What particular type of bugs is AI extra prone to generate? Do some classes of bugs present up extra typically? How extreme are they? How is that this impacting manufacturing environments?
On this article, we’ll speak in regards to the analysis we did, what it means for you as a developer, and how one can mitigate the errors that LLMs make.
To search out solutions to our questions, we scanned 470 open-access GitHub repos to create our State of AI vs. Human Code Generation Report. We regarded for alerts that indicated pull requests had been AI co-authored or human created, like commit messages or agentic IDE information.
What we discovered is that there are some bugs that people create extra typically and a few that AI creates extra typically. For instance, people create extra typos and difficult-to-test code than AI. However general, AI created 1.7 occasions as many bugs as people. Code technology instruments promise pace however get tripped up by the errors they introduce. It’s not simply little bugs: AI created 1.3-1.7 occasions extra crucial and main points.
The largest points lay in logic and correctness. AI-created PRs had 75% extra of those errors, including as much as 194 incidences per hundred PRs. This contains logic errors, dependency and configuration errors, and errors in management flows. Errors like these are the simplest to miss in a code evaluation, as they’ll seem like affordable code except you stroll by means of it to grasp it.
Logic and correctness points could cause critical issues in manufacturing: the sorts of outages that it’s a must to report back to shareholders. We’ve discovered that 2025 had a higher level of outages and other incidents, even past what we’ve heard about within the information. Whereas we will’t tie all these outages to AI on a one-to-one foundation, this was the 12 months that AI coding went mainstream.
We additionally discovered plenty of different points that, whereas they could not disable your app, had been alarming:
- Safety points: AI included bugs like improper password dealing with and insecure object references at a 1.5-2x larger price than human coders.
- Efficiency points: We didn’t see loads of these, however those who we discovered had been closely AI-created. Extreme I/O operations had been ~8x increased in AI code.
- Concurrency and dependency correctness: AI was twice as prone to make these errors, which embody misuse of concurrency primitives, incorrect ordering, and dependency stream errors.
- Error dealing with: AI-generated PRs had been nearly twice as prone to test for errors and exceptions like null pointers, early returns, and pro-active defensive coding practices.
The one largest distinction between AI and human code was readability: AI had 3x the readability points as human code. It had 2.66x extra formatting issues and 2x extra naming inconsistencies. Whereas these aren’t the problems that may take your software program offline, they’ll make it more durable to debug the problems that may.
Main errors occur largely as a result of these coding brokers are primarily educated on subsequent token prediction based mostly on giant swaths of coaching knowledge. That coaching knowledge contains giant numbers of open-source or in any other case unsecure code repositories, however it doesn’t embody your code base. That’s, any given LLM that you just use goes to lack the required context to write down the proper code.
Once you attempt to present that context as a system immediate or `brokers.md` file, that will work relying on the LLM or agentic harness you’re utilizing. However ultimately, the AI device might want to compact the context or use a sliding window technique to handle it effectively. On the finish of the day, although, you’re dropping info. When you’ve got a process checklist the place the agent is meant to create code, evaluation it, and test it off when it is executed, ultimately it forgets. It begins forgetting increasingly more alongside the best way till the purpose the place it’s a must to cease it and begin over.
We’re previous the times of code completion and reduce and pasting from chat home windows. Persons are utilizing AI brokers and working them autonomously now, generally for very lengthy durations of time. Any errors—hallucinations, errors in context, even slight missteps—compound over the working time of the agent. By the tip, these errors are baked into the code.
Agentic coding instruments make producing code extremely straightforward. To a sure diploma, it is enjoyable to give you the chance magically drop 500 traces of code in a minute. You’ve acquired 5 home windows going, 5 various things being carried out on the identical time. No concept what any of them are constructing, however they’re all being constructed proper now.
Finally, although, somebody might want to be sure that code works, to make sure that solely high quality code hits the manufacturing servers.
There’s a joke that if you would like loads of feedback, make a PR with 10 traces of code. If you need it accepted instantly, commit 500 traces of code. That is the law of triviality: small modifications get extra consideration than large modifications. With agentic code mills, it turns into very straightforward to commit these very giant commits with large diffs.
Large commits mixed with hard-to-read code makes it very straightforward for critical logic and correctness errors to slide by means of. That is the place the readability drawback compounds. AI creates extra surrounding harness code and little inline feedback. There’s simply much more to learn. Until somebody (ideally a number of someones) is combing by means of each single line of code on these commits, you would be creating tech debt at a scale not beforehand imagined.
Consider a code base over the lifetime of an organization. Early-stage firms have a mentality of shifting quick, getting your software program on the market, however maintainability, complexity, and readability points compound over time. It could not trigger the outage, however it’s going to make that outage more durable to repair. Finally, that tech debt needs to be paid off. Both the corporate dies or any person has to rewrite the whole lot as a result of no one can observe what any of the code is doing.
Individuals wish to use agentic coding instruments and get the productiveness good points. But it surely’s vital to make use of them in a approach that mitigates a few of the potential downstream results and prevents AI-generated errors from affecting your uptime. At each stage within the course of, there are issues you are able to do to make the tip consequence higher.
Earlier than beginning out, do as a lot pre-planning as you’ll be able to, and skim up on the perfect practices for these instruments. Personally, I just like the pattern of spec-driven growth. It forces you to have a clearly laid out plan and totally take into account the necessities, design, and performance of the tip software program that you really want. This crystallizes the context that you’ve in regards to the code into one thing the code technology agent can use. Add different items of context: model tips, documentation in regards to the code base, and extra.
Whereas everybody desires to leap to the most recent and best language fashions, we don’t imagine you must let your customers select their very own LLMs at CodeRabbit. Fashions have gotten very totally different, and by altering between LLMs, your prompts might not behave the identical. The main focus of the mannequin might shift, it could generate extra of sure sorts of error, or it could interpret present prompts otherwise. Simply because you understand how to immediate one mannequin, doesn’t imply you understand how to immediate one other. We suggest utilizing a coding device that benchmarks all of the fashions and assigns the perfect one to the duty you’re engaged on or studying benchmarks to raised perceive which to make use of for every process and find out how to immediate it.
When you begin working the agent, smaller is best. Break duties into the smallest potential chunks. Actively interact with the agent and ask questions; don’t simply let it burn tokens for hours. On the flip aspect, create small commits that may be simply digested by your reviewers. Individuals ought to have the ability to perceive the scope of a given PR. The hype of long-running brokers is a gross sales tactic, and engineers utilizing these instruments should be clear-eyed and pragmatic.
Once you strategy a PR that AI assisted with, go in realizing that there can be extra points there. Know the sorts of errors that AI produces. You continue to have to evaluation and perceive the code such as you would with any human-produced commit. It’s a tough drawback as a result of individuals don’t scale that properly, so take into account some tooling that catches issues in commits or offers summaries.
Your post-commit instruments—construct, take a look at, and deploy—are going to be extra vital. When you’ve got QA checklists, observe them carefully. In the event you don’t have a guidelines, make one. Typically simply including potential points to the guidelines will hold them top-of-mind. Assessment your code requirements and implement them in opinions. Instrument unit exams, use static evaluation instruments, and guarantee you might have stable observability in place. Or higher but, combat AI with AI by leveraging AI in opinions and testing. These are all good software program engineering practices, however firms typically neglect these instruments within the title of pace. In the event you’re utilizing AI-generated code, you’ll be able to’t try this anymore.
2025 noticed Google and Microsoft bragging in regards to the proportion of their code base that was AI-generated. This pace and effectivity was meant to point out how productive they had been. However traces of code has by no means been metric for human productiveness, so why would we predict it’s legitimate for AI?
These metrics are going to look more and more irrelevant as firms consider the downstream results of their code. You’ll have to account for the holistic prices and financial savings of AI. Not simply traces of code per developer, however evaluation time, incidences, and upkeep load.
If 2025 was the 12 months of AI coding pace, 2026 goes to be the 12 months of AI coding high quality.
Save your dev staff’s sanity this 12 months with higher code evaluation instruments. Join a 14-day CodeRabbit trial.

