Claude Gets It Wrong 67% of the Time and I Still Can't Stop Using It

I keep detailed data on my Claude Code sessions. Not by choice; Claude Code stores session metadata including friction points. I pulled the numbers from 55 tracked sessions and the results are worth writing about.

The uncomfortable numbers

Out of 55 sessions with tracked outcomes:

  • 37 had buggy code output
  • 37 involved Claude taking the wrong approach
  • 7 had Claude misunderstand my request
  • 3 involved excessive changes I didn’t ask for

That’s a 67% rate of buggy code and a 67% rate of wrong approaches. Those numbers sound like a tool you’d stop using. I haven’t. I use it more every month.

How that makes sense

The buggy code metric sounds damning if you think of Claude as a code generator. It sounds normal if you think of it as a pair programmer.

When I pair with another engineer, they write buggy code too. Everyone does. The first draft of anything is wrong. The value isn’t in the first draft being perfect. It’s in how fast we get from first draft to working code. Claude’s buggy first pass at a Helm chart or a Home Assistant automation still saved me the hour I’d have spent writing the buggy first pass myself. We both would have had to debug it either way.

The “wrong approach” number is the more interesting one. 37 sessions where Claude headed in a direction I disagreed with. Sometimes it’s architectural (it wanted to add a new database table when a computed column would do). Sometimes it’s stylistic (over-engineering a solution that needed three lines of code). Sometimes it’s plain wrong (suggesting a library that doesn’t exist or an API that changed).

But here’s the thing: I catch those. I catch them because I understand the codebase and the problem. Claude proposes, I evaluate. When the approach is wrong, I redirect. “No, we don’t need a new table for this. Use a computed column.” That takes five seconds. The alternative is me spending 30 minutes arriving at the right approach by myself, which I would have done anyway, but slower.

Where it fails hardest

The failure modes aren’t random. They cluster in predictable ways.

Cross-boundary bugs. Claude handles single-file changes well. When a change touches the Go backend, the API layer, the frontend, and a config file all at once, it’ll get one or two of those boundaries wrong. A field name that’s unitPrice in Go becomes unit_price in the JSON response, but the frontend expects UnitPrice. Claude misses these translation layers because it’s reasoning about one file at a time.

Stateful system debugging. When I paste a stack trace, Claude is good. When I describe emergent behavior (“the thermostat turns itself down to 75 when I set it to 77, but only sometimes”), it struggles. Stateful bugs require holding a mental model of the system over time, and Claude doesn’t have that temporal context. It’ll propose fixes based on the code structure that don’t account for the sequence of events.

Long-session drift. My best sessions are the first 50-80 messages. Past that, Claude starts losing the thread. It’ll re-suggest approaches we already tried and rejected. It’ll introduce inconsistencies with changes made earlier in the session. The context window is large but not infinite, and the quality degrades when the session gets long. I’ve learned to summarize and start fresh instead of pushing through.

Where it works beyond expectations

The flip side is equally predictable.

Boilerplate and scaffolding. Need a new gRPC service with tests, protobuf definitions, and server registration? Claude nails this every time. The code might need tweaks, but the structure is correct and saves 30-60 minutes of mechanical typing.

Multi-file refactoring. Renaming a concept across 15 files, updating all the callers, fixing the tests? Claude does this faster and more reliably than any find-and-replace. It understands the semantics, not the syntax.

Exploration and research. “What’s the right way to set up zigbee2mqtt on Talos Linux?” “How do I configure Longhorn storage with PodSecurity restrictions?” Claude gives me a starting point faster than searching through documentation, and I can follow up with “that didn’t work, here’s the error” in the same session.

Debugging with visual context. For frontend work, I’d screenshot UI glitches and paste them into Claude. “The scale indicators aren’t visible unless I’m right on the grid line” with a screenshot. Claude would identify the likely cause (opacity, z-index, contrast against the background) and propose a targeted fix. This loop of screenshot-diagnose-fix-screenshot is faster than any debugging workflow I’ve used.

The real cost-benefit

The way I think about it: Claude’s error rate is high, but the cost of each error is low. A wrong approach costs me 10 seconds to redirect. A bug in generated code costs me 2-5 minutes to spot and fix. The alternative, writing everything from scratch, has a lower error rate but a much higher time cost per unit of output.

Over 55 sessions, my outcomes were: 18 fully achieved, 17 mostly achieved, 17 partially achieved, 1 not achieved. That’s a 95% success rate at the session level even with a 67% bug rate at the code-generation level. The bugs are noise in the process. They don’t define the outcome.

The 2 sessions that were unhelpful

For honesty’s sake: 2 out of 55 sessions I rated unhelpful. Both involved debugging problems where Claude confidently proposed fix after fix, none of which worked, and the session spiraled into a guessing game. In both cases the root cause was something Claude couldn’t have known from the code alone (a hardware issue in one case, a runtime environment mismatch in the other). I should have stepped back sooner instead of letting Claude keep guessing.

The lesson: if Claude’s third attempt doesn’t work, stop. The problem is likely outside the code, and no amount of code-level fixes will help. That’s a skill I’m still building.

So why can’t I stop?

Because the 37 sessions with buggy code also include:

  • A Home Assistant automation system with multi-zone climate control and Zigbee sensors
  • A Kubernetes deployment pipeline with Helm charts and ArgoCD
  • A Terraform provider feature shipped to open source
  • A floor plan editor web application
  • This blog

I built all of that in evenings and weekends over six weeks. The bugs were real. The output was also real. I’ll take a fast, error-prone collaborator who I can correct on the fly over a slow, perfect process any day.

If you’re waiting for AI code generation to be reliable before you adopt it, you’ll be waiting a long time. The tool is useful now, bugs and all. You need to be good enough to catch the mistakes, and willing to accept that catching mistakes is part of the workflow, not a failure of the tool.

That’s the honest take. 67% bug rate. 95% session success rate. Both numbers are true at the same time.

The question I can’t answer yet

All of this data comes from my personal sessions. I use Claude at work full-time too, running 2-6 sessions in parallel on any given day across infrastructure, CI/CD, and platform work. Those sessions live on a different machine, and I haven’t pulled the numbers. I’d be curious to see how they compare. My gut says the bug rate is similar but the “wrong approach” rate is lower at work, because production codebases have stricter patterns and more guardrails for Claude to follow. But that’s a guess, not data. Maybe a follow-up post once I pull those numbers.

Comments

You can also comment directly on GitHub Discussions.