Reflections on creating ExCrap without looking at the code

I recently published ExCrap, an Elixir library for calculating Change Risk Anti-Patterns scores – or as they’re commonly called CRAP scores.

From the README:

CRAP combines function-level cyclomatic complexity with function-level coverage to highlight code that is risky to change because it is both complex and under-tested.

Because the library is small (it basically exposes a single mix crap task), and the idea concrete enough, it made for a perfect experiment: creating the library without looking at the code.

Why the experiment?

There’s so much talk about software factories, but I keep asking myself: can we really verify the code the agents write without looking at it? So, I wanted to treat the library’s code as a black box to force my brain to think of new verification strategies.

inputs -> black box -> outputs

That’s what this post is about.

In it, I share a few things I tried while creating the library (some good, some bad), and I end with some reflections.

Here’s an outline if you want to skip around:

Initial research with the three big brains

The initial research itself was interesting. I could have done it myself and given the agents a good prompt or context. But that’s not what I did.

A little while ago, I attended a workshop on building software factories that two old colleagues of mine were starting. In the first session, we talked about bringing an idea and using the “three big brains” for research: GPT, Gemini, and Claude.

So, I used all three models to do deep research on a core idea:

I want to create an Elixir library that calculates C.R.A.P. scores (combining test coverage and cyclomatic complexity). The library should be able to be executed with mix to be run locally and in CI environments. And it should have the ability to give me a score. If the score is beyond a certain level, we should be able to fail CI (same as is possible with tools like Credo).

That resulted in three research files that I committed to the repo, which other agents then used as an index of information.

Those files served as the foundation of the library. For example, that’s where the agents who did the initial implementation got the knowledge for what CRAP metrics were and how to calculate them.

After that, the rest of the application was built using GPT 5.5 and OpenCode (in case that matters to you).

The first few phases with superpowers

Once I had the research in place, I started implementing the application in full-stack slices. At the time, I was using superpowers, so some of my prompts automatically triggered its brainstorming, writing-plans, and other skills.

I originally committed some of the artifacts, though I have since removed them from the repo. But if you’d like a sample, you can still find a few of them in the commit history like this plan.

To be honest, planning is the part I liked the least.

LLMs have made me hate markdown. I still love markdown for writing, but reading markdown created by an LLM just makes my eyes bleed. (I now prefer HTML files as output.)

I also prefer a more interactive flow than a giant upfront spec/plan. So, my feature work today is more interactive, where the agent and I define some Gherkin, talk about module/function architecture, and then I let the agent do a slice of work.

But I digress. For the initial implementation, I created plans, and if I’m honest, I probably only skimmed them. So, the agents were guiding the roadmap, and I was mostly flying blind.

How do I verify this is working?

With all software, I think we need two types of verification:

  • The first is about whether or not the product we’re creating is what we desired. This one is external to the black box, and therefore, visible to us. In web apps, you could think of this as the UI you can click around. With ExCrap, it’s mostly about running the mix crap task.

  • The second is whether or not the codebase is healthy and evolving well. That one is internal to the black box, and therefore, invisible to us.

So, for us to fully verify our software product, we need to inspect the output (visible) and we need sensors and signals to give us information about what’s happening inside the black box.

inputs -> gray box -> outputs. The gray box has sensors that shoot out signals.

🧭 External verification

Since the library boils down to that single command, I could easily verify its output manually. I ran mix crap over and over to make sure the library could calculate CRAP metrics for itself.

Output of `mix crap`. A series of green check marks.

That also allowed me to improve the format of the output.

For example, I wanted mix crap’s output to be terse by default and verbose only if we passed a --verbose flag (the LLM’s original output was really verbose). I wanted it to have colors: red on failure, green on success (the LLM’s original output had no colors).

But dog-fooding only gave me one data point. The “users” of mix crap would be other Elixir projects. Would it work in those?

Manually testing external libraries

As a first pass, I cloned some open source repos (like Phoenix and Ecto), added ex_crap as a dependency, and ran mix crap on those libraries.

That immediately flagged issues with ExCrap – it couldn’t even read the Elixir files in those projects correctly. All I got was an {:error, :invalid_source} error.

Unfortunately, the error didn’t even have a stacktrace (it just failed to read the file), so there was no way of knowing what construct in those codebases ExCrap was treating as invalid.

So, I did what anyone leaning on agents will do: I grabbed the error and fed it back to the agent.

The agent had guesses at what might be causing the issue. But after a couple of rounds of me running mix crap on libraries and feeding the errors back to the agent, it was clear we wouldn’t really solve the problem that way.

But I had learned one thing: the errors were coming from ExCrap’s cyclomatic complexity calculation. It seems that ExCrap only accounted for some Elixir constructs, and when it found a file that had constructs it hadn’t accounted for, it just returned {:error, :invalid_source}.

So, I needed a way to test more combinations of Elixir constructs to make sure the library could parse them. That’s when I reached for property-based tests.

Property-based testing

I thought this would be a great opportunity to use property-based tests. I needed many, varied combinations of inputs that should always satisfy a simple property: the files should parse without error.

But once again, not being able to look at the code meant it was hard to know how to best write those properties. The best I could do was to have the agent generate the tests.

That helped, but there were two problems:

  • I didn’t really know what we were testing with those property-based tests. I could read the test descriptions with --trace, but even then, I wasn’t really sure what those tests were actually testing. I only knew what the test description claimed to be testing.

  • The second, bigger problem, was that even after creating the property-based tests, and fixing the issues those caught, running mix crap on many open-source libraries would still fail.

That’s when I finally got the idea on how to close the loop.

Leaning even more into the search & gather methodology

What I really wanted to do was to somehow automate what I was already doing manually (running mix crap on a bunch of open source libraries), and to do it directly in the ex_crap library.

Bringing all the libraries into the ex_crap project seemed like a bad idea, but perhaps I could do something simpler and achieve the same result. Maybe I could grab pieces of open source libraries and create sample modules that could serve as fixtures for tests.

So, just as with the research at the start of the project, I sent some agents to search and gather.

In this case, I tasked an agent to send subagents to find sample Elixir open source repositories (I didn’t specify which). I told it to get a broad set of concepts to cover all Elixir constructs. I then asked it to consolidate those findings and generate sample Elixir modules that would compile and stand alone (i.e. have no dependencies). Those would be the sample modules that I could use as fixtures.

The agent did a pretty good job of that.

It created a series of sample modules, organized them into separate directories based on what they’re meant to capture, and created tests that run those fixtures.

Here’s the directory structure from the samples README:

elixir_samples/
├── 01_basic_modules/
├── 02_typespecs/
├── 03_structs_protocols/
├── 04_behaviours/
├── 05_macros/
├── 06_pattern_matching/
├── 07_genserver/
├── 08_supervisors_otp/
├── 09_protocols/
├── 10_advanced/
└── _raw_originals/

(A huge thank you to the original libraries the agent used as good examples.)

With those files as our fixtures, the agent wrote some tests that surfaced the same {:error, :invalid_source} error, and it fixed the issues. After that, I could get CRAP scores for all the libraries I was testing!

🏥 Internal code health

Throughout the process, I also kept trying to improve the internal verification mechanisms of the library.

First, with all the code I created, I used the superpowers TDD skill. That meant that most of the code that was created had tests. But I couldn’t read or write those tests myself.

So, I needed to figure out what would make for good sensors that would tell me about the code, including the tests themselves!

The first thing I did was to take a quick glance at what the tests claim to be testing by running mix test --trace.

Here’s a small sample:

ExCrap.ScoreTest [test/ex_crap/score_test.exs]
  * test score/2 returns complexity squared plus complexity for 0 percent coverage (1.1ms) [L#9]
  * test score/2 returns complexity unchanged for 100 percent coverage (0.00ms) [L#5]
  * test score/2 preserves fractional scores for intermediate coverage (0.00ms) [L#13]
  * test score/2 rejects invalid complexity (0.00ms) [L#17]
  * test score/2 rejects invalid coverage (0.00ms) [L#22]
  * test score/2 rejects fractional coverage outside the valid range (0.00ms) [L#28]

ExCrap.ElixirSamplesTest [test/ex_crap/elixir_samples_test.exs]
  * test canonical sample fixtures discovers only canonical samples (2.4ms) [L#39]
  * test canonical sample fixtures canonical samples live with test fixtures (0.01ms) [L#35]
  * test canonical sample fixtures all canonical samples analyze successfully (14.7ms) [L#45]
  * test canonical sample fixtures canonical sample aggregate complexity summaries remain stable (7.8ms) [L#57]

# ... more tests ...

Finished in 1.1 seconds (0.4s async, 0.6s sync)
22 properties, 197 tests, 0 failures

That was okay, but it immediately raised the question: are those tests actually testing anything?

I remember a while ago, I had an agent write a test like this (not in this library, but in a different project):

test "pretend to test something" do
  if Application.get_env(:foo, :bar) do
    assert true
  end
end

# in the config
config :foo, bar: false

In other words, the test would always run, do nothing, but show as passing in the output. Just by reading the test description, it seemed like the test was testing something useful. But the code in question was never exercised.

To try to counter that type of scenario, I used two other tools: code coverage and mutation testing.

Code coverage

In Elixir, we get test coverage out of the box with ExUnit with mix test --cover.

Here are the current results for the ex_crap library:

Generating cover results ...

| Percentage | Module                         |
|------------|--------------------------------|
|     83.78% | ExCrap.Coverage                |
|     88.46% | Mix.Tasks.Crap                 |
|     90.00% | ExCrap.Mix.BoundarySpec        |
|     90.74% | ExCrap.Complexity              |
|     96.30% | ExCrap                         |
|     97.18% | ExCrap.Report                  |
|    100.00% | ExCrap.Mix                     |
|    100.00% | ExCrap.Scanner                 |
|    100.00% | ExCrap.Score                   |
|    100.00% | Mix.Tasks.Boundary.Spec.Accept |
|    100.00% | Mix.Tasks.Boundary.Spec.Check  |
|------------|--------------------------------|
|     92.09% | Total                          |

With that, I can at least see that the 197 tests and 22 property-based tests are covering ~92% of the code. So at least as a whole, the tests are exercising a large part of the codebase.

Are they good tests? Do they only test behavior and not implementation? That, I don’t know. For now, it’s the black box’s domain.

Mutation testing

The next, more exciting tool I wanted to try in this project was mutation testing.

The idea behind mutation testing is that we change some of the code (e.g. swap an and for an or), and run the tests. If they still pass, it means our tests are not actually testing that code path.

After some brief research, I decided to try muex.

Muex gave me what I was looking for. Over the course of building the library, I ran my trusty mix mutate and mix mutate.fast aliases frequently and found mutants that survived. So, I fed those back into the agent to improve the tests.

I did run into a lot of false positives, but that might have been a somewhat unique scenario related to Boundary. Muex would remove a use Boundary declaration and say the mutation had survived. But use Boundary is probably one of a few use macros that don’t break your code if you remove them. Nevertheless, it created a lot of noise. So, in the end, I had to trim what types of mutators the library was looking for.

If you’re curious, these are the two mutation aliases I currently have:

mutate:
    "muex --mutators arithmetic,boolean,comparison,conditional,function_call,literal,return_value",
"mutate.fast":
    "muex --mutators arithmetic,boolean,comparison,conditional,function_call,literal,return_value --optimize-level aggressive --max-per-function 5"

Using Boundary for module architecture

It’s still unclear if, in a future where the code is a black box, we’ll care about the architecture of the code. Maybe we’ll just let the agents design it as they see fit. But for now, I wanted to have some control over how modules depend on one another.

And for that, I used Boundary.

Boundary lets us specify some modules as boundaries and declare their dependencies and exports. When those boundaries are violated, it throws a warning. It’s the perfect sensor for module interdependencies.

And it comes with a great mix boundary.spec task. This is the current output for ExCrap:

ExCrap
  exports:
  deps:

ExCrap.Complexity
  exports:
  deps:

ExCrap.Coverage
  exports:
  deps:

ExCrap.Mix
  exports:
  deps: ExCrap

ExCrap.Report
  exports:
  deps: ExCrap.Score

ExCrap.Scanner
  exports:
  deps: ExCrap.Complexity

ExCrap.Score
  exports:
  deps:

When I first introduced Boundary into the project, I had already built most of the functionality. The spec it produced showed that we had some undesired coupling. So, I had the agents modify the code to reduce the number of exports and deps to what you see above.

Enforcing boundaries

But here was another problem. In the course of adding a new feature, an agent could easily violate a boundary, get a warning, and then “fix” it by changing the boundary definition — warning gone! But that obviously removes the usefulness of the signal.

To prevent that, I created two additional mix tasks to enforce that the boundary definitions aren’t changed without human approval:

  • The boundary.spec.check task compares the output of mix boundary.spec against output that was previously approved and saved. If the spec differs, then it means the agent has changed the boundaries without approval.

  • To approve a new boundary spec, a human needs to run an interactive mix boundary.spec.accept task.

Of course, an agent can just go ahead and edit the text file that contains the spec’s source of truth. And no amount of COMMANDS suggestions in the AGENTS.md file will stop a determined agent. So, the workflow is far from perfect, but for now, I’m comfortable with it since module architecture is not mission-critical.

CRAP scores

Another signal I used to try to keep code healthy was CRAP scores (that’s why I created the library).

I’ve always found Credo’s cyclomatic complexity check a little annoying because it’ll flag any case statement that has many branches, even if each branch is simple. And case statements are very common in Elixir.

So, before learning about CRAP scores, I was reluctant to treat cyclomatic complexity as a signal that the agent could use to correct itself.

But since CRAP scores combine cyclomatic complexity with test coverage, I actually like the metric a lot more. If a function has good test coverage for all the potential branches in a case statement, the score goes down a lot (technically, a fully covered function’s score is equal to the original cyclomatic complexity score, but CRAP’s default threshold score of 30 is a lot more forgiving than Credo’s 9).

So throughout the creation of the library, I ran mix crap (or my other trusty alias mix test.crap) to see that everything was green.

When I wanted more details, I ran mix crap --verbose. And if I wanted to see what modules were getting too close to the threshold, I ran mix crap with a lower --max-score flag (e.g. mix crap --max-score 20).

Adding more static checks

Last but not least, I also reached for static analysis tools in my quest to keep the black box healthy. For ExCrap, I added Credo, Quokka, and Sobelow, though I added them pretty late in the process.

Since it was late in the game, Credo and Quokka (which automatically runs on mix format) didn’t find any glaring issues (that I could tell). Sobelow, on the other hand, caught a few things. The overall majority were minor (which I ignored), but it caught a String.to_atom call that I was very glad to fix.

After all was said and done, I added all three to my precommit alias that is the core of my fast checks:

    "test.crap": [
        "test --cover --export-coverage default",
        "test.coverage",
        "crap"
    ],
    precommit: [
        "format",
        "test.crap",
        "boundary.spec.check",
        "credo --strict",
        "sobelow --skip"
    ],

Final reflections

So, having now published the library, how do I feel about the project?

I have to say, I still feel unsettled. It is very strange to have a project under my name that I’m not entirely sure I understand.

Before I published the library, I opened the black box and looked inside. I wanted to make sure I wasn’t shipping malware to people in the Elixir community. I didn’t change anything because nothing stood out as malicious or bad, and I wanted to keep the experiment somewhat “pure” until I wrote about it.

And that’s one part of the experiment I’m taking with me: I just couldn’t verify (or shake the feeling) that I wasn’t shipping something bad to other Elixir folks without peeking inside the box. So, for me, at least for now, I want to be able to see inside the box.

So, what was inside the box?

The code isn’t at all how I would’ve written it. That’s not to say it’s necessarily bad. I actually think a lot of it looks interesting. But I didn’t spend hours and hours understanding it and internalizing it, so it looks foreign to me.

Aside from that, the library is small and simple enough that I’m happy to publish it, not in the least because I want to use it in my own projects.

Meta lessons about working with LLMs

I want to leave you with two (meta) lessons learned.

The first is something I’ve come to realize about working with LLMs (which, honestly, should’ve already been apparent as a software developer). It is best to do things manually first. Discover the rough edges. Fine-tune the flow. Once you’ve done that, and the action starts feeling repetitive, then automate it (or hand it to an agent).

The second lesson is that you sometimes have to lean more into how LLMs work. It was surprising to me at first (though in hindsight it shouldn’t have been) that the best solution to my failing-to-parse-files issue was to go get more data, refine it, and use it for my tests. That’s something that might’ve been too expensive (or annoying) to do without agents. But with agents, it was a nice way to provide the data we needed.

Thank you for reading and going on this journey with me. In this wild west of AI, I encourage you to experiment more, and then share what you learn!

Want my daily thoughts, posts, and projects in your inbox?

    I will never send you spam. Unsubscribe any time.