Introducing DARUM

Warum? Ach, darum!

My team at Consensys R&D worked mostly with Dafny, the verification-friendly language. We enjoyed good success with it in our projects during the last few years; our extensive use got us a mention in Rustan Leino's Program Proofs book, and we even got a Best Paper Award in the 2022 edition of the International Conference on Formal Methods for Industrial Critical Systems!

Unfortunately, our team at Consensys is no more. So I finished my tool DARUM for a v1.0 release, and this post aims to gather the lose threads and offer an easier entry point for external users.

This is a gentler introduction to the problem of Dafny's verification brittleness and how DARUM helps diagnose it in your code. It's intended for those who maybe know some Dafny but are not necessarily acquainted with how it works behind the scenes: Boogie, Z3, logs, mutations, etc.

There's also a shorter, more technical document. And a fullwalkthrough of using DARUM on a real project.

Verifying code with Dafny – great, but...

Dafny is a verification-friendly language and toolchain. It automates away a lot of manual, step-by-step proving work that one would otherwise do with classic Interactive Theorem Provers (ITP) like Coq or Isabelle. Dafny is relatively easy to learn, and the resulting code is very readable by non-specialist programmers, which is great to create executable specifications that can be used by other teams. For example, one goal for Consensys' DafnyEVM project was to have it used as a mechanically-checked, well-specified EVM by developers of Ethereum clients, like Consensys’ own Besu.

Unfortunately, Dafny also presents some challenges. An important one is that projects typically develop brittleness issues as they grow in size. In a nutshell, the same automation that helps young projects develop rapidly, also tends to turn into a growing, inscrutable blocker as the project grows.

To understand what is brittleness, we need look a bit under Dafny’s hood.

How does Dafny verify your code?

The code and specifications that one writes in Dafny are internally translated into many assertions that need to be proven (discharged). In principle then, your job as a Dafny developer is to ensure that your code contains enough relevant information for Dafny to prove your specifications correct. Dafny uses those to generate assertions that are sent to the Z3 theorem prover to be discharged. For example: if your code has an expression like x/y, Dafny will generate an assertion requiring that something in the context ensures that y ≠ 0 at that point. If Z3 finds a solution, the assertion is discharged, meaning that there’s proof that division-by-zero can never happen. No need to check at runtime! Neat!

Z3 doesn't have enough context to find a solution

Now Z3 can reach the goal

Did you notice that handwavy “something in the context”? Every piece of Dafny code provides context information to help prove whatever assertions might come later. For example, even for our innocent x/y case, Dafny and Z3 need to ensure that x, y and the division operator all make sense together. For this, they use information from their definitions. This context information is the search space where Z3 searches for proofs.

For years the standard advice to help Dafny verify your code was to keep adding context and check with e.g. assertions that Dafny “knows” some fact at some strategic point in the code.

However, as your Dafny project grows, the quantity of context information generated grows much faster than the code itself. When Dafny then asks Z3 to solve some assertion, Z3 has a huge space in which to search, and so the search starts taking longer and longer. Z3 might go down unproductive rabbit holes, and what initially took seconds now takes minutes or hours – which in practice typically turns into a timeout.

Big context allows Z3 to find different paths to the solution... for good or bad

It’s important to note that Z3 is a complex piece of software, applying many heuristics to search for proofs, which causes an inherent degree or randomness: small changes to its input can cause it to use different search paths. And this is only exacerbated by Dafny/Boogie, who inject their own inherent randomness.

To give us some sense of the situation, when Z3 returns a result, it also returns how much it cost to find it. This is the only feedback we get. Dafny can log these costs, and even report averages and covariance across many verification runs.

Slowness is not the real problem

Slowness is not great. Alas, it’s the variability in the verification speed what turns into a real problem. The particularly bad case, the one that hurts developers, happens when verification sometimes runs acceptably fast, but at random times it runs much slower. There’s part of verification work that is a bit of an art, where the Dafny programmer needs to develop a sense of what would help Dafny verify their code. These random slowdowns mess with that intuition of what’s going on.

But also, consider that the slowness we’re talking about can easily be 10x or 100x slower than the fast case. This is a stark difference to what one would expect in the world of mainstream software development, where a speed difference of 2x or 3x is typically considered big and surprising.

The typical failure mode

Indeed, the typical failure mode that Dafny developers commonly find is that a file with many members (methods, functions, fields) will suddenly turn strangely sensitive to trivial changes. Imagine: you change some minor detail in method M1 of file F – and now the verification of the unrelated method M2 in the same file times out! You undo the change in M1 and maybe the verification now succeeds – or maybe it needs a couple of increasingly desperate re-runs, but eventually it does succeed. Relieved that now everything seems to work, you carefully save everything, curse the hour you wasted on this, and decide to move on to some more pressing issue.

Of course this also happens on CI: your commit broke the build! … or did it?

It’s hard to overstate how much of a problem this is. Refactoring is supposed to be a basic tool in any developer’s toolbox – but here one finds that trivialities like renaming a variable suddenly seems to break stuff! Or was it that I changed a code comment? That shouldn’t matter! … and indeed now it works?

This kind of situation breaks productivity, breaks morale, and turns into a very real blocker for further development. In fact, this kind of situation is what motivated Z3’s creator Leo de Moura to start developing the Lean ITP.

This phenomenon was initially named “verification instability” or “durability” by the Dafny team. Later, they settled on the term brittleness. Funnily enough, the earliest issue in the Dafny repository tagged for this was created by yours truly back in 2021.

Dealing with brittleness

For the last few versions, the Dafny language team has been adding features to allow the user to better control how much context reaches Z3. Remember our x/y case? Later in that program you might realize that you are no longer using those variables, so you can make Dafny forget about this piece of context after that point. But unfortunately this is more art than science. Z3 is – infamously – a bit of a black box, so it rarely is clear why it is fast in one run, nor why is it slow in another.

While the Dafny developers work on improving the situation, we considered alternatives, like migrating to some ITP: maybe it'd be better to spend time building manual proofs instead of spending it fighting unhelpful automation?

However, Dafny brought us this far – we had a big investment in code and experience with it. And Dafny keeps adding new functionality that we were looking forward to to expand our projects into new territories – like automatic, verification-guided test generation. And well, we still loved it! So we tried to improve the situation from our end.

Analyzing verification costs

To recap, Z3 is a black box, particularly from the Dafny point of view. The only things we get back from it are a success/failure/timeout response, plus how many "Resource Units" (RU) the response took (loosely corresponding to time spent). Furthermore, Z3 and Dafny work with a measure of randomness; this means that, even for a semantically unchanged Dafny program, Z3 might report wildly differing costs, even spanning multiple orders of magnitude. Indeed, Dafny can log the RU cost of verifying each code member, and provide some statistics like averages and covariances of these costs.

So we wondered: if we repeat a verification run multiple times and purposefully lean on the randomness of RU costs, is there anything we can learn? What does that randomness actually look like?

Turns out that there's interesting information to catch! After analyzing Dafny's logs and some long-running experiments, we found that the distribution of RUs are typically multimodal. For example, for a given piece of code some verification runs could take ~7 seconds, while other runs would oscillate around 800 seconds... with hardly anything between those extremes.

One intuitive explanation could be that each mode is caused by a different general search direction explored by Z3: one run went in a direction that found a quick solution, while another run went into some unproductive rabbit hole, causing a big, abrupt difference in the RU cost.

A rare but fast search direction vs a frequent but slow one

But why didn’t we notice this earlier with Dafny’s own statistics? Unfortunately, statistics like the averages and covariances provided by Dafny are useful in monomodal distributions, but they are not intuitive in multimodal distributions. Case in point: if 100 verification runs yield 99 times a RU cost of 1000, and 1 run of cost 10, then the average is 990.1; but this just obscures the interesting fact that a lot of runs go into deep rabbit hole searches while there’s a very short path that is rarely taken. The possible rarity of the extremes makes them precious, because they denote possibly infrequent search paths that we want to pursue; smoothing them into an average is counterproductive.

Another complication is that each Dafny member has its own RU verification cost. But Dafny primarily reports file-level RU costs, which is the simple sum of the RU for each member in that particular run. That is, at each verification run, some members could finish quickly while other could take a long time, causing a whole-file big RU cost or timeout, and obscuring the per-member variation. Recall that a sum of random variables is another random variable – so even if there’s some important change happening in one member’s RU cost, it gets obscured by the rest of members’ costs! This explains the typical failure mode I described earlier: code changes that should be trivial seem to change the verification times across the board. Of course the developer assumes that this was caused by their latest edits. But it's random! You’re just experiencing the multimodal distribution of a sum of random variables. Even if yours didn’t change, the rest did, and so did the sum.

Amusingly, this situation is similar to Skinner’s “superstitious pigeon” experiments: a developer tries to connect the random slowdowns or speedups with their actions, unsuccessfully – until, just by chance, a couple of trials seem to confirm some personal theory. “Dafny doesn’t like it when I assert this in here – wait, I swear it worked a moment ago!”. And we end up with our own little superstitions and rituals, hoping to control the randomness.

We can still go a bit deeper. Consider that Dafny by default groups the assertions pertaining to a member into an Assertion Batch, aiming to improve performance; the idea is that the assertions in such a batch logically share a lot of context, and therefore might verify faster as a group than if each assertion had to recreate a similar context for verification. But, intriguingly, Dafny also provides a verification mode in which assertions are isolated one by one instead of batched. So, what happens if we compare the performance of both modes: the cost of a member’s standard Assertion Batch vs the sum of the costs of the isolated assertions in that member?

We find that, effectively, in well-behaved cases batched assertions verify about 10x faster than when they are isolated. However, those batched cases sometimes degenerate into the already described case of astronomical costs with high, multimodal variation. In contrast, isolated assertions are typically relatively slower to verify, but very consistent in their RU costs; with very illuminating exceptions, which hint that any variation present in an isolated assertion is magnified in its corresponding Assertion Batch. And vice versa: when a member’s Assertion Batch shows RU cost variability, this will typically be traceable to some of its isolated assertions that is more variable than the rest. These observations taken together yield an algorithm to drill down from the top level code costs to the individual assertions that are slowing down the verification.

What can be done?

OK, so we know that we can’t directly force Z3 to take a particular path towards the goal. That’s actually what one would do with an ITP: manually build the proof, piece by piece and step by step, possibly with a bit of automation. It’s a lot of specialist work, but once it’s done, it’s done forever – until you change your code.

But Dafny isolates us from that work by automating most of it: code can be changed and re-verified just like one would re-run a compiler. Assertions are generated from the code, and Z3 finds a solution from scratch each time.

The only mechanism we have in Dafny to control Z3's search path is by reducing the context where Z3 searches. The cost variations inherent to Z3 will still be there, though hopefully much smaller.

How to do this? By using the Dafny functionalities to limit context: judiciously use opaque definitions, reduce the length of functions and methods, etc. This translates to making Assertion Batches smaller, which as we saw when comparing to isolated assertions makes verification slower but less brittle: effectively, stabilization by pessimization. Which keeps being a bit of an art: one needs to know how and where to curtail the automation that was supposed to help us.

Taken to the extreme, this could make you wonder how this effort compares to doing the corresponding job on an ITP like Isabelle, with its more manual (but explicit) approach.

So, to avoid reaching the point where ITPs become tempting, can we find where in the code it makes sense to surgically limit the context, and leave the rest alone – so that it verifies faster? This would minimize the stabilization-by-pessimization extra work, while keeping the code easier to work with for developers, and easier to read for non-verification specialists.

At Consensys, we turned these observations and heuristics into a tool to ease further exploration: DARUM. It works at two levels:

At a member level: it analyzes the cost of verification runs, ranks the members by variability and cost, and highlights the ones that should probably be fixed first.
At an assertion level: it compares the cost of verifying a member as an Assertion Batch or as a sum of individual assertions. This helps distinguish whether a member verifies slowly just because it contains a lot of assertions, or whether some of the assertions themselves are causing the variability or cost at the higher level. This typically can be attributed down to single expressions.

With minimal effort, DARUM guides you towards diagnosing what is making verification unreliable or slow – and hints which next steps should have priority to improve the situation.

Version 1.0 of DARUM is out, together with a practical walkthrough of how to use it on a real project.

The fundamental insight is about analyzing the variability of Z3’s RU costs, which is caused by Dafny/Boogie; so probably this approach could also be applied to other environments built on top of Boogie or similar solvers/toolchains. Maybe I’ll explore that in the future!

Happy verification!

H. Mijail's Blog

Search This Blog