> The excellent book xUnit Test Patterns describes a test smell named Assertion Roulette. It describes situations where it may be difficult to determine exactly which assertion caused a test failure.
How is that even possible in the first place?
The entire job of an assertion is to wave a flag saying "here! condition failed!". In programming languages and test frameworks I worked with, this typically includes providing at minimum the expression put in the assertion, verbatim, and precise coordinates of the assertion - i.e. name of the source file + line number.
I've never seen a case where it would be hard to tell which assertion failed. On the contrary, the most common problem I see is knowing which assertion failed, but not how the code got there, because someone helpfully stuffed it into a helper function that gets called by other helper functions in the test suite, and the testing framework doesn't report the call stack. But it's not that big of a deal anyway; the main problem I have with it is that I can't gleam the exact source of failure from CI logs, and have to run the thing myself.
> I've never seen a case where it would be hard to tell which assertion failed.
There are a set of unit testing frameworks that do everything they can to hide test output (junit), or vomit multiple screens of binary control code emoji soup to stdout (ginkgo), or just hide the actual stdout behind an authwall in a uuid named s3 object (code build).
Sadly, the people with the strongest opinions about using a "proper" unit test framework with lots of third party tooling integrations flock to such systems, then stack them.
I once saw a dozen-person team's productivity drop to zero for a quarter because junit broke backwards compatibility.
Instead of porting ~ 100,000 legacy (spaghetti) tests, I suggested forking + recompiling the old version for the new jdk. This was apparently heresey.
I was a TL on a project and I had two "eng" on the project that would make test with a single method and then 120 lines of Tasmanian Devil test cases. One of those people liked to write 600 line cron jobs to do critical business functions.
> One of those people liked to write 600 line cron jobs to do critical business functions.
I was a long-time maintainer of Debian's cron, a fork of Vixie cron (all cron implementations I'm aware of are forks of Vixie cron, or its successor, ISC cron).
There are a ton of reasons why I wouldn't do this, the primary one being is that cron really just executes jobs, period. It doesn't serialize them, it doesn't check for load, logging is really rudimentary, etc.
A few years ago somebody noticed that the cron daemon could be DoS'ed by a user submitting a huge crontab. I implemented a 1000-line limit to crontabs thinking "nobody would ever have 1000-line crontabs". I was wrong, quickly received bug reports.
I then increased it to 10K lines, but as far as I recall, users were hitting even that limit. Crazy.
Hadn't heard of it before, and it appears not to be.
There indeed exist a few non-Vixie-cron-derivative implementations but as far as I'm aware, all major Linux and BSD distributions use a Vixie cron derivative.
Edit: I see now where I caused confusion. In my original post, I should have said all default cron implementations.
I thought Dillon cron was the default cron in Slackware? Hard to be a more major Linux distribution than Slackware, in terms of historical impact if not current popularity.
I just confirmed with a Slackware user today, it still does use Dillon cron. I had a vague memory from before I switched from Slackware to Debian late last millennium.
Junit is especially bad about this. I often wonder how many of these maxims are from people using substandard Java tools and confusing their workarounds with deeper insights.
Here are a few mistakes I've seen in other frameworks:
- Make it possible to disable timeouts. Otherwise, people will need a different runner for integration, long running (e.g., find slow leaks), and benchmark tests. At that point, your runner is automatically just tech debt.
- It is probably possible to nest before and afters, and to have more than one nesting per process, either from multiple suites, or due to class inheritance, etc. Now, you have a tree of hooks. Document whether it is walked in breadth first or depth first order, then never change the decision (or disallow having trees of hooks, either by detecting them at runtime, or by picking a hook registration mechanism that makes them inexpressible).
We recently switched our system to a heartbeat system instead of a timeout system. The testing framework expects to see messages (printf, console.log, etc...) often. So a test testing a bunch of combinations might take 45 seconds to run but for each combination it's printing "PASS: Combination 1, 2, 3" every few ms.
This way the framework can kill the test if it doesn't see one of these messages in a short amount of time.
This fixed our timeout issues. We had tests that took too long, specially in debug builds and we'd end up having to set too large a timeout. Now though, we can keep the timeout for the heartbeat really short and our timeout issues have mostly gone away.
Never disable the timeouts. What you want is a way to set the timeouts once for an entire suite. Unit, functional, and integration tests all have a different threshold from each other. But in general within one kind your outliers almost always have something wrong with them. They’re either written wrong or the code is. And once I’m a while it’s okay to override the timeout on one test while you’re busy working on something else.
The problems isn’t with breaking rules. The problem is with promising yourself or others that you will fix it “later” and then breaking that promise.
I'd add one more: clearly document what determines the order in which tests are run.
On the one hand, running tests in any order should produce the same result, and would in any decent test suite.
On the other hand, if the order is random or nondeterministic, it's really annoying when 2% of PRs randomly fail CI, not because of any change in the code, but because CI happened to run unrelated tests in an unexpected order.
Test order should be random, so that the ability to run them in parallel and distribute them across multiple hosts is not lost by missing enforcement of test isolation.
The tests are fatally broken. It means you can't trust them to properly check new work even.
The solution is to use random ordering and print the ordering seed with each run so it can be repeated when it triggers an error. Immediately halt all new work until randomly run tests don't have problems.
This isn't as bad as it sounds, generally it's a few classes of things that cause the interference which each will fix many tests. It's unlikely that the code actually has a 2%+ density of global-variable use, for example.
I sometimes run into issues not so much due to order dependencies specifically, but due to tests running in parallel sometimes causing failures due to races. It's almost always been way more work to convert a fully serial test suite into a parallel one than it is to just write it that way from the start, so I think there's some merit in having test frameworks default to non-deterministic ordering (or parallel execution if that's feasible) with the ability to disable that and run things serially. I'm not dogmatic enough to think that fully parallel/random order tests are the right choice for every possible use case, but I think there's value in having people first run into the ordering/race issues they're introducing before deciding to run things fully serially so that they hopefully will consider the potential future work needed if they ever decide to reverse that decision.
I’ll disagree with this. Every time I’ve seen that, the interference between tests was also possible between requests in production. I’d rather my test framework give me a 2% chance of noticing the bug than 0%.
What's annoying is not being able to reproduce the 2% cases so you can't fix it even when you've noticed them. Sensible test tools give you the random seed they used to order the tests so you can reproduce the sequence.
If you have to pick one or the other, then you're breaking the common flow (human debugging code before pushing) so that management can have better reports.
The right solution would be to add a environment variable or CLI parameter that told tap to produce machine readable output, preferably with a separate tool that could convert the machine readable junk to whatever TAP currently writes to stdout/stderr.
But unlike TAP, it's fairly Perl-specific as opposed to just being an output format. I imagine you could adapt the ideas in it to Node but it'd be more complex than simply implement TAP in JS.
And yes, I think the idea of having different output formats makes sense. With Test2, the test _harness_ produces TAP from the underlying machine-readable format, rather than having the test code itself directly product TAP. The harness is a separate program that executes the tests.
Nothing should have to be parsed. Write test results to sqlite, done. You can generate reports directly off those test databases using anything of your choice.
> There are a set of unit testing frameworks that do everything they can to hide test output (junit), or vomit multiple screens of binary control code emoji soup to stdout (ginkgo), or just hide the actual stdout behind an authwall in a uuid named s3 object (code build).
The test runner in VS2019 does this, too and it's incredibly frustrating. I get to see debug output about DLLs loading and unloading (almost never useful), but not the test's stdout and stderr (always useful). Brilliant. At least their command line tool does it right.
I remember writing a small .NET test library for that exact problem - You could pass in a lambda with a complex condition, and it evaluated every piece of the expression separately and pretty printed what part of the condition failed.
So essentially you could write
Assert(()=>width>0 && x + width < screenWidth)
And you would get:
Assertion failed:
x is 1500
width is 600
screenWidth is 1920
It used Expression<T> to do the magic. Amazing debug messages. No moralizing required.
This was a huge boon for us as it was a legacy codebase and we ran tens of thousands of automated tests and it was really difficult to figure out why they failed.
Related to this, for anyone not fully up to date on recent C# features there is also the CallerArgumentExpression [1], [2] feature introduced in C# 10. While it is not a pretty printer for an expression, it does allow the full expression passed from the call site as an argument value to be captured and used within the method. This can be useful for custom assert extensions.
For example:
public void CheckIsTrue(bool value, [CallerArgumentExpression("value")] string? expression = null)
{
if (!value)
{
Debug.WriteLine($"Failed: '{expression}'");
}
}
So if you call like this: CheckIsTrue(foo != bar && baz == true), when the value is false it prints "Failed: 'foo != bar && baz == true'".
I love using Unquote[0] in F# for similar reasons; it uses F#'s code quotations. Assuming the variables have been defined with the values you state, the assertion is written as:
This is what the Python test framework Pytest does, among many other similar useful and magical things. I believe that the Python developer ecosystem as a whole would be substantially less productive without it.
Not quite the same, but available on nuget- 'fluentAssertions' gives you something akin to this. I've had decent success with having our juniors use it vs less verbose assertion libraries. I don't know about evaluating individual expressions in a line separately, but it does give you clean syntax and similar error messages that are very readable-
I like to use Jest’s toMatchObject to combine multiple assertions in a single assertion. If the assertion fails, the full object on both sides is shown in logs.
You can easily debug tests that way.
The only way to make it even possible is to do some eval magic or to use a pre-processor like babel or a typescript compiler plugin.
Well, Function.toString() should print the code of a lambda in JavaScript. So I think you could do it without a pre-processor: use Babel as a library to parse the body of the function; run each sub-expression separately and display the results.
You can just document a constraint that these function must use the purely functional subset of JavaScript, which is enough for the sorts of assertions I typically write. Alternatively, you could constrain it to the subset that has no unbound variables or side effects.
> How is that even possible in the first place? The entire job of an assertion is to wave a flag saying "here! condition failed!".
I envy you for never having seen tests atrocious enough where this is not only possible, but the common case.
Depending on language, framework and obviously usage, assertions might not be as informative as providing the basic functionality of failing the test - and that's it.
Now imagine this barebones use of assertions in tests which are entirely too long, not isolating the test cases properly, or even completely irrelevant to what's (supposedly) being tested!
If that's not enough, imagine this nightmare failing not after it has been written, but, let's say 18 months later, while being part of a massive test suite running for a while. All you have is a the name of the test that failed, you look into it to find a 630 lines long test "case" with 22 nondescript assertions along the way. You might know which line failed the test, but not always. And of course debugging the test function line by line doesn't work because the test depends on intricate timing for some reason. The person who wrote this might not be around and now this is your dragon to slay.
I think I should stop here before triggering myself any further. Therapy is expensive.
Even if the framework is fine, you can see something like an elaborate if-else tree, or even a try-catch block, and after it's all done, there's a condition check with `fail()`. So the point of failure could be manually detached from the actual point of failure.
Granted, this is not the way to do things. But it happens anyway.
I mean in your example it’s someone choosing not to use asserts. Which is a problem, don’t get me wrong, but it’s not the problem being talked about here.
The comment thread is about “Assertion Roulette” — having so many assertions you don’t know which went off. Which really seems like a test framework issue more than a test issue.
Not at all. It makes sense in some tests. I addressed the part asking how it's even possible to not know what happened.
As for multiple asserts, that is really meaningless. The test case should test one thing. If it requires several asserts that's okay. But having a very long test function with a lot of assertions, is strongly indicating that you're testing more than one thing, and when the test fails it will be harder to know what actually happened.
I guess you might be talking about a different language / environment than I'm used to, but even in the 100 assertion test case you get useful tracebacks in python. Testing lots of things at the same time means strictly speaking you're writing an integration test rather than a unit test, but still I don't see how it's a bad test. It's easy and stops buggy PRs going into production.
The test failures I see that are actually hard to debug are ones where the failures are difficult to reproduce due to random input, tests running in parallel and sharing the same filesystem etc. I don't think I've ever not known what assert was failing (although I guess in theory you could make that happen by catching AssertionError).
> Testing lots of things at the same time means strictly speaking you're writing an integration test rather than a unit test
There's nothing wrong with integration tests, but they're not unit tests. It's fine to have both, but the requirements for a good unit test and those for a good integration test diverge. The title of this post, at least, was specific to unit tests.
A unit tests tests one unit. And integration tests covers more than one unit. I think everyone agrees with that, but nobody has defined unit.
The longer I program the more I am convinced that the larger your unit the better. The unit tests is a statement that you will never refactor across this line, and that eliminates a lot of flexibility that I want.
It turns out that debugging failed integration tests is easy,the bug is in the last thing you changed. Sure the test covers hundreds of lines, but you only changed one.
I recently went to the effort of trying to work out where the term unit test came from in some desperate effort to find what a unit was meant to be.
After much googling and buying or ancient text books I hit a dead end. At this point I think "unit" is just noise that confuses people into making distinctions that don't exist.
As I recall the TDD mailing list has some background on the use of the word "unit", it goes WAY back, I believe it goes back to the mainframe/ punch card era. Regardless, I think it roughly translates to C's notion of the unit of compilation.
Which is obviously not what people really mean these days, but the phrase stuck. The early Xp'ers even found it an issue back then.
For a while people tried to push the term "micro tests", but that didn't really take off.
I agree with Gerard Mezaros and Martin Fowler and typically follow their (very mainstream) definitions on this stuff. Integration and functional testing have their own ambiguities too, it's definitely a frustrating situation to not have solidly defined foundational terms.
IIRC the "unit" in "unit test" was meant to mean "semantic unit" ("the access module", for example, should be distinct with a well-defined interface that all the tests go through), but very quickly turned into "syntactic units" ("a single function", for example, where the "well-defined interface" ends up just being function arguments/return value) because most people didn't understand what the original proponents meant.
> It turns out that debugging failed integration tests is easy,the bug is in the last thing you changed. Sure the test covers hundreds of lines, but you only changed one.
That’s not true.
A correct change might expose an existing bug which hadn’t been tested or expose flaky behavior which existed but hadn’t been exercised. In both cases the solution is not to revert the correct change, but to fix the buggy behavior.
Watched a bit of this... It's typical test-driven zealotry; the main criticism of integration tests seems to be that they don't force your hand in system design in the way that unit tests do? Which seems very silly, but then, I'm not a person who goes to conferences about testing philosophy.
Did you miss his follow-up? "Integration tests are a scam is a scam". For real. I like J.B., but I think he muddies the water too much and overall understanding suffers.
> The unit tests is a statement that you will never refactor across this line, and that eliminates a lot of flexibility that I want.
I certainly don't see it as that. I see it as "this is the smallest thing I _can_ test usefully". Mind you, those do tend to correlate, but they're not the same thing.
> this is the smallest thing I _can_ test usefully
Then you're testing useless things.
Usefulness is when different parts of a program work together as a coherent whole. Testing DB access layer and service layer separately (as units are often defined) has no meaning (but is often enforced).
>> this is the smallest thing I _can_ test usefully
> Then you're testing useless things.
We'll have to agree to disagree then.
> Testing DB access layer and service layer separately (as units are often defined)
Not at all. For me, a unit is a small part of a layer; one method. Testing the various parts in one system/layer is another type of test. Testing that different systems work together is yet another.
I tend to think in terms of the following
- Unit test = my code works
- Functional test = my design works
- Integration test = my code is using your 3rd party stuff correctly (databases, etc)
- Factory Acceptance Test = my system works
- Site Acceptance Test = your code sucks, this totally isn't what I asked for!?!
The "my code works" part is the smallest piece possible. Think "the sorting function" of a library that can return it's results sorted in a specific order.
That seems like a silly opinion to me. I use unit tests to make sure that individual units work like I expect them to. And I use them to test edge cases that can be tested separately from their caller. If I had to test all the use cases for each function, all combined together, there number of tests would grow by the multiplication of the partitions of each one, N x M x O x P, ... rather than the sum, plus a much smaller set of tests for how they work together (N + M + O + P + N_M + M_O + O_P, etc). It's much simpler to thoroughly test each unit. Then test how they work together.
> If I had to test all the use cases for each function, all combined together, there number of tests would grow by the multiplication of the partitions of each one
Why would they? Do these edge cases not appear when the caller is invoked? Do you not test these edge cases and the behavior when the caller is invoked?
As an example: you tested that your db layer doesn't fail when getting certain data and returns response X (or throws exception Y). But your service layer has no idea what to do with this, and so simply fails or falls back to some generic handler.
Does this represent how the app should behave? No. You have to write a functional or an integration test for that exact same data to test that the response is correct. So why write the same thing twice (or more)?
You can see this with Twitter: the backend always returns a proper error description for any situation (e.g. "File too large", or "Video aspect ratio is incorrect"). However, all you see is "Something went wrong, try again later".
> It's much simpler to thoroughly test each unit. Then test how they work together.
Me, telling you: test how they work together, unit tests are usually useless
You: no, this increases the number of tests. Instead, you have to... write at least double the amount of tests: first for the units, and then test the exact same scenarios for the combination of units.
----
Edit: what I'm writing is especially true for typical microservices. It's harder for monoliths, GUI apps etc. But even there: if you write a test for a unit, but then need to write the exact same test for the exact same scenarios to test a combination of units, then those unit tests are useless.
Unit one - returns a useful test for each type of error condition that can occur (N). Test that, for each type of error condition that can occur. One test for each error condition.
Unit two - calls unit one - test that, if unit one returns an error, it is treated appropriately. One test, covers all error conditions because they're all returned the same way from Unit one.
Unit three - same idea as unit one
If you were to test the behavior of unit one _through_ units 2 and 3, you'd need 2*N tests. If you were to test the behavior of unit one separately, you'd need N+2 tests.
You're missing the point that you don't need to test "the exact same scenarios for the combination of units", because the partitions of <inputs to outputs> is not the same as the partitions for <outputs>. And for each unit, you only need to test how it handles the partitions of <outputs> for the items, it calls; not that of <inputs to outputs>.
> If you were to test the behavior of unit one _through_ units 2 and 3, you'd need 2*N tests.
There are only two possible responses to that:
1. No, there are not 2*N tests because unit 3 does not cover, or need, all of the behavior and cases that flow through those units. Then unit testing unneeded behaviors is unnecessary.
> You're missing the point that you don't need to test "the exact same scenarios for the combination of units", because the partitions of <inputs to outputs>
This makes no sense at all. Yes, you've tested those "inputs/outputs" in isolation. Now, what tests the flow of data? That unit 1 outputs data required by unit 2? That unit 3 outputs data that is correctly propagated by unit 2 back to unit 1?
Once you start testing the actual flow... all your unit tests are immediately entirely unnecessary because you need to test all the same cases, and edge cases to ensure that everything fits together correctly.
So, where I would write a single functional test (and/or, hopefully, an integration test) that shows me how my system actually behaves, you will have multiple tests for each unit, and on top of that you will still need a functional test, at least, for the same scenarios.
> Once you start testing the actual flow... all your unit tests are immediately entirely unnecessary because you need to test all the same cases, and edge cases to ensure that everything fits together correctly.
You don't, but it's clear that I am unable to explain why to you. I apologize for not being better able to express what I mean.
If you don't, then you you have no idea if your units fit together properly :)
I've been bitten by this when developing microservices. And as I said in an edit above, it becomes less clear what to test in more monolithic apps and in GUIs, but in general the idea still holds.
Imagine a typical simple microservice. It will have many units working together:
- the controller that accepts an HTTP request
- the service layer that orchestrates data retrieved from various sources
- the wrappers for various external services that let you get data with a single method call
- a db wrapper that also lets you get necessary data with one method call
So you write extensive unit tests for your DB wrapper. You think of and test every single edge case you can think of: invalid calls, incomplete data etc.
Then you write extensive unit tests for your service layer. You think of and test every single edge case you can think of: invalid calls, external services returning invalid data etc.
Then you write extensive unit tests for your controller. Repeat above.
So now you have three layers of extensive tests, and that's just unit tests.
You'll find that most (if not all) of those are unnecessary for one simple reason: you never tested how they actually behave. That is, when the microservice is actually invoked with an actual HTTP request.
And this is where it turns out that:
- those edge cases you so thoroughly tested for the DB layer? Unnecessary because invalid and incomplete data is actually handled at the controller layer, or service layer
- or that errors raised or returned by service wrappers, or the db layer either don't get propagated up, or are handled by a generic catch all so that the call returns a nonsensical stuff like `HTTP 200: {error: "Server error"}`
- or that those edge cases actually exist, but since you tested them in isolation, and you didn't test the whole flow, the service just fails with a HTTP 500 error on invalid invocation
Or, instead, you can just write a single suite of functional tests that test all of that for the actual controller<->service<->wrappers flow covering the exact same scenarios.
And the second assert fails the error message will tell me exactly that, the line, and the value of both function calls if they are printable. What more do you want to know "what actually happened"?
> when the test fails it will be harder to know what actually happened
This should not ever be possible in any semi-sane test environment.
One could in theory write a single test function with thousands of asserts for all kinds of conditions and it still should be 100% obvious which one failed when something fails. Not that I'd suggest going to that extreme either, but it illustrates that it'll work fine.
> when the test fails it will be harder to know what actually happened.
Yeah, and if you write one assertion at a time, it will be harder to write the tests. Decreasing #assertions/test decreases the speed of test debugging while increasing the time spent writing non-production code. It's a tradeoff. Declaring that the optimal number of assertions per test is 1 completely ignores the reality of this tradeoff.
That's true and it boils down to what's acceptable in your team (or just you). I worked in some places where coverage was the only metric and in places where every single function had to have all cases covered, and testing took longer than writing the code.
As for me, I tend to write reasonable tests and cover several cases that guard the intended behavior of each function (if someone decides the function should behave differently in the future, a test should fail). One emerging pattern is that sometimes during testing I realize I need to refactor something, which might have been lost on me if I skimmed on tests. It's both a sanity check and a guardrail for future readers.
Devious commenter was describing a (normal) scenario where a unit test is not precise. No need to follow up with an aggressive "so what you're saying is".
> So because some idiot somewhere wrote a 100 assertion unit test we should ban anyone from writing even 2 assertions in one test?
You're falling prey to slippery slope fallacy, which at best is specious reasoning.
The rationale is easy to understand. Running 100 assertions in a single test renders tests unusable. Running 10 assertions suffers from the same problem. Test sets are user-friendly if they dump a single specific error message for a single specific failed assertion, thus allowing developers to quickly pinpoint root causes by simply glancing through the test logs.
Arguing whether two or three or five assertions should be banned misses the whole point and completely ignores the root cause that led to this guideline.
>Test sets are user-friendly if they dump a single specific error message for a single specific failed assertion, thus allowing developers to quickly pinpoint root causes by simply glancing through the test logs.
As if this actually happens in practice, regardless of multiple or single asserts. Anything that isn't non-trivial will at most tell you what doesn't work, but it won't tell you why it doesn't work. Maybe allowing an educated guess when multiple tests fail to function.
You want test sets to be user friendly? Start at taking down all this dogmatism and listening to the people as to why they dislike writing tests. We're pushing 'guidelines' (really more like rules) while individuals think to themselves 'F this, Jake's going to complain about something trivial again, and we know these tests do jack-all because our code is a mess and doing anything beyond this simple algorithm is a hell in a handbasket".
These discussions are beyond useless when all people do is talk while doing zero to actually tackle the issues of the majority not willing to write tests. "Laziness" is a cop-out.
> Depending on language, framework and obviously usage, assertions might not be as informative as providing the basic functionality of failing the test - and that's it.
Well, avoiding ecosystems where people act dumb is a sure way to improve one's life. For a start, you won't need to do stupid things in reaction of your tools.
Yes, it's not always possible. But the practices you create for surviving it are part of the dumb ecosystem survival kit, not part of any best practices BOK.
> ... to find a 630 lines long test "case" with 22 nondescript assertions along the way.
This is where tech team managers are abrogating their responsibility and job.
It's the job of the organization to set policy standards to outlaw things like this.
It's the job of the developer to cut as many corners of those policies as possible to ship code ASAP.
And it's the job of a tech team manager to set up a detailed but efficient process (code review sign offs!) that paper over the gap between the two in a sane way.
... none of which helps immediately with a legacy codebase that's @$&@'d, though.
> It's the job of the developer to cut as many corners of those policies as possible to ship code ASAP.
I can't tell if this is supposed to be humor, or if you actually believe it. It's certainly not my job as a developer to ship worse code so that I can release it ASAP. Rather, it's my job to push back against ASAP where it conflicts with writing better code.
And furthermore, you are not the developer most non-tech companies want.
Those sorts of companies want to lock the door to the development section, occasionally slide policy from memos under the door, and get software projects delivered on time, without wasting any more thought on how the sausage gets made.
With 22 separate tests you have the possibility of knowing that only a subset of them fail. Knowing which fail and which pass may help you debug.
In Go, in general, tests fail and continue, rather than causing the test to stop early, so you can tell which of those 22 checks failed. Other languages may have the option to do something similar.
xUnit is terrible. It has a horrible culture, the author seems to have a god complex. It is overly complex and opinionated in the worst way possible.
Many times I searched for 'how do I do something with xUnit' and found a github issue with people struggling with the same thing, and the author flat out refusing to incorporate the feature as it was against his principles.
Other times I found that I needed to do was override some core xUnit class so it would do the thing I wanted it to do - sounds complex, all right lets see the docs. Oh there are none, 'just read the source' according to the author.
Another thing that bit us in the ass is they refused to support .NET Standard, a common subset of .NET Framework and Core, making migration hell.
NUnit isn't much better. You'd think that they would have good test coverage and therefore high confidence to make changes, especially to fix actual bugs, but I gave up trying to get patches in because the core devs seem so afraid of breaking anything, even when it's an obviously-isolated private, hidden bug fix or performance improvement.
"We can't land this tiny fix because we have a release planned within three months" sort of thing.
Tbh, one of the differences between xUnit and nUnit, is the way generated test cases work, like specifying test cases in an xml file.
nUnit has the TestCaseSource attribute for this, while xUnit has the Fact attribute.
One of the key differences, is there are no test cases generated, nUnit tests just wont run, while xUnit will throw.
Since it's completely legal and sensible for a certain kind of test to have no entries in an xml, we needed to hack around this quirk. When I (and countless others) have mentioned this on the xunit github, the author berated us that how dare we request this.
So nUnit might be buggy, but xUnit is fundamentally unfixable.
I'm curious what things you're trying to do that requires you to overload xunit classes? We use xunit for everything and haven't found any gaps so far.
If you’re using pytest you just paramaterize the tests and it tells you the exact failing case. Seems to be a basic feature I would be surprised to know doesn’t exist across almost all commonly used frameworks.
Unless you need the library style in order to drive it, switch to pytest. Seriously.
- assert rewriting is stellar, so much more comfortable than having to find the right assert* method, and tell people they're using the wrong one in reviews
- runner is a lot more practical and flexible: nodeids, marks, -k, --lf, --sw, ...
- extensions further add flexibility e.g. timeouts, maxfail, xdist (though it has a few drawbacks)
- no need for classes (you can have them, mind, but if functions are sufficient, you can use functions)
This seems like a test design issue to me. Best practice is to avoid for-each loops with assertions within tests - using parametrized tests and feeding the looped values as input is almost always a better option. Figuring out which one failed and why is one advantage it gives you in comparison. Another one is that all inputs will always be tested - your example stops on the first one that fails, and does not evaluate the others after that.
not really. one thing this is useful for is extracting out various attributes in an object when you really don't want to compare the entire thing. Or comparing dict attributes, and figuring which one is the incorrect one.
for example,
expected_results = {...}
actual_obj = some_intance.method_call(...)
for key, val in expected_results.items():
assert getattr(actual_obj, key) == val, f"Mismatch for {key} attribute"
You could shift this off to a parametrized test, but that means you're making N more calls to the method being tested, which can have its own issues with cost of test setup and teardown. With this method, you see which key breaks, and re-run after fixing.
Ok, in this case a parametrized test is not the best approach, I agree. But I would still want to avoid the for-each and "failing fast". One approach would be to gather the required attributes in an array or a struct of some sort, and then do a single assert comparison with an expected value, showing all the differences at once. However, this requires the assertion framework to be able to make such a comparison and return a nicely readable error message, ideally with a diff.
Right, and not many actually do. with python and pytest, you could leverage difflib, but that's an additional thing that adds unnecessary complexity. My approach is simple enough, good enough, and doesn't require additional fudging around with the basics of the language's test libs.
also,
>your example stops on the first one that fails, and does not evaluate the others after that.
I would argue this is desirable behavior. there are soft checks, ie, https://pypi.org/project/pytest-check/, that basically replace assertions as raised exceptions and do your approach. But I do want my tests to raise errors at the point of failure when a change occurs. If there's alot of changes occurring, that raises larger questions of "why" and "is the way we're executing this change a good one"?
Hm. I think my main issue there is not the speed, but rather seeing the whole picture at once. You mentioned you use this pattern to test regular expressions; say you modify the regexp in question with some new feature requirement, and now the very first of a dozen test inputs fails. You fix it, but then each one of the following keeps failing, and you can only find an elegant solution that works for all of them after seeing all the failures, having ran the test and modified the code a dozen times. Wouldn't it be nicer to see all fails right away and be able to find a solution to all of them, instead of fixing the inputs one-by-one?
In my experience, from doing some TDD Katas[0] and timing myself, I found coding slower and more difficult when focusing on multiple examples at once.
I usually even comment out all the failing tests but the first one, after translating a bunch of specifications into tests, so I see the "green" when an example starts working.
Maybe it would be easier to grok multiple regex examples than algorithmic ones, but at least for myself, I am skeptical, and I prefer taking them one at a time.
In my own experience, this has often been a good way of going in circles, where I end up undoing and redoing changes as fixing one thing breaks another, until I take a step back to find the proper algorithm by considering multiple inputs.
Of course, ymmv depending on how good your initial intuition is, and how tricky the problem is.
> With pytest you can use the -x flag to stop after the first test failure.
> Even better you can use that in combination with -lf to only run the last failed test.
Fwiw `--sw` is much better for that specific use-case.
`--lf` is more useful to run the entire test suite, then re-run just the failed tests (of the entire suite). IIRC it can have some odd interactions with `-x` or `--maxfail`, because the strange things happen to the cached "selected set".
Though it may also be because I use xdist a fair bit, and the interaction of xdist with early interruptions (x, maxfail, ...) seems less than perfect.
this has downsides if you're comparing attributes with a method result and checking whether said attrs match what you expect. Either you run each test N times for N attr comparisons, accepting the cost of setup/teardown, or do a loop and fire off an assert error with text on which comparison failed.
Since you already have the object right there, why not do the latter approach?
If the setup/teardown is expensive I would do it in reusable fixtures. The reason I wouldn't choose the latter approach is that it would usually be less convenient in the long run. You'd need to replace your asserts with expects to avoid it throwing on the first error (if this isn't what you want), you'll often need to manually add data to the assertion (as GP did) that you would otherwise get for free, and you'll need to look at the assertion error rather than the test case to know what actually failed. This can be quite inconvenient if you e.g. export your test results in a CI/CD pipeline.
normally in a CI/CD pipeline, you'll see what asserts failed in the log output. Github Actions with pytest shows the context of the failed asserts in the log output. TBH, thought this was standard behavior, do you have experience with a CI pipeline that differs?
All the other points you make as negatives are all positives for me. Biggest thing is, if you're making this change that alters things so drastically, is that really a good approach.
Also, fixtures aren't magic. If you can't scope the fixture to module or session, that means by default it runs in function scope, which would be the same thing as having expensive setup/teardown. And untangling fixtures can be a bgger PITA than untangling unexpected circular imports
Think about the personality of someone who is so dissatisfied with the lack of verbosity in his test suites, that he needs a side project of writing a book about unit testing. Of course they will advocate testing one assertion per function, and make up nonsense to justify their recommendation.
Secretly, they would have the reader write 32 functions to separately test every bit of a uint32 calculation, only refraining from that advice due to the nagging suspicion that it might be loudly ridiculed.
I think the advantage of having an assertion per test is that it makes sure that all of your assertions are executed. In a lot of test frameworks (that use exceptions for assertions for example) the first assertion fail will stop the test.
That doesn't mean you have to duplicate code, you can deal with it in other ways. In Junit I like to use @TestFctory [1] where I'll write most of the test in the Factory and then each assertion will be a Test the factory creates, and since they're lambdas they have access to the TestFactoy closure.
I vaguely remember a test framework I saw a decade+ ago that had both "assert*" to fail immediately and something else ("expect*" maybe?) to check and continue.
I've noticed the opposite in a Java codebase I work in. Tests where the test is assertEquals(toJson(someObject), giantJsonBlobFromADifferentFile). Of course the test runner has no idea about formatting strings that happen to be json, so I end up having to copy these out into an editor, formatting them and eyeballing the difference, or for even larger ones having to save them out to files and diff them. And of course most of the fields in the mock aren't relevant to the class under test, so I'd trade them out for 5-6 targeted asserts for the relevant fields happily.
The problem is, since it's a legacy codebase, there's many fields which are only tested incidentally by this behaviour, by tests that actually aren't intending to test that functionality.
I had similar case recently, in C++. I ended up spending a few hours writing a simple JSON differ - a bit of code that would parse two strings into a DOM object graph using a rapidjson, and then walk down them simultaneously - basically, I implemented operator== which, instead of terminating early, recorded every mismatch.
Then, I packaged it into a Google Test matcher, and from now on, the problem you describe is gone. I write:
Expected someObject to be structurally equivalent to someBlobFromADifferentFile; it is not;
- #/object/key - missing in expected, found in actual
- #/object/key2 - expected string, actual is integer
- #/object/key3/array1 - array lengths differ; expected: 3, actual: 42
- #/object/key4/array1/0/key3 - expected "foo" [string], actual "bar" [string]
Etc.
It was a rather simple exercise, and the payoff is immense. I think it's really important for programmers to learn to help themselves. If there's something that annoys you repeatedly, you owe it to yourself and others to fix it.
> I think it's really important for programmers to learn to help themselves. If there's something that annoys you repeatedly, you owe it to yourself and others to fix it.
It's a cultural problem. _I_ can do that, but my colleagues will just continue to write minimum effort tests against huge json files or database dumps where you have no idea why something failed and why there are a bunch of assertions against undocumented magic numbers in the first place. It's like you're fighting against a hurricane with a leaf blower. A single person can only do so much. I end up looking bad in the daily standup because I take longer to work on my tickets but the code quality doesn't even improve in a measurable way.
From a "legacy code" perspective, you're better off picking an 'easy win' (Hamcrest). Initially, you're not going to convince a team to change their testing habits if it causes them pain. Your goal is to push a testing methodology which moves closer to the 'ideal' which saves them time.
Hamcrest is a drop-in replacement for `assertEquals`, and provides obvious benefits. Politically, it's easy to convince developers onboard once you show them:
* You just need to change the syntax of an assertion - no thought required
* You (Macha) will take responsibility for improving the formatting of the output, and developers have someone to reach out to to improve their assertions.
From this: you'll get a very small subset of missionaries who will understand the direction that you're pushing the test code in, and will support your efforts (by writing their own matchers and evangelising).
The larger subset of the developer population won't particularly care, but will see the improved output from what you're proposing, and will realise that it's a single line of code to change to reap the benefits.
EDIT: I've added a lint rule into a codebase to guide developers away from `assertEquals()`. Obviously this could backfire, and don't burn your political capital on this issue.
That test seems to be testing whether or not the library used to deserialize json works. I don’t think that’s valid unless the code base you are working on is Gson or Jackson or the like.
Assuming that’s not the case and you’re interested in the state of two object graphs then you just compare those, not the json string they deserialize to.
The majority of the testing I’ve written have been jest tests and PHPUnit tests, and note PHPUnit is my favourite. It’s easy to built up custom assertions, and all of the in built assertions have the ability to provide an additional failure message during a failure.
Assertions throw an exception and the test runner catches them along with any exceptions thrown by the code in test, marks the test as a failure, and reports the given error message and a full stack trace.
With the appropriate logging of each assertion in there.
Consider the situation of "I've got an object and I want to make sure it comes out in JSON correctly"
The "one assertion" way of doing it is to assertEqual the entire json blob to some predefined string. The test fails, you know it broke, but you don't know where.
The multiple assertions approach would tell you where in there it broke and the test fails.
The point is much more one of "test one thing in a test" but testing one thing can have multiple assertions or components to it.
You don't need to have testFirstNameCorrect() and testLastNameCorrect() and so on. You can do testJSONCorrect() and test one thing that has multiple parts to verify its correctness. This becomes easier when you've got the frameworks that support it such as the assertAll("message", () -> assertSomething, () -> assertSomethingElse(), ...)
It's not about it being hard to tell which assertion failed. It's about being hard to tell what the cause of the failure was.
When every test calls ->run_base_tests() before running it's own assertion sometimes things fail before you get to the root cause assertion.
The other problem of stacking assertions is that you'll see the first failure only. There may be more failures that give you a better picture of what's happening.
Having each assertion fail separately gives you a clearer picture of what's going wrong.
Fwiw, the book doesn't suggest what the reader is saying. It says what I've said above more or less.
You don't need to always stick to the rule but it generally does improve things to the point I now roll my eyes when I come across tests with stacked assertions and lots of test harness code that runs it's own assertions, I just know I'm in for a fun time.
In phpunit you can send dataset through a test function . If you don't label them you will have a jolly time finding out which one of the sets caused the failure.
I've seen several test frameworks that don't abort a test case after the first failed assertion.
When you get many tests each emitting multiple failures because one basic thing broke, the output gets hard to sort through. It's easier when the failures are all eager.
Which ones? I’ve used at least a dozen at this point, across C++, C#, JavaScript, Rust — and all of them throw (the equivalent of) exceptions on assertion failures.
That seems like it'd easily get confusing, when the assertions are dependent. Which is often the case e.g. if the list is empty, testing the properties of the first item make no sense.
That's why the first check is a hard assertion (returning on error), and the others are soft (continuing on error).
If the list is empty, then the log will contain one error about the length being zero. If the list has one item but it has the wrong properties, the log will contain two errors.
> That's why the first check is a hard assertion (returning on error), and the others are soft (continuing on error).
See that's so completely unclear I utterly missed that there were two different calls there. Doesn't exactly help that the functions are the exact same length, and significantly overlap in naming.
That's just a matter of familiarity, though. And if you make a mistake, you'll discover it the first time the test fails - either you'll see too little output, or you'll see the test throw an exception or crash.
GoogleTest was the one we used. I forgot, but now that you mention it, I remember the expect variations. We had decided against them. It’s a confusing feature in my opinion. If that’s what people mean by “multiple assertions”, then I at least understand where there coming from.
I haven't heard of the single-assertion thing in at least 10 years, probably 15. In the early 2000s, when I was starting out and doing .NET, it used to be something you'd hear in the community as a very general guideline, more like "there's something to be said about very focused tests, and too many assertions might be a smell." At the time, I got the impression that the practice had come over from Java and converted from a rule to a guideline (hardly the only bad practice that the .NET community adopted from Java, but thankfully they largely did move the needle forward in most cases).
(I wrote Foundations of Programming for any 2000s .NET developer out there!)
To hear this is still a fight people are having...It really makes me appreciate the value of having deep experience in multiple languages/communities/frameworks. Some people are really stuck in the same year of their 10 (or 20, or 30) years of experience.
Like you I was surprised to hear this is a thing or is even controversial. Admittedly I've only been programming for about 10 years, but I haven't heard (or seen) this come up even one time. Every test I've ever seen has usually had multiple mutations and assertions, all of them testing the same premise.
> I wrote Foundations of Programming for any 2000s .NET developer out there!
Holy moly! Think I still have your book somewhere. So thank you for that.
In my last +10 years of .net development I haven't heard anything about single-assertion.
> Some people are really stuck in the same year of their 10 (or 20, or 30) years of experience.
I think this has manifested even more with the transition into .net core and now .net 5 and beyond. There are so many things changing all the time (not that I complain), which can make it difficult to pick up what's the current mantra for the language and framework.
What? People really would criticize that code because it has two assertions? How are they ever testing any state changes?
And to the author: Your bubble is significantly different from mine. Pretty much every competent developer I've worked with would laugh at you for the idea that the second test case would not be perfectly fine. (But that first iteration would never pass code review either because it does nothing and thus is a waste of effort.)
I'm convinced if you read Uncle Bob carefully and follow all his suggestions... you'll have completely incapacitated whatever organization you infiltrated.
That is, regardless of the absolute value of bar.value, I expect foo.call() to increment it by 2.
The point of the 1 assertion per test guideline is to end up with tests that are more focused. Giving that you did not seem to think of the above technique, I'd say that this guideline might just have helped you discover a way to write better specs ;-)
Guidelines (that is, not rules) are of course allowed to be broken if you have a good reason to do so. But not knowing about common idioms is not a good reason.
You might argue that the above code is just sugar for 2 assertions, but thats beside the point: The test is more focused, there -appears- to be only one assertion, and thats what matters.
OP asked how any state change would be tested with a single 'assertion' and I provided an answer. Absolute rules are stupid, but our codebase has just short of 10k tests, and very few have more than one assertion.
The only reason I can really see to have more than one assertion would be to avoid having to run the setup/teardown multiple times. However, its usually a desirable goal to write code that require little setup/teardown to test anyways because that comes with other benefits. Again, it might not be practical or even possible, but that goes of almost all programming "rules"..
one assert per test seems... as you said, indicative of zealotry. if you already have your object there, why not test for the changes you expect?
So you have one test that indicates that a log error is outut. then another that tests that the property X in the return from the error is what you expect. then another test to determine that propery Y in return is what you expect?
that to me is wasteful, unclear, bloated. About the only useful result I can see that is it allows bragging about how many tests a project has.
Furthermore, if you have a one-assertion rule, some bright spark will realize he can write a single assertion that checks for the conjunction of all the individual postconditions.
That's one way to get dogma-driven assertion roulette, as you will not know which particular error occurred.
The amount of setup and teardown necessary to test something is a property of the system under test. It is not susceptible to one's opinion as to how things should be.
There are usually different ways to design a system. Its often the case that designing the system such that it is easy to test (with little setup/teardown) has other benefits too. E.g. It often indicates low coupling and a more simple design.
That being said, there can of course me other tradeoffs e.g. performance and even cases where simple test setups are downright impossible.
Interesting. Our (Rails) codebase is around 25,000 tests and less than half have a single assertion. Personally, there's some calculus in my head when I'm writing a test that determines if/when the scenario I'm testing needs multiple assertions.
rspec or minitest? ;-) Could rspecs 'expect change' idiom be the difference?
I find that reducing assertions per spec where I can a good guideline. E.g. combining expect(foo['a']).to eq(1) and expect(foo['b']).to eq(2) into expect(foo).to include('a' => 1, 'b' => 2)
yields better error messages.
Please correct me if I'm wrong, but would a precondtion not just be the postcondition of the setup?
Invariants would either have to be publically available and thus easily testable with similar methods, or, one would have to use assertions in the implemention.
I try to avoid the latter, as it mixes implemations and 'test/invariants'. Granted, there are situations (usually in code that implements something very 'algorithm'-ish) where inline assertions are so useful that it would be silly to avoid them. (But implementing algos from scratch is rare in commercial code)
Unit tests should be cheap. Cheap to write, cheap to run, cheap to read, cheap to replace.
Near as I can tell, many people are made uncomfortable by this in practice because these tests feel childish and dare I say demeaning. So they try to do something “sophisticated” instead which is a slow and lingering death where tests are co corned.
Lacking self consciousness, you can whack out hundreds of unit tests in a couple of days, and rewrite ten of someone else’s for a feature or a bug fix. That’s fine and good.
But when your test looks like an integration test, rewriting it misses boundary conditions because the test is t clear about what it’s doing. And then you have silent regressions in code with high coverage. What a mess.
I think you forgot at least one valid assertion and implied another one:
foo.call() might have a return value.
Also, the whole story invocation shouldn't throw an exception, if your language has them. This assertion is often implied (and that's fine), but it's still there.
Finally the test case is a little bit stupid, because very seldom code doesn't have any input that changes the behavior/result. So your assertion would usually involve that input.
If you follow that though consequently, you end up with property-based tests very soon. But property-based tests should have as many assertions as possible for a single point of data. Say you test addition. When writing property-based tests you would end up with three specifications: one for one number, testing the identity element and the relationship to increments. Another one for two numbers, testing commutativity and inversion via subtraction, and one for three numbers, testing associativity. In every case it would be very weird to not have all n-ary assertions for the addition operation in the same spot.
When you say I 'forgot' an assertion, are you implying that test should include all possible assertions on the code? That would perhaps cover more surface, but my goal (read zealot ideology) here is to have the tests help document the code:
test "pressing d key makes mario move 2 pixels right" {
I could test the value of the d() function, but I dont because I don't care what it returns.
Didnt understand the "whole story invocation" and exception part, am I missing some context?
Sure property-based testing can be invaluable in many situations. Only downside is if the tests become so complex to reason about that bugs become more likely in the tests than the implemenation.
I've sometimes made tests with a manual list of inputs and a list of expected outputs for each. I'd still call that 1 assertion tests (just run multiple times), so my definition of 1 assertion might too broad..
When you get the suites nested and configured right, and the code decomposed properly to support it, each of these assertions is two lines of code, plus the description of each constraint. So you just write four or five tests covering each one, in descending likelihood of breakage.
An assert message says what went wrong, and on which code line. How on earth does it help to make just one? The arrange part might take seconds for a nontrivial test and that would need to be duplicated both in code and execution time to make two asserts.
If you painstakingly craft a scenario where you create a rectangle of a specific expected size why wouldn’t it be acceptable to assert both the width and height of the the rectangle after you have created it?
assert_equal(20, w, …
assert_equal(10, h, …
A dogmatic rule would just lead to an objectively worse test where you assert an expression containing both width and height in a single assert?
assert_true(w == 20 && h == 10,…)
So I can only assume the rule also prohibits any compound/Boolean expressions in the asserts then? Otherwise you can just combine any number of asserts into one (including mutating state within the expression itself to emulate multiple asserts with mutation between)!
I’ve seen people take a dogmatic approach to this in Ruby without really applying any critical thought, because one assertion per test means your test is ‘clean’.
The part that is glossed over is that the test suite takes several hours to run on your machine, so you delegate it to a CI pipeline and then fork out for parallel execution (pun intended) and complex layers of caching so your suite takes 15 minutes rather than 2 and a half hours. It’s monumentally wasteful and the tests aren’t any easier to follow because of it.
The suite doesn’t have to be that slow, but it’s inevitable when every single assertion requires application state to be rebuilt from scratch, even when no state is expected to change between assertions, especially when you’re just doing assertions like ‘assert http status is 201’ and ‘assert response body is someJson’.
Yes, you got us rubyists there. :-( Its the unfortunate result of trying to avoid premature optimization and strive for clarity instead. Something thats usually sound advice.
Enginnering decisions have tradeoffs. When the testsuite becomes too slow, it might be time to reconsider those tradeoffs.
Usually though, I find that to road to fast tests is to reduce/remove slow things (almost always some form of IO) not to combine 10 small tests into one big.
I think it’s a sound strategy more often than not, it’s just that RSpec’s DSL can make those trade-offs unclear, especially if you use Rubocop and follow its default RSpec rules.
It just so happens that your tests become IO bound because those small tests in aggregate hammer your DB and the network purely to set up state. So if you only do it once by being more deliberate with your tests, you’re in a better place.
I'd argue that it's the unfortunate result of Ruby being at the center of the Agile and XP scene back when it first became prominent (the manifesto etc) - because that scene is also where the more cultish varieties of TDD originated.
> I’ve seen people take a dogmatic approach to this in Ruby without really applying any critical thought, because one assertion per test means your test is ‘clean’.
I can't speak for Ruby, but what I would call 'clean' and happily dogmatise is that assertions should come at the end, after setup and exercise.
I don't care how many there are, but they come last. I really hate tests that look like:
I think even with integration tests they should still be treated similarly - at the end of the day you are setting expectations on an output given a certain input, there’s just a lot more going on in between.
There’s no avoiding it though when you want something end-to-end, or a synthetic test. You’re piling up a whole succession of stateful actions and if you tested them in isolation you would fail to capture bugs that depend on state. In that sense, better to run a ‘signup, authenticate and onboard’ flow in one test instead of breaking it down.
In fact, this last version is worse, because if do_other() can fail if state wasn't blah, then what you'll get is the exception from that failure interrupting the test before the assert would have been reported.
Exactly because that's 'without any meaningful difference' is why I don't like that either. I'm not obsessing over purely the 'assert' keyword as you perhaps think I am, it's the structure I don't like.
> So I can only assume the rule also prohibits any compound/Boolean expressions in the asserts then? Otherwise you can just combine any number of asserts into one
That's what's bound to happen under that rule. People just start writing their complex tests in helper functions, and then write
The way it's been explained to me is that because one assert failing blocks the other assertions from running you don't get a "full" picture of what went wrong.
So instead of:
- error W doesn't equal 20
Fix that
Run test again
- error H doesn't equal 10
Fix that
Run test again
It's
- Error Width doesn't equal 20
- Error Height doesn't equal 10
Fix both
Run test
I think the time savings are negligible though. And it makes testing even more tedious, as if people needed any additional reasons to avoid writing tests.
Only a few programming languages have a facility to render that second assertion in a human readable way (python surprised me with this). Most C influenced languages will just present you with “assertion failed” or “expected true to be false” which means nothing. Test failure messages should be actionable, and that action is not, “read the test to see what went wrong”.
That can be considered one logical assertion though. You're asserting the size of the rectangle. You can even extract an assert helper function AssertRectangleSize(20,10)
Exactly. But if I assert N properties of an object is that then one assert logically for any N? At what point does it stop?
Applying any rule dogmatically is often bad, and this is no exception. The is that we don’t like lacking rules. Especially it goes to hell when people start adding code analysis to enforce it, and then developers start writing poor code that passes the analysis.
One assert imo isn’t even a good starting point that might need occasional exceptions.
It shouldn't be dogmatic, but I think it should be something to think about when writing the test or reviewing one. A test should be single responsibility too, in order for it to not be brittle.
I disagree, it can be a good starting point for most cases. You should be able to condense your test in 3 steps (arrange, act, assert) each one a single line, and even if you do not do it because setup is too complicated, it's not worth it for that single test, you assert more things etc., I think the mental exercise to try to think: "can this be made in a 3 line test" is invaluable in writing maintainable tests.
> is that then one assert logically for any N? At what point does it stop?
This is one of the hard things about good tests, they are a little bit of art too. Maybe you can apply the single responsibility like I said before: the test should change for one reason only. By one reason I mean one "person/role": it should change if the CFO of our clients want something differet, or if Mark from IT wants some change.
I am not stressing or enforcing single asserts too much, I feel like tests allow a little bit of leeway in many ways, as long as the decision enhances expressiveness. If the extra lines are not making the test clearer, if a single assert would be clearer for the story that the test is telling, then it should go into a single assert. If I can break the story into multiple stories that still make sense, then I do that, such that each story has it's own strong storyline.
I think two important requirements for good unit tests are that
1) If you look at a particular test, you can easily tell exactly what it's testing and what the expected behavior is, and
2) You are able to easily look at the tests in aggregate and determine whether they're covering all the behavior that you want to test.
To that end, I think a better guideline than "only have one assertion per test" would be "only test one behavior per test". So if you're writing a test for appending an element to a vector, it's probably fine to assert that the size increased by one AND assert that the last element in the vector is now the element that you inserted.
The thing I see people do that's more problematic is to basically pile up assertions in a single test, so that the inputs and outputs for the behavior become unclear, and you have to keep track of the intermediate state of the object being tested in your head (assuming they're testing an object). For instance, they might use the same vector, which starts out empty, test that it's empty; then add an element, then test that its size is one; then remove the element, test that the size is 0 again; then resize it, etc. I think that's the kind of testing that the "one assertion per test" rule was designed to target.
With a vector it's easy enough to track what's going on, but it's much harder to see what the discrete behaviors being tested are. With a more complex object, tracking the internal state as the tests go along can be way more difficult. It's a lot better IMO to have a bunch of different tests with clear names for what they're testing that properly set up the state in a way that's explicit. It's then easier to satisfy the above two requirements I listed.
I want to be able to look at a test and know exactly what I don't mind if a little bit of code is repeated - you can make functions if you need to help with the test set up and tear down.
Is it weird that not only have I never heard of the "rule" this post argues against, but I can't even conceive of a code structure where it would make sense?
How would a test suite with one assertion per test work? Do you have all the test logic in a shared fixture and then dozens of single-assertion tests? And does that rule completely rule out the common testing pattern of a "golden checkpoint"?
I tried googling for that rule and just came up with page after page of people arguing against it. Who is for it?
Looking at the Amazon listing and a third-party summary[0] it seems to be the sort of wool-brained code astrology that was popular twenty years ago when people were trying to push "extreme programming" and TDD.
Perhaps the author is better off not having heard of it then, and by implication, not having read "Clean Code" in the first place. The book is full of anti-patterns.
There's plenty of sensible advice in there it's just that he argues for the sensible stuff and the idiotic stuff with equal levels of conviction and if you are junior you aren't going to be able to distinguish them.
It would be easier if it were all terrible advice.
Where in that book is the rule stated? I ask because I have heard the author explicitly state that multiple assertions are fine (using essentially the same explanation as
TrianguloY did in this comment: https://news.ycombinator.com/item?id=33480120).
Chapter 9 talks about unit tests and there is a paragraph called 'Single Assert per Test', where he says it is a good concept but that he is not afraid to put more asserts in his tests.
That paragraph is followed by 'Single Concept per Test' where he starts with: 'Perhaps a better rule is that we want to test a single concept per test.'
That maps better with the lectures I've seen of him on YouTube, and I concur with it.
When I first wrote tests years ago, I would try to test everything in one test function. I think juniors have a tendency to do that in functions overall - it's par to see 30-100+ line functions that might be doing just a little too much on their own, and test functions are no different.
It feels as if folks are splitting hairs where a haircut (or at least a half-hearted attempt at grooming) is required. Use good judgment, and do not blindly follow "rules" without understanding their effects.
As MikeDelta reports¹, the book doesn’t actually say that.
I’ve come to learn to completely disregard any non-specific criticism of that book (and its author). There is apparently a large group of people who hate everything he does and also, seemingly, him personally. Everywhere he (or any of his books) is mentioned, the haters come out, with their vague “it’s all bad” and the old standard “I don’t know where to begin”. Serious criticism can be found (if you look for it), and the author himself welcomes it, but the enormous hate parade is scary to see.
It's not unusual to spin up a local dev server in the same environment (or on the same machine, ie at localhost) as the tests. There's an argument to say these aren't "unit" tests but your definition of "unit" may vary.
Where I work milestone and release builds cannot make http calls other than to our maven repo. It's meant to further reproducible builds but your tests can't do this as well. I fire up an in-memory database to make my tests self containing.
Heh yeah but it can be used to write tests that check your assumptions about a 3 party api. Granted, It'll only fail once the test are rerun without the cache, but it can still be a valuable technique. It can be valuable to have a testsuite that a) helps check assumptions when implementing the connection and b) helps locate what part of it later starts to behave unexpectedly.
First off, I do put more than 1 assertion in a test. But it definitely leads to situations where you have to investigate why a test failed, instead of it just being obvious. Like the article, I test 1 thing per test, but sometimes that means multiple assertions about the outcome of a test.
IMO there's no point in checking that you got a response in 1 test, and then checking the content/result of that response in another test. The useful portion of that test is the response bit.
IMO, the opposite also has to be considered. I've briefly worked with some code bases that absolutely did 1 assert per test. Essentially you'd have a helper method like "doCreateFooWithoutBarAttribute", and 3-4 tests around that - "check that response code is 400", "check that error message exists", and so on. Changes easily caused 4-5 tests to fail all at once, for example because the POST now returned a 404, but the 404 response also doesn't contain the error message and so on.
This also wasted time, because you always had to look at the tests, and eventually realized that they all failed from the same root cause. And sure, you can use test dependencies if your framework has that and do all manner of things... or you just put the asserts in the same test with a good message.
Even with multiple assertions the test failure reason should be quite clear as most testing frameworks allow to specify a message which is then output in the testing summary.
E.g. `assertEqual(actual_return_code, 200, "bad status code")` should lead to output like `FAILED: test_when_delete_user_then_ok (bad status code, expected 200 got 404)`
Note it mentions the actual expression put in the assert. Which makes it almost always uniquely identifiable within the test.
That's the bare minimum I'd expect of a testing framework - if it can't do that, then what's the point of having it? It's probably better to just write your own executable and throw exceptions in conditionals.
What I expect from a testing framework is at least this:
I.e. to also identify the file and the line containing the failing assertion.
If your testing framework doesn't do that, then again, what's even the point of using it? Throwing an exception or calling language's built-in assert() on a conditional will likely provide at least the file+line.
Maybe it’s different in other languages but in JS and .NET the failed assertion fails and you investigate the failed assertion. You wouldn’t ever have a situation that isn’t obvious.
If an assertion says “expected count to be 5 but got 4” you wouldn’t be looking at the not null check assertion getting confused why it’s not null…
> IMO there's no point in checking that you got a response in 1 test, and then checking the content/result of that response in another test. The useful portion of that test is the response bit.
If I understood this part correctly, you are making the dangerous assumption that your tests will run in a particular order.
No, I definitely am not making that assumption. With a bad response, but a good response code, 1 test would fail and the other would succeed, no matter the order. I just don't think that that valid response code is a useful test on its own. It's much better with both assertions in the same test, unless you have some reason to think that response code failure would signify something special on its own.
I wonder how much of this is the journeyman problem (aka, the expert beginner)
I believe writing test code is its own skill. Hence, like a coder learning SRP and dogmatically applying it, so does a person that is forced to write unit tests without deep understanding. (And of course, bad abstractions are worse than code duplication)
I think it's very possible to have a developer with 10 yes experience but effectively only 2 years experience building automated test suites. (Particularly if they came from a time before automated testing, or if the testing and automated tests were someone else's job)
I view it more as "only test one operation per unit test". If that needs multiple asserts (status code, response content, response mime type, etc.) to verify, that is fine.
IIUC, the guideline is so that when a test fails you know what the issue is. Therefore, if you are testing more than one condition (missing parameter, invalid value, negative number, etc.) it is harder to tell which of those conditions is failing, whereas if you have each condition as a separate test it is clear which is causing the failure.
Separate tests also means that the other tests run, so you don't have any hidden failures. You'll also get hidden failures if using multiple assertions for the condition, so will need to re-run the tests multiple times to pick up and fix all the failures. If you are happy with that (e.g. your build-test cycle is fast) then having multiple assertions if fine.
Ultimately, structure your tests in a way that best conveys what is being tested and the conditions it is being tested under (e.g. partition class, failure condition, or logic/input variant).
I’ve come to the conclusion that none of this matters for most parts of a system. I worked in the most horrendous code and systems you can imagine but it turned into a multi billion dollar company. Then everyone starts talking about code quality and rewrites etc and new features stall as beautiful systems are written and high test coverage met and competing companies surpass us and take market share with new and better features. We’ve gotten religious over code and tests in the software industry abd should probably shift back some.
I've worked on shitty code with shitty tests that ran the core of the business. Even while doing that, it was horrible to work with, held important features back and drove talented people away, leaving everything to stagnate in a "this works enough" state. When the wind changed, it was hard to turn the ship around, important people got nervous, and things got into a bad spiral.
None of this is the failure of code and tests alone; but both can be indicative of the structural health and resilience of the wider situation.
I've always operated on the principal that each test should be testing one logical concept. That usually translates into one concrete assertion in the test code itself, but if you need to assert multiple concrete conditions to test the logical concept it's not a bad thing. At the end of the day tests are just there to make you more confident in code changes and they are a tax you have to pay for that comfort. However you arrive at "I feel comfortable making changes", go with it.
Out of everything coming out of testing and the pain of testing old code, this seems like such a trivial thing to discuss about. Then again, automated testing seems like a breeding ground for inane discussions wasting more time than picking a less-than-ideal solution and moving on.
I always have multiple assertions in my unit tests. I test around areas of functionality, as opposed to individual functions. It's a bit arbitrary, but there you have it...
I also use Apple's XCTest, which does a lot more than simple assertions.
If an assertion is thrown, I seldom take the assertion's word for it. I debug trace, and figure out what happened. The assertion is just a flag, to tell me where to look.
This is correct - multiple asserts are OK, but there are still good guidelines:
A good unit test has the phases Arrange, Act, Assert, end of unit test.
You can use multiple assert statements in the "assert" phase, to check the specifics of the single logical outcome of the test.
In fact, once I see the same group of asserts used 3 or more times, I usually extract a helper method, e.g. "AssertCacheIsPopulated" or "AssertHttpResponseIsSuccessContainingOrder" these might have method bodies that contain multiple assert statements, but the question of is this a "single assert" or not, is a matter of perspective and not all that important.
The thing to look out for is - does the test both assert that e.g. the response is an order, and that the cache is populated? Those should likely be separate tests as they are logically distinct outcomes.
The test ends after the asserts - You do not follow up the asserts with a second action. That should be a different test.
Multiple asserts are fine, multiple aspects in a single test are not. For example like a transaction test that checks if account balance is ok and if an user gets a notification. It boils down to a single responsibility, so when there are multiple aspects then it means that the class/interface should be refactored.
* The primary goal of test automation is to prevent regression. A secondary goal can be performance tuning your product.
* Tests are tech debt, so don’t waste time with any kind of testing that doesn’t immediately save you time in the near term.
* Don’t waste your energy testing code unless you have an extremely good reason. Test the product and let me the product prove the quality of your code. Code is better tested with various forms of static analysis.
* The speed with which an entire test campaign executes determines, more than all other factors combined, when and who executes the tests. If the test campaign takes hours nobody will touch it. Too painful. If it takes 10-30 minutes only your QA will touch it. When it takes less than 30 seconds to execute against a major percentage of all your business cases everybody will execute it several times a day.
Tests are not tech debt. You could have bad, brittle tests that you could consider debt but just having tests isn’t debt. Debt implies there is something you could do about it in the future to pay it down, which isn’t the case for a good test suite.
It’s debt. When you can’t add new features quickly because you have nightmarish tests to fix and you spend more time on the tests than the product, I’d say it’s debt. Especially with the insane mocking setups.
Tests can carry tech debit, just like any code. They certainly are not identified by it.
Tests are one of the ways you have to ensure your code is correct. Consequently, they are business-oriented code that exist to support your program usage, and subject to its requirements. How much assurance you need is completely defined by those requirements. (But how you achieve that assurance isn't, and tests are only one of the possible tools for that.)
They are still debt since they don't directly contribute to product value: you can delete all your tests and your software will keep functioning.
It doesn't mean it's a debt worth taking, though IME most companies are either taking way too much or way to little. Not treating tests as debt typically leads to over-testing, and it is way worse than under-testing.
Also, what you're talking about (business-oriented requirements) is more akin to higher level tests (integration/e2e), not unit tests.
You can also delete the source code after compiling it and your software will keep functioning. Does it mean that code don't directly contribute to product value ?
Yes, those are the bad tests I was referring to. NOT having tests greatly increases the debt burden of your production code because you cannot refactor with any confidence and so you simply won’t.
This is the No True Scotsman issue with testing. When it fails, you just disregard the failure as "bad tests". But any company that has anything that resembles a testing culture will have a good amount of those "bad tests". And this amount is way higher than people are willing to admit.
> you cannot refactor with any confidence
Anecdotally, I've had way more cases where I wouldn't refactor because too many "bad tests" were breaking, not because I lacked confidence due to lack of tests.
There are many things beyond tests that allow you refactor with confidence: simple interfaces, clear dependency hierarchy, modular design, etc. They are way more important than tests.
Tests are often a last resort when all of the above is a disaster. When you're at a place where you need tests to keep your software stable you are probably already fucked, you're just not willing to recognize it.
You shouldn't have zero tests, but tests should be treated as debt. The fewer tests you need to keep your software stable, the better your architecture is. Huge number of tests in a codebase is typically a signal of shitty architecture that crumbles without those crutches.
NOT having tests is dept. When you can’t fearlessly add features quickly because you introduce regressions in another end of the product that you didn’t think of, and later have to spend all your time on firefighting because it got deployed to prod.
If your mocks and tests are in the way when introducing features or refactoring, they are likely not on the right level. Too much unit testing of moving internals, rather than public apis, usually being one of the culprits.
The one assertion per test doesn't mean you need to use only one assertion call but rather that you only need to do one assertion block. Checking everything after a response is considered 1 assertion, no matter how many assert calls you need.
The issue is when you use multiple assertions for multiple logic statements: do > assert > do > assert...
In that example imagine that you were also checking that the reservation was successful. That would be considered bad, you should create a different test that checks for that (testCreate + testDelete) and just have the precondition that the delete test has a valid thing to delete (usually added to the database on the setup).
Sometimes, it takes a lot less code to test a particular code by doing it in a long test with multiple assertions.
Migrate up -> assert ok -> rollback 1 -> assert ok -> rollback 2 -> assert ok
I don’t see much benefit to breaking it up, and you’re testing state changes between each transition, so the entire test is useful and simpler, shorter, and clearer than the alternative.
Sometimes it's also more meaningful to the reader this way.
Imagine an object which sees numbers, tries to pair up identical numbers, and reports the set of unpaired numbers. A good test would be:
Create object
Assert it has no unpaired numbers
Show it 1, 2, and 3
Assert it has 1, 2, and 3 as the unpaired numbers
Show it 1 and 3
Assert it has 2 as the unpaired number
This test directly illustrates how the state changes over time. You could split it into three tests, but then someone reading the tests would have to read all three and infer what is going on. I consider that strictly worse.
That test is asserting a flow, not features. How can you be sure that the second rollback fails because that rollback code is wrong and not because the previous two functions made some unexpected changes? Or rather, how can you be sure that the second rollback is ok if you dont know the state it was run from? Maybe the migration set an unexpected flag that made the rollbacks pass that test without working properly.
This is also the reason why tests should be run in arbitrary order, to avoid unexpected interactions due to order.
Flow tests can be useful in some situations, but they should never replace individual feature tests.
This can be a path where things do go bad. Let's say thus test pattern is a success and then is replicated for many tests. Now, the schema or migration changes. A small change there now breaks the entire test suite. At this point the number of failing tests only indicates how many hours you will be fixing assertions.
Another failure mode is when test scaffolding builds up. Imagine that migrate up part becoming multiple schemas, or services. It then fails, now finding exactly where to fix the test scaffolding becomes a multi-hit exercise.
I'm not saying the example is bad, but it can put you on a path where if you constantly build on top of it, it can bad (eg, developers that don't care for tests nor test code quality, or just want to go home, and they just add a few assertions, add some scaffolding, copy-paste it all and mutate some assertions for a different table & rinse-wash-repeat across 4 people, 40 hours a week for 3 years...)
Sorry the example was vague. It was a test for the migration library itself— something I wrote recently when playing around with building a 0-dependency web framework for Bun.
I wouldn’t actually write tests for migrations themselves.
I think the issue is that you’ll always have one of those teammates who see this as an excuse to test the entire happy flow and all its effects in a single test case. I think what you want is reasonable, but how do you agree when it is no longer reasonable?
If you logic depends on that happy path, make a test for it. But as I explained in another comment that test should not justify the lack of individual feature tests, which should not only test the happy path but other corner cases too.
On my company we developers usually create white-box unitary/feature tests (we know how it was implemented, so we check components knowing that). But then we have an independent QA team that creates and run black-box flow tests (they don't know how it was implemented, only what it should do and interact)
Sounds like a fine approach, and I wasn’t criticizing. Mostly I was pondering out loud why people come up with blanket statements what good tests should look like.
The Go idiom is to use table-driven tests[1]. It's still an evolving practice, so you'll see different variants, but the essence is that you have a slice of your inputs and expected outputs, iterate through the slice, and run the assert(s) on each element.
var flagtests = []struct {
in string
out string
}{
{"%a", "[%a]"},
{"%-a", "[%-a]"},
{"%+a", "[%+a]"},
// additional cases elided
{"%-1.2abc", "[%-1.2a]bc"},
}
func TestFlagParser(t *testing.T) {
var flagprinter flagPrinter
for _, tt := range flagtests {
t.Run(tt.in, func(t *testing.T) {
s := Sprintf(tt.in, &flagprinter)
if s != tt.out {
t.Errorf("got %q, want %q", s, tt.out)
}
})
}
}
Sometimes there will be an additional field in the test cases to give it a name or description, in which case the assertion will look something like:
Another evolving practice is to use a map instead of a slice, with the map key being the name or description of the test case. This is nice because in Go, order is not specified in iterating over a map, so each time the test runs the cases will run in a different order, which can reveal any order-dependency in the tests.
This always rubbed me the wrong way. I think a better way would be to ensure the assertions to be readable, as simple as can be. The worst case of this being broken was the assertions were done in an helper method used in a base class of a test, and navigating to it required multiple hops, it took time to build the context of the test as well.
The only downside of multiple assertions is that when the first one fails we won’t know if the subsequent ones are passing.
Totally agreed, I even think duplication is totally fine in tests if it brings more readability. You should be able to read the test to verify its correctness (after all you don't test the test code), and multiple helper functions and levels of abstraction hinder this goal.
The trouble is that the type of developer that blindly follows this sort of shaming/advice is exactly the type that blindly followed someone's rule of thumb to use one assertion per test.
There are no hard and fast rules.
I agree with keeping the number of assertions low, but it isn't the number that matters. Keeping the number of assertions low helps prevent the 'testItWorks()' syndrome.
Oh, it broke, I guess 'it does not work' time to read a 2000 line test written 5 years ago.
I think part of the motivation for one assertion per test comes from the fact that you stop getting information from a test as soon as one assertion fails.
I think it was a guidance, like the SRP where you should be testing one thing in each test case. I also think a growing number of assertions might be a sign your unit under test is wearing too many responsibilities.
Maybe it’s better to say “few assertions, all related to testing one thing”
Testing is hard when the code it tests is OOP, has mutations everywhere, and has big functions that do too many things.
It's practically impossible to thoroughly test such code with one assertion per test; it would mean having dozens of tests just for one object method. Correspondingly, the fixtures/factories/setup for tests would balloon in number and complexity as well to be able to setup the exact circumstance being tested.
But the example in TFA is, imo, bad because it is testing two entirely different (and unrelated) layers at once. It is testing that a business logic delete of a thing works correctly, and that a communication level response is correct. Those could be two separate tests, separating the concerns, and resulting in simpler code to reason about and less maintenance effort in the future.
We want to know if the DeleteAsync(address) behaves correctly. Actually we want to know if DeleteReservation() works, irrespective of the async requirement. Testing whether AnythingAsync() works is something that is already done at the library or framework level, and we probably don't need to prove that it works again.
Write a test for DeletReservation() which tests if a valid reservation gets deleted. Write another related test to ensure that a non-existent or invalid reservation does not get deleted, but rather returns some appropriate error value. That's two, probably quite simple tests.
Now somewhere higher up, write the REST API tests. ApiDelete()... a few tests to establish that if an API delete is called, and the business logic function it calls internally returns a successful result, then does the ApiDelete() return an appropriate response? Likewise if the business logic fails, does the delete respond to the API caller correctly?
In my experience when code isn’t OOP, that means all static functions with static (I.e. global) data which isn’t hard to test, it’s actually impossible because you can’t mock out the static data.
I didn't downvote you, but I have a hard time either understanding your meaning or imagining the scenario you describe. Can you give an example?
OOP functions are usually harder to test because they expect complete objects as arguments, and that tends to require a lot more mocking or fixtures/factories to setup for the test.
FP functions typically operate on less complex and more open data structures. You just construct the minimum thing necessary to satisfy the function, and the test is comparatively simple. None of this has anything to do with global data. Using global data from within any functions is generally a bad idea and has nothing to do with FP or OOP.
Pure functional, yeah, absolutely - that’s not what I usually see though. I see procedural/iterative static functions that connect to static data that connect to live databases and immediately start caching its contents locally.
At work, I put in a change which allows multiple assertions in a C testing framework. There is no exception handling or anything.
A macro like
EXPECT_ASSERT(whatever(NULL));
will succeed if whatever(NULL) asserts (e.g. that its argument isn't null). If whatever neglects to assert, then EXPECT_ASSERT will itself assert.
Under the hood it works with setjmp and longjmp. The assert handler is temporarily overridden to a function which performs a longjmp which changes some hidden local state to record that the assertion went off.
This will not work with APIs that leave things in a bad state, because there is no unwinding. However, the bulk of the assertion being tested are ones that validate inputs before changing any state.
It's quite convenient to cover half a dozen of these in one function, as a block of six one-liners.
Previously, assertions had to be written as individual tests, because the assert handler was overriden to go to a function which exits the process successfully. The old style tests are then written to set up this handler, and also indicate failure if the bottom of the function is reached.
> You may be trying to simulate a ‘session’ where a client performs many steps in order to achieve a goal. As Gerard Meszaros writes regarding the test smell, this is appropriate for manual tests, but rarely for automated tests.
Integration tests are typically easier to write / maintain and thus are more valuable than small unit tests. Don’t know why the entire premise argues against that.
This is such a bad example because the level of testing is somewhat between unit and acceptance/functional testing. I can't tell at a glance if "api" is something that will hit a database or not.
The first part where he says you can check in the passing test of code that does nothing makes me twitch, but mainly because I see functional testing as the goal for a bit of functionality and unit tests as a way of verifying the parts of achieving that goal. Unit tests should verify the code without requiring integration and functional tests should confirm that the units integrate properly. I wouldn't recommend checking in a test that claims to verify that a deleted item no longer exists when it doesn't actually verify that.
Deciding on the granularity of actual unit tests is probably something that is best decided through trial and error. I think when you break down the "rules", like one assertion per test, you need to understand the goals. In unit testing an API I might have lots of tiny tests that confirm things like input validation, status codes, debugging information, permissions, etc. I don't want a test that's supposed to check the input validation code to fall because it's also checking the logged-in state of the user and their permission to reach that point in the code.
In unit tests, maybe you want to test the validation of the item key. You can have a "testItemKey" test that checks that the validation confirms that the key is not null, not an empty string, not longer than expected, valid base64, etc. Or you could break those into individual tests in a test case. It's all about the balance of ergonomics and the informativeness and robustness of the test suite.
In functional testing, however, you can certainly pepper the tests with lots of assertions along the way to confirm that the test is progressing and you know at what point it broke. In that case, the user being unable to log in would mean that testing deleting an item would not be worthwhile.
These aren't unit tests, they are integration tests (API calls to an external system). Integration tests have had multiple assertions forever (look at how Cypress works where each test step is an assertion), the idea of single assertions are for unit tests.
The issue with multiple assertions for unit tests is that they hide errors by failing early extending the time to fix (fail at Assert 1, fix, re-run, fail at Assert 2, fix...). If you want multiple assertions you should have a single assertion that can report multiple issues by returning an array of error strings or something. The speed increase of multiple assertions vs multiple tests is usually tiny, but if you do have a long unit test then that's basically the only reason I can think of to use multiple assertions.
I think it depends on the test. If you are checking that e.g. context.currentUser() returns current user name, email, id and roles I would probably wrote just 2 tests. One for the user attributes and one for checking roles.
jUnit provides a helpful assertAll(...) method that allows us to check multiple assertions without stopping and the first failed one.
I my tests I often use "thick" asserts like assertEqualsWithDiff(dtoA, dtoB), that compare 2 objects as a whole and prints property names with values that do not match. Not everyone likes this approach, but for me it is the best balance between time spend on the test and the benefit that I get from it.
I work to "assert one thing", and the assertions are added to help with debugging. I sometimes even assert starting conditions if they can at all change.
Without a "assert one thing", what tends to happen is there will be cut-and-paste between tests, and the tests overlap in what they assert. This means that completely unrelated tests will have an assertion failure when the code goes wrong.
When you do a refactor, or change some behavior, you have to change _all_ of the tests. Not just the one or two that have the thing you're changing as their focus.
Think of tests that over-assert like screenshot or other diff-based tests, they are brittle.
It is interesting how far purism can go. I wouldn't have thought that some people obsess about having just one assertion in a test. This seems to be a case of being mentally enslaved by your own categories.
I mean, obviously. Obviously I'm not going to assert every header in their own unit test for a single request path in most cases. Assert what makes sense as a logical whole / unit.
Things I find awkward with unit test (and it might just be me) is Inwant to write a test like
def testLotsOfWaysTofail():
d = {"Handle Null": (None, foobar),
"Don't allow under 13 to do it": (11, foobar),
"or old age pensioner": (77,wobble)}
for ...
generate a unit test dynamically here
I have built metaclasses, I have tried many different options.
I am sure there is a near solution.
@parametrize_cases(
Case("handle null", age=None, x="foobar"),
Case("don't allow under 13s", age=11, x="foobar"),
Case("or old age pension", age=77, x="wobble"),
... # as many as you want
)
def test_lots_of_ways_to_fail(age, x):
with pytest.raises(ValueError):
function_under_test(age, x)
Just wanted to pop in and say ckp95 actually mailed me this reply in case I missed it. The extra effort kinda restored my faith in humanity - giving a shit about strangers problems matters these days. Nice one.
I've seen this pattern often with teams that are out of their normal wheelhouse (i.e. Python shop doing Go or JS shop doing Java) and think they cannot extend a bad test library. The other place you see this a lot is where developer hours are being billed. I can sympathize with devs who are shallow on a test library, but the billing hours one is a dark pattern for optimizing cost plus billing.
It took me a long time to feel okay with all the times I broke the single assertion “rule” after reading Clean Code. In fact I only recently stopped feeling bad at all when I went to reimplement a bunch of abseil’s flat hash map tests in C#. All of the tests assert multiple things. If a project as big as that can have multiple asserts on a basic data structure then I can too
The problem seems to be that assert is implemented as a regular function.
It must be implemented as a macro so that the line number and assertion expressions are printed, allowing to easily identify the failed assertion.
If a language doesn't support such macros and has no ad-hoc mechanism for this case, it should not be used, or if it must the assert function must take a string parameter identifying the assertion.
Some languages allow a stack trace to be obtained in normal code, which enables position reporting without macros. Python and Go are good examples.
If you know the file and line of the assertion, plus the values that are being checked, there's not as much need for a stringified version of the expression.
> If you know the file and line of the assertion, plus the values that are being checked, there's not as much need for a stringified version of the expression.
It does save time. With the actual condition reproduced, half the time I don't even need to check the source of the failed test to know what went bad and where to fix it. Consider the difference between:
FAILED: Expected 1, got 4
In /src/foo/ApiTest.cpp:123
vs.
FAILED: response.status evaluated to 4
expected: 1
In /src/foo/ApiTest.cpp:123
vs.
FAILED: response.status evaluated to Response::invalidArg (4)
expected: Response::noData (1)
In /src/foo/ApiTest.cpp:123
This is also why I insist on adding custom matchers and printers in Google Test for C++. Without it, 90% of the time a failed assertion/expectation just prints "binary objects differ" and spews a couple lines of hexadecimal digits. Adding a custom printer or matcher takes little work, but makes all such failures print meaningful information instead, allowing one to just eyeball the problem from test output (useful particularly with CI logs).
A test should have as many assertions as needed to test an interface. If you find you’re needing a lot of assertions to do that then I suspect either the interface under test is too large or you’re testing multiple stack frames. In my experience, it’s usually the latter; I call it accidental testing. Those delegate calls should have their own tests.
Use mutation testing as well, if it's available for your language. This evaluates the quality of your test suite by automatically switching up your app logic and running the same tests.
Nothing like removing one line, breezing through code review, and bringing down production. "But all tests passed!"
I didn't know you were only supposed to put one assertion per unit test. some of mine have 5-10 assertions. why?. because I don't have time to write tests all day but I recognize there are tests. As long as it catches the bugs before it goes to prod and can be modified easily, who cares?
Unfortunately and from experience, far more seem to care about the how than the what when it comes to testing. This comment section alone is pretty telling: people spending their Saturday to write dogmatic rules.
Arrange: whatever you need for the setup.
Act: a single line of code that is under test.
Assert: whatever you want to assert, multiple statements allowed and usually required.
We put the 3A in code as comments as boundaries and that works more than perfect for the whole team.
I’ve seen code guidelines where this article’s asserts would be unacceptable.
If I saw this error in a test output, I’d be pretty miffed that it doesn’t tell me what the expected and actual values are.
Actual status code: NotFound.
Expected: True
Actual: False
also, only testing public interfaces is perfectly fine and may actually be preferable as it leaves you free to refactor internals freely without breaking tests.
tbh unit testing is a balancing act between reaping code quality benefits and bogging yourself down with too much testing updating.
This is an interesting point IMO. We tend to focus at the API level far more, and implicitly test inner functionality (that is, the inner functionality must be correct for the outer tests to pass). Sometimes testing inner functionality explicitly is required when the outer tests are not complete, or when behaviour is defined by the inner code. We also as far as possible use defensive techniques and extensive type constraints (which is a joy in Rust).
I'm constantly thinking where we need to put tests though, and in still not fully convinced I get it right. My rule of thumb is that each test should map to a specification point, and that spec is a necessary documentation line for the test.
It's not that there's anything inherently wrong with multiple assertions in a unit test, it's that it's often a smell you're testing multiple behaviours in a single test - which is the thing you should avoid.
i need to send this to all my coworkers. i wrote some tests kind of similar to this, except i had a bunch of setup code then asserted something was true, changed 1 thing and then verified it became false so that it was super clear that it was the 1 thing that causes it to become false.
they were like no, copy paste the entire setup and change the field to false in the new setup.
im like how are you supposed to tell which of the dozen conditions triggered it now? you have to diff the 2 tests in your head and hope they don't get out of sync? ridiculous
I like the one assertion per unit test case rule. It makes me design cleaner code and write readable test cases. I also use object equality vs asserting on individual fields.
I think my dream test framework would have the following features:
1) Test suites are organized as trees, not as lists:
I found one of the most common reasons to have many assertions in a test was that you want to share some complicated setup/teardown logic - or that one testable action depends on another testable action having happened before. (i.e., adding an item - asserting it's there, then removing it, asserting it's gone).
The disadvantage is that you have to lower the granularity of your tests - if you want to debug a specific action, you still have to rerun the whole test.
I think a better way to solve this would be to organize tests as a tree, maybe something like this:
- A single unit test consists of a setup phase, a teardown phase, 0 or more assertions and 0 or more child tests. Each child test is organzed the same, i.e. can have child tests on its own, etc.
- When running a test, first the setup phase and assertions are executed, then each child test recursively, then the teardown phase. Success/failure is tracked for each test separately, but child tests are ran in the same process/context as the parent test.
- Each test can be started individually, including child tests. When a child test (or grandchild test, etc) is ran individually, the test runner will first run the setup phases of all ancestors, then run the test, then run the teardown phases of the ancestors.
- Bonus: In the setup phase, a test can dynamically generate child tests (e.g. as lambdas/closures). Each test must have a unique ID with which it can tracked across different test runs or started individually. This could be useful for parameterized tests or if you want to test a loop invariant across multiple iterations.
This would allow you to write your test script like one big multi-assert test, but still get fine-grained reports and control as if you'd have put each assert in a separate script.
2) Provide "metrics" and "change detection" as an alternative to assertions:
I think one of the most involved parts of writing tests can often be to verify the results - think which particular state you want to assert, how you can access that state in your script, etc.
A way to make this easier would be to provide a second kind of "output" for the test script: The test script simply outputs a list of key/value pairs without any notion whether or not the value is "correct" or "incorrect". The test runner stores the list and compares the values with the list from a previous test run - e.g. the previous commit. Every value that was changed between the runs is shown to the user and can be marked as "correct" or "incorrect".
This way, you could sort of interactively "learn" which values are correct and which aren't instead of having to figure out all of it beforehand.
The runner could also implement more complex conditions instead of "changed"/"did not change", such as "value may only change in one direction" e.g. for quality measures or "value must stay the same within a certain confidence interval" for flaky tests.
This could also let you track more difficult to manage metrics in a test, such as runtime or memory consumption of particular method calls.
"A foolish consistency is the hobgoblin of little minds"
- whoever
I definitely write tests with multiple assertions, the rule I try to follow is that the test is testing a single cause/effect. that is, a single set of inputs, run the inputs, then assert as many things as you want to ensure the end state is what's expected. there is no problem working this way.
How is that even possible in the first place?
The entire job of an assertion is to wave a flag saying "here! condition failed!". In programming languages and test frameworks I worked with, this typically includes providing at minimum the expression put in the assertion, verbatim, and precise coordinates of the assertion - i.e. name of the source file + line number.
I've never seen a case where it would be hard to tell which assertion failed. On the contrary, the most common problem I see is knowing which assertion failed, but not how the code got there, because someone helpfully stuffed it into a helper function that gets called by other helper functions in the test suite, and the testing framework doesn't report the call stack. But it's not that big of a deal anyway; the main problem I have with it is that I can't gleam the exact source of failure from CI logs, and have to run the thing myself.