After so many months, finally we have a green CI – we can consider pull requests and changes that break the CI as “unstable”. This is a huge milestone, and I want to both thank everybody that tracked these failures to check if they were legit, to the people that fixed the legit failures and that fixed the flaky tests. This has been an amazing experience, and it helps build trust on the project!
If you look at the commits from the Atom project and check which ones have passing tests, you’ll see a horrible situation: they mostly don’t pass. There’s even a recent commit called “disable tests” that passes, and then the next commits also fail – so Atom was already failing tests when these were disabled (how is that possible, I won’t dare to think).
So it’s a huge milestone to have a CI that confidently (or at least we hope!) say if we broke something new! But on this post, I want to show not only how cool is that, but also the process and how to avoid flaky tests on the near future – both for Pulsar and for other projects
Don’t mock yourself
It’s a simple rule that we usually forget: if you’re testing User
, don’t mock or stub User
‘s methods, functions, etc. Doing this basically allows simple bugs to escape, makes the tests reject obvious right code (specially when the implementation of the mocked functions change), and usually is just a bad idea overall. Remember, mocking should be done to control things outside the project – not things inside the project.
Another example, is when the project needs to read some local database. Some people might think it’s a good idea to either use a different, in-memory database like SQLite3 (with :memory:
db) or simply mock SQL queries to return what the code wants. Please, don’t – databases are easy to setup – just pass a single connection to the code, wrap the whole test around a database transaction, and ROLLBACK
on the end – that’s it. It’s fast (because nothing is actually committed, unless you’re using MySQL in which case you might have other problems – yes, I know, I am biased) and it is predictable; if you change the DB schema, the query is wrong, or some validation needs to be done, the code will fail – and that’s exactly what you want
Don’t mock clocks
Usually, this is a bad idea. Some frameworks have an “internal clock”, some use the system’s clock, some use some library, and some code basically “waits” for something to happen after a couple of seconds. An old practice was to offer some kind of “internal clock”, like a Queue, a Task, or something like this, and manually implement some kind of loop that would go though all the “tasks”, process the task, and give a return to some callback. That approach complicates the code needlessly, have serious performance problems specially on languages that have mechanisms to control background tasks, and it suffers from the same problems as above – tying the implementation details to the actual code.
The approach to mock some “core libraries” is also bad – again, because it ties implementation details. As an example, Pulsar by default mocks the system clock – a practice we inherited from the Atom codebase, that uses the Jasmine framework. Unfortunately, for now, we can’t change this without breaking everything, and we’ll probably never be able to change this on the Jasmine implementation because it might break third-party packages’ tests. Also unfortunate is how much changed in the web after all these years, and how computers got way faster than they were. This also means that some tests that used to pass will not fail, because while the clock is mocked, these web technologies are not using the same “core functions” of what is mocked.
Finally, it’s flaky – supposing we add some new task to run “before” what we’re wanting to check. Now, the tests will fail, and we’ll need to “manually advance the clock” twice. So, how to fix this? Simply – let asynchronous things be asynchronous, and “wait” for it to happen. It’s a simple as a loop, that times out after a couple of seconds, that from time-to-time checks if things are correct. On languages like Javascript, where we can only run a single thread at a time, you might need to use the event loop to solve this, and the easier way is to create a promise and a combination of setTimeout and
setInterval`:
function waitPromise(functionToCheck) { return new Promise(resolve => { const interval = setInterval(() => { if(functionToCheck()) { resolve(); clearInterval(interval); } }, 50) // checks each 50ms setTimeout(() => { resolve() clearInterval(interval); }, 2000) // 2 seconds }) }
Now you can check with await waitPromise(() => document.querySelector('div.results'))
to “await” for a div, with class results
, to be present on the UI. Also, please notice that we always resolve the promise, even if the conditions was not set. And the reason for that is…
Make readable failure messages
I can’t stress this enough: tests are to help you know what’s wrong. If your test says expected true to be false
you’re doing it wrong!. Other examples of bad messages are Test timed out
or Failure
, etc.
This is easier said than done – some test frameworks don’t offer good ways of solving this. For example, in Clojure, you might need some custom matcher library (I made one called Check because of this exact problem); on Pulsar, I basically hacked a solution using the Jasmine library we’re using and called it toSatisfy
– a matcher that takes two parameters, one with the value to match, and the second one being a function that, if the test fails, describes why it fails. It’s not perfect, but I was able to make a test that, while failing, had the message:
Expected { value : 'require', scopes : [ 'source.ruby', 'keyword.other.special-method.ruby' ] } to equal { value : 'require', scopes : [ 'source.ruby', 'meta.require.ruby', 'keyword.other.special-method.ruby' ] }
To read as:
Expected to find scope `source.ruby`, `meta.require.ruby`, `keyword.other.special-method.ruby` but found scopes `source.ruby`, `keyword.other.special-method.ruby`
which can be further customized into creating a difference, like saying “it’s missing this specific scope” for example. It might not sound too much of a change, but as soon one reads this second message, that person will know that the test refers to scopes, instead of “some value that should be returned from somewhere” – so the person will know that it depends on some grammar rules, or things like that.
Make it easy to write tests
I lost count of how many people basically threw all the good practices like “don’t repeat yourself” or the idea to create functions to help things when writing tests. Some people argue that it makes tests “easier to reason about” because you know, exactly, how the setup works.
I disagree. Hardly.
If you have to write 40 lines of code to setup the “scenario” to write an assertion, the obvious choice is that you won’t write a new test because it’s too hard. Worse, when this setup needs to change, maybe because some of the requirements changed, you will have a lot of places to change and lots of failures that have nothing to do with the actual test. So, please – write a setup function. And no, don’t use beforeEach
or the equivalent for that – write a freaking function to make it. It’s way easier to read something like: prepareEditor(); activatePlugin('notifications');
than to track all “beforeAll” up to the root and read a lot of code to understand what’s happening.
This also makes sense for “magical stuff” that some test libraries have. In the past, I loved the let(:variable) { create_something }
that Ruby have; nowadays, I find it abhorrent – it is, again, hard to track, and it’s lazy – meaning that a test can fail if you don’t check something, because the “something” will never be created…
Also, if possible, use some library that allows you to write custom matchers. It’s easier to read expect(something).toBeAGrammarForContents("some editor contents")
than the alternatives; also, your failure message will be more descriptive, like something is not a grammar for the contents ....
Closing thoughts
Writing tests is not an easy task, and Pulsar is far from being helpful on this. On newer versions of Pulsar, we intend to modernize this test generation too – both for core packages, but also for package authors. Heck, even I don’t like how Atom’s tests are written, so I made Chlorine run with a real Atom instance instead and I’m driving it with Selenium so I want to make a system that I’d like to use, and that I feel happy using. Hopefully, we’ll get there!