Completely inspired on this post: https://archiloque.net/blog/a-machine-for-gods-jam/, and my experiences with Pulsar.
The Node.JS Ecosystem, together with so many others, is broken. Maybe beyond repair.
Let’s review the foundations of good software: good code, automated tests, a server that checks if the software works (usually called CI Server), a server that publishes the software continuously as soon as everything is working (usually called a CD Server), and reproducibility – meaning, if something fails, it needs to always fail if we send the same parameters, and always fail in the same place in the same way; if it passes, it must always pass on the same condition.
Now, onto Pulsar
In the beginning, I tried to use CircleCI. Because it’s easier to debug than other CIs. The problem: the free tier is too small for us to work on our Windows builds – because our Windows builds take an hour to build the editor. Not test – just to build – and it does less than our Linux builds, that take less than 30 minutes to run everything, including tests. With about 4 runs, we already exhausted our free tier…
So, I moved to CirrusCI. It only works on GitHub, but on this case it’s fine because Pulsar’s code is on GitHub and we don’t plan to remove it. Cirrus… have a lot of problems. ZIP files would not download; then it broke the download of binaries; then it broke the UI; then it broke the API; then it broke the Windows image we were using. And that’s only the problems with Cirrus…
Apple changed their processor’s arch to Arm64. Supposedly, that’s all they did. But no – they also got strict on how their apps should be packaged/described. Some “entitlements” need to be present if we use WASM, and we need to sign the app. Signing an app needs an Apple developer license, and a mac.
Writing open-source software is not free – developers basically pay with their time. With Apple, only our time is not sufficient – we also need to pay USD 99 per year. If we don’t sign, Apple says that the app is damaged – not “not signed”, not “insecure”, but straight up damaged. There’s a way to bypass this running a cryptic command from the command-line, but it doesn’t matter – what it matters is that most people don’t know this, and it wasn’t documented anywhere that this “new restriction” (together with a WHOLE LOT of others) only applies to binaries that target Arm64 – if you try to install the same app, same entitlements, same non-signed, but compiled for Intel, in an Apple running Arm64 (I know the official name is “Silicon” but I’m quite fed up with Apple inventing new terms just to “be cool”) it simply… works. Sure, it shows an warning, but not “this is broken”.
Windows is… well, there’s no sugar-coat – it’s a piece of garbage. Yes, it is. Depending on the version of Visual Studio that you install, Node won’t see it. It comes with
python command installed, and when you run, it sends you to the Windows Store. On Cirrus, if you use Chocolatey to install things, it doesn’t set the PATH, meaning that sometimes
python would not work even though we had it installed correctly (the Chocolatey installation puts the “real Python”‘s PATH before the “fake” one).
And when everything is ready to work… we had random breakages. Literally random – sometimes, it was “file not found”, sometimes “file in use”, sometimes “library not found to link”, sometimes it didn’t download something. The worst part? These failures where consistent in a single run – meaning that if we tried to retry things, the build would literally fail in the same way but if we cancelled the build and started a new one, it would give a different failure. We could not debug this, because Cirrus don’t attach a terminal to windows machines…
Finally, we were able to find a combination that works, so we could generate our binaries. Except… that the tests won’t run. We kinda gave up then for now; but then, a bug appeared only on Windows builds, so we’re trying to re-add the tests. Finally, our “portable” app take A LOT of time to boot, and we found that’s because it is compressed with 7zip (WHY?) so we wanted to migrate to ZIP…
Electron Builder / Playwright Madness
Electron builder tries to be clever. That’s usually a horrible sign on software development.
At first, it detected we were on the CI, and it tried to publish the releases to GitHub – even considering that we didn’t add any token to the CI, didn’t configure it to automatically publish, and were not even running the builds on GitHub. I found out that LOTS of tools do this nowadays – detect that you’re running on CI and set some different parameters.
Please, tell me: what the fuck? Seriously – we want reproducibility, that’s why we have a CI that runs our tests… differently than how we run on our local machines? Whose idea was that?
Electron builder also swallows logs. Up to now, I didn’t find a way to make the logs appear, which basically means, if something fails, it gives you the middle finger and says “yeah, it failed. Good luck with that”.
Finally, it removed metadata from out
package.json, so we had to monkey-patch it to keep it; it also failed to produce binaries on ARM Linux, and now… on Windows, if we set portable to use ZIP, it also fails – again, with a middle finger and the very descriptive error
ERR_ELECTRON_BUILDER_CANNOT_EXECUTE (yes, that’s it – even with all cases and underscore).
Playwright doesn’t show the logs for Electron. So, when we tried to run the tests by pointing to the compiled code instead of the source (to try to capture the Apple Silicon madness and Windows madness) we… could’t do it. Because it fails to load Pulsar. Why? Who knows – we have logs, but Playwright doesn’t show us.
CI servers are always weird things that don’t reflect reality
Most CI servers run… something – I am really not sure what. But I do know that they run some server / architecture / whatever that basically answers the question “what is the most broken machine and operating system you can find that satisfies what this developer wants?” because everything breaks on the CI all the time, and it close to never breaks on local machines.
Binaries built on the CI sometimes simply don’t work. Why? We have no idea. The tests did pass, so maybe some problem on packing? We don’t know, because the binaries were generated, and a exit code of
0 was returned. So we have to run tests on a prebuilt binary. Except…
Yes, that’s right – that doesn’t work. But it doesn’t work only on the CI – locally it runs, downloading the same docker image the CI runs and running the same commands, it also works. Only the CI is a problem. These are things that don’t run on a CI, but run locally:
- Building a binary on Linux and running the tests over it
- Building a binary on ARM Linux and running the tests over it
- Building a portable binary on Windows with ZIP compression
- Installing the NSIS binary that was generated by the CI on Windows
- Building a binary on Intel Mac and running the tests over it (I was able to make it work on a local VM – thing that supposedly I could not even do considering that Apple offers no support for virtualization)
- SEE what’s happening – basically, none of the options allow me to debug, except with some crappy terminal that keeps being disconnected from time to time
How is it acceptable that, on every developer’s machine, things work fine, the tests all run, we have a failure rate of about 10%, and on the CI our failure rate sometimes gets to 50%? How is it that every minor change breaks the CI? A stupid example: I tried to add a test to see if I could install a package. Tested locally – worked every time. Tested on a Virtual Machine with limited resources – caught some bugs, made some retries, failure rate was essentially 0%. Tried to run on docker, works too.
On the CI? It breaks. In all systems, with any configuration, with a failure rate of 100%. We record videos on the CI. It shows the freaking screen but the tests fail with “can’t find this element that you’re clearly seeing on the video, on the screen, right now”. To add insult to the injury, the mere existence of this test makes other tests crash too – even considering that Playwright do closes the editor and opens again after a failure occurs, so that it starts with a blank slate. Again, we accepted that these tests can’t run, and moved on.
The stupid situation with package managers on Node
You know why Pulsar uses
yarn? Because it’s the only package manager that works.
Yes, that’s right. Not “better”, not “more reliable”, literally freaking works.
npm fails with
Undefined is not a function while trying to install… something… because it doesn’t show what is installing.
pnpm fails with random things, or installs things but then
require can’t find it; also, it shows problems on Windows.
Package managers in Node are a joke – and not even a funny one. Yarn caches a lot of things on my
/tmp folder that I actually have no idea what it is, and if I remove it, I see no difference on new installs, or old ones – it’s literally just megabytes and megabytes of junk. There are a couple of
--ignore-<something> toggles on Yarn, most of these need another
yarn won’t ignore the right amount of things for you package to works. Warnings, on Yarn, appears as bright red on my machine; errors, on a light orange, meaning that warnings are more pronounced than errors; when something fails, it doesn’t tell you when or what failed – it shows you some log that you have to, hopefully, discover for yourself. Local dependencies on
yarn are incredibly inconsistent – sometimes they reuse the
node_modules you have on your local folder, sometimes it installs everything from scratch; sometimes, it won’t include my WASM files, sometimes it does; sometimes, it does not “nest” the local dependency, sometimes it does. It’s so broken that I basically found that it’s actually easier to push to GitHub some change and add that commit to
yarn as a dependency instead.
Oh, I hear you say – I can create a local git repo and point Yarn to it, right? So I don’t need to actually push to GitHub? Ha, wrong again – Yarn, somehow, detects this and apply the same broken rules for local files.
Ah, I almost forgot – package installation on Node can run arbitrary commands. Every single one. That means that if a package adds a compilation process and then runs some rootkit on your machine to keylog all your password, yep – that’s what it will do.
Finally, Yarn is migrating to a new version. I am almost sure that it’ll also break our dependencies, so I’m afraid to upgrade…
Subpar experience that more time to be done than everything else
Ever since we started the Pulsar project in June/2022, I still fight with these things. It’s either
yarn that insists that a package that’s not installed is installed and need a hundred toggles to ignore what it thinks it’s right, or Electron that crashes without logs or stacktraces or even memory dumps, or Windows that keeps producing errors that never happen twice in a row, or Apple that invents errors from scratch (yes, literally – there were errors like “network timeout” when there was no network involved at all, for example), or CI stack upgrades that basically don’t add anything but break all of our builds; to the tiny toggles that we can’t do otherwise CI will break, even though it literally run everywhere else just fine, to the “everything was correct, no errors reported” and then the final binary is just broken.
The worst part is that it takes forever to solve! I am about three weeks trying to solve that “binary doesn’t work but tests report that everything is fine” for out CI. As a comparison, it took one week to bootstrap the rewrite of a major part of the editor – tree-sitter syntax highlight. Considering I had no documentation, the code is complicated (to say the least) to understand, and that I had to understand how the tokenizer works, how tree-sitter works, and how the queries work, and write the code, and test everything… which is harder? To make a FREAKING IMPLEMENTATION from scratch with NO DOCUMENTATION AT ALL, or to make something that already works work somewhere else with the same architecture, same operating system, same version?
Finally – why does newer CIs don’t show the messages being printed anymore? Travis, Cirrus, GitHub Actions, they all make you wait until the build fails (or times out) to show you the error…
Want to migrate tools? Ha ha ha ha!
Oh, wait, you were being serious? Let me laugh even more!
Let me reiterate: EVERYTHING IS BROKEN. Like, everything. Really. So, you’re replacing something that is broken, but works about 70% of the time, to something that is also broken, but works 0% of the time. All the weird shit you had to do to convince your CI that things are working correctly, thank you, now please catch the real errors if possible, you have to do on this other tool… with the exception that you have no experience at all on this other tool – you don’t even know if it’ll work!
As an example: we moved some of our tests to GitHub. After a while, we gave up on Windows testing on GH Actions – it simply won’t work, no matter what we do. So we… disabled everything. Then, I tried to migrate these tests to CircleCI. The CircleCI’s interface got so slow that I basically could not click the screen, so I again, gave up on using Circle – even though it runs my tests on a small fraction of what GitHub Actions do
So, we decided to stay on the subpar experience of GitHub actions. It’s broken, but at least we know how it’s broken.
Is this what programming is about? Making random changes until things work? Losing precious development time to fix something that wasn’t broken because another thing that wasn’t broken was updated so it is now breaking everyone else?
I want to believe it’s not. Unfortunately, my newest experiences keep trying to prove me wrong.