Thunderbird

Automated Testing: How We Catch Thunderbird Bugs Before You Do

April 9, 2024 Geoff Lankow 8 responses

Since the release of Thunderbird 115, a big focus has been on improving the state of our automated testing. Automated testing increases the software quality by minimizing the number of bugs accidentally introduced by changes to the code. For each change made to Thunderbird, our testing machines run a set of tests across Windows, macOS, and Linux to detect mistakes and unintended consequences. For a single change (or a group of changes that land at the same time), 60 to 80 hours of machine time is used running tests.

Our code is going to be under more pressure than ever before – with a bigger team making more changes, and monthly releases reducing the time code spends on testing channels before being released.

We want to find the bugs before our users do.

Why We’re Testing

We’re not writing tests merely to make ourselves feel better. Tests improve Thunderbird by:

Preventing mistakes
If we test that some code behaves in an expected way, we’ll find out immediately if it no longer behaves that way. This means a shorter feedback loop, and we can fix the problem before it annoys the users.
Finding out when somebody upstream breaks us
Thunderbird is built from the Firefox code. The Firefox code, which we are not responsible for, is 30 to 40 times the size of the code we are responsible for. When something inevitably changes in Firefox that affects us, we want to know about it immediately so that we can respond.
Freeing up human testers
If we use computers to prove that the program does what it’s supposed to do, particularly if we avoid tedious repetition and difficult-to-set-up tasks, then the limited human resources we have can do more things that humans are better at.
For example, I’ve recently added tests that check 22 ways to trigger fetching mail, and 10 circumstances fetching mail might not work. There’s no way our human testers (great though they are) are testing all of them, but our automated tests can and do, several times a day.
Thinking through what the code should be doing
Testing forces an engineer to look at the code from a different point-of-view, and this is helpful to think about what the code is supposed to do in more circumstances. It also makes it easier to prove that the code does work in obscure circumstances.
Finding existing bugs
In software terms we’re working with some very old code, and much of it is untested. Testing it puts a fresh set of eyes on the code and reveals some of the mistakes of the past, and where the ravages of time have broken things. It also helps the person writing the tests to understand what the code does, a lot better than just reading the code does.

We’re not trying to completely cover a feature or every edge case in tests. We are trying to create a testing framework around the feature so that when we find a bug, as well as fixing it, we can easily write a test preventing the bug from happening again without being noticed. For too much of the code, this has been impossible without a weeks-long detour into tests.

Breaking New Ground

In the past few months we’ve figured out how to make automated tests for things that were previously impossible:

Communication with mail servers using encrypted channels.
OAuth2 authentication with mail servers.
Communication with web servers where a specific address must be used and an unencrypted channel must not be used.
Servers at any given host name or port. Previously, if we wanted to start a server for automated testing, it had to be on the local machine at a non-standard location. Now we can pretend that the server is anywhere, and using standard ports, which is needed for proper testing of account configuration features. (Actually, this was possible before, but now it’s much easier.)

These new abilities are being used to wrap better testing around account set-up features, ahead of the new Account Hub development, so that we can be sure nothing breaks without being noticed. They’re also helping test that collecting mail works when it should, or gives the error prompts we expect when it doesn’t.

Code coverage

We record every line of code that runs during our tests. Collecting all that data tells what code doesn’t run during our tests. If a block of code doesn’t run during any of our tests, nothing will tell us when it breaks until somebody uses the code and complains.

Our code coverage data can be viewed at coverage.thunderbird.net. You can also look at Firefox’s data at coverage.moz.tools.

Looking at the data, you might notice that our overall number is now lower than it was when we started measuring. This doesn’t mean that our testing got worse, it actually shows where we added a lot of code (that isn’t maintained by us) in the third_party directory. For a better reflection of the progress we’ve made, check out the individual directories, especially mail/base which contains the most important user interface code.

Just setting up the code coverage tools and looking at the results uncovered several memory leaks. (A memory leak is where memory is allocated for a task and not released when it is no longer needed.) We fixed these leaks and some more that existed in our test code. We now have very low levels of memory leaking in our test runs, so if we make a mistake it is easy to spot.
Code coverage data can also point to code that is no longer used. We’ve removed some big chunks of this dead code, which means we’re not wasting time maintaining it.

Mozmill no more

Towards the end of last year we finally retired an old test suite known as Mozmill. Those tests were partially migrated to a different test suite (Mochitest) about four years ago, and things were mostly working fine so it wasn’t a priority to finish. These tests now do things in a more conventional way instead of relying on a bunch of clever but weird tricks.

How much of the code is test code?

About 27%. This is a very rough estimate based on the files in our code repository (minus some third-party directories) and whether they are inside a directory with “test” in the name or not. That’s risen from about 19% in the last five years.

There is no particular goal in mind, but I can imagine a future where there is as much test code as non-test code. If we achieve that, Thunderbird will be in a very healthy place.

A stacked area chart showing the estimated lines of test code (in red) and non-test code (in blue) over time, from January 2019 to January 2024. The chart indicates both types of code increase over this period.

Looking ahead, we’ll be asking contributors to add tests to their patches more often. This obviously depends on the circumstance. But if you’re adding or fixing something, that is the best time to ensure it continues to work in the future. As always, feel free to reach out if you need help writing or running tests, either via Matrix or Topicbox mailing lists:

Matrix: You can join our development chat channel at #maildev:mozilla.org
Topicbox mailing list: The Thunderbird Developers list a good place to raise questions about Thunderbird development.

Geoff Lankow, Staff Engineer

Tags: community Development QA

8 responses

Anne-Marie Dubler wrote on April 16, 2024 at 00:32

Ich wäre Ihnen dankbar, wenn Sie im Mailverkehr nicht Blauton bei einem Mail-Return verwenden würden.

Jason Evangelho wrote on April 16, 2024 at 09:41

Zur Kenntnis genommen, und danke für den Kommentar.
(Please forgive any errors, this is a machine translation)

Wolfgang Wedlat wrote on April 16, 2024 at 09:44

Bei Windows 11 funktioniert Thunderbird nicht mehr!!!°

Jason Evangelho wrote on April 16, 2024 at 09:52

Wir würden uns über weitere Informationen freuen. Ich verwende derzeit Thunderbird 115 unter Windows 11. Möglicherweise haben Sie also einen Fehler entdeckt. Könnten Sie bitte unsere Community-Support-Seite besuchen und Ihr Problem im Detail erklären? https://support.mozilla.org/products/thunderbird

Scheibe wrote on April 17, 2024 at 05:25

Hallo
Der Link mit der Frage “übersetzen” erscheint so wie sich die Startseite aufgebaut hat.funktioniert nicht immer, mitunter werden die deutschen Texte “übersetzt” und es ergeben sich Wörter ohne Sinn ohne Zusammenhang zum original vorhandenen Text. Ich verwende einen Laptop mit Linux – Ubuntu und als Browser Mozilla, Thunderbird ist auf dem neuesten Stand.
Danke

Monica Ayhens-Madon wrote on April 17, 2024 at 11:47

(DeepL) Unsere Übersetzungsschaltflächen verwenden Google Translate, was manchmal nicht gut funktioniert. Wir wollen bessere Übersetzungen anbieten und suchen nach Möglichkeiten, dies zu tun.

Simon Mills wrote on April 17, 2024 at 09:03

An interesting summary.
As a retired practitioner of extreme automated testing, I always like to see it being used to free-up the testers to concentrate on the efficacy of the tests rather than the circular activity of repetitive, time consuming, manual execution. Nice one!

Monica Ayhens-Madon wrote on April 17, 2024 at 11:01

I think we’re on the same wavelength! Using technology to minimize the tedious things, and leaving people’s brains more time and energy to be thoughtful and creative is the best way to go. 🙂

Comments are closed.