| .. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) | 
 | .. See the bottom of this file for additional redistribution information. | 
 |  | 
 | Handling regressions | 
 | ++++++++++++++++++++ | 
 |  | 
 | *We don't cause regressions* -- this document describes what this "first rule of | 
 | Linux kernel development" means in practice for developers. It complements | 
 | Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a | 
 | user's point of view; if you never read that text, go and at least skim over it | 
 | before continuing here. | 
 |  | 
 | The important bits (aka "The TL;DR") | 
 | ==================================== | 
 |  | 
 | #. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_ | 
 |    (regressions@lists.linux.dev) quickly become aware of any new regression | 
 |    report: | 
 |  | 
 |     * When receiving a mailed report that did not CC the list, bring it into the | 
 |       loop by immediately sending at least a brief "Reply-all" with the list | 
 |       CCed. | 
 |  | 
 |     * Forward or bounce any reports submitted in bug trackers to the list. | 
 |  | 
 | #. Make the Linux kernel regression tracking bot "regzbot" track the issue (this | 
 |    is optional, but recommended): | 
 |  | 
 |     * For mailed reports, check if the reporter included a line like ``#regzbot | 
 |       introduced: v5.13..v5.14-rc1``. If not, send a reply (with the regressions | 
 |       list in CC) containing a paragraph like the following, which tells regzbot | 
 |       when the issue started to happen:: | 
 |  | 
 |        #regzbot ^introduced: 1f2e3d4c5b6a | 
 |  | 
 |     * When forwarding reports from a bug tracker to the regressions list (see | 
 |       above), include a paragraph like the following:: | 
 |  | 
 |        #regzbot introduced: v5.13..v5.14-rc1 | 
 |        #regzbot from: Some N. Ice Human <some.human@example.com> | 
 |        #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789 | 
 |  | 
 | #. When submitting fixes for regressions, add "Closes:" tags to the patch | 
 |    description pointing to all places where the issue was reported, as | 
 |    mandated by Documentation/process/submitting-patches.rst and | 
 |    :ref:`Documentation/process/5.Posting.rst <development_posting>`. If you are | 
 |    only fixing part of the issue that caused the regression, you may use | 
 |    "Link:" tags instead. regzbot currently makes no distinction between the | 
 |    two. | 
 |  | 
 | #. Try to fix regressions quickly once the culprit has been identified; fixes | 
 |    for most regressions should be merged within two weeks, but some need to be | 
 |    resolved within two or three days. | 
 |  | 
 |  | 
 | All the details on Linux kernel regressions relevant for developers | 
 | =================================================================== | 
 |  | 
 |  | 
 | The important basics in more detail | 
 | ----------------------------------- | 
 |  | 
 |  | 
 | What to do when receiving regression reports | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Ensure the Linux kernel's regression tracker and others subscribers of the | 
 | `regression mailing list <https://lore.kernel.org/regressions/>`_ | 
 | (regressions@lists.linux.dev) become aware of any newly reported regression: | 
 |  | 
 |  * When you receive a report by mail that did not CC the list, immediately bring | 
 |    it into the loop by sending at least a brief "Reply-all" with the list CCed; | 
 |    try to ensure it gets CCed again in case you reply to a reply that omitted | 
 |    the list. | 
 |  | 
 |  * If a report submitted in a bug tracker hits your Inbox, forward or bounce it | 
 |    to the list. Consider checking the list archives beforehand, if the reporter | 
 |    already forwarded the report as instructed by | 
 |    Documentation/admin-guide/reporting-issues.rst. | 
 |  | 
 | When doing either, consider making the Linux kernel regression tracking bot | 
 | "regzbot" immediately start tracking the issue: | 
 |  | 
 |  * For mailed reports, check if the reporter included a "regzbot command" like | 
 |    ``#regzbot introduced: 1f2e3d4c5b6a``. If not, send a reply (with the | 
 |    regressions list in CC) with a paragraph like the following::: | 
 |  | 
 |        #regzbot ^introduced: v5.13..v5.14-rc1 | 
 |  | 
 |    This tells regzbot the version range in which the issue started to happen; | 
 |    you can specify a range using commit-ids as well or state a single commit-id | 
 |    in case the reporter bisected the culprit. | 
 |  | 
 |    Note the caret (^) before the "introduced": it tells regzbot to treat the | 
 |    parent mail (the one you reply to) as the initial report for the regression | 
 |    you want to see tracked; that's important, as regzbot will later look out | 
 |    for patches with "Closes:" tags pointing to the report in the archives on | 
 |    lore.kernel.org. | 
 |  | 
 |  * When forwarding a regression reported to a bug tracker, include a paragraph | 
 |    with these regzbot commands:: | 
 |  | 
 |        #regzbot introduced: 1f2e3d4c5b6a | 
 |        #regzbot from: Some N. Ice Human <some.human@example.com> | 
 |        #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789 | 
 |  | 
 |    Regzbot will then automatically associate patches with the report that | 
 |    contain "Closes:" tags pointing to your mail or the mentioned ticket. | 
 |  | 
 | What's important when fixing regressions | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | You don't need to do anything special when submitting fixes for regression, just | 
 | remember to do what Documentation/process/submitting-patches.rst, | 
 | :ref:`Documentation/process/5.Posting.rst <development_posting>`, and | 
 | Documentation/process/stable-kernel-rules.rst already explain in more detail: | 
 |  | 
 |  * Point to all places where the issue was reported using "Closes:" tags:: | 
 |  | 
 |        Closes: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/ | 
 |        Closes: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890 | 
 |  | 
 |    If you are only fixing part of the issue, you may use "Link:" instead as | 
 |    described in the first document mentioned above. regzbot currently treats | 
 |    both of these equivalently and considers the linked reports as resolved. | 
 |  | 
 |  * Add a "Fixes:" tag to specify the commit causing the regression. | 
 |  | 
 |  * If the culprit was merged in an earlier development cycle, explicitly mark | 
 |    the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag. | 
 |  | 
 | All this is expected from you and important when it comes to regression, as | 
 | these tags are of great value for everyone (you included) that might be looking | 
 | into the issue weeks, months, or years later. These tags are also crucial for | 
 | tools and scripts used by other kernel developers or Linux distributions; one of | 
 | these tools is regzbot, which heavily relies on the "Closes:" tags to associate | 
 | reports for regression with changes resolving them. | 
 |  | 
 | Expectations and best practices for fixing regressions | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | As a Linux kernel developer, you are expected to give your best to prevent | 
 | situations where a regression caused by a recent change of yours leaves users | 
 | only these options: | 
 |  | 
 |  * Run a kernel with a regression that impacts usage. | 
 |  | 
 |  * Switch to an older or newer kernel series. | 
 |  | 
 |  * Continue running an outdated and thus potentially insecure kernel for more | 
 |    than three weeks after the regression's culprit was identified. Ideally it | 
 |    should be less than two. And it ought to be just a few days, if the issue is | 
 |    severe or affects many users -- either in general or in prevalent | 
 |    environments. | 
 |  | 
 | How to realize that in practice depends on various factors. Use the following | 
 | rules of thumb as a guide. | 
 |  | 
 | In general: | 
 |  | 
 |  * Prioritize work on regressions over all other Linux kernel work, unless the | 
 |    latter concerns a severe issue (e.g. acute security vulnerability, data loss, | 
 |    bricked hardware, ...). | 
 |  | 
 |  * Expedite fixing mainline regressions that recently made it into a proper | 
 |    mainline, stable, or longterm release (either directly or via backport). | 
 |  | 
 |  * Do not consider regressions from the current cycle as something that can wait | 
 |    till the end of the cycle, as the issue might discourage or prevent users and | 
 |    CI systems from testing mainline now or generally. | 
 |  | 
 |  * Work with the required care to avoid additional or bigger damage, even if | 
 |    resolving an issue then might take longer than outlined below. | 
 |  | 
 | On timing once the culprit of a regression is known: | 
 |  | 
 |  * Aim to mainline a fix within two or three days, if the issue is severe or | 
 |    bothering many users -- either in general or in prevalent conditions like a | 
 |    particular hardware environment, distribution, or stable/longterm series. | 
 |  | 
 |  * Aim to mainline a fix by Sunday after the next, if the culprit made it | 
 |    into a recent mainline, stable, or longterm release (either directly or via | 
 |    backport); if the culprit became known early during a week and is simple to | 
 |    resolve, try to mainline the fix within the same week. | 
 |  | 
 |  * For other regressions, aim to mainline fixes before the hindmost Sunday | 
 |    within the next three weeks. One or two Sundays later are acceptable, if the | 
 |    regression is something people can live with easily for a while -- like a | 
 |    mild performance regression. | 
 |  | 
 |  * It's strongly discouraged to delay mainlining regression fixes till the next | 
 |    merge window, except when the fix is extraordinarily risky or when the | 
 |    culprit was mainlined more than a year ago. | 
 |  | 
 | On procedure: | 
 |  | 
 |  * Always consider reverting the culprit, as it's often the quickest and least | 
 |    dangerous way to fix a regression. Don't worry about mainlining a fixed | 
 |    variant later: that should be straight-forward, as most of the code went | 
 |    through review once already. | 
 |  | 
 |  * Try to resolve any regressions introduced in mainline during the past | 
 |    twelve months before the current development cycle ends: Linus wants such | 
 |    regressions to be handled like those from the current cycle, unless fixing | 
 |    bears unusual risks. | 
 |  | 
 |  * Consider CCing Linus on discussions or patch review, if a regression seems | 
 |    tangly. Do the same in precarious or urgent cases -- especially if the | 
 |    subsystem maintainer might be unavailable. Also CC the stable team, when you | 
 |    know such a regression made it into a mainline, stable, or longterm release. | 
 |  | 
 |  * For urgent regressions, consider asking Linus to pick up the fix straight | 
 |    from the mailing list: he is totally fine with that for uncontroversial | 
 |    fixes. Ideally though such requests should happen in accordance with the | 
 |    subsystem maintainers or come directly from them. | 
 |  | 
 |  * In case you are unsure if a fix is worth the risk applying just days before | 
 |    a new mainline release, send Linus a mail with the usual lists and people in | 
 |    CC; in it, summarize the situation while asking him to consider picking up | 
 |    the fix straight from the list. He then himself can make the call and when | 
 |    needed even postpone the release. Such requests again should ideally happen | 
 |    in accordance with the subsystem maintainers or come directly from them. | 
 |  | 
 | Regarding stable and longterm kernels: | 
 |  | 
 |  * You are free to leave regressions to the stable team, if they at no point in | 
 |    time occurred with mainline or were fixed there already. | 
 |  | 
 |  * If a regression made it into a proper mainline release during the past | 
 |    twelve months, ensure to tag the fix with "Cc: stable@vger.kernel.org", as a | 
 |    "Fixes:" tag alone does not guarantee a backport. Please add the same tag, | 
 |    in case you know the culprit was backported to stable or longterm kernels. | 
 |  | 
 |  * When receiving reports about regressions in recent stable or longterm kernel | 
 |    series, please evaluate at least briefly if the issue might happen in current | 
 |    mainline as well -- and if that seems likely, take hold of the report. If in | 
 |    doubt, ask the reporter to check mainline. | 
 |  | 
 |  * Whenever you want to swiftly resolve a regression that recently also made it | 
 |    into a proper mainline, stable, or longterm release, fix it quickly in | 
 |    mainline; when appropriate thus involve Linus to fast-track the fix (see | 
 |    above). That's because the stable team normally does neither revert nor fix | 
 |    any changes that cause the same problems in mainline. | 
 |  | 
 |  * In case of urgent regression fixes you might want to ensure prompt | 
 |    backporting by dropping the stable team a note once the fix was mainlined; | 
 |    this is especially advisable during merge windows and shortly thereafter, as | 
 |    the fix otherwise might land at the end of a huge patch queue. | 
 |  | 
 | On patch flow: | 
 |  | 
 |  * Developers, when trying to reach the time periods mentioned above, remember | 
 |    to account for the time it takes to get fixes tested, reviewed, and merged by | 
 |    Linus, ideally with them being in linux-next at least briefly. Hence, if a | 
 |    fix is urgent, make it obvious to ensure others handle it appropriately. | 
 |  | 
 |  * Reviewers, you are kindly asked to assist developers in reaching the time | 
 |    periods mentioned above by reviewing regression fixes in a timely manner. | 
 |  | 
 |  * Subsystem maintainers, you likewise are encouraged to expedite the handling | 
 |    of regression fixes. Thus evaluate if skipping linux-next is an option for | 
 |    the particular fix. Also consider sending git pull requests more often than | 
 |    usual when needed. And try to avoid holding onto regression fixes over | 
 |    weekends -- especially when the fix is marked for backporting. | 
 |  | 
 |  | 
 | More aspects regarding regressions developers should be aware of | 
 | ---------------------------------------------------------------- | 
 |  | 
 |  | 
 | How to deal with changes where a risk of regression is known | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Evaluate how big the risk of regressions is, for example by performing a code | 
 | search in Linux distributions and Git forges. Also consider asking other | 
 | developers or projects likely to be affected to evaluate or even test the | 
 | proposed change; if problems surface, maybe some solution acceptable for all | 
 | can be found. | 
 |  | 
 | If the risk of regressions in the end seems to be relatively small, go ahead | 
 | with the change, but let all involved parties know about the risk. Hence, make | 
 | sure your patch description makes this aspect obvious. Once the change is | 
 | merged, tell the Linux kernel's regression tracker and the regressions mailing | 
 | list about the risk, so everyone has the change on the radar in case reports | 
 | trickle in. Depending on the risk, you also might want to ask the subsystem | 
 | maintainer to mention the issue in his mainline pull request. | 
 |  | 
 | What else is there to known about regressions? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot | 
 | of other aspects you want might want to be aware of: | 
 |  | 
 |  * the purpose of the "no regressions" rule | 
 |  | 
 |  * what issues actually qualify as regression | 
 |  | 
 |  * who's in charge for finding the root cause of a regression | 
 |  | 
 |  * how to handle tricky situations, e.g. when a regression is caused by a | 
 |    security fix or when fixing a regression might cause another one | 
 |  | 
 | Whom to ask for advice when it comes to regressions | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Send a mail to the regressions mailing list (regressions@lists.linux.dev) while | 
 | CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the | 
 | issue might better be dealt with in private, feel free to omit the list. | 
 |  | 
 |  | 
 | More about regression tracking and regzbot | 
 | ------------------------------------------ | 
 |  | 
 |  | 
 | Why the Linux kernel has a regression tracker, and why is regzbot used? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Rules like "no regressions" need someone to ensure they are followed, otherwise | 
 | they are broken either accidentally or on purpose. History has shown this to be | 
 | true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to | 
 | keep an eye on things as the Linux kernel's regression tracker, who's | 
 | occasionally helped by other people. Neither of them are paid to do this, | 
 | that's why regression tracking is done on a best effort basis. | 
 |  | 
 | Earlier attempts to manually track regressions have shown it's an exhausting and | 
 | frustrating work, which is why they were abandoned after a while. To prevent | 
 | this from happening again, Thorsten developed regzbot to facilitate the work, | 
 | with the long term goal to automate regression tracking as much as possible for | 
 | everyone involved. | 
 |  | 
 | How does regression tracking work with regzbot? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | The bot watches for replies to reports of tracked regressions. Additionally, | 
 | it's looking out for posted or committed patches referencing such reports | 
 | with "Closes:" tags; replies to such patch postings are tracked as well. | 
 | Combined this data provides good insights into the current state of the fixing | 
 | process. | 
 |  | 
 | Regzbot tries to do its job with as little overhead as possible for both | 
 | reporters and developers. In fact, only reporters are burdened with an extra | 
 | duty: they need to tell regzbot about the regression report using the ``#regzbot | 
 | introduced`` command outlined above; if they don't do that, someone else can | 
 | take care of that using ``#regzbot ^introduced``. | 
 |  | 
 | For developers there normally is no extra work involved, they just need to make | 
 | sure to do something that was expected long before regzbot came to light: add | 
 | links to the patch description pointing to all reports about the issue fixed. | 
 |  | 
 | Do I have to use regzbot? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | It's in the interest of everyone if you do, as kernel maintainers like Linus | 
 | Torvalds partly rely on regzbot's tracking in their work -- for example when | 
 | deciding to release a new version or extend the development phase. For this they | 
 | need to be aware of all unfixed regression; to do that, Linus is known to look | 
 | into the weekly reports sent by regzbot. | 
 |  | 
 | Do I have to tell regzbot about every regression I stumble upon? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Ideally yes: we are all humans and easily forget problems when something more | 
 | important unexpectedly comes up -- for example a bigger problem in the Linux | 
 | kernel or something in real life that's keeping us away from keyboards for a | 
 | while. Hence, it's best to tell regzbot about every regression, except when you | 
 | immediately write a fix and commit it to a tree regularly merged to the affected | 
 | kernel series. | 
 |  | 
 | How to see which regressions regzbot tracks currently? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_ | 
 | for the latest info; alternatively, `search for the latest regression report | 
 | <https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_, | 
 | which regzbot normally sends out once a week on Sunday evening (UTC), which is a | 
 | few hours before Linus usually publishes new (pre-)releases. | 
 |  | 
 | What places is regzbot monitoring? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Regzbot is watching the most important Linux mailing lists as well as the git | 
 | repositories of linux-next, mainline, and stable/longterm. | 
 |  | 
 | What kind of issues are supposed to be tracked by regzbot? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | The bot is meant to track regressions, hence please don't involve regzbot for | 
 | regular issues. But it's okay for the Linux kernel's regression tracker if you | 
 | use regzbot to track severe issues, like reports about hangs, corrupted data, | 
 | or internal errors (Panic, Oops, BUG(), warning, ...). | 
 |  | 
 | Can I add regressions found by CI systems to regzbot's tracking? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Feel free to do so, if the particular regression likely has impact on practical | 
 | use cases and thus might be noticed by users; hence, please don't involve | 
 | regzbot for theoretical regressions unlikely to show themselves in real world | 
 | usage. | 
 |  | 
 | How to interact with regzbot? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | By using a 'regzbot command' in a direct or indirect reply to the mail with the | 
 | regression report. These commands need to be in their own paragraph (IOW: they | 
 | need to be separated from the rest of the mail using blank lines). | 
 |  | 
 | One such command is ``#regzbot introduced: <version or commit>``, which makes | 
 | regzbot consider your mail as a regressions report added to the tracking, as | 
 | already described above; ``#regzbot ^introduced: <version or commit>`` is another | 
 | such command, which makes regzbot consider the parent mail as a report for a | 
 | regression which it starts to track. | 
 |  | 
 | Once one of those two commands has been utilized, other regzbot commands can be | 
 | used in direct or indirect replies to the report. You can write them below one | 
 | of the `introduced` commands or in replies to the mail that used one of them | 
 | or itself is a reply to that mail: | 
 |  | 
 |  * Set or update the title:: | 
 |  | 
 |        #regzbot title: foo | 
 |  | 
 |  * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of | 
 |    the issue or a fix are discussed -- for example the posting of a patch fixing | 
 |    the regression:: | 
 |  | 
 |        #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/ | 
 |  | 
 |    Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot | 
 |    will consider all messages in that thread or ticket as related to the fixing | 
 |    process. | 
 |  | 
 |  * Point to a place with further details of interest, like a mailing list post | 
 |    or a ticket in a bug tracker that are slightly related, but about a different | 
 |    topic:: | 
 |  | 
 |        #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789 | 
 |  | 
 |  * Mark a regression as fixed by a commit that is heading upstream or already | 
 |    landed:: | 
 |  | 
 |        #regzbot fix: 1f2e3d4c5d | 
 |  | 
 |  * Mark a regression as a duplicate of another one already tracked by regzbot:: | 
 |  | 
 |        #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/ | 
 |  | 
 |  * Mark a regression as invalid:: | 
 |  | 
 |        #regzbot invalid: wasn't a regression, problem has always existed | 
 |  | 
 | Is there more to tell about regzbot and its commands? | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | More detailed and up-to-date information about the Linux | 
 | kernel's regression tracking bot can be found on its | 
 | `project page <https://gitlab.com/knurd42/regzbot>`_, which among others | 
 | contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ | 
 | and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_ | 
 | which both cover more details than the above section. | 
 |  | 
 | Quotes from Linus about regression | 
 | ---------------------------------- | 
 |  | 
 | Find below a few real life examples of how Linus Torvalds expects regressions to | 
 | be handled: | 
 |  | 
 |  * From `2017-10-26 (1/2) | 
 |    <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_:: | 
 |  | 
 |        If you break existing user space setups THAT IS A REGRESSION. | 
 |  | 
 |        It's not ok to say "but we'll fix the user space setup". | 
 |  | 
 |        Really. NOT OK. | 
 |  | 
 |        [...] | 
 |  | 
 |        The first rule is: | 
 |  | 
 |         - we don't cause regressions | 
 |  | 
 |        and the corollary is that when regressions *do* occur, we admit to | 
 |        them and fix them, instead of blaming user space. | 
 |  | 
 |        The fact that you have apparently been denying the regression now for | 
 |        three weeks means that I will revert, and I will stop pulling apparmor | 
 |        requests until the people involved understand how kernel development | 
 |        is done. | 
 |  | 
 |  * From `2017-10-26 (2/2) | 
 |    <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_:: | 
 |  | 
 |        People should basically always feel like they can update their kernel | 
 |        and simply not have to worry about it. | 
 |  | 
 |        I refuse to introduce "you can only update the kernel if you also | 
 |        update that other program" kind of limitations. If the kernel used to | 
 |        work for you, the rule is that it continues to work for you. | 
 |  | 
 |        There have been exceptions, but they are few and far between, and they | 
 |        generally have some major and fundamental reasons for having happened, | 
 |        that were basically entirely unavoidable, and people _tried_hard_ to | 
 |        avoid them. Maybe we can't practically support the hardware any more | 
 |        after it is decades old and nobody uses it with modern kernels any | 
 |        more. Maybe there's a serious security issue with how we did things, | 
 |        and people actually depended on that fundamentally broken model. Maybe | 
 |        there was some fundamental other breakage that just _had_ to have a | 
 |        flag day for very core and fundamental reasons. | 
 |  | 
 |        And notice that this is very much about *breaking* peoples environments. | 
 |  | 
 |        Behavioral changes happen, and maybe we don't even support some | 
 |        feature any more. There's a number of fields in /proc/<pid>/stat that | 
 |        are printed out as zeroes, simply because they don't even *exist* in | 
 |        the kernel any more, or because showing them was a mistake (typically | 
 |        an information leak). But the numbers got replaced by zeroes, so that | 
 |        the code that used to parse the fields still works. The user might not | 
 |        see everything they used to see, and so behavior is clearly different, | 
 |        but things still _work_, even if they might no longer show sensitive | 
 |        (or no longer relevant) information. | 
 |  | 
 |        But if something actually breaks, then the change must get fixed or | 
 |        reverted. And it gets fixed in the *kernel*. Not by saying "well, fix | 
 |        your user space then". It was a kernel change that exposed the | 
 |        problem, it needs to be the kernel that corrects for it, because we | 
 |        have a "upgrade in place" model. We don't have a "upgrade with new | 
 |        user space". | 
 |  | 
 |        And I seriously will refuse to take code from people who do not | 
 |        understand and honor this very simple rule. | 
 |  | 
 |        This rule is also not going to change. | 
 |  | 
 |        And yes, I realize that the kernel is "special" in this respect. I'm | 
 |        proud of it. | 
 |  | 
 |        I have seen, and can point to, lots of projects that go "We need to | 
 |        break that use case in order to make progress" or "you relied on | 
 |        undocumented behavior, it sucks to be you" or "there's a better way to | 
 |        do what you want to do, and you have to change to that new better | 
 |        way", and I simply don't think that's acceptable outside of very early | 
 |        alpha releases that have experimental users that know what they signed | 
 |        up for. The kernel hasn't been in that situation for the last two | 
 |        decades. | 
 |  | 
 |        We do API breakage _inside_ the kernel all the time. We will fix | 
 |        internal problems by saying "you now need to do XYZ", but then it's | 
 |        about internal kernel API's, and the people who do that then also | 
 |        obviously have to fix up all the in-kernel users of that API. Nobody | 
 |        can say "I now broke the API you used, and now _you_ need to fix it | 
 |        up". Whoever broke something gets to fix it too. | 
 |  | 
 |        And we simply do not break user space. | 
 |  | 
 |  * From `2020-05-21 | 
 |    <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_:: | 
 |  | 
 |        The rules about regressions have never been about any kind of | 
 |        documented behavior, or where the code lives. | 
 |  | 
 |        The rules about regressions are always about "breaks user workflow". | 
 |  | 
 |        Users are literally the _only_ thing that matters. | 
 |  | 
 |        No amount of "you shouldn't have used this" or "that behavior was | 
 |        undefined, it's your own fault your app broke" or "that used to work | 
 |        simply because of a kernel bug" is at all relevant. | 
 |  | 
 |        Now, reality is never entirely black-and-white. So we've had things | 
 |        like "serious security issue" etc that just forces us to make changes | 
 |        that may break user space. But even then the rule is that we don't | 
 |        really have other options that would allow things to continue. | 
 |  | 
 |        And obviously, if users take years to even notice that something | 
 |        broke, or if we have sane ways to work around the breakage that | 
 |        doesn't make for too much trouble for users (ie "ok, there are a | 
 |        handful of users, and they can use a kernel command line to work | 
 |        around it" kind of things) we've also been a bit less strict. | 
 |  | 
 |        But no, "that was documented to be broken" (whether it's because the | 
 |        code was in staging or because the man-page said something else) is | 
 |        irrelevant. If staging code is so useful that people end up using it, | 
 |        that means that it's basically regular kernel code with a flag saying | 
 |        "please clean this up". | 
 |  | 
 |        The other side of the coin is that people who talk about "API | 
 |        stability" are entirely wrong. API's don't matter either. You can make | 
 |        any changes to an API you like - as long as nobody notices. | 
 |  | 
 |        Again, the regression rule is not about documentation, not about | 
 |        API's, and not about the phase of the moon. | 
 |  | 
 |        It's entirely about "we caused problems for user space that used to work". | 
 |  | 
 |  * From `2017-11-05 | 
 |    <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_:: | 
 |  | 
 |        And our regression rule has never been "behavior doesn't change". | 
 |        That would mean that we could never make any changes at all. | 
 |  | 
 |        For example, we do things like add new error handling etc all the | 
 |        time, which we then sometimes even add tests for in our kselftest | 
 |        directory. | 
 |  | 
 |        So clearly behavior changes all the time and we don't consider that a | 
 |        regression per se. | 
 |  | 
 |        The rule for a regression for the kernel is that some real user | 
 |        workflow breaks. Not some test. Not a "look, I used to be able to do | 
 |        X, now I can't". | 
 |  | 
 |  * From `2018-08-03 | 
 |    <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_:: | 
 |  | 
 |        YOU ARE MISSING THE #1 KERNEL RULE. | 
 |  | 
 |        We do not regress, and we do not regress exactly because your are 100% wrong. | 
 |  | 
 |        And the reason you state for your opinion is in fact exactly *WHY* you | 
 |        are wrong. | 
 |  | 
 |        Your "good reasons" are pure and utter garbage. | 
 |  | 
 |        The whole point of "we do not regress" is so that people can upgrade | 
 |        the kernel and never have to worry about it. | 
 |  | 
 |        > Kernel had a bug which has been fixed | 
 |  | 
 |        That is *ENTIRELY* immaterial. | 
 |  | 
 |        Guys, whether something was buggy or not DOES NOT MATTER. | 
 |  | 
 |        Why? | 
 |  | 
 |        Bugs happen. That's a fact of life. Arguing that "we had to break | 
 |        something because we were fixing a bug" is completely insane. We fix | 
 |        tens of bugs every single day, thinking that "fixing a bug" means that | 
 |        we can break something is simply NOT TRUE. | 
 |  | 
 |        So bugs simply aren't even relevant to the discussion. They happen, | 
 |        they get found, they get fixed, and it has nothing to do with "we | 
 |        break users". | 
 |  | 
 |        Because the only thing that matters IS THE USER. | 
 |  | 
 |        How hard is that to understand? | 
 |  | 
 |        Anybody who uses "but it was buggy" as an argument is entirely missing | 
 |        the point. As far as the USER was concerned, it wasn't buggy - it | 
 |        worked for him/her. | 
 |  | 
 |        Maybe it worked *because* the user had taken the bug into account, | 
 |        maybe it worked because the user didn't notice - again, it doesn't | 
 |        matter. It worked for the user. | 
 |  | 
 |        Breaking a user workflow for a "bug" is absolutely the WORST reason | 
 |        for breakage you can imagine. | 
 |  | 
 |        It's basically saying "I took something that worked, and I broke it, | 
 |        but now it's better". Do you not see how f*cking insane that statement | 
 |        is? | 
 |  | 
 |        And without users, your program is not a program, it's a pointless | 
 |        piece of code that you might as well throw away. | 
 |  | 
 |        Seriously. This is *why* the #1 rule for kernel development is "we | 
 |        don't break users". Because "I fixed a bug" is absolutely NOT AN | 
 |        ARGUMENT if that bug fix broke a user setup. You actually introduced a | 
 |        MUCH BIGGER bug by "fixing" something that the user clearly didn't | 
 |        even care about. | 
 |  | 
 |        And dammit, we upgrade the kernel ALL THE TIME without upgrading any | 
 |        other programs at all. It is absolutely required, because flag-days | 
 |        and dependencies are horribly bad. | 
 |  | 
 |        And it is also required simply because I as a kernel developer do not | 
 |        upgrade random other tools that I don't even care about as I develop | 
 |        the kernel, and I want any of my users to feel safe doing the same | 
 |        time. | 
 |  | 
 |        So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel | 
 |        without upgrading some other random binary, then we have a problem. | 
 |  | 
 |  * From `2021-06-05 | 
 |    <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_:: | 
 |  | 
 |        THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS. | 
 |  | 
 |        Honestly, security people need to understand that "not working" is not | 
 |        a success case of security. It's a failure case. | 
 |  | 
 |        Yes, "not working" may be secure. But security in that case is *pointless*. | 
 |  | 
 |  * From `2011-05-06 (1/3) | 
 |    <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_:: | 
 |  | 
 |        Binary compatibility is more important. | 
 |  | 
 |        And if binaries don't use the interface to parse the format (or just | 
 |        parse it wrongly - see the fairly recent example of adding uuid's to | 
 |        /proc/self/mountinfo), then it's a regression. | 
 |  | 
 |        And regressions get reverted, unless there are security issues or | 
 |        similar that makes us go "Oh Gods, we really have to break things". | 
 |  | 
 |        I don't understand why this simple logic is so hard for some kernel | 
 |        developers to understand. Reality matters. Your personal wishes matter | 
 |        NOT AT ALL. | 
 |  | 
 |        If you made an interface that can be used without parsing the | 
 |        interface description, then we're stuck with the interface. Theory | 
 |        simply doesn't matter. | 
 |  | 
 |        You could help fix the tools, and try to avoid the compatibility | 
 |        issues that way. There aren't that many of them. | 
 |  | 
 |    From `2011-05-06 (2/3) | 
 |    <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_:: | 
 |  | 
 |        it's clearly NOT an internal tracepoint. By definition. It's being | 
 |        used by powertop. | 
 |  | 
 |    From `2011-05-06 (3/3) | 
 |    <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_:: | 
 |  | 
 |        We have programs that use that ABI and thus it's a regression if they break. | 
 |  | 
 |  * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_:: | 
 |  | 
 |        > Now this got me wondering if Debian _unstable_ actually qualifies as a | 
 |        > standard distro userspace. | 
 |  | 
 |        Oh, if the kernel breaks some standard user space, that counts. Tons | 
 |        of people run Debian unstable | 
 |  | 
 |  * From `2019-09-15 | 
 |    <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_:: | 
 |  | 
 |        One _particularly_ last-minute revert is the top-most commit (ignoring | 
 |        the version change itself) done just before the release, and while | 
 |        it's very annoying, it's perhaps also instructive. | 
 |  | 
 |        What's instructive about it is that I reverted a commit that wasn't | 
 |        actually buggy. In fact, it was doing exactly what it set out to do, | 
 |        and did it very well. In fact it did it _so_ well that the much | 
 |        improved IO patterns it caused then ended up revealing a user-visible | 
 |        regression due to a real bug in a completely unrelated area. | 
 |  | 
 |        The actual details of that regression are not the reason I point that | 
 |        revert out as instructive, though. It's more that it's an instructive | 
 |        example of what counts as a regression, and what the whole "no | 
 |        regressions" kernel rule means. The reverted commit didn't change any | 
 |        API's, and it didn't introduce any new bugs. But it ended up exposing | 
 |        another problem, and as such caused a kernel upgrade to fail for a | 
 |        user. So it got reverted. | 
 |  | 
 |        The point here being that we revert based on user-reported _behavior_, | 
 |        not based on some "it changes the ABI" or "it caused a bug" concept. | 
 |        The problem was really pre-existing, and it just didn't happen to | 
 |        trigger before. The better IO patterns introduced by the change just | 
 |        happened to expose an old bug, and people had grown to depend on the | 
 |        previously benign behavior of that old issue. | 
 |  | 
 |        And never fear, we'll re-introduce the fix that improved on the IO | 
 |        patterns once we've decided just how to handle the fact that we had a | 
 |        bad interaction with an interface that people had then just happened | 
 |        to rely on incidental behavior for before. It's just that we'll have | 
 |        to hash through how to do that (there are no less than three different | 
 |        patches by three different developers being discussed, and there might | 
 |        be more coming...). In the meantime, I reverted the thing that exposed | 
 |        the problem to users for this release, even if I hope it will be | 
 |        re-introduced (perhaps even backported as a stable patch) once we have | 
 |        consensus about the issue it exposed. | 
 |  | 
 |        Take-away from the whole thing: it's not about whether you change the | 
 |        kernel-userspace ABI, or fix a bug, or about whether the old code | 
 |        "should never have worked in the first place". It's about whether | 
 |        something breaks existing users' workflow. | 
 |  | 
 |        Anyway, that was my little aside on the whole regression thing.  Since | 
 |        it's that "first rule of kernel programming", I felt it is perhaps | 
 |        worth just bringing it up every once in a while | 
 |  | 
 | .. | 
 |    end-of-content | 
 | .. | 
 |    This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top | 
 |    of the file. If you want to distribute this text under CC-BY-4.0 only, | 
 |    please use "The Linux kernel developers" for author attribution and link | 
 |    this as source: | 
 |    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst | 
 | .. | 
 |    Note: Only the content of this RST file as found in the Linux kernel sources | 
 |    is available under CC-BY-4.0, as versions of this text that were processed | 
 |    (for example by the kernel's build system) might contain content taken from | 
 |    files which use a more restrictive license. |