net: track locally triggered link loss

A comment above carrier_up_count / carrier_down_count in netdevice.h
proudly states:

	/* Stats to monitor link on/off, flapping */

In reality datacenter NIC drivers introduce quite a bit of noise
into those statistics, making them less than ideal for link flap
detection.

There are 3 types of events counted as carrier changes today:
  (1) reconfiguration which requires pausing Tx and Rx but doesn't
      actually result in a link down for the remote end;
  (2) reconfiguration events which do take the link down;
 (3a) real PHY-detected link loss due to remote end's actions;
 (3b) real PHY-detected link loss due to signal integrity issues.

(3a and 3b are indistinguishable to local end so counting as one.)

Reconfigurations of type (1) are when drivers call netif_carrier_off()
/ netif_carrier_on() around changes to queues, IRQs, time stamping,
XDP enablement etc. In DC scenarios machine provisioning or
reallocation causes multiple settings to be changed in close
succession. This looks like a spike in link flaps to monitoring
systems.

Suppressing the fake carrier changes while maintaining the Rx/Tx
pause behavior seems hard, and can introduce a divergence in what
the kernel thinks about the device (paused) vs what user space
thinks (running).

Another option would be to expose a link loss statistic which
some devices (FW) already maintain. Unfortunately, such stats
are not very common (unless my grepping skills fail me).

Instead this patch tries to expose a new event count - number
of locally caused link changes. Only "down" events are counted
because the "up" events are not really in our control.
The "real" link flops can be obtained by subtracting the new
counter from carrier_down_count.

In terms of API - drivers can use netif_carrier_admin_off()
hen taking the link down. There's also an API for situations
where driver requests link reset but expects the down / up
reporting to come thru the usual, async path.

It may be worth pointing out that in case of datacenter NICs
even with the new statistic we will not be able to distinguish
between events (1) and (2), and what follows two Linux boxes
connected back-to-back won't be able to isolate events of type (3b).
I think that to solve that we'd need yet another counter -
carrier_down_local_link_was_really_reset... I think it's okay
to leave that as a future extension.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 files changed