| From fca5e672d3deb79cab6a46cda19bde41eac65bf3 Mon Sep 17 00:00:00 2001 |
| From: Sasha Levin <sashal@kernel.org> |
| Date: Wed, 16 Jun 2021 17:09:51 -0700 |
| Subject: bpf: Do not change gso_size during bpf_skb_change_proto() |
| MIME-Version: 1.0 |
| Content-Type: text/plain; charset=UTF-8 |
| Content-Transfer-Encoding: 8bit |
| |
| From: Maciej Żenczykowski <maze@google.com> |
| |
| [ Upstream commit 364745fbe981a4370f50274475da4675661104df ] |
| |
| This is technically a backwards incompatible change in behaviour, but I'm |
| going to argue that it is very unlikely to break things, and likely to fix |
| *far* more then it breaks. |
| |
| In no particular order, various reasons follow: |
| |
| (a) I've long had a bug assigned to myself to debug a super rare kernel crash |
| on Android Pixel phones which can (per stacktrace) be traced back to BPF clat |
| IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later |
| on during transmit deep in the GSO engine, AFAICT precisely because of this |
| change to gso_size, though I've never been able to manually reproduce it. I |
| believe it may be related to the particular network offload support of attached |
| USB ethernet dongle being used for tethering off of an IPv6-only cellular |
| connection. The reason might be we end up with more segments than max permitted, |
| or with a GSO packet with only one segment... (either way we break some |
| assumption and hit a BUG_ON) |
| |
| (b) There is no check that the gso_size is > 20 when reducing it by 20, so we |
| might end up with a negative (or underflowing) gso_size or a gso_size of 0. |
| This can't possibly be good. Indeed this is probably somehow exploitable (or |
| at least can result in a kernel crash) by delivering crafted packets and perhaps |
| triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) |
| is related to MTU, but not directly derived from it: gso_size/MSS may be |
| significantly smaller then one would get by deriving from local MTU. And on |
| some NICs (which do loose MTU checking on receive, it may even potentially be |
| larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and |
| sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just |
| going from 21 to 1 is potentially problematic because it increases the number |
| of segments by a factor of 21 (think DoS, or some other crash due to too many |
| segments). |
| |
| (c) It's always safe to not increase the gso_size, because it doesn't result in |
| the max packet size increasing. So the skb_increase_gso_size() call was always |
| unnecessary for correctness (and outright undesirable, see later). As such the |
| only part which is potentially dangerous (ie. could cause backwards compatibility |
| issues) is the removal of the skb_decrease_gso_size() call. |
| |
| (d) If the packets are ultimately destined to the local device, then there is |
| absolutely no benefit to playing around with gso_size. It only matters if the |
| packets will egress the device. ie. we're either forwarding, or transmitting |
| from the device. |
| |
| (e) This logic only triggers for packets which are GSO. It does not trigger for |
| skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a |
| GSO packet (and you don't even know what the MTU is, so you can't even fix it). |
| As such your transmit path must *already* be able to handle an MTU 20 bytes |
| larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 |
| bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() |
| call doesn't actually increase the size of the packets your transmit side must |
| be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ |
| route MTUs must already be set correctly. Since for example with an IPv4 egress |
| MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, |
| so you need a 1520 byte device MTU. This means if your IPv6 device's egress |
| MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the |
| need to handle fragments). This is to handle normal non-GSO packets. Thus the |
| reduction is simply not needed for GSO packets, because when they're correctly |
| built, they will already be the right size. |
| |
| (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP |
| segments) should not be modified, so that TCP's MSS counting works correctly |
| (this matters for congestion control). If protocol conversion changes the |
| gso_size, then the number of TCP segments may increase or decrease. Packet loss |
| after protocol conversion can result in partial loss of MSS segments that the |
| sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs |
| in the middle of the segments it sent? |
| |
| (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS |
| case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee |
| that gso_size (and thus resulting packet MTU) is changed. ie. you must assume |
| it won't be changed. |
| |
| (h) changing gso_size is outright buggy for UDP GSO packets, where framing |
| matters (I believe that's also the case for SCTP, but it's already excluded |
| by [g]). So the only remaining case is TCP, which also doesn't want it |
| (see [f]). |
| |
| (i) see also the reasoning on the previous attempt at fixing this |
| (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows that the current |
| behaviour causes TCP packet loss: |
| |
| In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the |
| coalesced packet payload can be > MSS, but < MSS + 20. |
| |
| bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload |
| length. After then tcp_gso_segment checks for the payload length if it |
| is <= MSS. The condition is causing the packet to be dropped. |
| |
| tcp_gso_segment(): |
| [...] |
| mss = skb_shinfo(skb)->gso_size; |
| if (unlikely(skb->len <= mss)) goto out; |
| [...] |
| |
| Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary |
| and buggy, and decreasing can go negative. |
| |
| Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") |
| Signed-off-by: Maciej Żenczykowski <maze@google.com> |
| Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
| Cc: Dongseok Yi <dseok.yi@samsung.com> |
| Cc: Willem de Bruijn <willemb@google.com> |
| Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com |
| Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com |
| Signed-off-by: Sasha Levin <sashal@kernel.org> |
| --- |
| net/core/filter.c | 4 ---- |
| 1 file changed, 4 deletions(-) |
| |
| diff --git a/net/core/filter.c b/net/core/filter.c |
| index ef6bdbb63ecb..7ea752af7894 100644 |
| --- a/net/core/filter.c |
| +++ b/net/core/filter.c |
| @@ -3266,8 +3266,6 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb) |
| shinfo->gso_type |= SKB_GSO_TCPV6; |
| } |
| |
| - /* Due to IPv6 header, MSS needs to be downgraded. */ |
| - skb_decrease_gso_size(shinfo, len_diff); |
| /* Header must be checked, and gso_segs recomputed. */ |
| shinfo->gso_type |= SKB_GSO_DODGY; |
| shinfo->gso_segs = 0; |
| @@ -3307,8 +3305,6 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) |
| shinfo->gso_type |= SKB_GSO_TCPV4; |
| } |
| |
| - /* Due to IPv4 header, MSS can be upgraded. */ |
| - skb_increase_gso_size(shinfo, len_diff); |
| /* Header must be checked, and gso_segs recomputed. */ |
| shinfo->gso_type |= SKB_GSO_DODGY; |
| shinfo->gso_segs = 0; |
| -- |
| 2.30.2 |
| |