releases/4.19.170/net-avoid-32-x-truesize-under-estimation-for-tiny-skbs.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From foo@baz Fri Jan 22 01:21:57 PM CET 2021
 From: Eric Dumazet <edumazet@google.com>
 Date: Wed, 13 Jan 2021 08:18:19 -0800
 Subject: net: avoid 32 x truesize under-estimation for tiny skbs

 From: Eric Dumazet <edumazet@google.com>

 [ Upstream commit 3226b158e67cfaa677fd180152bfb28989cb2fac ]

 Both virtio net and napi_get_frags() allocate skbs
 with a very small skb->head

 While using page fragments instead of a kmalloc backed skb->head might give
 a small performance improvement in some cases, there is a huge risk of
 under estimating memory usage.

 For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
 per page (order-3 page in x86), or even 64 on PowerPC

 We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
 but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

 Even if we force napi_alloc_skb() to only use order-0 pages, the issue
 would still be there on arches with PAGE_SIZE >= 32768

 This patch makes sure that small skb head are kmalloc backed, so that
 other objects in the slab page can be reused instead of being held as long
 as skbs are sitting in socket queues.

 Note that we might in the future use the sk_buff napi cache,
 instead of going through a more expensive __alloc_skb()

 Another idea would be to use separate page sizes depending
 on the allocated length (to never have more than 4 frags per page)

 I would like to thank Greg Thelen for his precious help on this matter,
 analysing crash dumps is always a time consuming task.

 Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
 Signed-off-by: Eric Dumazet <edumazet@google.com>
 Cc: Paolo Abeni <pabeni@redhat.com>
 Cc: Greg Thelen <gthelen@google.com>
 Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
 Acked-by: Michael S. Tsirkin <mst@redhat.com>
 Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 ---
  net/core/skbuff.c |    9 +++++++--
  1 file changed, 7 insertions(+), 2 deletions(-)

 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -459,13 +459,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
  				 gfp_t gfp_mask)
  {
 -	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 +	struct napi_alloc_cache *nc;
  	struct sk_buff *skb;
  	void *data;

  	len += NET_SKB_PAD + NET_IP_ALIGN;

 -	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
 +	/* If requested length is either too small or too big,
 +	 * we use kmalloc() for skb->head allocation.
 +	 */
 +	if (len <= SKB_WITH_OVERHEAD(1024) ||
 +	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
  	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
  		if (!skb)
 @@ -473,6 +477,7 @@ struct sk_buff *__napi_alloc_skb(struct
  		goto skb_success;
  	}

 +	nc = this_cpu_ptr(&napi_alloc_cache);
  	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
  	len = SKB_DATA_ALIGN(len);
	From foo@baz Fri Jan 22 01:21:57 PM CET 2021
	From: Eric Dumazet <edumazet@google.com>
	Date: Wed, 13 Jan 2021 08:18:19 -0800
	Subject: net: avoid 32 x truesize under-estimation for tiny skbs

	From: Eric Dumazet <edumazet@google.com>

	[ Upstream commit 3226b158e67cfaa677fd180152bfb28989cb2fac ]

	Both virtio net and napi_get_frags() allocate skbs
	with a very small skb->head

	While using page fragments instead of a kmalloc backed skb->head might give
	a small performance improvement in some cases, there is a huge risk of
	under estimating memory usage.

	For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
	per page (order-3 page in x86), or even 64 on PowerPC

	We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
	but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

	Even if we force napi_alloc_skb() to only use order-0 pages, the issue
	would still be there on arches with PAGE_SIZE >= 32768

	This patch makes sure that small skb head are kmalloc backed, so that
	other objects in the slab page can be reused instead of being held as long
	as skbs are sitting in socket queues.

	Note that we might in the future use the sk_buff napi cache,
	instead of going through a more expensive __alloc_skb()

	Another idea would be to use separate page sizes depending
	on the allocated length (to never have more than 4 frags per page)

	I would like to thank Greg Thelen for his precious help on this matter,
	analysing crash dumps is always a time consuming task.

	Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
	Signed-off-by: Eric Dumazet <edumazet@google.com>
	Cc: Paolo Abeni <pabeni@redhat.com>
	Cc: Greg Thelen <gthelen@google.com>
	Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
	Acked-by: Michael S. Tsirkin <mst@redhat.com>
	Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
	Signed-off-by: Jakub Kicinski <kuba@kernel.org>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	---
	net/core/skbuff.c \| 9 +++++++--
	1 file changed, 7 insertions(+), 2 deletions(-)

	--- a/net/core/skbuff.c
	+++ b/net/core/skbuff.c
	@@ -459,13 +459,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
	struct sk_buff __napi_alloc_skb(struct napi_struct napi, unsigned int len,
	gfp_t gfp_mask)
	{
	- struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
	+ struct napi_alloc_cache *nc;
	struct sk_buff *skb;
	void *data;

	len += NET_SKB_PAD + NET_IP_ALIGN;

	- if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) \|\|
	+ /* If requested length is either too small or too big,
	+ * we use kmalloc() for skb->head allocation.
	+ */
	+ if (len <= SKB_WITH_OVERHEAD(1024) \|\|
	+ len > SKB_WITH_OVERHEAD(PAGE_SIZE) \|\|
	(gfp_mask & (__GFP_DIRECT_RECLAIM \| GFP_DMA))) {
	skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
	if (!skb)
	@@ -473,6 +477,7 @@ struct sk_buff *__napi_alloc_skb(struct
	goto skb_success;
	}

	+ nc = this_cpu_ptr(&napi_alloc_cache);
	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
	len = SKB_DATA_ALIGN(len);