releases/4.0.6/net-sched-fix-call_rcu-race-on-classifier-module-unloads.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From foo@baz Sat Jun 13 09:48:35 PDT 2015
 From: Daniel Borkmann <daniel@iogearbox.net>
 Date: Wed, 20 May 2015 17:13:33 +0200
 Subject: net: sched: fix call_rcu() race on classifier module unloads

 From: Daniel Borkmann <daniel@iogearbox.net>

 [ Upstream commit c78e1746d3ad7d548bdf3fe491898cc453911a49 ]

 Vijay reported that a loop as simple as ...

   while true; do
     tc qdisc add dev foo root handle 1: prio
     tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
     tc qdisc del dev foo root
     rmmod cls_u32
   done

 ... will panic the kernel. Moreover, he bisected the change
 apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
 locking to netlink_lookup() and seq walker").

 The removal of synchronize_net() from the netlink socket
 triggering the qdisc to be removed, seems to have uncovered
 an RCU resp. module reference count race from the tc API.
 Given that RCU conversion was done after e341694e3eb5 ("netlink:
 Convert netlink_lookup() to use RCU protected hash table")
 which added the synchronize_net() originally, occasion of
 hitting the bug was less likely (not impossible though):

 When qdiscs that i) support attaching classifiers and,
 ii) have at least one of them attached, get deleted, they
 invoke tcf_destroy_chain(), and thus call into ->destroy()
 handler from a classifier module.

 After RCU conversion, all classifier that have an internal
 prio list, unlink them and initiate freeing via call_rcu()
 deferral.

 Meanhile, tcf_destroy() releases already reference to the
 tp->ops->owner module before the queued RCU callback handler
 has been invoked.

 Subsequent rmmod on the classifier module is then not prevented
 since all module references are already dropped.

 By the time, the kernel invokes the RCU callback handler from
 the module, that function address is then invalid.

 One way to fix it would be to add an rcu_barrier() to
 unregister_tcf_proto_ops() to wait for all pending call_rcu()s
 to complete.

 synchronize_rcu() is not appropriate as under heavy RCU
 callback load, registered call_rcu()s could be deferred
 longer than a grace period. In case we don't have any pending
 call_rcu()s, the barrier is allowed to return immediately.

 Since we came here via unregister_tcf_proto_ops(), there
 are no users of a given classifier anymore. Further nested
 call_rcu()s pointing into the module space are not being
 done anywhere.

 Only cls_bpf_delete_prog() may schedule a work item, to
 unlock pages eventually, but that is not in the range/context
 of cls_bpf anymore.

 Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
 Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
 Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 Cc: John Fastabend <john.r.fastabend@intel.com>
 Cc: Eric Dumazet <edumazet@google.com>
 Cc: Thomas Graf <tgraf@suug.ch>
 Cc: Jamal Hadi Salim <jhs@mojatatu.com>
 Cc: Alexei Starovoitov <ast@plumgrid.com>
 Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
 Acked-by: Alexei Starovoitov <ast@plumgrid.com>
 Acked-by: Eric Dumazet <edumazet@google.com>
 Signed-off-by: David S. Miller <davem@davemloft.net>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 ---
  net/sched/cls_api.c |    5 +++++
  1 file changed, 5 insertions(+)

 --- a/net/sched/cls_api.c
 +++ b/net/sched/cls_api.c
 @@ -81,6 +81,11 @@ int unregister_tcf_proto_ops(struct tcf_
  	struct tcf_proto_ops *t;
  	int rc = -ENOENT;

 +	/* Wait for outstanding call_rcu()s, if any, from a
 +	 * tcf_proto_ops's destroy() handler.
 +	 */
 +	rcu_barrier();
 +
  	write_lock(&cls_mod_lock);
  	list_for_each_entry(t, &tcf_proto_base, head) {
  		if (t == ops) {
	From foo@baz Sat Jun 13 09:48:35 PDT 2015
	From: Daniel Borkmann <daniel@iogearbox.net>
	Date: Wed, 20 May 2015 17:13:33 +0200
	Subject: net: sched: fix call_rcu() race on classifier module unloads

	From: Daniel Borkmann <daniel@iogearbox.net>

	[ Upstream commit c78e1746d3ad7d548bdf3fe491898cc453911a49 ]

	Vijay reported that a loop as simple as ...

	while true; do
	tc qdisc add dev foo root handle 1: prio
	tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
	tc qdisc del dev foo root
	rmmod cls_u32
	done

	... will panic the kernel. Moreover, he bisected the change
	apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
	locking to netlink_lookup() and seq walker").

	The removal of synchronize_net() from the netlink socket
	triggering the qdisc to be removed, seems to have uncovered
	an RCU resp. module reference count race from the tc API.
	Given that RCU conversion was done after e341694e3eb5 ("netlink:
	Convert netlink_lookup() to use RCU protected hash table")
	which added the synchronize_net() originally, occasion of
	hitting the bug was less likely (not impossible though):

	When qdiscs that i) support attaching classifiers and,
	ii) have at least one of them attached, get deleted, they
	invoke tcf_destroy_chain(), and thus call into ->destroy()
	handler from a classifier module.

	After RCU conversion, all classifier that have an internal
	prio list, unlink them and initiate freeing via call_rcu()
	deferral.

	Meanhile, tcf_destroy() releases already reference to the
	tp->ops->owner module before the queued RCU callback handler
	has been invoked.

	Subsequent rmmod on the classifier module is then not prevented
	since all module references are already dropped.

	By the time, the kernel invokes the RCU callback handler from
	the module, that function address is then invalid.

	One way to fix it would be to add an rcu_barrier() to
	unregister_tcf_proto_ops() to wait for all pending call_rcu()s
	to complete.

	synchronize_rcu() is not appropriate as under heavy RCU
	callback load, registered call_rcu()s could be deferred
	longer than a grace period. In case we don't have any pending
	call_rcu()s, the barrier is allowed to return immediately.

	Since we came here via unregister_tcf_proto_ops(), there
	are no users of a given classifier anymore. Further nested
	call_rcu()s pointing into the module space are not being
	done anywhere.

	Only cls_bpf_delete_prog() may schedule a work item, to
	unlock pages eventually, but that is not in the range/context
	of cls_bpf anymore.

	Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
	Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
	Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
	Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
	Cc: John Fastabend <john.r.fastabend@intel.com>
	Cc: Eric Dumazet <edumazet@google.com>
	Cc: Thomas Graf <tgraf@suug.ch>
	Cc: Jamal Hadi Salim <jhs@mojatatu.com>
	Cc: Alexei Starovoitov <ast@plumgrid.com>
	Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
	Acked-by: Alexei Starovoitov <ast@plumgrid.com>
	Acked-by: Eric Dumazet <edumazet@google.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	---
	net/sched/cls_api.c \| 5 +++++
	1 file changed, 5 insertions(+)

	--- a/net/sched/cls_api.c
	+++ b/net/sched/cls_api.c
	@@ -81,6 +81,11 @@ int unregister_tcf_proto_ops(struct tcf_
	struct tcf_proto_ops *t;
	int rc = -ENOENT;

	+ /* Wait for outstanding call_rcu()s, if any, from a
	+ * tcf_proto_ops's destroy() handler.
	+ */
	+ rcu_barrier();
	+
	write_lock(&cls_mod_lock);
	list_for_each_entry(t, &tcf_proto_base, head) {
	if (t == ops) {