| From foo@baz Sat Jun 13 09:48:35 PDT 2015 |
| From: Daniel Borkmann <daniel@iogearbox.net> |
| Date: Wed, 20 May 2015 17:13:33 +0200 |
| Subject: net: sched: fix call_rcu() race on classifier module unloads |
| |
| From: Daniel Borkmann <daniel@iogearbox.net> |
| |
| [ Upstream commit c78e1746d3ad7d548bdf3fe491898cc453911a49 ] |
| |
| Vijay reported that a loop as simple as ... |
| |
| while true; do |
| tc qdisc add dev foo root handle 1: prio |
| tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1 |
| tc qdisc del dev foo root |
| rmmod cls_u32 |
| done |
| |
| ... will panic the kernel. Moreover, he bisected the change |
| apparently introducing it to 78fd1d0ab072 ("netlink: Re-add |
| locking to netlink_lookup() and seq walker"). |
| |
| The removal of synchronize_net() from the netlink socket |
| triggering the qdisc to be removed, seems to have uncovered |
| an RCU resp. module reference count race from the tc API. |
| Given that RCU conversion was done after e341694e3eb5 ("netlink: |
| Convert netlink_lookup() to use RCU protected hash table") |
| which added the synchronize_net() originally, occasion of |
| hitting the bug was less likely (not impossible though): |
| |
| When qdiscs that i) support attaching classifiers and, |
| ii) have at least one of them attached, get deleted, they |
| invoke tcf_destroy_chain(), and thus call into ->destroy() |
| handler from a classifier module. |
| |
| After RCU conversion, all classifier that have an internal |
| prio list, unlink them and initiate freeing via call_rcu() |
| deferral. |
| |
| Meanhile, tcf_destroy() releases already reference to the |
| tp->ops->owner module before the queued RCU callback handler |
| has been invoked. |
| |
| Subsequent rmmod on the classifier module is then not prevented |
| since all module references are already dropped. |
| |
| By the time, the kernel invokes the RCU callback handler from |
| the module, that function address is then invalid. |
| |
| One way to fix it would be to add an rcu_barrier() to |
| unregister_tcf_proto_ops() to wait for all pending call_rcu()s |
| to complete. |
| |
| synchronize_rcu() is not appropriate as under heavy RCU |
| callback load, registered call_rcu()s could be deferred |
| longer than a grace period. In case we don't have any pending |
| call_rcu()s, the barrier is allowed to return immediately. |
| |
| Since we came here via unregister_tcf_proto_ops(), there |
| are no users of a given classifier anymore. Further nested |
| call_rcu()s pointing into the module space are not being |
| done anywhere. |
| |
| Only cls_bpf_delete_prog() may schedule a work item, to |
| unlock pages eventually, but that is not in the range/context |
| of cls_bpf anymore. |
| |
| Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto") |
| Fixes: 9888faefe132 ("net: sched: cls_basic use RCU") |
| Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> |
| Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
| Cc: John Fastabend <john.r.fastabend@intel.com> |
| Cc: Eric Dumazet <edumazet@google.com> |
| Cc: Thomas Graf <tgraf@suug.ch> |
| Cc: Jamal Hadi Salim <jhs@mojatatu.com> |
| Cc: Alexei Starovoitov <ast@plumgrid.com> |
| Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> |
| Acked-by: Alexei Starovoitov <ast@plumgrid.com> |
| Acked-by: Eric Dumazet <edumazet@google.com> |
| Signed-off-by: David S. Miller <davem@davemloft.net> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| --- |
| net/sched/cls_api.c | 5 +++++ |
| 1 file changed, 5 insertions(+) |
| |
| --- a/net/sched/cls_api.c |
| +++ b/net/sched/cls_api.c |
| @@ -81,6 +81,11 @@ int unregister_tcf_proto_ops(struct tcf_ |
| struct tcf_proto_ops *t; |
| int rc = -ENOENT; |
| |
| + /* Wait for outstanding call_rcu()s, if any, from a |
| + * tcf_proto_ops's destroy() handler. |
| + */ |
| + rcu_barrier(); |
| + |
| write_lock(&cls_mod_lock); |
| list_for_each_entry(t, &tcf_proto_base, head) { |
| if (t == ops) { |