| From 28e1581c3b4ea5f98530064a103c6217bedeea73 Mon Sep 17 00:00:00 2001 |
| From: Ilya Dryomov <idryomov@gmail.com> |
| Date: Wed, 7 Oct 2020 20:06:48 +0200 |
| Subject: libceph: clear con->out_msg on Policy::stateful_server faults |
| |
| From: Ilya Dryomov <idryomov@gmail.com> |
| |
| commit 28e1581c3b4ea5f98530064a103c6217bedeea73 upstream. |
| |
| con->out_msg must be cleared on Policy::stateful_server |
| (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the |
| reconnection attempt, because after writing the banner the |
| messenger moves on to writing the data section of that message |
| (either from where it got interrupted by the connection reset or |
| from the beginning) instead of writing struct ceph_msg_connect. |
| This results in a bizarre error message because the server |
| sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct |
| ceph_msg_connect: |
| |
| libceph: mds0 (1)172.21.15.45:6828 socket error on write |
| ceph: mds0 reconnect start |
| libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN) |
| libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32 |
| libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch |
| |
| AFAICT this bug goes back to the dawn of the kernel client. |
| The reason it survived for so long is that only MDS sessions |
| are stateful and only two MDS messages have a data section: |
| CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare) |
| and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved). |
| The connection has to get reset precisely when such message |
| is being sent -- in this case it was the former. |
| |
| Cc: stable@vger.kernel.org |
| Link: https://tracker.ceph.com/issues/47723 |
| Signed-off-by: Ilya Dryomov <idryomov@gmail.com> |
| Reviewed-by: Jeff Layton <jlayton@kernel.org> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| |
| --- |
| net/ceph/messenger.c | 5 +++++ |
| 1 file changed, 5 insertions(+) |
| |
| --- a/net/ceph/messenger.c |
| +++ b/net/ceph/messenger.c |
| @@ -2998,6 +2998,11 @@ static void con_fault(struct ceph_connec |
| ceph_msg_put(con->in_msg); |
| con->in_msg = NULL; |
| } |
| + if (con->out_msg) { |
| + BUG_ON(con->out_msg->con != con); |
| + ceph_msg_put(con->out_msg); |
| + con->out_msg = NULL; |
| + } |
| |
| /* Requeue anything that hasn't been acked */ |
| list_splice_init(&con->out_sent, &con->out_queue); |