summaryrefslogtreecommitdiffstats
path: root/net/ceph/messenger.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* libceph, rbd, ceph: convert to use the new mount APIDavid Howells2019-11-271-2/+0
| | | | | | | | | | | | | | | | Convert the ceph filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. [ Numerous string handling, leak and regression fixes; rbd conversion was particularly broken and had to be redone almost from scratch. ] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: add function that reset client's entity addrYan, Zheng2019-09-161-0/+6
| | | | | | | | | This function also re-open connections to OSD/MON, and re-send in-flight OSD requests after re-opening connections to OSD. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2019-07-181-6/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "Lots of exciting things this time! - support for rbd object-map and fast-diff features (myself). This will speed up reads, discards and things like snap diffs on sparse images. - ceph.snap.btime vxattr to expose snapshot creation time (David Disseldorp). This will be used to integrate with "Restore Previous Versions" feature added in Windows 7 for folks who reexport ceph through SMB. - security xattrs for ceph (Zheng Yan). Only selinux is supported for now due to the limitations of ->dentry_init_security(). - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff Layton). This is actually a single feature bit which was missing because of the filesystem pieces. With this in, the kernel client will finally be reported as "luminous" by "ceph features" -- it is still being reported as "jewel" even though all required Luminous features were implemented in 4.13. - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention with xattrs is to not terminate and this was causing inconsistencies with ceph-fuse. - change filesystem time granularity from 1 us to 1 ns, again fixing an inconsistency with ceph-fuse (Luis Henriques). On top of this there are some additional dentry name handling and cap flushing fixes from Zheng. Finally, Jeff is formally taking over for Zheng as the filesystem maintainer" * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits) ceph: fix end offset in truncate_inode_pages_range call ceph: use generic_delete_inode() for ->drop_inode ceph: use ceph_evict_inode to cleanup inode's resource ceph: initialize superblock s_time_gran to 1 MAINTAINERS: take over for Zheng as CephFS kernel client maintainer rbd: setallochint only if object doesn't exist rbd: support for object-map and fast-diff rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() libceph: export osd_req_op_data() macro libceph: change ceph_osdc_call() to take page vector for response libceph: bump CEPH_MSG_MAX_DATA_LEN (again) rbd: new exclusive lock wait/wake code rbd: quiescing lock should wait for image requests rbd: lock should be quiesced on reacquire rbd: introduce copyup state machine rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() rbd: move OSD request allocation into object request state machines rbd: factor out __rbd_osd_setup_discard_ops() rbd: factor out rbd_osd_setup_copyup() rbd: introduce obj_req->osd_reqs list ...
| * libceph: rename ceph_encode_addr to ceph_encode_banner_addrJeff Layton2019-07-081-3/+3
| | | | | | | | | | | | | | | | | | ...ditto for the decode function. We only use these functions to fix up banner addresses now, so let's name them more appropriately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONEJeff Layton2019-07-081-2/+5
| | | | | | | | | | | | | | | | | | | | | | Going forward, we'll have different address types so let's use the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE. Also, make ceph_pr_addr print the address type value as well. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: fix sa_family just after reading addressJeff Layton2019-07-081-3/+2
| | | | | | | | | | | | | | | | It doesn't make sense to leave it undecoded until later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | keys: Pass the network namespace into request_key mechanismDavid Howells2019-06-281-1/+2
|/ | | | | | | | | | | | Create a request_key_net() function and use it to pass the network namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys for different domains can coexist in the same keyring. Signed-off-by: David Howells <dhowells@redhat.com> cc: netdev@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-afs@lists.infradead.org
* Merge tag 'afs-fixes-20190516' of ↵Linus Torvalds2019-05-171-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull misc AFS fixes from David Howells: "This fixes a set of miscellaneous issues in the afs filesystem, including: - leak of keys on file close. - broken error handling in xattr functions. - missing locking when updating VL server list. - volume location server DNS lookup whereby preloaded cells may not ever get a lookup and regular DNS lookups to maintain server lists consume power unnecessarily. - incorrect error propagation and handling in the fileserver iteration code causes operations to sometimes apparently succeed. - interruption of server record check/update side op during fileserver iteration causes uninterruptible main operations to fail unexpectedly. - callback promise expiry time miscalculation. - over invalidation of the callback promise on directories. - double locking on callback break waking up file locking waiters. - double increment of the vnode callback break counter. Note that it makes some changes outside of the afs code, including: - an extra parameter to dns_query() to allow the dns_resolver key just accessed to be immediately invalidated. AFS is caching the results itself, so the key can be discarded. - an interruptible version of wait_var_event(). - an rxrpc function to allow the maximum lifespan to be set on a call. - a way for an rxrpc call to be marked as non-interruptible" * tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix double inc of vnode->cb_break afs: Fix lock-wait/callback-break double locking afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set afs: Fix calculation of callback expiry time afs: Make dynamic root population wait uninterruptibly for proc_cells_lock afs: Make some RPC operations non-interruptible rxrpc: Allow the kernel to mark a call as being non-interruptible afs: Fix error propagation from server record check/update afs: Fix the maximum lifespan of VL and probe calls rxrpc: Provide kernel interface to set max lifespan on a call afs: Fix "kAFS: AFS vnode with undefined type 0" afs: Fix cell DNS lookup Add wait_var_event_interruptible() dns_resolver: Allow used keys to be invalidated afs: Fix afs_cell records to always have a VL server list record afs: Fix missing lock when replacing VL server list afs: Fix afs_xattr_get_yfs() to not try freeing an error value afs: Fix incorrect error handling in afs_xattr_get_acl() afs: Fix key leak in afs_release() and afs_evict_inode()
| * dns_resolver: Allow used keys to be invalidatedDavid Howells2019-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Allow used DNS resolver keys to be invalidated after use if the caller is doing its own caching of the results. This reduces the amount of resources required. Fix AFS to invalidate DNS results to kill off permanent failure records that get lodged in the resolver keyring and prevent future lookups from happening. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Signed-off-by: David Howells <dhowells@redhat.com>
* | libceph: make ceph_pr_addr take an struct ceph_entity_addr pointerJeff Layton2019-05-071-24/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC9 is throwing a lot of warnings about unaligned accesses by callers of ceph_pr_addr. All of the current callers are passing a pointer to the sockaddr inside struct ceph_entity_addr. Fix it to take a pointer to a struct ceph_entity_addr instead, and then have the function make a copy of the sockaddr before printing it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | libceph: fix unaligned accesses in ceph_entity_addr handlingJeff Layton2019-05-071-40/+37
|/ | | | | | | | | | | | GCC9 is throwing a lot of warnings about unaligned access. This patch fixes some of them by changing most of the sockaddr handling functions to take a pointer to struct ceph_entity_addr instead of struct sockaddr_storage. The lower functions can then make copies or do unaligned accesses as needed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: fix breakage caused by multipage bvecsIlya Dryomov2019-03-251-2/+6
| | | | | | | | | | | | | | | A bvec can now consist of multiple physically contiguous pages. This means that bvec_iter_advance() can move to a different page while staying in the same bvec (i.e. ->bi_bvec_done != 0). The messenger works in terms of segments which can now be defined as the smaller of a bvec and a page. The "more bytes to process in this segment" condition holds only if bvec_iter_advance() leaves us in the same bvec _and_ in the same page. On next bvec (possibly in the same page) and on next page (possibly in the same bvec) we may need to set ->last_piece. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: handle an empty authorize replyIlya Dryomov2019-02-181-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | The authorize reply can be empty, for example when the ticket used to build the authorizer is too old and TAG_BADAUTHORIZER is returned from the service. Calling ->verify_authorizer_reply() results in an attempt to decrypt and validate (somewhat) random data in au->buf (most likely the signature block from calc_signature()), which fails and ends up in con_fault_finish() with !con->auth_retry. The ticket isn't invalidated and the connection is retried again and again until a new ticket is obtained from the monitor: libceph: osd2 192.168.122.1:6809 bad authorize reply libceph: osd2 192.168.122.1:6809 bad authorize reply libceph: osd2 192.168.122.1:6809 bad authorize reply libceph: osd2 192.168.122.1:6809 bad authorize reply Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry. Cc: stable@vger.kernel.org Fixes: 5c056fdc5b47 ("libceph: verify authorize reply on connect") Link: https://tracker.ceph.com/issues/20164 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: avoid KEEPALIVE_PENDING races in ceph_con_keepalive()Ilya Dryomov2019-01-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | con_fault() can transition the connection into STANDBY right after ceph_con_keepalive() clears STANDBY in clear_standby(): libceph user thread ceph-msgr worker ceph_con_keepalive() mutex_lock(&con->mutex) clear_standby(con) mutex_unlock(&con->mutex) mutex_lock(&con->mutex) con_fault() ... if KEEPALIVE_PENDING isn't set set state to STANDBY ... mutex_unlock(&con->mutex) set KEEPALIVE_PENDING set WRITE_PENDING This triggers warnings in clear_standby() when either ceph_con_send() or ceph_con_keepalive() get to clearing STANDBY next time. I don't see a reason to condition queue_con() call on the previous value of KEEPALIVE_PENDING, so move the setting of KEEPALIVE_PENDING into the critical section -- unlike WRITE_PENDING, KEEPALIVE_PENDING could have been a non-atomic flag. Reported-by: syzbot+acdeb633f6211ccdf886@syzkaller.appspotmail.com Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Myungho Jung <mhjungk@gmail.com>
* libceph: switch more to bool in ceph_tcp_sendmsg()Ilya Dryomov2018-12-261-1/+1
| | | | | | Unlike in ceph_tcp_sendpage(), it's a bool. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage()Ilya Dryomov2018-12-261-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent do_tcp_sendpages() from calling tcp_push() (at least) once per page. Instead, arrange for tcp_push() to be called (at least) once per data payload. This results in more MSS-sized packets and fewer packets overall (5-10% reduction in my tests with typical OSD request sizes). See commits 2f5338442425 ("tcp: allow splice() to build full TSO packets"), 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push() once") and ae62ca7b0321 ("tcp: fix MSG_SENDPAGE_NOTLAST logic") for details. Here is an example of a packet size histogram for 128K OSD requests (MSS = 1448, top 5): Before: SIZE COUNT 1448 777700 952 127915 1200 39238 1219 9806 21 5675 After: SIZE COUNT 1448 897280 21 6201 1019 2797 643 2739 376 2479 We could do slightly better by explicitly corking the socket but it's not clear it's worth it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage()Ilya Dryomov2018-12-261-26/+6
| | | | | | | | | | sock_no_sendpage() makes the code cleaner. Also, don't set MSG_EOR. sendpage doesn't act on MSG_EOR on its own, it just honors the setting from the preceding sendmsg call by looking at ->eor in tcp_skb_can_collapse_to(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: drop last_piece logic from write_partial_message_data()Ilya Dryomov2018-12-261-5/+3
| | | | | | | | | | | | | | | | | | | | | | last_piece is for the last piece in the current data item, not in the entire data payload of the message. This is harmful for messages with multiple data items. On top of that, we don't need to signal the end of a data payload either because it is always followed by a footer. We used to signal "more" unconditionally, until commit fe38a2b67bc6 ("libceph: start defining message data cursor"). Part of a large series, it introduced cursor->last_piece and also mistakenly inverted the hint by passing last_piece for "more". This was corrected with commit c2cfa1940097 ("libceph: Fix ceph_tcp_sendpage()'s more boolean usage"). As it is, last_piece is not helping at all: because Nagle algorithm is disabled, for a simple message with two 512-byte data items we end up emitting three packets: front + first data item, second data item and footer. Go back to the original pre-fe38a2b67bc6 behavior -- a single packet in most cases. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: fall back to sendmsg for slab pagesIlya Dryomov2018-11-191-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | skb_can_coalesce() allows coalescing neighboring slab objects into a single frag: return page == skb_frag_page(frag) && off == frag->page_offset + skb_frag_size(frag); ceph_tcp_sendpage() can be handed slab pages. One example of this is XFS: it passes down sector sized slab objects for its metadata I/O. If the kernel client is co-located on the OSD node, the skb may go through loopback and pop on the receive side with the exact same set of frags. When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy complains because the size exceeds the object's allocated size: usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes) Although skb_can_coalesce() could be taught to return false if the resulting frag would cross a slab object boundary, we already have a fallback for non-refcounted pages. Utilize it for slab pages too. Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge branch 'work.afs' of ↵Linus Torvalds2018-11-021-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull AFS updates from Al Viro: "AFS series, with some iov_iter bits included" * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) missing bits of "iov_iter: Separate type from direction and use accessor functions" afs: Probe multiple fileservers simultaneously afs: Fix callback handling afs: Eliminate the address pointer from the address list cursor afs: Allow dumping of server cursor on operation failure afs: Implement YFS support in the fs client afs: Expand data structure fields to support YFS afs: Get the target vnode in afs_rmdir() and get a callback on it afs: Calc callback expiry in op reply delivery afs: Fix FS.FetchStatus delivery from updating wrong vnode afs: Implement the YFS cache manager service afs: Remove callback details from afs_callback_break struct afs: Commit the status on a new file/dir/symlink afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS afs: Don't invoke the server to read data beyond EOF afs: Add a couple of tracepoints to log I/O errors afs: Handle EIO from delivery function afs: Fix TTL on VL server and address lists afs: Implement VL server rotation afs: Improve FS server rotation error handling ...
| * iov_iter: Separate type from direction and use accessor functionsDavid Howells2018-10-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>
* | libceph: preallocate message data itemsIlya Dryomov2018-10-221-67/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently message data items are allocated with ceph_msg_data_create() in setup_request_data() inside send_request(). send_request() has never been allowed to fail, so each allocation is followed by a BUG_ON: data = ceph_msg_data_create(...); BUG_ON(!data); It's been this way since support for multiple message data items was added in commit 6644ed7b7e04 ("libceph: make message data be a pointer") in 3.10. There is no reason to delay the allocation of message data items until the last possible moment and we certainly don't need a linked list of them as they are only ever appended to the end and never erased. Make ceph_msg_new2() take max_data_items and adapt the rest of the code. Reported-by: Jerry Lee <leisurelysw24@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()Ilya Dryomov2018-10-221-0/+1
|/ | | | | | | | | | | | | | | | Because send_mds_reconnect() wants to send a message with a pagelist and pass the ownership to the messenger, ceph_msg_data_add_pagelist() consumes a ref which is then put in ceph_msg_data_destroy(). This makes managing pagelists in the OSD client (where they are wrapped in ceph_osd_data) unnecessarily hard because the handoff only happens in ceph_osdc_start_request() instead of when the pagelist is passed to ceph_osd_data_pagelist_init(). I counted several memory leaks on various error paths. Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in ceph_osd_data. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: check authorizer reply/challenge length before readingIlya Dryomov2018-08-021-0/+7
| | | | | | | | Avoid scribbling over memory if the received reply/challenge is larger than the buffer supplied with the authorizer. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: add authorizer challengeIlya Dryomov2018-08-021-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | When a client authenticates with a service, an authorizer is sent with a nonce to the service (ceph_x_authorize_[ab]) and the service responds with a mutation of that nonce (ceph_x_authorize_reply). This lets the client verify the service is who it says it is but it doesn't protect against a replay: someone can trivially capture the exchange and reuse the same authorizer to authenticate themselves. Allow the service to reject an initial authorizer with a random challenge (ceph_x_authorize_challenge). The client then has to respond with an updated authorizer proving they are able to decrypt the service's challenge and that the new authorizer was produced for this specific connection instance. The accepting side requires this challenge and response unconditionally if the client side advertises they have CEPHX_V2 feature bit. This addresses CVE-2018-1128. Link: http://tracker.ceph.com/issues/24836 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: factor out __prepare_write_connect()Ilya Dryomov2018-08-021-9/+12
| | | | | | | | Will be used for sending ceph_msg_connect with an updated authorizer, after the server challenges the initial authorizer. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: store ceph_auth_handshake pointer in ceph_connectionIlya Dryomov2018-08-021-28/+26
| | | | | | | | | | We already copy authorizer_reply_buf and authorizer_reply_buf_len into ceph_connection. Factoring out __prepare_write_connect() requires two more: authorizer_buf and authorizer_buf_len. Store the pointer to the handshake in con->auth rather than piling on. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: use timespec64 in for keepalive2 and ticket validityArnd Bergmann2018-08-021-10/+10
| | | | | | | | | | | | | | | | | | ceph_con_keepalive_expired() is the last user of timespec_add() and some of the last uses of ktime_get_real_ts(). Replacing this with timespec64 based interfaces lets us remove that deprecated API. I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64() here that take timespec64 structures and convert to/from ceph_timespec, which is defined to have an unsigned 32-bit tv_sec member. This extends the range of valid times to year 2106, avoiding the year 2038 overflow. The ceph file system portion still uses the old functions for inode timestamps, this will be done separately after the VFS layer is converted. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: use MSG_TRUNC for discarding received bytesIlya Dryomov2018-06-041-13/+8
| | | | | | Avoid a copy into the "skip buffer". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: get rid of more_kvec in try_write()Ilya Dryomov2018-06-041-7/+3
| | | | | | | | | All gotos to "more" are conditioned on con->state == OPEN, but the only thing "more" does is opening the socket if con->state == PREOPEN. Kill that label and rename "more_kvec" to "more". Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
* libceph: validate con->state at the top of try_write()Ilya Dryomov2018-04-261-0/+7
| | | | | | | | | | | | | | | | | | | | | ceph_con_workfn() validates con->state before calling try_read() and then try_write(). However, try_read() temporarily releases con->mutex, notably in process_message() and ceph_con_in_msg_alloc(), opening the window for ceph_con_close() to sneak in, close the connection and release con->sock. When try_write() is called on the assumption that con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock gets passed to the networking stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: selinux_socket_sendmsg+0x5/0x20 Make sure con->state is valid at the top of try_write() and add an explicit BUG_ON for this, similar to try_read(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/23706 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
* libceph, ceph: add __init attribution to init funcitonsChengguang Xu2018-04-021-3/+1
| | | | | | | | | | Add __init attribution to the functions which are called only once during initiating/registering operations and deleting unnecessary symbol exports. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: handle zero-length data itemsIlya Dryomov2018-04-021-2/+12
| | | | | | | | | rbd needs this for null copyups -- if copyup data is all zeroes, we want to save some I/O and network bandwidth. See rbd_obj_issue_copyup() in the next commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce BVECS data typeIlya Dryomov2018-04-021-0/+75
| | | | | | | | In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for working with bio_vec array data buffers. The wrappers are trivial, but make it look similar to ceph_bio_iter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph, rbd: new bio handling code (aka don't clone bios)Ilya Dryomov2018-04-021-67/+34
| | | | | | | | | | | | | | The reason we clone bios is to be able to give each object request (and consequently each ceph_osd_data/ceph_msg_data item) its own pointer to a (list of) bio(s). The messenger then initializes its cursor with cloned bio's ->bi_iter, so it knows where to start reading from/writing to. That's all the cloned bios are used for: to determine each object request's starting position in the provided data buffer. Introduce ceph_bio_iter to do exactly that -- store position within bio list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: mark expected switch fall-throughsGustavo A. R. Silva2017-11-131-0/+1
| | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> [idryomov@gmail.com: amended "Older OSDs" comment] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman2017-11-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* libceph: don't call ->reencode_message() more than once per messageIlya Dryomov2017-08-011-3/+3
| | | | | | | | | | | | | Reencoding an already reencoded message is a bad idea. This could happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY), such as MDS sessions. This didn't pop up in testing because currently only OSD requests are reencoded and OSD sessions are always lossy. Fixes: 98ad5ebd1505 ("libceph: ceph_connection_operations::reencode_message() method") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
* libceph: potential NULL dereference in ceph_msg_data_create()Dan Carpenter2017-07-171-2/+4
| | | | | | | | | | If kmem_cache_zalloc() returns NULL then the INIT_LIST_HEAD(&data->links); will Oops. The callers aren't really prepared for NULL returns so it doesn't make a lot of difference in real life. Fixes: 5240d9f95dfe ("libceph: replace message data pointer with list") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: ceph_connection_operations::reencode_message() methodIlya Dryomov2017-07-071-2/+5
| | | | | | | | | Give upper layers a chance to reencode the message after the connection is negotiated and ->peer_features is set. OSD client will use this to support both luminous and pre-luminous OSDs (in a single cluster): the former need MOSDOp v8; the latter will continue to be sent MOSDOp v4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: remove ceph_sanitize_features() workaroundIlya Dryomov2017-07-071-2/+1
| | | | | | Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: cleanup old messages according to reconnect seqYan, Zheng2017-05-241-3/+12
| | | | | | | | | | when reopen a connection, use 'reconnect seq' to clean up messages that have already been received by peer. Link: http://tracker.ceph.com/issues/18690 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: make ceph_msg_data_advance() return voidIlya Dryomov2017-05-231-7/+4
| | | | | | | Both callers ignore the returned bool. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* fs: ceph: CURRENT_TIME with ktime_get_real_ts()Deepa Dinamani2017-05-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CURRENT_TIME is not y2038 safe. The macro will be deleted and all the references to it will be replaced by ktime_get_* apis. struct timespec is also not y2038 safe. Retain timespec for timestamp representation here as ceph uses it internally everywhere. These references will be changed to use struct timespec64 in a separate patch. The current_fs_time() api is being changed to use vfs struct inode* as an argument instead of struct super_block*. Set the new mds client request r_stamp field using ktime_get_real_ts() instead of using current_fs_time(). Also, since r_stamp is used as mtime on the server, use timespec_trunc() to truncate the timestamp, using the right granularity from the superblock. This api will be transitioned to be y2038 safe along with vfs. Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> M: Ilya Dryomov <idryomov@gmail.com> M: "Yan, Zheng" <zyan@redhat.com> M: Sage Weil <sage@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* libceph: force GFP_NOIO for socket allocationsIlya Dryomov2017-03-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
* Merge branch 'work.sendmsg' of ↵Linus Torvalds2017-03-031-15/+29
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs sendmsg updates from Al Viro: "More sendmsg work. This is a fairly separate isolated stuff (there's a continuation around lustre, but that one was too late to soak in -next), thus the separate pull request" * 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ncpfs: switch to sock_sendmsg() ncpfs: don't mess with manually advancing iovec on send ncpfs: sendmsg does *not* bugger iovec these days ceph_tcp_sendpage(): use ITER_BVEC sendmsg afs_send_pages(): use ITER_BVEC rds: remove dead code ceph: switch to sock_recvmsg() usbip_recv(): switch to sock_recvmsg() iscsi_target: deal with short writes on the tx side [nbd] pass iov_iter to nbd_xmit() [nbd] switch sock_xmit() to sock_{send,recv}msg() [drbd] use sock_sendmsg()
| * ceph_tcp_sendpage(): use ITER_BVEC sendmsgAl Viro2016-12-271-5/+15
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ceph: switch to sock_recvmsg()Al Viro2016-12-271-10/+14
| | | | | | | | | | | | ... and use ITER_BVEC instead of playing with kmap() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | locking/atomic, kref: Add kref_read()Peter Zijlstra2017-01-141-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | Since we need to change the implementation, stop exposing internals. Provide kref_read() to read the current reference count; typically used for debug messages. Kills two anti-patterns: atomic_read(&kref->refcount) kref->refcount.counter Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* libceph: no need to drop con->mutex for ->get_authorizer()Ilya Dryomov2016-12-121-6/+0
| | | | | | | | | ->get_authorizer(), ->verify_authorizer_reply(), ->sign_message() and ->check_message_signature() shouldn't be doing anything with or on the connection (like closing it or sending messages). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>