summaryrefslogtreecommitdiffstats
path: root/include (follow)
Commit message (Collapse)AuthorAgeFilesLines
* RDMA/core: Add an integrity MR pool supportIsrael Rukshin2019-06-241-1/+1
| | | | | | | | | | This is a preparation for adding new signature API to the rw-API. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/mlx5: Introduce and implement new IB_WR_REG_MR_INTEGRITY work requestMax Gurtovoy2019-06-242-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new WR will be used to perform PI (protection information) handover using the new API. Using the new API, the user will post a single WR that will internally perform all the needed actions to complete PI operation. This new WR will use a memory region that was allocated as IB_MR_TYPE_INTEGRITY and was mapped using ib_map_mr_sg_pi to perform the registration. In the old API, in order to perform a signature handover operation, each ULP should perform the following: 1. Map and register the data buffers. 2. Map and register the protection buffers. 3. Post a special reg WR to configure the signature handover operation layout. 4. Invalidate the signature memory key. 5. Invalidate protection buffers memory key. 6. Invalidate data buffers memory key. In the new API, the mapping of both data and protection buffers is performed using a single call to ib_map_mr_sg_pi function. Also the registration of the buffers and the configuration of the signature operation layout is done by a single new work request called IB_WR_REG_MR_INTEGRITY. This patch implements this operation for mlx5 devices that are capable to offload data integrity generation/validation while performing the actual buffer transfer. This patch will not remove the old signature API that is used by the iSER initiator and target drivers. This will be done in the future. In the internal implementation, for each IB_WR_REG_MR_INTEGRITY work request, we are using a single UMR operation to register both data and protection buffers using KLM's. Afterwards, another UMR operation will describe the strided block format. These will be followed by 2 SET_PSV operations to set the memory/wire domains initial signature parameters passed by the user. In the end of the whole transaction, only the signature memory key (the one that exposed for the RDMA operation) will be invalidated. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/mlx5: Add attr for max number page list length for PI operationMax Gurtovoy2019-06-241-0/+1
| | | | | | | | | | | | | PI offload (protection information) is a feature that each RDMA provider can implement differently. Thus, introduce new device attribute to define the maximal length of the page list for PI fast registration operation. For example, mlx5 driver uses a single internal MR to map both data and protection SGL's, so it's equal to max_fast_reg_page_list_len / 2. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/core: Add signature attrs element for ib_mr structureMax Gurtovoy2019-06-242-1/+3
| | | | | | | | | | | | | This element will describe the needed characteristics for the signature operation per signature enabled memory region (type IB_MR_TYPE_INTEGRITY). Also add meta_length attribute to ib_sig_attrs structure for saving the mapped metadata length (needed for the new API implementation). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/core: Introduce ib_map_mr_sg_pi to map data/protection sgl'sMax Gurtovoy2019-06-241-0/+9
| | | | | | | | | | | | | This function will map the previously dma mapped SG lists for PI (protection information) and data to an appropriate memory region for future registration. The given MR must be allocated as IB_MR_TYPE_INTEGRITY. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/core: Introduce IB_MR_TYPE_INTEGRITY and ib_alloc_mr_integrity APIIsrael Rukshin2019-06-241-0/+10
| | | | | | | | | | | | | This is a preparation for signature verbs API re-design. In the new design a single MR with IB_MR_TYPE_INTEGRITY type will be used to perform the needed mapping for data integrity operations. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/core: Save the MR type in the ib_mr structureMax Gurtovoy2019-06-241-0/+10
| | | | | | | | | | | | | | | | This is a preparation for the signature verbs API change. This change is needed since the MR type will define, in the upcoming patches, the need for allocating internal resources in LLD for signature handover related operations. It will also help to make sure that signature related functions are called with an appropriate MR type and fail otherwise. Also introduce new mr types IB_MR_TYPE_USER, IB_MR_TYPE_DMA and IB_MR_TYPE_DM for correctness. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* RDMA/core: Introduce new header file for signature operationsMax Gurtovoy2019-06-242-111/+121
| | | | | | | | | | | Ease the exhausted ib_verbs.h file and make the code more readable. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* IB/{rdmavt, qib, hfi1}: Convert to new completion APIMike Marciniszyn2019-06-211-36/+0
| | | | | | | | | | | | | | | | | | | | Convert all completions to use the new completion routine that fixes a race between post send and completion where fields from a SWQE can be read after SWQE has been freed. This patch also addresses issues reported in https://marc.info/?l=linux-kernel&m=155656897409107&w=2. The reserved operation path has no need for any barrier. The barrier for the other path is addressed by the smp_load_acquire() barrier. Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* IB/rdmavt: Add new completion inlineMike Marciniszyn2019-06-211-0/+72
| | | | | | | | | | | | | | | | | | | | There is opencoded send completion logic all over all the drivers. We need to convert to this routine to enforce ordering issues for completions. This routine fixes an ordering issue where the read of the SWQE fields necessary for creating the completion can race with a post send if the post send catches a send queue at the edge of being full. Is is possible in that situation to read SWQE fields that are being written. This new routine insures that SWQE fields are read prior to advancing the index that post send uses to determine queue fullness. Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* RDMA: Convert destroy_wq to be voidLeon Romanovsky2019-06-201-1/+1
| | | | | | | | | All callers of destroy WQ are always success and there is no need to check their return value, so convert destroy_wq to be void. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* Merge remote-tracking branch 'mlx5-next/mlx5-next' into HEADDoug Ledford2019-06-197-41/+69
|\ | | | | | | | | | | Take mlx5-next so we can take a dependent two patch series next. Signed-off-by: Doug Ledford <dledford@redhat.com>
| * net/mlx5: Expose eswitch encap modeMaor Gottlieb2019-06-161-0/+12
| | | | | | | | | | | | | | | | | | | | | | Add API to get the current Eswitch encap mode. It will be used in downstream patches to check if flow table can be created with encap support or not. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com>
| * net/mlx5: Declare more strictly devlink encap modeLeon Romanovsky2019-06-161-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Devlink has UAPI declaration for encap mode, so there is no need to be loose on the data get/set by drivers. Update call sites to use enum devlink_eswitch_encap_mode instead of plain u8. Suggested-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz>
| * net/mlx5: Add EQ enable/disable APIYuval Avnery2019-06-131-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | Previously, EQ joined the chain notifier on creation. This forced the caller to be ready to handle events before creating the EQ through eq_create_generic interface. To help the caller control when the created EQ will be attached to the IRQ, add enable/disable API. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Use a single IRQ for all async EQsAriel Levkovich2019-06-131-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch modifies the IRQ allocation so that all async EQs are assigned to the same IRQ resulting in more available IRQs for completion EQs. The changes are using the support for IRQ sharing and EQ polling budget that was introduced in previous patches so when the shared interrupt is triggered, the kernel will serially call the handler of each of the sharing EQs with a certain budget of EQEs to poll in order to prevent starvation. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Separate IRQ data from EQ table dataYuval Avnery2019-06-131-0/+3
| | | | | | | | | | | | | | | | | | | | IRQ table should only exist for mlx5_core_dev for PF and VF only. EQ table of mediated devices should hold a pointer to the IRQ table of the parent PCI device. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Separate IRQ request/free from EQ life cycleYuval Avnery2019-06-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | Instead of requesting IRQ with eq creation, IRQs will be requested before EQ table creation. Instead of freeing the IRQs after EQ destroy, free IRQs after eq table destroy. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Change interrupt handler to call chain notifierYuval Avnery2019-06-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple EQs may share the same IRQ in subsequent patches. Instead of calling the IRQ handler directly, the EQ will register to an atomic chain notfier. The Linux built-in shared IRQ is not used because it forces the caller to disable the IRQ and clear affinity before free_irq() can be called. This patch is the first step in the separation of IRQ and EQ logic. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Support querying max VFs from deviceBodong Wang2019-06-132-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ECPF with eswitch manager privilege, query the host max VF count by querying the device using query_functions command. With this enhancement: 1. flow steering entries are created only for valid vports based on the max VF count of the PF. 2. Driver only queries cap of valid vport. Eswitch requires the max VFs when doing initialization, so do sr-iov init before eswitch init. Signed-off-by: Bodong Wang <bodong@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * {IB,net}/mlx5: Constify rep ops functions pointersParav Pandit2019-05-311-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently for every representor type and for every single vport, representer function pointers copy is stored even though they don't change from one to other vport. Additionally priv data entry for the rep is not passed during registration, but its copied. It is used (set and cleared) by the user of the reps. As we want to scale vports, to simplify and also to split constants from data, 1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops prefix with other standard netdev, ibdev ops. 2. Constify the IB and Ethernet rep ops structure. 3. Instead of storing copy of all rep function pointers, store copy per eswitch rep type. 4. Split data and function pointers to mlx5_eswitch_rep_ops and mlx5_eswitch_rep_data. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: E-Switch, Honor eswitch functions changed event capVu Pham2019-05-311-1/+3
| | | | | | | | | | | | | | | | | | Whenever device supports eswitch functions changed event, honor such device setting. Do not limit it to ECPF. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: E-Switch, Replace host_params event with functions_changed eventVu Pham2019-05-312-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support sriov on a E-Switch manager, num_vfs are queried to the firmware whenever E-Switch manager is notified by esw_functions_changed event. Replace host_params event with esw_functions_changed event that reflects more appropriate naming. While at it, also correct num_vfs type from int to u16 as expected by the function mlx5_esw_query_functions(). Signed-off-by: Vu Pham <vuhuong@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Introduce termination table bitsEli Britstein2019-05-312-2/+5
| | | | | | | | | | | | | | | | | | | | | | Termination table is a flow table with a termination flag. The flag allows the firmware to assume that the the specified actions are the last actions list. This assumption allows the FW to safely perform potential looping logic (e.g. hairpin). Introduce the bits for this attribute. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Add core dump register access HW bitsMoshe Shemesh2019-05-312-1/+17
| | | | | | | | | | | | | | | | Add Firmware core dump registers and HW definitions. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* | RDMA: Report available cdevs through RDMA_NLDEV_CMD_GET_CHARDEVJason Gunthorpe2019-06-192-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Update the struct ib_client for all modules exporting cdevs related to the ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now autoloadable and discoverable by userspace over netlink instead of relying on sysfs. uverbs also exposes the DRIVER_ID for drivers that are able to support driver id binding in rdma-core. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoloadJason Gunthorpe2019-06-193-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow userspace to issue a netlink query against the ib_device for something like "uverbs" and get back the char dev name, inode major/minor, and interface ABI information for "uverbs0". Since we are now in netlink this can also trigger a module autoload to make the uverbs device come into existence. Largely this will let us replace searching and reading inside sysfs to setup devices, and provides an alternative (using driver_id) to device name based provider binding for things like rxe. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | RDMA: Move rdma_node_type to uapi/Jason Gunthorpe2019-06-172-12/+13
| | | | | | | | | | | | | | | | This enum is exposed over the sysfs file 'node_type' and over netlink via RDMA_NLDEV_ATTR_DEV_NODE_TYPE, so declare it in the uapi headers. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | RDMA: Convert CQ allocations to be under core responsibilityLeon Romanovsky2019-06-111-3/+3
| | | | | | | | | | | | | | | | | | | | Ensure that CQ is allocated and freed by IB/core and not by drivers. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | RDMA: Clean destroy CQ in drivers do not return errorsLeon Romanovsky2019-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Like all other destroy commands, .destroy_cq() call is not supposed to fail. In all flows, the attempt to return earlier caused to memory leaks. This patch converts .destroy_cq() to do not return any errors. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | RDMA: Move owner into struct ib_device_opsJason Gunthorpe2019-06-101-1/+1
| | | | | | | | | | | | | | | | | | This more closely follows how other subsytems work, with owner being a member of the structure containing the function pointers. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | RDMA: Move uverbs_abi_ver into struct ib_device_opsJason Gunthorpe2019-06-101-1/+1
| | | | | | | | | | | | | | | | | | No reason for every driver to emit code to set this, just make it part of the driver's existing static const ops structure. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | RDMA: Move driver_id into struct ib_device_opsJason Gunthorpe2019-06-102-2/+3
| | | | | | | | | | | | | | | | | | No reason for every driver to emit code to set this, just make it part of the driver's existing static const ops structure. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | rdma: Delete the ib_ucm moduleJason Gunthorpe2019-06-101-326/+0
| | | | | | | | | | | | | | | | | | This has been marked CONFIG_BROKEN for over a year now with no complaints. Delete the whole thing for good. The module provided the /dev/infiniband/ucmX interface. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | RDMA/core: Make ib_destroy_cq() voidLeon Romanovsky2019-05-211-2/+2
| | | | | | | | | | | | | | | | Kernel destroy CQ flows can't fail and the returned value of ib_destroy_cq() is not interested in those flows. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* | RDMA/umem: Move page_shift from ib_umem to ib_odp_umemJason Gunthorpe2019-05-212-15/+24
|/ | | | | | | | | | | This value has always been set to PAGE_SHIFT in the core code, the only thing that does differently was the ODP path. Move the value into the ODP struct and still use it for ODP, but change all the non-ODP things to just use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly. Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-05-192-2/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge yet more updates from Andrew Morton: "A few final bits: - large changes to vmalloc, yielding large performance benefits - tweak the console-flush-on-panic code - a few fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: panic: add an option to replay all the printk message in buffer initramfs: don't free a non-existent initrd fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro mm/vmalloc.c: keep track of free blocks for vmap allocation
| * panic: add an option to replay all the printk message in bufferFeng Tang2019-05-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently on panic, kernel will lower the loglevel and print out pending printk msg only with console_flush_on_panic(). Add an option for users to configure the "panic_print" to replay all dmesg in buffer, some of which they may have never seen due to the loglevel setting, which will help panic debugging . [feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()] Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com [feng.tang@intel.com: use logbuf lock to protect the console log index] Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Aaro Koskinen <aaro.koskinen@nokia.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/vmalloc.c: keep track of free blocks for vmap allocationUladzislau Rezki (Sony)2019-05-191-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). # i5-3320M How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt # Hikey960 8x CPUs How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. # i5-3320M <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt # Hikey960 8x CPUs <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'i2c/for-next' of ↵Linus Torvalds2019-05-191-0/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "Some I2C core API additions which are kind of simple but enhance error checking for users a lot, especially by returning errno now. There are wrappers to still support the old API but it will be removed once all users are converted" * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: core: add device-managed version of i2c_new_dummy i2c: core: improve return value handling of i2c_new_device and i2c_new_dummy
| * | i2c: core: add device-managed version of i2c_new_dummyHeiner Kallweit2019-05-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i2c_new_dummy is typically called from the probe function of the driver for the primary i2c client. It requires calls to i2c_unregister_device in the error path of the probe function and in the remove function. This can be simplified by introducing a device-managed version. Note the changed error case return value type: i2c_new_dummy returns NULL whilst devm_i2c_new_dummy_device returns an ERR_PTR. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> [wsa: rename new functions and fix minor kdoc issues] Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Peter Rosin <peda@axentia.se> Reviewed-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Reviewed-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
* | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-05-191-3/+5
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some bug fixes, and an update to the URL's for the final version of Unicode 12.1.0" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid panic during forced reboot due to aborted journal ext4: fix block validity checks for journal inodes using indirect blocks unicode: update to Unicode 12.1.0 final unicode: add missing check for an error return from utf8lookup() ext4: fix miscellaneous sparse warnings ext4: unsigned int compared against zero ext4: fix use-after-free in dx_release() ext4: fix data corruption caused by overlapping unaligned and aligned IO jbd2: fix potential double free ext4: zero out the unused memory region in the extent tree block
| * | | jbd2: fix potential double freeChengguang Xu2019-05-111-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When failing from creating cache jbd2_inode_cache, we will destroy the previously created cache jbd2_handle_cache twice. This patch fixes this by moving each cache initialization/destruction to its own separate, individual function. Signed-off-by: Chengguang Xu <cgxu519@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2019-05-191-2/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Ingo Molnar: "Misc clocksource/clockevent driver updates that came in a bit late but are ready for v5.2" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: misc: atmel_tclib: Do not probe already used TCBs clocksource/drivers/timer-atmel-tcb: Convert tc_clksrc_suspend|resume() to static clocksource/drivers/tcb_clksrc: Rename the file for consistency clocksource/drivers/timer-atmel-pit: Rework Kconfig option clocksource/drivers/tcb_clksrc: Move Kconfig option ARM: at91: Implement clocksource selection clocksource/drivers/tcb_clksrc: Use tcb as sched_clock clocksource/drivers/tcb_clksrc: Stop depending on atmel_tclib ARM: at91: move SoC specific definitions to SoC folder clocksource/drivers/timer-milbeaut: Cleanup common register accesses clocksource/drivers/timer-milbeaut: Add shutdown function clocksource/drivers/timer-milbeaut: Fix to enable one-shot timer clocksource/drivers/tegra: Rework for compensation of suspend time clocksource/drivers/sp804: Add COMPILE_TEST to CONFIG_ARM_TIMER_SP804 clocksource/drivers/sun4i: Add a compatible for suniv dt-bindings: timer: Add Allwinner suniv timer
| * | | | ARM: at91: move SoC specific definitions to SoC folderAlexandre Belloni2019-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move linux/atmel_tc.h to the SoC specific folder include/soc/at91. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Thierry Reding <thierry.reding@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
* | | | | Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2019-05-197-8/+214
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull IRQ chip updates from Ingo Molnar: "A late irqchips update: - New TI INTR/INTA set of drivers - Rewrite of the stm32mp1-exti driver as a platform driver - Update the IOMMU MSI mapping API to be RT friendly - A number of cleanups and other low impact fixes" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) iommu/dma-iommu: Remove iommu_dma_map_msi_msg() irqchip/gic-v3-mbi: Don't map the MSI page in mbi_compose_m{b, s}i_msg() irqchip/ls-scfg-msi: Don't map the MSI page in ls_scfg_msi_compose_msg() irqchip/gic-v3-its: Don't map the MSI page in its_irq_compose_msi_msg() irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg() iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two parts genirq/msi: Add a new field in msi_desc to store an IOMMU cookie arm64: arch_k3: Enable interrupt controller drivers irqchip/ti-sci-inta: Add msi domain support soc: ti: Add MSI domain bus support for Interrupt Aggregator irqchip/ti-sci-inta: Add support for Interrupt Aggregator driver dt-bindings: irqchip: Introduce TISCI Interrupt Aggregator bindings irqchip/ti-sci-intr: Add support for Interrupt Router driver dt-bindings: irqchip: Introduce TISCI Interrupt router bindings gpio: thunderx: Use the default parent apis for {request,release}_resources genirq: Introduce irq_chip_{request,release}_resource_parent() apis firmware: ti_sci: Add helper apis to manage resources firmware: ti_sci: Add RM mapping table for am654 firmware: ti_sci: Add support for IRQ management firmware: ti_sci: Add support for RM core ops ...
| * | | | | iommu/dma-iommu: Remove iommu_dma_map_msi_msg()Julien Grall2019-05-031-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent change split iommu_dma_map_msi_msg() in two new functions. The function was still implemented to avoid modifying all the callers at once. Now that all the callers have been reworked, iommu_dma_map_msi_msg() can be removed. Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | | | iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two partsJulien Grall2019-05-031-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On RT, iommu_dma_map_msi_msg() may be called from non-preemptible context. This will lead to a splat with CONFIG_DEBUG_ATOMIC_SLEEP as the function is using spin_lock (they can sleep on RT). iommu_dma_map_msi_msg() is used to map the MSI page in the IOMMU PT and update the MSI message with the IOVA. Only the part to lookup for the MSI page requires to be called in preemptible context. As the MSI page cannot change over the lifecycle of the MSI interrupt, the lookup can be cached and re-used later on. iomma_dma_map_msi_msg() is now split in two functions: - iommu_dma_prepare_msi(): This function will prepare the mapping in the IOMMU and store the cookie in the structure msi_desc. This function should be called in preemptible context. - iommu_dma_compose_msi_msg(): This function will update the MSI message with the IOVA when the device is behind an IOMMU. Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | | | genirq/msi: Add a new field in msi_desc to store an IOMMU cookieJulien Grall2019-05-031-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an MSI doorbell is located downstream of an IOMMU, it is required to swizzle the physical address with an appropriately-mapped IOVA for any device attached to one of our DMA ops domain. At the moment, the allocation of the mapping may be done when composing the message. However, the composing may be done in non-preemtible context while the allocation requires to be called from preemptible context. A follow-up change will split the current logic in two functions requiring to keep an IOMMU cookie per MSI. A new field is introduced in msi_desc to store an IOMMU cookie. As the cookie may not be required in some configuration, the field is protected under a new config CONFIG_IRQ_MSI_IOMMU. A pair of helpers has also been introduced to access the field. Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | | | soc: ti: Add MSI domain bus support for Interrupt AggregatorLokesh Vutla2019-05-013-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the system coprocessor managing the range allocation of the inputs to Interrupt Aggregator, it is difficult to represent the device IRQs from DT. The suggestion is to use MSI in such cases where devices wants to allocate and group interrupts dynamically. Create a MSI domain bus layer that allocates and frees MSIs for a device. APIs that are implemented: - ti_sci_inta_msi_create_irq_domain() that creates a MSI domain - ti_sci_inta_msi_domain_alloc_irqs() that creates MSIs for the specified device and resource. - ti_sci_inta_msi_domain_free_irqs() frees the irqs attached to the device. - ti_sci_inta_msi_get_virq() for getting the virq attached to a specific event. Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>