summaryrefslogtreecommitdiffstats
path: root/drivers/misc/habanalabs/common/device.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* habanalabs: make print of engines idle mask more readableTomer Tayar2022-11-231-6/+21
| | | | | | | | | | | The engines idle mask was increased to be an array of 4 u64 entries. To make the print of this mask more readable, remove the "0x" prefix, and zero-pad each u64 to 16 bytes if either it isn't zero or if any of the higher-order u64's is not zero. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: print context refcount value if hard reset failsTomer Tayar2022-11-231-3/+15
| | | | | | | | | | | Failing to kill a user process during a hard reset can be due to a reference to the user context which isn't released. To make it easier to understand if this the reason for the failure and not something else, add a print of the context refcount value. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: increase the size of busy engines maskTomer Tayar2022-11-231-4/+5
| | | | | | | | | Increase the size of the busy engines mask in 'struct hl_info_hw_idle', for future ASICs with more than 128 engines. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: extend process wait timeout in device fineOded Gabbay2022-11-231-2/+4
| | | | | | | | | | | | | | | Processes that use our device are likely to use at the same time other devices such as remote storage. In case our device is removed and a user process is still using the device, we need to kill the user process. However, if that process has a thread waiting for i/o to complete on remote storage, for example, the process won't terminate. Let's give it enough time to terminate before giving up. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
* habanalabs: check schedule_hard_reset correctlyOded Gabbay2022-11-231-12/+13
| | | | | | | | schedule_hard_reset can be true only if we didn't do hard-reset. Therefore, no point of checking it in case hard_reset is true. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
* habanalabs: reset device if still in use when releasedTomer Tayar2022-11-231-3/+4
| | | | | | | | | | | | | | | | | | | If the device file is released while a context is still held, it won't be possible to reopen it until the context is eventually released. If that doesn't happen, only a device reset will revert it back to an operational state, i.e. need to wait for a CS timeout or an error, or to wait for an external intervention of injecting a reset via sysfs. At this stage, after the device was released by user, context is held either because of CS which were left running on the device and are not relevant anymore, or due to missing cleanup steps from user side. All of this is in any case handled in the device reset flow, so initiate the reset at this point instead of waiting for it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi2: add razwi notify eventDani Liberman2022-11-231-1/+3
| | | | | | | | | Each time razwi (read-only zero, write ignored) event happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: add page fault notify eventDani Liberman2022-11-231-0/+9
| | | | | | | | | Each time page fault happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use single threaded WQ for event handlingDani Liberman2022-11-231-1/+1
| | | | | | | | | | | | | Creating event queue workqueue using alloc_workqueue made it run in multi threaded mode, which caused parallel dumping of events as well as parallel events notifying to user, causing logs with multiple events to be out of order. Fixed by creating event queue workqueue as single threaded work queue. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: add razwi notify eventDani Liberman2022-11-231-0/+8
| | | | | | | | | Each time razwi (read-only zero, write ignore) happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi2: add PCI revision 2 supportOfir Bitton2022-11-231-0/+4
| | | | | | | | | | Add support for Gaudi2 Device with PCI revision 2. Functionality is exactly the same as revision 1, the only difference is device name exposed to user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove redundant gaudi2_sec asic typeOfir Bitton2022-11-231-3/+0
| | | | | | | | As Gaudi2 has a single PCI id, the secured asic type is redundant. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix PCIe access to SRAM via debugfsTomer Tayar2022-11-231-3/+3
| | | | | | | | | | | | | hl_access_sram_dram_region() uses a region base which is set within the hl_set_dram_bar() function. However, for SRAM access this function is not called, and we end up with a wrong value of region base and with a bad calculated address. Fix it by initializing the region base value independently of whether hl_set_dram_bar() is called or not. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: no consecutive err when user context is enabledTal Cohen2022-11-231-0/+4
| | | | | | | | | | | | | | Consecutive error protects a device reset loop from being triggered due to h/w issues and enters the device into an unavailable state. When user may cause the error, an unavailable state will prevent the user from running its workloads. The commit prevents entering consecutive state when a user context is enabled. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add support for graceful hard resetTomer Tayar2022-11-231-13/+128
| | | | | | | | | | | | | | | | | | | | | | | Calling hl_device_reset() for a hard reset will lead to a quite immediate device reset and to killing user process. For resets that follow errors, it disables the option to debug the errors on both the device side and the user application side. This patch adds a 'graceful hard reset' option and a new hl_device_cond_reset() function. Under some conditions, mainly if there is no user process or if he is not registered to driver notifications, this function will execute hard reset as usual. Otherwise, the reset will be postponed and a notification will be sent to user, to let him perform post-error actions and then to release the device, after which reset will take place. If device is not released by user in some defined time, a watchdog work will execute the reset in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: avoid divide by zero in device utilizationOhad Sharabi2022-11-231-2/+7
| | | | | | | | Currently there is no verification whether the divisor is legal. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix user mappings calculation in case of page faultDani Liberman2022-11-231-2/+7
| | | | | | | | | As there are 2 types of user mappings, pmmu and hmmu, calculate only the relevant mappings for the requested type. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: allow setting HBM BAR to other regionsOhad Sharabi2022-11-231-13/+16
| | | | | | | | | | | | | | Up until now the use-case in the driver was that the HBM is accessed using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and so we needed to set the BAR to other HBM offset. Now we are facing the need to access other PCI memory regions that can be covered by the HBM BAR. To answer that we are allowing the caller to determine if the HBM BAR need to be set or not regardless of the PCI memory region. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix using freed pointerOhad Sharabi2022-11-231-1/+4
| | | | | | | | | | The code uses the pointer for trace purpose (without actually dereference it) but still get static analysis warning. This patch eliminate the warning. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: verify no zero event is sentTal Cohen2022-11-231-0/+5
| | | | | | | | | The event notifier mechanism should not raise an empty event (event equals zero). Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: handle HBM MMU when capturing page fault dataDani Liberman2022-11-231-8/+21
| | | | | | | | In case of HBM MMU page fault, capture its relevant mappings. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: move reset workqueue to be under hl_deviceTomer Tayar2022-11-231-9/+6
| | | | | | | | | | | | | | 'struct hl_device_reset_work' is used as a wrapper for the reset work and its parameters, including the reset workqueue on which it runs. In a future commit, another reset related work with similar parameters is going to be added, but it won't use the reset workqueue. As in any case there is a single reset workqueue, and to allow the resue of this structure, move the reset workqueue to 'struct hl_device'. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: skip idle status check if reset on device releaseTomer Tayar2022-11-231-9/+7
| | | | | | | | | | If reset upon device release is enabled, there is no need to check the device idle status in hpriv_release(), because device is going to be reset in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: replace 'pf' to 'prefetch'Dafna Hirschfeld2022-11-231-7/+7
| | | | | | | | | pf was an abbreviation for prefetch but because pf already stands for 'physical function', we decided to change it to 'prefetch'. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add page fault info uapiDani Liberman2022-11-231-0/+58
| | | | | | | | | | | Only the first page fault will be saved. Besides the address which caused the page fault, the driver captures all of the mmu user mappings. User can retrieve this data via the new uapi (new opcode in INFO ioctl). Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: refactor razwi event notificationDani Liberman2022-11-231-0/+22
| | | | | | | | | | | | | | | | | This event notification was compatible only with gaudi, where razwi and page fault happens together. To make it compatible with all ASICs, this refactor contains: 1. Razwi notification will only notify about razwi info. New notification will be added in future patch, to retrieve data about page fault error. 2. Changed razwi info structure to support all ASICs. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: allow control device open during resetOfir Bitton2022-11-231-0/+22
| | | | | | | | | | Monitoring apps would like to query device state at any time so we should allow it also during reset because it doesn't involve accessing the h/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: eliminate aggregate use warningOded Gabbay2022-09-201-2/+2
| | | | | | | | | When doing sizeof() and giving as argument a dereference of a pointer-to-a-pointer object, clang will issue a warning. Eliminate the warning by passing struct <name>* Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix resetting the DRAM BAROhad Sharabi2022-09-191-19/+22
| | | | | | | | | Current code does not takes into account the new DRAM region base and so calculated address is wrong and can lead to crush. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi2: read F/W security indication after hard resetTomer Tayar2022-09-191-0/+7
| | | | | | | | | | | | | | | F/W security status might change after every reset. Add the reading of the preboot status to the hard reset sequence, which among others reads this security indication. As this preboot status reading includes the waiting for the preboot to be ready, it can be removed from the CPU init which is done in a later stage. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: send device activity in a proper contextOfir Bitton2022-09-191-2/+2
| | | | | | | | | | | 'Device activity open packet' should be sent outside of mutex as there is no real necessity for a lock. In addition 'device activity close packet' should be sent upon an actual release of the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: send device active message to f/wfarah kassabri2022-09-191-0/+2
| | | | | | | | | As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: trace DMA allocationsOhad Sharabi2022-09-181-18/+31
| | | | | | | | | | | | | | | | | | This patch add tracepoints in the code for DMA allocation. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. To get better understanding of what happened in the DMA allocations the real allocating function is added to the trace as well. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix calculation of DRAM base address in PCIe BARTomer Tayar2022-09-181-1/+5
| | | | | | | | | | | | | The calculation of the device DRAM base address before setting the relevant PCIe BAR to point at it, has an assumption that this BAR is used to access only the DRAM, and thus the covered DRAM size is a power of 2. In future ASICs it is not necessarily true, so need to update the calculation to support also a non-power-of-2 size. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add cdev index data memberOmer Shpigelman2022-09-181-4/+6
| | | | | | | | | | | Instead of recalculating the cdev index, store it in a dedicated data member. This data member is intended to be passed to other drivers using the auxiliary bus infra and hence this new data member is necessary in case that the calculation is changed in the future. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: move common function out of debugfs.cOded Gabbay2022-09-181-0/+24
| | | | | | | | | | | A common function that is called from multiple places can't be located in degugfs.c because that file is only compiled if debugfs is enabled in the kernel config file. This can lead to undefined symbol compilation error. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add a missing lock for in_reset indicationTomer Tayar2022-09-181-0/+2
| | | | | | | | | Add a missing lock in hl_device_resume() when it assigns a value to the 'in_reset' indication. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: rename non_hard_reset to compute_resetOfir Bitton2022-09-181-1/+1
| | | | | | | | | | In order to be more explicit we should use the term compute_reset for describing the reset in which only the compute engines gets reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: rename soft reset to compute resetOded Gabbay2022-07-121-14/+14
| | | | | | | | | | | | | Doing compute reset can be the traditional inference soft reset that is supported only in Goya. Or it can be the new reset upon device release, which is supported in Gaudi2 and above. Therefore, wherever suitable, use the terminology of compute reset instead of soft reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add status of reset after device releaseOded Gabbay2022-07-121-6/+11
| | | | | | | The user might want to know the device is in reset after device release, which is not an erroneous event as a regular reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: fix update of is_in_soft_resetOded Gabbay2022-07-121-5/+15
| | | | | | | | | | | | | | | | reset_info.is_in_soft_reset should be updated both before in_reset and inside the spin lock of the reset info structure. The reasons are: - When we are inside soft reset, it implies we are in reset. Therefore, if someone checks if we are in soft reset, he can deduce we are in reset, while the opposite is not correct and might be misleading. - Both these flags are changed together so they must be changed inside the reset info spinlock. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add a value field to hl_fw_send_pci_access_msg()Tomer Tayar2022-07-121-3/+2
| | | | | | | | | | | For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: initialize variable explicitlyOded Gabbay2022-07-121-1/+1
| | | | | | | Fix warning of "warning: ‘old_base’ may be used uninitialized in this function" Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: do not set max power on a secured deviceOfir Bitton2022-07-121-2/+4
| | | | | | | | | Max power API is not supported in secured devices. Hence, we should skip setting it during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: enable gaudi2 code in driverOded Gabbay2022-07-121-0/+8
| | | | | | | | | | | | Enable the Gaudi2 ASIC code in the pci probe callback of the driver so the driver will handle Gaudi2 ASICs. Add the PCI ID to the PCI table and add the ASIC enum value to all relevant places. Fixup the device parameters initialization for Gaudi2. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add gaudi2 wait-for-CS supportOded Gabbay2022-07-121-14/+34
| | | | | | | | | | | | | | | | In Gaudi2 we moved to a different wait for command submission completion model. Instead of receiving interrupt only on external queues, we use the device's sync manager to notify us when the entire command submission finishes. This enables us to remove the categorization of queues to external and internal, and treat each queue equally, without the need to parse and patch any command buffer. This change also requires refactoring to the IRQ handling of CS completions. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove obsolete device variables used for testingOded Gabbay2022-07-121-2/+1
| | | | | | | | | | There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add gaudi2 asic-specific codeOded Gabbay2022-07-121-0/+10
| | | | | | | | | | | | | | | Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove redundant argument in access_dev_mem APIsOfir Bitton2022-07-121-3/+2
| | | | | | | | | Region structure is derived from region type, hence no need to pass it as an argument. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: refactor dma asic-specific functionsOhad Sharabi2022-07-121-0/+75
| | | | | | | | | | | | | | | | This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>