summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* chardev: fix error handling in cdev_device_add()Yang Yingliang2022-12-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | While doing fault injection test, I got the following report: ------------[ cut here ]------------ kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called. WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0 CPU: 3 PID: 6306 Comm: 283 Tainted: G W 6.1.0-rc2-00005-g307c1086d7c9 #1253 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:kobject_put+0x23d/0x4e0 Call Trace: <TASK> cdev_device_add+0x15e/0x1b0 __iio_device_register+0x13b4/0x1af0 [industrialio] __devm_iio_device_register+0x22/0x90 [industrialio] max517_probe+0x3d8/0x6b4 [max517] i2c_device_probe+0xa81/0xc00 When device_add() is injected fault and returns error, if dev->devt is not set, cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt in error path. Fixes: 233ed09d7fda ("chardev: add helper function to register char devs with a struct device") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mcb: mcb-parse: fix error handing in chameleon_parse_gdd()Yang Yingliang2022-12-021-1/+1
| | | | | | | | | | | | | If mcb_device_register() returns error in chameleon_parse_gdd(), the refcount of bus and device name are leaked. Fix this by calling put_device() to give up the reference, so they can be released in mcb_release_dev() and kobject_cleanup(). Fixes: 3764e82e5150 ("drivers: Introduce MEN Chameleon Bus") Reviewed-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Johannes Thumshirn <jth@kernel.org> Link: https://lore.kernel.org/r/ebfb06e39b19272f0197fa9136b5e4b6f34ad732.1669624063.git.johannes.thumshirn@wdc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* drivers: mcb: fix resource leak in mcb_probe()Zhengchao Shao2022-12-021-1/+3
| | | | | | | | | | | When probe hook function failed in mcb_probe(), it doesn't put the device. Compiled test only. Fixes: 7bc364097a89 ("mcb: Acquire reference to device in probe") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Johannes Thumshirn <jth@kernel.org> Link: https://lore.kernel.org/r/9f87de36bfb85158b506cb78c6fc9db3f6a3bad1.1669624063.git.johannes.thumshirn@wdc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'icc-6.2-rc1' of ↵Greg Kroah-Hartman2022-11-306-107/+61
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next Georgi writes: interconnect changes for 6.2 These are the interconnect changes for the 6.2-rc1 merge window consisting of new drivers to enable both L3 and DDR scaling on sc8280xp platforms. There are also a few miscellaneous fixes. New osm-l3 driver: - interconnect: qcom: osm-l3: Use platform-independent node ids - interconnect: qcom: osm-l3: Squash common descriptors - interconnect: qcom: osm-l3: Add per-core EPSS L3 support - interconnect: qcom: osm-l3: Simplify osm_l3_set() - dt-bindings: interconnect: Add sm8350, sc8280xp and generic OSM L3 compatibles - dt-bindings: interconnect: qcom,msm8998-bwmon: Add sc8280xp bwmon instances Fixes: - interconnect: qcom: icc-rpm: Remove redundant dev_err call - interconnect: qcom: sc7180: fix dropped const of qcom_icc_bcm - interconnect: qcom: sc7180: drop double space - interconnect: qcom: sc8180x: constify pointer to qcom_icc_node Signed-off-by: Georgi Djakov <djakov@kernel.org> * tag 'icc-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc: interconnect: qcom: sc8180x: constify pointer to qcom_icc_node interconnect: qcom: sc7180: drop double space interconnect: qcom: sc7180: fix dropped const of qcom_icc_bcm interconnect: qcom: icc-rpm: Remove redundant dev_err call dt-bindings: interconnect: qcom,msm8998-bwmon: Add sc8280xp bwmon instances dt-bindings: interconnect: Add sm8350, sc8280xp and generic OSM L3 compatibles interconnect: qcom: osm-l3: Simplify osm_l3_set() interconnect: qcom: osm-l3: Add per-core EPSS L3 support interconnect: qcom: osm-l3: Squash common descriptors interconnect: qcom: osm-l3: Use platform-independent node ids dt-bindings: interconnect: qcom,msm8998-bwmon: Correct SC7280 CPU compatible
| * Merge branch 'icc-sc8280xp-l3' into icc-nextGeorgi Djakov2022-11-173-100/+57
| |\ | | | | | | | | | | | | | | | | | | | | | | | | The SC8280XP currently shows depressing results in memory benchmarks. Fix this by introducing support for the platform in the OSM (and EPSS) L3 driver and support for the platform in the bwmon binding. Link: https://lore.kernel.org/r/20221111032515.3460-1-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * dt-bindings: interconnect: qcom,msm8998-bwmon: Add sc8280xp bwmon instancesBjorn Andersson2022-11-171-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sc8280xp platform has two BWMON instances, one v4 and one v5. Extend the existing qcom,msm8998-bwmon and qcom,sc7280-llcc-bwmon to describe these. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-10-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * dt-bindings: interconnect: Add sm8350, sc8280xp and generic OSM L3 compatiblesBjorn Andersson2022-11-171-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add EPSS L3 compatibles for sm8350 and sc8280xp, but while at it also introduce generic compatible for both qcom,osm-l3 and qcom,epss-l3. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-6-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * interconnect: qcom: osm-l3: Simplify osm_l3_set()Bjorn Andersson2022-11-141-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The aggregation over votes for all nodes in the provider will always only find the bandwidth votes for the destination side of the path. Further more, the average kBps value will always be 0. Simplify the logic by directly looking at the destination node's peak bandwidth request. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-5-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * interconnect: qcom: osm-l3: Add per-core EPSS L3 supportBjorn Andersson2022-11-141-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The EPSS instance in e.g. SM8350 and SC8280XP has per-core L3 voting enabled. In this configuration, the "shared" vote is done using the REG_L3_VOTE register instead of PERF_STATE. Rename epss_l3 to clarify that it's affecting the PERF_STATE register and add a new L3_VOTE description. Given platform lineage it's assumed that the L3_VOTE-based case will be the predominant one, so use this for a new generic qcom,epss-l3 compatible. While adding the EPSS generic, also add qcom,osm-l3. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-4-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * interconnect: qcom: osm-l3: Squash common descriptorsBjorn Andersson2022-11-141-40/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each platform defines their own OSM L3 descriptor, but in practice there's only two: one for OSM and one for EPSS. Remove the duplicated definitions. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-3-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * interconnect: qcom: osm-l3: Use platform-independent node idsBjorn Andersson2022-11-141-57/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The identifiers used for nodes needs to be unique in the running system, but defining them per platform results in a lot of duplicated definitions and prevents us from using generic compatibles. As these identifiers are not exposed outside the kernel, change to use driver-local numbers, picked completely at random. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Tested-by: Steev Klimaszewski <steev@kali.org> Reviewed-by: Sibi Sankar <quic_sibis@quicinc.com> Link: https://lore.kernel.org/r/20221111032515.3460-2-quic_bjorande@quicinc.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
| | * dt-bindings: interconnect: qcom,msm8998-bwmon: Correct SC7280 CPU compatibleKrzysztof Kozlowski2022-10-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two different compatibles for SC7280 CPU BWMON instance were used in DTS and bindings. Correct the bindings to use the same one as in DTS, because it is more specific. Fixes: b7c84ae757c2 ("dt-bindings: interconnect: qcom,msm8998-bwmon: Add support for sc7280 BWMONs") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Rajendra Nayak <quic_rjendra@quicinc.com> Acked-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20221011140744.29829-1-krzysztof.kozlowski@linaro.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
| * | interconnect: qcom: sc8180x: constify pointer to qcom_icc_nodeKrzysztof Kozlowski2022-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Pointers to struct qcom_icc_node are const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221027154848.293523-3-krzysztof.kozlowski@linaro.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
| * | interconnect: qcom: sc7180: drop double spaceKrzysztof Kozlowski2022-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Drop double white-space. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221027154848.293523-2-krzysztof.kozlowski@linaro.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
| * | interconnect: qcom: sc7180: fix dropped const of qcom_icc_bcmKrzysztof Kozlowski2022-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pointers to struct qcom_icc_bcm are const, but the change was dropped during merge. Fixes: 016fca59f95f ("Merge branch 'icc-const' into icc-next") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221027154848.293523-1-krzysztof.kozlowski@linaro.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
| * | interconnect: qcom: icc-rpm: Remove redundant dev_err callShang XiaoJing2022-11-171-4/+1
| |/ | | | | | | | | | | | | | | | | | | devm_ioremap_resource() prints error message in itself. Remove the dev_err call to avoid redundant error message. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220924015043.25130-1-shangxiaojing@huawei.com Signed-off-by: Georgi Djakov <djakov@kernel.org>
* | Merge tag 'coresight-next-v6.2' of ↵Greg Kroah-Hartman2022-11-293-42/+116
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux into char-misc-next Suzuki writes: coresight: Update for v6.2 CoreSight updatesfor v6.2 includes : - Support for ETMv4 probing on hotplugged CPUs - Fix TRBE driver for cpuhp state refcounting - Fix CTI driver NULL pointer dereferencing - Fix comment for repeated word Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> * tag 'coresight-next-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux: coresight: etm4x: fix repeated words in comments coresight: cti: Fix null pointer error on CTI init before ETM coresight: trbe: remove cpuhp instance node before remove cpuhp state coresight: etm4x: add CPU hotplug support for probing
| * | coresight: etm4x: fix repeated words in commentsJilin Yuan2022-11-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Delete the redundant word 'the'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20221019124953.45885-1-yuanjilin@cdjrlc.com
| * | coresight: cti: Fix null pointer error on CTI init before ETMMike Leach2022-11-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CTI is discovered first then the function coresight_set_assoc_ectdev_mutex() is called to set the association between CTI and ETM device. Recent lockdep fix passes a null pointer. This patch passes the correct pointer. Before patch: log of boot oops sequence with CTI discovered first: [ 12.424091] cs_system_cfg: CoreSight Configuration manager initialised [ 12.483474] coresight cti_sys0: CTI initialized [ 12.488109] coresight cti_sys1: CTI initialized [ 12.503594] coresight cti_cpu0: CTI initialized [ 12.517877] coresight-cpu-debug 850000.debug: Coresight debug-CPU0 initialized [ 12.523479] coresight-cpu-debug 852000.debug: Coresight debug-CPU1 initialized [ 12.529926] coresight-cpu-debug 854000.debug: Coresight debug-CPU2 initialized [ 12.541808] coresight stm0: STM32 initialized [ 12.544421] coresight-cpu-debug 856000.debug: Coresight debug-CPU3 initialized [ 12.585639] coresight cti_cpu1: CTI initialized [ 12.614028] coresight cti_cpu2: CTI initialized [ 12.631679] CSCFG registered etm0 [ 12.633920] coresight etm0: CPU0: etm v4.0 initialized [ 12.656392] coresight cti_cpu3: CTI initialized ... [ 12.708383] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000348 ... [ 12.755094] Internal error: Oops: 0000000096000044 [#1] SMP [ 12.761817] Modules linked in: coresight_etm4x(+) coresight_tmc coresight_cpu_debug coresight_replicator coresight_funnel coresight_cti coresight_tpiu coresight_stm coresight [ 12.767210] CPU: 3 PID: 1346 Comm: systemd-udevd Not tainted 6.1.0-rc3tid-v6tid-v6-235166-gf7f7d7a2204a-dirty #498 [ 12.782827] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 12.793154] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 12.800010] pc : coresight_set_assoc_ectdev_mutex+0x30/0x50 [coresight] [ 12.806694] lr : coresight_set_assoc_ectdev_mutex+0x30/0x50 [coresight] ... [ 12.885064] Call trace: [ 12.892352] coresight_set_assoc_ectdev_mutex+0x30/0x50 [coresight] [ 12.894693] cti_add_assoc_to_csdev+0x144/0x1b0 [coresight_cti] [ 12.900943] coresight_register+0x2c8/0x320 [coresight] [ 12.906844] etm4_add_coresight_dev.isra.27+0x148/0x280 [coresight_etm4x] [ 12.912056] etm4_probe+0x144/0x1c0 [coresight_etm4x] [ 12.918998] etm4_probe_amba+0x40/0x78 [coresight_etm4x] [ 12.924032] amba_probe+0x11c/0x1f0 After patch: similar log [ 12.444467] cs_system_cfg: CoreSight Configuration manager initialised [ 12.456329] coresight-cpu-debug 850000.debug: Coresight debug-CPU0 initialized [ 12.456754] coresight-cpu-debug 852000.debug: Coresight debug-CPU1 initialized [ 12.469672] coresight-cpu-debug 854000.debug: Coresight debug-CPU2 initialized [ 12.476098] coresight-cpu-debug 856000.debug: Coresight debug-CPU3 initialized [ 12.532409] coresight stm0: STM32 initialized [ 12.533708] coresight cti_sys0: CTI initialized [ 12.539478] coresight cti_sys1: CTI initialized [ 12.550106] coresight cti_cpu0: CTI initialized [ 12.633931] coresight cti_cpu1: CTI initialized [ 12.634664] coresight cti_cpu2: CTI initialized [ 12.638090] coresight cti_cpu3: CTI initialized [ 12.721136] CSCFG registered etm0 ... [ 12.762643] CSCFG registered etm1 [ 12.762666] coresight etm1: CPU1: etm v4.0 initialized [ 12.776258] CSCFG registered etm2 [ 12.776282] coresight etm2: CPU2: etm v4.0 initialized [ 12.784357] CSCFG registered etm3 [ 12.785455] coresight etm3: CPU3: etm v4.0 initialized Error can also be triggered by manually starting the modules using modprobe in the following order: root@linaro-developer:/home/linaro/cs-mods# modprobe coresight root@linaro-developer:/home/linaro/cs-mods# modprobe coresight-cti root@linaro-developer:/home/linaro/cs-mods# modprobe coresight-etm4x Tested on Dragonboard DB410c Applies to coresight/next Fixes: 23722fb46725 ("coresight: Fix possible deadlock with lock dependency") Signed-off-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20221123193818.6253-1-mike.leach@linaro.org
| * | coresight: trbe: remove cpuhp instance node before remove cpuhp stateYang Shen2022-11-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuhp_state_add_instance() and cpuhp_state_remove_instance() should be used in pairs. Or there will lead to the warn on cpuhp_remove_multi_state() since the cpuhp_step list is not empty. The following is the error log with 'rmmod coresight-trbe': Error: Removing state 215 which has instances left. Call trace: __cpuhp_remove_state_cpuslocked+0x144/0x160 __cpuhp_remove_state+0xac/0x100 arm_trbe_device_remove+0x2c/0x60 [coresight_trbe] platform_remove+0x34/0x70 device_remove+0x54/0x90 device_release_driver_internal+0x1e4/0x250 driver_detach+0x5c/0xb0 bus_remove_driver+0x64/0xc0 driver_unregister+0x3c/0x70 platform_driver_unregister+0x20/0x30 arm_trbe_exit+0x1c/0x658 [coresight_trbe] __arm64_sys_delete_module+0x1ac/0x24c invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0x58/0x1a0 do_el0_svc+0x38/0xd0 el0_svc+0x2c/0xc0 el0t_64_sync_handler+0x1ac/0x1b0 el0t_64_sync+0x19c/0x1a0 ---[ end trace 0000000000000000 ]--- Fixes: 3fbf7f011f24 ("coresight: sink: Add TRBE driver") Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Yang Shen <shenyang39@huawei.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20221122090355.23533-1-shenyang39@huawei.com
| * | coresight: etm4x: add CPU hotplug support for probingTamas Zsoldos2022-10-311-40/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | etm4x devices cannot be successfully probed when their CPU is offline. For example, when booting with maxcpus=n, ETM probing will fail on CPUs >n, and the probing won't be reattempted once the CPUs come online. This will leave those CPUs unable to make use of ETM. This change adds a mechanism to delay the probing if the corresponding CPU is offline, and to try it again when the CPU comes online. Signed-off-by: Tamas Zsoldos <tamas.zsoldos@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220705145935.24679-1-tamas.zsoldos@arm.com
* | | Merge tag 'misc-habanalabs-next-2022-11-23' of ↵Greg Kroah-Hartman2022-11-2921-529/+1267
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into char-misc-next Oded writes: This tag contains habanalabs driver changes for v6.2: - New feature of graceful hard-reset. Instead of immediately killing the user-process when a command submission times out, we wait a bit and give the user-process notification and let it try to close things gracefully, with the ability to retrieve debug information. - Enhance the EventFD mechanism. Add new events such as access to illegal address (RAZWI), page fault, device unavailable. In addition, change the event workqueue to be handled in a single-threaded workqueue. - Allow the control device to work during reset of the ASIC, to enable monitoring applications to continue getting the data. - Add handling for Gaudi2 with PCI revision 2. - Reduce severity of prints due to power/thermal events. - Change how we use the h/w to perform memory scrubbing in Gaudi2. - Multiple bug fixes, refactors and renames. * tag 'misc-habanalabs-next-2022-11-23' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux: (63 commits) habanalabs: fix VA range calculation habanalabs: fail driver load if EEPROM errors detected habanalabs: make print of engines idle mask more readable habanalabs: clear non-released encapsulated signals habanalabs: don't put context in hl_encaps_handle_do_release_sob() habanalabs: print context refcount value if hard reset fails habanalabs: add RMWREG32_SHIFTED to set a val within a mask habanalabs: fix rc when new CPUCP opcodes are not supported habanalabs/gaudi2: added memset for the cq_size register habanalabs: added return value check for hl_fw_dynamic_send_clear_cmd() habanalabs: increase the size of busy engines mask habanalabs/gaudi2: change memory scrub mechanism habanalabs: extend process wait timeout in device fine habanalabs: check schedule_hard_reset correctly habanalabs: reset device if still in use when released habanalabs/gaudi2: return to reset upon SM SEI BRESP error habanalabs/gaudi2: don't enable entries in the MSIX_GW table habanalabs/gaudi2: remove redundant firmware version check habanalabs/gaudi: fix print for firmware-alive event habanalabs: fix print for out-of-sync and pkt-failure events ...
| * | | habanalabs: fix VA range calculationOhad Sharabi2022-11-231-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current implementation is fixing the page size to PAGE_SIZE whereas the input page size may be different. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: fail driver load if EEPROM errors detectedOfir Bitton2022-11-231-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case EEPROM is not burned, firmware sets default EEPROM values. As this is not valid in production, driver should fail load upon any EEPROM error reported by firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: make print of engines idle mask more readableTomer Tayar2022-11-231-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The engines idle mask was increased to be an array of 4 u64 entries. To make the print of this mask more readable, remove the "0x" prefix, and zero-pad each u64 to 16 bytes if either it isn't zero or if any of the higher-order u64's is not zero. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: clear non-released encapsulated signalsTomer Tayar2022-11-233-31/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reserved encapsulated signals which were not released hold the context refcount, leading to a failure when killing the user process on device reset or device fini. Add the release of these left signals in the CS roll-back process. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: don't put context in hl_encaps_handle_do_release_sob()Tomer Tayar2022-11-231-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hl_encaps_handle_do_release_sob() can be called only when the last reference to the context object is released and hl_ctx_do_release() is initiated, and therefore it shouldn't call hl_ctx_put(). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: print context refcount value if hard reset failsTomer Tayar2022-11-231-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Failing to kill a user process during a hard reset can be due to a reference to the user context which isn't released. To make it easier to understand if this the reason for the failure and not something else, add a print of the context refcount value. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: add RMWREG32_SHIFTED to set a val within a maskDafna Hirschfeld2022-11-232-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to RMWREG32, but the given 'val' is already shifted according to the mask. This allows several 'ORed' vals and masks to be set at once The patch also fixes wrong usage of RMWREG32 by replacing it with RMWREG32_SHIFTED Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: fix rc when new CPUCP opcodes are not supportedTomer Tayar2022-11-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the new CPUCP opcodes are not supported and a CPUCP packet fails, the return value is the F/W error resposone which is a positive value. If this packet is sent from IOCTL and the positive value is used, the ICOTL will not be considered as unsuccessful. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: added memset for the cq_size registerMarco Pagani2022-11-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clang-analyzer reported a warning: "Value stored to 'cq_size_addr' is never read". The cq_size register of dcore0 is not being zeroed using gaudi2_memset_device_lbw(), along with the other cq_* registers, even though the corresponding cq_size_addr variable is set. Signed-off-by: Marco Pagani <marpagan@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: added return value check for hl_fw_dynamic_send_clear_cmd()Marco Pagani2022-11-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clang-analyzer reported a warning: "Value stored to 'rc' is never read". The return value check for the first hl_fw_dynamic_send_clear_cmd() call in hl_fw_dynamic_send_protocol_cmd() appears to be missing. Signed-off-by: Marco Pagani <marpagan@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: increase the size of busy engines maskTomer Tayar2022-11-232-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increase the size of the busy engines mask in 'struct hl_info_hw_idle', for future ASICs with more than 128 engines. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: change memory scrub mechanismfarah kassabri2022-11-231-46/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the scrubbing mechanism used the EDMA engines by directly setting the engine core registers to scrub a chunk of memory. Due to a sporadic failure with this mechanism, it was decided to initiate the engines via its QMAN using LIN-DMA packets. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: extend process wait timeout in device fineOded Gabbay2022-11-232-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Processes that use our device are likely to use at the same time other devices such as remote storage. In case our device is removed and a user process is still using the device, we need to kill the user process. However, if that process has a thread waiting for i/o to complete on remote storage, for example, the process won't terminate. Let's give it enough time to terminate before giving up. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
| * | | habanalabs: check schedule_hard_reset correctlyOded Gabbay2022-11-231-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | schedule_hard_reset can be true only if we didn't do hard-reset. Therefore, no point of checking it in case hard_reset is true. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
| * | | habanalabs: reset device if still in use when releasedTomer Tayar2022-11-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the device file is released while a context is still held, it won't be possible to reopen it until the context is eventually released. If that doesn't happen, only a device reset will revert it back to an operational state, i.e. need to wait for a CS timeout or an error, or to wait for an external intervention of injecting a reset via sysfs. At this stage, after the device was released by user, context is held either because of CS which were left running on the device and are not relevant anymore, or due to missing cleanup steps from user side. All of this is in any case handled in the device reset flow, so initiate the reset at this point instead of waiting for it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: return to reset upon SM SEI BRESP errorTomer Tayar2022-11-231-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to a H/W issue in the LBW path to the PCIE_DBI MSI-X doorbell, there were false sporadic error responses in SM when it was configured to write to there, and hence no reset was done as part of handling the relevant event. Now that the virtual MSI-X doorbell is used, such errors in SM are not expected and reset shouldn't be skipped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: don't enable entries in the MSIX_GW tableTomer Tayar2022-11-231-26/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | User should use the virtual MSI-X doorbell to generate interrupts from the device, so there is no need to enable entries in the MSIX_GW table. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: remove redundant firmware version checkfarah kassabri2022-11-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Firmware 1.7 is the first official firmware, so no need to check if we are running a version below it. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi: fix print for firmware-alive eventTomer Tayar2022-11-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing le{32,64}_to_cpu conversions. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: fix print for out-of-sync and pkt-failure eventsTomer Tayar2022-11-233-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing le32_to_cpu() conversions, and use %d for the value returned from atomic_read(). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: add page fault notify eventDani Liberman2022-11-231-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time page fault happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: classify power/thermal events as infoOfir Bitton2022-11-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As power and thermal envelope events are pure informative and not indicating an error, we reduce the print level to info only. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: skip events info ioctl if not supportedOhad Sharabi2022-11-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some ASICs haven't yet implemented this functionality and so the ioctl call should fail and the user should be notified of the reason. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: fix firmware descriptor copy operationfarah kassabri2022-11-231-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is needed to allow adding more data to the lkd_fw_comms_desc structure. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: add razwi notify eventDani Liberman2022-11-232-62/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time razwi (read-only zero, write ignored) event happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi2: implement fp32 not supported eventOfir Bitton2022-11-233-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to binning, Gaudi2 does not always support fp32. We add support for such an event in case fp32 is used by the user in such a device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs/gaudi: add page fault notify eventDani Liberman2022-11-234-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time page fault happens, besides capturing its data, also notify the user about it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| * | | habanalabs: use single threaded WQ for event handlingDani Liberman2022-11-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Creating event queue workqueue using alloc_workqueue made it run in multi threaded mode, which caused parallel dumping of events as well as parallel events notifying to user, causing logs with multiple events to be out of order. Fixed by creating event queue workqueue as single threaded work queue. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>