| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
phydev is assigned a value right away, no need to initialize it.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20230329064414.25028-1-wsa+renesas@sang-engineering.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Eric Dumazet says:
====================
net: rps/rfs improvements
Jason Xing attempted to optimize napi_schedule_rps() by avoiding
unneeded NET_RX_SOFTIRQ raises: [1], [2]
This is quite complex to implement properly. I chose to implement
the idea, and added a similar optimization in ____napi_schedule()
Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS
I was able to observe a ~10% reduction of NET_RX_SOFTIRQ
invocations.
While this had no impact on throughput or cpu costs on this synthetic
benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler
can force __do_softirq() to wakeup ksoftirqd when need_resched() is true.
This can have a latency impact on stressed hosts.
[1] https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
[2] https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gmail.com/
====================
Link: https://lore.kernel.org/r/20230328235021.1048163-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
____napi_schedule() adds a napi into current cpu softnet_data poll_list,
then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
from net_rx_action(), because we can process poll_list from this point,
without going to full softirq loop.
This needs a change in net_rx_action() to make sure we restart
its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
being raised.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Based on initial patch from Jason Xing.
Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps()
when we queued a packet into another cpu backlog.
We can do this only in the context of us being called indirectly
from net_rx_action(), to have the guarantee our rps_ipi_list
will be processed before we exit from net_rx_action().
Link: https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.
sd.in_net_rx_action is only read/written by the owning cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|/
|
|
|
|
|
|
|
|
|
| |
napi_schedule_rps() return value is ignored, remove it.
Change the comment to clarify the intent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-28
Dragos Tatulea says:
====================
net/mlx5e: RX, Drop page_cache and fully use page_pool
For page allocation on the rx path, the mlx5e driver has been using an
internal page cache in tandem with the page pool. The internal page
cache uses a queue for page recycling which has the issue of head of
queue blocking.
This patch series drops the internal page_cache altogether and uses the
page_pool to implement everything that was done by the page_cache
before:
* Let the page_pool handle dma mapping and unmapping.
* Use fragmented pages with fragment counter instead of tracking via
page ref.
* Enable skb recycling.
The patch series has the following effects on the rx path:
* Improved performance for the cases when there was low page recycling
due to head of queue blocking in the internal page_cache. The test
for this was running a single iperf TCP stream to a rx queue
which is bound on the same cpu as the application.
|-------------+--------+--------+------+---------|
| rq type | before | after | unit | diff |
|-------------+--------+--------+------+---------|
| striding rq | 30.1 | 31.4 | Gbps | 4.14 % |
| legacy rq | 30.2 | 33.0 | Gbps | 8.48 % |
|-------------+--------+--------+------+---------|
* Small XDP performance degradation. The test was is XDP drop
program running on a single rx queue with small packets incoming
it looks like this:
|-------------+----------+----------+------+---------|
| rq type | before | after | unit | diff |
|-------------+----------+----------+------+---------|
| striding rq | 19725449 | 18544617 | pps | -6.37 % |
| legacy rq | 19879931 | 18631841 | pps | -6.70 % |
|-------------+----------+----------+------+---------|
This will be handled in a different patch series by adding support for
multi-packet per page.
* For other cases the performance is roughly the same.
The above numbers were obtained on the following system:
24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
32 GB RAM
ConnectX-7 single port
The breakdown on the patch series is the following:
* Preparations for introducing the mlx5e_frag_page struct.
* Delete the mlx5e_page_cache struct.
* Enable dma mapping from page_pool.
* Enable skb recycling and fragment counting.
* Do deferred release of pages (just before alloc) to ensure better
page_pool cache utilization.
====================
* tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats
net/mlx5e: RX, Break the wqe bulk refill in smaller chunks
net/mlx5e: RX, Increase WQE bulk size for legacy rq
net/mlx5e: RX, Split off release path for xsk buffers for legacy rq
net/mlx5e: RX, Defer page release in legacy rq for better recycling
net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags
net/mlx5e: RX, Defer page release in striding rq for better recycling
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
net/mlx5e: RX, Enable skb page recycling through the page_pool
net/mlx5e: RX, Enable dma map and sync from page_pool allocator
net/mlx5e: RX, Remove internal page_cache
net/mlx5e: RX, Store SHAMPO header pages in array
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation
====================
Link: https://lore.kernel.org/r/20230328205623.142075-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The recycle parameter used during page release is no longer
necessary: the page pool can detect when the page cannot be
recycled to the cache or ring without any outside hint.
The page pool will also take care of cleaning up after itself
once all the inflight pages have been released. So no need to
explicitly release pages to the system.
Remove the internal page_cache stats as the mlx5e_page_cache
struct no longer exists.
Delete the documentation entries along with the stats.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
To avoid overflowing the page pool's cache, don't release the
whole bulk which is usually larger than the cache refill size.
Group release+alloc instead into cache refill units that
allow releasing to the cache and then allocating from the cache.
A refill_unit variable is added as a iteration unit over the
wqe_bulk when doing release+alloc.
For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 0% to 52%:
+---------------------------------------------+
| Page Pool stats (/sec) | Before | After |
+-------------------------+---------+---------+
|rx_pp_alloc_fast | 2145422 | 2193802 |
|rx_pp_alloc_slow | 2 | 0 |
|rx_pp_alloc_empty | 2 | 0 |
|rx_pp_alloc_refill | 34059 | 16634 |
|rx_pp_alloc_waive | 0 | 0 |
|rx_pp_recycle_cached | 0 | 1145818 |
|rx_pp_recycle_cache_full | 0 | 0 |
|rx_pp_recycle_ring | 2179361 | 1064616 |
|rx_pp_recycle_ring_full | 121 | 0 |
+---------------------------------------------+
With this patch, the performance for legacy rq for the above test is
back to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Deferred page release was added to legacy rq but its desired effect
(driver releases last fragment to page pool cache) is not yet visible
due to the WQE bulks being too small.
This patch increases the WQE bulk size to span 512 KB and clip it to
one quarter of the rx queue size.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Don't mix xsk buffer releases with page releases anymore. This is
needed for handling of deferred page release.
Add a new bulk free function for xsk buffers from wqe frags.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, fragmented pages from the page pool can be released
in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the
associated skb fragments have been released. This path allows
recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which
will always release pages to the page pool ring
(allow_direct == false).
Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. A flag is
added to make sure that there's no release before first alloc
and that XDP_TX fragments are not released prematurely.
This is a preparation patch that doesn't unlock the performance
improvements yet. A followup patch will do that.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Change the bool flag to a bitfield as we'll use it in a downstream patch
in the series to add signaling about skipping a fragment release.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, for striding RQ, fragmented pages from the page pool can
get released in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the
associated skb fragments have been released. This path allows
recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which
will always release pages to the page pool ring
(allow_direct == false).
Whichever is releasing the last fragment will be decisive on
where the page gets released: the cache or the ring. So we
obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments
right before requesting new ones from the page pool. Extra care
needs to be taken for the corner cases:
* On first call, make sure that release is not called. The
skip_release_bitmap is used for this purpose.
* On rq shutdown, make sure that all wqes that were not
in the linked list are released.
For a single ring, single core, default MTU (1500) TCP stream
test the number of pages allocated from the cache directly
(rx_pp_recycle_cached) increases from 31 % to 98 %:
+----------------------------------------------+
| Page Pool stats (/sec) | Before | After |
+-------------------------+---------+----------+
|rx_pp_alloc_fast | 2137754 | 2261033 |
|rx_pp_alloc_slow | 47 | 9 |
|rx_pp_alloc_empty | 47 | 9 |
|rx_pp_alloc_refill | 23230 | 819 |
|rx_pp_alloc_waive | 0 | 0 |
|rx_pp_recycle_cached | 672182 | 2209015 |
|rx_pp_recycle_cache_full | 1789 | 0 |
|rx_pp_recycle_ring | 1485848 | 52259 |
|rx_pp_recycle_ring_full | 3003 | 584 |
+----------------------------------------------+
With this patch, the performance in striding rq for the above test is
back to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The xdp_xmit_bitmap currently serves only one purpose: to avoid
releasing pages that are still in use due to XDP TX.
A following patch will use this bitmap in a slightly different context
but for the same purpose. So rename the bitmap to a more generic name
that reflects the purpose not the context.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Start using the page_pool skb recycling api to recycle all pages back to
the page pool and stop using atomic page reference counting.
The mlx5e driver used to manage in-flight pages using page refcounting:
for each fragment there were 2 atomic write operations happening (one
for building the skb and one on skb release).
The page_pool api introduced a method to track page fragments more
optimally:
* The page's pp_fragment_count is set to a large bias on page alloc
(1 x atomic write operation).
* The driver tracks the actual page fragments in a non atomic variable.
* When the skb is recycled, pp_fragment_count is decremented
(atomic write operation).
* When page is released in the driver, the unused number of fragments
(relative to the bias) is deducted from pp_fragment_count (atomic
write operation).
* Last page defragmentation will only be an atomic read.
So in total there are `number of fragments + 1` atomic write ops. As
opposed to previously: `2 * frags` atomic writes ops.
Pages are wrapped in a mlx5e_frag_page structure which also contains the
number of fragments. This makes it easy to count the fragments in the
driver.
This change brings performance improvements for the case when the old rx
page_cache had low recycling rates due to head of queue blocking. For a
iperf3 TCP test with a single stream, on a single core (iperf and receive
queue running on same core), the following improvements can be noticed:
* Striding rq:
- before (net-next baseline): bitrate = 30.1 Gbits/sec
- after : bitrate = 31.4 Gbits/sec (diff: 4.14 %)
* Legacy rq:
- before (net-next baseline): bitrate = 30.2 Gbits/sec
- after : bitrate = 33.0 Gbits/sec (diff: 8.48 %)
There are 2 temporary performance degradations introduced:
1) TCP streams that had a good recycling rate with the old page_cache
have a degradation for both striding and linear rq. This is due to
very low page pool cache recycling: the pages are released during skb
recycle which will release pages to the page pool ring for safety.
The following patches in this series will tackle this problem by
deferring the page release in the driver to increase the
chance of having pages recycled to the cache.
2) XDP performance is now lower (4-5 %) due to the higher number of
atomic operations used for fragment management. But this opens the
door for supporting multiple packets per page in XDP, which will
bring a big gain.
Otherwise, performance is similar to baseline.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Remove driver dma mapping and unmapping of pages. Let the
page_pool api do it.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch removes the internal rx page_cache and uses the generic
page_pool api only. It used to be that the page_pool couldn't handle all
the mlx5 driver usecases, but with the introduction of skb recycling and
page fragmentaton in the page_pool full switch can now be made. Some
benfits of this transition:
* Better page recycling in the cases when the page_cache was suffering
from head of queue blocking. The page_pool doesn't have this issue.
* DMA mapping/unmapping can be managed by the page_pool.
* mlx5e_rq size reduced by more than 50% due to the page_cache array
being deleted.
This patch only removes the page_cache. Downstream patches will enable
the required page_pool features and will add further fine-tuning.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Save allocated SHAMPO header pages to an array to which the
mlx5e_dma_info page will point to.
This change is a preparation for introducing mlx5e_frag_page structure
in a downstream patch. There's no new functionality introduced.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This change removes the usage of mlx5e_alloc_unit union for
striding rq. The change is more straightforward than legacy rq as
the alloc units union is already in place.
This patch only moves things around: instead of an array of unions make
it a union of arrays. This has the effect that each mlx5e_mpw_info will
allocate the largest possible size of the array member. For xsk this
means that the array of xdp_buff pointers for the wqe will still be
contiguous, but there will be some extra unused bytes at the end of the
array.
As further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The mlx5e_alloc_unit union is conveniently used to store arrays of
pointers to struct page or struct xdp_buff (for xsk). The union is
currently expected to have the size of a pointer for xsk batch
allocations to work. This is conveniet for the current state of the
code but makes it impossible to add a structure of a different size
to the alloc unit.
A further patch in the series will add the mlx5e_frag_page struct for
which the described size constraint will no longer hold.
This change removes the usage of mlx5e_alloc_unit union for legacy rq:
- A union of arrays is introduced (mlx5e_alloc_units) to replace the
array of unions to allow structures of different sizes.
- Each fragment has a pointer to a unit in the mlx5e_alloc_units array.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Change internal page cache and page pool api to use a struct page **
instead of a mlx5e_alloc_unit *.
This is the first change in a series which is meant to remove the
mlx5e_alloc_unit altogether.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
clang 16.0.0 with W=1 reports:
drivers/net/ethernet/amazon/ena/ena_netdev.c:1901:6: error: variable 'tx_bytes' set but not used [-Werror,-Wunused-but-set-variable]
u32 tx_bytes = 0;
The variable is not used so remove it.
Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/r/20230328151958.410687-1-horms@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The h and the f letters are swapped so it unlocks the wrong lock.
Fixes: 577f0d1b1c5f ("octeon_ep: add separate mailbox command and response queues")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/251aa2a2-913e-4868-aac9-0a90fc3eeeda@kili.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Adding a DFL (Device Feature List) device driver of ToD device for
Intel FPGA cards.
The Intel FPGA Time of Day(ToD) IP within the FPGA DFL bus is exposed
as PTP Hardware clock(PHC) device to the Linux PTP stack to synchronize
the system clock to its ToD information using phc2sys utility of the
Linux PTP stack. The DFL is a hardware List within FPGA, which defines
a linked list of feature headers within the device MMIO space to provide
an extensible way of adding subdevice features.
Signed-off-by: Raghavendra Khadatare <raghavendrax.anand.khadatare@intel.com>
Signed-off-by: Tianfei Zhang <tianfei.zhang@intel.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20230328142455.481146-1-tianfei.zhang@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The HNS3 driver supports Wake-on-LAN, which can wake up
the server from power off state to power on state by magic
packet or magic security packet.
ChangeLog:
v1->v2:
Deleted the debugfs function that overlaps with the ethtool function
from suggestion of Andrew Lunn.
v2->v3:
Return the wol configuration stored in driver,
suggested by Alexander H Duyck.
v3->v4:
Add a helper to go from netdev to the local struct,
suggested by Simon Horman and Jakub Kicinski.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Edward Cree says:
====================
sfc: support TC decap rules
This series adds support for offloading tunnel decapsulation TC rules to
ef100 NICs, allowing matching encapsulated packets to be decapsulated in
hardware and redirected to VFs.
For now an encap match must be on precisely the following fields:
ethertype (IPv4 or IPv6), source IP, destination IP, ipproto UDP,
UDP destination port. This simplifies checking for overlaps in the
driver; the hardware supports a wider range of match fields which
future driver work may expose.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
A 'foreign' rule is one for which the net_dev is not the sfc netdevice
or any of its representors. The driver registers indirect flow blocks
for tunnel netdevs so that it can offload decap rules. For example:
tc filter add dev vxlan0 parent ffff: protocol ipv4 flower \
enc_src_ip 10.1.0.2 enc_dst_ip 10.1.0.1 \
enc_key_id 1000 enc_dst_port 4789 \
action tunnel_key unset \
action mirred egress redirect dev $REPRESENTOR
When notified of a rule like this, register an encap match on the IP
and dport tuple (creating an Outer Rule table entry) and insert an MAE
action rule to perform the decapsulation and deliver to the representee.
Moved efx_tc_delete_rule() below efx_tc_flower_release_encap_match() to
avoid the need for a forward declaration.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a hashtable to detect duplicate and conflicting matches. If match
is not a duplicate, call MAE functions to add/remove it from OR table.
Calling code not added yet, so mark the new functions as unused.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
An encap match corresponds to an entry in the exact-match Outer Rule
table; the lookup response includes the encap type (protocol) allowing
the hardware to continue parsing into the inner headers.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Translate the fields from flow dissector into struct efx_tc_match.
In efx_tc_flower_replace(), reject filters that match on them, because
only 'foreign' filters (i.e. those for which the ingress dev is not
the sfc netdev or any of its representors, e.g. a tunnel netdev) can
use them.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Extend the MAE caps check to validate that the hardware supports these
outer-header matches where used by the driver.
Extend efx_mae_populate_match_criteria() to fill in the outer rule ID
and VNI match fields.
Nothing yet populates these match fields, nor creates outer rules.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| | |
Includes an explanation of the lifetime of the 'cursor' action-set `act`.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Herbert Xu says:
====================
macvlan: Allow some packets to bypass broadcast queue
This patch series allows some packets to bypass the broadcast
queue on receive. Currently all multicast packets are queued
on receive and then processed in a work queue. This is to avoid
an unbounded amount of work occurring in the receive path, as
one broadcast packet could easily translate into 4,000 packets.
However, for multicast packets with just one receiver (possible
for IPv6 ND), this introduces unnecessary latency as the packet
will go to exactly one device.
This series allows such multicast packets to be processed inline.
It also adds a toggle which lets the admin control what threshold
to set between queueing and not queueing. A follow-up patch for
iproute will be posted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Make the broadcast cutoff configurable through netlink. Note
that macvlan is weird because there is no central device for
us to configure (the lowerdev could be anything). So all the
options are duplicated over what could be thousands of child
devices.
IFLA_MACVLAN_BC_QUEUE_LEN took the approach of taking the maximum
of all child device settings. This is unnecessary as we could
simply store the option in the port device and take the last
child device that gets updated as the value to use.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
As it stands all broadcast and multicast packets are queued and
processed in a work queue. This is so that we don't overwhelm
the receive softirq path by generating thousands of packets or
more (see commit 412ca1550cbe "macvlan: Move broadcasts into a
work queue").
As such all multicast packets will be delayed, even if they will
be received by a single macvlan device. As using a workqueue
is not free in terms of latency, we should avoid this where possible.
This patch adds a new filter to determine which addresses should
be delayed and which ones won't. This is done using a crude
counter of how many times an address has been added to the macvlan
port (ha->synced). For now if an address has been added more than
once, then it will be considered to be broadcast. This could be
tuned further by making this threshold configurable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Matthieu Baerts says:
====================
mptcp: a couple of cleanups and improvements
Patch 1 removes an unneeded address copy in subflow_syn_recv_sock().
Patch 2 simplifies subflow_syn_recv_sock() to postpone some actions and
to avoid a bunch of conditionals.
Patch 3 stops reporting limits that are not taken into account when the
userspace PM is used.
Patch 4 adds a new test to validate that the 'subflows' field reported
by the kernel is correct. Such info can be retrieved via Netlink (e.g.
with ss) or getsockopt(SOL_MPTCP, MPTCP_INFO).
---
Changes in v2:
- Patch 3/4's commit message has been updated to use the correct SHA
- Rebased on latest net-next
- Link to v1: https://lore.kernel.org/r/20230324-upstream-net-next-20230324-misc-features-v1-0-5a29154592bd@tessares.net
====================
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds the mptcp_info fields tests in endpoint_tests(). Add a
new function chk_mptcp_info() to check the given number of the given
mptcp_info field.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/330
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Only the in-kernel PM uses the number of address and subflow limits
allowed per connection.
It then makes more sense not to display such info when other PMs are
used not to confuse the userspace by showing limits not being used.
While at it, we can get rid of the "val" variable and add indentations
instead.
It would have been good to have done this modification directly in
commit 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
but as we change a bit the behaviour, it is fine not to backport it to
stable.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Postpone the msk cloning to the child process creation
so that we can avoid a bunch of conditionals.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/61
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In the syn_recv fallback path, the msk is unused. We can skip
setting the socket address.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Kuniyuki Iwashima says:
====================
ipv6: Random cleanup for in6addr_any.
The first patch removes in6addr_any alternatives and the second
removes redundant initialisation of a local variable.
Changes:
v2: Use ipv6_addr_any() in patch 1. (David Ahern)
v1: https://lore.kernel.org/netdev/20230322012204.33157-1-kuniyu@amazon.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We'll call memset(&tmp, 0, sizeof(tmp)) later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some code defines the IPv6 wildcard address as a local variable and
use it with memcmp() or ipv6_addr_equal().
Let's use in6addr_any and ipv6_addr_any() instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Bobby Eshleman says:
====================
Add support for sockmap to vsock.
We're testing usage of vsock as a way to redirect guest-local UDS
requests to the host and this patch series greatly improves the
performance of such a setup.
Compared to copying packets via userspace, this improves throughput by
121% in basic testing.
Tested as follows.
Setup: guest unix dgram sender -> guest vsock redirector -> host vsock
server
Threads: 1
Payload: 64k
No sockmap:
- 76.3 MB/s
- The guest vsock redirector was
"socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock"
Using sockmap (this patch):
- 168.8 MB/s (+121%)
- The guest redirector was a simple sockmap echo server,
redirecting unix ingress to vsock 2:1234 egress.
- Same sender and server programs
*Note: these numbers are from RFC v1
Only the virtio transport has been tested. The loopback transport was
used in writing bpf/selftests, but not thoroughly tested otherwise.
This series requires the skb patch.
Changes in v4:
- af_vsock: fix parameter alignment in vsock_dgram_recvmsg()
- af_vsock: add TCP_ESTABLISHED comment in vsock_dgram_connect()
- vsock/bpf: change ret type to bool
Changes in v3:
- vsock/bpf: Refactor wait logic in vsock_bpf_recvmsg() to avoid
backwards goto
- vsock/bpf: Check psock before acquiring slock
- vsock/bpf: Return bool instead of int of 0 or 1
- vsock/bpf: Wrap macro args __sk/__psock in parens
- vsock/bpf: Place comment trailer */ on separate line
Changes in v2:
- vsock/bpf: rename vsock_dgram_* -> vsock_*
- vsock/bpf: change sk_psock_{get,put} and {lock,release}_sock() order
to minimize slock hold time
- vsock/bpf: use "new style" wait
- vsock/bpf: fix bug in wait log
- vsock/bpf: add check that recvmsg sk_type is one dgram, seqpacket, or
stream. Return error if not one of the three.
- virtio/vsock: comment __skb_recv_datagram() usage
- virtio/vsock: do not init copied in read_skb()
- vsock/bpf: add ifdef guard around struct proto in dgram_recvmsg()
- selftests/bpf: add vsock loopback config for aarch64
- selftests/bpf: add vsock loopback config for s390x
- selftests/bpf: remove vsock device from vmtest.sh qemu machine
- selftests/bpf: remove CONFIG_VIRTIO_VSOCKETS=y from config.x86_64
- vsock/bpf: move transport-related (e.g., if (!vsk->transport)) checks
out of fast path
====================
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a test case testing the redirection from connectible AF_VSOCK
sockets to connectible AF_UNIX sockets.
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add vsock loopback to the test kernel.
This allows sockmap for vsock to be tested.
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch adds sockmap support for vsock sockets. It is intended to be
usable by all transports, but only the virtio and loopback transports
are implemented.
SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported.
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This adds the vsock_perf binary to the gitignore file.
Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility")
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Reviewed-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230327-vsock-add-vsock-perf-to-ignore-v1-1-f28a84f3606b@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Donald Hunter says:
====================
ynl: add support for user headers and struct attrs
Add support for user headers and struct attrs to YNL. This patchset adds
features to ynl and add a partial spec for openvswitch that demonstrates
use of the features.
Patch 1-4 add features to ynl
Patch 5 adds partial openvswitch specs that demonstrate the new features
Patch 6-7 add documentation for legacy structs and for sub-type
====================
Link: https://lore.kernel.org/r/20230327083138.96044-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|