Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-1-2'
Tariq Toukan says:
====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2
This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).
Design
Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:
- MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
behavior, used by bonding, FW LAG commands, v2p_map)
- MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
(used by MPESW shared FDB across all devices)
- specific group_id: iterate only devices in that SD group (used by
per-group SD shared FDB operations)
Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.
Lifecycle and ownership
The SD LAG lifecycle is tied to the SD group, not to bonding events:
1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
(priv.lag) for each LAG-capable PF. e.g.: SD primary devices
2. During mlx5_sd_init(), after the SD group is fully formed (primary
and secondaries paired), sd_lag_init() registers the secondary
devices into the primary's existing priv.lag by calling
mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
also gets its group_id set. No separate LAG instance is created.
3. After all the devices in SD group transition to switchdev,
mlx5_lag_shared_fdb_create() is invoked with the group_id to create
a software-only shared FDB scoped to that SD group. This sets
sd_fdb_active on all lag_func entries in the group. No FW LAG
commands are issued since SD devices share the same physical port.
4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
per-group SD shared FDB is torn down first, then MPESW shared FDB is
created spanning all devices (ports + SD secondaries) using
MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
restored.
5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
removes secondaries from priv.lag and clears the primary's group_id.
The LAG structure itself is not destroyed.
The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.
SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.
Patches
Infrastructure (patches 1, 5-6):
- Factor out shared FDB code into a dedicated file
- Extend lag_func with group_id and sd_fdb_active fields;
add XA_MARK_PORT and unified iterator with group_id filter
- Extend shared FDB API with group_id parameter
E-Switch preparation (patches 2-3):
- Align eswitch disable sequence ordering
- Move devcom init from TC to eswitch layer
SD group management (patches 4, 7-9):
- Replace peer count check with direct peer lookup
- Register SD secondaries in the existing LAG at SD init time
- Block RoCE and VF LAG for SD devices
- Block multipath LAG for SD devices
Switchdev integration (patch 10):
- Keep netdev resources local in switchdev mode
Steering (patches 11-12):
- Track peer flow slots with bitmap for selective peer flow deletion
- Enable TC flow steering for SD LAG
Enablement (patch 13):
- Verify unique vhca_id count for cross-VHCA RQT
====================
Link: https://patch.msgid.link/20260531113954.395443-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>