linux-stable.git/drivers/scsi, branch v5.13.2

scsi: core: Retry I/O for Notify (Enable Spinup) Required error

2021-07-14T15:07:51+00:00

commit 104739aca4488909175e9e31d5cd7d75b82a2046 upstream.

If the device is power-cycled, it takes time for the initiator to transmit
the periodic NOTIFY (ENABLE SPINUP) SAS primitive, and for the device to
respond to the primitive to become ACTIVE. Retry the I/O request to allow
the device time to become ACTIVE.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210629155826.48441-1-quat.le@oracle.com
Reviewed-by: Bart Van Assche 
Signed-off-by: Quat Le 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

scsi: libfc: Correct the condition check and invalid argument passed

2021-07-14T15:07:49+00:00

commit 8f70328c068f9f5c5db82848724cb276f657b9cd upstream.

Incorrect condition check was leading to data corruption.

Link: https://lore.kernel.org/r/20210603101404.7841-3-jhasan@marvell.com
Fixes: 8fd9efca86d0 ("scsi: libfc: Work around -Warray-bounds warning")
CC: stable@vger.kernel.org
Reviewed-by: Himanshu Madhani 
Signed-off-by: Javed Hasan 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

scsi: lpfc: Fix Node recovery when driver is handling simultaneous PLOGIs

2021-07-14T15:07:49+00:00

commit 4012baeab6ca22b7f7beb121b6d0da0a62942fdd upstream.

When lpfc is handling a solicited and unsolicited PLOGI with another
initiator, the remote initiator is never recovered. The node for the
initiator is erroneouosly removed and all resources released.

In lpfc_cmpl_els_plogi(), when lpfc_els_retry() returns a failure code, the
driver is calling the state machine with a device remove event because the
remote port is not currently registered with the SCSI or NVMe
transports. The issue is that on a PLOGI "collision" the driver correctly
aborts the solicited PLOGI and allows the unsolicited PLOGI to complete the
process, but this process is interrupted with a device_rm event.

Introduce logic in the PLOGI completion to capture the PLOGI collision
event and jump out of the routine.  This will avoid removal of the node.
If there is no collision, the normal node removal will occur.

Fixes: 	52edb2caf675 ("scsi: lpfc: Remove ndlp when a PLOGI/ADISC/PRLI/REG_RPI ultimately fails")
Cc:  # v5.11+
Link: https://lore.kernel.org/r/20210514195559.119853-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee 
Signed-off-by: Justin Tee 
Signed-off-by: James Smart 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

scsi: lpfc: Fix unreleased RPIs when NPIV ports are created

2021-07-14T15:07:49+00:00

commit 01131e7aae5d30e23e3cdd1eebe51bbc5489ae8f upstream.

While testing NPIV and watching logins and used RPI levels, it was seen the
used RPI count was much higher than the number of remote ports discovered.

Code inspection showed that remote port removals on any NPIV instance are
releasing the RPI, but not performing an UNREG_RPI with the adapter thus
the reference counting never fully drops and the RPI is never fully
released. This was happening on NPIV nodes due to a log of fabric ELS's to
fabric addresses. This lack of UNREG_RPI was introduced by a prior node
rework patch that performed the UNREG_RPI as part of node cleanup.

To resolve the issue, do the following:

 - Restore the RPI release code, but move the location to so that it is in
   line with the new node cleanup design.

 - NPIV ports now release the RPI and drop the node when the caller sets
   the NLP_RELEASE_RPI flag.

 - Set the NLP_RELEASE_RPI flag in node cleanup which will trigger a
   release of RPI to free pool.

 - Ensure there's an UNREG_RPI at LOGO completion so that RPI release is
   completed.

 - Stop offline_prep from skipping nodes that are UNUSED. The RPI may
   not have been released.

 - Stop the default RPI handling in lpfc_cmpl_els_rsp() for SLI4.

 - Fixed up debugfs RPI displays for better debugging.

Fixes: a70e63eee1c1 ("scsi: lpfc: Fix NPIV Fabric Node reference counting")
Link: https://lore.kernel.org/r/20210514195559.119853-2-jsmart2021@gmail.com
Cc:  # v5.11+
Co-developed-by: Justin Tee 
Signed-off-by: Justin Tee 
Signed-off-by: James Smart 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

scsi: megaraid_sas: Send all non-RW I/Os for TYPE_ENCLOSURE device through firmware

2021-07-14T15:07:49+00:00

commit 79db830162b733f5f3ee80f0673eeeb0245fe38b upstream.

The driver issues all non-ReadWrite I/Os for TYPE_ENCLOSURE devices through
the fast path with invalid dev handle. Fast path in turn directs all the
I/Os to the firmware. As firmware stopped handling those I/Os from SAS3.5
generation of controllers (Ventura generation and onwards) this will lead
to I/O failures.

Switch the driver to issue all the non-ReadWrite I/Os for TYPE_ENCLOSURE
devices directly to firmware for SAS3.5 generation of controllers and
later.

Link: https://lore.kernel.org/r/20210528131307.25683-2-chandrakanth.patil@broadcom.com
Cc:  # v5.11+
Signed-off-by: Chandrakanth Patil 
Signed-off-by: Sumit Saxena 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman

scsi: mpt3sas: Fix error return value in _scsih_expander_add()

2021-07-14T15:07:42+00:00

[ Upstream commit d6c2ce435ffe23ef7f395ae76ec747414589db46 ]

When an expander does not contain any 'phys', an appropriate error code -1
should be returned, as done elsewhere in this function. However, we
currently do not explicitly assign this error code to 'rc'. As a result, 0
was incorrectly returned.

Link: https://lore.kernel.org/r/20210514081300.6650-1-thunder.leizhen@huawei.com
Fixes: f92363d12359 ("[SCSI] mpt3sas: add new driver supporting 12GB SAS")
Reported-by: Hulk Robot 
Signed-off-by: Zhen Lei 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Sasha Levin

scsi: iscsi: Flush block work before unblock

2021-07-14T15:07:34+00:00

[ Upstream commit 7ce9fc5ecde0d8bd64c29baee6c5e3ce7074ec9a ]

We set the max_active iSCSI EH works to 1, so all work is going to execute
in order by default. However, userspace can now override this in sysfs. If
max_active > 1, we can end up with the block_work on CPU1 and
iscsi_unblock_session running the unblock_work on CPU2 and the session and
target/device state will end up out of sync with each other.

This adds a flush of the block_work in iscsi_unblock_session.

Link: https://lore.kernel.org/r/20210525181821.7617-17-michael.christie@oracle.com
Fixes: 1d726aa6ef57 ("scsi: iscsi: Optimize work queue flush use")
Reviewed-by: Lee Duncan 
Signed-off-by: Mike Christie 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Sasha Levin

scsi: iscsi: Fix in-kernel conn failure handling

2021-07-14T15:07:33+00:00

[ Upstream commit 23d6fefbb3f6b1cc29794427588b470ed06ff64e ]

Commit 0ab710458da1 ("scsi: iscsi: Perform connection failure entirely in
kernel space") has the following regressions/bugs that this patch fixes:

1. It can return cmds to upper layers like dm-multipath where that can
retry them. After they are successful the fs/app can send new I/O to the
same sectors, but we've left the cmds running in FW or in the net layer.
We need to be calling ep_disconnect if userspace is not up.

This patch only fixes the issue for offload drivers. iscsi_tcp will be
fixed in separate commit because it doesn't have a ep_disconnect call.

2. The drivers that implement ep_disconnect expect that it's called before
conn_stop. Besides crashes, if the cleanup_task callout is called before
ep_disconnect it might free up driver/card resources for session1 then they
could be allocated for session2. But because the driver's ep_disconnect is
not called it has not cleaned up the firmware so the card is still using
the resources for the original cmd.

3. The stop_conn_work_fn can run after userspace has done its recovery and
we are happily using the session. We will then end up with various bugs
depending on what is going on at the time.

We may also run stop_conn_work_fn late after userspace has called stop_conn
and ep_disconnect and is now going to call start/bind conn. If
stop_conn_work_fn runs after bind but before start, we would leave the conn
in a unbound but sort of started state where IO might be allowed even
though the drivers have been set in a state where they no longer expect
I/O.

4. Returning -EAGAIN in iscsi_if_destroy_conn if we haven't yet run the in
kernel stop_conn function is breaking userspace. We should have been doing
this for the caller.

Link: https://lore.kernel.org/r/20210525181821.7617-8-michael.christie@oracle.com
Fixes: 0ab710458da1 ("scsi: iscsi: Perform connection failure entirely in kernel space")
Reviewed-by: Lee Duncan 
Signed-off-by: Mike Christie 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Sasha Levin

scsi: iscsi: Rel ref after iscsi_lookup_endpoint()

2021-07-14T15:07:33+00:00

[ Upstream commit 9e5fe1700896c85040943fdc0d3fee0dd3e0d36f ]

Subsequent commits allow the kernel to do ep_disconnect. In that case we
will have to get a proper refcount on the ep so one thread does not delete
it from under another.

Link: https://lore.kernel.org/r/20210525181821.7617-7-michael.christie@oracle.com
Reviewed-by: Lee Duncan 
Signed-off-by: Mike Christie 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Sasha Levin

scsi: iscsi: Use system_unbound_wq for destroy_work

2021-07-14T15:07:33+00:00

[ Upstream commit b25b957d2db1585602c2c70fdf4261a5641fe6b7 ]

Use the system_unbound_wq for async session destruction. We don't need a
dedicated workqueue for async session destruction because:

 1. perf does not seem to be an issue since we only allow 1 active work.

 2. it does not have deps with other system works and we can run them in
    parallel with each other.

Link: https://lore.kernel.org/r/20210525181821.7617-6-michael.christie@oracle.com
Reviewed-by: Lee Duncan 
Signed-off-by: Mike Christie 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Sasha Levin