So for this New Year
we all created resolutions right? So for
this year I have set a goal to write at least one article a week. So with that in mind let's get to it shall
we.
Troubleshooting
iSCSI connectivity has been one of the most curious pieces of my job now for
the last few years and it seems to be one of the most misunderstood.
So first of all
let's understand the event logs and where to get some help in understanding
them.
iSCSI Initiator
Users Guide for Windows 7 and Windows Server 2008 R2
Approximate page 105
starts all the Event IDs and good general description of what the error is;
however that is it. It is generic.
Let's examine this
event ID:
Event ID 20
Connection to the
target was lost. The initiator will attempt to retry the connection
This event is logged
when the initiator loses connection to the target when the connection was in
iSCSI Full Feature Phase. This event typically happens when there are network
problems, network cable is removed, network switch is shutdown, or target resets
the connection. In all cases initiator will attempt to reestablish the TCP
connection.
So what does this really mean?
Where is my issue? The key is that this is NOT an iSCSI initiator issue; however everything on the network chain is suspect to your storage array. So when did the issue start? What change in this environment occurred? Nothing? OS updates? Has your utilization and general usage of the volume attached grown?
So let's start at
the lowest hanging / easiest pieces to resolve.
Network drivers should be reviewed and you should consider updating them
if possible. NIC firmware and switch firmware
is a bit of a consideration as well, but biggest thing to consider is the array
firmware. What updates and fixes have
occurred to potentially resolve it.
In your OS
performance trace the error output of the NICs.
Each OS has a bit of a different way to do this and Windows is fairly
easy. You could start a trace on the
iSCSI MS iSCSI drivers. It is good but
typically it will show us that the is not the service. We can then setup a Data Collector Set for
the error output of the NICs to confirm if this is a hardware error. Focus on the Physical Network Adapter and not
interface - the following counters make it easier to focus in on the error.
Network
Interface(*)\Bytes Received/sec
Network
Interface(*)\Bytes Sent/sec
Network
Interface(*)\Current Bandwidth
Network
Interface(*)\Output Queue Length
Network
Interface(*)\Packets Outbound Errors
Network
Interface(*)\Packets Receive Errors
No errors
reported? Review then the switch
logs. Do you have pause frames on the
ports that the iSCSI is utilizing? If so
which? Typically I see them on the array
side and then I have to deep dive our array's performance numbers. Often I see the issue is a design in the RAID
Group and the number of drives.
So bottom line.
Is it an OS problem? Could be, but often no. Is it a switch issue? Unless it is known issues with firmware then the answer is - No. Is it an array issue? Just as the switch firmware fixes can resolve issues, but often the answer here is - not really. Consider this an indicator that your data needs have grown and it is time to consider the design and throughput of the environment you have now. Look at the simple things first which often seem to reduce the issue and allow you to regroup and consider:
>Isolate iSCSI
networking to its own physical switches - just make sure each has enough
resources to handle the burst traffic that iSCSI is infamous for. Your array vendor should have a matrix of
what switches are tested and validated.
You can use those are references.
>Isolate the
iSCSI network to its own NICS where possible.
If the OS is a guest VM this is really important. When you present a NIC to the hypervisor that
is not utilizing SRIOV the host has to manage the virtual traffic and sharing
it with the host's iSCSI network often can add the extra congestion.
>Consider your
array's design. Whether you utilize Disk
Pools, RAID Groups or something other - each has their limits in hardware and
software. Ensure you are not overrunning
these.
No comments:
Post a Comment