Pages

Monday, January 5, 2015

Troubleshooting iSCSI Networking

So for this New Year we all created resolutions right?  So for this year I have set a goal to write at least one article a week.  So with that in mind let's get to it shall we.

Troubleshooting iSCSI connectivity has been one of the most curious pieces of my job now for the last few years and it seems to be one of the most misunderstood.

So first of all let's understand the event logs and where to get some help in understanding them.

iSCSI Initiator Users Guide for Windows 7 and Windows Server 2008 R2

Approximate page 105 starts all the Event IDs and good general description of what the error is; however that is it.  It is generic.

Let's examine this event ID:

Event ID 20
Connection to the target was lost. The initiator will attempt to retry the connection

This event is logged when the initiator loses connection to the target when the connection was in iSCSI Full Feature Phase. This event typically happens when there are network problems, network cable is removed, network switch is shutdown, or target resets the connection. In all cases initiator will attempt to reestablish the TCP connection.

So what does this really mean?  

Where is my issue?  The key is that this is NOT an iSCSI initiator issue; however everything on the network chain is suspect to your storage array.  So when did the issue start?  What change in this environment occurred? Nothing? OS updates?  Has your utilization and general usage of the volume attached grown?

So let's start at the lowest hanging / easiest pieces to resolve.  Network drivers should be reviewed and you should consider updating them if possible.  NIC firmware and switch firmware is a bit of a consideration as well, but biggest thing to consider is the array firmware.  What updates and fixes have occurred to potentially resolve it.

In your OS performance trace the error output of the NICs.  Each OS has a bit of a different way to do this and Windows is fairly easy.  You could start a trace on the iSCSI MS iSCSI drivers.  It is good but typically it will show us that the is not the service.  We can then setup a Data Collector Set for the error output of the NICs to confirm if this is a hardware error.  Focus on the Physical Network Adapter and not interface - the following counters make it easier to focus in on the error.

Network Interface(*)\Bytes Received/sec
Network Interface(*)\Bytes Sent/sec
Network Interface(*)\Current Bandwidth
Network Interface(*)\Output Queue Length
Network Interface(*)\Packets Outbound Errors
Network Interface(*)\Packets Receive Errors

No errors reported?  Review then the switch logs.  Do you have pause frames on the ports that the iSCSI is utilizing?  If so which?  Typically I see them on the array side and then I have to deep dive our array's performance numbers.  Often I see the issue is a design in the RAID Group and the number of drives. 

So bottom line.  

Is it an OS problem? Could be, but often no.  Is it a switch issue?  Unless it is known issues with firmware then the answer is - No.  Is it an array issue?  Just as the switch firmware fixes can resolve issues, but often the answer here is - not really.  Consider this an indicator that your data needs have grown and it is time to consider the design and throughput of the environment you have now.  Look at the simple things first which often seem to reduce the issue and allow you to regroup and consider:

>Isolate iSCSI networking to its own physical switches - just make sure each has enough resources to handle the burst traffic that iSCSI is infamous for.  Your array vendor should have a matrix of what switches are tested and validated.  You can use those are references.

>Isolate the iSCSI network to its own NICS where possible.  If the OS is a guest VM this is really important.  When you present a NIC to the hypervisor that is not utilizing SRIOV the host has to manage the virtual traffic and sharing it with the host's iSCSI network often can add the extra congestion.

>Consider your array's design.  Whether you utilize Disk Pools, RAID Groups or something other - each has their limits in hardware and software.  Ensure you are not overrunning these.

No comments:

Post a Comment