Friday, March 22, 2013

Storage migrations

I spent much of today setting up a pair of Linux servers to migrate data off of a 2005-vintage Intransa storage array. The Intransa storage array still works fine, but clearly the end is in sight -- I have a limited supply of spares and three drives died within the past two months alone. So I set up a 10Gb fiber connection between the iSCSI switch for the Intransa array and the iSCSI switch for my new(old) commodity Linux servers (a pair of previous-generation Supermicro 12-disk servers with a 12-disk JBOD apiece), exported iSCSI volumes via lio, told Windows to mirror its various volumes to the new volumes, and let'er rip. Note that I did traditional RAID here because I don't have the cycles or the CPU's to implement an internal cloud for a small company, and that these storage servers are also providing regular file shares via NFS and Samba (CIFS). I deliberately kept things as simple as possible in order to make it more easily manageable. In the process some clear issues with the current Linux storage stack became apparently. Thumbnail summary: The Linux storage stack is to professional storage stacks such as the old Intransa stack (or modern-day HDC or HP stacks) as Soviet toilet paper was to Charmin. Soviet toilet paper could serve as sandpaper -- it was rough, annoying, and it did the job but you certainly didn't like it. Same deal with the Linux storage stack, with the additional caveat that there are some things that antique Intransa gear would do that are pretty much impossible with the current Linux storage stack.

Rather than go off onto a long rant, here's some things to think about:

  1. The Intransa unit integrated the process of creating a volume, assigning it to a volume group (RAID array) that fit your desired policy creating one if necessary (this Intransa installation has six storage cabinets each with 16 drives, so individually adding 96 drives to RAID arrays then managing which ones your volumes got placed upon would have been nightmarish) and then exporting the resulting volume as an iSCSI volume. All of this is a multi-step manual process on Linux.
  2. You can create replications to replicate a volume to another Intransa realm (either local or geographically remote) at any point in time after a volume has been created, without taking the volume offline. On Linux, you have to take the volume offline, unexport it from iSCSI and/or NFS, layer drbd on top of it, then tell everybody (iscsi, NFS, fstab) to access the volume at its drbd device name now rather than at the old LVM volume name. Hint: Taking volumes offline to do something this simple is *not* acceptable in a production environment!
  3. Scaling out storage by adding storage cabinets is non-trivial. I had to bring up my storage cabinets one at a time so I could RAID6 or RAID10 the cabinets (depending upon whether I was doing a scale or performance share) without spanning cabinets with my RAID groups, because spanning cabinets with SAS is a Bad Idea for a number of reasons. Policy-based storage management -- it's a great idea.
  4. Better hope that your Linux server doesn't kernel panic, because there's no battery-backed RAM cache to keep unwritten data logged. It still mystifies me that nobody has implemented this idea for the Linux software RAID layer. Well, except for Agami back in 2004, and Intransa back in 2004, neither of which are around anymore and where the hardware that implemented this idea is no longer available even if they were. And Agami did it at the disk controller level, actually, while Intransa did it by the simple expedient of entirely bypassing the Linux block layer. These first-generation disk cabinets have each 4-disk shelf IP-connected to a gigabit switch pair that then uplinks via a 10Gb link to the controllers, iSCSI requests flow in via 10Gb from the iSCSI switch pair to the controllers, are processed internally to turn them into volume and RAID requests which then get turned into disk shelf read/write requests that flow out the network on the other end of the stack, and nowhere in any of this does the Linux block layer come into play. That's why it was so easy to add the battery-backed cache -- no Linux block layer to get in the way.
The last of which brings to the forefront the role of the Linux block layer. The Linux block layer is this primitive thing that was created back in the IDE disk days and hasn't advanced much since. There have been attempts via write barriers and other mechanisms to make it work in a more reliable way that doesn't lose filesystems so often, and those efforts have worked to a certain extent, but the reality is that you have lvm and dm and drbd and the various raid layers and iscsi and then filesystems all layered on top like a cake and making sure that data that comes in at the top of the cake makes it to the disk drives at the bottom without a confectionery disaster inbetween... well, it's not simple. Just ask the BTRFS team. In private. Out of earshot of young children. Because some things are just too horrible for young ears to hear.

And I said I wasn't going to go off on a long rant. Oh well. So anyhow, next thing I'll do is talk about why I went with traditional RAID cabinets rather than creating a storage cloud, the latter of which would have taken care of some of my storage management issues by making it, well, cloudy. But that is a discussion for another day.

-ELG

No comments:

Post a Comment