Thursday, October 24, 2019

Red Hat Software has phone company syndrome

Specifically: They don't care, because they don't have to.

Google "jboss sucks." You get about 946,000 results. Google "tomcat sucks". You get about 260,000 results.

So which one did Red Hat Software omit from Red Hat Enterprise Linux 7, after it had been in the distribution for close to two decades?

Hint: It's the one that sucks less.

Yeah, Red Hat kept the one that sucks more (jboss) and dumped the one that sucks less (tomcat). Because they want to push their bloated buggy application server that nobody really likes (JBoss) and if that requires breaking everybody's migration path from RHEL7 to RHEL8, well.They're Red Hat Software / IBM. They don't care. They don't have to.

Of course, we all know what happened the last time that IBM tried to shove something down people's throats, when they introduced the IBM PS/2 and OS/2. It turns out that people do *not* willingly abandon their old stuff just because the incumbent near-monopoly decides to push their own proprietary junk. Competitors soon outcompeted IBM, and now they no longer make personal computers. Because it turns out that customers will go to the competition if you do customer-hostile things.

Like Ubuntu.

Which is likely going to be what I standardize on going forward, because really, if Red Hat is going to break the world every time they do a new release, why bother?

- ELG

Wednesday, May 15, 2019

"But how can I be waiting on a lock when I'm writing an entirely different table?"

Anybody who has run long transactions on PostgreSQL has run into situations where an update on one table, let's call it sphere_config, is waiting on a write from another table, let's call it component_state, to complete.

But how, you ask? How can that be? Those are two different tables, there are no rows in common between the two writes, thus no lock to wait upon!

Well, here's the deal: The transaction that's writing component_state *earlier* wrote to sphere_config. But it's still chugging away doing writes to component_state.

But wait, you say, yes, that transaction wrote to sphere_config, but wrote to records that are not in common with my transaction, so what gives? Well, what gives is this: Uniqueness constraints. Transaction A grabbed a share lock on sphere_state when it wrote to sphere_state because there's a uniqueness constraint. Until transaction A, that's writing to component_state, finishes chugging away doing its things and COMMITs its transaction, its writes to sphere_config aren't visible to transaction B, your transaction that's trying to write to sphere_config. After all, it might ROLLBACK instead because of an error. Thus your transaction B can't update the row it's trying to update because it doesn't know yet whether it would be violating a uniqueness constraint because it doesn't know yet whether transaction A wrote that same value (because transaction A hasn't committed or rolled back yet).

Now I hear you saying, "but transaction A is operating on a completely different subset of the table, there's no way that anything it's writing can violate uniqueness for transaction B." Well, Postgres doesn't know that because of the way Postgres time travels. So when transaction B goes to write to that table, it tries to grab a share lock on it... but can't, because transaction A still has it. Until transaction A finishes, transaction B is stuck.

So, how can you deal with this? Well:

  1. If you know you're never going to violate the uniqueness constraint, remove it.
  2. If you need the uniqueness constraint, try to break up your big long transactions into smaller transactions. This will clear those locks faster.
  3. Make sure your transactions get terminated by either a COMMIT or a ROLLBACK. Unfortunately some programs in languages that do exceptions have very bad error handling and, when an exception occurs that terminates execution of the database trasnaction, don't clean up the database connection behind them by issuing a ROLLBACK to the database session. Note that most people are using a connection pool between themselves and Postgres, and that connection pool may not automatically drop a connection when the program exceptions. Instead, the still-open connection (still unterminated by a ROLLBACK) may be simply put back into the pool for re-use, keeping that lock open on the . So: Try your darndest to use scaffolding like Spring's Service scaffolding that will automatically roll back transactions when there is an uncaught exception that terminates execution of a transaction. And do a lot of praying.
  4. If you are absolutely positively sure you will never have a long-running transaction, you can add a idle_in_transaction_session_timeout (integer) parameter to your postgresql.conf . Be very careful here though, if your connection pool doesn't do checking for dropped connections you can get *very* ugly user interaction with this!
That's pretty much it. Uniqueness checking in a time-traveling database like Postgres is pretty hard to do. In MySQL, if you try to write a duplicate into the index, it will collide with the record already there and your transaction will fail at that point, but Postgres doesn't ever actually update records, it just adds a new record with a higher transaction number and lets the record with a lower transaction number eventually get garbage-collected once all transactions referring to it are finished executing. This allows multiple transactions to have multiple views of the record as of the time that each transaction started, thus insuring chronological integrity (as well as adding a potential for deadlock but that's another story), but definitely makes uniqueness harder to accomplish.

So now you know how your transaction can be waiting on a lock when you're writing an entirely different table -- uniqueness checking. How... unique. And now that you know, now you know how to fix it.

-ELG

Friday, February 15, 2019

Implementing "df" as a Powershell script

Note that I first have 'DiskFree' which creates an object for each line, so that I can use it in PowerShell applets that need disk usage info in an easily digested fashion, then use the list of objects in 'df' to display in a formatted manner. This is all in my startup profile for PowerShell so I have it handy when I'm at the command line. This is an example of writing Perl-esque Powershell on the latest Windows 10. People who were weaned on Powershell back in the Windows XP days likely are going to be rather appalled because "it doesn't look like Powershell!". Well, Perl people don't like how my Perl looks either -- "it doesn't look like Perl!" -- I dunno, people who cling to inscrutable syntax just because are, well, strange. I'm more into clarity and simplicity of reading when it comes to my code.
function DiskFree($dir = "") {
   $wmiq = 'SELECT * FROM Win32_LogicalDisk WHERE Size != Null AND DriveType >= 2'
   $disks = Get-WmiObject -Query $wmiq
   [System.Collections.ArrayList]$newary = @()

   foreach ($disk in $disks) {
        # Write-Host "Processing " $disk.deviceId
        $used = ( $disk.Size - $disk.FreeSpace )
        $percent = [math]::Round($used / $disk.size * 100,1)
        $freePercent = [math]::Round($disk.FreeSpace / $disk.size * 100,1)
        if ( ($dir -ne "" -and $dir -eq $disk.DeviceID) -or ($dir -eq "")  ) {
           $dfObject = New-Object -TypeName psobject  
           $dfObject | Add-Member -MemberType NoteProperty -Name Device -Value $disk.deviceId
           $dfObject | Add-Member -MemberType NoteProperty -Name Used -Value $used
           $dfObject | Add-Member -MemberType NoteProperty -Name UsedPercent -Value $percent
           $dfObject | Add-Member -MemberType NoteProperty -Name Free -Value $disk.FreeSpace
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name FreePercent -Value $freePercent
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name Total -Value $disk.size
           $res = $newary.Add($dfObject)
        } 
        # Write-Host "Processed" $disk.DeviceId
    }
    $newary
}


function df($dir="") {
   DiskFree($dir) | Format-Table -Property Device,Used,UsedPercent,Free,FreePercent,Total -AutoSize
}

Thursday, October 26, 2017

Amazon Aurora Postgres: First thoughts

Well, I have to say that this was a bit frustrating. I never actually got my database installed into Aurora Postgres because of some serious limits of Amazon's implementation. Once I found those limits, I found that they limited my operational flexibility to the point where, for my workload, it simply doesn't work.

The biggest limits are based on the fact that Aurora Postgres doesn't use a filesystem. Rather, Amazon has created a block-based back end for Postgres that allows clustered access to the data store. The data store itself, like EBS, is replicated for performance and redundancy. This has some interesting side effects. Postgres was built around the assumption that the filesystem cache was the primary block cache. You allocate a fairly limited amount of memory to the internal Postgres shared memory pool and leave the rest to be used by the filesystem block cache. Aurora Postgres, on the other hand, must assign that memory to the internal Postgres shared memory pool in order to serve as cache since there is no filesystem and thus no filesystem block cache. Unlike the filesystem block cache pool, Postgres jobs cannot take memory away from the internal shared memory pool in order to accomplish whatever task they are doing. The end result is that internal jobs that require a lot of memory can die with out of memory errors since there's not enough memory outside the Postgres shared memory pool to allocate for that job.

The other big limitation is that Aurora Postgres has limited space for handling large sorts or indexing operations. Regular Postgres uses a directory, pgsql_tmp, in a tablespace to store temporary heap results for sorts and indexes too big to fit in work_mem (which by default is 2gb). This can be as big as your filesystem allows. If, for example, I have 500gb free in my tablespace, I have no trouble sorting an entire 150gb table into an arbitrary order then exporting it to an external consumer.

But remember, Aurora Postgres doesn't have a filesystem for its tablespace. It has a block store. Instead, Aurora Postgres instances that are doing large sorts or indexing large files use local storage, which is currently 2x the size of memory. That is, if an Aurora database instance has 72gb of memory, you only have 144gb of temporary space. Good luck sorting that 150gb table.

What this means for me is that Aurora Postgres has some interesting scalability limits when dealing with very large data sets. I'm currently managing about 2 billion rows in Postgres. Needless to say, this requires a lot of very large indexes in order to segment this data space into usable consumable subsets. Creating these indexes is a slow and tedious problem on Aurora Postgres because you have to do them one at a time, you can't do them in parallel, due to the lack of temporary space to use for the sort heaps. And if I'm querying and sorting significant subsets of this database, again Aurora Postgres has some serious limits due to the inability to expand pgsql_tmp.

My guess is that people who are dealing with much fewer rows, but who are querying those rows with much greater frequency, will have a more successful experience with Aurora Postgres. But then they'll run into the IOPS costs. Basically, to get the same IOPS that would cost me $1200/month on EBS, I'd end up paying around $4,000/month on Aurora.

So: What's the point of Aurora? Aurora does have a couple of positives. You can create additional read replicas virtually instantly, since they're just pointed at the same shared block storage. Failover simply happens, and happens almost instantly. And from a management point of view, Aurora makes the database administrator's job far simpler since you no longer have to closely monitor your tablespaces and expand your block storage as needed (and reallocate tables and indexes across multiple tablespaces using pg_repack) in order to handle growing your dataset. Still, in its current state of development, given its limitations and high costs compared to running your own cluster, I really cannot recommend Aurora Postgres.

-ELG

Saturday, May 20, 2017

Yes, Virginia, there is a Cloud

So a pundit, attempting to be clever, said "there is no `Cloud,' it’s just a computer sitting in a rack somewhere that you can’t see."

Except that's not true. There is a cloud, and it has nothing to do with that computer sitting in a rack somewhere that you can't see. Rather, it has to do with manageability and services that allow you to ignore the reality of that computer sitting in a rack somewhere and treat infrastructure as a service rather than as a physical piece of hardware.

Look: There's been dedicated and shared hosting for literally decades now where you could rent time on somebody else's computer that was somewhere out on the Internet. But nobody who had any sense used those for production environments once they got more than a few dozen users, because it made far more sense to host your own hardware at a data center where you could go hands-on in order to manage it. You could make sure your hardware met your durability and performance requirements, you could reconfigure your hardware as needed to add additional capability, and so forth.

Thing is, all of that is a royal pain in the butt to deal with. Been there, done that, got the four racks of gear in the back room of our shop to prove it. What AWS and other cloud services give us is usable infrastructure as a service, reconfigurable via an easy-to-use web console to meet whatever performance requirements we have. I have constellations of computers on two sides of the continent now, provisioned with whatever combination of CPU and disk space that I need to fulfill my workloads, all done via point and click from my desk in Mountain View, California. I didn't have to go out and spec hardware and purchase it. I didn't have to rack hardware. When I need to burst hardware to process some additional data, I don't need to go out and buy more hardware, then decommission it until the next time I need it, at which point it's just sitting around doing nothing. When I need infrastructure, I provision it. When I don't need it, or I want to upgrade to more performant infrastructure, I de-provision it and provision new infrastructure as needed. And all of this is happening in data centers that are put together with far more redundancy than anything I could afford to put together myself.

That's what cloud means to me. Yes, it's computers I can't see sitting in racks somewhere, but that's not the part that makes it cloud. It's the infrastructure as a service that makes it cloud. For that matter, it's Internet-connected services, period, that makes it cloud. If it's a service sitting out on the Internet somewhere out of sight of me where I don't have to manually configure hardware and can easily scale as needed, it's cloud. Claiming "it's just computers, dude!" overlooks the point entirely.

-ELG

Friday, September 30, 2016

"So do that with your smartphone, nerd boy!"

That was the challenge from someone who'd read the story about an auto shop in Poland still using a Commodore 64 to balance drive shafts.

The Commodore 64 had a port on the back where you got direct digital signals from a parallel I/O chip (the 6526 CIA). So it was used in a number of embedded applications back in the day in situations where customers wouldn't actually see that critical tasks were being done by a $150 home microcomputer with a whole 64k of RAM and a 1mhz processor. When I was in school, I got a couple of contracts to do embedded stuff using the Commodore 64. The one I found most interesting was the temperature characterization of a directional drilling probe.

Directional drilling probes are used to know where the drill bit is when you're doing horizontal drilling of oil or gas wells. We calibrated the probe by mounting it in a testbed that allowed moving it into various positions, and monitoring it with a Commodore 64 bit-banging a two-wire interface. This testbed was in a magnetically calibrated chamber that could be heated or cooled upon demand. The probe itself had seven sensors -- three gravitic sensors (x, y, z), and three magnetic sensors (that aligned with the Earth's magnetic field to point towards magnetic north), in three different orientations (x, y, z), as well as a temperature sensor. These sensors went into A/D converters on the drilling probe itself, and were read out via a two-wire protocol (there were four wires total that went to the probe -- +/- power, and the CLK/DATA lines -- because running wires down a drill string is a PITA and they wanted to run as few wires as possible). The problem is that everything was heat sensitive -- "true north" ( or "true up and down" ) returned a different result from the A/D converters depending upon the temperature. And the further you go down into the Earth, the hotter it gets. You didn't want your directional drill heading off onto someone else's plot of ground just because it got hot, that could be a legal mess!

So basically, what we did was bake the probe, then watch the signals as it cooled off. A test run consisted of taking the probe up to its maximum operating temperature, pressing the ENTER key on the Commodore 64, and then turning off the oven and letting it cool down. As it cooled down, the Commodore bit-banged the values in from the probe and created a table in memory as well as graphed on the console. This was done in each of the six orientations of the probe. At the end of the test run, the table was printed out onto a piece of paper to be entered into the calibrated software that went with the probe (calibrated software that did *not* run on the Commodore 64, it ran on a standard PC under MS-DOS, and yes, I wrote that software too, based on equations I was given by their chief scientist).

So do this with a smartphone? Okay, challenge accepted! Some of the things being done with APRS and Android on ham radio would work here, that's another instance where you're interfacing a smartphone with an analog system. I would use a $25 Arduino board ( https://www.arduino.cc ) to bitbang the signals. I would use an $8 Bluetooth adapter for the Arduino that presented itself as a Bluetooth UART adapter. Then I would use the Bluetooth Serial profile on the Android phone to actually retrieve the streams of data from the Arduino, process them, display them as pretty graphs on the phone's display and, since this is now the 21st century, send them to a server on the Internet where they're stuck in a database under the particular directional drilling probe's serial number.

Of course, it's be just as easy to have the Arduino do that part too, if you choose an Arduino that has a WiFi adapter, and use the phone only to prompt the Arduino to start a test run and to display the pretty graphs being generated on the Internet server. It'd be even easier to use a laptop with built-in Bluetooth. But hey, you challenged me to do it with my phone, so there. :P .

-ELG

Wednesday, August 10, 2016

Security fail: FedEx.com

One of the first rules of security on the Internet for avoiding phishing attacks is that you never, ever, enter your user credentials into a web site unless you know that you're talking to that web site, and that web site alone -- not some spoofed web site. The way we do this in the modern era is with SSL encryption. https provides not only encryption of the content being transmitted between you and the web site, it also provides authentication. Only one server (well, tightly controlled constellation of servers for bigger web sites) has the private key whose public key is being served to you (and which can be validated against the global public key infrastructure). And that's the server that you think you're talking to. If you type, say, https://www.google.com, you will be on the Google.com homepage. Up in your browser URL bar will be a green lock. Click on that lock, and with some clicking around (depends on browser) you will be able to view the certificate and verify that you are, in fact, connected to the one and only Google.com home page owned by the one and only Google. Then, and only then, is it okay to hit the 'Login' link up at the top right of the page and log in.

So, this is how any of us who are concerned about security operate. We don't put in a user name and password unless we see that green lock, click on it, and it says we're talking to who we think we're talking to. This is because of a hacker technique called DNS poisoning, where hackers can manage to convince your local name servers that their server, not the real Google.com's web server, is where you should go to get to https://www.google.com. They then intercept the user name and ID that you enter. Well, they can convince your DNS to give the wrong address, but they don't have Google's private key, so they can't impersonate Google. So you won't get that green lock. But they hope you won't notice. You should.

This attack is called phishing, and is used to filch your user name and password, which are then used in an automated fashion on other web sites. Usually after your DNS is poisoned, you get an email telling you to go to http://some.web.site.com because your password is expiring and your account will be deleted. So I got what looked like a phishing email from Fedex.com saying that my account was going to be deleted because I hadn't logged in for over a year, unless I logged in within the next two weeks. This actually would be normal for FedEx -- the only reason to ever log in to their web site is to set up your notification preferences that tell you that a package is on the way, has been delivered, and so forth. So with the possibility that it might actually be a valid email, I manually type https://www.fedex.com into my browser's URL bar (*never* click on a URL in email! Never!), hit the ENTER key... and immediately got kicked out to a non-encrypted site.

At which point my reaction was, "WTF? Have hackers hacked the FedEx web site and are grabbing user credentials?" But it appears that's not the case. Using the host and whois systems to resolve the IP address, it goes into Akamai's site acceleration service. Instead, it looks like pure rank incompetence. FedEx is deliberately putting their customers' user names and passwords at risk because... why? Well, because they're too stupid to know how to implement SSL in an Akamai-distributed architecture, apparently. Despite the fact that Akamai has explicitly supported SSL for years.

So anyhow, I use LastPass so my password was random gibberish in the first place, so after examining the source code of the web page to see if there were obvious problems, I logged in. At that point the web site did put me into a proper SSL-encrypted web page. But the point... the point... I should have never had to enter my user name and password into a plain text unencrypted web page in the first place. There's no -- zero -- way to authenticate that you are actually talking to the site you thought you were talking to, if you're talking to a web site that's not https. DNS poisoning attacks are ridiculously easy and could have sent me *anywhere*. The only reason I felt even halfway safe talking to this web site was because LastPass had generated me a random gibberish password for this web site a year ago, so if they *did* steal my FedEx credentials, at least they could only be used to hack my FedEx account, which would be no big deal (I don't have credit card information or billing information associated with the account, it's strictly an informational account). But still. Bad FedEx. Bad, bad FedEx. Bad DevOps team, no cookie, go to your room!

The takeaway from this:

  1. Check those green locks. They're important. It tells you a) whether you're talking to the web site you think you're talking to or not (if it's there, you know you are, if it's not, you don't know), and b) tells you that any user names and passwords that you enter will go to the web site via an encrypted connection.
  2. Any web site that your company provides should only ask for user names and passwords on an SSL-encrypted https page. If someone tries to go there with a plain http: url, it should immediately be forwarded to the SSL site.
  3. If you don't follow that last rule, you will be publicly shamed, if not by me, by someone. And a public shaming is never good for your brand.
  4. Furthermore, if you don't follow that last rule, there's many potential customers who will simply refuse to use your web site. This perhaps is not a big deal for FedEx, since most of their customers have no choice but to use their web site to schedule package pickups, but if you're providing a web service to the general public? Dude. You are leaving money on the table if you do something stupid like ask for a username and password on an unencrypted page.
This isn't rocket science, people. This is just basic Web Services 101. Get it right, or choose a different profession. I understand that McDonalds is hiring. Just sayin'.

-ELG

Sunday, September 27, 2015

SSD: This changes everything

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.

-ELG

Tuesday, August 25, 2015

Where does the future of enterprise storage lie?

I've talked about how traditional block and NAS storage isn't going away for small businesses. So what about enterprise storage? In the past few years, we've seen the death of multiple vendors of scale-out block storage, two of which were of interest to me being Coraid and Intransa, both of which allowed chaining together large numbers of Ethernet-connected nodes to scale out storage across a very large array (the biggest cluster we built at Intransa had 16 nodes and a total of 1.5 petabytes of storage but the theoretical limits of the technology were significantly higher). Reality is that they had been on life support for years because the 1990's and 2000's were the decades of NAS, not of block storage. Oh, EMC was still heaving lots of big iron block storage over the wall to power big databases, but most applications of storage other than those big corporate data marts were NAS applications, whether it was Windows and Linux NAS servers at the low end or NetApp NAS servers at the high end.

NAS was pretty much a necessity back in the era of desktops and individual servers. You could mount people's home directories on a CIFS or NFS share (depending on their OS). People could share their files with each other by simply copying them to a shared directory. You saw block storage devices being exported to these desktops via iSCSI sometimes, but usually block storage devices were attached to physical servers in the back room on dedicated storage networks that were much faster than floor networks. The floor networks were fast enough to carry CIFS, but CIFS at its core is just putting and getting objects, not blocks, and can operate much more asynchronously than a block device and thus wasn't killed by latency the way iSCSI is.

But there's problems too. For one thing, every single device has to be part of a single login realm or domain of some sort, because that's how you secure connections to the NAS. Furthermore, people have to be put into groups, and access set on portions of the overall NAS cloud based on what groups a person belongs to. That was difficult enough in the days when you just had to worry about Linux servers and Windows desktops. But now you have all these devices

Which brings up the second issue with NAS -- it simply doesn't fit into a device-oriented world. Devices typically operate in a cloud world. They know how to push and pull objects via http, but they don't speak CIFS or NFS, and never will. What we are seeing is that increasingly we are operating in a world that isn't file based, it's object based. When you go into Google Docs to edit a spreadsheet, you aren't reading and writing a file. You're reading and writing an object. When you are running an internal business application, you are no longer loading a physical program and reading and writing files. You're going to a URL for a web app that most likely is talking to a back end database of some kind to load and store objects.

Now, finally, add in what has happened in the server room. You'll still see the big physical iron for things like database servers. But by and large the remainder of the server room has gone away, replaced by a private cloud, or pushed into a public cloud like Amazon's cloud. Now when people want to put up a server to run some service they don't call IT and work up a budget and wait months for an actual server to be procured etc., they work at the speed of the cloud -- they spin up a virtual machine, they attach block storage to it for the base image and for any database they need beyond object storage, and they implement whatever app they need to implement.

What this means is that block storage and object storage integrated with cloud management systems like OpenStack are the future of enterprise storage, a future that alas did not arrive soon enough for the vendors of scale-out block storage that survived the previous decade, who ended up without enough capital to enter this brave new world. NAS won't go away entirely, but it will increasingly be a departmental thing feeding desktops on the floor, not something that anything in the server room uses. And that is, in fact, what you see happening in the marketplace today. You see traditional Big Iron vendors like HDS increasingly pushing object storage, and the new solid-state storage vendors such as Pure Storage and Solidfire are predominantly block storage vendors selling into cloud environments.

So what does the future hold? For one thing, lower latencies via hypervisor integration. Exporting a share via iSCSI then mounting it via the hypervisor has all of the usual latency issues of iSCSI. Even with 10 gigabit networking now hitting affordability and 25 to 100 gigabit Ethernet in the future, latency is a killer if you're expecting a full round trip. What if writes were cached on a local SSD array, in order, and applied in order? For 99% of the applications out there this provides all the write consistency that you need. The cache will have to be turned off prior to migrating the virtual machine to a different box, of course -- thus the need for hypervisor integration -- but other than a catastrophic failure (where the virtual machine will go lights out also and thus not have inconsistent data when it is restarted on another node) you will, at best, have some minor data loss -- much better than inconsistent data.

So: Block storage with hypervisor and cloud management integration, and object storage. The question then becomes: Is there a place for the traditional dedicated storage device (or cluster of devices) in this brave new world? Maybe I'll talk about that next, because it's an interesting question, with issues of data density, storage usage, power consumption, and then what about that new buzzword, "software defined storage"? Is storage really going to be a commodity in the future where everybody's machine room has a bunch of generic server boxes loaded with someone's software? And what impact, exactly, is solid state storage having? Interesting things to think about there...

-ELG

Saturday, August 1, 2015

The quest for an integrated storage stack

In prior posts I've mentioned the multitude of problems with the standard Linux storage stack. It's inflexible -- once you've set up a stack (usually LV->VG->PV->MD->BLOCK) and opened a filesystem on it, you cannot modify it to, e.g., add a replication layer to the stac. It lacks the ability to do geographic replication in any reasonable fashion. The RAID layer in particular lacks the ability to write to (and replay) a battery-backed RAM cache to deal with the RAID 5 write hole (which, despite its name, also applies to other RAID levels and results in silently corrupted data). Throw iSCSI into this equation to provide block devices to virtual machines and, potentially, to do replication to block devices on other physical machines, and things get even more complex.

One method that has been proposed to deal with these issues is to simply not use a storage stack at all. Thus we have ZFS and BTRFS, which attempt to move the RAID layer and logical volume layers into the filesystem. This certainly solves the problem of corrupted data, but at a significant penalty in terms of performance, especially on magnetic media where the filesystem swiftly becomes fragmented. As a result running virtual machines using "block devices" that are actually files on a BTRFS filesystem results in extremely poor "disk" performance on the virtual machines. A file on a log-based subsystem is simply a poor substitute for an extent on a block device. Furthermore, use of these filesystems for databases has proven to be woefully slow compared to using a normal filesystem like XFS on top of a RAID-10 layer.

The other method that has been to abandon the Linux storage stack except as a provider of individual block devices and instead layer a distributed system like Ceph on top of it. My tests with Ceph have not been particularly promising. Performance of Ceph block devices at an individual virtual machine level were abysmal. There appears to be three reasons for this: 1) Overly pessimistic assumptions about writes on the part of Ceph, 2) The inherent latencies involved in a distributed storage stack, and 3) the fact that Ceph reads/writes via XFS filesystems layered on top of block devices, rather than to extents on raw block devices. For the latter, in my experience you will see *at least* a 10% degradation in virtual machine block device performance if the block device is implemented as a file on top of XFS rather than directly to a LVM extent.

In both cases, I wonder if we are throwing out the cart because the horse has asthma. I've worked as a software engineer for two of the pioneers of Linux-based storage -- Agami Systems, which did a NAS device with an integrated storage system, and Intransa Inc., which did scalable iSCSI storage systems with an integrated block storage subsystem. Both suffered the usual fate of pioneers -- i.e., face down dead with arrows in the back, though it took longer with Intransa than with Agami. Both wrote storage stacks for Linux which solved most of the problems of the current Linux storage stack, though each solved a different subset of those problems. There are still a significant number of businesses which do not need the expense and complexity of a full OpenStack data center in order to solve their problems, but which do need things like, e.g., logged geographic replication to replicate their data to an offsite location, something which Intransa solved ten years ago (but which, alas, died with Intransa), or real-time snapshots of virtual machine block devices at the host device level, or ...

In short: Despite the creation of distributed systems like CEPH and integrated storage management filesystems like BTRFS, there is a significant need for an integrated storage stack for Linux -- one that allows flexibility in configuring both block devices and network filesystems, which allows for easy scalability and management, which has modern features such as logged geographic replication, battery backed RAM cache support (or at least fast SSD log device support at the MD layer), and allows dynamic insertion of components into the software stack much as you could create a replication layer in the Intransa StorStac and have it sync then replicate to a remote device without ever unmounting any filesystem or making the iSCSI target inaccessible. There is simply a large number of businesses which just don't need the expense and complexity of a full OpenStack data center, which indeed don't need more than a pair of iSCSI / NAS storage appliances (a pair in order to handle replication and snapshotting), and the current Linux storage stack lacks fundamental functionality that was implemented over a decade ago but never integrated into Linux itself. It may not be possible to bring all the concepts that Agami and Intransa created into Linux (though I'll point out that all of Intransa's patents are now owned by a patent entity that allows free use for Open Source software), but we should attempt to bring as many of them as possible into the standard Linux storage stack -- because the cloud is the cloud, but most smaller businesses have no need for the cloud, they just need reliable local storage for their local physical and virtual machines.

-ELG