Friday, April 2, 2021

Eating the seed corn: Red Hat Software, CentOS, and the end of an era

Or: Why we are migrating to Ubuntu Linux.

So, why did Linux win the server wars? Why is virtually every cloud server running Linux now, rather than some variety of Windows Server? Why are probably the majority of corporate web servers running Linux now, and a large variety of other internal application servers are running Linux?

Basically, Linux won the server wars because Linux won the poor college student market. Poor college students couldn't afford a copy of Windows Server to run on their own computer. They could afford a copy of Linux to run on their own computer. And twenty years later, they are the decision makers in IT departments who decide what technology to use.

So anyhow: After being one of those poor college students some years ago I'm now one of those decision makers. For example, I spec'ed and got quotes for $16,000 worth of server equipment last week as well as $1500 worth of software, and over the next few years will be doing the same for a six-figure amount of equipment and software. Granted, this isn't huge by IT terms, but multiply me by thousands, and it adds up quickly.

And one of the things I spec'ed was the embedded OS in our appliance. In 2010 I chose to use CentOS Linux for the embedded distribution in our appliance for a number of reasons:

  1. Licensing costs. The licensing terms for Red Hat Enterprise Linux really are oriented around servers. Their per-server costs would make the appliance unprofitable. Furthermore, we're using an extremely stripped down Linux distribution for our appliance. There really is no value we would get from Red Hat support. So we went with a community distribution (CentOS) rather than a commercial distribution(RHEL or SUSE) to reflect the value that we get from the support organization around the distribution.
  2. Stream of security patches. We don't have a team to watch CVE's and figure out whether they apply to our stripped-down Linux distribution. Instead, we watch the stream of RPM's coming out of the distribution and apply the ones applicable to our distribution on a regular basis.
  3. Length of support. Ten years is a long time. It means we don't have to re-base our software that runs on top of the embedded OS to a new Linux distribution very often. This means less work and more play. Playing with new technologies to add to our product is a lot more fun than the grunt work of porting our software to a new Linux distribution.
  4. Continuity with earlier versions of Red Hat Enterprise Linux. If you knew how to set up advanced networking on EL5, you know how to set it up on EL7 (with some caveats such as having to disable NetworkManager first). This made the job of re-basing our software considerably easier than with distributions that moved things around in major ways with each release.
  5. A commitment in writing and in all major press outlets by Red Hat Software, when they acquired the CentOS project in 2014, to continue the mission of the CentOS project unchanged, which led to transitioning from Centos 6 to Centos 7 when it arrive and stabilized.
Fundamentally, Red Hat Software has broken all of the above with recent decisions such as the decision to discontinue CentOS 8 as of the end of 2021.
  1. CentOS Stream is not a community distribution. It is a beta test stream for RHEL. As such, it is not an appropriate base for an embedded appliance distribution. So we will have to move to another community distribution, and Ubuntu is the community distribution that has the highest mindshare right now.
  2. One reason why CentOS Stream is not a community distribution is that CentOS Stream will not be a reliable source of security patches. Yes, I understand that Facebook has their own internal distribution based on CentOS Stream. We are not Facebook. We do not have infinite money like Facebook. We do not have a team to watch the list of RPM's and apply only the finalized tested versions to our appliance distribution. Yet another reason why CentOS Stream is not appropriate for us.
  3. One thing that kept us on CentOS rather than on Ubuntu LTS was that CentOS releases were supported for twice as long as Ubuntu LTS releases. Since that's no longer true, there's no reason to stay with a RHEL-derived distribution.
  4. Red Hat Software broke significant continuity with earlier versions of RHEL with Enterprise Linux 8. For example, they removed all Java applet containers from the distribution other than their own JBOSS applet container. You use Tomcat? Tough. You'll have to install it from scratch and keep it updated yourself when security patches happen. And those who use the BTRFS filesystem are completely left out.
  5. And finally, Red Hat Software has proven that they will break their commitments. So.
And Red Hat's explanation? "It was putting price pressure on Red Hat’s ability to capture some of the value that we create."

Uhm, no. Because the users of CentOS, by and large, are not going to buy Red Hat Enterprise Linux. Buying RHEL does not comport with their business model if they're a business, or with their budget if they're a poor college student. They are going to move to another community distribution, either another rebuild of RHEL such as Oracle Linux or Rocky Linux, or to a different distribution altogether such as Debian or Ubuntu.

So why are we choosing Ubuntu LTS rather than some other distribution?

  1. Oracle Linux: I've evaluated Oracle Linux. They've made sufficient changes to make it difficult to transition from CentOS, plus have broken a few things along the way. And they don't (can't) fix the fact that Enterprise Linux 8 broke significant continuity with earlier versions of RHEL. They can't do that and maintain compatibility with RHEL 8.
  2. Then there's the general problem with all of the RHEL rebuilds: mindshare and market share is declining drastically. CentOS is really bad as a desktop operating system, so the students we get in as interns, if they know Linux at all, usually know Ubuntu Linux or one of its derivatives. Combined with its derivatives, that makes Ubuntu the most popular desktop Linux right now. For desktop Linux use, WSL has made Ubuntu the most common Linux on the desktop in our office despite the fact that we work with CentOS or CentOS-derived distributions as our target.
  3. Debian Linux is quite stable, but if security fixes require effort to backport to the previous stable release, they usually aren't backported. This makes it hard for us to keep our product secure without transitioning to a newer distribution.
  4. Fedora -- it moves too fast. Decent desktop OS, but not a good fit for an embedded appliance.
  5. The ubiquity of Ubuntu. It is available as a supported option, and fully cloud-aware, on all major cloud platforms -- Google, Microsoft, and Amazon. To create a virtual machine with many of the other Linux distributions you must first locate a third-party image, then install that possibly untrusted third-party image. And often the third-party image is not cloud aware -- e.g. it doesn't have cloud-init installed or the native cloud tools installed, making it hard to bring up new virtual machines in an automated fashion. Ubuntu is a Tier 1 supported distribution everywhere that matters.
Thus Ubuntu.

In general, the mindshare issue -- where Red Hat's mindshare amongst potential employees is declining rapidly while Ubuntu's mindshare is becoming near ubiquitous -- makes it almost a no-brainer for us. Instead of attempting to port our software stack from Centos 7 to a RHEL 8 clone as Centos 7 approaches end-of-life, we will instead port it to Ubuntu 20.04 LTS. We use a configuration management tool to actually create our appliance (we plop the deliverables onto the CM server and it configures a copy of the base OS with our changes), and the CM tool we use hides most distribution details from us, so the job of porting is mostly going to be one of adjusting package names and occasional file locations in the manifest used to create the appliance. Furthermore, we avoid the Tomcat issue. Ubuntu ships with Tomcat, while RHEL 8 does not. Granted, that's a fairly minor issue (we already have things in our manifest that manually install various software) but why make work for ourselves?

And our student interns? The ones that aren't already Ubuntu users will become Ubuntu users. And ten years from now, they will be specifying Ubuntu Linux rather than Red Hat Enterprise Linux for their corporation's deployments. Because that's how it works. If you seed your field, crops will rise. If you instead eat your seed corn, as Red Hat's current IBM pointy headed management is doing ... well, you eat well for this quarter. But good luck reaping a crop at the end of the growing season.


Thursday, October 24, 2019

Red Hat Software has phone company syndrome

Specifically: They don't care, because they don't have to.

Google "jboss sucks." You get about 946,000 results. Google "tomcat sucks". You get about 260,000 results.

So which one did Red Hat Software omit from Red Hat Enterprise Linux 7, after it had been in the distribution for close to two decades?

Hint: It's the one that sucks less.

Yeah, Red Hat kept the one that sucks more (jboss) and dumped the one that sucks less (tomcat). Because they want to push their bloated buggy application server that nobody really likes (JBoss) and if that requires breaking everybody's migration path from RHEL7 to RHEL8, well.They're Red Hat Software / IBM. They don't care. They don't have to.

Of course, we all know what happened the last time that IBM tried to shove something down people's throats, when they introduced the IBM PS/2 and OS/2. It turns out that people do *not* willingly abandon their old stuff just because the incumbent near-monopoly decides to push their own proprietary junk. Competitors soon outcompeted IBM, and now they no longer make personal computers. Because it turns out that customers will go to the competition if you do customer-hostile things.

Like Ubuntu.

Which is likely going to be what I standardize on going forward, because really, if Red Hat is going to break the world every time they do a new release, why bother?


Wednesday, May 15, 2019

"But how can I be waiting on a lock when I'm writing an entirely different table?"

Anybody who has run long transactions on PostgreSQL has run into situations where an update on one table, let's call it sphere_config, is waiting on a write from another table, let's call it component_state, to complete.

But how, you ask? How can that be? Those are two different tables, there are no rows in common between the two writes, thus no lock to wait upon!

Well, here's the deal: The transaction that's writing component_state *earlier* wrote to sphere_config. But it's still chugging away doing writes to component_state.

But wait, you say, yes, that transaction wrote to sphere_config, but wrote to records that are not in common with my transaction, so what gives? Well, what gives is this: Uniqueness constraints. Transaction A grabbed a share lock on sphere_state when it wrote to sphere_state because there's a uniqueness constraint. Until transaction A, that's writing to component_state, finishes chugging away doing its things and COMMITs its transaction, its writes to sphere_config aren't visible to transaction B, your transaction that's trying to write to sphere_config. After all, it might ROLLBACK instead because of an error. Thus your transaction B can't update the row it's trying to update because it doesn't know yet whether it would be violating a uniqueness constraint because it doesn't know yet whether transaction A wrote that same value (because transaction A hasn't committed or rolled back yet).

Now I hear you saying, "but transaction A is operating on a completely different subset of the table, there's no way that anything it's writing can violate uniqueness for transaction B." Well, Postgres doesn't know that because of the way Postgres time travels. So when transaction B goes to write to that table, it tries to grab a share lock on it... but can't, because transaction A still has it. Until transaction A finishes, transaction B is stuck.

So, how can you deal with this? Well:

  1. If you know you're never going to violate the uniqueness constraint, remove it.
  2. If you need the uniqueness constraint, try to break up your big long transactions into smaller transactions. This will clear those locks faster.
  3. Make sure your transactions get terminated by either a COMMIT or a ROLLBACK. Unfortunately some programs in languages that do exceptions have very bad error handling and, when an exception occurs that terminates execution of the database trasnaction, don't clean up the database connection behind them by issuing a ROLLBACK to the database session. Note that most people are using a connection pool between themselves and Postgres, and that connection pool may not automatically drop a connection when the program exceptions. Instead, the still-open connection (still unterminated by a ROLLBACK) may be simply put back into the pool for re-use, keeping that lock open on the . So: Try your darndest to use scaffolding like Spring's Service scaffolding that will automatically roll back transactions when there is an uncaught exception that terminates execution of a transaction. And do a lot of praying.
  4. If you are absolutely positively sure you will never have a long-running transaction, you can add a idle_in_transaction_session_timeout (integer) parameter to your postgresql.conf . Be very careful here though, if your connection pool doesn't do checking for dropped connections you can get *very* ugly user interaction with this!
That's pretty much it. Uniqueness checking in a time-traveling database like Postgres is pretty hard to do. In MySQL, if you try to write a duplicate into the index, it will collide with the record already there and your transaction will fail at that point, but Postgres doesn't ever actually update records, it just adds a new record with a higher transaction number and lets the record with a lower transaction number eventually get garbage-collected once all transactions referring to it are finished executing. This allows multiple transactions to have multiple views of the record as of the time that each transaction started, thus insuring chronological integrity (as well as adding a potential for deadlock but that's another story), but definitely makes uniqueness harder to accomplish.

So now you know how your transaction can be waiting on a lock when you're writing an entirely different table -- uniqueness checking. How... unique. And now that you know, now you know how to fix it.


Friday, February 15, 2019

Implementing "df" as a Powershell script

Note that I first have 'DiskFree' which creates an object for each line, so that I can use it in PowerShell applets that need disk usage info in an easily digested fashion, then use the list of objects in 'df' to display in a formatted manner. This is all in my startup profile for PowerShell so I have it handy when I'm at the command line. This is an example of writing Perl-esque Powershell on the latest Windows 10. People who were weaned on Powershell back in the Windows XP days likely are going to be rather appalled because "it doesn't look like Powershell!". Well, Perl people don't like how my Perl looks either -- "it doesn't look like Perl!" -- I dunno, people who cling to inscrutable syntax just because are, well, strange. I'm more into clarity and simplicity of reading when it comes to my code.
function DiskFree($dir = "") {
   $wmiq = 'SELECT * FROM Win32_LogicalDisk WHERE Size != Null AND DriveType >= 2'
   $disks = Get-WmiObject -Query $wmiq
   [System.Collections.ArrayList]$newary = @()

   foreach ($disk in $disks) {
        # Write-Host "Processing " $disk.deviceId
        $used = ( $disk.Size - $disk.FreeSpace )
        $percent = [math]::Round($used / $disk.size * 100,1)
        $freePercent = [math]::Round($disk.FreeSpace / $disk.size * 100,1)
        if ( ($dir -ne "" -and $dir -eq $disk.DeviceID) -or ($dir -eq "")  ) {
           $dfObject = New-Object -TypeName psobject  
           $dfObject | Add-Member -MemberType NoteProperty -Name Device -Value $disk.deviceId
           $dfObject | Add-Member -MemberType NoteProperty -Name Used -Value $used
           $dfObject | Add-Member -MemberType NoteProperty -Name UsedPercent -Value $percent
           $dfObject | Add-Member -MemberType NoteProperty -Name Free -Value $disk.FreeSpace
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name FreePercent -Value $freePercent
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name Total -Value $disk.size
           $res = $newary.Add($dfObject)
        # Write-Host "Processed" $disk.DeviceId

function df($dir="") {
   DiskFree($dir) | Format-Table -Property Device,Used,UsedPercent,Free,FreePercent,Total -AutoSize

Thursday, October 26, 2017

Amazon Aurora Postgres: First thoughts

Well, I have to say that this was a bit frustrating. I never actually got my database installed into Aurora Postgres because of some serious limits of Amazon's implementation. Once I found those limits, I found that they limited my operational flexibility to the point where, for my workload, it simply doesn't work.

The biggest limits are based on the fact that Aurora Postgres doesn't use a filesystem. Rather, Amazon has created a block-based back end for Postgres that allows clustered access to the data store. The data store itself, like EBS, is replicated for performance and redundancy. This has some interesting side effects. Postgres was built around the assumption that the filesystem cache was the primary block cache. You allocate a fairly limited amount of memory to the internal Postgres shared memory pool and leave the rest to be used by the filesystem block cache. Aurora Postgres, on the other hand, must assign that memory to the internal Postgres shared memory pool in order to serve as cache since there is no filesystem and thus no filesystem block cache. Unlike the filesystem block cache pool, Postgres jobs cannot take memory away from the internal shared memory pool in order to accomplish whatever task they are doing. The end result is that internal jobs that require a lot of memory can die with out of memory errors since there's not enough memory outside the Postgres shared memory pool to allocate for that job.

The other big limitation is that Aurora Postgres has limited space for handling large sorts or indexing operations. Regular Postgres uses a directory, pgsql_tmp, in a tablespace to store temporary heap results for sorts and indexes too big to fit in work_mem (which by default is 2gb). This can be as big as your filesystem allows. If, for example, I have 500gb free in my tablespace, I have no trouble sorting an entire 150gb table into an arbitrary order then exporting it to an external consumer.

But remember, Aurora Postgres doesn't have a filesystem for its tablespace. It has a block store. Instead, Aurora Postgres instances that are doing large sorts or indexing large files use local storage, which is currently 2x the size of memory. That is, if an Aurora database instance has 72gb of memory, you only have 144gb of temporary space. Good luck sorting that 150gb table.

What this means for me is that Aurora Postgres has some interesting scalability limits when dealing with very large data sets. I'm currently managing about 2 billion rows in Postgres. Needless to say, this requires a lot of very large indexes in order to segment this data space into usable consumable subsets. Creating these indexes is a slow and tedious problem on Aurora Postgres because you have to do them one at a time, you can't do them in parallel, due to the lack of temporary space to use for the sort heaps. And if I'm querying and sorting significant subsets of this database, again Aurora Postgres has some serious limits due to the inability to expand pgsql_tmp.

My guess is that people who are dealing with much fewer rows, but who are querying those rows with much greater frequency, will have a more successful experience with Aurora Postgres. But then they'll run into the IOPS costs. Basically, to get the same IOPS that would cost me $1200/month on EBS, I'd end up paying around $4,000/month on Aurora.

So: What's the point of Aurora? Aurora does have a couple of positives. You can create additional read replicas virtually instantly, since they're just pointed at the same shared block storage. Failover simply happens, and happens almost instantly. And from a management point of view, Aurora makes the database administrator's job far simpler since you no longer have to closely monitor your tablespaces and expand your block storage as needed (and reallocate tables and indexes across multiple tablespaces using pg_repack) in order to handle growing your dataset. Still, in its current state of development, given its limitations and high costs compared to running your own cluster, I really cannot recommend Aurora Postgres.


Saturday, May 20, 2017

Yes, Virginia, there is a Cloud

So a pundit, attempting to be clever, said "there is no `Cloud,' it’s just a computer sitting in a rack somewhere that you can’t see."

Except that's not true. There is a cloud, and it has nothing to do with that computer sitting in a rack somewhere that you can't see. Rather, it has to do with manageability and services that allow you to ignore the reality of that computer sitting in a rack somewhere and treat infrastructure as a service rather than as a physical piece of hardware.

Look: There's been dedicated and shared hosting for literally decades now where you could rent time on somebody else's computer that was somewhere out on the Internet. But nobody who had any sense used those for production environments once they got more than a few dozen users, because it made far more sense to host your own hardware at a data center where you could go hands-on in order to manage it. You could make sure your hardware met your durability and performance requirements, you could reconfigure your hardware as needed to add additional capability, and so forth.

Thing is, all of that is a royal pain in the butt to deal with. Been there, done that, got the four racks of gear in the back room of our shop to prove it. What AWS and other cloud services give us is usable infrastructure as a service, reconfigurable via an easy-to-use web console to meet whatever performance requirements we have. I have constellations of computers on two sides of the continent now, provisioned with whatever combination of CPU and disk space that I need to fulfill my workloads, all done via point and click from my desk in Mountain View, California. I didn't have to go out and spec hardware and purchase it. I didn't have to rack hardware. When I need to burst hardware to process some additional data, I don't need to go out and buy more hardware, then decommission it until the next time I need it, at which point it's just sitting around doing nothing. When I need infrastructure, I provision it. When I don't need it, or I want to upgrade to more performant infrastructure, I de-provision it and provision new infrastructure as needed. And all of this is happening in data centers that are put together with far more redundancy than anything I could afford to put together myself.

That's what cloud means to me. Yes, it's computers I can't see sitting in racks somewhere, but that's not the part that makes it cloud. It's the infrastructure as a service that makes it cloud. For that matter, it's Internet-connected services, period, that makes it cloud. If it's a service sitting out on the Internet somewhere out of sight of me where I don't have to manually configure hardware and can easily scale as needed, it's cloud. Claiming "it's just computers, dude!" overlooks the point entirely.


Friday, September 30, 2016

"So do that with your smartphone, nerd boy!"

That was the challenge from someone who'd read the story about an auto shop in Poland still using a Commodore 64 to balance drive shafts.

The Commodore 64 had a port on the back where you got direct digital signals from a parallel I/O chip (the 6526 CIA). So it was used in a number of embedded applications back in the day in situations where customers wouldn't actually see that critical tasks were being done by a $150 home microcomputer with a whole 64k of RAM and a 1mhz processor. When I was in school, I got a couple of contracts to do embedded stuff using the Commodore 64. The one I found most interesting was the temperature characterization of a directional drilling probe.

Directional drilling probes are used to know where the drill bit is when you're doing horizontal drilling of oil or gas wells. We calibrated the probe by mounting it in a testbed that allowed moving it into various positions, and monitoring it with a Commodore 64 bit-banging a two-wire interface. This testbed was in a magnetically calibrated chamber that could be heated or cooled upon demand. The probe itself had seven sensors -- three gravitic sensors (x, y, z), and three magnetic sensors (that aligned with the Earth's magnetic field to point towards magnetic north), in three different orientations (x, y, z), as well as a temperature sensor. These sensors went into A/D converters on the drilling probe itself, and were read out via a two-wire protocol (there were four wires total that went to the probe -- +/- power, and the CLK/DATA lines -- because running wires down a drill string is a PITA and they wanted to run as few wires as possible). The problem is that everything was heat sensitive -- "true north" ( or "true up and down" ) returned a different result from the A/D converters depending upon the temperature. And the further you go down into the Earth, the hotter it gets. You didn't want your directional drill heading off onto someone else's plot of ground just because it got hot, that could be a legal mess!

So basically, what we did was bake the probe, then watch the signals as it cooled off. A test run consisted of taking the probe up to its maximum operating temperature, pressing the ENTER key on the Commodore 64, and then turning off the oven and letting it cool down. As it cooled down, the Commodore bit-banged the values in from the probe and created a table in memory as well as graphed on the console. This was done in each of the six orientations of the probe. At the end of the test run, the table was printed out onto a piece of paper to be entered into the calibrated software that went with the probe (calibrated software that did *not* run on the Commodore 64, it ran on a standard PC under MS-DOS, and yes, I wrote that software too, based on equations I was given by their chief scientist).

So do this with a smartphone? Okay, challenge accepted! Some of the things being done with APRS and Android on ham radio would work here, that's another instance where you're interfacing a smartphone with an analog system. I would use a $25 Arduino board ( ) to bitbang the signals. I would use an $8 Bluetooth adapter for the Arduino that presented itself as a Bluetooth UART adapter. Then I would use the Bluetooth Serial profile on the Android phone to actually retrieve the streams of data from the Arduino, process them, display them as pretty graphs on the phone's display and, since this is now the 21st century, send them to a server on the Internet where they're stuck in a database under the particular directional drilling probe's serial number.

Of course, it's be just as easy to have the Arduino do that part too, if you choose an Arduino that has a WiFi adapter, and use the phone only to prompt the Arduino to start a test run and to display the pretty graphs being generated on the Internet server. It'd be even easier to use a laptop with built-in Bluetooth. But hey, you challenged me to do it with my phone, so there. :P .


Wednesday, August 10, 2016

Security fail:

One of the first rules of security on the Internet for avoiding phishing attacks is that you never, ever, enter your user credentials into a web site unless you know that you're talking to that web site, and that web site alone -- not some spoofed web site. The way we do this in the modern era is with SSL encryption. https provides not only encryption of the content being transmitted between you and the web site, it also provides authentication. Only one server (well, tightly controlled constellation of servers for bigger web sites) has the private key whose public key is being served to you (and which can be validated against the global public key infrastructure). And that's the server that you think you're talking to. If you type, say,, you will be on the homepage. Up in your browser URL bar will be a green lock. Click on that lock, and with some clicking around (depends on browser) you will be able to view the certificate and verify that you are, in fact, connected to the one and only home page owned by the one and only Google. Then, and only then, is it okay to hit the 'Login' link up at the top right of the page and log in.

So, this is how any of us who are concerned about security operate. We don't put in a user name and password unless we see that green lock, click on it, and it says we're talking to who we think we're talking to. This is because of a hacker technique called DNS poisoning, where hackers can manage to convince your local name servers that their server, not the real's web server, is where you should go to get to They then intercept the user name and ID that you enter. Well, they can convince your DNS to give the wrong address, but they don't have Google's private key, so they can't impersonate Google. So you won't get that green lock. But they hope you won't notice. You should.

This attack is called phishing, and is used to filch your user name and password, which are then used in an automated fashion on other web sites. Usually after your DNS is poisoned, you get an email telling you to go to because your password is expiring and your account will be deleted. So I got what looked like a phishing email from saying that my account was going to be deleted because I hadn't logged in for over a year, unless I logged in within the next two weeks. This actually would be normal for FedEx -- the only reason to ever log in to their web site is to set up your notification preferences that tell you that a package is on the way, has been delivered, and so forth. So with the possibility that it might actually be a valid email, I manually type into my browser's URL bar (*never* click on a URL in email! Never!), hit the ENTER key... and immediately got kicked out to a non-encrypted site.

At which point my reaction was, "WTF? Have hackers hacked the FedEx web site and are grabbing user credentials?" But it appears that's not the case. Using the host and whois systems to resolve the IP address, it goes into Akamai's site acceleration service. Instead, it looks like pure rank incompetence. FedEx is deliberately putting their customers' user names and passwords at risk because... why? Well, because they're too stupid to know how to implement SSL in an Akamai-distributed architecture, apparently. Despite the fact that Akamai has explicitly supported SSL for years.

So anyhow, I use LastPass so my password was random gibberish in the first place, so after examining the source code of the web page to see if there were obvious problems, I logged in. At that point the web site did put me into a proper SSL-encrypted web page. But the point... the point... I should have never had to enter my user name and password into a plain text unencrypted web page in the first place. There's no -- zero -- way to authenticate that you are actually talking to the site you thought you were talking to, if you're talking to a web site that's not https. DNS poisoning attacks are ridiculously easy and could have sent me *anywhere*. The only reason I felt even halfway safe talking to this web site was because LastPass had generated me a random gibberish password for this web site a year ago, so if they *did* steal my FedEx credentials, at least they could only be used to hack my FedEx account, which would be no big deal (I don't have credit card information or billing information associated with the account, it's strictly an informational account). But still. Bad FedEx. Bad, bad FedEx. Bad DevOps team, no cookie, go to your room!

The takeaway from this:

  1. Check those green locks. They're important. It tells you a) whether you're talking to the web site you think you're talking to or not (if it's there, you know you are, if it's not, you don't know), and b) tells you that any user names and passwords that you enter will go to the web site via an encrypted connection.
  2. Any web site that your company provides should only ask for user names and passwords on an SSL-encrypted https page. If someone tries to go there with a plain http: url, it should immediately be forwarded to the SSL site.
  3. If you don't follow that last rule, you will be publicly shamed, if not by me, by someone. And a public shaming is never good for your brand.
  4. Furthermore, if you don't follow that last rule, there's many potential customers who will simply refuse to use your web site. This perhaps is not a big deal for FedEx, since most of their customers have no choice but to use their web site to schedule package pickups, but if you're providing a web service to the general public? Dude. You are leaving money on the table if you do something stupid like ask for a username and password on an unencrypted page.
This isn't rocket science, people. This is just basic Web Services 101. Get it right, or choose a different profession. I understand that McDonalds is hiring. Just sayin'.


Sunday, September 27, 2015

SSD: This changes everything

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.


Tuesday, August 25, 2015

Where does the future of enterprise storage lie?

I've talked about how traditional block and NAS storage isn't going away for small businesses. So what about enterprise storage? In the past few years, we've seen the death of multiple vendors of scale-out block storage, two of which were of interest to me being Coraid and Intransa, both of which allowed chaining together large numbers of Ethernet-connected nodes to scale out storage across a very large array (the biggest cluster we built at Intransa had 16 nodes and a total of 1.5 petabytes of storage but the theoretical limits of the technology were significantly higher). Reality is that they had been on life support for years because the 1990's and 2000's were the decades of NAS, not of block storage. Oh, EMC was still heaving lots of big iron block storage over the wall to power big databases, but most applications of storage other than those big corporate data marts were NAS applications, whether it was Windows and Linux NAS servers at the low end or NetApp NAS servers at the high end.

NAS was pretty much a necessity back in the era of desktops and individual servers. You could mount people's home directories on a CIFS or NFS share (depending on their OS). People could share their files with each other by simply copying them to a shared directory. You saw block storage devices being exported to these desktops via iSCSI sometimes, but usually block storage devices were attached to physical servers in the back room on dedicated storage networks that were much faster than floor networks. The floor networks were fast enough to carry CIFS, but CIFS at its core is just putting and getting objects, not blocks, and can operate much more asynchronously than a block device and thus wasn't killed by latency the way iSCSI is.

But there's problems too. For one thing, every single device has to be part of a single login realm or domain of some sort, because that's how you secure connections to the NAS. Furthermore, people have to be put into groups, and access set on portions of the overall NAS cloud based on what groups a person belongs to. That was difficult enough in the days when you just had to worry about Linux servers and Windows desktops. But now you have all these devices

Which brings up the second issue with NAS -- it simply doesn't fit into a device-oriented world. Devices typically operate in a cloud world. They know how to push and pull objects via http, but they don't speak CIFS or NFS, and never will. What we are seeing is that increasingly we are operating in a world that isn't file based, it's object based. When you go into Google Docs to edit a spreadsheet, you aren't reading and writing a file. You're reading and writing an object. When you are running an internal business application, you are no longer loading a physical program and reading and writing files. You're going to a URL for a web app that most likely is talking to a back end database of some kind to load and store objects.

Now, finally, add in what has happened in the server room. You'll still see the big physical iron for things like database servers. But by and large the remainder of the server room has gone away, replaced by a private cloud, or pushed into a public cloud like Amazon's cloud. Now when people want to put up a server to run some service they don't call IT and work up a budget and wait months for an actual server to be procured etc., they work at the speed of the cloud -- they spin up a virtual machine, they attach block storage to it for the base image and for any database they need beyond object storage, and they implement whatever app they need to implement.

What this means is that block storage and object storage integrated with cloud management systems like OpenStack are the future of enterprise storage, a future that alas did not arrive soon enough for the vendors of scale-out block storage that survived the previous decade, who ended up without enough capital to enter this brave new world. NAS won't go away entirely, but it will increasingly be a departmental thing feeding desktops on the floor, not something that anything in the server room uses. And that is, in fact, what you see happening in the marketplace today. You see traditional Big Iron vendors like HDS increasingly pushing object storage, and the new solid-state storage vendors such as Pure Storage and Solidfire are predominantly block storage vendors selling into cloud environments.

So what does the future hold? For one thing, lower latencies via hypervisor integration. Exporting a share via iSCSI then mounting it via the hypervisor has all of the usual latency issues of iSCSI. Even with 10 gigabit networking now hitting affordability and 25 to 100 gigabit Ethernet in the future, latency is a killer if you're expecting a full round trip. What if writes were cached on a local SSD array, in order, and applied in order? For 99% of the applications out there this provides all the write consistency that you need. The cache will have to be turned off prior to migrating the virtual machine to a different box, of course -- thus the need for hypervisor integration -- but other than a catastrophic failure (where the virtual machine will go lights out also and thus not have inconsistent data when it is restarted on another node) you will, at best, have some minor data loss -- much better than inconsistent data.

So: Block storage with hypervisor and cloud management integration, and object storage. The question then becomes: Is there a place for the traditional dedicated storage device (or cluster of devices) in this brave new world? Maybe I'll talk about that next, because it's an interesting question, with issues of data density, storage usage, power consumption, and then what about that new buzzword, "software defined storage"? Is storage really going to be a commodity in the future where everybody's machine room has a bunch of generic server boxes loaded with someone's software? And what impact, exactly, is solid state storage having? Interesting things to think about there...