Saturday, August 25, 2012

Windows 8 and taste

One thing about most geeks that I've noticed: They have horrible taste. You can look at their homes, their clothes, their cars, the trinkets scattered about their cubicles, it's all a horrible mishmash of ugly. The way that Apple addressed this was via the Stalinesque concept of the Chief Designer. You may laugh at that description, but the Soviet-era Soyuz rocket and space capsule, designed under the supervision of Chief Designer Sergei Korolev, are still flying today fifty years after their design process began because he had exactly the same kind of qualifications as Apple's Chief Designer -- good engineering taste that balanced simplicity, cost, performance, and capability into a pleasing whole.

What brings this to mind is Windows 8, which I'm using to type this while eval'ing the RTM product. I'm not disclosing any NDA stuff here, it's pretty much the same product you downloaded earlier as the "Consumer Preview" with a few pieces of missing functionality filled in (and undoubtedly many bugs fixed). Windows 8 is Microsoft's attempt to re-invent the user interface, but fails primarily because of two reasons: A lack of courage, and a lack of that chief designer.

The lack of courage part is where Microsoft flinched at the notion of completely re-inventing the desktop. As a result, they have the "classic" desktop available by hitting a button on the "Modern" desktop. The end result is a bizarre mishmash of two different desktop environments in one, twice the amount of stuff to learn if you're a user because the "Classic" desktop environment doesn't *exactly* work the same as the well-known Windows 7 desktop environment, while the "Modern" desktop... well, it's entirely different, period. Twice the amount for end users to learn is user environment fail, period.

The lack of that chief designer, however, shows even more in the design of the "Modern" desktop. A good design is clean, looks simple (even if it isn't), everything's laid out in an obvious manner, there's a limited number of things for end users to learn in order to be productive, and, for lack of a better word, it is tasteful. It doesn't look like a mishmash of unrelated ideas from multiple independent teams all smashed together into one product.

That, however, doesn't describe the "Modern" desktop at all. One of the things I noted about Gnome 3 was that you had to basically know one gesture -- how to move your mouse pointer to the top left of the screen (or touch the top left of the screen on a touchscreen) -- to make it useful to you. Everything else is pretty obvious touch an icon or touch and drag an icon (or the click-on and click-on-and-drag with a mouse) or scroll up and down using the mouse wheel or two fingers. With the "Modern" desktop, every single corner of the screen does something -- and does something *different* (with the exception of the right-hand corners, which does something the *same*). Furthermore, moving to a corner, waiting for the hover timeout, then moving your mouse up and down does something even *different*. And right-clicking does something even *more* different. The confusing number of things you can do, indeed, need to know how to do to make the environment useful, are well past the three things you need to know how to do to use Gnome 3.

In essence, it's as if a bunch of geeks got together and decided to take every idea from every touchscreen environment ever created anywhere, and put them all into the same user interface. It's as if every geek critic of Gnome 3's tasteful design got together and designed their perfect touchscreen environment with every feature they could think of. It's as if Larry Wall designed the thing. Folks, Perl is many things, but clean and easy to use are not among those things -- it's an ugly, nasty, piece of work that will spit you in the eye if you look at it wrong just like the camel on the cover of the definitive book on the language. Like said camel it also happens to be very useful (thus why I wrote the virtualization management infrastructure for our virtualized product line in Perl, because it was the most reasonable way to parse the output of all the low-level virtualization-related utilities that the various virtualization systems use for their low-level management), but nobody has ever suggested that end users be given Perl as their user interface to their computers.

So the question is, will Windows 8 succeed? Well, define "success". The majority of personal computers in the world next year will ship with Windows 8 pre-installed. And because everything in post-Gates Microsoft is an API and Microsoft is quite open with their API's (Apple, not Microsoft, is the "Do Evil" company in the post-Gates era), sooner or later someone is going to come up with a means to tame this mess. But I have to say that Windows 8 is, in the end, a disappointment to me. Microsoft had an opportunity to re-define how personal computers worked, and they have all the pieces needed in Windows 8 to do so. They just needed a tasteful Chief Designer with the power to impose order and taste upon this mess -- and, alas, it appears they have no Jon Ives or Sergei Korolev to do so.

-ELG

Linux block layer, BTRFS, and ZFS On Linux

Long time no blog. Lately I've been stuck way down in the 2.6.32 kernel block device midlayer, both initiating I/O to block devices via the submit_bio interface, and also setting up a midlayer driver.

What I'm finding is that things are a bit of a mess in the 2.6.32 kernel when it comes to device pulls and device removals. When I chug down into the scsi midlayer I see that it's supposed to be completing all bios with -EIO, but there's still situations where when I yank a drive out of the chassis, I don't get all of my endios back with errors because of races in the kernel between device removal and device teardown. The net result is I have no idea what actually made it to disk or not. Note that you will NOT see this racy behavior on a normal system where the completion (almost) always wins the race, I was generating thousands of I/O's per second to 48 disks with CPU usage pretty much maxed out as part of load testing to see how things worked at the limits.

Now, that's no problem for my particular application, which is somewhat RAID-like, or for the Linux MD layer, or for single standalone block device filesystems for that matter. What's on the disk is on the disk, and when the standalone filesystem is remounted it'll know its state at that point by looking at the log. For the RAID type stacking drivers, when it comes back the RAID layer will note that its RAID superblock is out of data and rebuild the disk via mirror or ECC recovery, a somewhat slow process but the end result is a disk in known state. So when I get the disk removal event I mark the device as failed, quit accepting I/O for it, and mark all the work pending for that device that hasn't already endio'ed as errored, and if an endio sneaks in afterwards and tries to mark that work item again as errored, no big deal (although I put a check in the endio so that it simply noops if the work already was completed by the disk removal event). This means I have to keep a work pool around, but that's a good idea anyhow since otherwise I'd be thrashing kalloc/kfree, and if the drive comes back I'll re-use that pool again.

So traditional RAID and standalone devices don't have a problem here. Where a problem exists is with filesystems like btrfs and zfs that do replication on a per-file-block level rather than on a per-block-device-block level. If they can't log whether a block made it or not because they never got an endio, they can get confused. btrfs appears to err on the side of caution (i.e., assumes it didn't get replicated, and replicate it elsewhere if possible) but when the missing volume comes back and has that additional replica on it, strange things may happen. ZFSonLinux is even worse, since its SEL (Solaris Emulation Layer) appears to assume that bios always complete, and deadlocks waiting for bio's to complete rather than properly handle disk remove events. (Note that I haven't really gone digging into SEL, I'm basing this solely on observed behavior).

The good news: The popularity of btrfs among kernel developers appears to be motivating the Linux kernel team to fix this situation in newer kernels. I was chugging through the block subsystem in 3.5 to see if there was something that could be backported to make 2.6.32 behave a bit better here, and noticed some significant new functionality to make the block subsystem more flexible and robust. I didn't backport any of it because it was easier to just modify my kernel module to behave well with the default 2.6.32 behavior (I'm talking *extensive* changes in the block layer in recent kernels), but it appears that the end result is that btrfs on the 3.5 kernel should be *significantly* more reliable than the backported version of btrfs that Red Hat has put into their 2.6.32 kernel on RHEL 6.3.

So that's my recommendation: If you want to run btrfs right now, go with the latest kernel, whether it's on RHEL 6.3 or Fedora 17 or whatever. And you know the reason for my recommendation now. Red Hat has *not* backported any of those block layer changes back to RHEL 6.3, so what you have with btrfs on their stock 2.6.32 kernel is the equivalent of having a knight in shining armor that's missing the breast-plate. Sort of renders the whole exercise useless, in the end.

-ELG