The reason I mention that is because I was talking to the Director of Software Development of a network security appliance company and mentioned that I'd spent some time recently modifying Intransa's kernel block drivers to work around bugs in the Linux 2.6.32 kernel, specifically, to work around some races under heavy load that had been introduced into block device teardown that would either OOPS you or cause hung I/O. He asked me, puzzled, "why didn't you just fix the kernel bugs?"
Well, it was a fair question. The fact that the races I ran into are a result of the removal of the Big Kernel Lock and would require significant re-factoring of the kernel locks to make them go away, and furthermore that they're hard to reproduce and debug, is one issue. I looked at later kernel versions to see how that played out, and the changes to the block layer were deep and intrusive. It was much simpler to simply modify our software to cope with the misbehavior. But that's not what I answered him with. I answered him "because if we start hacking on the Linux kernel that opens us up to a world of hurt from a maintainability point of view when we want to move to a new kernel version, from a licensing point of view with the GPL... it just isn't the right thing to do."
I was explicitly thinking about Agami's 2.6.7 kernel there. Agami patched 2.6.7 to a fare-thee-well. I'm not sure how many patches in toto they were applying to their kernel, but it was at least in the hundreds. By contrast, the last two appliance companies I've worked for -- Resilience and Intransa -- patched the kernel only if it was completely and utterly unavoidable. When I ported the Resilience kernel patches from RHEL3 to RHEL5, there was less than a dozen patches, and they were all to fix driver issues with specific network drivers for obsolete hardware that, alas, we still needed to support. We'd submitted those patches upstream and it turned out that I ended up only needing to apply five patches total, the rest had already been applied upstream. The situation with Intransa's kernel is even simpler. There is a .config file, and one(1) patch that basically exports a kernel API that we need for one of our driver modules. I also apply an upstream LSI mpt2sas driver and an upstream networking driver needed for the specific hardware in our current and next-generation servers, but those are compiled separately as part of our software compile, not as part of the kernel compile (i.e., they're disabled in the kernel compile and isolated in a "3rdparty" directory where they're easy to remove once we transition to a kernel that has the required drivers in it). Everything else is self-contained in our own modules.
The result is that while it was a pain to change kernel versions from the kernel shipped with Intransa StorStac 7.12 to the kernel shipped with Intransa StorStac 7.20, it was a manageable pain -- it took me roughly two weeks of work to figure out what was happening with the block layer locking and around two weeks to create and debug work-arounds in our kernel drivers in the three places that were running into race issues, and the other issues were just a matter of some of the include files moving around. Meanwhile, Agami had painted themself into such a corner with their heavily hacked kernel (amongst other issues) that it proved exceedingly difficult for them to move off of 2.6.7 even though 2.6.7 had severe stability issues under heavy load -- I had to scale back many of the hardware tests that I wrote for the manufacturing line because they were making the kernel fall over under heavy load, and I was trying to test that the machine was assembled correctly and that the hardware was working correctly, I wasn't trying to test for bugs in the kernel.
These issues come into play mostly when there are quantum shifts in underlying hardware architectures requiring either a new kernel version or significant back-porting of new drivers and kernel features. We had to switch kernel versions between StorStac 7.2 and StorStac 7.11 because of the introduction of Nehalem-based server hardware. We had to switch kernel versions between StorStac 7.12 and StorStac 7.20 because of the introduction of Sandy Bridge based server hardware. In both cases backporting the architecture and driver support back to an earlier kernel would have been *much* more difficult than simply porting the StorStac kernel drivers to the new kernel. And it was all because of the decision to keep our kernel code as independent of the kernel as possible -- if there had been extensive Intransa modifications to the kernel, we could have never done it within the short amount of time that it was done (the 7.12 to 7.20 development cycle was four months -- *including* re-basing to a new Linux distribution).
So I suppose the takeaway from all this is:
- If you are applying a lot of patches to the kernel, stop and think of different ways of handling things. The Linux kernel will have bugs. Always. Think of workarounds that will work with those bugs, and continue working once those bugs are fixed. Bonus brownie points if the workarounds also improve performance by, e.g., adding bio pending and free lists that mostly mitigate against needing to kmalloc bio's once the system has been under load for a while.
- Keep your own stuff in your own .ko modules, don't go patching mainline kernel code to add your own functionality unless there's just no alternative (such as one NIC driver that did not implement the ability to set the MAC address in the NIC chip, which we needed for cluster failover of the Resilience appliance).
- If you need to really hack on a kernel subsystem, either a) create a new module by a different name, or b) consider different solutions.
- And above all: Consider maintainability from the beginning. Because if you don't, you, too, can go out of business within two years of delivering your first (and last) product...
No comments:
Post a Comment