Sunday, August 8, 2010

Architectural decisions

Let's look at two products. The first product is a small 1U rackmount firewall device with a low-power Celeron processor and 256 megabytes of memory. It can be optionally clustered into a high availability cluster so that if one module fails, the other module takes over. Hard drive capacity is provided by a 120gb hard drive or a 64GB SSD. The second is a large NAS file server with a minimum configuration of 4 gigabytes of memory and with a minimum hard drive configuration of 3.8 terabytes. The file system on this file server is inherently capable of propagating transactions due to its underlying design.

So: How are we going to handle failover on these two devices? That's where your architectural decisions come into play, and your architectural decisions are going to in large part influence how things are going to be done.

The first thing to influence our decisions is going to be how much memory and CPU we have to play with. This directly influences our language choices, because the smaller and more limited the device, the lower level we have to go in order to a) fit the software into the device, and b) get acceptable performance. So for the firewall, we chose "C". The architect of the NAS system also chose "C". As an exercise for the reader, why do you think I believe the architect of the NAS system was wrong here? In order to get acceptable performance with the small module, we chose a multi-threaded architecture where monitor threads were associated with XML entries of what to monitor, and faults and alerts were passed through a central event queue handler which used that same XML policy database to determine which handler module (mechanism) to execute for a given fault or alert event, nothing was hard-wired, everything could be reconfigured simply by changing the XML. The architect of the NAS system had an external process sending faults and alerts to the main system manager process via a socket interface using a proprietary interface, and the main system manager process then spawned off agent threads to perform whatever tasks were necessary -- but the main system manager process had no XML database or any other configurable way to associate mechanism with policy. Rather, policy for handling faults and alerts was hard-wired. Is hard-wiring policy into software wise or necessary if there is an alternative?

The next question is, what problem are we going to solve? For the firewall system, it's simple -- we monitor various aspects of the system, and execute the appropriate mechanism specified by XML-configured policies when various events happen with the goal of maintaining service as much as possible. One possible mechanism could be to ask the slave module to take over. Tweaking policy so that this only happens when there's no possibility of recovery on the active module is decidedly a goal because there is a brief blink of service outage as the upstream and downstream switches get GARP'ed to redirect gateway traffic to a different network port, and service outages are bad. We don't have to worry about resyncing when we come back up -- we just resync from the other system at that point, if we had any unsynced firewall rules or configuration items that weren't on the other system at the point we went down, well, so what. It's no big deal to manually re-enter those rules again. And in the unlikely event that we manage to dual-head (not very likely because we have a hardwired interconnect and differential backoffs where the current master wins and does a remote power-down of the slave before the slave can do a remote power-down of the master), no data gets lost because we're a firewall. We're just passing data, we're not serving it ourselves. All that happens if we dual-head is that service is going to be problematic (to say the least!) until one of the modules gets shut down manually.

For the NAS system, it's quite a bit harder. Data integrity is a must. Dual-heading -- both systems believing they are the master -- requires either advanced transaction merge semantics when partitioning is resolved (transaction merge semantics which are wicked hard to prove do not lead to data corruption), or must be avoided at all costs by having all systems associated with a filesystem immediately cease providing services if they've not received an "I'm going down" from the missing peer(s), have no ability to force the missing peer to shut down (via IPMI or other controllable power), and no way of assuring (via voting, or other mechanisms) that the missing peers are going down. Still, we're talking about the same basic principle, with one caveat -- dual-heading is a disaster and it is better to serve nothing at all than risk dual-heading.

For the NAS system, the architectural team chose not to incorporate programmable power (such as IPMI) to allow differential backoffs to assure that dual-heading couldn't happen. Rather, they chose to require a caucus device. If you could not reach the caucus device, you failed. If you reached the caucus device but there were no update ticks on the caucus device from your peer(s), you provided services. This approach is workable, but a) requires another device, and b) provides a single point of failure. If you provide *multiple* caucus devices, then you still have the potential for a single point of failure in the event of a network partition. That is because when partition happens (i.e. you start missing ticks from your peers), if you cannot reach *all* caucus devices, you cannot guarantee that the missing peers are not themselves updating the missing caucus device and thinking *you* are the down system. How did the NAS system architectural team handle that problem? Well, they didn't. They just had a single caucus device, and if anybody couldn't talk to the caucus device, they simply quit serving data in order to prevent dual-heading, and lived with the single point of failure. I have a solution that would allow multiple caucus devices while guaranteeing no dual-heading, based on voting (possibly weighted in case of a tie), but I'll leave that as an exercise to the reader.

So... architectural decisions: 1) Remember your goals. 2) Make things flexible. 3) Use as high-level an architecture as possible on your given hardware to ensure that #2 isn't a fib, i.e., if what you're doing is doable in a higher-level language like Java or Python, for heaven's sake don't do it in "C"!. 4) Separate policy from mechanism -- okay, so this is same as #2, but worth repeating. 5) Document, document, document! I don't care whether it's UML, or freehand sketches, or whatever, but your use cases and data flows through the system *must* be clear to everybody in your team at the time you do the actual design or else you'll get garbage, 6) Have good taste.

Have good taste? What does that mean?! Well, I can't explain. It's like art. I know it when I see it. And that, unfortunately, is the rarest thing of all. I recently looked at some code that I had written when I was in college, that implemented one of the early forum boards. I was surprised and a bit astonished that even this many years later, the code was clearly well structured and showed a clean and well-conceived architecture to it. It wasn't because I had a lot of skill and experience, because, look, I was a college kid. I guess I just had good taste, a clear idea of what a well-conceived system is supposed to look like, and I don't know how that can be taught.

At which point I'm rambling, so I'm going to head off to read a book. BTW, note that the above NAS and firewall systems are, to a certain extent, hypothetical. Some details match systems I've actually worked on, some do not. If you worked with me at one of those companies, you know which is which. If you didn't, well, try not to read exact details as gospel of how a certain system works, because you'll be wrong :).

-ELG

No comments:

Post a Comment