Confessions of a Linux Penguin: August 2010

Tuesday, August 24, 2010

Droid or iPhone?

On the left side, weighing in at 4.3 inches, the Motorola Droid X. This massive hunka hunka burnin' cell phone love is running a 1ghz TI OLED processor and has 8gb of built-in memory for programs and 16gb of Microsd memory for data, as well as an 8 megapixel camera.

On the right side, weighing in at 3.8 inches, the Apple iPhone 4. This sleek little beauty is running an 800mhz Samsung A4 with a 5 megapixel camera whose sensor is the same size as the Droid's (i.e., fewer, but more sensitive, pixels).

So who is the winner? The iPhone 4's camera, despite fewer pixels, is definitely better than the Droid X camera. As in, ridiculously good. On the other hand, the iPhone 4 is more tightly-walled than any previous iPhone. As in, jailbreaking it to run "unauthorized" software is ridiculously difficult. Given the way AT&T and Apple cripple the thing, that's a major problem. The iPhone's screen is too teensy for my not-as-young-as-they-used-to-be eyes. On the other hand, it also has the iPod ecosphere with it, and tight integration with my Macbook Pro, and all my current software will continue working with it.

The Droid X, on the other hand, is easily "rooted". Motorola has created their own ecosphere of sorts, with a car dock and charger, home docks, etc. so that you can do pretty much the same things as with the Apple ecosphere. It doesn't by default have any integration with my Macbook Pro, but a third-party program called The Missing Sync will do most of that. The big screen is nice for older eyes. But Android itself is ugly and clunky, though serviceable.

So who's the winner? Call it a draw -- for now. Which presents a problem, because my aging iPhone 3G really is not liking iOS 4.0, everything runs really slow and clunky. Maybe I ought to flip a coin... or maybe if I wait a few more months, the horse will sing. Hmm...

-ELG

Migration concluded

The penguin has now landed at a new employer. I'll update my LinkedIn profile with that information after I've had a few days to de-stress and relax... the past couple of weeks have been a wild, wild ride, reminding me a bit of the last couple of weeks before the deadline for a major new product. But having too much interest in your skills definitely beats the alternative :).

-- ELG

Monday, August 9, 2010

Action items

I had joined the company a few weeks earlier and was sitting in yet another raucous meeting. The latest attempt at a new product had failed, and the blame-casting and finger-pointing were at full tilt. Finally I sighed, and added my own say. "Look. I'm new here and I don't know what all has gone on, and really don't care who's to blame for what, blame isn't going to get anything done. What I want to know is, what do we need to do now?"

Person 1: "Well, we failed because we weren't using software engineering system X" (where X is some software engineering scheme that was popular at the time).

"Okay, so we'll use software engineering system X, I have no objection to using any particular system, as long as we use one. What's the first thing we need to do, in that system?"

Person 2: "We need to figure out what we want the product to do."

"Okay, let's do that. What is the product supposed to do?"

We discussed it for a while, then from there the meeting devolved into a list of action items, and eventually broke up with another meeting scheduled to work on the detailed functional requirements. But on the whiteboard before we left, I had already sketched out the basics of "what we want it to do", and eventually that turned into an architecture and then a product that is still being sold today, many years later.

So what's my point? Simple: Meetings must be constructive. One of the things my teacher supervisors told me, when I first entered the classroom, was to always ask myself, what do I want the students to be doing? And then communicate it. A classroom where every student knows what he's supposed to be doing at any given time is a happy classroom. Idle hands being the devil's workshop and all that. The same applies to meetings. Unless it's intended to be an informational meeting, meetings should always be about, "what do we want to do". And meetings should never be about blame-casting, finger-pointing, or any of the other negative things that waste time at meetings. No product ever got shipped because people pointed fingers at each other.

Everybody should have a takeaway from a development meeting -- "this is what I am supposed to be doing." Otherwise you're simply wasting time. So now you know why one of my favorite questions, when a meeting has gone on and on and on and is now drawing to a close but without any firm conclusion, is "what do we need to be doing? What are our action items?" We all need to know that we're on the same page and that we all know what we're supposed to be doing. That way there are no surprises, there are no excuses like "but I thought Doug was supposed to do that task!" when the meeting minutes show quite well that Doug was *not* assigned that action item, and things simply get done. Which is the point, after all: Get the product done, and out the door.

--ELG

* Usual disclaimer: The above is at least slightly fictionalized to protect the innocent. If you were there, you know what really happened. If you weren't... well, you got my takeaway, anyhow.

Sunday, August 8, 2010

Architectural decisions

Let's look at two products. The first product is a small 1U rackmount firewall device with a low-power Celeron processor and 256 megabytes of memory. It can be optionally clustered into a high availability cluster so that if one module fails, the other module takes over. Hard drive capacity is provided by a 120gb hard drive or a 64GB SSD. The second is a large NAS file server with a minimum configuration of 4 gigabytes of memory and with a minimum hard drive configuration of 3.8 terabytes. The file system on this file server is inherently capable of propagating transactions due to its underlying design.

So: How are we going to handle failover on these two devices? That's where your architectural decisions come into play, and your architectural decisions are going to in large part influence how things are going to be done.

The first thing to influence our decisions is going to be how much memory and CPU we have to play with. This directly influences our language choices, because the smaller and more limited the device, the lower level we have to go in order to a) fit the software into the device, and b) get acceptable performance. So for the firewall, we chose "C". The architect of the NAS system also chose "C". As an exercise for the reader, why do you think I believe the architect of the NAS system was wrong here? In order to get acceptable performance with the small module, we chose a multi-threaded architecture where monitor threads were associated with XML entries of what to monitor, and faults and alerts were passed through a central event queue handler which used that same XML policy database to determine which handler module (mechanism) to execute for a given fault or alert event, nothing was hard-wired, everything could be reconfigured simply by changing the XML. The architect of the NAS system had an external process sending faults and alerts to the main system manager process via a socket interface using a proprietary interface, and the main system manager process then spawned off agent threads to perform whatever tasks were necessary -- but the main system manager process had no XML database or any other configurable way to associate mechanism with policy. Rather, policy for handling faults and alerts was hard-wired. Is hard-wiring policy into software wise or necessary if there is an alternative?

The next question is, what problem are we going to solve? For the firewall system, it's simple -- we monitor various aspects of the system, and execute the appropriate mechanism specified by XML-configured policies when various events happen with the goal of maintaining service as much as possible. One possible mechanism could be to ask the slave module to take over. Tweaking policy so that this only happens when there's no possibility of recovery on the active module is decidedly a goal because there is a brief blink of service outage as the upstream and downstream switches get GARP'ed to redirect gateway traffic to a different network port, and service outages are bad. We don't have to worry about resyncing when we come back up -- we just resync from the other system at that point, if we had any unsynced firewall rules or configuration items that weren't on the other system at the point we went down, well, so what. It's no big deal to manually re-enter those rules again. And in the unlikely event that we manage to dual-head (not very likely because we have a hardwired interconnect and differential backoffs where the current master wins and does a remote power-down of the slave before the slave can do a remote power-down of the master), no data gets lost because we're a firewall. We're just passing data, we're not serving it ourselves. All that happens if we dual-head is that service is going to be problematic (to say the least!) until one of the modules gets shut down manually.

For the NAS system, it's quite a bit harder. Data integrity is a must. Dual-heading -- both systems believing they are the master -- requires either advanced transaction merge semantics when partitioning is resolved (transaction merge semantics which are wicked hard to prove do not lead to data corruption), or must be avoided at all costs by having all systems associated with a filesystem immediately cease providing services if they've not received an "I'm going down" from the missing peer(s), have no ability to force the missing peer to shut down (via IPMI or other controllable power), and no way of assuring (via voting, or other mechanisms) that the missing peers are going down. Still, we're talking about the same basic principle, with one caveat -- dual-heading is a disaster and it is better to serve nothing at all than risk dual-heading.

For the NAS system, the architectural team chose not to incorporate programmable power (such as IPMI) to allow differential backoffs to assure that dual-heading couldn't happen. Rather, they chose to require a caucus device. If you could not reach the caucus device, you failed. If you reached the caucus device but there were no update ticks on the caucus device from your peer(s), you provided services. This approach is workable, but a) requires another device, and b) provides a single point of failure. If you provide *multiple* caucus devices, then you still have the potential for a single point of failure in the event of a network partition. That is because when partition happens (i.e. you start missing ticks from your peers), if you cannot reach *all* caucus devices, you cannot guarantee that the missing peers are not themselves updating the missing caucus device and thinking *you* are the down system. How did the NAS system architectural team handle that problem? Well, they didn't. They just had a single caucus device, and if anybody couldn't talk to the caucus device, they simply quit serving data in order to prevent dual-heading, and lived with the single point of failure. I have a solution that would allow multiple caucus devices while guaranteeing no dual-heading, based on voting (possibly weighted in case of a tie), but I'll leave that as an exercise to the reader.

So... architectural decisions: 1) Remember your goals. 2) Make things flexible. 3) Use as high-level an architecture as possible on your given hardware to ensure that #2 isn't a fib, i.e., if what you're doing is doable in a higher-level language like Java or Python, for heaven's sake don't do it in "C"!. 4) Separate policy from mechanism -- okay, so this is same as #2, but worth repeating. 5) Document, document, document! I don't care whether it's UML, or freehand sketches, or whatever, but your use cases and data flows through the system *must* be clear to everybody in your team at the time you do the actual design or else you'll get garbage, 6) Have good taste.

Have good taste? What does that mean?! Well, I can't explain. It's like art. I know it when I see it. And that, unfortunately, is the rarest thing of all. I recently looked at some code that I had written when I was in college, that implemented one of the early forum boards. I was surprised and a bit astonished that even this many years later, the code was clearly well structured and showed a clean and well-conceived architecture to it. It wasn't because I had a lot of skill and experience, because, look, I was a college kid. I guess I just had good taste, a clear idea of what a well-conceived system is supposed to look like, and I don't know how that can be taught.

At which point I'm rambling, so I'm going to head off to read a book. BTW, note that the above NAS and firewall systems are, to a certain extent, hypothetical. Some details match systems I've actually worked on, some do not. If you worked with me at one of those companies, you know which is which. If you didn't, well, try not to read exact details as gospel of how a certain system works, because you'll be wrong :).

-ELG

Sunday, August 1, 2010

The migration of the penguin

I have added a link to my resume in the left margin, in case someone is interested in hiring a long-time Linux guy who knows where the skeletons are buried and, if you need something Linux done, probably has already done it at least once...

-ELG

Confessions of a Linux Penguin

Tuesday, August 24, 2010

Droid or iPhone?

Migration concluded

Monday, August 9, 2010

Action items

Sunday, August 8, 2010

Architectural decisions

Sunday, August 1, 2010

The migration of the penguin

About Me

Pages

My Links

Blog Archive

Geek Links

Followers