Archive for February, 2016

You will never be bug free

Recently I was on a call with a miffed client dealing with issues in a software product. Understandably they were upset that what they were paying for had an issue that impacted them. All of that is pretty SOP but then an IT guy pipes up:

Do you guys have any testing? How did this make it into production? You should be catching all bugs during development!”.

bug-featureComing from a business person this is somewhat of an expected statement, but from someone in the technology field this is borderline ignorant. Here is the cold hard fact, you will never, ever be bug free. If you think you are 100% bug free, your not, they just haven’t been exposed or reported yet.

This has nothing to do with your methodology (Agile/Waterfall/Kanban), your delivery (SaaS, Mobile App, Desktop App) or your audience (Consumer,  Business, Gov). This is just the reality of software development, but it’s not limited to just software.

Mariner 1

On July 22 1962 Mariner 1 was launched on it’s way to Venus. A few minutes after launch Mariner 1 began to fly off course and the guidance system failed to correct it. As the spacecraft started to veer toward North Atlantic shipping lanes it was destroyed by the safety officer.

So what caused the issue? It was a typo.

Pentium

The year, 1994 and Intel launched their brand new Pentium check to the masses. After many years in development this was the first new chip to usher in a new era after the 486. How could this go wrong? Well it  did, the chips made some mistakes during floating point division.

What’s to blame? A faulty division table.

Mars Climate Orbiter

1962 too old? They didn’t have QA, or automated testing back then you say. Plus Pentium is a hardware issue! Well on December 11st, 1998 the Mars Climate Orbiter launch and was on it’s way to Mars. On September 23rd, 1999 NASA lost contact with the orbiter as the craft started to enter orbit.

What caused this issue? One team used English units and another team used Metric units.

F22 Raptor

I have first hand knowledge how much dealing with dates, times and time zones sucks. But thankfully I’ve never encountered an issue quite like this one. In 2007 during a 15 hour flight from Hawaii to Okinawa, Japan multiple on-board computers crashed when the planes crossed the International Date Line. The F-22 Raptor with a simulated KD Ratio of 241-2 was grounded due to a software issue dealing with the International Date Line.

How many lines a code does it take to cripple an advanced fighter? Just a couple.

Flash Crash 2010

It was calm and sunny day in NYC. The date, May 6th 2010, the time, 2:32 PM. The next 36 minutes will go down in history for the for the NYSE. A trading algorithm ran amok by some accounts or worked too well by others. Causing a massive and almost instantaneous drop of the DJIA by 9.2%, an intra-day swing of 1,010.14 points.

How much damage can a guy in his basement do? A lot apparently.

Knight Capital

August 2nd 2012 between 9:30AM and 10:00AM Knight Capital’s automated market making trading platform started generating erratic trades. We all know to buy low and sell high, but apparently in this case the the system didn’t get that memo and started buying high and selling low.

The cost of some bad logic? $440 Million and a company.

The above stories are just a super small sampling of software bugs that made it past, in some cases, some very rigorous testing and QA. You think you have it tough running your software changes through QA, talk to anyone who’s worked in the aircraft or defense industry.

We should do everything possible to catch, fix or remediate bugs or issues before they make it into production. But no matter how rigorous your testing, QA or automation process is, they will always be bugs in production. Once you accept this as a fact of life you can look into minimizing the impact those bugs have.

First monitor production system from every angle you can. Hardware, software, traffic, errors, analytics, etc. When you develop a baseline of your system running normally you can use that profile to determine when it isn’t running properly.

Second, pay attention to production logs and automation notification of actual errors. Don’t put a bunch of Exception logging code all over the place and have a system notify you every time it’s tripped, that will just become noise and you’ll ignore it.

Third, make it stupid simple for users to report errors and issues. Your users, for the most part, will not bother to inform you of most errors they encounter. They just don’t care about your software or service that much. It’s when it truly impacts their life or business that you will hear about it and by then it’s too late.

Finally, jump on customer impacting production bugs right away. You ideally want to fix these issues within hours, not days. I’ve pulled many all nighters fixing production issues and it’s not fun, but your users appreciate it. Keep them informed and over communicate with a status/service page, on social media and via email.

Bugs in production happen, you are just fooling yourself if you think otherwise. It’s how quickly you address those bugs and how you handle the affected customers that really matter. Don’t put all your eggs in the “catch all bugs during development” basket, save some for the production and customer service side.

If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.

Tech Debt is Anything But

Recently I’ve been involved in a lot of discussions regarding Technical Debt. A lot of the discussions boil down to “we have a mountain of tech debt, how can we pay it off?”. To me this is fundamentally the incorrect way of looking at the problem and the terminology is partly to blame. When we think of debt it’s something that you can pay off and if you don’t go into debt again you never have to worry about it. But there isn’t a system out there that doesn’t have some Technical Debt.

debtThe most pristinely architected, well written, accurately commented, OOP, SOLID, unit tested code now, will in the future be some form of tech debt, either because the code was written with incorrect assumptions, wasn’t maintained, or the knowledge about it was lost and people became afraid to touch it.

Technical Debt is the wrong term, we should be talking and thinking about Preventative Code Maintenance.

If you think about it the products your company is creating is somewhat similar to National Infrastructure. The ‘sexy and sizzle’ parts of both Infrastructure and Code is the creation of new: new features, new products, etc. What doesn’t generate sizzle? Routine maintenance.

Take a look at this video from John Oliver’s Last Week Tonight about Infrastructure (somewhat NSFW).

The issue is you can’t take an “all or nothing” approach to Preventative Code Maintenance. For example you can’t just do one off, large projects to fix issues “the Technical Debt style” nor can you do all small incremental updates, a “Scout Coding approach”.

Preventative Code Maintenance can help catch and correct issues before they get too big. But teams need to be empowered to tackle these issues when they are small, and manageable. Far too often an issue is just tagged as a ‘Tech Debt’ item and pushed to the backlog, where it is constantly push down and deprioritized until the issue is so large the business is unwilling or unable to take the issue on, until that is the issue starts impacting the business.

But before an issue ever starts impacting production there is a time when it can be mitigated. But at the time the issue could be deemed too small or too unimportant to take care of.

Preventative Code Maintenance “PCM” should be the #3 priority permanently on your board

Why #3? Well we do need features and enhancements, it’s important to our customers and to sales. There usually is an important bug or other non-preventative issue that needs taken care of. But always plan and prioritize taking care of preventative issues when they are small, unassuming and seemingly unimportant is far more cost effective then waiting till they are costing you business. In your scheme you should block enough weight for the Preventative Code Maintenance task as well. Use Tee-shirt sizes? PCM is always an L, Fibonacci? Put it at the higher end of your scale, (5 for the places I’ve worked). Doing 2 week sprints and just use days? Give it 2 days.

At first you may only be able to get one PCM done or maybe partially done. But as time goes on you will be able to get more and more done in that time frame. Given time and effort it will end up being truly preventative. You can obviously do more Code Maintenance then just the ones you can fit into the one item, but the goal is to always be working on some PCM.

Don’t delay maintenance until it’s costing you business or customers and reached the point where it most likely is expensive and painful to tackle. Be proactive and tackle the issues when they are small, cheap and easy to fix.

If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.

Go to Top