Recently I was on a call with a miffed client dealing with issues in a software product. Understandably they were upset that what they were paying for had an issue that impacted them. All of that is pretty SOP but then an IT guy pipes up:
“Do you guys have any testing? How did this make it into production? You should be catching all bugs during development!”.
Coming from a business person this is somewhat of an expected statement, but from someone in the technology field this is borderline ignorant. Here is the cold hard fact, you will never, ever be bug free. If you think you are 100% bug free, your not, they just haven’t been exposed or reported yet.
This has nothing to do with your methodology (Agile/Waterfall/Kanban), your delivery (SaaS, Mobile App, Desktop App) or your audience (Consumer, Business, Gov). This is just the reality of software development, but it’s not limited to just software.
On July 22 1962 Mariner 1 was launched on it’s way to Venus. A few minutes after launch Mariner 1 began to fly off course and the guidance system failed to correct it. As the spacecraft started to veer toward North Atlantic shipping lanes it was destroyed by the safety officer.
The year, 1994 and Intel launched their brand new Pentium check to the masses. After many years in development this was the first new chip to usher in a new era after the 486. How could this go wrong? Well it did, the chips made some mistakes during floating point division.
Mars Climate Orbiter
1962 too old? They didn’t have QA, or automated testing back then you say. Plus Pentium is a hardware issue! Well on December 11st, 1998 the Mars Climate Orbiter launch and was on it’s way to Mars. On September 23rd, 1999 NASA lost contact with the orbiter as the craft started to enter orbit.
I have first hand knowledge how much dealing with dates, times and time zones sucks. But thankfully I’ve never encountered an issue quite like this one. In 2007 during a 15 hour flight from Hawaii to Okinawa, Japan multiple on-board computers crashed when the planes crossed the International Date Line. The F-22 Raptor with a simulated KD Ratio of 241-2 was grounded due to a software issue dealing with the International Date Line.
Flash Crash 2010
It was calm and sunny day in NYC. The date, May 6th 2010, the time, 2:32 PM. The next 36 minutes will go down in history for the for the NYSE. A trading algorithm ran amok by some accounts or worked too well by others. Causing a massive and almost instantaneous drop of the DJIA by 9.2%, an intra-day swing of 1,010.14 points.
August 2nd 2012 between 9:30AM and 10:00AM Knight Capital’s automated market making trading platform started generating erratic trades. We all know to buy low and sell high, but apparently in this case the the system didn’t get that memo and started buying high and selling low.
The above stories are just a super small sampling of software bugs that made it past, in some cases, some very rigorous testing and QA. You think you have it tough running your software changes through QA, talk to anyone who’s worked in the aircraft or defense industry.
We should do everything possible to catch, fix or remediate bugs or issues before they make it into production. But no matter how rigorous your testing, QA or automation process is, they will always be bugs in production. Once you accept this as a fact of life you can look into minimizing the impact those bugs have.
First monitor production system from every angle you can. Hardware, software, traffic, errors, analytics, etc. When you develop a baseline of your system running normally you can use that profile to determine when it isn’t running properly.
Second, pay attention to production logs and automation notification of actual errors. Don’t put a bunch of Exception logging code all over the place and have a system notify you every time it’s tripped, that will just become noise and you’ll ignore it.
Third, make it stupid simple for users to report errors and issues. Your users, for the most part, will not bother to inform you of most errors they encounter. They just don’t care about your software or service that much. It’s when it truly impacts their life or business that you will hear about it and by then it’s too late.
Finally, jump on customer impacting production bugs right away. You ideally want to fix these issues within hours, not days. I’ve pulled many all nighters fixing production issues and it’s not fun, but your users appreciate it. Keep them informed and over communicate with a status/service page, on social media and via email.
Bugs in production happen, you are just fooling yourself if you think otherwise. It’s how quickly you address those bugs and how you handle the affected customers that really matter. Don’t put all your eggs in the “catch all bugs during development” basket, save some for the production and customer service side.
If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.