Posts tagged Development

Kicking the can down the road

Being a developer is an interesting profession. On one hand there are an engineering feel to it but on the other hand it’s far more like art. For me calling myself a “Software Engineer” is more for marketing then anything else. But lets face it, little of what developers/programmers do is actual engineering.

kicking-the-canEngineers are certified go through an apprenticeship process then design and create things in the real world that, for the most part, have to stand the test of time. To me Electrical, Mechanical, Structural, etc are the real engineers and us developers haven’t yet earned that right to use that title in the way we do.

How often in a development meeting do we sit there, talk about a design/architecture/code flaw that could cause issues down the road but decide to not even begin to address it? It happens so often we have an industry term for it “Tech Debt”. Can you imagine some structural engineer’s having the same discussion?

Tom: “So I cobbled together this bridge design from a bunch of other actual bridge designs and sample designs and we look ready to go.”

Mike: “Nice, but looking at these designs if we have the amount of traffic we estimate, and we fully expect the bridge to be popular in the near future, it will start leaning to one side”

Tom: “Yea, but we have a deadline, we can always go back and add more supports latter”

Structural Engineers can’t go back and refactor the bridge after it’s been built as easily as developers can with code. There are also lots of other differences, but it really boils down to is that engineers can’t kick the can down the road as easily as developers can, and so we do.

We talk about “performance as a feature”, but what about “ease of maintenance as a feature”, “scalable architecture as a feature”, “security as a feature” or “testing as a feature”? Every time we kick something down the road, label it as “Tech Debt” and put it in the backlog if were being truthful with ourselves we know full well that unless it catches fire we will almost never be back to address it.

Eventually all that Tech Debt will catch up to you and when it does, it can cost you business, alienate customers, hurt your image or even cause your company to fail. I’ve talked about this problem before and at Resgrid I try and follow a 60/40 approach. 60% new features/customer facing bugs, 40% tech debt, testing, automation and tooling.

Here are some guidelines I feel will help stop us from kicking cans down the road and start turning developers into engineers.

  1. Design/Code for the developers around you.
    One thing I’ve always admired about the military is the “Do it for the person next to you” attitude most service members have. The same can be said for the fire service. Sure you start off wanting a thrill, but after the first few times your going in because it’s your duty and because your doing it for your crew. Developers need to have the same mentality, don’t code for yourself or for your company, craft code for the developers on your team. Do it for the person in the cube next to you.
  2. Balance Architecture & Implementation
    I’ve seen my fair share of architecture astronauts in my day and they can do more harm then good. The architecture implemented needs to match the problem domain and how it’s going to be implemented. The architecture you use for a internal LoB app will be different then one that will be deployed on the cloud. No architecture is perfect and it never will be. Ideally you need to design your Architecture for the worst case scenario for 5 years out? How many users do expect in 5 years? Double that, then how are you going to handle that? How many web servers, databases will you need? Is that the scale you should have been using eventing, CQRS, etc?
  3. Code for now and for the future
    If your response to something new is “well that’s the way we’ve always done it” or some variation of it your coding in the past. Yes it’s painful to constantly keep up to date with the latest trends and best practices, but your hurting yourself and the entity you working for by not incorporating the current best practices. You will find it increasingly hard to find developers that want to work, or even know how to work, in your ‘brand new’ code base. I’ve been interviewing a lot of developers lately and almost none of them with less then 7 years experience have ever worked on an WebForms app. Starting a new project? Using WebForms? You will find it very difficult to find developers that know that technology in the near future. This same principal goes for patterns & practices.
  4. Don’t Investing in Tooling/Automation too Early
    When you first starting out a project get the minimum amount of tooling/automation you need. This is a train of thought used by startups. Time spent on setting up elaborate tooling, complex automation is time not spent properly architecting your project, implementing features or fixing bugs. Because tooling/automation lives out-of-band of your core project you can cycle on it more quickly and once you’ve done things by hand a while you know exactly where the pain points are. When you first start off, you’ll just be guessing. A lot of tooling/automation is coupled to your environment as well, so you may start off deploying locally, but then move to Azure, your tooling will have to change at that point. Start working on automation/tooling once your project is established, is maintaining good velocity and your target environments are well known.
  5. 60/40 Every Sprint, In Every SDLC Phase
    Whether your just starting your app or are in maintenance mode balance the work between features/bugs (the 60%) and tech debt, testing, automation and tooling (the 40%). Break down complex technical debt items into smaller pieces and work on them every sprint. I call the 40% bucket “Preventative Code Maintenance” and it should be the #3 priority on your backlog at all times.
  6. Document your culture and live it
    Have coding contracts and guidelines before you start any project. Your codebase, especially early on, should be unified and feel cohesive. If I got into Mikes code it should feel like Sally’s. Utilize tools like Style/FX Cop and ReSharper to nudge people in the right direction. Utilize Pull Requests, Peer Reviews or Pair Programming to keep your codebase like a well run HOA. No one developer should ‘own’ code, go in clean up code and fix broken windows. Practice Scout Coding at all times. Not everyone on the team needs to be in complete agreement on something, but once the team commits to it everyone needs to be on board. You can either succeed as a team or fail individually.
  7. “Whatever” as a feature
    When your documenting what your application or system should do. Right there should be “It should be performant, it should scale, it should be maintainable and it should be secure”. Even for internal only applications if your app goes down or is slow, it’s costing you money in wasted employee time. You shouldn’t, on a whim, sacrifice performance or security to push out a new feature without the business knowing the full extent of the tradeoff.
  8. Automate Deployments with Smoke Tests
    At first glance this may seem contradictory to #4, but that item is based on timing. When you first start working on the project you won’t be deploying to a production or pre-prod environment. Much latter down the road (hence the timing aspect) you should completely automate your deployments. In the day of Containers, Slot Deployments, CI servers you should never be manually modifying your production environment. One day you will mess up and it could cost you. Yes, the “rm –rf” guy was a hoax, but it’s also a cautionary tale.

We have to balance getting features out or getting the product out with all of the ‘back of the house’ concerns. But we have to remember that we spend our days in the ‘back of the house’. If the architecture it’s good, code isn’t well formed or meets standards, patterns and practices aren’t being followed it’s only going to slow development down, cause bugs and arguments.

If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.

You will never be bug free

Recently I was on a call with a miffed client dealing with issues in a software product. Understandably they were upset that what they were paying for had an issue that impacted them. All of that is pretty SOP but then an IT guy pipes up:

Do you guys have any testing? How did this make it into production? You should be catching all bugs during development!”.

bug-featureComing from a business person this is somewhat of an expected statement, but from someone in the technology field this is borderline ignorant. Here is the cold hard fact, you will never, ever be bug free. If you think you are 100% bug free, your not, they just haven’t been exposed or reported yet.

This has nothing to do with your methodology (Agile/Waterfall/Kanban), your delivery (SaaS, Mobile App, Desktop App) or your audience (Consumer,  Business, Gov). This is just the reality of software development, but it’s not limited to just software.

Mariner 1

On July 22 1962 Mariner 1 was launched on it’s way to Venus. A few minutes after launch Mariner 1 began to fly off course and the guidance system failed to correct it. As the spacecraft started to veer toward North Atlantic shipping lanes it was destroyed by the safety officer.

So what caused the issue? It was a typo.

Pentium

The year, 1994 and Intel launched their brand new Pentium check to the masses. After many years in development this was the first new chip to usher in a new era after the 486. How could this go wrong? Well it  did, the chips made some mistakes during floating point division.

What’s to blame? A faulty division table.

Mars Climate Orbiter

1962 too old? They didn’t have QA, or automated testing back then you say. Plus Pentium is a hardware issue! Well on December 11st, 1998 the Mars Climate Orbiter launch and was on it’s way to Mars. On September 23rd, 1999 NASA lost contact with the orbiter as the craft started to enter orbit.

What caused this issue? One team used English units and another team used Metric units.

F22 Raptor

I have first hand knowledge how much dealing with dates, times and time zones sucks. But thankfully I’ve never encountered an issue quite like this one. In 2007 during a 15 hour flight from Hawaii to Okinawa, Japan multiple on-board computers crashed when the planes crossed the International Date Line. The F-22 Raptor with a simulated KD Ratio of 241-2 was grounded due to a software issue dealing with the International Date Line.

How many lines a code does it take to cripple an advanced fighter? Just a couple.

Flash Crash 2010

It was calm and sunny day in NYC. The date, May 6th 2010, the time, 2:32 PM. The next 36 minutes will go down in history for the for the NYSE. A trading algorithm ran amok by some accounts or worked too well by others. Causing a massive and almost instantaneous drop of the DJIA by 9.2%, an intra-day swing of 1,010.14 points.

How much damage can a guy in his basement do? A lot apparently.

Knight Capital

August 2nd 2012 between 9:30AM and 10:00AM Knight Capital’s automated market making trading platform started generating erratic trades. We all know to buy low and sell high, but apparently in this case the the system didn’t get that memo and started buying high and selling low.

The cost of some bad logic? $440 Million and a company.

The above stories are just a super small sampling of software bugs that made it past, in some cases, some very rigorous testing and QA. You think you have it tough running your software changes through QA, talk to anyone who’s worked in the aircraft or defense industry.

We should do everything possible to catch, fix or remediate bugs or issues before they make it into production. But no matter how rigorous your testing, QA or automation process is, they will always be bugs in production. Once you accept this as a fact of life you can look into minimizing the impact those bugs have.

First monitor production system from every angle you can. Hardware, software, traffic, errors, analytics, etc. When you develop a baseline of your system running normally you can use that profile to determine when it isn’t running properly.

Second, pay attention to production logs and automation notification of actual errors. Don’t put a bunch of Exception logging code all over the place and have a system notify you every time it’s tripped, that will just become noise and you’ll ignore it.

Third, make it stupid simple for users to report errors and issues. Your users, for the most part, will not bother to inform you of most errors they encounter. They just don’t care about your software or service that much. It’s when it truly impacts their life or business that you will hear about it and by then it’s too late.

Finally, jump on customer impacting production bugs right away. You ideally want to fix these issues within hours, not days. I’ve pulled many all nighters fixing production issues and it’s not fun, but your users appreciate it. Keep them informed and over communicate with a status/service page, on social media and via email.

Bugs in production happen, you are just fooling yourself if you think otherwise. It’s how quickly you address those bugs and how you handle the affected customers that really matter. Don’t put all your eggs in the “catch all bugs during development” basket, save some for the production and customer service side.

If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.

Tech Debt is Anything But

Recently I’ve been involved in a lot of discussions regarding Technical Debt. A lot of the discussions boil down to “we have a mountain of tech debt, how can we pay it off?”. To me this is fundamentally the incorrect way of looking at the problem and the terminology is partly to blame. When we think of debt it’s something that you can pay off and if you don’t go into debt again you never have to worry about it. But there isn’t a system out there that doesn’t have some Technical Debt.

debtThe most pristinely architected, well written, accurately commented, OOP, SOLID, unit tested code now, will in the future be some form of tech debt, either because the code was written with incorrect assumptions, wasn’t maintained, or the knowledge about it was lost and people became afraid to touch it.

Technical Debt is the wrong term, we should be talking and thinking about Preventative Code Maintenance.

If you think about it the products your company is creating is somewhat similar to National Infrastructure. The ‘sexy and sizzle’ parts of both Infrastructure and Code is the creation of new: new features, new products, etc. What doesn’t generate sizzle? Routine maintenance.

Take a look at this video from John Oliver’s Last Week Tonight about Infrastructure (somewhat NSFW).

The issue is you can’t take an “all or nothing” approach to Preventative Code Maintenance. For example you can’t just do one off, large projects to fix issues “the Technical Debt style” nor can you do all small incremental updates, a “Scout Coding approach”.

Preventative Code Maintenance can help catch and correct issues before they get too big. But teams need to be empowered to tackle these issues when they are small, and manageable. Far too often an issue is just tagged as a ‘Tech Debt’ item and pushed to the backlog, where it is constantly push down and deprioritized until the issue is so large the business is unwilling or unable to take the issue on, until that is the issue starts impacting the business.

But before an issue ever starts impacting production there is a time when it can be mitigated. But at the time the issue could be deemed too small or too unimportant to take care of.

Preventative Code Maintenance “PCM” should be the #3 priority permanently on your board

Why #3? Well we do need features and enhancements, it’s important to our customers and to sales. There usually is an important bug or other non-preventative issue that needs taken care of. But always plan and prioritize taking care of preventative issues when they are small, unassuming and seemingly unimportant is far more cost effective then waiting till they are costing you business. In your scheme you should block enough weight for the Preventative Code Maintenance task as well. Use Tee-shirt sizes? PCM is always an L, Fibonacci? Put it at the higher end of your scale, (5 for the places I’ve worked). Doing 2 week sprints and just use days? Give it 2 days.

At first you may only be able to get one PCM done or maybe partially done. But as time goes on you will be able to get more and more done in that time frame. Given time and effort it will end up being truly preventative. You can obviously do more Code Maintenance then just the ones you can fit into the one item, but the goal is to always be working on some PCM.

Don’t delay maintenance until it’s costing you business or customers and reached the point where it most likely is expensive and painful to tackle. Be proactive and tackle the issues when they are small, cheap and easy to fix.

If you’re a First Responder or know one check out Resgrid which is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations.

Go to Top