Complex Software Architecture and your Startup

0

Every day I get exposed to different architectures and designs for systems. It’s a great contrast to see how a large company tends to have pretty complex software architectures compared to software I’ve designed outside the large company environment. There are many reasons for this, some of it is developers trying to show off, justifying their jobs, create job security, trying to be too clever or trying to be too helpful. Maybe the use cases or requirements were over engineered planning for every “What if” scenario under the sun.

The vast majority of the time it’s not malicious or has even a hint of bad intent. But most of the time on the developer side it’s science experiments that turned into production code or developers trying to make something too “pit of success’ie” without contemplating the drawbacks.

image59Complexity carries a heavy burden that most developers don’t want to think about and instead look at architecture as the solution to every problem. But too often when simple problems are over engineered it causes more problems down the line. For a large company, this is ok, they can take the extra time, throw the extra money at it or add manpower. For a startup, especially a bootstrapped startup our best feature is agility, and if our software is too complex to be agile we are trying to win a marathon with our legs tied together.

What are some problems that arise from the dark art of architecture and complex system design and why are they bad for startups?

1. Complexity make debugging more difficult and tracing issues more problematic. The more complicated your system is, the more layers it has is more you will have to debug at 3:42 AM on a Tuesday morning when the production system seizes up. When you’re in a startup environment there usually isn’t a DevOps team to deal with those issues for you at odd hours.

2. Complexity creates friction to change. The more complexity the more you need to refactor or recode when you change directions. It’s also more time, and effort you loose when you throw away that system. You will be throwing away code, and lots of it over the life of your startup. When you go in add a new feature or recode something you can then go in and refactor.

3. Complexity systems take more time to code. In a startup environment the easiest, quickest, hacky, ugliest solution is almost always the right choice. The less time you spend on things the quicker you get stuff done and to your customers getting feedback. Delivering is the most important thing. This isn’t to say you should push out totally buggy unusable systems, they need to work.

4. Architecture is not a deliverable. In the vast majority of cases this is true, but there are cases where this may be false. If your designing a system to say show people how to cook, your users don’t care how may layers you have, how you got CQRS and async full stack with DI and 100% coverage. Are those cool things? Yea, but they aren’t as important as us developers like to think they are.

5. It’s hard to bring new developers into complex system. In a startup environment development is probably one of the least valuable things you can do. As a founder you probably need to spend more time on business development, customer acquisition, planning features, marketing, etc. So you’ve reached that point and you want to bring in a technical founder, consultant or employee, the more complex your architecture the harder it will be for the new people to learn and longer it will be before they are up and running efficiently.

I’m not ‘anti-architecture’ or ‘anti-patterns’, instead I’m of the mindset that there is a time and a place for science experiments, architecture and complexity but rarely are those good for your early stage startup. I’m still trying to fight my inner architect when I develop for my startup, it’s difficult, but next time I’m up at 3:42 AM debugging an issue I’ll be a little more happy.

Plan Exit/Migration Strategies for your 3rd Party Integrations

In this highly interconnected world our products or services rarely operate in complete isolation. We rely on 3rd or external parties for services or logic that we’d rather no do ourselves, well at least most of us. When I first started in software development I always did it myself. “My solution will be better, it will be exactly what I need, my implementation will be more flexible, etc, etc, etc” those thoughts would always come up when I needed something outside the domain of what project I was working on.

third_party_systemsNow days, I look for external products or 3rd party solutions that meet the 80% rule. “Will this product/solution meet 80% of my core needs?” If it does, I jump on it. I’d rather spend my time taking the product or service the last 20% (which probably has more value to my business) then coding the entire thing. You might be able to build a better wheel, but should you try? When your trying to build a while new vehicle should you spend your finite amount of time and energy on it.

Because of this my products usually have a good number of external integrations, but that brings risk along with it. What if the service shuts down? The API changes? They pivot into something completely different. What happens when the stop supporting the product or code base? So what do you do? As my friend Dale would say, “That’s an uptown problem” and for the most part I agree with him. You can’t see into the future so how can you really mitigate that risk without waiting a whole lot of time.

Although it is a “Uptown problem” it can very quickly and without notice become a “it’s in your house” problem. As it did for us at Resgrid which is a SaaS deployed on Microsoft Azure, providing logistics and management tools to first responder organizations like volunteer fire, career fire, EMS, search and rescue, public safety, disaster relief organizations, etc. It was founded in late 2012 by myself and Jason Jarrett (staxmanade).

So what should you do before hand? Here is our list of data you should collect when you intergrade a 3rd party solution into your product.

  • Research alternative services/products and compare contrast it with your current solution. This will help you identify other potential external 3rd party migrations. Keep the list short, at max the top 3 other solutions, but elect the ‘Plan B’ option and have some additional information on it compared to the others.
  • Spend an hour and determine if it could be done “in house” and how much time and effort would be required. Determine the high level work that would need to be done and how it may be composed.
  • Is it possible for your product to work without the service? If so can you turn the feature off easily to hide the issue. Do some brainstorming and have a plan of action if this is the case.

Share your findings with the team to build up some institutional knowledge then store it in a wiki or central documentation location. When you may need this knowledge it could be without notice. For us at Resgrid, we discovered an issue sending SMS gateway email’s to AT&T customers during our standard Friday deployment. Customers not being able to receive text messages from Resgrid is a big deal so we investigated and decided that we needed to implement Twilio. A day latter we had a working solution in production and our AT&T customers hopefully didn’t notice any issues. Because we had been investigating Twilio already we had some institutional knowledge built up and could implement it quickly.

If we didn’t have the knowledge built up, we could have wasted a lot of time researching alternatives, putting hacks in place and possibility making it worse. Just because something may be an “Uptown problem” doesn’t mean you shouldn’t invest some time now to help easy pain in the future.

Taking Google’s Services for Granted

Recently I was on boarding a large department into Resgrid which is a cloud service company, SaaS deployed on Microsoft Azure, providing logistics and management tools to first responder organizations like volunteer fire, career fire, EMS, search and rescue, public safety, disaster relief organizations, etc. It was founded in late 2012 by myself and Jason Jarrett (staxmanade). 

I was in a Join.me with them for about an hour. As I was going through it I noticed that the BigBoard, which is used to display a snapshot of data usually on a monitor mounted on a wall, wasn’t setting the correct address, so it was defaulting to Carson City (nice feature for a department in another country). So I’m completely baffled why the map on the BigBoard won’t center correctly. I knew it was working at one point, the code tested and validated, but I went along and continued to onboard them, making a note to check this out latter.

326696-google-logoLatter in the day I pulled down the latest DB from Azure and fired it up, and logged into the department, pulled up the BigBoard and bam, it’s working. I’m seeing a map not of the dessert of Carson City, but of their location. No data changes, everything is the same as prod, I doubled checked all the data, reviewed all my git commit history to ensure I didn’t change anything between prod’s deployment and master, nothing. So I move along, nothing to see here, and worked on some other issues they reported. I left the BigBoard open on a tab while I worked, a couple of hours later I look at it and it’s back to Carson City. So began the process of stepping through the code to see where in the hell the bug in. After a couple of hours I discovered that the geocoders stopped working.

Currently Resgrid uses 2 geocorders, Bing and Google. Yahoo shut theirs down and I haven’t gotten around to enabling MapQuest or Nokia HERE yet. We use multiple because sometimes one or another falls down on geocoding an address, Yahoo was actually pretty good at UK and Australia addresses compared to Google. I dove into the Google geocoder to see what’s up and saw the request was denied with an OVER_QUOTA error, oh snap. We had been using the free/public version of Google’s geocoder which only gives 2,000 geocoding transitions per day. Every time the geocode was called it would call out to Google. Which means if someone sat on the BigBoard it would generate 1440 geocode requests just but itself, every day!

So began the mad dash, low-carb monster fueled coding binge. I reorderd the geocode code, signed up for an API key on Google so we could track request counts and most importantly enabled our Azure Redis cache to cache the request of a geocode request, both for reverse and forward. The Bing Geocoder returned nothing, so I think they banned that api key, so I signed up for a new one as well.

As a friend Dale would say this was one of the first of many ‘uptown problems’ Resgrid encountered as we grow. But the more I look at the problem it should have never been an ‘uptown’ problem. For so long I’ve been doing the same thing, just relying on Google’s services to work all the time for free. I didn’t bother caching the result incase it was requested again. This was just being a poor netcitizen, this data isn’t in flux, forward and reverse geocoding results are what they are, the latitude and longitude of my house still equals my house and vice versa. This is a clear cut golden case for caching the results.

The things I can no longer take for granted are growing in number every day. Developing a global SaaS app has ruined DateTime handing, TimeZones, EF CodeFirst and services like Geocoding for me. I think I developed a few more gray hairs with just this issue alone and the fact is that it could have been avoided.

Go to Top