Posts tagged Cloud Service

Azure CPU Linear Growth/Ramp Up Issue

Recently Resgrid experienced an issue with our backend cloud service web role instances. After a deployment our CPU, in all instances, would ramp up from a normal of around 10% to over 90% stay there for a while then ‘reset’ back to 10% and begin the process again.

2015-05-08_13-50-47

Above is a screen shot of the monitoring tab in the Azure portal for the visual inclined. Sometimes it would peak at over 90% and stay there, and dramatically affect our API performance for an extended period of time. This is a huge deal as most of Resgrid users interact with it via the API at some level (calls going out, emails being imported and of course our mobile apps).

We were at a complete loss why this was happening. It seemed to start a while ago, but was exacerbated by new work. But only on our API stack, our Web stack was following a normal CPU pattern, hovering around 20%. Both projects shared the exact same service, database, caching and other provider code. The API layer itself was pretty thin. So what in the world was going on?

After a lot of debugging and redeployments we were at a loss. Time to call in the big guns, Microsoft. The good thing about the Azure Support plans, if you sign up and pay the $29 bucks you get support. So no need to be paying that all the time.

We sent around 10GB’s worth of dumps and PerfView traces to MS to review. So what did they discover:

From majority of the callstack, we see StackExchange_Redis!StackExchange.Redis.SocketManager calls either for reading or writing to the queues. Now there are 2153 active threads in the process! This seems to be too high and it would be interesting to see if you arerunning into connection problems mentioned here: https://social.msdn.microsoft.com/Forums/en-US/5e075053-802a-4a46-9fea-a0e859e9a7a9/redis-cache-sudden-100-cpu-and-crash?forum=azurecache

Now there are 1065 StackExchange.Redis.ConnectionMultiplexer objects in the managed heap! The dump shows that most of these ConnectionMultiplexer objects has connection failure message “UnableToResolvePhysicalConnection on PING”. So it seems there are lot of connection issues happening.

I would highly recommend customer to update StackExchange.Redis from v1.0.333 to v1.0.450 (https://www.nuget.org/packages/StackExchange.Redis/1.0.450 ). The older version might have had such 100% high CPU issues. Also are you creating multiple multiplexer object? Redis cache recommend to have one object and reuse it.

We started using Azure’s Redis cache a while back to cache our Geocoding results. We’ve been doing more and more work with that recently as it’s critical information to first responders. Getting a cached value from Redis is far faster and cost effective then contacting Google, Yahoo or Bing.

When we set it up we installed version 1.0.333, which was supposed to fix the CPU issue and may have in some cases but not ours. We use Ninject to control the lifecycle of our objects and had our RedisProvider in a singleton scope, but that may been part of the issue as well.

We upgraded to the latest StackEchange.Redis (v1.0.450) and marked out ConnectionMultiplexer as static and that fixed the issue. So if your seeing a CPU ramp up and using Redis, check your packages/dll’s and ensure your ConnectionMultiplexer is static.

Lessons Learned:

  • Always enable Remote Desktop for the roles, web or worker. This was amazingly helpful when Microsoft needed us to install software on the machine.
  • Pick an instance to let fail and cycle the other instances. This keeps your service up and running while allowing you to test. The Azure load balance seems to be a round robin, so your high CPU instance will still get traffic.
  • In Cloud Service deployments turning off Update Deployments does not issue you a fresh VM. If you install anything on the VM and a deploy without a Update Deployment method selected (Incremental or Simultaneous) is safe.
  • To get a fresh VM you need to “Reimage” from the Azure Management Portal, Instances section. Deployment and Reimages will keep the same machine name in case you have something else that keys off machine name.
  • Azure CPU metrics are based on averages over a 5 minute period. Just because Azure is reporting 90% CPU utilization if you log into the VM you won’t see the CPU pegged at 90%.
  • Have another tool to monitor performance. I’ll be reviewing NewRelic in a latter blog post.
  • Do not rely on Profiling, Intellitrace or Remote Debugging. In both VS2013 and VS2015RC we were unable to get those to work correctly.
  • When using the Debug Diagnostics Collection tool, the HTTP Response time trigger did nothing. Although the application was slow, the way the IIS Server was determining if it was ‘slow’ didn’t work. Performance Counters worked best.

Resgrid is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations, etc. It was founded in late 2012 by myself and Jason Jarrett (staxmanade).

Azure Dashboards: They need to get better

I’ve never really taken the Azure Dashboards seriously, or the Metrics page really. For the uninitiated the Dashboard and Metrics pages inside Azure are views into your Cloud Service, VM, Web App, etc. They give you some key metrics about the resource utilization or performance of your service.

2015-03-30_14-34-50

Great right, well personally I’ve always felt it as an incomplete picture. For example on Cloud Services I can’t get Memory Utilization, really Azure no memory utilization? It’s also relative to all the items in the chat, so a chart with 5% may be toward the top if none of the other elements push the max value much higher. This is configurable, instead of using Relative you can use Absolute. But that’s pretty useless if your mixing metrics, for example if you have Disk Read Bytes and it’s 500 it’ll push the chat to 500 and your 50% CPU utilization will be at the bottom.

But one important metric CPU utilizing is something I needed to pay more attention to. You can’t track history out more then 7 days which is rough. But if you can eye-ball it you can get a general feel. For example Resgrid has a Cloud Service Worker Role, if I had to extrapolate it’s CPU graph over 2 years it’d look like this:

chart1

If you have resource utilization increasing over time in a linear fashion like this it’s your metrics shouting “Huston you may have a problem”.

In our case there were some data points that could have been causing the issues. As our customers use the system our data footprint grows, new calls, new actions, new staffing levels, etc.

Every month our worker process would utilize a little more CPU. After a little bit a work and little RedGate ANTS profiling we narrowed down, when we were auto-closing calls we were pull all calls (Closed, Cancelled, Unfounded and Active calls) instead of just active ones.

So some slight tweaking we got to here:

2015-03-30_8-03-10

This is what success looks like, from ~47% CPU utilization to around 15%. PROTIP for Worker Roles don’t let them get past 50% utilization, Azure will just assume there are failing and it will constantly restart it.

The Azure Dashboard and Metrics screens need to give you more then just 7 days, 7 days isn’t enough to establish trends, they also need to give you memory utilization. Hopefully the new Azure Portal will help with some of this and hopefully Microsoft will give Cloud Services some love.

Resgrid is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations, etc. It was founded in late 2012 by myself and Jason Jarrett (staxmanade).

What is Resgrid and what have we learned

Resgrid has been around for just over a year now. We started out as a minimum viable product beta release in late 2012 and launched as a paid service in late 2013. In that year Resgrid has changed so much it’s a completely different application. Most of the change has been customer driven. We pride ourselves on listening to our customers, and delivering what they need in a way that useable for everyone on the platform, with a very short turnaround time.

So what is Resgrid?

ResgridCircle

Resgrid is a system designed to provide logistics and management capabilities for first responder organizations like volunteer and career fire departments, EMS, public safety, search and rescue, HAZMAT and more.

Resgrid was founded by myself and a partner staxmanade. It started as a simple website and a couple of mobile apps with one page and some big buttons. A year later it’s now a complete end to end management and logistics system.

Almost half of what we worked on came directly from our customers and potential customers who found and signed up for Resgrid. Our minimum viable product was far from what we initially released, but we got something out quickly to our potential customer base and listened to the comments and feedback that came in.

Today in Resgrid you can:

  • Manage personnel, their schedules, certifications and staffing level
  • Manage units, availability and unit logs
  • Manage calls, responding, accountability and logging
  • Post documents, messages, logs or notes to share with personnel
  • Import calls, messages and distribute emails to personnel
  • Manage stations, calendars and generate reports

The the feedback we get from our customers has been amazing and helped guide where we spend time and energy developing.

So what have we learned over that year?

Release your product/service as early as possible

Release your product/service with it’s most useful and valuable feature and ensure tat feature is solid. Then start getting feedback from your customers. There’s an old adage “plans never survive first contact with the enemy”. This is true for products and services as well. Resgrid not is not the system that I originally environed, but it’s way better and provides more useful functionality. So the adage can be adopted to be “product designs never survive first use by your customers”.

Be flexible and don’t be married to your ideas/plans

Jason (staxmanade) has a saying he uses from time to time “strong beliefs, loosely held”. This is great advice for people and for products/services. If you resist what your customers want, they will go somewhere else. The Soup Nazi approach only works for so long then when the novelty wears off you stop growing and start shrinking.

Get on the phone, or face to face, with your customers

A lot can be lost from communicating over email, chat, IM or a support system. Some of the more valuable feedback we’ve gotten recently has been with face to face meetings or phone conversations. It’s also been valuable to have at least one dev or other team member there with you observing and asking questions from a different point of view. It’s good to have 2 viewpoints when trying to figure out your customers needs or issues, at least one of those should be a developer.

Bend over backwards for your customers

I’m a huge fan of customer service. No matter what problem your customers are having if you show them you care and work hard on solving and understanding their problem that buys a lot of loyalty. I’ve talked with people who have stop using a product/service for one reason or another but rave about it to others. Why? Because they had a great experience even if it wasn’t the best fit.

Build a team

You can go it alone, but it’s best if you have ay least one other person to back you up. Having another developer (even if they can’t contribute a lot of code) to bounce ideas off of, work out the best implementation and to check your code is a huge help.

Balance architecture, features, UX and bugs

It’s hard, but the best architecture/implementation won’t sell your product. But completely neglecting it means it’s harder and more expensive for you to develop new features. Features help sell your product, lots of customers look for ‘bang for the buck’, which means they want lots of features for a good price. Other customers aren’t afraid to go a ‘best of breed’ approach, where they buy systems because they are the best implementation of the feature they want. Your User Experience is also very important, your system needs to be easy to use, good looking and well laid out. Bugs will drive your existing customers mad, but because they don’t ‘help sell’ they can be put on the back burner. Don’t neglect your bugs for too long.

 

So those are some of the lessons learned over the last year. We still have a lot of learning to do as we work on making Resgrid the premier logistics and management system for first responders.

Go to Top