High Availability System Maintenance: The 7th and Final Key to Developing a DevOps Culture

DevOps Final Key

 

Welcome back to the vRad technology quest as we unlock the 7th and final key!

As always, I recommend you check out the first post in the series if you’re interested in the topics I’ve covered thus far.

Today we’ll discuss how we perform scheduled maintenance on our highly available (24/7/365) platform and our process for minimizing or eliminating disruptions when short notice changes are necessary.

Let’s get to it.

 

The Usual: Regular Monthly Maintenance

The vRad technology platform is designed to be highly available.

Definition High Availability System: application support server(s) that can be consistently accessed with minimal downtime.

But even in a highly available platform, some changes require scheduled downtime. A significant amount of preparation and planning goes into all of our platform changes, but specifically ones that require system downtime. Because our platform is our link to clients and the patients we collectively serve, minimizing downtime is paramount.

Each month vRad deploys a new major version of our platform. Each of our 3 subsystems (Biz, PACS and RIS) deploys on a separate day. RIS is the only part of the platform that requires us to bring the system down for a short time (15 minutes). We utilize this planned maintenance window each month to also perform miscellaneous additional updates or enhancements.

 

Preparing and Communication Maintenance Windows

As you might imagine, we spend quite a bit of time planning and preparing for these changes. We generally start planning 6 months in advance of any changes that require downtime. We work carefully to coincide these changes with our RIS releases to limit our downtime to a single 15-minute maintenance period each month.

We typically perform our RIS scheduled maintenance during a daytime window to lessen the impact to our clients and the patients we serve.  As a teleradiology provider, we are primarily an “after-hours” solution which means our busiest times are in the middle of night, weekends and holidays. But nevertheless, a number of our clients our affected during that daytime window. We reduce disruption to those clients by:

  1. Limiting our downtime to 15 minutes whenever possible;
  2. Scheduling maintenance during consistent timeframes to better set client expectations; and
  3. Communicating our planned maintenance date and time to our clients early, and often: two weeks in advance, the day before and at the point of service (within our OMS [Order Management System]).

By carefully planning, scheduling and communicating our downtime windows, we ensure consistent expectations for our team and our clients.

 

The Unusual: Short Notice Changes

Usually, changes fall into existing maintenance windows and processes; since we plan 6 months out, there is plenty of time for adequate discussion and risk management in our weekly change management meetings. However, sometimes we don’t have the luxury of advanced planning.

Late last year, a network maintenance scenario arose that serves as an excellent example of how this process works at vRad, and demonstrates how seriously we take keeping our platform healthy and online.

Before we dive into the story, I thought I’d share our short notice change playbook:

High Availability Maintenance Process

 

The change in our example started when Mark, one of our network engineers, stopped by my office. “I need to make a network change that has a small chance of a network blip,” Mark told me.

“What sort of change?”

“I need to connect the two firewalls to talk to each other about English Beans so that we can use them for coffee stirring later next year.” He said. (Ok, ok, he might have said something else, but the specifics of the change aren’t relevant for our purposes today.)

Mark and I discussed the change and the impact on the environment – we mapped everything out on my office whiteboard and scoured the downtime events calendar to sketch out a preliminary plan.

 

Bringing the Plan to the Team

The possibility of a network blip constitutes a change we would ordinarily plan out months in advance. In this case, we did not have the luxury of knowing about the needed until Mark made the discovery. While we try to avoid these situations, the nature of technology’s rapid changes means they bubble up from time to time – and we must have the agility to address them. Mark and I both felt that we needed to push through this change as soon as we could.

The next step involved communicating and brainstorming. I emailed Shannon, then our CIO, Wade, our Sr. Director of Engineering, team members from Support, team members from Sales and Client Services, team members from Operations and Jodi, our Sr. Director of Marketing.

Did I mention we take these changes seriously?

“We’re going to be making a change to our networking infrastructure in our October RIS release,” I wrote, and proceeded to explain details and risk scenarios. “We can discuss this at our upcoming change management and engineering meetings this week.”

I hit Send and then forwarded the email, including a few more technical details to our engineering team to make sure they were informed.

That Thursday, our Change Management team – a cross-section of the business that discusses all changes to the platform and approves them – met and Mark and I presented the change and our preliminary plan.

These discussions are ripe for probing questions:

“How have we tested this?”

“How long will the platform be down?”

“If this goes wrong, what’s the worst case scenario?”

“What is your projected confidence?”

“Will this impact WebEx or telephones?”

“Could these cause interruptions in our VPN tunnels?”

“Will image ingestion be impacted?”

 

Verifying Answers to Tough Questions

For each change to our platform, we make sure to ask each other tough questions. Networking changes can be particularly impactful if they don’t work as expected; however, after discussion, our networking team was confident that these changes were low risk and that the production impact would be nothing or at most, a very short blip.

We take a lot of pride in our up-time – many of us have had brothers, sisters, parents or friends who have been in medical emergency situations and we often think of them and know how important it was to have timely medical care. Interruptions to our platform are not simply disruption to business, but possibly disruption to patients in need of care. That matters to us, at a deep and personal level.

We planned for a worst case scenario with the change in question: 30 seconds without network connectivity; the risk profile was larger because these changes were to components that service a significant part of the platform.

 

Finalizing the Change Plan

We decided to include people beyond our Change Management team to finalize our plan – the same group that I had emailed notice to earlier. We discussed our plan, the current questions and our answers. We were going to be impacting remote team members with this change, as well as tools like WebEx used by Sales and Client Management team members, and potentially our VPNs, so communication was particularly important.

The group discussed and came to alignment:

We would send internal notifications 2 weeks before the release, a week before the release, and the day of the release. We would make the change alongside our normal comprehensive deployment plan for the RIS, adding 20 or so discrete steps required to perform the additional changes. Finally, Jodi helped us add notification information into our client notice for those impacted by the RIS release – letting them know we were doing additional changes that had the possibility of short interruptions, but the expected impact was minimal.

 

Go Time

As the date of these changes approached, our engineering and project management teams ensured that all communications that were discussed went out according to plan. We re-tested the changes and reminded our engineering teams about the change – fielding additional questions and ensuring our plan was solid.

Wrapping up the process – and our story – the day of the change came. Our network team carefully orchestrated the updates and almost no connectivity downtime occurred; the entire process took less than 5 minutes from start to finish and our outage-time was immeasurably short. Another solid success.

While the finale of our story may seem anti-climactic, that’s intentional!

All the time spent planning is expressly for the purpose of ensuring business will continue as usual – sometimes an ordinary day can be an extraordinary feat.

 

Thank You for Coming Along on the vRad Technology Quest

It’s been a pleasure sharing with you how vRad does DevOps, from agile development, test automation – and everything in between. I hope you’ve enjoyed coming along for the ride.

If you’ve missed any of the 7 keys, I encourage you to check the vRad Technology Quest Log for some more interesting nuggets about vRad DevOps.

Thank you for taking the time to learn about DevOps at vRad. Our technology team is passionate about our platform – I welcome you to message me on LinkedIn if you have any questions or other topics that might interest you!

 

Farewell for now,

Brian (Bobby) Baker

 

About the Author

Brian (Bobby) BakerBrian leads the DevOps team at vRad. He has a passion for building highly scalable systems and finding ways to make software engineers lives easier through tools and processes.  Brian loves working with many different parts of the business, finding ways to improve products one hallway conversation at a time. Brian and his wife Kelly live in Minnesota with a dog and a cat; he enjoys fishing, boating, and reading.

Find him on LinkedIn: https://www.linkedin.com/in/bakbri/

Share this post