Friday, March 13, 2015

A recent encounter with a customer resulted in a couple good questions regarding the workflow that one would use to deploy apps to Cloud Foundry in order to try for a 100% up-time.  Based on that, I would share the questions and answers here.

Question #1 - How do you push updates to your application without downtime?

Currently when you push, or restart for that matter, an application running on Cloud Foundry, the change is applied in a series of steps that go roughly like this.
  • New app bits are uploaded
  • The current version of the app is stopped
  • Staging for the new app occurs (i.e. the build pack runs)
  • The new app is started
What’s important to understand about this process is that when the app is stopped all instances of your application are stopped. Thus there will be some small amount of downtime, while your new app stages and is started.
The typical suggestion for working around this is to do what are called blue / green deployments, which work by running both the current and new version of application at the same time. Since both apps are running, you can switch to the new app in a controlled fashion by simply manipulating the routes, something that happens instantly and does not require downtime.

Question #2 - How do you push updates to your application if it’s not taking web requests but still needs to maintain high availability?

If you have an application that is not taking HTTP requests, like a background worker, the typical blue / green deployment scenario may or may not work for you. If you’re running a background worker and need to keep it highly available, here are some things to consider.
  1. If you have a background worker style task, it may be as simple as starting a second instance of the application that is running the new code and then stopping the old instance. The key to making this work on CF is to use different application names (both bound to the same service, if the worker is using services). This will enable both to run at the same time and allow you to shutdown the old worker instance when you’re satisfied that the new code is working properly.

    Before doing this though, please keep in mind that there will be a window of time where there are two versions of your application running. This means that before adopting this approach, you should consider what will happen if there are two versions of your worker running at the same time. Will they play nice together or will they compete for the work, and will they both be compatible (i.e. did the database schema change, did message formats change, etc..).
  2. Another solution to this problem is to simply ignore it. Depending on the architecture of your application you may be able to just push your new changes and ignore the fact that the application will be down for a small window of time. This will generally be the case for background worker tasks that are simply pulling jobs from a queue or database. This flexibility comes from the fact that by their nature the database or queue will hold the jobs while your application is not running. Given this, all you need to do is push the new change and wait for the app to catch up on it’s work.

    Before going with this approach, there are some important things that you should consider. First, you should have a good understanding of how long it will take for your new version of the application to stage, start up and begin doing work. This is critical and leads us to the second point. You need to have a good estimate as to how much work will be queued up while the application is restarting and if your service is capable of storing that much data. This is key to not losing any work while the new version of your application is starting up.

    Lastly, you want to consider how long it will take your application to recover from being down.  While the app is down, jobs will be queuing up on the database or messaging system. You'll want to consider how long it will take for the new application to catch up with the queue jobs.  If the time it takes for you to recover from being down is too long you may want to look at temporarily increasing the number of instances of your application.  If your application supports this, it will allow you to catch up more quickly.  Then after things are caught up, you can scale down with cf scale to your usual level.
  3. With a blue / green deployment you have the luxury of running two versions of the application at once, but your end-users are only using one at any given time. This is accomplished by manipulating the application mappings such that your users get directed to the version that you want them to see. With a background task, there is no such external control or switch. As soon as you start the second version of your application, it’ll begin working.

    One way around the lack of an external switch would be to build an internal one into your application. This could be something like an “admin” console (or REST endpoint) that allows you to enable or disable processing, flipping a record in a database or even sending a special message to control the application.   Exactly how it’s implemented will largely depend on the application and what fits best for it’s workflow, but in the end what you have is an internal switch to turn on or off processing for the application.

    This switch can then be used in conjunction with the first or second approaches listed above to give some additional control over the application and your deployment workflow.

No comments: