DevOps and Feature Flags (IV)

This is the last post of the four related to feature flags. In this post, we will define a strategy for using feature flags to contribute to our DevOps Journey, applying concepts described in the post series about feature flags. Of course, we have to combine it with other actions like implementation of pipelines or test automation.

Objectives

In our journey to implement DevOps, one of the milestones that we want to achieve is Continuous Deployment. We are going to embrace Continuous Deployment by applying the Drunken Hamster Approach to DevOps:

«You shall always conduct development as if there is always a possibility that a drunken hamster suddenly deploys your code to production»
Karl Tillström. Team Lambda Collector Bank

In order to release fast and safe, we need to implement a mechanism to reduce the risk associated with deploying and releasing changes into production.

In a trunk-based development environment, changes to the code base are consistently merged into the main branch and pushed to production frequently and in a systematic way. Consequently, unfinished features (pending implementation or testing) will be pushed to the main brunch and these changes could be deployed into production.

For this objective, we propose using feature flags. Feature flags allow committing to the main branch and deploying unfinished features safely, without affecting the production environment. Feature flags are essential to maintain the integrity and stability of code deployment.

By operating with feature flags, we expect to go one step further in the Drunken Hamster Approach to DevOps, allowing our friends to also commit our local changes at any time, provoking the deployment into production of the committed change.

Software Development Life Cycle

Our plan is to implement Continuous Deployment in a trunk-based development environment with feature flags. In this scenario we are going to commit changes continuously to the main branch and these changes can be deployed into production at any time; deployed, not released, as they will be hidden behind a feature flag.

Therefore, we have to provide secure deployment for any committed change to main branch and safe release for any completed feature.

Deployments will be secured by verifying each change. This verification consists in deploying change in our staging (or preproduction) environment and certify that is compatible with our production release, by running automated tests using Production State strategy: only released features in production will return their toggle ON, rest will be OFF.

Release of features will be kept safe by certifying the feature in our staging, enabling test in production and providing a rollout that will allow us exposing the new feature in a controlled way; elaborating it in a little more detail:

Feature will be certified in staging by running automated tests using Next Release strategy (completed features, released in production or not, will be ON, rest will be OFF).
Test in production will be available automatically as soon as feature is completed, verified and deployed.
Rollout process will be managed manually, requiring the approval of key roles and, depending on the urgency and criticism of the feature, number of steps will be adjusted.

Finally, a cleanup mechanism will be implemented in order to keep our code clean.

Our DevOps proposal assumes that most of the features will be implemented under a feature flag and it will be based in connecting: status flow, feature flag strategies and pipelines.

Status Flow

Status flow starts once development team commits to implement a Feature (status is set to To Do). Any definition, action or prioritization of the feature before this point is out of the scope.

To Do: Feature is ready to start working with.
Doing: Feature is implemented, required tests are automated and affected tests are adapted.
Staging: Feature is completed (implemented and tests automated). This status indicates that feature has to pass staging certification; i.e. it fits properly with rest of completed features.
Done: Completed feature has been certified in staging.
Finished: Feature deployed and activated in production.

Status flow is detached from deployment. Any change will be automatically deployed in our development environment by our CI/CD pipeline. Deployment to production will be also managed by a pipeline, triggered manually until we implement continuous deployment.

Flag Strategies

Before the introduction of feature flags, instead of flag strategies we would be talking about environments. With feature flag strategies, we obtain an improved behavior than with environments, reducing the number of components required.

Strategies are setup using some runtime context (by user, by remote address, etc.).

There will be two main toggle systems: development and production. Each system will define its own strategies:

Development Toggle System Strategies

Flush Out Strategy: All toggles activated. Used by developers and testers to implement new features without worrying about activating the flag.
Production State Strategy: Only released features in production (Finished) will return their toggle as ON. This strategy is to make sure that last committed change, normally hidden behind a feature flag, can be deployed into production safely.
Next release Strategy: Only toggles of completed features (Staging, Done or Finished) will be returned as ON. It will verify that a completed feature can be activated in production and is compatible with rest of completed features.
Custom Strategy: At any time, we will be able to configure a custom strategy to check compatibility/testability/… of two or more features.

Production Toggle System Strategies

Beta Strategy: Done features (implemented, tested and certified with Next Release Strategy) that have been deployed will be available for a selected set of users (including QA), that will have the opportunity to test the feature in production, before it is rolled out to everyone.
Production strategy: controls who has access to the feature. It is the only main strategy that is activated manually (roll out).
Custom Production Strategy: Similar to Custom strategy in development, we will be able to setup any toggle combination in production.

Pipelines

Our CI/CD pipelines will provide steps to automate the process, from default CI/CD steps (build, pass unit tests, code analysis and deploy) to executing automated tests against different strategies. Not all the steps will belong to same pipeline.

DevOps Proposal

By default, any feature will be linked to a toggle. Toggle behavior will be handled automatically in Development and in Production (except for the Production strategy that will be managed manually in the rollout). This behavior will depend on the status of the feature (status flow) and the execution of pipelines.

On the other hand, status of the feature will change automatically in most of transitions, except for the two commitment points that are done manually: To Do (to control WIP) and Finished (once feature has been rolled out in production for everybody).

Pipelines will be triggered by a commit to the main branch, by a deployment, by a change of status or manually.

Following schema shows the feature flag workflow combined with status flow and pipeline execution. This schema is explained in next sections:

Feature Flag Workflow (click to enlarge image)

Team Commits To Implement A New Feature

Feature is set to To Do status, making it visible for the team (definition, refinement and prioritization is before this point). In Kanban this is known as first commitment point.

Toggle is created and Flush Out strategy is activated for this flag.

Creating Feature Flag

As explained in previous posts, it is important to define a naming convention and a clear owner. In our proposal, we have also included the stakeholder:

<stakeholder>.<team>.<user story id>

First part is the Stake Holder, normally who has defined it.
Second, the team, is the owner.
Finally, last part is the feature id, making toggle name unique.

A release feature flag would look like: DeptA.TeamB.66623

When To Not Create A Feature Flag

These are situations where we might not create a toggle:

Super urgent fix: When we need intervene in production immediately and cannot expend a second trying to keep the code we want to fix in an if/else statement.
Tiny and straightforward fix with low risk that does not worth the effort of creating a toggle.
New functionality, not yet in use. Its deployment will not have any impact.
Cleaning features.
In general in those cases where creating a flag is adding to much complexity (for example if new feature breaks backward compatibility). In this case, commits to main branch have to be performed very careful.

Feature is set to To Do
Feature Flag is created
Flush Out strategy activated

Implementing And Testing A Feature

Once team starts working on it, feature is set to Doing.

Most of the work is done during this stage (more details at second post of these series: Creating, Implementing and Testing Feature Flags):

Feature is implemented. Unit test, covering toggle ON and OFF, is part of the implementation.
Implement automated tests for the new feature.
Adapt affected test cases by new feature. Affected tests can be identified using the Flush Out strategy.

Usually all this work will end up in a commit to the main branch. Any commit will trigger our CI/CD pipeline that, among other steps, will have three main ones: building a deployable artifact, certify that this artifact can be deployed into production and certifying staging (described in next section: Completing The Feature).

Build Artifact Step

Typical build action. A merged change into the main branch will generate a new artifact of the module. This artifact will be used in any deploy in any environment (artifacts are not re-generated depending on the environment).

We have to be able to obtain each artifact’s generation timestamp. This timestamp give us a picture about the Staging and Done features included in it. These features will be deployed into production when created artifact is delivered to production, they will be next candidates to be rolled out.

Certify Production Step

Every time that a change is merged to main branch, even though change is hidden behind a feature flag, we must ensure that it can be deployed into production.

Certify production will launch all automated tests (API, integration, end to end) against the Production State strategy that will return as active only toggles that have been rolled out in production.

If this step is passed, artifact generated in the build step will be labelled as deployable.

Feature is set to Doing
Feature is implemented (coded), required test are automated and affected tests are adapted
Artifact is created

Completing The Feature

Feature will be set to Staging once all its tasks are finished.

Certify Staging Step

This action will execute all automated tests against Next Release Candidate strategy. This strategy return ON for all features that have been rolled out in production (Finished), those that are pending (Done) to be rolled out (maybe deployment in production is also pending) and those that are implemented, tested and deployed (in development environments) and have to be certified (Staging). In other words, check that completed features, already deployed in production or not, are compatible.

If this step passes, then any feature in Staging will be set to Done.

Feature is set to Staging
Staging features are set to Done once certified

Broken Pipeline

A pipeline can be broken before or after creating the artifact. There are three situations where a pipeline may break:

Build process fails: no artifact generated.
Certify Production step fails: artifact is not labelled as deployable.
Certify Staging step fails: artifact has been labelled as deployable. Artifact can still be deployed in production, however no rollouts should be performed.

Failure could be due to an incorrect code implementation of the feature or to a wrong implementation of the automated tests.

In case of code fix required, a new artifact will be generated and certification of production and staging will be executed again.

If the reason for the failure is in the tests, fixing it does not require executing whole pipeline, it would be enough executing the two steps of certification (Production and Staging) on the last generated artifact.

Deployment

In the future, we expect to include deployment in our CI/CD pipeline and land in the Continuous Deployment world. Meanwhile, deployment will be manual and can be performed at any time.

Deployment is installing last version of a module’s artifact labelled as deployable.

Deployment has not any impact in the feature flag configuration for production.

However, a consequence of a new deployment is that features that are in Done and are delivered to production (they are included in the last deployment) will be activated in the Beta strategy. Therefore, product owners, QA and key users will have the option of testing new features in production, before their rollout.

Canary Deployment

Combining canary deployments with feature flags can get very complicated, as we should have to check to paths Production and Beta strategies.

Therefore, we are going to detach deployment from the feature flags strategies. We will implement canary deploys using production strategy, i.e. we ensure that we continue offering same level of service; leaving the analysis of checking the impact of activating a toggle to the rollout phase.

Deploy is not release
Done features deployed are activated in Beta strategy
Canary deployment of artifacts against Production strategy

Rollout

Rollout is the process of activating a toggle for all users and all traffic in production. Rollout can be simply activating the flag for everyone or it can go through a set of rollout stages (check previous post for more details).

This operation starts when Product request the activation of a feature in production.

Telemetry will protect the rollout. Metrics should indicate that everything continues working properly; otherwise we have to cancel the rollout and set the toggle to OFF.

As explained in the previous post, it is key identifying in our metrics when a feature flag has been activated.

Finish Feature

If the rollout is success, then feature will be set to Finished, Production Strategy will be updated, anyone interested in this feature should be informed and cleanup process is scheduled.

Again, deploy is not release
Rollout is guarded by metrics
Feature is set to Finished

Cleanup

Before removing flag, we are going to stablish a security period. Period starts after the rollout. During this period, feature flag will be available, keep the option of deactivating it at any time.

After this period, a cleaning feature will be created and properly prioritized. This feature cannot have a flag.

Cleanup has to pass the CI/CD pipeline, i.e. cleanup will create a new artifact that will be certified.

Once cleanup is deployed into production, it has to be checked that flag has no activity (toggle service does not receive status requests for the feature); then, flag can be removed.

Cleanup has to pass the CI/CD pipeline

Key Takeaways

The Drunken Hamster approach to DevOps
Merge changes often to main branch
Limit the number of flags
Features should be small
Deploy is not Release
We need metrics

References

The Split Blog
Feature Toggles (aka Feature Flags)
A Practical Guide to Testing in DevOps, by Katrina Clokie
Feature Flag Best Practices, by Pete Hodgson & Patricio Echagüe
Managing Feature Flags, by Adil Aijaz and Patricio Echagüe

Feature Flags Posts

FeatureFlag