A Checklist for Building a DevOps Organization
Without automation built into production, a newly built DevOps team will find it hard to support the business. Here’s how to assemble the ideal organization.
Without the core value of automation built into the production engineering processes, a newly built team will find it hard to support the business when the latter is ready to scale up. This is especially true at a time when development groups want to focus on new product features and will be eager to move off their plates the responsibilities that are not essentially part of building and fine-tuning product features.
An automated operations environment helps businesses make quick changes with minimum defects and downtime. Typically, DevOps teams will be responsible for tasks of the following kind in an organization.
In virtualized, cloud-based environments, the computing resources can be provisioned as needed. When the allocation of computing resources tends to be available on-demand and elastically, there is no way such environments can be built manually. Your team should know how to automate it and integrate those steps with provisioning tools and configuration management systems.
Identify software roles in an application stack and automate the steps to build them. Using those as the building blocks, large-scale application environments can be stood up with the help of configuration management tools. That is the only way you can scale up operations if the consumer app you support or the internal storage service you manage or the newly released SaaS offering of the company should become an instant hit. If you ask your prospective customers to wait, you will lose them to competition or lose your credibility as an internal infrastructure service provider, depending on what you have been supporting.
A well-tested feature should be deployed in production with minimum delay. Continuous integration is an effective method to both test and deploy code, but in real life, the state of that in a company would be somewhere between manual code push and a fully automated code deployment process.
Don't settle and only use the out-of-the-box features of your favorite monitoring tool. To effectively monitor the application stacks, custom plugins have to be developed and the team should have the necessary skills to do it. The advent of log aggregation tools such as LogStash and Splunk made it possible to dig out errors and insights from server logs. Again, to make these tools more useful, code has to be written to the instrument, mine, and present operational data.
The production engineering team in any company will have a long list of items to carry out periodically that tend to defy any classification. Some of the things I have done or have been responsible for in the past from this category are the following:
Weekly reports containing aggregates from business systems.
Operational performance and computational usage data.
Data extracts were generated for both internal and external customers.
Updating of various metadata used by applications.
Reprocessing of data to correct issues with aggregation done earlier.
Security audits and both internal and regulatory requirements as needed for SOX compliance.
These chores are normally handed over to the production engineering team to handle lengthy procedures. They should be automated as much as possible to avoid burdening the team members with rut work and to avoid mistakes that could be committed by a bored worker who may find nothing exciting in carrying out a routine task.
Tools and processes alone will not solve any problem. You will always need talented people on a team to get things done in line with the larger goals of the company with the use of minimum resources. Complex tools in the hands of incompetent people will only result in creating more chaos. With that general warning, let’s see what we need in this area.
I already repeated the importance of automation a few times, and that normally means writing code. Tool vendors will always argue that you can do everything from the dashboards of the products they peddle. However, a production engineering team that has good coding skills can extend third-party tools, build custom tools if the situation demands it, and collaborate well with development teams for adding features that would improve the operability of the applications.
A traditional operations team during the data center era consisted of system and network admins, database and third-party application admins, and application support engineers. In the last such company where I worked, we had system admins, Oracle and MySQL admins, Tableau and Microstrategy admins, and application support engineers. Very few new companies can afford to have such a division of labor, but they might still need to have resources to cover similar job responsibilities.
The important thing is to find people who are not married to certain technologies and products but who are open to learning new technologies and comfortable using an ability to code as one of the items in their toolbox to solve problems.
It is important to have both system administration and coding skills available in a production engineering team. If the company can afford to have only a few people in the production engineering team, then its members need to be versatile. A large team can still afford to have specialists. The exact composition of a team will ultimately depend on the specific requirements of the business, but a team without substantial coding skills as a group will not get much automation done. Failure to automate operational tasks will become a bottleneck as the company grows and the requirement to scale up will become essential.
An operations infrastructure must be in place to roll out the related processes in a new organization. Some components will be there already even before a production engineering group is formally set up because those things (such as a ticketing system) are essential to running a high-tech company. Parts of that infrastructure will be shared with other groups also, mainly development, as part of collaboration. Though it is hard to generalize, a production engineering organization would require some form of the tools and applications from the following list.
The infrastructure can be divided into two broad categories. First is the set of processes that need to be rolled out and owned by the production engineering group. Examples are the release process, incident management, and on-call. The tools needed for rolling out automation projects and the production engineering process are the second category of items.
My objective here is only to list the kind of tools needed for a production engineering team for them to be effective. In each category, you can easily find multiple competing products. If I mention any product specifically, that only indicates my familiarity with that product. The important thing is to have some tools, including home-grown tools, available when the need arises. It is also important to avoid using multiple tools in the same category unless there is a compelling reason to do so.
A documentation platform that can be used by both development and operations teams is an essential component for collaboration. Wiki-based solutions like MediaWiki are the most popular, but even Google Docs will do the job.
Any documentation solution can easily degenerate into a storage location of assorted documents very soon. To avoid a free-for-all anarchy situation, it is important to set a structure for organizing the documents right from the beginning. One effective method is to organize documents around applications and use templates for creating standard documents.
Configuration management is a very generic term. In the DevOps circles, it usually refers to a Puppet or Chef-like tool that manages the system level configurations and baseline software installations on a computing node. There are at least 3 different configuration management needs and if we include the configuration requirements for automated deployments, the list can grow to four. However, the subject of deployment automation is better discussed in the larger context of Continuous Integration (CI).
When a user joins or leaves the company, various accesses (both at system and application level), defined for that user related to the role of the user should be propagated to various systems automatically. Tools like Puppet and Chef may be the most popular, but there are plenty of alternatives available.
As a new user is provisioned or a departing user is deleted from the system, when a new computing node is provisioned, the baseline software bits needed for that node can be installed and configured as well using these tools. The system-level configuration and software deployment is done on a computing node based on its "role" in a larger software system.
Managing configurations of application stacks and environments are the next requirement. Implementation of a full-fledged CMDB system may not be warranted, but at least a custom solution will be needed eventually because without a single source of truth for such configurations, rolling out serious automation projects may be hard.
For example, keeping track of what system settings and software bits go into a software role would have an immense use if we want to stand up application stacks in a totally automated fashion. I still remember the joy of helping ourselves in building large object storage farms from a few command lines, which used to be an excruciating, week-long effort of cutting and pasting scores of commands and running manual steps.
Traditionally, the use of an SCM tool like Subversion or Git is limited within production engineering teams. Scripts used for ad hoc automation efforts will be in somebody’s home directories and when Brian leaves the company, hell will break loose in the application area that he has been supporting smoothly up until then.
SCM system is not only for application code development. Any piece of code or configuration data that is needed to replicate the application environment has to be managed using the SCM system. Code is not only for defining the product features; in a highly automated environment, code is also needed for maintaining it.
Make sure that members of a production engineering team are skilled in using the company’s SCM system. If there are multiple SCM tools in use, take leadership in standardizing one. The existence of multiple tools is a clear indication that product development teams work in silos and that normally creates nightmarish scenarios for the production engineering team because when issues happen you will come across development teams that are more inclined to cover their bases than resolve issues.
Ops code should also go through peer review and be part of the release process to have visibility on what is deployed in production. The need to include ops code as part of the release process is becoming more important lately as the concepts of infrastructure and platform as code can be implemented, and, they are not very different from writing code for implementing product features.
CI automates the code deployment process beginning the step of code checking in by developers into the SCM system. On the CI platform, the code changes are built, packaged for deployment, and deployed in a staging environment where the changes are tested.
The developers will get immediate feedback on the quality of their code. This helps get bugs fixed immediately. The integration of the code is incremental and continuous, and incompatibilities are ironed out early on. There will not be a need for specific integration tests.
Jenkins is a popular CI platform available for rolling out CI processes. The CI processes are integrated with CM, CMDB, and SCM systems.
Like an SCM system, a bug tracking system is primarily rolled out for the use of development organizations. It is important that the production engineering team gets visibility into the projects and issues tracked in that system. The team should also have privileges to create its own projects and queues to manage code related to DevOps areas that we discussed in the beginning.
The bug tracking applications are typically part of the generic ticketing systems. There has been no dearth of both open-source and licensed software in this area. Bugzilla and Jira are some of the well-known products.
A matured monitoring infrastructure will have checks implemented at different levels. Infrastructure, network, system, and application monitoring using industry-standard tools such as Nagios, Zenoss, etc.
If you have to monitor a consumer web app or SaaS application that test has to be done from the Internet outside of your corporate network, there are many service providers in that space, like Apica and Catchpoint.
At the very basic level, log aggregation tools gather the system and application logs at one place and index them for search. Looking through the logs for error patterns and setting up alerts on their occurrences can help with catching issues that dedicated monitoring might miss. There are both open-source and licensed products in the market; Loggly, Logstash, and Splunk are some of the popular products.
Many third-party tools that are used to build the applications might come with their own admin tools. There will be some monitoring features available with those. While dashboards can be used out of the box, monitoring-related APIs that provide status on the underlying components could be used to build monitoring checks on the main monitoring platform.
The leadership team will be interested in various summary data such as the utilization of computing resources, uptime of applications, and various performance indexes such as the percentage of meeting SLAs (Service Level Agreement). Core monitoring systems will provide basic information for such reporting, but further aggregation and presentation will be required. Custom batch jobs that collect and aggregate operational data will have to be designed and implemented. Presentation layers can be custom dashboards built using popular frameworks using PHP or Node.js or standard reporting tools such as Actuate, Tableau, or Microstrategy. Once the task of collecting various operational data of interest is completed, insights can be drawn from the data using any BI tool. Such tools might already be used by business groups.
Popular log aggregation tools such as Logstash and Splunk provide another set of operational intelligence data by indexing the logs. In addition to mining the standard log files, operational data can be generated on computational nodes and these tools can be used to aggregate and index custom operational metrics for analysis.
There are products available in the market to help with this, but largely home-grown solutions tend to be the norm in this category with the support of reporting applications.
The tools discussed above help roll out the standard production engineering processes that are essential to a matured organization. However, when such processes are implemented in a new organization, care must be taken to ensure that a new process adds some value and will not slow down things as a result of its implementation.
The release process normally refers to code deployment in production, and change management refers to any change that would have an impact on systems in production. By definition, the change management process covers application releases. It also keeps track of changes in infrastructure, OS and third-party software upgrades, database changes, and even one-off jobs that may have an impact on the computing resources.
The main objectives of a change management process should be tracking changes done in production and documenting and socializing the changes for better visibility within the company.
It is important that the proposed changes are reviewed and approved by a dedicated team, and that stakeholders and business owners are notified of the changes before and after those are implemented.
This is something built on top of the documentation platform. In a new company, product documentation would be non-existent and such efforts will be ongoing as the applications will be enhanced in every release. It is important to create operational run books for the applications. Set up a process to maintain them and tie that to release management. One standard question to ask in a release review meeting would be about the changes needed in the operational runbook.
Document the application errors that will be distributed by monitoring systems and log aggregation tools as alerts. Even though a self-healing production environment is an ideal situation, there could be some manual interventions needed.
Document routine maintenance tasks. Generating reports for both internal and external customers, meta-data updates, and taking backups and purging -- there could be several application-specific chores you may need to do routinely. Though these tasks are typically automated, some manual steps will be needed to deliver the services to the end customers.
Make sure that run books are not excuses for not automating repetitive tasks. There is a tendency on the line management side to throw manpower at maintenance tasks to address them manually. As indicated earlier, such an expensive strategy will never scale up in the long run, and that could drive away staff who may not want to perform the rut work. If you have team members who are happy to do routine tasks and resistant to automating them, you will soon notice that your DevOps efforts will get stuck with their inability or lack of motivation to implement automation.
Applications are expected to always be available. Even if the application has internal users, it may have a userbase from multiple geographies. The downtime of consumer web or SaaS applications should be very minimal if any; businesses can't afford that. Outages and other incidents can happen in production in the most unexpected ways and the response to such incidents should be quick.
To have a smooth on-call process, the following things have to be in place:
Contact information of members of both development and operations teams.
An on-call calendar that clearly indicates who is responsible for responding to critical alerts and incidents at a given point in time.
Escalation procedures specific to applications. Normally, the on-call person has to contact a point-of-contact (POC) in the development group as the first escalation step.
A BCP plan essentially addresses the non-availability of a primary production environment. The non-availability could be a result of a natural disaster (hence the popular term Disaster Recovery planning) and sometimes BCP and DR are used interchangeably. However, DR planning is a part of the larger BCP strategy.
As part of BCP, the following items are addressed:
Backup and replication strategies to support the overall BCP strategy.
Building production-quality standby environments or running application environments in multiple geographical regions. The latter configuration makes the application Highly Available (HA).
I've seen production operations teams dragged into a company's or engineering department's drive to roll out the agile process. Though that has been found to be a useful methodology for product development groups, it could be clumsy and forced in an operations environment, mainly because the operations teams don't have full control over their own time. Issues happen in production and the priorities change, but keeping the systems up and running is the primary responsibility. Getting the projects done in a fixed time frame may not be possible always.
However, projects both small and big have to be tracked formally and they have to complete. If an agile methodology has to be adopted, the production operations team has to be realistic and assertive about its involvement:
Be part of development Scrum teams. Engineering projects are not just product development. The infrastructure to run the application and its monitoring requirements have to be planned right from the beginning. Embedding an operations engineer in the application development agile teams is a great idea as opposed to tossing out tasks to the operations team without context.
Roll out a Kanban-like process within the production engineering team to manage projects. Regardless of the adoption of methodologies, managing the backlog of projects and tasks and their prioritization should happen.
Issues happen all the time in production environments. However, if such an incident causes a considerably negative impact on the end-user experience or loss of revenue, then a quick fix will be needed. That is called a hotfix normally, and the process that is followed is different from the standard code deployment procedure, with a focus on resolving the issue at the earliest.
The incident management process should also ensure that both users and stakeholders are informed of an ongoing issue. If an end-user would end up escalating a system-wide issue (don't confuse this with the reporting of product bugs), then the company has a serious problem running its business, and the production operations group can avoid such embarrassment by alerting on an issue before users notice it, and later, taking leadership in analyzing the root cause of the production issue.
In a new organization, the following processes and info have to be in place to deal with incidents in production that would have some business impact.
Prepare a comprehensive list of contacts and set up a process to maintain it. The contact info should include operations and development POCs for products. The contacts from operations could be multiple with the support needed from core infrastructure, network operations, databases, and application support. The list should also identify the product owners, normally product managers, who would manage the communication with the end-users if some issue happens.
Set up the group communication infrastructure. When an incident happens, multiple people could end up triaging the issue. Chat, voice, and desktop sharing are the most common modes of communication that will be used during a crisis. The employees should have access to communication tools such as telephone conferencing, IRC, Webex, etc.
Implement a root cause analysis process to review major outages in production. The focus of such issues must be resolving issues so the same incident will not repeat.
Software applications run on hardware infrastructure and software platforms that need upgrades. Old hardware has to be replaced or upgraded, OS has to be upgraded to the latest stable version, and third-party software components will also require upgrades as older versions could go out of official support if you hang onto it for long.
In environments that are built using open-source products, automatic upgrades are very common. Though largely, that will not have an impact. In general, changing any component in production without adequate testing is not advisable. The company should have a plan to roll out upgrades in production environments.
In a data center or private cloud environment, the production operations team has to plan for retiring and replacing old hardware. Such efforts are called rewiring and considerable resources are needed to set up a new computing environment where an application stack will be redeployed so the existing environment can be retired.
Vulnerabilities in the security strategy will put both business and its customers at risk. If there is a serious security breach, new companies rarely recover from it, as, it would lose customer trust and reputation.
The subject of securing cloud-based applications and the platform they are running on can be discussed in length. However, the basic precautions listed below have to be taken. Keep in mind that they are rolled out in a specific environment. These efforts will be in the right direction in implementing the requirements for ISO/IEC 27001 certification or SOX compliance. Such things are needed as the company grows.
Often, the unit-tested software is neither password-protected nor communications-encrypted. It is important that the applications in production only run with such basic protections enabled. It means the implementation of SSL and custom or industry-standard authentication protocols like OAuth.
Don't allow code with user credentials to be checked into CMS. Such info must be externalized from the code and be moved to config files that can be set up as part deployment process.
Roll out a process to manage passwords. Such efforts will be useful later to be compliant with security audits like ISO/IEC 27001 certification or SOX compliance.
Run industry-standard tests such as PEN tests periodically and harden the environments quickly based on the results.
Automate the process of granting and revoking user access, both OS and application. Generate user creation and addition logs.
Have a process in place for the production operations team to be informed of the latest security patches etc by the cloud provider or third-party tool vendors.
Include security review as part of planning major releases.
It is very tempting to roll out popular tools and implement fancy-sounding processes company-wide as part of setting up production engineering infrastructure. The tools and processes are only good in the hands of those who know how to use them effectively. It is very important that a versatile and competent team is built first and then empower to choose or build the right tools of their trade. A new tool or a process implemented should be for solving an existing problem or improving productivity; if such an emergency doesn't exist, it is better to wait, as real-life requirements can help define the processes better and help you choose the right supporting tools.
We Provide consulting, implementation, and management services on DevOps, DevSecOps, Cloud, Automated Ops, Microservices, Infrastructure, and Security
Services offered by us: https://www.zippyops.com/services
Our Products: https://www.zippyops.com/products
Our Solutions: https://www.zippyops.com/solutions
For Demo, videos check out YouTube Playlist: https://www.youtube.com/watch?v=4FYvPooN_Tg&list=PLCJ3JpanNyCfXlHahZhYgJH9-rV6ouPro
If this seems interesting, please email us at [email protected] for a call.
Leave a Comment
We will be happy to hear what you think about this post