By Jonty Sidney, Synthesis Senior Cloud & DevOps Engineer
As DevOps teams become more comfortable with the ability to use monitoring tools to react to problems and issues, they can expand what they do with these tools.
Why do operations teams spend so much time building and maintaining complex systems for monitoring and logging? Is there a point? Does anyone actually look at them?
These questions will sound very familiar to many IT professionals. Often operations teams spend hours poring over logs and guessing blindly in the dark when outages occur. They do not look at the metrics being gathered by secure, centralised event log management systems that cost hundreds of thousands of rands.
Often these systems do not provide the insight required to solve a problem. They just seem to give pretty pictures and graphs that appear to add the same value as the fancy-looking painting in the entrance of the office.
This might seem a little harsh. However, it would seem as though monitoring and alerting solutions are afterthoughts that are tacked on to a system to meet some (apparently) arbitrary requirement of auditors and management.
However, what if monitoring could unlock the unknown potential of the system? If it could take performance to a new level? What if it could increase the level of experimentation and data-driven decisions being taken in a project? This is not just a good sales pitch for a new fancy dashboarding tool.
In a previous Industry Insight in this series, the first way of DevOps was introduced. It described the processes and tools that a team can use to create fast flow from left to right (in other words, getting features and value to customers as efficiently as possible). However, with any increase of speed, the chance of breaking the system quickly increases.
The critical question that the second way of DevOps seeks to solve is this balance: it is imperative that it allows fast flow while still maintaining a high a level of quality.
Fundamentally, the second way of DevOps is very similar to Eric Ries’s Lean Start-up Model. In this world-famous work, Ries argues that quick and efficient feedback loops are the method for creating the products and services that consumers and end-users want. This is accomplished by continually testing various assumptions with data received from interactions with a system (often known as minimum viable product or MVP).
While this is not the forum to go into the details of the Lean Framework, the diagram alongside should suffice. The key takeaway, though, is that the Lean Start-up method creates frequent and tight feedback loops. These feedback loops are what allow companies to innovate at the pace that has been seen over past few years.
When building software, DevOps encourages several core processes. Each of these processes assists in creating these feedback loops.
Pushing quality closer to source
It is important to establish who is responsible for quality in the project. Is it the developers, operations teams, testers, project managers or security team? The answer is that everyone is responsible for quality.
In other words, whenever any team member contributes something to the project, it is their responsibility to add the highest level of quality assurance that they are capable of.
This does not remove the need for dedicated testers and quality assurance staff. However, it should not be their sole responsibility to monitor the application.
For example, as code is added to the codebase, each developer is responsible for ensuring the required automated tests are written, and their code is logging adequately. The underlying theory behind this practice is to ensure quality is built into the project from its very inception.
In the case of security, this is extremely critical. Instead of waiting for a security expert to verify an entire system at the very end of the project, it should be built into the system from day one.
In a sentence: Quality is not the finishing touches we put on a product – it isthe product.
Swarming and solving problems to build new knowledge
In Toyota’s manufacturing plants, there is a very interesting phenomenon: the Andon Cord. The cord is there for any single employee on the plant floor to pull. If they notice something amiss or wrong, they are responsible for pulling the cord and notifying everyone about the problem.
This could land up with the entire factory floor pausing their work and gathering and solving the newly found problem.
This approach has resulted in fewer major manufacturing issues and even a marked decrease in the cost of errors and mistakes. There are several reasons for this, but, for the purpose of this discussion, two stand out.
Firstly, problems or mistakes are corrected at the earliest stage and before they are hidden and difficult to fix.
Secondly, since the entire factory floor participates in fixing the problem – commonly known as swarming – the information and lessons from any mistake are spread across the entire workforce. This allows more and more team members to notice issues before they even become mistakes or errors – preventing future issues.
In software, when an issue is discovered – be it a bug or outage – the entire team is notified. Since everyone is responsible for quality, it is everyone’s responsibility to look for any untoward issues. When the team then swarms the problem – more people learn how to fix similar bugs in the future.
Optimising for downstream work centres
In the old way of doing things, teams often would “throw” their work over a proverbial wall to the next team to do their job.
The best example of this is how a development team would simply hand over a “completed” artefact for the operations team to install and run.
This notion of ignoring what happens to your work further on down the work stream is the complete antithesis of a DevOps culture.
In the second way, teams take the time to learn what the downstream team members will need. They then build this into their work.
By means of an example, the developers ensure the code they write can be deployed and run in the same way that the operations team will run it. Furthermore, they take the time to build the monitoring, logging and other instrumentation that operations require into their code.
This goes back to the “bringing quality closer to the source”. The entire team takes everyone’s requirements and concerns seriously. It is not just the operations team who needs to care how much memory is used by the code. The developers need to know this as well.
When the first two ways of DevOps combine, something impressive happens. The team delivers small batches of work quickly and efficiently (the first way) and obtaining feedback just as quickly (the second way).
Since the team has built quality into the project from the ground up, issues are smaller and easier to fix.
However, this is only part of the benefit. As a team becomes more mature and comfortable with the ability to use their monitoring tools to react to problems and issues – they can start to expand what they do with their monitoring tools. A/B tests, hypothesis testing and true data-driven decisions become part of the development process.
Instead of waiting to react to issues, software teams can pre-empt issues – experimenting with solutions and implementing fixes quicker than ever before. With fast flow and even faster feedback, software teams can unleash their systems’ and products’ true value: delivering value to their customers.