I’m going to tell you a story about how the metrics raised their voice and made it happen. More than a year ago, someone in my team said: “We have something firing our service every 15 minutes.” And the other members replied “Really? What can it be? The customer's requests? A tentative attack? Does this happen every day and for hours?” So many questions unanswered!
The team decides to go after this and tries to understand more, starting by checking if we have metrics and data that can show the patterns or problems. After some time, a chart shows up where the spikes of requests are visible every 15 minutes.
Report shows the increase of requests every 15 minutes
The chart shows the builds requested by MABS consumers and highlights the huge impact on MABS (Mobile Apps Build Service). What is MABS? A fast, reliable and secure service for building native mobile apps using the OutSystems product, allowing our customers to generate mobile application packages for iOS and Android without having to install any mobile platform SDK.
After looking and getting this evidence found, the team reached other product teams, and something was identified: a queuing mechanism that processes the requests every 15 minutes.
One of the impacts of this issue is it degrades the performance of the service build in a completely disruptive way. An average service successful build of an application takes around 4 min; with this issue it went, potentially, to 4+15 minutes.
The issue was identified, created in the backlog, and the priority defined.
However, as time went by, and the issue wasn't addressed yet, the team continued to have a feeling that there was more besides the one issue identified. We saw the same customers repeatedly requesting the same build, getting the same error messages more than once, why?
One of the quality practices we have implemented is the definition of Quality Goals for the team assets/product. A Product Quality Goal will help us to answer “What does quality mean for our product and what does the team need to do to accomplish that?”
During the Quality Goals sessions, we set a goal: understand why our customers are requesting the same build, repeatedly, and getting the same errors and how we resolve this. We needed to drill down, break into parts, add monitoring, observe, and measure it.
And so we did it. We started by defining what “MABS repeated build requests” means, what the variables are, the use cases, what messages the customers are getting. With those values, data being monitored, the use cases identified, we have a set of metrics to observe and iterate if needed.
Report shows the number of repeated builds & queuing mechanism effect
The metrics defined show us that besides the queuing mechanism (the spikes visible in the blue and green columns every 15 minutes) we also have other use cases that need to be addressed — scenarios of failed requested builds. The use cases identified and shown in the chart happen when customers have requested to generate mobile app packages and for some reason the build generation failed, causing an automatic repetition of the build process when a determined action was made. That’s what the metrics above in the chart show us: the number of MABS built when a previous failed build scenario occurs.
What is the difference between the blue and green colors?
- The green columns identify the number of repeated builds (same code) but explicitly requested by the MABS client.
- The blue ones identify the number of repeated builds (some changes made in the code) but without requesting explicitly to generate a mobile app.
Using Metrics to Understand Customers and Product Issues
The metrics show us visually the impact on the MABS service, but also on the customer’s experience; so the metrics have spoken! They allow us to identify three issues — the 15 minutes queueing mechanism, the repeated builds requests without code changes, and the repeated builds requests triggered automatically without explicit action, after specific changes in the code — that have impact in customers' perspective about our product, and also impact on MABS performance and serviceability (get MABS doing more work that is not needed nor used by anyone).
The product teams need to work together to fix those issues, it isn't only a MABS problem, it’s experience issues, it’s business use cases issues and in the end it’s a product issue. MABS is a service used mostly internally by product internal components and the use cases shown by the metrics are actionable by MABS clients (internal product components). The product areas/teams need to work together and fix the issues.
Metrics, monitoring, and observability are important tools to understand customers, product behaviors, and unknown problems, and using them wisely allow us to manage issues proactively and faster, even without customers noticing them.
Today, the issues pinpointed by the metrics are being addressed and will be delivered soon. The metrics will stay for now; they will allow us to continue monitoring and observing all the behaviors and patterns identified. Then when our customers adopt the release (with issues fixed), the metrics will speak again, and they will evolve or disappear to give place to other ones.