Practical Site Reliability Engineering
上QQ阅读APP看书,第一时间看更新

Platform metrics

This will give you insights into an applications' infrastructure, such as what the average execution time for the top databases queries was, or the top DTU/CPU consuming queries, or resource consumption by application, or average response time for each service endpoint, or each services success/failure ratio. We should set up some alerts on these metrics with high priority, as this could directly impact the user experience. We need to catch these issues/outage before customer by proactive approach. For example, we can set up some automation that will auto-scale our system resource during peak hours. This monitoring will help us understand the platform's performance.