Systems are engineering for use in Production from the start – they are scalable, observable and tolerant of failure.
Building more, smaller systems and utilising cloud environments means understanding and adapting to change, and handling failure, is more important than ever. Ideally, a system is self-healing or produces only relevant, actionable alerts.
- Release It will be used as the starting point for what defines Production Ready.
- The system should have sufficient monitoring, logging and alerting to be able to understand and report on its health, and the health of its dependencies, in a consistent way.
- Anyone in JLP should be able to access our telemetry tools.
- Systems should gracefully degrade on disruption – for example, by using circuit breakers, bulkheads, caching or partial responses.
- Teams and business owners will need to work together to identify relevant business and technical KPIs and SLOs for their product.
- Teams should understand and improve the resilience of their service through failure mode & effects analysis, and by experimenting with intentional faults.
- Dashboards need to be implemented for the most important SLOs.
- Not all risks to production readiness are analysable in advance so exploratory testing, production observability, fault injection, post-incident reviews, and operational monitoring should be used to expose new information about software behaviour.
- Applications should be built with a diversity of stakeholders in mind. Operability and supportability are important in most contexts, but see [https://en.wikipedia.org/wiki/List_of_system_quality_attributes] for a list of other software 'ilities' that may need particular consideration in your context.