Modern Engineering at Microsoft: Transforming for the Future
Our Microsoft Digital team is in the midst of a significant transformation, driven by our modern engineering vision. This initiative is designed to create a culture, along with the necessary tools and practices, that prioritizes the development of high-quality, secure, and feature-rich services. The goal is to empower digital transformation across the company. This approach has already yielded significant improvements, including a sharper customer focus, accelerated delivery of new capabilities, and heightened engineering productivity.
The Journey to Modernization
Moving to the cloud was a crucial step, allowing us to increase the agility of our development processes and speed up value delivery. This impacted approximately 600 services, composed of around 1,400 components, by adopting new cloud technologies, enabling quicker access to extra infrastructure. This switch allowed engineers to quickly deploy environments and resources on demand, allowing for efficient responses to evolving business needs.
However, we recognized the need to address deeper structural issues. These included inconsistencies among teams concerning fundamental engineering practices such as coding standards, automated testing, security scans, compliance, release methodologies, gated builds, and releases. A lack of a centralized, unified engineering system and associated practices hampered our progress. Recognizing the limitations of a federated approach, we invested in a central team. This team was tasked with establishing a common engineering system based on Microsoft Azure DevOps, while also driving consistency across the organization in the design, coding, instrumentation, testing, building, and deployment of services. This shift brought a product engineering mindset, helping us define a clear vision and establish priorities based on objectives and key results (OKRs), which are tracked and reported using Viva Goals.
These improved engineering processes have fostered increased business alignment, more efficient developer workflows, and enhanced cross-team mobility. We incorporated industry-leading development practices for accessibility, security, and compliance. Achieving compliance proved challenging, requiring us to transition from legacy processes and tooling and actively address our technical debt in these areas. We also improved our telemetry and monitoring, giving us key insights into service health, feature performance, customer experience, and usage patterns. Embracing a Live Site culture has helped us drive continuous improvements in service quality.
Our Modern Engineering Vision
Our vision for modern engineering is rooted in the imperative to deliver high-quality capabilities and solutions faster, with improved reliability and security. To achieve this, we are modernizing the build, deployment, and management of our services to swiftly bring new functionality to our users. We are re-evaluating every aspect of our engineering process, adopting modern engineering practices, as summarized by Satya Nadella, our Chief Executive Officer:
“In order to deliver the experiences our customers need for the mobile-first, cloud-first world, we will modernize our engineering processes to be customer-obsessed, data-driven, speed-oriented and quality focused.”
Our ongoing investments build on the foundation we have already established, supporting our vision and cultural changes. We are focusing on three key pillars, with AI integration wherever appropriate:
- Customer Obsession
- Engineering Productivity
- Rapid Delivery
Customer Obsession
To ensure our engineers are laser-focused on the customer, we collect feedback to provide a deep understanding of the customer experience. Our service monitoring has allowed us to identify and fix problems before customers are even aware of them. As first customers of Microsoft’s commercial offerings, we identify and address the engineering needs of enterprises operating in a cloud-centric architecture. We constantly collaborate with our product engineering groups, creating a virtuous cycle promoting products such as Azure DevOps and Azure services for enterprise readiness.
Using Customer Feedback to Drive Development Customer experience is at the center of our engineering process. Feedback loops are critical for driving hypothesis-driven product improvements. We are working to make feedback submission as easy as possible, using the same tools as the Microsoft Office product suite. The “Send a Smile” feature gathers feedback across multiple channels and key user touchpoints. This tool serves as a centralized data system where we store, triage, and analyze feedback to create actionable insights. This process also encourages adoption of feedback loops and experimentation methods to measure the impact of product changes. Next, we correlate this feedback data with related telemetry to better understand usability issues. Controlled rollouts reduce the need for UAT environments, thus accelerating delivery.
Telemetry By building on Azure Monitor, we have unified all the telemetry from disparate systems. This facilitates continuous improvements in our service quality. Azure Monitor integrates with various data sources to collect, process, and publish data from applications, infrastructure, and business processes. This provides end-to-end views and generates actionable insights about service management. We are working toward delivering highly connected insights that aggregate the health of component services, customer experience, and business processes. This data produces contextual data that identifies events and their root causes, along with recommended next steps. We use business process monitoring (BPM) to monitor availability and performance by tracking successful transactions and customer impact. To achieve sustained quality, we are leveraging synthetic monitoring for all critical services. Data-enhanced incident tickets provide a business impact prioritized perspective of issues, supplemented with potential causes identified through Machine Learning. These tickets allow teams to focus on the most critical issues and reduce mitigation time. We are investing in AI technologies to proactively detect anomalies and automate their remediation. This intelligent response reduces support costs and improves service reliability and the overall user experience.
Service Health
We’ve intensified our focus on effective service and live site incident management. We implemented a standard incident management process and continuously improved key metrics. We monitor service health metrics and key performance indicators (KPIs) across the organization to understand customer sentiment and ensure services are reliable and performing well. Consistent standards enable us to aggregate and compare data at any level. An integrated experience, built on Azure Monitor, is enriched with contextual data from our unified telemetry platform. This allows us to create a set of defined service health measures, enabling us to track events that affect service reliability, creating a tool that enables service health reporting by different services. We knew we must connect service health to business process health. The experience we’re building enables visualization of end-to-end business process health and the health of the underlying services by analyzing their telemetry. We also simplified the flow of service health and engineering fundamentals data to the engineer and reduced the number of dashboards and tools they use. An internal tool is now the key repository for all service owners to view service health and other relevant KPIs. The tool’s integrated notification workflow informs service owners when a service reaches a defined threshold, making it more convenient to prioritize any needed remediation into their backlogs.
Embracing a Live Site culture Focusing on customer experiences has been essential for increased scale and agility within our services and processes. We’re establishing a Live Site culture, pursuing excellence through customer-obsessed, data-driven, multidisciplinary teams. These teams embrace potential failure with honest observation, continuous learning, and measurable improvement targets. We host an organization-wide, live site review, including postmortem reviews on incidents and long-term remediation plans. These reviews are based on reports that contain leading indicators for outages or failures based on the analysis of telemetry, synthetic monitoring, and other data.
Engineering Productivity
We’re providing our engineers with best-in-class unified standards and practices in a common engineering system. A consistent development environment allows engineers to transition smoothly between projects and teams. We have improved automation to allow engineers to better focus on the core role of developing which also reduces onboarding time and allows our engineers to be more flexible across projects.
Integrating Developer Tooling We made organizationally mandated code analysis and compliance tools accessible within the development environment. Self-service capabilities were built to manage access, set policies, and make changes to Azure DevOps artifacts. This has made it easy for engineers to manage services and components, minimizing the time spent managing such resources. We want to extend our shift-left goal to also examine optimization of our Azure service design and surface recommendations for configuration optimization so that these occur early in the deployment cycle and allow us to rightsize our configurations and avoid unnecessary Azure costs.
Enabling Code Reuse We support a few applications that use on-premises servers which results in ongoing effort to patch servers, upgrade software, and perform basic infrastructure maintenance tasks. We’ve transformed these applications to Microsoft Azure platform-as-a-service (PaaS) and software-as-a-service (SaaS) based solutions. We provide architectural guidance and tools to migrate data, refactoring existing functionality as APIs, and building lightweight applications by reusing APIs that others have already published. Promoting data and code reuse to build solutions more rapidly and align with a service-oriented architecture requires that developers have the ability to publish and discover APIs easily.
Workforce Strategies We implemented a new workforce strategy, hiring full-time employees and bringing more work in-house. This strategy makes it imperative that there is full-time employee oversight of any supplier deliveries. We implemented a common bar for hiring across all teams and a common onboarding program to ensure all new hires receive a consistent level of training. We are investing in re-skilling and training initiatives to expand the engineering capacity available to work on AI-related projects.
Universal Design System Every product should meet the quality expectations of today’s consumers, meaning that every piece of the user interface (UI) and user experience (UX) should be engineered with accessibility, responsiveness, and familiar behaviors, states, motion, and visual styling. Adopting a universal design system considerably reduces engineering time.
Rapid Delivery
To be customer-obsessed, we track delivery metrics. We are helping engineers achieve this objective by checking for issues earlier in the pipeline. We apply feedback-loop mechanisms for a clear understanding of the user experience as new functionality gets deployed. We perform automated rollbacks if customer reaction or service-health signals are less favorable than we anticipated.
Integrating Security, Accessibility, and Fundamentals Our engineers are checking for issues earlier in the pipeline. We moved to a shift left process. We implemented gates in the developer workflow and auto-onboarding services to ensure continuous compliance. We scan code for security issues and log bugs in Azure DevOps. We adopted accessibility insights tooling and now expose accessibility-related bugs as part of the pipeline workflow. We adopted AI technologies for providing accessibility guidance and conducting accessibility assessments. We are implementing continuous integration practices.
Deploying Safely to Customers We created an environment where teams test ideas and prototypes. Progressive exposure and feature flags are key in deploying new capabilities to users via controlled rollouts. We implemented checks and balances in the process. Implementing safe deployment practices, combined with a well-managed pipeline, are key for achieving a continuous integration, continuous deployment (CI/CD) model.
Reliability and Efficiency We are enhancing our DevOps engineering pipeline by identifying and removing bottlenecks. We’ll use DevOps Research and Assessment (DORA) metrics to measure our execution. We’re making our vision for modern engineering a reality by promoting a Live Site first culture. The Live Site first culture and the tools and ceremonies that support it have increased visibility into engineering processes.