Internal documents obtained by Business Insider this week reveal that Amazon has been quietly managing a serious wave of software outages tied directly to code changes made by its engineering teams. The documents describe millions of lost orders, cascading website errors, and what Amazon internally characterizes as "sharp edges" in its deployment pipeline. Amazon's response has been a crackdown on the pace and scope of code changes. This is not a story about technical failure. It is a story about how large organizations mistake procedural acceleration for structural competence, and then discover the difference only when systems break.
The Procedural Debt Problem
Amazon's situation illustrates a dynamic that organizational theorists have not adequately named, even if they have gestured at it. When engineering organizations scale rapidly, they tend to codify successful past behaviors into deployment procedures. This is rational in the short term. Procedures reduce coordination costs and allow new engineers to contribute quickly. But procedures encode the context in which they were written. When that context changes, which it does constantly in a live e-commerce environment, the procedures become liabilities rather than assets. Hatano and Inagaki (1986) drew exactly this distinction between routine expertise, which is fast and accurate within familiar parameters, and adaptive expertise, which can respond when the parameters shift. Amazon's outages suggest an organization that has accumulated enormous routine expertise in deployment while allowing adaptive expertise to atrophy.
What the "Sharp Edges" Metaphor Actually Reveals
The internal language Amazon uses is analytically interesting. Calling deployment problems "sharp edges" is a topographic metaphor. It describes the surface of a system as something to navigate carefully, to avoid cutting yourself on. This framing is revealing precisely because of what it omits. A topographic description tells engineers where the dangerous spots are. It does not tell them why the danger is structured the way it is, or what structural changes would reduce the danger systemically. This maps directly onto a distinction my dissertation research draws between topography and topology. Topographic knowledge is practical and local. Topological knowledge is structural and transferable. Amazon's crackdown on code changes is a topographic response to what may be a topological problem. Slowing down engineers so they navigate sharp edges more carefully does not eliminate the sharp edges.
The Awareness-Capability Gap at Organizational Scale
There is a parallel here to what Kellogg, Valentine, and Christin (2020) describe in their review of algorithmic work systems: workers frequently develop awareness of system constraints without developing the structural understanding needed to respond effectively to those constraints. Amazon's engineers almost certainly know the deployment pipeline is fragile. The organization clearly knows it, given the documented internal response. But awareness of fragility is not the same as understanding what generates fragility. Organizations, like individual platform workers, can become expert at working around a structural problem without ever resolving it. The crackdown on code changes is, in this reading, a large-scale workaround rather than a structural intervention.
What Organizational Theory Predicts Here
The broader organizational literature on failure is instructive. The classical account from Weick (1990) and subsequent high-reliability organization research argues that catastrophic failures in complex systems are rarely attributable to a single cause. They emerge from the interaction of multiple small deviations that individually fall within tolerable limits. Amazon's framing of the problem as a series of discrete code changes that caused specific outages may itself be part of the problem. If the organization treats each incident as a local event to be procedurally prevented, it will likely generate more incidents elsewhere as the underlying structural tensions find new release points. Rahman (2021) makes a related argument about platform firms more broadly: the invisible constraints that govern system behavior are often opaque not just to workers but to the organizations that designed them.
The Governance Implication
What Amazon's situation demands is not a better checklist for deploying code. It demands schema-level understanding of why the system produces sharp edges in the first place. Gentner's (1983) structure-mapping framework suggests that genuine transfer of understanding requires identifying the relational structure of a problem, not just its surface features. For Amazon's engineering leadership, this means asking a harder question than "which code changes caused outages." The harder question is: what structural features of our deployment architecture make certain classes of changes reliably dangerous, and how do we redesign those features rather than restrict behavior around them. Procedural constraints are not a substitute for that analysis. They are, at best, a delay while the analysis gets done.
References
Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7(2), 155-170.
Hatano, G., & Inagaki, K. (1986). Two courses of expertise. In H. Stevenson, H. Azuma, & K. Hakuta (Eds.), Child development and education in Japan (pp. 262-272). Freeman.
Kellogg, K. C., Valentine, M. A., & Christin, A. (2020). Algorithms at work: The new contested terrain of control. Academy of Management Annals, 14(1), 366-410.
Rahman, H. A. (2021). The invisible cage: Workers' reactivity to opaque algorithmic evaluations. Administrative Science Quarterly, 66(4), 945-988.
Roger Hunt