Look past the technology.
In this episode, we talk about human factors in engineering and learning from incidents with Nick Stenning, site reliability engineer at Microsoft, working on Azure, and Laura Maguire, researcher at Jeli, an incident analysis platform.
Ben Halpern is co-founder and webmaster of DEV/Forem.
Molly Struve is senior site reliability engineer at Netflix and former head of engineering at Forem. During her time working in the software industry, she has had the opportunity to work on some challenging problems and thrives on creating reliable, performant infrastructure that can grow as fast as a booming business. When not making systems run faster, she can be found fulfilling her need for speed by riding and jumping her show horses
Laura Maguire is experienced in a wide range of systems design methods and techniques to support human performance in high risk/high consequence work environments. She has led project teams across a variety of domains in the identification and development of systems improvement initiatives. Her doctoral work focused on resilience engineering helping organizations cope with complexity, adapt at the pace of change and improve industrial systems performance.
Nick Stenning is a software engineer with interests in resilience, human factors in reliability engineering, and most things infrastructure. He's currently working at Microsoft. He was at Travis CI, Hypothesis, and the UK Government Digital Service.
[00:00:00] NS: And software is in, from my perspective, a similar place right now, where we in many cases are looking quite narrowly at the sources of either failure or success.
[00:00:22] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all of our lives as developers. I’m Ben Halpern, a Co-Founder of Forem.
[00:00:29] MS: And I’m Molly. Struve, Head of Engineering at Forem. Today, we’re talking about human factors in engineering and learning from incidents with Nick Stenning, Site Reliability Engineer at Microsoft working on Azure, and Laura Maguire, Researcher at Jeli in incident analysis platform. Thank you both for joining us.
[00:00:48] LM: Thanks for having us.
[00:00:48] NS: Thank you. It’s great to be here.
[00:00:50] BH: Before we get started, can we hear a little bit about your background? So Nick, can you start us off with your background as a developer?
[00:00:56] NS: So I’m currently an engineering manager in SRE, in Site Reliability Engineering at Microsoft. And my career as a whole has mainly revolved around infrastructure production systems, occasionally putting out fires and understanding what it takes to build teams that can do that well. Before I was at Microsoft, I’ve been at some smaller companies. Many of your listeners might know Travis CI. And before that, I worked for the UK government for a bit, as well as a string of nonprofits.
[00:01:28] MS: Can you talk a little bit maybe about some projects you’re currently working on at Microsoft?
[00:01:32] NS: What I and my team work on at Microsoft is a service called Azure Resource Manager, which is, if you like the API gateway, it’s the front door to all of the Azure control planes. So we’re a very fancy, reverse proxy service that directs your requests to create a virtual machine or create a storage account or update a virtual machine to be a bigger virtual machine with a service that routes those requests to the things behind the scenes that actually make that happen.
[00:01:56] BH: And Laura, can you tell us about your background?
[00:01:59] LM: I’m a relative newcomer to the field of software engineering. I spent almost the last 20 years working in high risk, high consequence industries in the physical world. So places where when there’s safety issues or failures, things explode, they crash, they have high speed impacts with equipment, where people, there’s catastrophic failures in those worlds. And as I was working in that domain, I started to realize that there was a whole other body of work, a body of research that was looking at safety and risk management from a whole new perspective. And that led me to go back to school, to finish my PhD under the advising of Dr. David Woods who’s done a lot of work in looking at sort of patterns of how people cope with complexity, how people manage risk that’s applicable across a lot of domains. So that’s really what led me to software engineering and to looking at how is it that software engineers can detect, diagnose, repair failures, similarly like a pilot might behind the controls of a cockpit.
[00:03:12] MS: Can you tell us a little bit more about Jeli and what your role there is?
[00:03:17] LM: Yeah. So at Jeli, we are building an incident analysis platform. So we’re looking at how do we support deeper, richer understandings that go beyond just what component broke, how do we repair that, how do we fix that and prevent that from happening, to look more broadly at how do people within that system of work manage and handle these incidents. So how do they coordinate with others? How do they recognize when their mental models of the system are limited in some ways and how do they work collaboratively across a variety of roles, across a variety of functions within the organization to reduce the amount of time it takes to resolve software outages?
[00:04:02] BH: So today, we’re talking about human factors in engineering, which is a phrase or a term that if used generically could refer to practically anything we do in our field. But today, we’re going to break down this discussion with more specificity. So Laura, can you sort of introduce us to some of the topics we’re going to be covering today?
[00:04:22] LM: So Nick and I were talking about what’s the key kind of ideas that we want to get across. If someone’s never heard of human factors in software engineering, what are the most important takeaways that we want them to have? And it kind of goes back to talking a little bit about the origins of where human factors started, how it began to get applied in software engineering, what some of the key concepts are around mental models of individual engineers, the need to support cognitive work within software engineering, how people handle and cope with complexity. And we wanted to share some of the research that’s going on already within the field. And some of the practical examples that we see from our work, both within our own organizations and within the learning from incidents, community more broadly.
[00:05:14] NS: The interesting thing to me is, at least from my reading and I should emphasize in all of this, that Dr. Laura Maguire is the expert here and I’m a practitioner who’s trying to learn how to apply this stuff. The interesting thing for me is that we are in a situation, very different in some ways, but in some ways quite similar to the situation that some of these other industries, these high consequence industries that Laura is talking about either are still in, but in some cases were in some time ago where those who worked in those industries and those who studied those industries in particular the fallout from high consequence events, accidents often, began to notice that the tools and the models of thinking and the ways of seeing that they were applying to those things won’t actually resulting in improved outcomes for the people and the societies around those systems. They began to notice that focusing exclusively on technical failure, for example, didn’t lead them to understand how those systems failed. And that’s where the human factors term comes in is thinking about the factors in those accidents that were human or related to humans. And software is in, from my perspective, a similar place right now, where we in many cases are looking quite narrowly at the sources of either failure or success in our industry and human factors, more broadly systems safety, resilience engineering, is about understanding in a much more holistic sense what it takes to be successful, what it takes to create safety in those systems.
[00:06:47] LM: Also, there’s kind of like this safety nerd street fight that’s going on around the concept of human factors. So I wouldn’t say if you went and pulled a bunch of human factors, researchers, and said, “Is it what Nick just described? Is this what the human factor is?” You probably would get a lot of disagreement because, I’ll use an oversimplification here, there’s sort of like two ends of the spectrum, and one end of that spectrum says, “Human factor is about ergonomics. It’s about tool design. It’s about the setup of your desk to reduce musculoskeletal injuries.” There might be some cognitive ergonomics that goes on in there, which is mostly around restricting and limiting humans from being fallible, from making mistakes, from doing their human thing, which is largely to disrupt these perfect systems that we design for safety and risk management. And there’s another end of that spectrum that says, “Actually, those systems are always operating integrated modes.” Those systems are never going to be able to fully capture the dynamic nature of the real world. The tools that you give to practitioners in any field of practice are always going to be limited in some way. So what actually makes these systems safe, what makes people able to manage risk effectively is their ability to adapt and to adjust and to be able to meet the conditions as they are when the world is under specified and when the world changes in ways that are very difficult to anticipate.
[00:08:23] BH: Can we get a bit of history on human factors in engineering? So when did this field have its origins and what were some key moments of evolution?
[00:08:35] NS: My understanding, the worlds of operational research and ergonomics and human factors during and after the Second World War. So one of the stories that I’ve told people to try and explain to them what some of this field is about is the story of B-17 accidents during the Second World War, where these B-17s would keep crashing on landing or shortly after landing. And to call it a long story short, it turned out that what was happening is that the pilots were confusing two switches in the cockpit one for another. They were retracting the landing gear when they meant to retract the flaps. And it took a very long time. It took hundreds of incidents before somebody thought that maybe they should go and have a look at the cockpits and interview some pilots and understand what was going wrong. But I think the more recent history and the more relevant history of the field is the Three Mile Island incident that in the aftermath, the people who worried about safety and nuclear systems were faced with this problem that there wasn’t an easy way to explain what had happened at Three Mile Island that didn’t involve relating the organizational context of the people who work there, the history that those people had had working in, for example, the nuclear navy, which gave them a bunch of heuristics that they applied to a reactor to which that made no sense at the time. And that opened a conversation which included such researchers as Jens Rasmussen. I suppose it would be fair to call Jens Rasmussen the father of the field of systems safety, I suppose.
[00:10:16] LM: I think your example there from like the early sort of cockpit design, that’s a good basis to build off of, because if we think about aviation, the change from flying a plane sort of stick and rudder style to where you’re flying a lot of feedback from the plane, you’re feeling different aspects of pitch and yawn, the orientation of the plane to modern aircraft where you are flying banks of sensor systems and by computer screens is a good kind of representation of the ways in which automation, digitalization has transformed the nature of a lot of work in these high hazard industries. And they’ve created a lot of safety for us in a lot of different ways, but they’ve also created complexity. They’ve created a lot of hidden interdependencies. So this kind of complexity changes the way in which the human operators are actually taking in information about the world. So they’re no longer getting haptic feedback. You might not be able to see the process that you’re trying to manage directly. You can’t feel the heat from the engine or the boiler or whatever it is you’re managing. So instead, you’re relying on your sort of automated components to be your eyes and ears and sensor systems. And then all this perceptual data is now getting represented as pixels on a screen. So that’s changed the nature of how we design the work of how we think about supporting the ability of the human operator to be able to interpret change and events, to be able to understand the meaning of the trajectory of an increasing temperature or a changing slope angle or whatever that kind of sensor reading is to be able to interpret that in real time. When you are getting the benefits of having these highly automated systems, things are degrading faster and they are degrading in ways that require the human operator to be in the loop at all times and to be able to sort of understand what’s happening very quickly. So these kinds of additional cognitive demands are really, I guess, at the core of how we think about this sort of new view of safety. And Nick, as you did say, I think Jens built a lot of that kind of early work, Annalisa Peterson, David Woods, Erik Hollnagel picked it up. There’s a lot of really great work that’s being done in a number of different domains right now.
[00:12:53] NS: You’ve alluded to an area or a theme, maybe, I should say, running through this research, that’s an idea that persists to this day in the software industry that when things go wrong and there are humans in the loop, it’s the humans in the loop that are the problem and you will solve your problems by automating the humans away. And Lisanne Bainbridge wrote a famous paper called the Ironies of Automation, I think, in the early to mid-eighties, which talks about you can automate systems, but the systems where automation removes cognitive requirements from humans in a way that is guaranteed to be safe are incredibly rare if they exist at all. And actually, what you just ended up doing is you end up moving the cognitive requirements of the humans around in time and space and sometimes in ways that actually make it harder for them to do their jobs.
[00:13:42] MS: It’s interesting you bring up that concept of removing the cognitive load from humans. I know there’s been a lot of debate in the airline industry that automating a lot of the systems has taken away that need for pilots to really understand on a physical level what the aircraft is doing. And that in some instances, that has led to accidents because the pilots were not able to properly assess the situation. Can you speak to that at all in terms of software and how you find that balance between automation and then also ensuring that there’s still some cognitive load for the humans? Because, obviously, at this point, there’s nothing that can replace that kind of decision making.
[00:14:26] LM: Yeah. So that’s a really good question, Molly. It’s not about sort of either or, it’s about how do we share, right? So a really good example of that is your automation is now a copilot. And if your copilot starts to do something but doesn’t tell you, doesn’t show you and you’re not able to look over at them and see what they’re doing and what they’re going to do next and see physically how they’re moving about the cockpit and what they’re looking at, you have very little ability to anticipate what they’re doing and why. So as a supervisor of this system, which is effectively what that kind of perspective does, is it says, “Let’s let the machines do these tasks and activities and then the human supervisor kind of jumps in and correct when the computer fails to understand the context or is getting a faulty reading and may not understand it.” That kind of change to a supervisory role can be a sort of a distant one because there’s ongoing sort of information that you need to be able to keep track of what’s happening at what points in time so that you’re able to make sense of it when you are forced to make a decision or when you’re forced to jump in and take over control from the automation.
[00:15:44] NS: Outside of software, perhaps one of the most visceral examples I can think of, of this is Air France Flight 447, which I think anybody who is interested in accidents probably has had some of this story before. I’m going to massively oversimplify it for the purposes of this discussion and say that what happened in Air France 447 was that automation behaved in a way that confused the pilots. So this aircraft was flying from Rio and it flew through a storm and the pitot tube, which is designed to measure air speed, froze over. And because it froze over, the autopilot put itself into a different mode from the mode that it was usually in. It is believed by the people who studied that accident that the fact that autopilot was then in that alternate law affected one of the co-pilot’s mindsets or his belief about what he could and could not do to the plane in a way that led to him putting the plane into a stall for a period of multiple minutes until they eventually crashed into the ocean. And without wanting to get too much into that specific example, I see things that look like that but which are hopefully significantly less fatal all the time in software where a system is believed to behave in a certain way and then some smarts in that system react to something else that’s happened that put it into a mode where it’s behaving in a different way and then an operator will interact with that system, assuming that it’s behaving how it normally behaves and everything will go wrong. Or to be even simpler about it, there are cases where we build automation in order to help ourselves that then runs away and destroys everything. I’ll give a concrete example. Just the other day, I was involved in an incident where the same tool was responsible for changing a bunch of secrets across a bunch of different regions very quickly all at once. And then that same tool was used to change them back or to roll them over again. And in the first case, that was a tool that was sharp edged and allowed us to break things in a way that we wouldn’t normally want to break things. But the fact that we had that tool allowed us to then go and fix the problem. So on the one hand, it’s behaved in a way that has surprised us that resulted in a bad outcome. But then if we were in that position to start with and we didn’t have automation to help us fix it, we would have been worse off than we were.
[00:18:40] MS: It feels like kind of a common theme is automation, but also like with clarity. So having an understanding of how things are automated so that when that automation clicks on, there’s a clear understanding of what has changed within the system as opposed to blind automation where things change and we’re still trying to work with stale data.
[00:19:02] NS: And for what it’s worth, I think that is exactly why people like Charity Majors to cross the streams a little here are so right to hop on about observability is the thing you should care about, being able to see into your systems and understand your systems state is far more important than any kind of a common heuristic monitoring.
[00:19:22] LM: That’s why understanding human performance in these systems is so important because you are going to have unexpected events, you’re going to have unexpected unanticipated surprises. And when the system is always changing, your mental model of how it works, what mode it’s in, what it’s going to do next is always going to be partial and incomplete in some ways. And that’s not necessarily a bad thing. What we want to be able to do is help engineers recognize very quickly when their model of the world is outdated or is in need of updating or they need to recruit other folks who have expertise or experience or recency working in a part of the system that might be failing or might be sort of contributing and very quickly be able to bring those people into the event, bring them up to speed and make their knowledge and expertise as relevant as quickly as possible. So this is where we start to say, “Yeah, the coordination matters. The collaboration matters.” The ability to sort of see what other people are doing, how they’re doing it make sense of that in real time is what is going to help reduce mean time to response. It’s going to help improve reliability and performance in these systems and ultimately improve the ability to adapt in real time as new challenges and surprises threaten the capabilities of that system.
[00:20:52] BH: Yeah. I’m sure we could probably go into a lot of specific details. We could reference pretty much any incident and we probably don’t need to go down too many rabbit holes. But as we were talking, I kind of wanted to bring in one that came to mind for me. Basically, I was driving home from a concert with some friends who were in the front seat driving. It was about 11:30 at night and they had Waze on to kind of tell them how long it was going to take to get home. We’re about 40 minutes from where we need to be and then there starts to be this traffic backup on the highway. And eventually, we come to a complete stop. And there was an accident a few miles up the road. And that was obviously like outside of the norm, but the way that Waze started acting was that it gradually just kept adding to the ETA to get home. And eventually, it got to like six hours to get home. And from my perspective, just like thinking through the situation, my interpretation is that Waze has lost all confidence. It has no idea and it’s handling the situation by just continuously incrementing its ETA, because all it knows is it’s in a standstill. It has enough data to tell you it’s going to be awhile. But from my perspective, understanding that Waze is probably just out of information at this point. It has no idea if it’s going to be five minutes or an hour before this accident gets cleaned up. My interpretation was that it just didn’t know. And obviously, it’s not going to be six hours for us to get going on the highway. But the people I was with really were just starting to freak out, like, “Are we going to be here till tomorrow morning sitting in the car?” They just sort of trusted Waze enough to think that it was accurate. The best way to handle it is probably to have a scenario where it informs you that it just doesn’t know. It’s hard to determine how long it’s going to be at this point. It is no longer accurate. It’s generally pretty accurate, people trust it, and then when it’s not accurate, maybe it can just tell you it’s not accurate instead of guessing until it’s completely off.
[00:23:04] NS: You know what was interesting about that to me is the fact that you as a software engineer looked at it and went, “Yeah, it just doesn’t know.” And people perhaps with a different experience thought, “Well, it knows and it’s going to be six hours.” And to me, it says a bunch of things, but one of the things it says is that we have not in any way normalized the idea that software systems might be able to tell us how confident they are about stuff. And perhaps one of the things that might be a good market, we’re talking about automation and how can automation plays well with human operators and bills, effective joint cognitive systems. One of the ways they can do that is they can recognize when the situation they’re in is one that they weren’t ever really built to handle. And they can say, “Nope, no idea. Human needs to look at this one.”
[00:23:53] LM: Yeah. And I think to go back to the question that Molly asked earlier about kind of the teaming is that it’s about, “Does your software just say, ‘Bah! I don’t know,’ and give up and drop control on the human operator at a point in time where it’s very hard for them to recover to jump in on it? Or does it start to signal, “Oh, we’re slowing down here. Have you decided to push the car home? Is that why it’s going to take us six hours?” It lacks the context. And so it’s about designing these interactions and designing these sort of capabilities for human and machine to work more effectively together. But I think it also points to the implicit models that software engineers build into technology about the state of the world and about how the world changes and about what the kinds of challenges that the product is going to be expected to handle. And when you get to the boundaries of what this design for performance are reached, it becomes quite brittle. That’s a very nice example of how you reach the edge of the operating envelope. It no longer has context for what’s going on in traffic and it just kind of falls apart and becomes quite useless for you as a user.
[00:25:10] NS: I’m going to give a concrete example of that from a recent incident and this is something that happens all the time. If you have a system that for whatever reason has overloaded itself and is pegged on CPU, one of the things you will lose from that system is any kind of telemetry. And usually, that situation, my system is burning CPU and I have no idea why, is the computer cannot resolve that system. That’s a situation that requires a human to be able to understand what’s going on, probably, and the systems that we build today usually, not always, but usually are going to lose telemetry at that point. It’s going to be a thing where the one moment where you need, you really need that information in order to work out what’s going on is the moment that you don’t have it.
[00:25:56] LM: One of the main reasons we wanted to talk about this is because there are a lot of possibilities. There’s a lot of potential to be thinking about sort of human performance in software engineering in a very different way. And I think maybe some of the pessimism kind of stems from this idea that we may not be able to take full advantage of this body of research, of kind of new ways, new paradigms of thinking about things. And I’m sort of interested in why you think that’s the case within software engineering. What do you think some of the barriers to actually moving forward with this are?
[00:26:36] NS: I think the software in general has an interesting relationship with risk and that broadly in our industry there is this idea that big bets are rewarded and that taking risks is a good thing. And as a result of that, the places that we put software in the world are now almost unlimited. And I don’t know that there’s a great deal of reflection on what needs to be true for us to do that in a way that is safe and responsible. I guess my fundamental concern is that those incentives are not aligned with the desire of people who care about what it takes to do them safely, to learn from the research and to apply that in order to do what we do without potentially creating horrible accidents. It’s less about the field specifically than it is about the incentives that surround that field and the broader push for growth and centralization and more complexity.
[00:27:43] MS: I think you also have this aspect of as an engineer, you’re just sitting in front of a computer screen. It’s not quite as tangible how much infrastructure you’re necessarily reaching, how many lives are actually impacting with that software when all you’ve got in front of you is just a terminal. But with that terminal, the amount of infrastructure that you can hit across the globe these days is just incredible. And that kind of leads me to another question I have for both of you, which is why do you think this field is suddenly new to the software industry? Do you think it’s because we haven’t really lost lives yet? Or how is it that it’s just newer now?
[00:28:26] LM: From my perspective, I think the consequences are increasing to your point that more and more critical infrastructure in society is becoming digital or moving to the cloud and sort of engendering these new kinds of threats to reliability, to stability of the systems. And because we’re operating at scale, the consequences of those outages are much broader. I think there’s a number of kind of different factors that kind of feed into that as well. One of which is not inconsequential is that some of this literature has started to permeate into the field. So I think that what’s happening sort of in the world around us, the technology has enabled greater reaches, complexity is increasing and there’s interest in being able to design these systems and to use methods that have been applied to other industries for the last 40 years to understand kind of how do we better support cognitive work, how do we better support coping with complexity as opposed to how do we just get engineers to follow the playbook or how do we just keep the documentation from getting stale. So there’s different kinds of thinking about how do we solve the problems as they get harder and harder.
[00:29:48] NS: I think that there are two answers to that question. There is, “Why is software looking for these things and then why are researchers interested in software?” So I’m not going to answer the second question because I think there’s somebody more qualified who can answer that one. But why is software looking for this? I think it’s simply that the complexity and scale of the software systems that we are working on now is reaching a point where it’s really hard to ignore the fact that the narrow tools and approaches that we’ve used up till this point are not really working. I work on a cloud as it were and there is barely a day goes by when we don’t have some significant complex failure that involves multiple teams interacting in order to get things working again. And very few of those incidents that I get to see at least are things that can be meaningfully understood by just looking narrowly at technical failure. I think there are more proximate explanations for how this came to be a thing that people in software are talking about, including some of our colleagues, John Allspaw, for example, is no small part to play in exposing this research to people within software. But when I think about the broader picture of why is software worrying about this stuff I think it’s because we’re recognizing that the systems that we work on have some of the same of the same properties of these high hazard different fields that mean we need to be thinking in the same ways.
[00:31:15] LM: I agree with that, Nick. And I think from a researcher's perspective, there’s some characteristics of the domain that we just don’t see in other worlds. The first, I guess, is like you’re operating at speeds and it scales that far exceed what a lot of what we do in sort of physical infrastructure or physical kind of safety worlds. And so this, obviously, extends the problem. It helps us think more broadly about how can people cope with complexity, what are the limits of human operators in these environments. A large part of how we’ve dealt with scope and scale and speed in physical worlds has been adding automation, being able to kind of work at a distance from these high hazard processes. So that has kind of created this whole body of research and interest area around how do we create effective teams between humans and automation. So the fact that software engineering is working at speeds and scale, working in highly automated systems and just so prevalent in terms of what kinds of critical digital infrastructure you are managing within society or increasingly managing within society, it makes it a very interesting domain to study. The other aspect about software engineering that’s super interesting from a research perspective is there’s a lot of built-in traceability to what you do. If we’re looking at say trying to understand coordination, cognition within say an emergency room, when a patient comes in and they’re bleeding out or there’s an undiagnosed illness, you have a team of experts that are rapidly bringing their skills and knowledge to bear in a very unstructured kind of problem space. But unless you, as a researcher, are standing there with a video recording or you’ve got audio recordings, which obviously have a lot of implications around privacy and sort of data sensitivity in that world, a lot of that coordination and collaboration is ephemeral. So you’re going to lose a lot of the data that you have about these kinds of subtleties, of how people interpret the world, how they interpret others around them and their actions or inactions. So we have a lot more traceability. We have log files. We have Slack transcripts. We have Zoom recordings. We have dashboards. We have monitoring. So there’s a lot of data that’s available to us as researchers to be able to lay out some of the cues and some of the coordination mechanisms that are being used to manage these really high consequence, high impact service outages.
[00:33:54] BH: Do you find that the work you get to do in software development for these purposes with logs, with a lot of written communication, et cetera, is that helping the overall human factors space in general? Is it learning more from software development that can be applied elsewhere?
[00:34:10] LM: From my perspective, we are in very early stages of how do we use this data and how do we actually help organizations and ultimately individual engineers to be able to be better at coping with complexity, to be able to deal better with handling the challenges of rapidly changing, often rapidly decompensating situations where there’s cascading failures, there’s interactive failures. There’s well-meaning actions that are being taken that can sometimes have unanticipated effects on the system. So we’re in really early days at how to use this data.
[00:34:52] NS: I think the big question is whether or not the research, and in particular, your application of the research is going to actually penetrate into software fast enough to have a positive impact before we have the software equivalent of our Three Mile Island incident or something like that. Working software engineers all kind of know this at the back of their heads, but what Laura’s talking about when she talks about cascading failures and the scope and speed with which impacts, can propagate in software systems, it’s actually really hard to grasp how big a change that is from some of these other systems that we work on. Even the worst safety accidents in history, for the most part regional with the exception of a couple of major radioactive releases. But actually some of these systems that we work on, there are 999 dispatch systems and hospitals around the world that are built on cloud technology. And as much as I would love to sit here and say that a global outage of a cloud provider is an unlikely event, I think in the fullness of time, it’s not an unlikely event. There’s a security researcher who I have some interactions with at the University of Cambridge, Professor Ross Anderson, who spent a lot of his time advising governments and companies on the security of digitization systems for records and in particular patient health records systems. And one of the points he would always make is if you want to steal a bunch of patient healthcare records, if they’re on paper, you need a truck and an incredibly sophisticated logistics plan to travel around the country. So all of the general practitioner’s offices and steal all of the medical records. But once you digitize medical records, sure, that brings you all kinds of great benefits, but the sword has two edges. And if you build your system in such a way that it’s possible, and this is a very easy mistake for somebody to make, you can just steal all the records at once. I think a lot of the software systems that we build, we haven’t recognized that all of these incredible benefits that building things out of software gives us, they can also come back to bite us in a really horrible place.
[00:37:19] BH: Can we talk a little bit about how we solve for the impulse to not look at some of these things as human factors in the first place? And then we can maybe touch more deeply on how this research practically reaches the teams and almost the broader society as needed. But let’s talk about the impulse to ignore the human factor.
[00:37:48] NS: Oh, I can talk about that all day. For me, I think one of the strongest imperatives in any business after an accident is the imperative to get back to business as usual. And yes, everybody wants to be seen to reflect and analyze and understand what happened after an incident or an accident. But there is also a very strong incentive for the business to get back to what it is supposed to be, I’m using scare quotes here, what are they supposed to be doing. And one of the challenging aspects of the research is that it says to quote a T-shirt that Ben Goldacre once wore on stage, I think you’ll find that it’s a little more complicated than that. It says that most of the stories that we tell ourselves about how incidents play out are oversimplifications and oversimplifications are not helpful if what you want to do is understand, but they are helpful if what you want to do is get your analysis over and done with and get back to work.
[00:38:49] LM: Yeah, I would agree with that. I think what I have seen from working with a lot of software organizations is that it’s a fundamental threat to recognize that the systems are, as Richard Cook has said, “It’s not a surprise that the system went down, it’s a surprise that the system stayed up at all, that the system is up at all.” And so recognizing the sort of fragility and the brittleness of the systems in relation to the demands that are placed on them, the need for near-perfect reliability when you’re talking about doctors being able to access patient records, when you’re talking about financial markets, when you’re talking about communication systems during a hurricane response, any kinds of these critical infrastructure, it is inherently threatening to think about their fragility. A really key aspect to thinking about resilience is about recognizing the adaptability and the capability of humans to be able to recognize the sort of the bounds of their knowledge and capabilities and to be able to adjust, to be able to recruit new resources, to be able to match the changing demands and conditions as those evolve over time. We’ve talked at length that the brittleness and the difficulties of automation to do this. So when you start to look closer at software incidents, it’s not necessarily all about the component failure. It’s about how do we enhance the adaptive capabilities of the people with their fingers on the keyboards to be able to recognize what’s happening and to be able to adjust performance as those demands change.
[00:40:41] NS: I think there’s another aspect and I’m going to add another layer on to this, which is I’ve given us a cynical take, which is also probably I think a realistic take, which is that businesses want to get back to business as usual. But there’s another aspect of this, which is that humans are not rational analysts. Humans are deeply heuristic creatures. And I can tell you from my personal experience that when I am investigating after an incident, it is not that I do not think why on earth did that person do that. That was such a stupid thing for them to do. I do think that, and then I hold my tongue and I’ve reserved my judgment and I asked the question, “Tell me what you were seeing when you did that thing.” And if I hold my tongue for long enough, then I discover, “Okay, well, actually it makes perfect sense because they did this last week and it worked just fine or whatever.” Right? But I think one of the other reasons why it’s hard for people to apply the ideas from engineering research in practice is that that’s just not how our brains work. We see intent where there is none. We pattern match. We often jump to conclusions that are not even remotely true and haven’t practiced and developed the abilities to reserve judgment and just ask the next open-ended question in order to discover that our assessment was wrong.
[00:42:06] MS: When you both are really looking at kind of incidents and breaking them down, what sort of things are you looking at? What sort of data are you collecting to help you kind of assess the whole situation and what happened?
[00:42:19] LM: So typically, when ChatOps is used to manage incidents, the Slack or Teams transcript is a really good place to start because it is the kind of externalizations of what people are thinking or doing. So we talked about that traceability a little bit earlier, so it kind of lays the foundation. We can look at log files. We can look at GitHub. We can look at dashboards. We can look at any kinds of queries that were run at different points in time. And really what we’re trying to do there at a high level is to do a practice called kind of process tracing, which is basically to pull all of these hidden aspects of a very dynamic event, multi-layered, multi-activity event where things are happening rapidly and they’re happening concurrently, and we’re trying to lay this out to say, “How were people’s attention directed at different points in time? Who was coming in and out of the event over time and whose knowledge and experience became relevant to help with different kinds of problems as this event progressed?” So at Jeli, we’re working on kind of like pulling together a lot of different disparate sources to be able to help that process, to kind of lay that out as an engineer who is investigating these kinds of things. You start to immediately see the complexity and you start to immediately see the ways in which relationships matter and the ways in which one person will say something, “I think it’s X,” and someone else interprets and recognizes that and sort of brings the thinking forward to the next point. And he said some things, I find it a little contentious, but that’s another podcast. There are points where our thinking does fail us, where we over-index on heuristics, where we kind of go down a garden path, where there’s red herrings that distract us. But when we start to lay these things out, and again, this is retrospectively, so we’re doing it not with hindsight, but rather with curiosity, we can start to see how these things made sense. So then it helps us think about our systems and say, “How do we minimize these potential pitfalls? How do we maximize the ability for people to cross check each other in real time for people to interject with minimal amounts of cognitive overhead for the others for us to raise the amount of signal that comes out of these kinds of very noisy frantic events or Slack rooms where there may be multiple competing priorities going on? And how do we help focus the efforts and attention moving forward in ways that help people adapt in real time?
[00:45:07] NS: In an attempt to drag myself perhaps partly out of the doghouse, because I think I know what Laura is referring to when I was talking about heuristic things. I’m not really talking about people in the context of incidents, because I think there’s a lot of application of expert knowledge going on there that I’m not trying to talk about. I’m saying that as an investigator, it’s not that I am somehow a perfect satisficer or minimizer when I’m understanding what other people have done. And Molly, you were asking, like, “What do you do? What are you looking for?” My goal first of all is to try and understand what happens. And that typically starts with getting a sense for the timeline of the incident from any of the available data that doesn’t involve talking to people and I think that’s kind of objective and clearly like marks on some incident logs. And then I will go and talk to people and ask them what their experiences were at the incident. And I will not assume that because somebody told me that it happened in such a way that it happens in such a way. I will ask other people who are there for the same event how the same thing happens and see if they have a different perspective because that may reveal something interesting about what’s going on. And hopefully between the data that we have available from the technical systems and the audit systems and the interviews with those people, you then have enough of an idea of what’s going on to get all of those people in a room to then talk about it and try and understand whether or not they have the same perspectives on the incident or different perspectives on the incidents, talk about what was hard, what was easy, what went well, what didn’t go so well. But the process tracing aspects of it that Laura is talking about and that that Jeli is really building on seems to me, like even though I’m not in any way an expert on process tracing, it seems to me to be one of the most important things is that if you can describe what was happening when, potentially in ways that are not objective but are just representing different people’s views of what was happening when, you start to see why people behaved in the way that they did. You start to make sense of things that perhaps on a first pass didn’t make sense. And I think it’s hard for somebody who hasn’t been through that process to understand how true that can be and that actually sometimes just doing the work to write down concretely what happened in what order, who was seeing what, when can reveal a bunch of interesting things about an incident in ways that you would not expect if I just told you that.
[00:47:30] LM: Yeah. And I want to answer a question that you didn’t quite ask Molly, but that I think you’re kind of pointing to, which is like, “Well, what do we do with this?” Right? So what we’re saying is we’re saying that incidents can help us learn, that engineers, individual mental models are going to be partial and incomplete in some way. So everyone has a little bit of a different perspective or lens on the problem, but how do we get around that? Do we just train everybody? No. We can take that data and we can say, “Our team’s structured in the right way.” Do we have the right mix of junior and senior engineers of cross-functional roles that have enough common ground and common language across their different functions to be able to work really effectively together to recognize what kinds of information is going to be relevant to help solve the problem and providing what kinds of information is just going to be distracting or disruptive to an event? It helps us recognize have we over relied on a specific team or on a specific individual because of their really deep knowledge about a recent change or about a legacy system. It helps us to kind of surface where we have these knowledge islands within our organization and to distribute that a little bit more broadly. It can help us with resourcing. If we have a planned initiative coming up that’s going to really draw on a specific team, do we have enough kind of backfill and buffer to be able to cope with problems that are inevitably going to come up? Can we distribute some of the workload? We’re always working with trade-offs. There’s no organization that has unlimited time and unlimited amounts of money to fix their code and design their system. So how do we think about those trade-offs and handle those trade-offs as best we can so that we don’t have to do that in the heat of the moment?
[00:49:30] MS: Before we wrap up, are there any resources that you all would recommend for folks that might want to learn more about human factors and engineering?
[00:49:38] NS: I think we absolutely have to start with the community writing blog posts to introduce specifically people in the software world to this, learning from incidents.io run by Nora who is a colleague of Laura’s. There are blog posts written by a whole raft of people who are at the intersection of software and safety systems. There are links in those blog posts to all kinds of additional resources. So it would be a great jumping off point to understand what we know and what we don’t know.
[00:50:10] LM: Yeah. I think Lorin Hochstein has done an excellent job of curating a resilience engineering repo. He’s done a lot of reading himself. There’s a number of people who have contributed to that, which kind of lays out some of the central canonical pieces from the field. I always come back to, and this is kind of independent of whatever domain I’m working in is The Field Guide to Understanding Human Error by Sidney Dekker. And that is a really good entry point for anyone who is kind of like, “I don’t know if I fully buy into the idea that there is no human error. What’s so wrong with root cause? And okay, I agree. What do I do next?” It’s a really nice, compact, easy read. And then Behind Human Error by Woods and the whole collection of folks at Ohio State kind of builds off that and really goes a little bit deeper into some of the things that we were talking about like cognition, complexity, automation, and automation surprises. And I think those three, four kind of resources are really good entry points for anyone interested in this.
[00:51:18] NS: In the realm of practical analyses of real world events, there are two that I would recommend. I mean, they’re very, very different, but they have the same flavor, which it presents a thing that is seemingly unexplainable and then over the course of a few hundred pages explain it. The first is Friendly Fire by Scott Snook, which talks about the friendly fire shoot down of some Black Hawk helicopters. And when you first read the story, you think that is absolutely ridiculous. There’s no way that could possibly make any sense. And Scott explains it. The other one, which I’m a little hesitant to suggest because it’s a bit more of a tone, but The Challenger Launch Decision by Diane Vaughan is the book that really got me into this field. John Allspaw has some blame for that, but that book, particularly if you’re interested in space in the shuttle, it is an absolutely incredible book that also has the same format, which is at the start of the book form describes what happened the night before Challenger launched and it seemed as unexplainable or frustrating. And then over the course of several hundred pages in Vaughan’s case, analyzes the culture and the organizational and human systems that allowed that to happen in that way to the extent that when you read the summary at the end of the book, you’re left thinking, “Well, how could it have happened to any other way?” Which I personally found enthralling and fascinating.
[00:52:48] LM: Well, if you’re adding on to that, then I have to add on that. And I would say books by Gary Klein, Seeing What Others Don’t and Sources of Power are again very accessible reads, and he’s done a lot of work in very practical, sort of laying out practical examples in high-hazard domains.
[00:53:04] MS: Excellent. I think that’s a great list and we’ll link some of those things in the show notes so folks can easily find and access them. Thank you both for joining us today.
[00:53:15] NS: Thank you very much for having us.
[00:53:16] LM: Yeah, this has been great. Thanks a lot.
[00:53:27] BH: This show is produced and mixed by Levi Sharpe. Editorial oversight by Jess Lee, Peter Frank, and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, email email@example.com and make sure to join us for our DevDiscuss Twitter chats every Tuesday at 9:00 PM US Eastern Time. Or if you want to start your own discussion, write a post on DEV using the #discuss. Please rate and subscribe to this show on Apple Podcasts.