Please don't send your SRE a screenshot of an error.
In this episode, we're talking SRE with Logan McDonald, senior site reliability engineer at BuzzFeed, and Molly Struve, lead site reliability engineer at Forem. We get into what site reliability is, the history, some SRE horror, what developers can do to make an SRE's job easier, and more.
Jess Lee is co-founder of DEV.
Ben Halpern is co-founder and webmaster of DEV/Forem.
Logan is a releng-focused senior site reliability engineer at BuzzFeed. If Logan has a personal brand, she hopes it is "Friendly Neighborhood Operations Engineer."
Molly Struve is senior site reliability engineer at Netflix and former head of engineering at Forem. During her time working in the software industry, she has had the opportunity to work on some challenging problems and thrives on creating reliable, performant infrastructure that can grow as fast as a booming business. When not making systems run faster, she can be found fulfilling her need for speed by riding and jumping her show horses
[00:00:01] JL: This season of DevDiscuss is sponsored by Heroku. Heroku is a platform that enables developers to build, run, and operate applications entirely in the cloud. It streamlines development, allowing you to focus on your code, not your infrastructure. Also, you’re not locked into the service. So why not start building your apps today with Heroku?
[00:00:18] BH: Fastly is today’s leading edge cloud platform. It’s way more than a content delivery network. It provides image optimization, cloud security features, and low latency video streaming. Test run the platform for free today.
[00:00:31] JL: Triplebyte is a job search platform that allows you to take a coding quiz for a variety of tracks to identify your strengths, improve your skills, and help you find your dream job. The service is free for engineers and Triplebyte will even cover your flights and hotels for final interviews. So go to triplebyte.com/devdiscuss today.
[00:00:48] BH: Smallstep is an open source security company offering single sign-on SSH. Want to learn more about Zero Trust Security and perimeter list user access? Head over to the Smallstep Organization page on Dev to read a series about Zero Trust SSH access using certificates. Learn more and follow Smallstep on dev.to/smallstep.
[00:01:12] LM: And it looked like he had posted some content about something that happened at the conference, but then the image that was placed with the post was like a man shirtless with an anonymous mask on, like taking a mirror selfie.
[00:01:43] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all our lives as developers. I’m Ben Halpern, a co-founder of Forem.
[00:01:50] JL: And I’m Jess Lee, also a co-founder of Forem. Today, we’re talking about SRE with Logan McDonald, Senior Site Reliability Engineer at BuzzFeed, and Molly Struve, Lead Site Reliability Engineer at Forem.
[00:02:01] BH: So you might’ve noticed that we introduced ourselves as cofounders of Forem, where we used to say cofounders at Dev. And that’s not to say that we’ve changed organizations or anything. Dev is still alive and well, but we announced recently that the underlying software that powers dev is called Forem, that’s spelled F-O-R-E-M, and it stands for the phrase for empowering community, and we are in the transition of migrating our organization identity to be Forem instead of Dev, with dev continuing to be a central, critical, important thing. Our relationship with the whole Dev community is really fundamental to building Forem. But Forem is a software that’s going to power Dev. It’s going to power other communities. It already is to some extent it’s open source and we’ve been working on it for years as the underlying technology of Dev. And so yeah, we have to make that transition eventually because we see our fundamental work as building Forem. So yeah, we’re going to be introducing ourselves as cofounders of Forem now.
[00:03:03] JL: And before we get into our round table discussion, let’s talk a little bit about why we chose to do this episode topic.
[00:03:08] BH: Yeah. So one of our guests today is our own Molly Struve and she is our Lead Site Reliability Engineer. And to me, site reliability is the stuff I no longer have to stress about a hundred percent of the time. And that’s really what having an SRE team is about as a founder from my perspective. I feel like this is the stuff that used to really, really be on my mind all the time and not that it’s been thrown over the fence, but I think that’s what SREs are there to do to support the organizational needs and really be those engineers who drive a more enjoyable, focused, reliable software team.
[00:03:55] JL: At what point did you feel like we needed to bring on Molly?
[00:03:59] BH: I think there was a time where there was a shift from problems that our team knew exactly how to handle and maybe they were difficult or we had prioritization issues or anything. Sometimes there’s known problems that you can’t immediately solve, but then there’s sometimes when you get to the point where a little bit more expertise, a little bit more having been there, done that is really what you need to unlock productivity and growth and sanity of the organization. And Molly brought the right amount of all of that stuff and essentially was there to also teach us what site reliability meant. The first developers on any team are typically going to be generalists and that’s what we were and that’s what we still mostly are. But we hit a point in time where there was only so much we could gain from our total generalist mentality.
[00:04:53] JL: Yeah. And I’m really excited about this conversation because Molly is a one woman SRE team here at Forem, whereas Logan works on an entire team with fellow SREs for a really big company, BuzzFeed. So I’m curious how their experiences differ and what sort of advice they’ll have for us.
[00:05:13] BH: Yeah, absolutely.
[00:05:23] JL: And now here with us is Logan McDonald and Molly Struve. Thank you both so much for joining us.
[00:05:27] MS: Thanks for having us.
[00:05:28] LM: Yeah. Thanks for having us.
[00:05:29] BH: Before we get into SRE, can you tell us a little bit about your background as engineers? Logan, let’s have you start.
[00:05:37] LM: I started out on my engineering journey right after college. In college, I actually didn’t study computer science and I hopped around a lot, but ended up studying economics. Sort of in my last year, I really got into like the statistical side of programming in my economics degree, and I really wanted to learn to program. So I literally just Google, like, “How do you learn to program like programming groups in Seattle?” Where I was at the time, I was particularly interested in like gender diverse groups and groups that focused on women in engineering. And I found this program that is like exactly that called Ada Developers Academy, which is a one-year program that has an apprenticeship built into it for seven months where it’s just like an educational program to learn to code and then you try to specialize in your apprenticeship in some area. And so I was really lucky and got into that program and learned how to code for the first six months and then did my apprenticeship at a company called Chef Software, which makes different software for configuration management and like continuous delivery. And after that, I became an operations engineer. I worked for a couple of years at Kickstarter and then I went on to become a site reliability engineer where I currently work, which is at BuzzFeed. So a little bit of a winding road, and I can talk a little bit more in the discussion about kind of how I landed on operations and SRE coming from only like a coding background. But yeah, that’s a little bit of my journey.
[00:07:18] JL: Tell us more about your current role at BuzzFeed.
[00:07:20] LM: Yeah. So I am a site reliability engineer, definitions of what that actually means, but to me it means I’m a software engineer that’s focused on reliability. So I focus on limiting the toil and struggle of our developers and I do that mainly through release engineering, but there’s a mix of other parts of the role, like some InfoSec, some like developer productivity, some production engineering, but overall kind of fits under the umbrella of reliability. Yeah. I work at BuzzFeed. So BuzzFeed is a media company that covers quite a few different verticals. We have BuzzFeed News, BuzzFeed.com, Tasty, Nifty. We have tons of sites and it’s a giant microservice architecture underneath of about almost 600 different microservices. So our team, which is core infrastructure, manages the platform that all of those services sit on top of from development, all the way to production.
[00:08:26] BH: And Molly, why don’t you tell us a bit about your background?
[00:08:29] MS: Similar to Logan, I did not study computer science when I was in college. I actually studied aerospace engineering because at the time planes seemed really cool to me. So I ended up getting a degree in that and then actually went into options trading for a couple of years. And during that time, the market was pretty much in turmoil, and at the same time, Silicon Valley was really kind of hitting its stride and you had all these exciting companies like Instagram and Facebook, really just starting to take hold and start to shape the landscape of the internet. And me being in options trading and just not really feeling it kind of looked at that and said, “Huh! I think it’d be really great cool to kind of get back to my engineering roots and go back to building stuff.” So I quit my job as a trader in 2013, and similar to Logan, literally Googled, “How to start web development and how to start learning to code?” I had coded in college, but it was kind of backend Java stuff. So nothing to really do with web development. And in my searches, I found the Michael Hartl Tutorial on how to build basically a Twitter app. And I taught myself Ruby on Rails through going through that tutorial and ended up landing a job as an intern at a tiny little startup and kind of the rest was history after that. In terms of getting into SRE work, I, from the beginning, was kind of a full stack engineer, learn the front end, the backend. But while I was at one of my companies that I worked for, a large security firm, I kind of took a hold and took ownership over one of our large data stores called Elasticsearch. And as I kind of took ownership over that, I really kind of honed my skills and became an expert in it. And so when the company started growing, it needed to split up the developer teams. It was kind of a natural fit that I moved into to the site reliability land from there, since I was already so involved in our data stores and their resiliency and reliability.
[00:10:41] JL: And now that you’re at Forem, can you share what you’re doing for us as a site reliability engineer?
[00:10:47] MS: Yeah. So since joining Forem, I’ve done a lot of work to really take us off of kind of paid niche services and onto more open source platforms and data stores such as Elasticsearch and Redis and it’s just kind of another way that our platform continues to maintain its open source values in the fact that we’re using these kind of data stores that are run by code that is also open source. So been doing a lot of work on that and then most recently, as we have started transitioning to this Forem brand, our ideals have really shifted to not just supporting this single Dev Community, but supporting many communities all over the place and for all different types of communities. And so now my job has shifted to just making sure that Dev is reliable. And instead now kind of ensuring the reliability of all of these new Forems that we’re building and what that really starts with is designing a system and an infrastructure that is going to be not only scalable, but also easily reproducible many times so that we can spin up many, many communities and be able to help run them and ensure that they’re all as reliable as the next one.
[00:12:12] JL: So Logan kind of touched on this earlier that SRE can mean many different things to different people. So I’m curious what your definition of SRE is.
[00:12:20] MS: My definition, actually, a hundred percent lines up with Logan’s. I totally agree that site reliability engineers are software engineers, but with a focus on reliability. And another kind of attribute that I think a lot of site reliability engineers have is they have this innate ability to kind of step back from just the code and look at the overall system and think about how a single piece of code affects that overall system from everything, from the front end to the backend data stores, to the servers that the code is running on. So I think that’s another kind of key piece in being an SRE is being able to really grasp the large picture and make sure that everything, whether it’s new code, whether it’s a new data store, kind of fits nicely and securely into that big picture.
[00:13:16] LM: I completely agree with that. And it’s part of what really drew me to this particular field. I think both of us come from an interdisciplinary background and SRE is inherently so interdisciplinary. It covers so many different types of engineering and really requires people who are systems thinkers. Yeah. So I just want to say, I completely agree with that.
[00:13:41] BH: Yeah. And site reliability has to enter the picture at really varying levels of the stack. So how do you ultimately communicate with the different stakeholders in the organization being the system side or the front end engineers or the architects? What are some of the sort of key concerns that go into ensuring folks are working collaboratively in order to reach a certain state of adequate reliability? Logan, do you want to talk about that?
[00:14:11] LM: Yeah. I can talk a little bit about it in the context of BuzzFeed now, which has a model that I really like. So I mentioned at the beginning that we have a pretty large microservice architecture. So like 600 services that are split across a team of probably around a hundred engineers. And so organizing those services and understanding like one service may be some Slack bot job that some engineer created that if it goes offline is not that important. The other is like our Buzz API, which powers all buzzes across the site, what we kind of call like “article components”. And if the Buzz API goes down as a way bigger deal, then if there are issues with that tiny Slack bot. So we have a pretty powerful platform as a service that our platform and infrastructure teams manage. We call it “rig”. And it is a sort of automated release system for all of these services that has configuration available to tell whether a service is in a top tier or maybe a lower tier that needs less assistance, but that’s only one sort of guardrail around this platform that we put in place. The way that we develop these services and how we configure them develops tons of guardrails from like how many resources can be provisioned for a given service to who owns that service, to what port is that. If it’s an HTTP service, what port is that service being made available? And all of this is very standardized and configured. And a big part of our job is operating that platform and watching out for possible things that developers might want to change to make it like flexible enough that they can adequately do their jobs and don’t have to come to us every time they need a new database, they need to change some bit of the configuration, they need authentication proxy, but they also don’t do things that will accidentally hurt themselves and hurt the ecosystem as a whole. So it’s like kind of always that balance. You don’t want to be a gatekeeper, but you do want to have guardrails in place so that the system remains safe enough that you feel really good about being on call for it and supporting it.
[00:16:42] JL: How was this different from when you were working at a smaller startup?
[00:16:46] LM: That’s a really good question. I wanted to say like a little bit to the SRE definition. I think a big contributing factor when you’re thinking about SRE is scale. And there are some people out there who will be like, “If you’re not working at Google or a similarly sized company, you’re not really an SRE,” which is like very debatable. But yeah, I think that the scale is really different. Once you reach a certain like level of system complexity and like incident management really, there’s so much toil in that system that needs to be handled and automated. And I think an SRE is a really good role to place in there, but like my previous job, a team of like four for the platform and operations team in total, we were doing lots of stuff that probably could be classified at certain companies which is backend engineering, less platform, concentrated work. A lot of the time we were like directly on production boxes, which is not something I really do in my job now because we’ve reached like a level of abstraction and a level of tooling that makes that not necessary. And then all the way down, like I’ve talked to SREs at companies that are really small and just starting out that would maybe even classify some of the work is like IT, like they’re setting up certain systems that are kind of like purely IT systems. So it’s really a range of things. And even in my current role, a lot of the time, the stuff I’m doing, I wouldn’t really say as SRE work. But it goes back to our conversations, it’s very interdisciplinary and you wear a lot of hats in the role.
[00:18:27] MS: I’ve definitely heard people say, “Oh, you’re not really an SRE unless you’re at a giant company like a Google.” And I think because kind of Google wrote the handbook on being an SRE, there’s this kind of stigma that you have to be a certain size, have X number of engineers, be running X servers before you can have an SRE team. And that is something I personally disagree with. It’s never too early to start thinking like an SRE. And in the beginning, like Logan said, you might not just be doing SRE things, you might be a backend developer or a full stack developer, but you have that SRE mindset so that you can kind of look into the future and determine what your decisions you’re making now, how they’re going to affect things in the future. And I think it’s never too early at a company to have that mindset so that you can kind of get ahead of those scaling issues before they pop up and that’s going to save you a lot of time and energy down the road because no one likes hitting those scaling issues and then you’re running around, putting out all of these fires. If you can kind of get ahead of those early on, I think that’s a huge win, no matter how small a company is.
[00:19:41] LM: I think that like lots of companies who don’t do that risk burning out their engineers. What they would rather do is just let everyone burn out rather than pay to have an SRE do it or even just allowing an SRE or an SRE team to build safer systems in this way.
[00:20:22] JL: Over nine million apps have been created and ran on Heroku’s cloud service. It scales and grows with you from free apps to enterprise apps, supporting things at enterprise scale. It also manages over two million data stores and makes over 175 add-on services available. Not only that. It allows you to use the most popular open source languages to build web apps. And while you’re checking out their services, make sure to check out their podcast, Code[ish], which explores code, technology, tools, tips, and the life of the developer. Find it at heroku.com/podcasts.
[00:20:55] BH: Empower your developers, connect with your customers, and grow your business with Fastly. We use Fastly over here at Dev and have been super happy with the service. It’s the most performant and reliable edge cloud platform. Fastly CDN moves content, data, and applications closer to your users at the edge of the network to help your websites and apps perform faster, safer, and at a global scale. Test run your platform for free today.
[00:21:22] JL: So Molly, you mentioned that Google like more or less invented the SRE role, and Logan, you mentioned that depending on the size of the company, some SREs can end up doing some more IT work or more backend work. Do either of you know what the history of site reliability is? How did this role come to be?
[00:21:39] LM: Basically at Google, they have these huge server farms and they were having tons of incidents and the incidents were taking up a huge amount of time for like the engineering teams whose software was running on these servers. And so they started investing in teams that would look at what was causing these high level of incidents and this like real like toil and labor from these teams to have to be on call all the time for the stuff. And they took software engineers working in the org and put them on this team and they started to automate away some of the incidents, like you find patterns of like an incident happened 10 times the same way every day, they started to automate away those incidents and it became fewer and fewer incidents and more outliers and then the outliers sort of became what we kind of think of now as incident management as like how do you deal with unknown unknowns and like freak black swan sort of situations. And so they started to develop this hierarchy, which is in the book, it was developed by Paul Dickerson, who was a site reliability engineer at Google of like all the way down to the code. So at the very top would be like the code that your engineers are writing and then it sits in this pyramid of all of these principles of site reliability engineering, which are monitoring root cause analysis, capacity planning, and I think that there are a couple of more. From that, they built out what is the book now and this kind of way of thinking. But because these people were originally software engineers, they approached it in a very like automate my problems way mindsets and that I think became like a big tenant of what we think of as SRE today. You’re always trying to find patterns. And when you find patterns and problems, you automate them away. You’re trying to make things more observable. You’re trying to have less incidents and you’re trying to keep your developers happy.
[00:23:45] JL: What is a black swan incident?
[00:23:47] LM: So black swan theory, black swan event is a metaphor that describes an event that comes as a surprise, having a major effect and is often inappropriately rationalized after the fact with the benefit of hindsight. So there’s this really great talk on this. What Breaks Our Systems: A Taxonomy of Black Swan events by Laura Nolan. Some examples that she talks about in this talk are like you have a highly automated system that has triggered events, right? Like something happens that triggers something else, but it can cascade down and something completely unforeseen happens and it starts triggering all of these failovers and events. Like if this database goes down in this region, that’s going to trigger all of these deployments, but those deploy events depend on this other part of the system working and it doesn’t actually work. And then it just is catastrophe. It’s like one of those 24-hour long incidents, but you just never saw it coming. Right? So a lot of SRE is once you’ve like figured it out, all the things are probably going to go wrong and you automate checks for all of that stuff. You still have this possibility of these black swan events. And so how you handle and react to them is a big part of the job as well.
[00:25:08] JL: So SRE seems like one of the more inherently stressful development positions, from lots of preparation to when you’re stuck in the middle of a black swan event and you need to just be executing and fixing things. How do you manage the pressure?
[00:25:22] MS: So I personally am someone who kind of thrives on pressure. I’m a very competitive person. And when my adrenaline gets going, that’s kind of when I perform my best and that makes it exciting for me. So obviously, you don’t want those events to happen too often because that is what leads to burnout. But I think part of what helps me enjoy the SRE position is having some of those events come up and you kind of get a little bit of excitement. It kind of mixes up your day as opposed to you start work, you develop some things, you write some code, you commit the code and you go home at the end of the day. As an SRE, you have this kind of unknown every single day that you kind of have to watch out for. And I think that’s what makes, at least for me, the job exciting. And I really love just knowing after being in this position for a while, I know that at the end of an incident, at the end of some outage, we’re going to learn something and then we’re going to make our system better. That is something that has kept me really kind of hooked on SRE is the constant learning, the constant ability to continue to make systems better and then seeing those changes really pay off and then protect the systems reliability in the future. That feedback loop is really kind of what keeps me hooked on SRE.
[00:26:50] LM: To speak to the adrenaline thing, I think that when I see an incidence happening, I don’t immediately know why it kind of like makes me excited rather than scared because I get to like investigate something and maybe annoyance, but never fear. But I think three things really helped me to not get burnt out. The first is psychological safety. I. Work on a team and I would always choose to work on a team that I 100% know if I’m the primary on call and I get paged and I let it slip to secondary because I don’t know what’s going on, then my secondary is not going to like wake up and be like, “Logan, why did you wake me up? This is awful.” Like they’re going to be in it with me and they’re not going to be mad that I didn’t know what was going on and I needed help. So as a team, you just have to hire people and create a culture that is super safe and in which people are willing to admit when they don’t know what’s going on. The second thing is, and this is basically what Molly was describing, it’s blamelessness, which is kind of a cultural practice in dev ops and SRE lens, which does not mean, I mean, it means no blame, that blame is unhelpful, but it does not mean no accountability. Like if part of the problem was that I was super stressed out and didn’t get a good night’s sleep and slept through a page, like that’s something we should know, but not blame me for not waking up, blame like the system for not working. Like why was I burnt out? You don’t believe that there’s any use in pointing fingers at someone or one thing in particular in a complicated system. And finally, my organization made a really great choice about a year ago that if you got paged out of hours when you were primary on-call, you got a half day off and you could use it whenever you wanted. And then anyone who’s primary on call for a week gets a day of paid time off whenever they want to take it. And so really trying to mitigate burnout and leave time to recover because even if you get the adrenaline and you really like it, it’s still adrenaline and that chemical is not great for your body long-term, especially if it keeps happening. So I think that really incentivizes the organization management to deal with repeat incidents because if someone is getting paged every single night, then they’re basically not working for half of the next week because they have all the paid time off. And it also allows you to recover and rest, if you got paged at 3:00 AM and you’re all out of sorts.
[00:29:31] BH: So here’s a scenario. Let’s say the site’s architect, maybe someone who doesn’t work specifically in SRE or other parts of software development has the notion that in order to relieve stress on the backend, we are going to move more logic to the front end, like more stuff going on, on the client in order to just have a reduction of total computing happening on the servers. That must be a scenario where the SRE maybe is relieved and that the server load will be reduced, but maybe stress and that the amount of work going on in the client will lead to more logical problems that are harder to observe. What would go through your head in that scenario, Molly, just like outlining those different trade-offs and what you might do to prepare for such a change?
[00:30:27] MS: I definitely think in a situation such as rebalancing the load to different areas of the app, that is when I would definitely be reaching out to those front end people, those people who are really versed in the client side code and really get on the same page as, “Okay, how is this new load on the client going to affect the logic and how the client is performing? Do we need to worry about types of internet providers in that case? Are there other things we need to worry about?” And that’s something where I really kind of find the experts in that client side domain. And I really make sure that I work with them to understand how things are going to change and then figure out my strategy for staying on top of monitoring that code.
[00:31:19] LM: In some ways, I don’t know how involved we would be. We may not even ever hear about it, but the only thing that we might hear about it from is like making sure that all of our tooling is completely available to them and it’s something that we’re kind of working on now because we’re writing more services and node. I’m sure some teams are making this transition and we need to make sure that we have parody and node for like step through debugging for our logging packages and for metrics and then just every service, kind of a set up for itself in terms of who they say is on call for it and what level of support it needs. And then every service that is tier one, if it has to be online all the time and it has ops support has a run book and we may help them with the construction of that run book and making sure whoever’s on call if they get paged because of some latency or some issue with it knows exactly what to do and where to look. And that again, comes back to the idea that Molly spoke to of observability and making sure that all of the packages are lined up so they can do what they need to do safely.
[00:32:54] JL: Join over 200,000 top engineers who have used Triplebyte to find their dream job. Triplebyte shows your potential based on proven technical skills by having you take a coding quiz from a variety of tracks and helping you identify high growth opportunities and getting your foot in the door with their recommendation. It’s also free for engineers, since companies pay Triplebyte to make their hiring process more efficient. Go to triplebyte.com/devdiscuss to sign up today.
[00:33:18] BH: Another message from our friends at Smallstep. They’ve published a series called, “If you’re not using SSH certificates, you’re doing SSH wrong,” on their dev organization page. It’s a great overview of Zero Trust SSH user access and includes a number of links and examples to get started right away. Head over to dev.to/smallstep to follow their posts and learn more.
[00:33:44] BH: How about some horror stories as we get into the end of the show? Logan, can you let us know about an incident that was particularly painful and some of the aftermath, what you learned, et cetera?
[00:33:58] LM: Yeah. This one is not like specifically classic infrastructure capacity planning, but it’s something that sort of changed, like meaning halfway through. So I was maybe three months into being at BuzzFeed and I was called on to an incident we had. This was with our news team. We had some reporters who are at an Apple conference and they were live blogging new product releases and stuff and they had kind of shoddy Wi-Fi because they were in this big conference center with a bunch of other reporters trying to say this iPhone launched or whatever. And one of them constructed a post in our CMS, went to publish, lost his Wi-Fi, then it came back online and the requests had processed and then he kind of ignored it for a little bit, but then started getting pings on the page because there was something really weird going on with the posts. And it looked like he had posted some content about something that happened at the conference, but then the image that was placed with the post was like a man shirtless with an anonymous mask on, like taking a mirror selfie. And so he immediately starts freaking out and he’s like, “We’ve been hacked. Someone has like found some vulnerability where they can like change our images and our CMS.” It’s a very complicated system. And so he pinged the head of security for news which they pinged us and we opened an incident channel. And so we’re going through all these things of like where is some cross-site scripting bug that’s allowing someone to insert these image links?” We’ve got so many people involved, like all the VPs, because if we did have like a hacking situation, that’s really bad. And so what one engineer was like poking around, like grabbed the image link and realized that it was being served from our own image store, like it was at our own S3 link. And so reverse image searched it in Google and found that it was attached to a BuzzFeed posts from like 2011. It was like 15 sexy, anonymous members or something like that. We were looking and found that in our CMS, there is this bug that if you lost connectivity and it couldn’t upload the photo, it took a random buzz ID from our database and populated it. And so we tried it a couple of times to reproduce it and we’re getting like images of flowers and like random stuff. It just happened that everything aligned that the random ID which has made us all go in the absolute wrong direction and we probably wasted like 30 minutes getting way too many people involved because we thought we were hacked and made all these assumptions about what was happening. So technically, this isn’t like super interesting and it was just a minor bug, but kind of shows you in like the heat of the moment when you’re just seeing symptoms that it could be something you totally don’t think that it is.
[00:37:18] JL: Molly, any horror stories from your work?
[00:37:21] MS: Oh my! So many good ones. I think the one that sticks with me the most because it was kind of the biggest and longest running horror story I’ve ever had was at a previous company and I kind of alluded to this incident earlier. We had our Elasticsearch data store, which was really the cornerstone of the entire platform. We were looking to do an upgrade on it. And the last upgrade we had done for Elasticsearch went swimmingly. It was great. We saw all these huge gains, like memory usage went down, request speeds went up. It was fantastic. So preparing for this next upgrade, we assumed it was going to be just as great. So we got all the code working, we got everything ready. And when we flip the switch and did the upgrade, it was a disaster and our Elasticsearch became overloaded and we had all of these problems and kind of long story short, our entire app was basically down for a week as we were trying to figure out what was wrong. And it got so bad. We actually pulled in an Elasticsearch engineer who, lo and behold, in the end found out that our problem wasn’t on our end, but it was actually a bug within Elasticsearch. And so they actually had to push code, fix the bug, gave us a workaround in the meantime until they released that. And after about a week of just 12, 15-hour plus days of working nonstop, we finally got this fixed and it was one of those incidences that, one, really bonded our team together because after we kind of survived that, we realized, “Okay, we can really do anything.” And I think it also opened up our eyes to the fact that no matter how widely used software is, all software can have bugs, even massively used data source software, like Elasticsearch. It can have bugs in it. And I think that really kind of opened up my eyes to, “Okay, no matter what kind of software you’re using, you really have to load tested and capacity tested and make sure that it can handle your use case because no matter how widely used it is, all software has bugs. And whether that bug might affect you or it might not affect you is something that you’re kind of really responsible for in how you use that software. So in the heat of the moment, it was not the most glamorous incident, but we learned so much and I felt like that was one of really the defining incident for me in my SRE career and that I learned a lot and also kind of in a backwards way gained a lot of confidence from that incident. Okay, if I can survive this, I can kind of survive anything.
[00:40:14] LM: I think that Molly spoke to something that is like probably the most stressful thing for me, not like the incident itself, but making the call of whether to go backwards or keep going through it is really hard. And I’ve been a part of upgrades like that, where first, like having a solid plan of how to go back is hard. And sometimes you just can’t. You’re just like, “We’re going to do it.” It’s happening. And then when you’re in it and you’re like, “We can get through this,” but everything is on fire and people are like, “Should we go back? Should we go back?” And you’re like, “No, we’re going to make it through.” It’s hard.
[00:40:56] JL: What is something that you wish that engineers knew about SRE that would help make your jobs easier?
[00:41:04] MS: I wish kind of just engineers knew this in general is that we don’t have all the answers as an SRE. And our job is a constant iteration of seeing something break and then fixing the system to make it more resilient. Sometimes SREs can be kind of elevated to the status of like this hero culture where you see an incident happen and you see the SRE team step in and they fix it and they solve it. And so sometimes that can kind of make that, “Oh, they just know how to deal with anything.” And I love trying to debunk that myth and being like, “We don’t have all the answers. Half the time we’re in there and we’re sifting through logs trying to figure out what’s going on. We’re just as clueless a lot of the time as anyone else.” And we just happen to go through the logs and break down the problem and use the tools we have in order to kind of get to the solution. But I think one thing devs can kind of do to help their SRE teams is have that little SRE mindset in the back of your head. It can be as simple as you’re changing code that talks to the database and maybe pinging in SRE for a code review, just because, “Okay, I think this code is going to execute a lot. Let me just check in with the SRE team about this.” I think just being thoughtful when you’re committing code to make sure you’re pulling in the right people, even if there’s just a small chance that that code might affect systems and other ways.
[00:42:46] LM: I have just a super specific request, which is when you have a problem that you see in the logs or in your local environment, don’t screenshot it, like actually copy the full error and mod it like a code blog.
[00:43:00] MS: Oh my God! Yes!
[00:43:03] LM: Because my number one debugging tool is searching Slack for when this has previously happened and I can’t search images. I can only search the text. Actually, I want to write a post about this for dev, but like Slack dorking is like the number one skill I think you could learn if you want to be an SRE or just a better software engineer because it’s like super powerful, their advanced search, and you can find all sorts of problems that have occurred in the past, but I can’t find them if you screenshot there.
[00:43:36] MS: Oh my! I’ve had this issue so many times, but not quite enough to like really stop and think about it. But now that you mention it, if you read a post on that, let me know. I’d be happy to contribute to it because that is also when I get a screenshot and then I think to myself, “Oh my Lord! I have to write out everything that’s in this error into Slack or into Google now.”
[00:43:58] BH: Helping people who are trying to help you is probably one of the more fundamental things you can do, like make people’s jobs as easy as possible, give them the context, give them the text, help the universe by putting out texts into the world that’s searchable in the future.
[00:44:15] LM: Yeah. Or if you can just get a screenshot and it just says error in like big round, actually, I don’t know what this is.
[00:44:23] JL: Logan, Molly, thank you both so much for joining us today.
[00:44:25] MS: Thanks for having us.
[00:44:26] LM: Thank you. This was fun.
[00:44:36] JL: I want to thank everyone who sent in responses. For all of you listening, please be on the lookout for our next question. We’d especially love it if you would dial into our Google Voice. The number is +1 (929) 500-1513 or you can email us a voice memo so we can hear your responses in your own beautiful voices. This show is produced and mixed by Levi Sharpe. Editorial oversight by Peter Frank and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, please email firstname.lastname@example.org and make sure to join our DevDiscuss Twitter chats on Tuesdays at 9:00 PM Eastern, or if you want to start your own discussion, write a post on Dev using the #discuss. Please rate and subscribe to this show on Apple Podcasts.
[00:45:32] SY: Hi there. I’m Saron Yitbarek, founder of CodeNewbie, and I’m here with my two cohosts, Senior Engineers at Dev, Josh Puetz.
[00:45:40] JP: Hello.
[00:45:40] SY: And Vaidehi Joshi.
[00:45:41] VJ: Hi everyone.
[00:45:42] SY: We’re bringing you DevNews. The new show for developers by developers.
[00:45:47] JP: Each season, we’ll cover the latest in the world with tech and speak with diverse guests from a variety of backgrounds to dig deeper into meaty topics, like security.
[00:45:54] WOMAN: Actually, no. I don’t want Google to have this information. Why should they have information on me or my friends or family members, right? That information could be confidential.
[00:46:02] VJ: Or the pros and cons of outsourcing your site’s authentication.
[00:46:06] BH: Really, we need to offer a lot of solutions that users expect while hopefully simplifying the mental models.
[00:46:13] SY: Or the latest bug and hacks.
[00:46:16] VJ: So if listening to us nerd out about the tech news that’s blowing up our Slack channels sounds up your alley, check us out.
[00:46:22] JP: Find us wherever you get your podcasts.
[00:46:24] SY: Please rate and subscribe. Hope you enjoy the show.