Don't just test it, prove it.
In this episode, we talk about proof engineering with Talia Ringer, researcher and incoming assistant professor at the University of Illinois at Urbana-Champaign.
Ben Halpern is co-founder and webmaster of DEV/Forem.
Molly Struve is senior site reliability engineer at Netflix and former head of engineering at Forem. During her time working in the software industry, she has had the opportunity to work on some challenging problems and thrives on creating reliable, performant infrastructure that can grow as fast as a booming business. When not making systems run faster, she can be found fulfilling her need for speed by riding and jumping her show horses
Talia Ringer is an assistant professor with the PL/FM/SE group at Illinois. She likes to build proof engineering technologies to make that world a reality. In so doing, she loves to use the whole toolbox---everything from dependent type theory to program transformations to neural proof synthesis---all in service of real humans.
[00:00:00] TR: Rather than just test your programs on some inputs and outputs, you could state formerly, like for any input I could ever see, it’s actually going to do what I expect it to do, and then you can write a formal proof that shows that’s true.
[00:00:24] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all of our lives as developers. I’m Ben Halpern, a Co-Founder of Forem.
[00:00:31] MS: And I’m Molly Struve, Site Reliability Engineer, and Former Head of Engineering at Forem. Today, we’re talking about proof engineering with Talia Ringer, Researcher and Incoming Assistant Professor at the University of Illinois at Urbana-Champaign. Thank you for joining us.
[00:00:48] TR: Thank you.
[00:00:49] BH: So we invited you on the show because of the work you do and we sort of know about some of these ideas because of a Twitter thread that you made about writing tests that caught our eye and was all very interesting. Before we get into the future of testing and software development, can you get into a little bit about your background and why you do research in this area in the first place?
[00:01:15] TR: I got into computer science in undergrad. I never wrote a program before that. I was like a math and computer science double major. And I went to work at Amazon as a software engineer for a few years. I kind of knew I wanted to do research eventually. I had originally wanted to go to grad school for like more theory kind of thing. And when I was working as a software engineer, I felt like a lot of the things I’d learned in programming languages classes really, really could help make this job much easier. So I kind of thought about all the things that I struggled with on a day-to-day basis. And I was like, “Rather than fight this kind of at the object level on a day-to-day basis, I feel like I’d rather be building a future where these problems don’t exist.” So yeah, that kind of drew me back into research. And then the work with proofs in particular started kind of at the beginning of grad school. I took a grad class using this tool, these tools called proof assistants, where you can actually, like rather than just test your programs on some inputs and outputs, you could state formerly, like for any input I could ever see, it’s actually going to do what I expect it to do, and then you can write a formal proof that shows that’s true. And it sounded like too good to be true. And in some sense, it was, only insofar as like it’s still too hard, but I don’t think it should be hard. And so I got kind of obsessed with that and that drove most of my PhD work, how can I make this much easier.
[00:02:49] BH: Can you tell us a little bit more about what proof engineering is exactly and how it relates to testing and software development at all?
[00:02:58] TR: So proof engineering refers to the technologies that make it easier to develop and maintain verified systems. So verified systems here are systems that are proven correct using these tools called proof assistants that I mentioned earlier. So usually the way this works, to use a proof assistant, you’ll rate your program in a special language right now inside of the proof assistant and then you’ll specify something that you want to prove about your program. And then you just write a proof that shows that the program actually satisfies that specification so it does what you said it does. And the proof assistant will actually check this and let you know if your proof is correct. So there’s a lot of automation involved in writing the proof. There’s sort of an interactive process where you’re almost like having a conversation with the tool, which is why it’s like a proof assistant and not just a proof checker. So you’ll kind of send it like a strategy for writing your proof. It’ll proceed a little bit. And then it’ll come back with like one particular case of the proof, asks you to prove that, but it kind of goes back and forth until you’ve written the entire proof.
[00:04:07] BH: How does a proof assistant work?
[00:04:10] TR: So there are like different proof assistants that work a little differently, but one core part of it is the same, which is that usually you’re writing code in a really rich language, really like a logic, and there’s this core part of the proof assistant called the kernel, which can actually check that you’ve proceeded along these logical rules. It’s actually your proof is correct. It proves the thing that you stated. So basically this is the trusted part of the proof assistant, because in anything you do, there’s going to be something that you’re trusting. So the kind of philosophy behind these proof assistants is the thing that you’re trusting should be small and easy to interpret as a person. It’s debatable how much that applies in practice. It depends on the proof assistant, but that’s the core philosophy. You have this little kernel. It knows one logical system. It can check it. So then that’s at the very core. If you wanted, you could write your programs in this logical system, actually, and then you could also write your proofs in just like raw logic. This turns out to just be incredibly difficult, even for experts. So almost all of these proof assistants, you don’t actually write your proofs this way. There’s a layer of automation, like another language built on top of it where you’re kind of sending proof strategies, which might be something like do induction or try this decision procedure that I know works for all these numbers that have this property or, I don’t know, like try some rewriting rules. Then the strategies, you can really think of them as kind of like compiling down to that core logical language, but it stops the user from having to write raw logical proofs.
[00:06:01] BH: Would you say this is research which ultimately is going to touch all of software engineering to some extent once these ideas become more practical? Or are the outcomes kind of like specialized in a domain where a verified system is like needed in any capacity? What do you expect the scope to be in terms of like lessons or research at all?
[00:06:29] TR: Yeah. So I think in terms of practical industry outcomes, there’s just started to become some industry adoption in the last few years. And what I’ve seen is that in the research world, we kind of assume the entire system is going to be verified, which is like reasonable for some domains, domains where security correctness really matters. But for industry adoption, it seems like the most common use case is that there is a core critical property of your system that you would like fully verified. We would really like strong guarantee that this is always true, like no use case that you could ever think of, that you could throw at it, no test input that you could ever think of is going to cause unexpected behavior there. But then there are other parts of the system where you don’t want that at all. And it might be too expensive, like completely impractical. You might not even be able to specify what correctness is. And so I think the future there is going to be integrating this sort of into the rest of the development workflows so that you can test parts of the systems where you don’t need those kinds of guarantees. You can write proofs about the parts of assistants where you really do want those guarantees or where it’s easy to state what correct this means and then have some kind of smooth interaction between these guarantees about what the composite of all this actually means.
[00:07:50] MS: That’s super interesting. So my next question is how is it that we can use these proofs to ensure code is more secure?
[00:08:00] TR: Yeah. So I think it depends. It’s easiest when there’s some well-known security property that you can know and state. So one of my favorite examples of this is there’s this seL4, like is this the name of this verified operating system microkernel? And they’re just well-known properties. I think it was like confidentiality and integrity, but you could actually define and state formally and they actually proved that these properties hold formally for this operating system, microkernel. And this actually ends up not just being a theoretical guarantee. They build this into an autonomous helicopter. This is like a DARPA project. So DARPA first would hire a team of professional hackers to try to hijack the helicopter when it was running a normal operating system, microkernel, and they actually succeeded and then they built in this like seL4, a verified operating system microkernel, and had them try again and then this time they actually couldn’t hijack the helicopter. So you can see like for real critical software, it’s really impactful I think already, although there are some challenges where the proof engineering work comes in, like maybe you’ve proven this operating system for a particular hardware architecture, and then you want to adapt it to some other hardware architecture. That’s really hard. Now you’d have not just your program and stuff to adapt, but you also have all your proofs.
[00:09:26] BH: Is it most useful then for proving in the process of implementing or re-implementing a known existing spec? So like I have an implementation that’s sort of tried and tested, but I want to improve on it from like a performance perspective, but it’s extremely important that it is as verifiably correct as the old one was, is a reimplementation of existing spec one of the things that I might do with this still?
[00:10:01] TR: Yeah. Yeah. I think so. At a super high level, I mean, it’s hard to specify exactly what this means, but compilers is one place where you’ve seen this already where you want your C compiler to preserve the behavior of your actual program and that’s a property that you can state. There’s a well-known way to state this. There’s a verified C compiler called Concert where this property is actually proven. And then of course you have GCC and all the other C compilers. You can compare against them. There’s a super cool paper that actually looks at how many bugs are in GCC versus Concert. Do you actually see the returns here? Are there bugs in your normalcy compiler? And the answer is yes. There were bugs in the normalcy compiler like in GCC that just don’t exist in this verified version. It’s powerful. It’s cool. It does get harder when you have to do a novel kind of specification yourself, because specifications are hard. And one of the big challenges of the future is going to be making it easier for software engineers to specify things.
[00:11:15] MS: That’s not verified, has more bugs. Is that a trend you see? Have you been able to correlate the trait of a software being verified with it also having less bugs?
[00:11:29] TR: Yes. I mean, I don’t know how many use cases there are so far where there’s an unverified and verified version that are like really similar. There are some. So the operating system and the Concert, C Compiler, both of these I think are examples where this really did come up in practice. So yeah, I wrote a big survey paper on this. It’s called like QED at Large. And I wrote about a whole bunch of different use cases where you really did see practical benefits to this. Although one of the papers I read about this kind of suggested that, I believe this, a lot of the benefit comes from even like thinking hard enough about your code to write proofs to begin with. So yeah, I don’t know. It’s interesting
[00:12:13] BH: What does the research look like on a day-to-day basis in terms of the types of experiments you do and even the types of tools you use to conduct your work?
[00:12:23] TR: So my work so far has mostly been on building automation to help with common proof engineering tasks. So the focus of my PhD work, for example, it was on this thing called proof repair, which is you have written a whole bunch of proofs about your system and then you change something about your system because it’s like software engineering and your program is changing all the time, or sometimes the thing you want to prove is changing. And so it was on like, “How do I build automation that automatically adapts your proofs to those changes? Historically, I would write a lot of code in OCaml, which is a nice functional programming language that’s used a lot for compile areas and really I’m building like plugins for these proof assistants that will kind of operate over proofs, transform them, manipulate them and produce new proofs. And so there’s like a two-step thing. It’s doing this like a different thing, like comparison of old and new versions of your programs, your specifications and your proofs. And then it’s actually transforming like a representation of your proof in a way that corresponds to that change. To me, it feels very similar to compiler engineering, except that the source and destination language are the same. It’s changing a bit because I’m integrating some machine learning tools now. So now there’s a little bit of Python integration and some playing with like PyTorch. But I think it feels pretty similar because there’s still a lot of, like, I’m looking at the structure of these proofs, trying to figure out what information I can use to build this automation. Historically, it was always me writing the automation. Now it’s like, “What can a neural net do with this too?”
[00:14:31] MS: A lot of this sounds kind of like a subset of like QA and testing in general. Is there a way we can kind of compare and contrast proof engineering to these other types of things that people might be more familiar with?
[00:14:43] TR: Yeah. So I think the big thing with like traditional testing versus like proof engineering is like testing, say, obviously it’s going to simplify things a lot, but let’s say you’re writing a program that takes like numbers as inputs. I mean, it can take any number. If you’re writing tests, there’s no way that you’re ever going to be able to write infinitely many tests to test every possible number ever. So probably what you’re going to do is think hard about like some representative numbers that you could pass in, that might trigger different kinds of behavior, maybe some edge cases and so on. And you’ll write some tests for each of those. With this proof engineering, you end up in this world where you could actually say, like for any number I could ever think of, this’ll be correct, but then you have to prove it. You don’t just run your tests and call it good. You have to explain in some way to this tool why you think that’s true and that reasoning has to be valid or your proof is not going to actually check out in the end. I think in the future, there’ll be kind of a continuum from testing to these formal proofs. It’d be cool if you could start off by writing a bunch of these tests and then have the tool help you figure out something more general that you might want to prove about your code automatically, like help you generalize that, and then from that kind of ask you the questions that you would need to answer to help it finish like a formal proof about this. I’m hoping this distinction will be less of a distinction and more of like a synthesis.
[00:16:16] BH: Can you speak to like the state of proof engineering outside of academia and which maybe gigantic companies are adopting it for their purposes? And then otherwise, is anybody directly building proof engineering into industry? Is there a Travis CI for proof engineering?
[00:16:36] TR: Yeah. So I think the first industry that I saw here was a lot in like kind of smaller government contractors who are doing really security critical things. So this company Galois that I often would work with, that I partnered with doing my PhD work, they’re in Oregon. And there were like a couple of there like specific startups almost that really only do this kind of work and they’ll contract out to companies. So I think BedRock Systems, for example, they’re all about this. It started to reach some of the bigger companies too, and it’s blended with a whole bunch of other technologies that are kind of related to it. So it’s a really big space of like what we call formal methods. But yeah, basically guarantees that you can get that are stronger than just testing. So Amazon, for example, has this automated reasoning group. It’s kind of sprung up in the last two years and they do all different kinds of like stronger guarantees. I think it’s very important for AWS, especially, again a space where correctness is really going to matter, like if something is incorrect, then the whole internet might go down. I know there’s a little bit of work on it, at Google. I don’t know how much. I have like one friend at Google, more on the research end, who does some cool hardware synthesis using these proof assistants with strong guarantees, and it’s pretty cool stuff. I don’t know. I would say, right now, it’s starting to make its way into industry, but it’s mostly like they’re hiring PhDs to do this kind of work, either further research work or really engineering, like hiring proof engineers, but they’re still PhDs. It’s not yet at the point where your average software engineer was able to pick this up in the same way that they can pick up a testing framework.
[00:18:23] BH: There’s no proof engineering course in coding bootcamps yet.
[00:18:27] TR: No. Maybe one day.
[00:18:31] MS: So you just finished your thesis research. Can you talk about like what your thesis involved and what some of your findings were?
[00:18:39] TR: Yeah. Yeah. So the thesis was on proof repair, which is like automation to keep your proofs up to date, like automatically adapt them to changes in your programs or in the things that you want to prove. But the kind of key finding there was that changes that you need to make in your proofs themselves in really practical use cases are actually carried inside of other changes that you make. So if you change the data type in some way, then your proofs are going to change in a way that’s like pretty predictable and tied to this change in the data type. Or if you change something and then you change just one of your proofs. Changes in all your other proofs might kind of mirror that change. These are usually fairly predictable and repetitive changes that you made that really correspond to some deep structural way of representing that change. So basically I found that I could build a tool that would then go in, in a lot of practical cases like identify what has changed and then use that to adapt other proofs for you sometimes fully automatically. Sometimes you would need to give it like a little bit of extra information.
[00:19:46] MS: I feel like a lot of software engineers are hearing the ability to not have to update their tests possibly in the future. I am sure many are getting very excited at that, that possible possibility down the road.
[00:19:58] TR: Yeah, that’s true. Actually, I haven’t thought that much about kind of bringing that back into updating tests. The hard part there would be, with these proofs, it’s really nice because once you update them, you can check their results. You know that it really does this, and it does. But with tests, sometimes it’s hard. If you update a test, how do you know that you have updated it correctly? I think you can, if you have a human in the loop or some proof technology on the backend, maybe that feature could also happen. It’d be cool.
[00:20:28] BH: Yeah. It seems like the things that we know about our systems with normal tests are pretty specific to the tests themselves sometimes. You might know test coverage and you might collectively have a sense of how correct the tests are being, but there’s certainly like nothing close to proof or anything like that. As every day developers, I think the gears have to turn about just like how does this trickle down to my work in any capacity. And I think that sort of brings us into the topic of machine learning. I think you mentioned neural nets a little bit ago. And when I think about this topic, I think, of course, maybe this is hyper-relevant due to the nature of not having to write every test oneself or anything like that, like the interaction between, these two ideas. But I also think this seems like a domain, which is maybe known at least to folks who are not super engaged in it as an area where people don’t have a lot of verifiability. When I hear a neural net’s output, there’s a bit of a black box nature, and this is just sort of thrown out my curiosity, like, what is the intersection between that, proof engineering and all things machine learning in this field?
[00:21:57] TR: Yeah. I mean, there are two directions there, which are like very different and also both very interesting, which is verification of neural networks. And then separately from that, neural networks for making verification easier. In the last few years, I think we’ve started to see people put together machine learning tools that will help you write your proofs. Usually, there’ll be suggesting these strategies, these things called tactics. So like strategies, things like induction or something like this. So yeah, those started in the last few years. I think there’s a lot of work to do before they’re really useful. So I just got into this space because it ended up being relevant for like a lot of the tools that I write that are say adapting proofs to changes they’re working on this low logical representation, which is super structured, but really not human friendly. But the thing that I want to give back to the human in the end are these tactics, these strategies, which are really not very structured at all, but they’re, they’re really going to help the human. And in my view, the machine learning is going to be, I think, most useful for dealing with that kind of unstructured human-friendly tactic, giving you something that stylistically is nice. The big challenge I see is how do you get that nice human looking thing while also really integrating all of that structural information that’s kind of hidden underneath it at this low-level language? I think that will improve the machine learning tools a lot when they start doing that well. This is less of a problem in this world, but you want to make sure that you’re actually getting something correct and not just something that looks pretty. So there’s usually some kind of checking step, but you don’t want to just naively generate a bunch of these strategies and then check them at the end. So the big thing I’ve been looking at is how to take that structural information, which if you watch experts write their proofs, they’re using this information all the time, often subconsciously, but I don’t know. An example is a big way to write an addition function for natural numbers is by recursion, but you have a choice of whether you recurse over the first argument or the second argument. And actually, depending on which of these you choose, when you’re writing your proofs, you’re going to proceed differently. If you do induction over the number, it might be the first one, if you recurse over the first one or it’ll be the second one, if you recurse over the second one. And these kinds of things are everywhere. Good proof engineers are always tracking all this structure information, this meaning, these semantics of what’s actually happening in their heads. But if you look at some of the machine learning we’re at right now, it’s a racing, all of that information. I just think it’s a loss. So yeah, I mean, even actually this last week, I think even the hour before this conversation, I was playing around with… I took a really popular machine learning tool, looked at their presentation they actually had and then I walked through the proof by hand and I was like, “What information would I use that is not there in any way?”
[00:25:00] MS: Yeah.
[00:25:00] TR: Like not even a proxy for it. And there was a lot. So I think it’s just like right now, these proof synthesis tools, their performance is not super great, but I feel it can be really great. It’s just going to be about how you integrate this information back.
[00:25:35] BH: And then on the use of proof engineering in any sort of machine learning domain, if my car is learning not to hit pedestrians or run stop signs, at what point is any of that verified? If it’s all statistical regressions that are maybe leading to its choices, this seems like a domain I want some verifiability, but it also seems like inherently difficult to verify. How does proof engineering play in this seemingly critical area that intersects with machine learning and things like that?
[00:26:13] TR: I think this is much harder because I’ve spoken to some people about this, but I thought it’s going to be a lot more preliminary than usual. So one split immediately is there’s a difference between you can prove things about your algorithm or you can prove things about your model. And I think proving things about your algorithm, if there are any properties that your algorithm should implement, you want to make sure you’re actually implementing them, that is something that you should be able to do without much additional work here. And I think there’s a little bit of preliminary work on this that’s already out. But I think where things become harder is dealing with your model because these are really difficult to even make sense of and also they’re not correct. This is one of the interesting things about them. They basically all have these robustness issues, if you look hard enough. And so I think there are a couple of things there, like what does correctness actually mean. I think you no longer want that, or I don’t know, maybe you want that it’ll work for anyone. But in practice, what you’re going to get is up to some probability it’s going to work. You want to loosen the actual property that you prove a little bit. And then I think reasoning about it is really difficult. So usually the way that people will approve things about their models is by some kind of sampling. I think this ends up being kind of slow and intractable for like really large models. So one thing that I’ve spoken to some people about is, “How do you take a model that’s kind of hard to understand, like not really interpretable and kind of refactor it into something that’s easier to reason about?” Then you can actually prove something about that representation, but the thing is that refactoring, so that it’s actually not going to behave exactly the same way. So there’s some information that you lose in that process. I think one of the really, really cool things I would like to see going forward is I think often when you have, neural nets are best for these things where you don’t have a formal property that you could state about everything, but I think often when that’s true, there are smaller properties that you could state along the way and then there’s some unknown. So you might know a little bit about how you expect a self-driving car to behave under certain situations, but you’re never going to be able to state the full formal correctness criteria of a self-driving car. It just won’t have any. You’d have to model the entire world formally pretty much. But it’d be really cool if you could find a way of kind of separating out that information, like what are the things you want to formerly hold for your car that you could state that way and then what are the things where it’s kind of fuzzy, I don’t know how I would even say it. It involves modeling the whole world. I just want to make sure that no one dies. And then you could get some formal guarantees about that part. I suspect that for a lot of neural nets, you would find that those formal guarantees don’t hold right now because they’re not doing true generalization or there will be robustness problems there. So I think you would actually want this to be hooked into something where you’re able to make changes to address it. And yeah, I don’t know. I think that a lot of the power here is going to come from what people are calling like neural-symbolic learning or neural-symbolic programming, bringing these strong symbolic guarantees, this is like logical world kind of back in with this machine learning is like neural nets in a way that really brings them together nicely.
[00:29:55] MS: So you kind of just mentioned casually in your Twitter thread, the year is 2030, and then you kind of started on your kind of ideal world where possibly an IDE helps us actually kind of prove these things. How far out do you think really kind of implementing this technology in a usable way by software engineers is if you were just to kind of just hazard a guess?
[00:30:20] TR: Yeah, that’s interesting. I think 2030 is an attainable goal. If I have enough people interested in working on it, I think right now a lot of academics are working on it, and then kind of separately from this, there are a lot of people in the industry who are interested in building really awesome automation to make programming easier. And a few people who are working on making these proofs easier in industry, but not, not very many. I think if the industry interests there emerges nicely with the academics, if there’s like enough exchange of ideas, enough collaboration between people, then I think that 2030 is a doable thing. It could take a little longer if people don’t talk to each other as much. It’s kind of hard because there’s an incentive in industry I think to keep things secret. Like Copilot, I had no idea that Copilot was happening, it’s like this GitHub tool, until it was released. And then the safety researchers were talking to me about how my work was relevant to it. They write a great paper about the safety concerns is awesome, but I wish I could have had that conversation with them months ago. With the amount of secrecy that often surrounds these projects and industry, it might take a little bit longer just because it’s much harder for academics to contribute ideas back that way. And I think academics really want to, because we really like seeing impact, it’s great, especially in this area. Success for us is actually used by engineers to make things better because it’s much harder to define other criteria of success in this area. So of course, we want to work with people.
[00:32:05] MS: So what’s next for you? You’ve finished your thesis. You’re going to be an assistant professor. What’s kind of next on your list of things you want to accomplish with this research and science?
[00:32:16] TR: It’s interesting because so much of it ends up being student driven. So I get really excited about all the things I want to do. And if I actually tried to do them, I think I need to clone myself like 50 times if I wanted to do all of them. So in practice, I have different sets of things that I’d really like to work on and then I kind of propose them to students and we’ll kind of come to an agreement of like here’s a place where they’re really interested and that’ll drive the direction and hopefully they’ll push me a little bit too and propose things I never thought of. But yeah, this summer has been mostly focused on improving these machine learning tools for proofs, integrating this structural information that usually these neural networks are not using back into those tools to try to make that easier. And then I’m pretty enthusiastic about bringing together that kind of technology with the more traditional automation with building better user interfaces to kind of hook it into the standard development processes. So yeah, I don’t know. My big philosophy is always meet developers where they are. Having been a software engineer, I know that if you throw a new technology at someone, they’d probably be like, “That’s a cool technology, but I don’t want to use it because it means changing what I’m doing.”
[00:33:33] MS: You’ve been there. You know?
[00:33:35] TR: Yeah. So it’s always like, “Okay, how do you trick people into slowly using this thing just by building it into their normal workflow?” So I’m pretty enthusiastic if I can find people who are interested, it’s like students who are interested in really helping with this and building interfaces for writing proofs that kind of, for me, the natural integration point is writing tests, which I think engineers are already pretty used to doing, how do I kind of go from that to writing proofs with like a really smooth experience or it doesn’t feel like you’re doing that much more than writing tests and answering a few questions.
[00:34:12] MS: Thank you so much for joining us today.
[00:34:14] BH: Yeah. Thanks a lot.
[00:34:15] TR: Thanks for hosting me. It was fun.
[00:34:26] BH: This show is produced and mixed by Levi Sharpe. Editorial oversight by Jess Lee, Peter Frank, and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, email firstname.lastname@example.org and make sure to join us for our DevDiscuss Twitter chats every Tuesday at 9:00 PM US Eastern Time. Or if you want to start your own discussion, write a post on DEV using the #discuss. Please rate and subscribe to this show on Apple Podcasts.