[ { "speaker": "Interviewer", "text": "So like you you scaled a lot from 2010 to 2020." }, { "speaker": "Interviewer", "text": "What were the great subs in Shopify history?" }, { "speaker": "Simon", "text": "One of the funniest ones was we had this problem where about every hour the primary or the writer of the MySQL clusters would stall for about 30 seconds." }, { "speaker": "Simon", "text": "And we couldn't figure out why." }, { "speaker": "Simon", "text": "And we were just we were debugging this endlessly endlessly, could not figure out what was going on." }, { "speaker": "Simon", "text": "And someone figured out when this was going on, there was an lsof running on these machines." }, { "speaker": "Simon", "text": "I was like, \"Okay, why is this lsof running?\"" }, { "speaker": "Simon", "text": "And someone was like, you know, tracing the kernel, figuring out, \"Okay, this is like causing a soft lock up in the kernel." }, { "speaker": "Simon", "text": "Where is this lsof coming from?\"" }, { "speaker": "Simon", "text": "It turns out that some of the Percona utilities, which are some of the like, you know, Perl scripts you used to manage MySQL, uh drew in PHP as a dependency." }, { "speaker": "Simon", "text": "And PHP as a dependency has some standard cron job that every hour goes and does an lsof to figure out which files, which sessions are open, and then removing the files for sessions that are no longer actively opened by the PHP process." }, { "speaker": "Simon", "text": "And this is running every hour on all the MySQL instances." }, { "speaker": "Interviewer", "text": "What were some of like the great systems that Shopify was running which was at a scale that no one had run before?" }, { "speaker": "Simon", "text": "I think Facebook had probably taken MySQL to really, really high big heights before Shopify, but MySQL was certainly one of the systems that we were all like a lot of companies were just really compounding on the MySQL clusters through the 2010s." }, { "speaker": "Simon", "text": "Um GitHub was in a similar situation." }, { "speaker": "Simon", "text": "Um and so I think we just spent a lot of time scaling all the layers on top of MySQL." }, { "speaker": "Simon", "text": "One of the big things was that was more uh not novel, but a a lot of the SaaS apps in the 2010s that they'd written in something like Ruby or Python or whatever, where you just had so many processes that couldn't do that many QPS per process, maybe in the tens, hundreds if you were lucky." }, { "speaker": "Simon", "text": "And all of those processes had to have an an individual connection to MySQL or, um, Postgres." }, { "speaker": "Simon", "text": "Postgres is actually worse at this out of the box." }, { "speaker": "Simon", "text": "Um, and so with MySQL we had like 30, 40,000 connections open." }, { "speaker": "Simon", "text": "And so you're just spending so much time, like e-polling through all these connections and which ones you can operate on at any point in time." }, { "speaker": "Simon", "text": "So we had this problem with Memcached, we had this problem with Redis, we had this problem with with with MySQL." }, { "speaker": "Simon", "text": "Today there's lots of open source proxies in front of all of these systems, but at the time there wasn't really, so we had to play a lot of tricks to reduce this and just reduce the connection count as much as possible." }, { "speaker": "Interviewer", "text": "So how was the like infrastructure team constructed so that you could handle like sev after sev after sev and like as you're scaling through this, like what what were the like really interesting people that you know, gelled together to make to make all of this possible?" }, { "speaker": "Simon", "text": "When when I joined in 2013, it was still a very traditional structure where, um, it was a pure ops team, right?" }, { "speaker": "Simon", "text": "People who were just incredibly good at operating Linux systems and SSHing into all of them, like sort of the, you know, servers as as cattle era." }, { "speaker": "Simon", "text": "The sevs that we had were not so much just the systems falling over." }, { "speaker": "Simon", "text": "A lot of the sevs we had really came out of large flash sales driving enormous amounts of traffic into Shopify" }, { "speaker": "Interviewer", "text": "What's the worst flash sale ever?" }, { "speaker": "Simon", "text": "I might not have been around for the one that overwhelmed the system the most, um, but I remember Kylie Jenner's flash sales as, uh, particularly challenging." }, { "speaker": "Simon", "text": "I think she drove a lot of traffic even on trial accounts, like just showing up selling stuff." }, { "speaker": "Interviewer", "text": "the What's the What's the like a normal flash sale?" }, { "speaker": "Interviewer", "text": "Like there's a flash sale, how does it happen?" }, { "speaker": "Interviewer", "text": "What happens when a flash sale happen to the systems?" }, { "speaker": "Simon", "text": "So a flash sale is generally it's someone who has a very large following, and then through the 2010s we saw this They would post on like Instagram or something." }, { "speaker": "Interviewer", "text": "Yeah, something like, you know, they might might have like 10 million followers, 100 million." }, { "speaker": "Interviewer", "text": "I don't know what's a lot of followers on Instagram." }, { "speaker": "Interviewer", "text": "10 million?" }, { "speaker": "Interviewer", "text": "Tens of million?" }, { "speaker": "Simon", "text": "And they have some new product they release." }, { "speaker": "Simon", "text": "And so, you drop the product and suddenly, you know, millions of people or like hundreds of hundreds of thousands of people trying to buy the same SKU at the same time." }, { "speaker": "Simon", "text": "And so, that turns into an enormous amount of inventory lock contention on MySQL on that inventory row." }, { "speaker": "Simon", "text": "And um that was the kind of thing that drove like drove a lot of outages." }, { "speaker": "Simon", "text": "So, we had to do a lot of things to make sure that both the inventory reservation and everything would work cuz often, you have hundreds of thousands of people fighting for maybe 10,000 SKUs." }, { "speaker": "Simon", "text": "That was pretty much all the SEVs, right?" }, { "speaker": "Simon", "text": "And so, they would drive these sales and then they would just keep dropping things, right?" }, { "speaker": "Simon", "text": "Like Kanye would have a new sneaker, put it on Shopify, and then it would drive an enormous amount of traffic." }, { "speaker": "Interviewer", "text": "So, how how are you guys preparing for this?" }, { "speaker": "Interviewer", "text": "So, you knew that there would be flash sales." }, { "speaker": "Interviewer", "text": "You knew that like the flash sales are getting bigger and bigger." }, { "speaker": "Interviewer", "text": "How is the planning happening?" }, { "speaker": "Interviewer", "text": "How is the team like getting ready for the next big horrible event?" }, { "speaker": "Interviewer", "text": "And like the horrible event can happen at any point." }, { "speaker": "Interviewer", "text": "It's not like a sort of, you know, for Stripe or something, there's like Cyber Monday or Black Friday, you know it's going to happen on Black Friday." }, { "speaker": "Interviewer", "text": "But these were like happening random points." }, { "speaker": "Simon", "text": "Exactly." }, { "speaker": "Simon", "text": "So, these were basically random events that would take from, let's say, 1,000 requests per second to 100,000, right?" }, { "speaker": "Simon", "text": "So, massive scaling events." }, { "speaker": "Simon", "text": "And you're right, they could come out of any moment." }, { "speaker": "Simon", "text": "Of course, we also had to scale for Black Friday, Cyber Monday, but it was a little bit more predictable." }, { "speaker": "Simon", "text": "The flash sales were sort of random tests." }, { "speaker": "Simon", "text": "And we didn't know, sometimes they came from a trial account." }, { "speaker": "Simon", "text": "So, so the preparation was writing load testing tools." }, { "speaker": "Simon", "text": "So, we had load testing tools internally that tried to mimic what the users were doing." }, { "speaker": "Simon", "text": "They were not particularly sophisticated, they were just little Ruby scripts that we ran on a lot of servers that were trying to reserve inventory and contend on it, and then we would try to figure out what the bottlenecks were." }, { "speaker": "Simon", "text": "Um that was the majority of the preparation we're doing." }, { "speaker": "Simon", "text": "But this was also the first flash sales that I was a part of was while we were still in data centers." }, { "speaker": "Simon", "text": "So, we had a team that was racking servers ourselves." }, { "speaker": "Simon", "text": "It was before the cloud." }, { "speaker": "Simon", "text": "And so, it was very difficult to manage how many servers that we needed." }, { "speaker": "Simon", "text": "Um it's still difficult, and I'm sure you're also now running into this at your scale because you actually have to go to the clouds." }, { "speaker": "Simon", "text": "It's not infinite, and you have to tell them how much you're inspecting to use." }, { "speaker": "Simon", "text": "Um and uh with GPUs it's probably even more finite than in the CPU realm that we are in that I'm in now and that we were in then." }, { "speaker": "Simon", "text": "Um but we were just trying to pad with as much capacity as possible." }, { "speaker": "Simon", "text": "So, I mean at the end of the day, you don't have that many choices when you have to scale very quickly." }, { "speaker": "Simon", "text": "You can scale up." }, { "speaker": "Simon", "text": "It's too slow, and in on prem there's not you can't really scale up much." }, { "speaker": "Simon", "text": "Um the second thing is to cash harder." }, { "speaker": "Simon", "text": "So, larger TTLs and trying to move the caching up." }, { "speaker": "Simon", "text": "So, in the beginning at Shopify we were doing a lot of the caching in in in Ruby, but over time we moved some of that caching into Engine X directly and serving it from Engine X Lua to try to just move more and more load off the servers." }, { "speaker": "Simon", "text": "We would also do things like shedding load." }, { "speaker": "Simon", "text": "So, if someone had a massive flash sale, we would try to prioritize requests." }, { "speaker": "Simon", "text": "Like if you had a cart or you had a session, we would prioritize that request over someone just coming to the site." }, { "speaker": "Simon", "text": "So, load shedding was another mechanism that we would do." }, { "speaker": "Simon", "text": "Um you don't have that many other options other than that." }, { "speaker": "Simon", "text": "And then and load shedding is just a way of trying to fail gracefully and failing fairly." }, { "speaker": "Simon", "text": "Those were really the main levers that we had." }, { "speaker": "Simon", "text": "And um yeah." }, { "speaker": "Interviewer", "text": "How was the Shopify infrastructure organization organized?" }, { "speaker": "Interviewer", "text": "Like how did it How was it in 2013?" }, { "speaker": "Interviewer", "text": "Like how does it go from like tens of people to like hundreds of people or thousands of people now?" }, { "speaker": "Simon", "text": "It was like there was probably five to 10 people in 2013." }, { "speaker": "Simon", "text": "Like maybe a few thousand requests per second." }, { "speaker": "Simon", "text": "And then we had a team of um five to 10 people that were engineers without a an ops background." }, { "speaker": "Simon", "text": "This is like when people just started talking about DevOps, right?" }, { "speaker": "Simon", "text": "Where DevOps was oh, maybe someone can both SSH in and run the of commands and also write software." }, { "speaker": "Simon", "text": "That was almost a novel idea back in 2013." }, { "speaker": "Simon", "text": "And so that's sort of what happened at Shopify." }, { "speaker": "Simon", "text": "You had a bunch of people who were just doing performance engineering stuff in the application, and a bunch of people doing servers, and those teams ended up merging." }, { "speaker": "Simon", "text": "Um, and back then it was really scary." }, { "speaker": "Simon", "text": "It's like, \"Whoa, you're going to have a developer writing Chef, right, to configure the infrastructure?\"" }, { "speaker": "Simon", "text": "Um, and that's what we were figuring out then." }, { "speaker": "Simon", "text": "So, those teams merged, and we ended up calling that the production engineering team." }, { "speaker": "Simon", "text": "I think Facebook did a good job pioneer ing this pattern." }, { "speaker": "Simon", "text": "Um, and then it just started breaking into different teams." }, { "speaker": "Interviewer", "text": "So, this is actually really interesting." }, { "speaker": "Interviewer", "text": "One of the things that I've learned is that actually wasn't just like, you know, there was a bunch of great companies scaling in 2010." }, { "speaker": "Interviewer", "text": "There's like, you know, early Stripe, and early GitHub, and early, you know, Shopify, and dot dot dot, there's like tons of tons of different companies, and they all collaborated with each other." }, { "speaker": "Interviewer", "text": "Like, how were the teams like different teams, you know, infrastructure teams collaborating across the organizations to build tools that like What are the great tools that came out of it?" }, { "speaker": "Interviewer", "text": "What are the great open source libraries that people use now that have like stories in the infra wars of the early 2010s?" }, { "speaker": "Simon", "text": "I was just talking to um, we we probably both know Sam Lambert, who um, who runs Planet Scale." }, { "speaker": "Simon", "text": "And we were talking about this, how in the 2010s, a lot of these lessons, and it's probably still the case, are not written." }, { "speaker": "Simon", "text": "They're shared on phone calls." }, { "speaker": "Simon", "text": "Like, you and I had some of those phone calls in the early days of Cursor." }, { "speaker": "Simon", "text": "It's like, \"How do you do this?\"" }, { "speaker": "Simon", "text": "and all of that." }, { "speaker": "Simon", "text": "Um, and there's a lot of wisdom that honestly the models can't really train on, cuz most of this is just in a bunch of people's heads." }, { "speaker": "Simon", "text": "Um, and in the 2010s, there was a bit of a collaborate intelligence like between these these companies." }, { "speaker": "Interviewer", "text": "What walk me through some of some of the people you talked to." }, { "speaker": "Simon", "text": "So, um, we talked to Zendesk." }, { "speaker": "Simon", "text": "We were also scaling Ruby and MySQL." }, { "speaker": "Simon", "text": "Um, Intercom was also using a bunch of the libraries that we built for caching, and also Rails and MySQL." }, { "speaker": "Simon", "text": "GitHub, of course." }, { "speaker": "Simon", "text": "GitHub built things like um, this open source library called Ghost, which is something that uses the MySQL bin log to run um, to run migrations of the schema." }, { "speaker": "Simon", "text": "Um, SoundCloud built this thing called LHMS, which was an used it way to use triggers and triggers in MySQL to do schema migrations." }, { "speaker": "Simon", "text": "At Shopify we wrote a project called for example something called Toxiproxy." }, { "speaker": "Simon", "text": "I don't Have you ever heard of this project?" }, { "speaker": "Interviewer", "text": "Yes." }, { "speaker": "Simon", "text": "It was a project I started because everything just started failing with all these different services that we had and we needed a proxy where we could guarantee that if we took down all these different services that things wouldn't fail." }, { "speaker": "Simon", "text": "So for example, we had a database that was managing all the sessions and carts and we needed to ensure and if that database failed all of Shopify stayed up so we could write tests against and I think a bunch of those companies used that." }, { "speaker": "Simon", "text": "I think Sam and I like at GitHub and and Shopify we were on the phone together and we were like, \"Oh, what are you doing about this with Ruby and how are you scaling MySQL and is is is this proxy good and what are you doing about connections?\"" }, { "speaker": "Simon", "text": "And you know, we would send like patch files together to get Redis to scale better and there was just a lot of this a lot of this collective wisdom a lot a lot of these like Ruby on Rails MySQL shops into 2010s." }, { "speaker": "Interviewer", "text": "What's the story of Logrus?" }, { "speaker": "Interviewer", "text": "Logrus is probably your most popular library?" }, { "speaker": "Simon", "text": "Yeah." }, { "speaker": "Simon", "text": "Um Logrus came Logrus yeah Logrus is a is a Go library and I created it because I was in so many incidents where I was just so mad at myself from 6 weeks ago for not more intentionally thinking about what output I wanted to see from the system at a particular point in time." }, { "speaker": "Simon", "text": "And so I wanted an API that just forced me to sit down and think about what information I want to dump out." }, { "speaker": "Simon", "text": "So the Logrus API is like logrus.withfields and then you have to type them all out." }, { "speaker": "Simon", "text": "Um now it's that's really the only thing that's well designed about Logrus." }, { "speaker": "Simon", "text": "It does way too many allocations and I haven't had time to actively maintain it very much, but it took off." }, { "speaker": "Simon", "text": "Yeah, it took off." }, { "speaker": "Interviewer", "text": "What does it look like when it when a library takes off?" }, { "speaker": "Simon", "text": "I think it has like 25,000 stars and I just kept finding people kept telling me, \"Oh, I'm using Logrus." }, { "speaker": "Simon", "text": "I really like it.\"" }, { "speaker": "Simon", "text": "Um and so it was just and people got the idea of just the structured logging, which wasn't This was before OpenTelemetry and all of that." }, { "speaker": "Simon", "text": "So, I think it just clicked for people." }, { "speaker": "Interviewer", "text": "Uh so, one very interesting part that you mentioned in the Logrus story is, you know, you want to write things intentionally, be very careful, uh be thoughtful like in advance of when you'll actually need need to be thoughtful." }, { "speaker": "Interviewer", "text": "What are other great engineering principles?" }, { "speaker": "Interviewer", "text": "What are the Simon engineering principles that like Yeah." }, { "speaker": "Interviewer", "text": "What one writes in an agent.md?" }, { "speaker": "Simon", "text": "The way I think about software in my head is that over time the software has to age well." }, { "speaker": "Simon", "text": "And over time the software is under strain from time, patterns changing, the language changing, scale, and lots of people working on the same thing." }, { "speaker": "Simon", "text": "And the only thing that I just keep coming back to, which is a major inspiration also for Turbo Buffer, is just to make it as simple as possible." }, { "speaker": "Simon", "text": "I worked at Shopify for almost a decade, and it's rare these days to have worked in one code base, one company for that long." }, { "speaker": "Simon", "text": "And what it taught me was how software ages, because I saw so many projects where people would spend a long time writing an RFC and doing a big project on something, and that was not a predictor on that software aging well." }, { "speaker": "Simon", "text": "And then I saw sometimes where on the infrastructure team we just had to hold something together with spit and bubble gum, as my boss used to say, and it would age phenomenally." }, { "speaker": "Simon", "text": "Like that spit and bubble gum would be perfectly in place and holding water five years later." }, { "speaker": "Simon", "text": "And so I think the biggest thing that I just learned is just to keep let simplicity surprise you, and complexity has to be deserved." }, { "speaker": "Simon", "text": "That's the underlying principle." }, { "speaker": "Simon", "text": "I think a lot of things follow from that." }, { "speaker": "Simon", "text": "Um I spent a lot of time thinking about how different things are going to fail if you 10x or 100x the scale." }, { "speaker": "Simon", "text": "Like if you're doing any infrastructure change, it has to be able to scale 100x." }, { "speaker": "Simon", "text": "Otherwise, there's no reason to make it." }, { "speaker": "Simon", "text": "You're just playing musical chairs." }, { "speaker": "Simon", "text": "Um but the aging well and letting simplicity start and complexity be deserved is extremely important and it's it's it's fundamental to how we've designed Turbo Buffer." }, { "speaker": "Simon", "text": "The other thing is that being on call on the last resort pager of a piece of software where real people around the world are losing millions of dollars per minute of downtime an enormous responsibility." }, { "speaker": "Simon", "text": "We were maybe six to eight people on that pager and if you got paged, like you knew it was on you to bring the site back up." }, { "speaker": "Simon", "text": "And I think that that taught me how to write software in a way that like it changes you for better or worse, right?" }, { "speaker": "Simon", "text": "And so um with Turbo Buffer, for example, it's just like what's a piece of software that I'm willing to go on call for?" }, { "speaker": "Simon", "text": "And it has to be extremely simple and very easy to debug." }, { "speaker": "Simon", "text": "And back to Logrus, right?" }, { "speaker": "Simon", "text": "It's how do we make this as easy to debug as possible because it's impossible for every line of code that I write is a liability for someone to be on call for at 3:00 a.m." }, { "speaker": "Interviewer", "text": "Well, this is super interesting." }, { "speaker": "Interviewer", "text": "I guess one way to frame the question is, you know, as you are all the models, you kind of are writing a constitution that the models have to follow cuz not only are they trying to, you know, get better and better at passing a certain suite of tests, but also uh we want them to write good quality code." }, { "speaker": "Interviewer", "text": "Uh you know, in the same way that like you know, humans have this constitution that like you know, it's very simple, it's not very complicated, uh but it's like very widely applicable." }, { "speaker": "Interviewer", "text": "Is there like a software engineering constitution that we should be using to are all the models such that like the code that comes out is like good quality, you know, stands the test of time." }, { "speaker": "Interviewer", "text": "I think a lot of people generally are worried about slop that the models produce." }, { "speaker": "Interviewer", "text": "We kind of want to train them to not produce slop." }, { "speaker": "Interviewer", "text": "What's the constitutions they don't produce slop?" }, { "speaker": "Simon", "text": "When I talk to the models about designing software, it feels like they want to design models as like like like an eager eager undergrad who's read way too much Hacker News." }, { "speaker": "Simon", "text": "I would ask you how you would RL a model to encode principles like that of just simplicity over everything because it's a certain set of trade-offs, right?" }, { "speaker": "Simon", "text": "Like in in in Turbo Puffer, it's just like we keep everything on object storage." }, { "speaker": "Simon", "text": "There's no state anywhere else." }, { "speaker": "Simon", "text": "And it's like, yeah, if you want low low latency for rights, like you've got to look somewhere else cuz we're not going to give you that cuz I'm not willing to accept the trade-offs." }, { "speaker": "Simon", "text": "I don't know how good the models are at navigating trade-offs like that." }, { "speaker": "Simon", "text": "It feels like they they're very eager to design a very like perfect system and not that eager to try to design something that's like very simple and will age well under those kinds of pressures." }, { "speaker": "Simon", "text": "But how can you how can you RL that?" }, { "speaker": "Interviewer", "text": "I think one can come up with various ideas, right?" }, { "speaker": "Interviewer", "text": "Like by by default, you're trying to even just the thing as like write something as simple as possible." }, { "speaker": "Interviewer", "text": "And like when you're judging between two correct solutions, prefer the solution that's really simple." }, { "speaker": "Interviewer", "text": "And you know, another one that you could try to encode for is make sure it's short." }, { "speaker": "Interviewer", "text": "Yeah." }, { "speaker": "Interviewer", "text": "You know, if if you enforce the model produce short things, um minus the the code golfing aspect, it's actually, you know, shorter things are generally simpler." }, { "speaker": "Interviewer", "text": "Uh don't write something in 100 lines that you could write in 10." }, { "speaker": "Simon", "text": "But I don't think it's always about the lines of code, right?" }, { "speaker": "Simon", "text": "Like the lines of code sometimes of you know, something like something like Turbo Puffer might accept all of its rights into Kafka or something like that." }, { "speaker": "Simon", "text": "But now you're operating this whole other system." }, { "speaker": "Simon", "text": "That might be less lines of code, but the system complexity is a lot higher." }, { "speaker": "Simon", "text": "Do you give the models like a like a constitution?" }, { "speaker": "Interviewer", "text": "We we yeah, we we actually write down like the things that like what we think of as great great code." }, { "speaker": "Interviewer", "text": "And that's like important because if you don't write such a thing, like the models will write all sorts of crazy that like is unbelievable." }, { "speaker": "Interviewer", "text": "Yeah." }, { "speaker": "Simon", "text": "I think also just think it thinking about all the graceful failure modes, right?" }, { "speaker": "Simon", "text": "And just that the recovery of the software is also something that we've always tried to preserve the property that you can shut down every single server and you lose nothing." }, { "speaker": "Simon", "text": "Like there's no lost data and it's surprisingly difficult and invariant to uphold." }, { "speaker": "Simon", "text": "And so I think the other thing I think about with software aging are just what are the invariants in the system that I care about, right?" }, { "speaker": "Simon", "text": "Like we've both done competitive programming, right?" }, { "speaker": "Simon", "text": "And you're always thinking about what are the invariants in the system that has to hold under every condition because otherwise you know that it's going to fail some test at some point." }, { "speaker": "Interviewer", "text": "Um, you write a famous blog." }, { "speaker": "Interviewer", "text": "Famous-ish blog." }, { "speaker": "Interviewer", "text": "It I stalled famous-ish blog maybe." }, { "speaker": "Interviewer", "text": "Uh, what's the story of you starting the blog?" }, { "speaker": "Interviewer", "text": "Uh, it was very, you know, inspirational for me when I read it back in the day where it's like, uh, for people who don't know, the blog basically walked through, you know, for seemingly complex engineering problems, how can you break it down into uh, Fermi estimates and actually get a, you know, fairly reliable and good estimate of like what the system will perform under weird weird uh, that is like otherwise really hard to estimate." }, { "speaker": "Simon", "text": "I had no idea that you read it." }, { "speaker": "Interviewer", "text": "I did." }, { "speaker": "Simon", "text": "Did you read it before we got to know each other?" }, { "speaker": "Interviewer", "text": "Mhm." }, { "speaker": "Interviewer", "text": "So what you're talking about is napkin math?" }, { "speaker": "Interviewer", "text": "Arvit, yeah." }, { "speaker": "Interviewer", "text": "Arvit introduced me." }, { "speaker": "Simon", "text": "Napkin math came out of um, my role at Shopify was I was a principal engineer and one of the things you do as a principal engineer is you're you're reviewing a lot of designs." }, { "speaker": "Simon", "text": "So you're reviewing designs of I want to build this product, it's going to use the database in this way." }, { "speaker": "Simon", "text": "And something that I found myself repeating a lot were just people would come to me with a benchmark and say, this is how I expect this database to perform." }, { "speaker": "Simon", "text": "Um, and then they would they would make decisions based on these benchmarks." }, { "speaker": "Simon", "text": "And I I started to just completely distrust benchmarks." }, { "speaker": "Simon", "text": "Like I still really don't like benchmarks very much, especially not when you're trying to make a technical argument because benchmarks are like a point in time." }, { "speaker": "Simon", "text": "They don't tell me anything about what the fundamental properties of the system are." }, { "speaker": "Simon", "text": "And so the lesson that I felt like I was like I felt like I was like preaching in so many of these reviews talking about like okay, this might be the benchmark but the fundamental properties of the system are that, you know, if you're trying to do a search query, for example, like you can just do a little bit of math and figure out like how many gigabytes of data that you have to move to serve the query." }, { "speaker": "Simon", "text": "You see DRAM, okay, to serve this this query, I um I have to move around a a gigabyte of memory around." }, { "speaker": "Simon", "text": "The DRAM can maybe do about 10 to 100 GB per second, so this should take somewhere between like 110 ms." }, { "speaker": "Simon", "text": "And you would come back with a benchmark and they're like, \"Well, it takes 5 seconds, so we can't use this database.\"" }, { "speaker": "Simon", "text": "And it's just like an unacceptable explanation to me because like this is not like this is there's a gap here between my first principle understanding of the system and my like dumb high school math multiplication math and how the system is performing." }, { "speaker": "Simon", "text": "So, that gap between the first principle understanding of the system and how the system is already like actually performing in the benchmark, we have to close that gap before we can conclude anything." }, { "speaker": "Simon", "text": "And that gap is either like my stupidity, like this math is wrong, or it's that this system is not performing." }, { "speaker": "Simon", "text": "But like one of us is wrong." }, { "speaker": "Simon", "text": "Um and unless someone could close that gap, it was just like like you're not making a compelling argument." }, { "speaker": "Simon", "text": "The benchmark is not persuad- persuasive unless you can explain that." }, { "speaker": "Simon", "text": "The benchmark only tells you how close you are to the theoretical floor." }, { "speaker": "Simon", "text": "And so, I just started a blog to try to explain this and to give myself a bit of a cadence of like, \"Okay, you know, you're trying to do this join." }, { "speaker": "Simon", "text": "How long might that take?\"" }, { "speaker": "Simon", "text": "And then just doing a bunch of bunch of math, running the actual test, and see what the difference is, and then um and then um and then trying to reconcile that gap, right?" }, { "speaker": "Interviewer", "text": "What's your favorite blog post in like the series of blog posts you wrote?" }, { "speaker": "Simon", "text": "I really like the one about uh TCP windows." }, { "speaker": "Simon", "text": "Have you read that one?" }, { "speaker": "Interviewer", "text": "Yes." }, { "speaker": "Interviewer", "text": "This one It eats away at every single, you know, scaling system." }, { "speaker": "Simon", "text": "Yes." }, { "speaker": "Simon", "text": "It does." }, { "speaker": "Simon", "text": "And it came it became very important to winning a very important deal at Turbo Encabulator actually at some point." }, { "speaker": "Simon", "text": "This blog post." }, { "speaker": "Simon", "text": "So, this blog post is basically um this was a problem that came to me at Shopify where someone was like, \"Okay, like why does it take 3 seconds to lay load a page in Australia, right?" }, { "speaker": "Simon", "text": "So, you're going from Australia to US East." }, { "speaker": "Simon", "text": "I'm guessing that round trip is probably 250 milliseconds, right?" }, { "speaker": "Simon", "text": "Like there's like pretty good undersea cable probably over the Pacific and then you're running the 60 milliseconds cross continent." }, { "speaker": "Simon", "text": "Probably around 250 to 200 milliseconds, right?" }, { "speaker": "Simon", "text": "Um but the page load was taking 3 seconds on like a vanilla Shopify store." }, { "speaker": "Simon", "text": "You just like spin it up and like I know that the Ruby and stuff is taking like 10 milliseconds." }, { "speaker": "Simon", "text": "Makes no sense." }, { "speaker": "Simon", "text": "Then And then I went to visit the site and I would refresh and it would take less time, but it was not a cache hit." }, { "speaker": "Simon", "text": "I would skip the caches and then it would take 260 ms." }, { "speaker": "Simon", "text": "I was like, \"Oh, what is going on here, right?" }, { "speaker": "Simon", "text": "There's like 3 seconds versus 260 milliseconds in my understanding." }, { "speaker": "Simon", "text": "Like what what is this gap?" }, { "speaker": "Simon", "text": "Like is it my stupidity or something like not working here?\"" }, { "speaker": "Simon", "text": "And so I dug into it and spent a bunch of time in Wireshark looking at it." }, { "speaker": "Simon", "text": "I'm like, \"Why is it going back and forth like over the Pacific so many times?\"" }, { "speaker": "Simon", "text": "Um and it turns out that in TCP, what you do is that when you open a connection, so you know, Sydney dials to US East, US East will only the first time send 10 packets back because they're trying to negotiate how big the link is between the two and that it has a very conservative default." }, { "speaker": "Simon", "text": "So, it'll send 10 packets." }, { "speaker": "Simon", "text": "The packets are like 1,500 bytes each." }, { "speaker": "Simon", "text": "And so a website that's like 15 kilobytes is going to load faster on most machines than anyone that's 16 kilobytes." }, { "speaker": "Simon", "text": "And I just sort of deduced this from Wireshark." }, { "speaker": "Simon", "text": "So, this this this website is in like the hundreds of kilobytes." }, { "speaker": "Simon", "text": "So, it does 15 kilobytes and then TCP says, \"Oh, okay." }, { "speaker": "Simon", "text": "I guess the link is big enough for 15 kilobytes." }, { "speaker": "Simon", "text": "Let's try 30 kilobytes the next time.\"" }, { "speaker": "Simon", "text": "So, now you transfer to 45 and it keeps doubling." }, { "speaker": "Simon", "text": "But now you're doing a lot of round trips back and forth to negotiate the size of that link." }, { "speaker": "Simon", "text": "And so what I realized like, \"Okay, well, if you just tuned the Linux kernels like TCP settings on both ends to send 100 packets in the first round trip, this will go a lot faster." }, { "speaker": "Simon", "text": "Um the downside is if you have a very very bad uplink, you're going to lose a lot of packet loss, but in general it's probably a better setting." }, { "speaker": "Simon", "text": "Um and this this became applicable even at something at Turbo Puffer at some point, yeah." }, { "speaker": "Interviewer", "text": "So, that brings us to one of my other favorite questions." }, { "speaker": "Interviewer", "text": "Uh, so there's a a two-part database question." }, { "speaker": "Interviewer", "text": "So, uh, number one, what are the great databases of the past?" }, { "speaker": "Interviewer", "text": "You know, what are the systems you found very inspirational?" }, { "speaker": "Interviewer", "text": "What have you learned from them?" }, { "speaker": "Interviewer", "text": "Um, walk me through your understanding of the evolution of databases." }, { "speaker": "Simon", "text": "So, my there's like I think there's a couple of angles to this." }, { "speaker": "Simon", "text": "So, um, the way that I think about it is that about every 15 years the ingredients are in the air to build a new database." }, { "speaker": "Simon", "text": "A lot of databases come around, right?" }, { "speaker": "Simon", "text": "Like I'm sure there's thousands of databases in production around there." }, { "speaker": "Simon", "text": "But in terms of big databases, it's sort of like begs for a new platform, right?" }, { "speaker": "Simon", "text": "Like Oracle for example and MySQL and Postgres are built in the '90s." }, { "speaker": "Simon", "text": "We have web and we have the web, we have a lot of like all these all these SaaS companies being built and lots of data going into databases." }, { "speaker": "Simon", "text": "That was the first, I would say, wave of big production databases." }, { "speaker": "Simon", "text": "There were some in the '80s, but this is so far, like '80s and '70s is so far before I'm born, so I don't have a great understanding of there, but obviously there databases there, often on on mainframes and so on." }, { "speaker": "Simon", "text": "Um, but the '90s was when the first like rate of like big database companies were built." }, { "speaker": "Simon", "text": "Then about 15 years later, you have a new workload, um, not just, you know, websites and so on trying to store data and applications starting to store store data, but you had these large-scale OLAP workloads." }, { "speaker": "Simon", "text": "Um, and so then you have Snowflake and Databricks and a bunch of other companies being built, um, around this new workload." }, { "speaker": "Simon", "text": "I think big database companies basically are companies where every single company on Earth has data in that database either directly or indirectly." }, { "speaker": "Simon", "text": "Um, and so, you know, while a d, you know, proverbial um, textile manufacturer in man in Bavaria, Germany, is not going to go out and and buy um, buy Snowflake directly, almost certainly they're using some product that's using Snowflake or Databricks, right?" }, { "speaker": "Simon", "text": "Or tens of times." }, { "speaker": "Simon", "text": "So, their data is probably in that in that database tens of times." }, { "speaker": "Simon", "text": "Um and those are how the biggest database companies are built in the biggest database companies." }, { "speaker": "Simon", "text": "I think now there's a moment where uh where that's happening again, where there's all these AI workloads that are being trying to be connected to data, but that's one way I think to see it in the past." }, { "speaker": "Simon", "text": "In terms of inspirational databases, um SQLite is the first one that comes to mind for me." }, { "speaker": "Simon", "text": "What I really like about SQLite is they just have this hardcore minimalist philosophy in everything that they do." }, { "speaker": "Simon", "text": "I think the best example is and this would go in my software agents MD constitution is to try to get as much pressure on every code path as possible, rather than having separate ones." }, { "speaker": "Simon", "text": "In SQLite, they have a phenomenal example of this." }, { "speaker": "Simon", "text": "And it's that uh normally normally when you do a join um ad hoc, you will you know, you'll do some version of like a nested for loop or a hash join or something like that." }, { "speaker": "Simon", "text": "And you will implement that as a particular path." }, { "speaker": "Simon", "text": "SQLite had basically will construct an ad hoc B-tree in memory to perform the join, which is the same code path as far as I understand of the B-tree index they build when you actually build the index on desk disk." }, { "speaker": "Simon", "text": "So, they're putting more pressure through that single code path, more optimization yielding to like more more query plans." }, { "speaker": "Simon", "text": "I think that's brilliant." }, { "speaker": "Simon", "text": "Obviously, another database that I admire is is um like Google Cloud Storage and and S3 and these blob storage, where they have very very few APIs, but those APIs have a very very consistent histogram of latency, almost infinite scale, and they work, and they're extremely reliable." }, { "speaker": "Simon", "text": "I really like systems that have very few primitives, and you just know they work, and you know that they will honor their histogram bounds, even when subjected to a lot of time, like so reliability, um but also a lot of scale." }, { "speaker": "Simon", "text": "Um so, I've certainly taken a lot of inspiration from that as well." }, { "speaker": "Simon", "text": "For me, there's also a bit of nostalgia, even just like in the old days of just like ripping it with FTP and PHP my admin and a MySQL box." }, { "speaker": "Simon", "text": "Um, I really like that." }, { "speaker": "Simon", "text": "And uh and there's like an a sense of that that I would love to bring into the database that we're building today." }, { "speaker": "Interviewer", "text": "So, next there's part two of my Qbar database question." }, { "speaker": "Interviewer", "text": "Is one of the things I don't understand very well about the database industry?" }, { "speaker": "Interviewer", "text": "This might sound naive, but most systems that you see in the world um have this property that as uh the industry matures, there's one winner and that one winner just stays the winner forever." }, { "speaker": "Interviewer", "text": "So, just to walk through a few examples, you know, operating systems, there's a lot of competition." }, { "speaker": "Interviewer", "text": "Then there's Linux wins in some ways." }, { "speaker": "Interviewer", "text": "Uh or like for consumers, Mac OS wins and that just and just stays the way forever." }, { "speaker": "Interviewer", "text": "And you know, for virtualization, you know, there's one winner and that, you know, winner stays the winning company forever." }, { "speaker": "Interviewer", "text": "Um and like almost all systems have this property, you know." }, { "speaker": "Interviewer", "text": "Uh it's true for OSs, it's true for uh anything you use in terms of like APIs in the real world, like even in clouds, there's like basically AWS plus like two copies and there those are like the standards forever." }, { "speaker": "Interviewer", "text": "There's no like real new like contender propping up every 15 years of like disrupting disrupting a standard." }, { "speaker": "Interviewer", "text": "Uh but database is the only example where every 10 years, there's like new companies every year, like someone starts a new database in the hopes of like you know, becoming the next big database, you know, that everyone will use and like there seems to be this consistent innovation over like a period of, you know, half a century, which it doesn't feel like anything else has." }, { "speaker": "Interviewer", "text": "Uh you know, there there are no great infrastructure companies where like once the infrastructure company starts, wins and stays stable, there's like a new one in the category every like five years." }, { "speaker": "Interviewer", "text": "People joke to which the answer is that like the only other answer to this is JavaScript frameworks, but outside of JavaScript frameworks and databases, why why do databases have such a property?" }, { "speaker": "Simon", "text": "So, I I the the mental model model I have of this is that um to discern new software um infrastructure software, there needs to be a new workload." }, { "speaker": "Simon", "text": "And so, the workload that we expect of an operating system hasn't changed enough for us to require rewrite, right?" }, { "speaker": "Simon", "text": "We've you know, we we expect now virtualization um and really good isolation primitives from the operating system." }, { "speaker": "Simon", "text": "Um Solaris was a lot sooner to that than Linux, right?" }, { "speaker": "Simon", "text": "Like Linux only really got good at that even just a few years ago, right?" }, { "speaker": "Simon", "text": "And they were working on that through the 2010s." }, { "speaker": "Simon", "text": "But it wasn't enough to disrupt the whole thing." }, { "speaker": "Simon", "text": "Um and so, I think with databases we do see new workloads like every 10 10 years, every 10 to 15 years, but there is not as many new workloads as I think people think that really matter." }, { "speaker": "Simon", "text": "Now, there's a lot of niche workloads, right?" }, { "speaker": "Simon", "text": "So, there's a niche workload in graphs." }, { "speaker": "Simon", "text": "It's a fairly niche workload, but the big database companies are built when you start with some ostensibly niche workload and then expand from there." }, { "speaker": "Simon", "text": "Um I think MongoDB is a great example of that, right?" }, { "speaker": "Simon", "text": "They started with like very web-like get up and running very quickly, and then they've just expanded like they do time series, they do graph, they do everything." }, { "speaker": "Simon", "text": "So, I think that's one property is workload, but workload is not enough." }, { "speaker": "Interviewer", "text": "Why can't the incumbent also get really good at that workload?" }, { "speaker": "Interviewer", "text": "Yeah, so Then why can't Oracle just keep getting better and there's never a new database company but Oracle or it's Oracle V1, V2, V3, V4?" }, { "speaker": "Interviewer", "text": "Seems like they get disrupted again and again and again." }, { "speaker": "Simon", "text": "It brings me to point number two." }, { "speaker": "Simon", "text": "So, you need a new workload because there needs to be some reason for every company on Earth to have some data in this new database, otherwise it's not going to get that big." }, { "speaker": "Simon", "text": "The second thing that you need is that you need a new fundamentally new storage architecture that is advantageous for that workload that the incumbent can't get to." }, { "speaker": "Simon", "text": "So, example of that, right?" }, { "speaker": "Simon", "text": "Is like um Snowflake and Databricks are architected with commodity HDDs either, you know, directly on like directly on object storage um and Oracle can't do that, right?" }, { "speaker": "Simon", "text": "It does not have a separation of computer and storage." }, { "speaker": "Simon", "text": "So, it can't do these like massive OLAP workloads even though you could ostensibly do that as an extension of the design." }, { "speaker": "Simon", "text": "You just Oracle has at that point what?" }, { "speaker": "Simon", "text": "30 years of heritage of assuming a tight knit like non-separation of computer like tight coupling of computer and storage." }, { "speaker": "Simon", "text": "Now, right?" }, { "speaker": "Simon", "text": "We can build databases that take advantage of NVMe SSDs and Snowflake and Databricks didn't do that because NVMe SSDs weren't even in the cloud until like 8 years ago." }, { "speaker": "Simon", "text": "And NVMe SSDs require you to architect a database fundamentally differently to take advantage of it." }, { "speaker": "Simon", "text": "You need a lot of outstanding IOPs in very few round trips." }, { "speaker": "Simon", "text": "And that's also what you need to do to do very low low latency on object storage." }, { "speaker": "Simon", "text": "The other thing is metadata." }, { "speaker": "Simon", "text": "So, the the first generation of databases built on object storage would have had to put all the metadata in a separate database which has all kinds of other problems like running other people's clouds and operate like running regions." }, { "speaker": "Simon", "text": "And you can build databases only as of like a year and a half ago that can have the metadata also in object storage." }, { "speaker": "Simon", "text": "So, when you have those two conditions of a new workload um in the 90s it was like, you know, computers internet and 15 years ago was OLAP with like analytics and today it's connecting very large amounts of especially unstructured data to AI." }, { "speaker": "Simon", "text": "That's the new workload." }, { "speaker": "Simon", "text": "You need that." }, { "speaker": "Simon", "text": "The second thing you need is a a new storage architecture can't be copy, right?" }, { "speaker": "Simon", "text": "The search engines before Turbo Buffer can't copy it cuz they tightly coupled computer and storage and Oracle couldn't easily copy what Snowflake and Databricks did which was separate computer and storage." }, { "speaker": "Interviewer", "text": "What's the story of Simon becoming a fan of databases?" }, { "speaker": "Interviewer", "text": "You're clearly very passionate about databases." }, { "speaker": "Simon", "text": "Love databases." }, { "speaker": "Simon", "text": "I think competitive programming has a lot of just like database adjacent topics." }, { "speaker": "Simon", "text": "And so, you just start thinking in asymptotic notation and how things are executed." }, { "speaker": "Simon", "text": "Um and when I got to Shopify, I was just the people I was just so drawn to the people who working on databases." }, { "speaker": "Simon", "text": "And the thing that breaks when you scale a website is generally the database." }, { "speaker": "Simon", "text": "So, it was just always the thing that was breaking." }, { "speaker": "Simon", "text": "And so, we had to stay ahead of it not breaking tonight at the flash sale and not breaking a year a year from now when the flash sale would be 10 times as larger." }, { "speaker": "Simon", "text": "So, it just becomes the fundamental bottleneck for scaling." }, { "speaker": "Simon", "text": "And that just drew me to it." }, { "speaker": "Simon", "text": "It's like, why are these things so hard to scale?" }, { "speaker": "Simon", "text": "Like and so, we just have to keep working on it and working on it and working on it." }, { "speaker": "Simon", "text": "And at some point, my model just started shifting from thinking about it as like a thing that execute SQL to just think about how the bits and bytes are laid out on disk." }, { "speaker": "Simon", "text": "They have so many fascinating trade-offs, right?" }, { "speaker": "Simon", "text": "Like we just talked about storage architecture being different for different query workloads." }, { "speaker": "Simon", "text": "There's so many trade-offs in databases." }, { "speaker": "Simon", "text": "And I love thinking in trade-offs." }, { "speaker": "Simon", "text": "Like I love thinking in, okay, well, if we do this, it's going to be better at this but not would worse at this." }, { "speaker": "Simon", "text": "There's just especially in search, it's like applying so many parts of computer science and computer engineering into one particular domain." }, { "speaker": "Simon", "text": "There there's just like an infinite amount of fascinating problems." }, { "speaker": "Interviewer", "text": "I remember the first time we met, we were uh Cursor was running into some Postgres bottlenecks." }, { "speaker": "Interviewer", "text": "And one of the most fascinating things at the time was you laid out, okay, here's how Postgres is architected." }, { "speaker": "Interviewer", "text": "I can like make a simple mental model of like things in Postgres happening this way and things in MySQL happening this other way and like at the time even though I had read about Postgres and understand understood some of the architecture, like I hadn't mapped it down to like all the little blocks, how they interact with each other." }, { "speaker": "Interviewer", "text": "How did you learn that?" }, { "speaker": "Interviewer", "text": "How does one learn about like the weird intricacies of like, here's how Postgres is designed in all of its little bits and components and like all the query parameters that one can find around for for those things?" }, { "speaker": "Simon", "text": "I think the short answer is that I can't help myself." }, { "speaker": "Simon", "text": "Like I don't I don't think I'm particularly smart, so I I spend a lot of time trying to dumb it down to something that I can't understand and explain to other people." }, { "speaker": "Simon", "text": "And I spend a lot of time on a little like notebook and just like trying to draw out exactly like how did the blocks move around?" }, { "speaker": "Simon", "text": "And so I think it took me about 10 years to get a very very good understanding of how MySQL works." }, { "speaker": "Simon", "text": "And then when I left Shopify I was helping some of my friends companies companies scale." }, { "speaker": "Simon", "text": "And I start I kept running into Postgres." }, { "speaker": "Simon", "text": "I was like, okay, this is cool like, you know, I don't know anything about Postgres." }, { "speaker": "Simon", "text": "So I just sat down one day and I just spent 8 hours reading the entire manual." }, { "speaker": "Simon", "text": "And with the lens of how is it similar and how is it different from MySQL?" }, { "speaker": "Simon", "text": "And just always comparing and contrasting." }, { "speaker": "Simon", "text": "So it it was it was easier because I already had sort of a trunk to to land the knowledge on." }, { "speaker": "Simon", "text": "And so it's just the compromise became very apparent, right?" }, { "speaker": "Simon", "text": "The way that the indexes work in in Postgres is very different from how they work in MySQL, right?" }, { "speaker": "Simon", "text": "In MySQL, the way that the data is laid out on disk is dictated by the primary index." }, { "speaker": "Simon", "text": "So at Shopify we took advantage of that by having all of the data for a shop co-located together." }, { "speaker": "Simon", "text": "That's very complicated in Postgres." }, { "speaker": "Simon", "text": "Postgres handles rights in a very difficult other like different way." }, { "speaker": "Simon", "text": "That requires a lot of tuning for the user that MySQL doesn't." }, { "speaker": "Simon", "text": "And I mean that was the problem you were running into when we met." }, { "speaker": "Interviewer", "text": "How do you code with the AI?" }, { "speaker": "Interviewer", "text": "Or in general, how do you use AI where like is there any way where like agents and models are helping you outside of coding?" }, { "speaker": "Simon", "text": "For sure." }, { "speaker": "Simon", "text": "So um on coding in particular I just I just have a cursor window with a website open." }, { "speaker": "Simon", "text": "A lot of what I do is like docs or small like smaller smaller changes." }, { "speaker": "Simon", "text": "And I just yeah, I just have cursor running." }, { "speaker": "Simon", "text": "I use like a synchronous agent and then I just choose a model based on what I need to do." }, { "speaker": "Simon", "text": "If I just need to answer a question about the code base I use the composer model cuz it's very fast and searches a lot." }, { "speaker": "Simon", "text": "And then I use different different models depending on depending on the kind of task that I'm doing." }, { "speaker": "Simon", "text": "But generally I'm managing like a few agents inside inside of cursor." }, { "speaker": "Simon", "text": "And it's especially inside of cursor because it allows me to make it very easy for me to review the code." }, { "speaker": "Simon", "text": "Um, I think my contrarian view is that um, we're still going to be reviewing every single line of code that goes into the database by the end of this year because I that's seems paramount still for the database." }, { "speaker": "Simon", "text": "Like we have a couple of individuals of Turbo puffer whose job it is to just have the entire context of the code base in their wet wear neural network um, and it still works really well um, because they have their manifesto and agents at MD also embedded in that wet wear and they make good luck local optimum decisions um, and the agents help them." }, { "speaker": "Interviewer", "text": "One of the things I've been excited about over the last many months is getting cloud agents to be better and better and better and better." }, { "speaker": "Interviewer", "text": "Uh, what would it take for 50% of Turbo puffer's code to be written by cloud agents?" }, { "speaker": "Interviewer", "text": "And I imagine one of the bottlenecks of writing 50% of your code is you really want to be sure that it's like verified the thing in extremely uh, you know, all sorts of queasy conditions." }, { "speaker": "Interviewer", "text": "You know, how do you think about testing in the, you know, age of models should operating for 12, 24, 48 hours um, and in general at what point does 50% of Turbo puffer's code be entirely written by by cloud agents?" }, { "speaker": "Simon", "text": "I think we probably should have agents running all the time trying to uh, trying to make the break the products before we do or before the customers do um, and use cloud agent." }, { "speaker": "Simon", "text": "I'm sure some of the engineers are already doing that." }, { "speaker": "Simon", "text": "I mean, generally when I use the when I use the model still on like core like core core core rust and database things, it's still not making globally optimal decisions." }, { "speaker": "Simon", "text": "Um, so I feel like the level of when I'm doing something in the dashboard or on the website, it can get such lax discussions and do incredibly well." }, { "speaker": "Simon", "text": "But in the database, there are just so many other properties of how is this API going to age, right?" }, { "speaker": "Simon", "text": "Like how is how is the data going to be laid out?" }, { "speaker": "Simon", "text": "And there's all of these war lessons, right?" }, { "speaker": "Simon", "text": "That is not that the model has not learned about how to operate this." }, { "speaker": "Simon", "text": "Um what has to be true?" }, { "speaker": "Simon", "text": "I think the model just the model just has to get better." }, { "speaker": "Simon", "text": "I don't think I have any like more wisdom on that." }, { "speaker": "Simon", "text": "And I think that we I think that I want the model to talk more about um how something is going to age, how the storage how this is going to make what irreversible decisions this makes on the storage architecture like ostensibly irreversible decisions." }, { "speaker": "Interviewer", "text": "So two or three Turbo Buffer questions." }, { "speaker": "Interviewer", "text": "So first one, what would it take for Turbo Buffer to become Google scale?" }, { "speaker": "Interviewer", "text": "To index the entire web?" }, { "speaker": "Simon", "text": "Yes." }, { "speaker": "Simon", "text": "It can already do that." }, { "speaker": "Simon", "text": "Uh index the entire web and serve the QPS that, you know, Google would serve." }, { "speaker": "Simon", "text": "I like actually be usable by Google." }, { "speaker": "Simon", "text": "So I mean, I don't really know I Google I'm sure is a web of like a thousand services to do it, but we have customers that have indexed the entire web into Turbo Buffer um and it works and it can do thousands of QPS." }, { "speaker": "Simon", "text": "Now, I'm sure Google has their hands on more than the hundred billion or so documents that um we procured in the data sets that have been indexed into Turbo Buffer maybe into trillions, but it can certainly be done." }, { "speaker": "Simon", "text": "Um there's no reason why that wouldn't continue to to scale and we've done the hundred billion with a P99 of 200 milliseconds, P50 of like 40 milliseconds." }, { "speaker": "Simon", "text": "So that's quite possible." }, { "speaker": "Simon", "text": "Now, to get the relevance to where something like Google is, you might have to do a lot more um and that would be like quite a few iteration cycles, but I mean you can index the entire web with not that many servers um with the Turbo Buffer architecture." }, { "speaker": "Interviewer", "text": "Could it ever be built without S3?" }, { "speaker": "Interviewer", "text": "Is S3 sort of being super strongly consistent just like one of the great engineering feats of like the last decade?" }, { "speaker": "Simon", "text": "Yes, I think so." }, { "speaker": "Simon", "text": "Back to the original question, the three um or the question about like why there's new database companies coming up every 10 years is that we needed NVMe SSDs, which were not available in the clouds until like the late 2010s." }, { "speaker": "Simon", "text": "We needed S3 to be consistent, which did not happen until December 2020, which is like mind-blowingly late." }, { "speaker": "Simon", "text": "And then we needed compare and swap." }, { "speaker": "Interviewer", "text": "Why is it so hard for S3 to be strongly consistent?" }, { "speaker": "Simon", "text": "I don't know." }, { "speaker": "Interviewer", "text": "[laughter]" }, { "speaker": "Simon", "text": "Um I know in Google Cloud Storage, I think it's probably Spanner or something sitting in front." }, { "speaker": "Simon", "text": "Um and so that's a little bit easier for me to understand." }, { "speaker": "Simon", "text": "There are more or less no details on the S3 metadata layer." }, { "speaker": "Simon", "text": "That metadata layer is presumably very difficult to operate on." }, { "speaker": "Simon", "text": "And I know that S3, you know, yes the API surface is small, but I know they invest a lot in formal verification and all these different things, presumably because they have little bugs that people rely on and they have to make sure that even if they change the tiniest thing, it doesn't break for a lot of other people." }, { "speaker": "Simon", "text": "Um and so I don't know what makes it so difficult." }, { "speaker": "Simon", "text": "I think probably the thing that makes it most difficult is that the system existed for 15, maybe 20 years without having that." }, { "speaker": "Simon", "text": "And from the systems that I've seen, it's a very fundamental assumption in the system and you're going to be engineering for 15 years assuming that this is not true." }, { "speaker": "Simon", "text": "Yeah, that would probably take I'm sure that took them like 5 years to do." }, { "speaker": "Simon", "text": "I'm guessing Google Cloud Storage because they were fronted by Spanner and might have been consistent much earlier, if not from day one." }, { "speaker": "Simon", "text": "But I also know that they consider that one of the greatest mistakes of S3 that they weren't consistent from day one." }, { "speaker": "Interviewer", "text": "What else do you like about S3?" }, { "speaker": "Interviewer", "text": "And when I say S3, I guess I mean object storage." }, { "speaker": "Simon", "text": "Yeah." }, { "speaker": "Simon", "text": "Um I think just the simplicity, right?" }, { "speaker": "Simon", "text": "Like it's like um tight latency bounds on very few things." }, { "speaker": "Simon", "text": "I think it's very very predictable systems." }, { "speaker": "Simon", "text": "It automatically shards and it's infinite." }, { "speaker": "Simon", "text": "Turbo buffer would not exist without it, right?" }, { "speaker": "Simon", "text": "Like if this was like 15 years ago, we would have people full-time just racking HDDs and like trying to strike deals to get as many HDDs as possible and we'd have the problem that you probably have with GPUs, but just with HDDs." }, { "speaker": "Simon", "text": "Um I'm very happy to not have that problem." }, { "speaker": "Simon", "text": "I don't Turbo buffer would not exist and I probably wouldn't have started the company because I would not have wanted to be on call for a product that required racking HDDs and flying to Ashburn every week to rack more HDDs." }, { "speaker": "Interviewer", "text": "One of the really weird things I've had recently is it's getting harder and harder to interview people." }, { "speaker": "Interviewer", "text": "Uh where one of the great interviewing tricks the last decade was instead of interviewing them on the blackboard uh give them a really complicated code base and let them try to do something in a complicated code base." }, { "speaker": "Interviewer", "text": "Uh because uh naturally doing things in complicated code bases requires an incredible amount of RAM." }, { "speaker": "Interviewer", "text": "Like you need to be able to hold a lot of things in your head while still being able to produce net new stuff other than just getting, you know, stuck or being asked to deliver, you know, day-to-day if like if you need to deliver in 2 days uh something that is required of you, you can you just shut everything else off and actually focus on the thing." }, { "speaker": "Interviewer", "text": "Sadly uh you know language models have just gotten so good uh that you can't use that interviewing trick anymore." }, { "speaker": "Interviewer", "text": "How do you think we go back to first principles and interview great engineers?" }, { "speaker": "Interviewer", "text": "Because I you know, there's obviously drawbacks of doing things like whiteboard interviews because whiteboard interviews are uh not the thing you're doing on a day-to-day basis." }, { "speaker": "Interviewer", "text": "It isn't it isn't exactly the the skill that you're trusting." }, { "speaker": "Simon", "text": "This is something I've spent a lot of time thinking about." }, { "speaker": "Simon", "text": "Um I have a document called traits of the P99." }, { "speaker": "Simon", "text": "Um I don't know if I've ever shared early draft of this cuz you and I have spent a lot of time talking about interviewing over the years." }, { "speaker": "Simon", "text": "Um and the list it's just a list of traits of the P99." }, { "speaker": "Simon", "text": "And it's it's a long list and I can't share all of them cuz that would too easy, but I think that the P99 is someone that you would describe as fast." }, { "speaker": "Simon", "text": "It comes out with in very very different ways in a lot of people, right?" }, { "speaker": "Simon", "text": "Um you and I can talk very fast and so that can be interpreted as fast, but some people are fast because they move very deliberately and there are not bugs in their code and they just keep moving forward and it's just one step forward, one step forward and it's never two steps backwards." }, { "speaker": "Simon", "text": "Um it comes out differently, but I've never met a P99 that I could not in some way describe as fast." }, { "speaker": "Simon", "text": "Another trait of the P99, I think, is that they have they have bent something to their will." }, { "speaker": "Simon", "text": "Um whether it's software or their trajectory or something like that, they have just made like the machine or something do what I mean, in Silicon Valley you call this agency, all right?" }, { "speaker": "Simon", "text": "But it's also agency or facility with the machine itself to get it to do what you need to do." }, { "speaker": "Simon", "text": "I think P99s try to their best of their abilities to surround themselves with P99s, and they probably have multiple moments in their life where they discovered that there was another level, and they could not help themselves but to try to get there, right?" }, { "speaker": "Simon", "text": "You're an immigrant to the new world like myself, right?" }, { "speaker": "Simon", "text": "And so you probably at some point tapped out of the P99s in your local community, and you went looking somewhere else, right?" }, { "speaker": "Simon", "text": "Um and I think one of the interviews that we do is an interview that we call the life story, and our recruiter Jen spends um an hour and a half with people going through their life, like all the way from when they first were introduced to something that they were excited about." }, { "speaker": "Simon", "text": "And I think that the P99 got very excited about something very early, and they continued to seek out the next nine, like P99 to P999, right?" }, { "speaker": "Simon", "text": "And for you, I'm sure it was um um like going to MIT, you discovered another level, right?" }, { "speaker": "Simon", "text": "Being at IMO, you discovered another level." }, { "speaker": "Simon", "text": "Um and I think that's another trait that we look for." }, { "speaker": "Simon", "text": "And so I just have a list of this, like they're obsessive, like they can't help themselves but get in the weeds, like on on some detail they find particularly interesting about something, they just can't help themselves." }, { "speaker": "Simon", "text": "They don't remain at this level, like it's not like a monoton- a monotonous abstraction level, they can't help themselves but go up and down." }, { "speaker": "Simon", "text": "And so I have a list of these, and I just look at them when I'm when I'm after I've interviewed someone, after I talked to someone, I have a particular interview that I do where I look for that." }, { "speaker": "Interviewer", "text": "What do you do?" }, { "speaker": "Interviewer", "text": "What do you look for now?" }, { "speaker": "Interviewer", "text": "Um actually, I've been thinking about it a lot." }, { "speaker": "Interviewer", "text": "I mean, I don't have like a great answer." }, { "speaker": "Interviewer", "text": "I do think probably we should go back to logic puzzles." }, { "speaker": "Interviewer", "text": "Uh where I think there's something that comes out when you go into something like these weird probability questions." }, { "speaker": "Interviewer", "text": "Where probability questions require something like clarity of thought." }, { "speaker": "Interviewer", "text": "In periods of stress having clarity of thought is more important than uh one realizes." }, { "speaker": "Interviewer", "text": "And of course a probability question itself you know, can be easily solved if you're like, you know, very very well versed in probability." }, { "speaker": "Interviewer", "text": "Uh but also uh just being able to stay calm and then say, \"We'll reason through this together.\"" }, { "speaker": "Interviewer", "text": "Uh so for for most people that, you know, it's actually a bonus." }, { "speaker": "Interviewer", "text": "Like we're we're the whole reason to pick a probability question is to pick for people who, you know, don't have probability experience." }, { "speaker": "Interviewer", "text": "Doesn't have to be probability in general." }, { "speaker": "Interviewer", "text": "It could be any like uh that anything where you're like, \"I've I've heard of, you know, various questions where you come up with scenarios.\"" }, { "speaker": "Interviewer", "text": "And like the scenario is something that is rather tough and like the problems get tougher and tougher over time." }, { "speaker": "Interviewer", "text": "And it's just really interesting because, you know, if it's very easy to produce code, uh probably the thing that uh matters is uh can you clearly think through the system and then can you clearly articulate uh what you want out of the system and what do you not want out of the system?" }, { "speaker": "Interviewer", "text": "What are the invariants you care about?" }, { "speaker": "Interviewer", "text": "So as those things like back in the day, uh if you had to trade off between a philosopher and you know, someone who's a workhorse coder, you could probably would pick the workhorse coder because they had like uh refined their skill at like Next.js or something." }, { "speaker": "Interviewer", "text": "Uh but actually that's become less valuable over time." }, { "speaker": "Interviewer", "text": "So you probably want people who are just incredibly thoughtful." }, { "speaker": "Interviewer", "text": "I think one of the things you great mentioned that like I think people are really under appreciate is just always having, you know, the person who's like one step forward and never two steps backward is like uh vastly underrated in the world." }, { "speaker": "Interviewer", "text": "I think those people are just very very very calibrated and like uh you would always trust them on the hardest problems in the company." }, { "speaker": "Interviewer", "text": "And that kind of skill is even though it's hard to interview for it's like pretty important." }, { "speaker": "Simon", "text": "I think one one thing that I that is probably the best way that we we we try to figure out if people have clarity of thought is to ask them how to design a system and then see how they ratchet up complexity and navigate the trade-offs, right?" }, { "speaker": "Simon", "text": "That clarity of thought of like you're going to start with the simplest thing and then you're going to attack it with some scale or whatever and then you're going to we talked about like the serving complexity earlier, right?" }, { "speaker": "Simon", "text": "Like the system sometimes engineers often gravitate towards designing the perfect system that doesn't really have any trade-offs, but is it's that's not one step forward one step forward." }, { "speaker": "Simon", "text": "That's like time to take 10 steps forward in the first step and you're always you're going to fail cuz you're going to make an assumption that's wrong." }, { "speaker": "Simon", "text": "Um, I think the the P99 navigates that ladder of complexity extremely elegantly." }, { "speaker": "Interviewer", "text": "What is the next frontier of databases?" }, { "speaker": "Interviewer", "text": "15 years from now what's the next frontier?" }, { "speaker": "Interviewer", "text": "Do you think we have the last database?" }, { "speaker": "Simon", "text": "Definitely not." }, { "speaker": "Simon", "text": "There's going to be hardware advances, right?" }, { "speaker": "Simon", "text": "In the next like 10 to 15 years." }, { "speaker": "Interviewer", "text": "GPU databases?" }, { "speaker": "Simon", "text": "It might be like people have tried that, but I don't think it's quite ready, right?" }, { "speaker": "Simon", "text": "But like it makes it would make sense to me that the GPU becomes more and more general purpose, could do more and more instructions and can move more and more bandwidth." }, { "speaker": "Simon", "text": "So, that might be the next platform, right?" }, { "speaker": "Simon", "text": "Does it take 10 years again?" }, { "speaker": "Simon", "text": "I I don't know." }, { "speaker": "Simon", "text": "The GPUs are evolving very fast." }, { "speaker": "Simon", "text": "Um, so I think that's one go at it." }, { "speaker": "Simon", "text": "I think that it's like it's been a pretty like pretty consistent pattern that machines like this has happened in CPUs, it's happened in disk and now it's happened on object storage where a lot of speculation." }, { "speaker": "Simon", "text": "So, basically having like a very large amount of outstanding requests at once in few round trips." }, { "speaker": "Simon", "text": "That's been a good bet and that's also how GPUs work, right?" }, { "speaker": "Simon", "text": "You try to give them a lot of things to chew on at once, get it back and then do something else conditioned on that." }, { "speaker": "Simon", "text": "So, like making everything good for speculation and predictability has worked out really well." }, { "speaker": "Interviewer", "text": "Would you be surprised if we just we we have the last database, we won't need more database in the future." }, { "speaker": "Interviewer", "text": "And I I I I actually think there's like a controversial opinion is I think we I think Turbo Buffer uh and uh you know, OLAP plus Postgres is probably combined or like the levels of scale that we found to hit them." }, { "speaker": "Interviewer", "text": "Maybe not uh Postgres, but something like Vitess plus MemSQL." }, { "speaker": "Interviewer", "text": "I think we've hit every everything we need." }, { "speaker": "Interviewer", "text": "I don't think we'll need another one." }, { "speaker": "Simon", "text": "I would hesitate to say never." }, { "speaker": "Simon", "text": "I would love for Turbo Buffer to be the last database." }, { "speaker": "Simon", "text": "Um call it a day." }, { "speaker": "Simon", "text": "Yeah." }, { "speaker": "Interviewer", "text": "Thank you so much for um you know, coming by the office." }, { "speaker": "Interviewer", "text": "I feel like I've learned a lot from you over time both in you know, architecting great systems, but also you know, building high-performance engineering teams." }, { "speaker": "Interviewer", "text": "And thank you for being a great partner to Cursor also." }, { "speaker": "Interviewer", "text": "Like I think I think there's you know, so many points of Cursor where like life would have been very hard if we didn't have Simon to rely on." }, { "speaker": "Interviewer", "text": "And yeah." }, { "speaker": "Interviewer", "text": "Thank you so much for coming down here." }, { "speaker": "Simon", "text": "I really appreciate that." }, { "speaker": "Simon", "text": "I mean, you know, we've had many conversations on the phone about infrastructure and you have a good enough team now that you don't call me anymore." }, { "speaker": "Simon", "text": "Um but I do kind of I do kind of miss it." }, { "speaker": "Simon", "text": "And it meant a lot to me and it means a lot for you to say that cuz you built a great company." }, { "speaker": "Simon", "text": "To have had been a very very small part of that, it really means a lot." }, { "speaker": "Interviewer", "text": "Thank you so much." } ]