[
  {
    "i": 0,
    "speaker": "Speaker 1",
    "text": "You need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like. And it's very important as closely as possible because"
  },
  {
    "i": 1,
    "speaker": "Speaker 1",
    "text": "sometimes the model can actually figure out when it's being run in like a fake environment or not a real one and it has like different behaviors during RL than in production. Are you saying it being"
  },
  {
    "i": 2,
    "speaker": "Speaker 2",
    "text": "conscious that it's being is in a fake environment and it starts being behaving differently?"
  },
  {
    "i": 3,
    "speaker": "Speaker 1",
    "text": ">> Yes. Yes."
  },
  {
    "i": 4,
    "speaker": "Speaker 1",
    "text": ">> Interesting. Like it's like oh I'm in a fake environment. I've learned a few tricks to like get the better reward in this environment and let me try them out. Models love to cheat. RL is really"
  },
  {
    "i": 5,
    "speaker": "Speaker 3",
    "text": "good at encouraging cheating."
  },
  {
    "i": 6,
    "speaker": "Speaker 2",
    "text": ">> [music] [music] >> I'm delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today. Federico, you are the research lead on Composer 2 at Cursor, Cursor's new agentic coding model. And"
  },
  {
    "i": 7,
    "speaker": "Speaker 2",
    "text": "Dima, you spent how many of the last few months moonlighting at Cursor in order to support all of the infrastructure required to make this gargantuan training task happen. And so, I'm"
  },
  {
    "i": 8,
    "speaker": "Speaker 2",
    "text": "excited to talk to both of you today about how the training of Composer 2 came together, what hard problems you solved together, and what you think it means for the future of of AI and"
  },
  {
    "i": 9,
    "speaker": "Speaker 2",
    "text": "foundation model companies. Exciting."
  },
  {
    "i": 10,
    "speaker": "Speaker 2",
    "text": "Yeah, exciting. Thank you for having us."
  },
  {
    "i": 11,
    "speaker": "Speaker 2",
    "text": "Thanks for joining. Okay, let's dive right in. For those who haven't been following as closely, uh Cursor recently announced Composer 2, which is an agentic coding model uh meant for long"
  },
  {
    "i": 12,
    "speaker": "Speaker 2",
    "text": "horizon coding tasks. Federico, uh up till now, um Cursor was mostly uh enabling uh other people's uh coding agents. Uh what was the impetus for Cursor to lean so heavily into Composer"
  },
  {
    "i": 13,
    "speaker": "Speaker 1",
    "text": "2, and how existential is it for you to become not just an application company but also a foundation model company yourselves? The reason why we started looking into training our own models is"
  },
  {
    "i": 14,
    "speaker": "Speaker 1",
    "text": "you can sort of think about the model as sort of like like a storage drive. It has certain amount of bits that it can store in its weights. And the idea is very simple, you know, like we care"
  },
  {
    "i": 15,
    "speaker": "Speaker 1",
    "text": "about only one task."
  },
  {
    "i": 16,
    "speaker": "Speaker 1",
    "text": "We don't even care about coding or programming necessarily. We care about software engineering inside cursor and inside cursor only. And so, what if we were to allocate all of the bits"
  },
  {
    "i": 17,
    "speaker": "Speaker 1",
    "text": "of information that can be stored inside the model weights to that one particular task?"
  },
  {
    "i": 18,
    "speaker": "Speaker 1",
    "text": "Also, as people may have noticed, composer is order of magnitude less expensive than Opus and other like coding models because we can just simply specialize all of the model weights to"
  },
  {
    "i": 19,
    "speaker": "Speaker 2",
    "text": "that particular task. And so, we can serve like a smaller model or something of that sort, yeah. So, it's about let's make sure every single bit of weight or information we have is"
  },
  {
    "i": 20,
    "speaker": "Speaker 2",
    "text": "dedicated toward the specific problem that we have at hand. Exactly. Got it."
  },
  {
    "i": 21,
    "speaker": "Speaker 2",
    "text": "Um that seems like it's an almost generalizable problem. Uh Dima, I'm curious your perspective. Do you think that every application company should be looking at cursor as a harbinger of"
  },
  {
    "i": 22,
    "speaker": "Speaker 3",
    "text": "what's to come? Like should they all be looking to do the same thing? Yeah, absolutely. I mean, we actually generally see it as a pattern of kind of evolution of the applications. You maybe"
  },
  {
    "i": 23,
    "speaker": "Speaker 3",
    "text": "start prototyping, you might be using kind of off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works. But the most kind of leveraged"
  },
  {
    "i": 24,
    "speaker": "Speaker 3",
    "text": "attribute of your application is the actual usage of user data or particular specific aspects of how this application works, maybe some aspects of your harness, which tools do you provide, how the"
  },
  {
    "i": 25,
    "speaker": "Speaker 3",
    "text": "application works, kind of really important bits which are important for your application. And the right way to capture that, you can do a little bit of that through prompting, but really the"
  },
  {
    "i": 26,
    "speaker": "Speaker 3",
    "text": "right way to do this is craft your model to act in your environment."
  },
  {
    "i": 27,
    "speaker": "Speaker 1",
    "text": ">> Yeah, absolutely. Like there are certain tools the agent calls that it's very hard to succinctly describe exactly the behavior of that tool to the model. And you know, with just like post-training,"
  },
  {
    "i": 28,
    "speaker": "Speaker 1",
    "text": "we can bake in the optimal way to use those tools. Like Composer, we do serve a prompt to Composer, but I I think the way we are training it, it would work even without a prompt and it would know"
  },
  {
    "i": 29,
    "speaker": "Speaker 1",
    "text": "what to do just because like we are intrinsically pushing the model to like the right direction of how it should act throughout our training. Basically, there's kind of like upper bound of like"
  },
  {
    "i": 30,
    "speaker": "Speaker 3",
    "text": "how far you can get with prompt engineering. And if you want to uh craft really great AI products, you have to go through kind of fine-tuning and influence model behavior. That's kind of"
  },
  {
    "i": 31,
    "speaker": "Speaker 3",
    "text": "one reason. I mean, reason number two is what Federico mentioned is kind of cost trade-off or XP trade-off. Like the way we kind of view it at Fireworks is that when you're trying to do optimization,"
  },
  {
    "i": 32,
    "speaker": "Speaker 3",
    "text": "you have this like three-dimensional trade-off between quality, speed, and cost."
  },
  {
    "i": 33,
    "speaker": "Speaker 3",
    "text": "And uh you can go quite far and we're doing it with all of our customers initially. We can go quite far with just optimizing infrastructure, but when you start getting to model training, you can"
  },
  {
    "i": 34,
    "speaker": "Speaker 3",
    "text": "really push this trade-off much further and you can get better model at fraction of the cost running much faster. And you know, Composer is a great example of >> Can I push on this a little bit? I want"
  },
  {
    "i": 35,
    "speaker": "Speaker 2",
    "text": "to ask you if this approach is better lesson pills. And we were we were actually all talking about TabNine on the walk-in. I'm remembering before the LLM era, there were these like small"
  },
  {
    "i": 36,
    "speaker": "Speaker 2",
    "text": "specialized coding models. And one of the things that was I think surprising to to a lot of people was as you've scaled up, you know, you scaled up just training on the internet and a lot of a"
  },
  {
    "i": 37,
    "speaker": "Speaker 2",
    "text": "bunch of English text and other languages, actually the models themselves got inherently better at coding as well. And so at least the trend line I've seen so far is just like bigger models perform better on"
  },
  {
    "i": 38,
    "speaker": "Speaker 2",
    "text": "everything including on coding. Is what you guys are saying, does that go against the grain of the better lesson?"
  },
  {
    "i": 39,
    "speaker": "Speaker 1",
    "text": "I think no, but one one sort of like thing to point out is that the big models trained by the labs train on a lot of code as well. Like code is one of the main tasks the labs are interested"
  },
  {
    "i": 40,
    "speaker": "Speaker 1",
    "text": "in pushing and so they don't just generalize to it. They're a bit specialized as well. I think for our case, actually, you know, if we believe about the bitter lesson, we are just pushing very hard on the data dimension,"
  },
  {
    "i": 41,
    "speaker": "Speaker 1",
    "text": "and we know that the models inherently have finite capacity. And so, if we want to saturate all that capacity, we need to scale data. And in order to ingest more data, we we need to like free up"
  },
  {
    "i": 42,
    "speaker": "Speaker 2",
    "text": "the weights from distractions the model may have. Mhm, okay. Got it. Super interesting. Okay, let's dig into the training of Composer 2. You launched a couple weeks ago, immediately grabbed"
  },
  {
    "i": 43,
    "speaker": "Speaker 2",
    "text": "attention. Strong benchmark numbers, much lower cost to to run inference on."
  },
  {
    "i": 44,
    "speaker": "Speaker 2",
    "text": "What's the short version of how Composer 2 works, and and what you guys did to make it so performant?"
  },
  {
    "i": 45,
    "speaker": "Speaker 1",
    "text": ">> We started from a very strong base, which is uh Kimmy 2.5. It's like a 1 trillion and parameter MoE, that's 30 B active, so very very sparse, actually."
  },
  {
    "i": 46,
    "speaker": "Speaker 1",
    "text": "We sort of like looked at the stock and realized there are like two axes. So, mainly Composer 1 was just pushing on one of these axes, which is reinforcement learning, but Composer 2"
  },
  {
    "i": 47,
    "speaker": "Speaker 1",
    "text": "pushes in two different axes. One is continual pre-training, and the other is reinforcement learning. So, the thing that made Composer 2 very good is pushing in both of these directions. So,"
  },
  {
    "i": 48,
    "speaker": "Speaker 1",
    "text": "we started off the training run by doing lots of mid-training on code tokens, almost sort of pre-training scale, actually. And then, coming out of that mid-training run, we took the"
  },
  {
    "i": 49,
    "speaker": "Speaker 1",
    "text": "checkpoints and we did very large-scale RL on lots of lots of tasks."
  },
  {
    "i": 50,
    "speaker": "Speaker 2",
    "text": "Okay, and then the premise here would be because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost"
  },
  {
    "i": 51,
    "speaker": "Speaker 1",
    "text": "pre-training scale. Yeah. Why not pre-train your own model, then?"
  },
  {
    "i": 52,
    "speaker": "Speaker 1",
    "text": "We just think about our approach from top-down instead of bottom-up. So, like, how do we get a model that's useful to users in the least time possible if we were to start from the bottom, sort of"
  },
  {
    "i": 53,
    "speaker": "Speaker 1",
    "text": "figure out how how we do pre-training and then scale it up to mid-training and then, okay, now we figured out mid-training, now we do reinforcement learning. That would take a very long"
  },
  {
    "i": 54,
    "speaker": "Speaker 1",
    "text": "time to get a model out to our users. By doing it the other way around, we were able to give our useful model to our users in very little time. So, hopefully, you know, like next Composer versions are going to be"
  },
  {
    "i": 55,
    "speaker": "Speaker 2",
    "text": "our own model instead of basing it off an open-source base. And what is the model roughly learning in the kind of mid-training step? And what is the model learning in the post-training step for"
  },
  {
    "i": 56,
    "speaker": "Speaker 1",
    "text": "you? Yeah, so in mid-training, it's sort of just kind of learning about libraries of code and learning about specific code patterns that are very common, like just world knowledge as well. There is like"
  },
  {
    "i": 57,
    "speaker": "Speaker 1",
    "text": "web data there as well. And this is sort of just creating a wider distribution that then reinforcement learning can sharpen on. And so, during reinforcement learning, you know, the model gets to"
  },
  {
    "i": 58,
    "speaker": "Speaker 1",
    "text": "play directly with the cursor harness."
  },
  {
    "i": 59,
    "speaker": "Speaker 1",
    "text": "And so, it gets to learn about the world the model is going to live in for the rest of its life, right? In in some way."
  },
  {
    "i": 60,
    "speaker": "Speaker 1",
    "text": "And and so, then during reinforcement learning, that's where it learns how to call tools properly, how to navigate its environment, how to write correct code."
  },
  {
    "i": 61,
    "speaker": "Speaker 1",
    "text": "Because during mid-training, it it learns how to write code. That doesn't necessarily mean it learns how to write correct code. We try to train on code that is largely only correct, but the"
  },
  {
    "i": 62,
    "speaker": "Speaker 1",
    "text": "model doesn't actually know how to differentiate between the two. While in RL, one of the key things that we're doing is we're kind of tuning the feature of the model saying, \"Hey, now"
  },
  {
    "i": 63,
    "speaker": "Speaker 1",
    "text": "you got to write correct code all the time.\" Exactly."
  },
  {
    "i": 64,
    "speaker": "Speaker 2",
    "text": ">> Interesting. And is the is the model after mid-training, is that similar to the model you guys have on Tab autocomplete, or is that a different core competency? Yeah, I mean, it's uh"
  },
  {
    "i": 65,
    "speaker": "Speaker 1",
    "text": "yeah, I think I would put it like that because like during mid-training we are just doing next token prediction, you know, like how well you predict the next token and then the token after that. So,"
  },
  {
    "i": 66,
    "speaker": "Speaker 2",
    "text": "yeah. So, why not just pre-train your tab on a complete model then? Why mid-train is a different model then?"
  },
  {
    "i": 67,
    "speaker": "Speaker 1",
    "text": ">> Yeah, I mean tab is a very small model because it's like a super low latency model."
  },
  {
    "i": 68,
    "speaker": "Speaker 1",
    "text": "Um, yes, we want it to be very fast. So, like the core two distinctions about the base models here is that tab is like small and and composer is quite large. I see. I see."
  },
  {
    "i": 69,
    "speaker": "Speaker 2",
    "text": "Okay. So, it seems like a lot of the focus of what you guys did for composer two was this large-scale reinforcement learning run. Can you break that down for us? Like what goes into that and"
  },
  {
    "i": 70,
    "speaker": "Speaker 2",
    "text": "what are the various hard problems you solved along the way?"
  },
  {
    "i": 71,
    "speaker": "Speaker 3",
    "text": "When you do a rollout it's quite different from like from like pre-training and mid-training because, you know, you're just trying to predict next token. You're actually running the"
  },
  {
    "i": 72,
    "speaker": "Speaker 3",
    "text": "entire hardness, like the entire experiment. You're letting the model act in the environment, see how how it performs for a given rollout. That's the terminology which is called rollout. And"
  },
  {
    "i": 73,
    "speaker": "Speaker 3",
    "text": "kind of assign it reward based on if it did something correctly or not, which might be some using LLM as a judge or maybe something verifiable like does the code compile or something like this,"
  },
  {
    "i": 74,
    "speaker": "Speaker 3",
    "text": "which actually means that compared to this regular training you need a bunch of other components. Like you still need large-scale training. You still need to orchestrate tens of thousands of GPUs to"
  },
  {
    "i": 75,
    "speaker": "Speaker 3",
    "text": "the forward backward propagation, do all the stuff you do in mid-training and pre-training. But now you also need to orchestrate a bunch of environments. You need to run model inference because"
  },
  {
    "i": 76,
    "speaker": "Speaker 3",
    "text": "but when you do this this rollout you're effectively running like real cursor session in some sense, right? You have your >> is like a forward pass? Uh, no, rollout is basically your entire like agent"
  },
  {
    "i": 77,
    "speaker": "Speaker 3",
    "text": "session from cursor, right? So, you we basically means it might take something like 50 turns. Model will take a take your initial prompt, then decides to call some tools, you want to execute"
  },
  {
    "i": 78,
    "speaker": "Speaker 3",
    "text": "those tools, then model generates a bunch of other code, kind of entire session which you when you interact with agent in cursor, right? You you kind of simulate this entire session as a part"
  },
  {
    "i": 79,
    "speaker": "Speaker 3",
    "text": "of your training run. You get to final reward and use the you get use that signal to now go back to trainer and kind incorporate it into the model weights. So, you have this kind of very"
  },
  {
    "i": 80,
    "speaker": "Speaker 3",
    "text": "big loop update loop, which is very heterogeneous, right? Because you have all these like different components working together. And now you're trying to orchestrate all of this to work"
  },
  {
    "i": 81,
    "speaker": "Speaker 3",
    "text": "efficiently and work with high throughputs because GPUs are expensive and you want to get your model trained quickly and in an economic fashion. So, that's by by itself is like very"
  },
  {
    "i": 82,
    "speaker": "Speaker 3",
    "text": "interesting kind of problem on intersection of algorithms and infrastructure because there are a lot of trade-offs how you can kind of co-optimize and co-design the system."
  },
  {
    "i": 83,
    "speaker": "Speaker 3",
    "text": "One aspect is kind of people call about like this async parallel of pipeline parallel. The idea is basically, okay, you're trying to update this model in steps, right? So, you have your current"
  },
  {
    "i": 84,
    "speaker": "Speaker 3",
    "text": "model version and you're trying to do a bunch of rollouts with it. What does your trainer do while you're doing this rollouts, right? Like naive approach would say that, \"Okay, now I'm going to stop my"
  },
  {
    "i": 85,
    "speaker": "Speaker 3",
    "text": "trainer and I'm going to do a bunch of sessions and those sessions might run for like 5-10 minutes or even longer if it's like longer horizon tasks. I'm going to get those outcomes and now I'm going to pause my"
  },
  {
    "i": 86,
    "speaker": "Speaker 3",
    "text": "inference and then go back to training, trying to do updates.\""
  },
  {
    "i": 87,
    "speaker": "Speaker 3",
    "text": "That's like very theoretically algorithmically robust because you are not precisely simulating everything, but it's very system inefficient because half of your capacity is sitting idle"
  },
  {
    "i": 88,
    "speaker": "Speaker 3",
    "text": "all the time. So, you can do all the clever like algorithmic tricks allowing you to Yeah, you can you can like [clears throat] kind of pipeline all of this. So, imagine this as a gigantic"
  },
  {
    "i": 89,
    "speaker": "Speaker 3",
    "text": "like factory, right? You have this like trainer building and you have a rollouts building. They're always churning, right? They also rollouts always take like latest model version and try to do"
  },
  {
    "i": 90,
    "speaker": "Speaker 3",
    "text": "new sessions and kind of simulate new agent sessions and trainer always takes new outcomes as they come and try to compute updates. So, everything is moving along all the time. The trade-off"
  },
  {
    "i": 91,
    "speaker": "Speaker 3",
    "text": "is that why I'm saying that algorithmically it's different because now by the time you finish some test rollout in your kind of simulated environment, maybe model weights already updated on some other data. So, you have"
  },
  {
    "i": 92,
    "speaker": "Speaker 3",
    "text": "this kind of staleness, like delay between how quickly model can learn updates because by the time you kind of process through some interaction session with a simulated environment, your model"
  },
  {
    "i": 93,
    "speaker": "Speaker 3",
    "text": "weights changed, and that introduces interesting training dynamics, and there are clever ways how you can address this. But, the flip side of that is that your all your GPUs, all your computers"
  },
  {
    "i": 94,
    "speaker": "Speaker 3",
    "text": "kind of loaded and churning all the time, which actually you're using more more flops and to your better less example, yeah, you you have like higher compute efficiency, you can get"
  },
  {
    "i": 95,
    "speaker": "Speaker 3",
    "text": "to a better model in small amount of time. You may be losing a few percent from being asynchronous and not doing like perfect mathematical updates, but you way compensate for that by effectively"
  },
  {
    "i": 96,
    "speaker": "Speaker 3",
    "text": "not leaving half your capacity on the table. And there are a lot of kind of depth and interesting interaction in that part. And we're very serious about performance at Cursor because unlike the"
  },
  {
    "i": 97,
    "speaker": "Speaker 1",
    "text": "big labs, you know, we have tons of thousands of GPUs, not millions, and so yeah, we do all sorts of tricks to make get the most out of a GPU like we train in production with FP4 even. We work"
  },
  {
    "i": 98,
    "speaker": "Speaker 1",
    "text": "with Fireworks to like push on inference as well. Cuz the thing about RL infrastructure is that just like it's just inherently more complex than pre-training cuz you need all the pre-training infrastructure. That's just"
  },
  {
    "i": 99,
    "speaker": "Speaker 1",
    "text": "like one of the requirements. Then you need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like, and it's very"
  },
  {
    "i": 100,
    "speaker": "Speaker 1",
    "text": "important it's closely as possible because sometimes the model can actually figure out when it's being run in like a fake environment or in a real one, and it has like different behaviors during"
  },
  {
    "i": 101,
    "speaker": "Speaker 2",
    "text": "RL than in production. Are you saying it being conscious that it's being used in a fake environment and it starts being behaving differently? Yes. Yes."
  },
  {
    "i": 102,
    "speaker": "Speaker 1",
    "text": ">> Interesting. Like it's like, \"Oh, I'm in a fake environment. I've learned a few tricks to like get the better reward in this environment and let me try them out. Models love to cheat. RL is really"
  },
  {
    "i": 103,
    "speaker": "Speaker 1",
    "text": "good at encouraging cheating. Yeah."
  },
  {
    "i": 104,
    "speaker": "Speaker 1",
    "text": "[snorts] Yeah, and then we need a really efficient inference. So, this is really important. So, there is like actually this kind of myth that during RL you spend more way more inference flops than"
  },
  {
    "i": 105,
    "speaker": "Speaker 1",
    "text": "training flops. This is sort of like just because the open source inference engines are very unoptimized instead of actually being a property of RL. Roughly the same ratio is kind of the same. In"
  },
  {
    "i": 106,
    "speaker": "Speaker 1",
    "text": "theory, if you push the GPUs to the maximum, you should have 1/3 of your training GPUs allocated to inference, right? Because training is effectively three forward passes. You have the"
  },
  {
    "i": 107,
    "speaker": "Speaker 1",
    "text": "forward pass, you have the data gradient, the weight gradient. While if you really hit the critical batch size on inference, you should only have a single forward pass worth of flops. So,"
  },
  {
    "i": 108,
    "speaker": "Speaker 1",
    "text": "that's why you guys use Fireworks instead of using an open inference engine? Yeah, I mean the other alternative is we would build one in-house, but you know, we have finite engineers like everybody else. We would"
  },
  {
    "i": 109,
    "speaker": "Speaker 1",
    "text": "like prefer to have engineers make training more efficient and more precise rather than like spin up like a inference effort, yeah. Okay, that's super hardcore. What about thinking you"
  },
  {
    "i": 110,
    "speaker": "Speaker 2",
    "text": "mentioned in your technical paper paper that you were doing this in a kind of globally distributed way? Why globally distributed and then what makes that hard? Yeah. Yeah. Well, there are"
  },
  {
    "i": 111,
    "speaker": "Speaker 1",
    "text": "various reasons. One, you know, like this very large contiguous clusters are hard to find in the market. And so, what we can do instead is we have one cluster that's going to run all of training. You"
  },
  {
    "i": 112,
    "speaker": "Speaker 1",
    "text": "know, we can't do global training cluster."
  },
  {
    "i": 113,
    "speaker": "Speaker 1",
    "text": "But then the inference component of reinforcement learning, we can globally distribute that across small clusters all over the world. So, I think for the composer to run, we used the four"
  },
  {
    "i": 114,
    "speaker": "Speaker 1",
    "text": "clusters in total that were all over the world, very far away from each other."
  },
  {
    "i": 115,
    "speaker": "Speaker 1",
    "text": "And we even used some of our production traffic when it was least used. So like we had the composer 1.5, the previous model served, and when it was least used by people, we just grabbed some"
  },
  {
    "i": 116,
    "speaker": "Speaker 1",
    "text": "inference GPUs, and we put them to speed up training. And so we can do these sort of things, and sort of easily scale up our training run without having one large continuous cluster. And the thing that enables it,"
  },
  {
    "i": 117,
    "speaker": "Speaker 3",
    "text": "maybe Dima can talk more about >> kind of like to reconfirm what Federico said is basically our RL training is like very heterogeneous, right? And by leveraging heterogeneity, how different components, like what"
  },
  {
    "i": 118,
    "speaker": "Speaker 3",
    "text": "infrastructures they need, you can actually drive efficiency. And you see this pattern kind of across the board everywhere. Specifically for for training, you have all these like highly"
  },
  {
    "i": 119,
    "speaker": "Speaker 3",
    "text": "interconnected clusters, you need high-speed network, kind of need to work in lockstep. So those clusters are expensive, right? And actually it's really hard to find big ones, right?"
  },
  {
    "i": 120,
    "speaker": "Speaker 3",
    "text": "Basically, at the scale with with which composer was trained, finding like 2x larger clusters is like significantly harder than finding the current size one. And that's why if you can"
  },
  {
    "i": 121,
    "speaker": "Speaker 3",
    "text": "disaggregate these components and put them on different places, one, you don't need to find such a big cluster."
  },
  {
    "i": 122,
    "speaker": "Speaker 3",
    "text": "Two, you can actually find like different trade-offs of hardware because for inference you don't need that kind of wide interconnect. You can have smaller groups of GPUs interconnected"
  },
  {
    "i": 123,
    "speaker": "Speaker 3",
    "text": "together. You can have heterogeneous sets of GPUs. You can have different generations of GPUs. You can kind of play all these games of games of optimization. And finally, like inference, it's much easier to scale up"
  },
  {
    "i": 124,
    "speaker": "Speaker 3",
    "text": "and down as you go. And yeah, it's very connected. So like when you have off-peak hours, you can view all your kind of inference pool as one set of GPUs serving production traffic for real"
  },
  {
    "i": 125,
    "speaker": "Speaker 3",
    "text": "users or serving simulated environments for RL purposes, and kind of balance balance between this. Of course, it's a very interesting systems problem."
  },
  {
    "i": 126,
    "speaker": "Speaker 3",
    "text": "Federico mentioned like the Kimi model is like one 1 TB."
  },
  {
    "i": 127,
    "speaker": "Speaker 3",
    "text": "Training step takes somewhere between like 5 to 15 minutes. So it basically means like every like every 5 to 10 minutes you are producing like 1 terabyte new snapshot of weights. So the"
  },
  {
    "i": 128,
    "speaker": "Speaker 3",
    "text": "question is like how are you going to ship it to a different cluster on the other side of the world very efficiently, right? And you want to like do it quickly because remember you don't"
  },
  {
    "i": 129,
    "speaker": "Speaker 3",
    "text": "want to get this staleness to get out of hand. So I think that's was probably one the the kind of the the most fun part which we figured out together is that despite you know, full"
  },
  {
    "i": 130,
    "speaker": "Speaker 3",
    "text": "model being like 1 terabyte, not all the weights change every step, right?"
  },
  {
    "i": 131,
    "speaker": "Speaker 3",
    "text": "Because RL does a lot of very like precise adjustments. Especially if the training going on. So actually there are very kind of regular patterns in like which subset of weights gets changed."
  },
  {
    "i": 132,
    "speaker": "Speaker 3",
    "text": "Maybe not all of them change every time."
  },
  {
    "i": 133,
    "speaker": "Speaker 3",
    "text": "So if you were to look at like how my model changes within one training step like after 10 minutes, there is relatively small delta between those."
  },
  {
    "i": 134,
    "speaker": "Speaker 3",
    "text": "You can write write a compression algorithm which basically leverages this property and now you end up with kind of like database systems problem which is okay, I have my delta and I just want to"
  },
  {
    "i": 135,
    "speaker": "Speaker 3",
    "text": "like ship it across across the world. My delta maybe is like 20 times smaller than what shipping the full model is and that and that makes it practical. But of course now you need to build all this"
  },
  {
    "i": 136,
    "speaker": "Speaker 3",
    "text": "kind of machinery from storage systems of full snapshots and deltas and recovery and like reconciliation etc. We were able to build it kind of in lossless fashion. Basically means that"
  },
  {
    "i": 137,
    "speaker": "Speaker 3",
    "text": "like you always end up with bit equivalent model on the other side. So you don't need to worry about any mass aspects of this and you can do it really fast too. You can you you can do it"
  },
  {
    "i": 138,
    "speaker": "Speaker 3",
    "text": "under you know, under a few minutes."
  },
  {
    "i": 139,
    "speaker": "Speaker 3",
    "text": "Even in the worst conditions usually it's under a minute and most importantly you like pause only for like maybe 30 seconds to swap the weights in your actual inference model. We also like"
  },
  {
    "i": 140,
    "speaker": "Speaker 3",
    "text": "fully like saturated the band egress of the cluster by like sharding the upload and the download as well. So you can do all this like system tricks to bring the stand down. It is It is quite a few"
  },
  {
    "i": 141,
    "speaker": "Speaker 3",
    "text": "complexity but you can kind of abstract it out and just make it work great. Like it doesn't interfere with your training algorithm and on the flip side you have this kind of power to disaggregate to"
  },
  {
    "i": 142,
    "speaker": "Speaker 3",
    "text": "leverage other clusters to do that. And that's kind of goes against kind of conventional wisdom of how you should do RL infrastructure because, you know, conventional wisdom is like you Okay,"
  },
  {
    "i": 143,
    "speaker": "Speaker 3",
    "text": "you're going to have this really huge one cluster connected with RDMA, and it's going to be very expensive, and you're going to probably spend, you know, maybe you're going to allocate"
  },
  {
    "i": 144,
    "speaker": "Speaker 3",
    "text": "like 1/3 to training and 2/3 to inference. And sure, if you have very expensive network, it's much easier to copy this 1 TB quickly, but now we have like three times larger cluster. Now, if"
  },
  {
    "i": 145,
    "speaker": "Speaker 3",
    "text": "your inference engine is more optimized, then maybe you you're going to save 1/3 of that cluster in terms of GPUs anyway because you're just more efficient, and you can take, you know, half of this"
  },
  {
    "i": 146,
    "speaker": "Speaker 3",
    "text": "cluster somewhere else in a maybe cheaper hardware in a different region."
  },
  {
    "i": 147,
    "speaker": "Speaker 2",
    "text": "So, your cost comes down quite a bit. I love that you guys are just grinning as you describe this because it's like it's so hard, and this is like a systems engineer's dream, right? And so, it's"
  },
  {
    "i": 148,
    "speaker": "Speaker 2",
    "text": "just like a it's amazing amazing system you guys have built."
  },
  {
    "i": 149,
    "speaker": "Speaker 1",
    "text": ">> a bunch of nights working on this."
  },
  {
    "i": 150,
    "speaker": "Speaker 2",
    "text": ">> Yeah. You look like you spent a long time together. A lot of time together."
  },
  {
    "i": 151,
    "speaker": "Speaker 2",
    "text": "What about I mean, you mentioned at the beginning that Kimi is a very large sparse MLE model. Does that make the RL run tricky in any way?"
  },
  {
    "i": 152,
    "speaker": "Speaker 1",
    "text": ">> Mhm. Yep."
  },
  {
    "i": 153,
    "speaker": "Speaker 1",
    "text": ">> How so? Well, when you do inference, you're essentially doing like a forward pass. It's just kind of like autoregressive."
  },
  {
    "i": 154,
    "speaker": "Speaker 1",
    "text": "And in this forward pass, it produces like log probabilities of like the tokens that it has sampled. When we ship back the like generations of the model to the trainer, we have to rerun that"
  },
  {
    "i": 155,
    "speaker": "Speaker 1",
    "text": "forward pass because as we mentioned, we are doing asynchronous training. So, the model that has produced the pass may have been like actually a few steps behind what the trainer is at. And so,"
  },
  {
    "i": 156,
    "speaker": "Speaker 1",
    "text": "we have to rerun that forward pass and reproduce log probabilities. Now, the problem is in theory, these log probabilities should be exactly the same if it's the same model version. But even"
  },
  {
    "i": 157,
    "speaker": "Speaker 1",
    "text": "with the same model version, you get slightly or sometimes very different log probability values for the same tokens. So, this is often called what like a numerical mismatch for inference. You hear this about all"
  },
  {
    "i": 158,
    "speaker": "Speaker 3",
    "text": "the time these days. For mixture >> Why Why is that? Why does that happen? I mean, primarily because like fundamentally floating point arithmetic, which is doing this is non-deterministic. So, if you"
  },
  {
    "i": 159,
    "speaker": "Speaker 2",
    "text": ">> So, why floating point arithmetic is non-deterministic?"
  },
  {
    "i": 160,
    "speaker": "Speaker 3",
    "text": ">> know, you you learn this code like if you take A plus B plus C, right? And like C plus B plus A, it's going to be the same result. If you're doing this with integers, with whole numbers on the"
  },
  {
    "i": 161,
    "speaker": "Speaker 3",
    "text": "computer, that's going to be always true. If you're going to do it with floating point numbers, which are actually like approximation numbers you have this like mantissa and exponent"
  },
  {
    "i": 162,
    "speaker": "Speaker 3",
    "text": "etc. A plus B plus C and C plus B plus A is going to give you like different results or even like A plus B and B So, basically fundamentally it's accumulation order of like all these"
  },
  {
    "i": 163,
    "speaker": "Speaker 3",
    "text": "operations which models do is basically multiplications and additions. And like addition order matters to your final result. It's all like small differences, but they get amplified through like"
  },
  {
    "i": 164,
    "speaker": "Speaker 3",
    "text": "millions and billions of operations. So, when you do inference on models, usually it doesn't matter that much because you pre-train your model, you actually pretty robust. If you like flip some"
  },
  {
    "i": 165,
    "speaker": "Speaker 3",
    "text": "bits, it's still going to produce you like good results. Your benchmarks are not going to change. But, RL in particular because you're using this very, very like weak signal to teach the"
  },
  {
    "i": 166,
    "speaker": "Speaker 3",
    "text": "model, the noise from this numerical differences can make or break your training. And that's like particularly important. And again, it's a interesting intersection between like algorithmic"
  },
  {
    "i": 167,
    "speaker": "Speaker 3",
    "text": "and systems part because, you know, you can write a beautiful mess and it just doesn't work in practice."
  },
  {
    "i": 168,
    "speaker": "Speaker 3",
    "text": "There are ways how you can drive this difference to pretty much zero. There are all these like batching variant ways you can Basically, you can be very, very careful and write all your GPU kernels"
  },
  {
    "i": 169,
    "speaker": "Speaker 3",
    "text": "so they always add numbers in the same order, so you always do like A plus B plus C and not a different order."
  },
  {
    "i": 170,
    "speaker": "Speaker 3",
    "text": "It's possible, but it always has a trade-off, right? Basically, your like your system becomes maybe like 2x or 3x slower. Again, it becomes an interesting trade-off like, okay, what is there 10%"
  },
  {
    "i": 171,
    "speaker": "Speaker 3",
    "text": "of slow down which we can take or in fact it's actually probably a few percent of slow down we can take to address 90% of this difference. That's you know the right trade-off which"
  },
  {
    "i": 172,
    "speaker": "Speaker 3",
    "text": "kind of we we find together through through iteration and you mentioned that particularly for MOEs and sparsity is hard. The reason for that is that like the way MOEs work is that you would take"
  },
  {
    "i": 173,
    "speaker": "Speaker 3",
    "text": "your activations at every layer and you would run it through gating layer which is basically decides okay for this token I'm going to run out of 384 experts I'm going to run this eight. Right? So it's"
  },
  {
    "i": 174,
    "speaker": "Speaker 3",
    "text": "going to do like some messing like top eight scores. Those eight experts going to be activated other ones will not be activated for this token. This operation amplifies your small numerical"
  },
  {
    "i": 175,
    "speaker": "Speaker 3",
    "text": "differences quite a bit because maybe your hidden states were like difference by like fifth digit after dot doesn't really matter but this difference made it so you picked expert number seven"
  },
  {
    "i": 176,
    "speaker": "Speaker 3",
    "text": "versus expert number nine as a kind of at the cutoff and suddenly you went and like activated totally different part of the model and your difference got amplified quite a bit. And my models by"
  },
  {
    "i": 177,
    "speaker": "Speaker 3",
    "text": "definition are like very more sensitive to this mismatch. Again, when you do inference or when you do kind of regular loud it usually doesn't matter and they average out but now if you're trying to"
  },
  {
    "i": 178,
    "speaker": "Speaker 3",
    "text": "make this model learn this difference is huge because your inference activated expert number seven. Now in your training you're trying to like update expert number nine which didn't even"
  },
  {
    "i": 179,
    "speaker": "Speaker 2",
    "text": "contribute to that during inference. So were you guys handwriting GPU kernels then to help get around this problem?"
  },
  {
    "i": 180,
    "speaker": "Speaker 3",
    "text": ">> Yeah. Yes. So you can again you can address all of this through GPU kernels and there's always trade-off."
  },
  {
    "i": 181,
    "speaker": "Speaker 3",
    "text": "Specifically for MOE you can do this interesting trick which people call router replay but basically you can have your inference just pass extra information to training and say that hey"
  },
  {
    "i": 182,
    "speaker": "Speaker 3",
    "text": "I activated expert seven for this token."
  },
  {
    "i": 183,
    "speaker": "Speaker 3",
    "text": "This very small bit piece of information is just one integer saying that like okay this is the expert which I activated so trainer can be aligned with that. And a lot of this numerical"
  },
  {
    "i": 184,
    "speaker": "Speaker 3",
    "text": "alignment is basically doing tricks like that, matching quantization levels, matching kernels, etc."
  },
  {
    "i": 185,
    "speaker": "Speaker 3",
    "text": "to drive the divergence between training and inference implementation down. And that makes huge difference in between, you know, your own maybe divergent completely or being, you know, multiplex"
  },
  {
    "i": 186,
    "speaker": "Speaker 2",
    "text": "less compute efficient because you'll need much more data to address to this mismatch. I'd love to maybe chat a little bit more about the RL kind of recipe. Can you say a word about the"
  },
  {
    "i": 187,
    "speaker": "Speaker 2",
    "text": "reward signal you're using? Is it Is like Are you can Okay, can't say. Got it. Top secret stuff. Top secret stuff."
  },
  {
    "i": 188,
    "speaker": "Speaker 2",
    "text": "Okay, that makes sense. Like it seems this is a almost like the equivalent of learning in sim. This is simulated rollouts versus like you have so much actual user data that you could be"
  },
  {
    "i": 189,
    "speaker": "Speaker 2",
    "text": "learning on. Why not just do RL on your your actual user data and your actual user harness versus doing this in sim?"
  },
  {
    "i": 190,
    "speaker": "Speaker 1",
    "text": "Yeah, we're also doing that. So that's what we call the real-time RL. Okay. And we use the same technology to do like the inference with sync with like fireworks to do this. We find like user"
  },
  {
    "i": 191,
    "speaker": "Speaker 1",
    "text": "signals where the user was happy or sad about a particular model generation."
  },
  {
    "i": 192,
    "speaker": "Speaker 1",
    "text": "And we're able to update that model live and so then ship a new version of the model continuously every few hours."
  },
  {
    "i": 193,
    "speaker": "Speaker 1",
    "text": "We're working on decreasing that time."
  },
  {
    "i": 194,
    "speaker": "Speaker 1",
    "text": "Actually, at some point we'll have to increase that time because as the horizon of the model gets longer and longer, we'll have to re-extend that time. So like an interesting play. Like"
  },
  {
    "i": 195,
    "speaker": "Speaker 1",
    "text": "right now we are trying to decrease the time for stability because we were figuring out the right hyper parameters and then after we have figured it out we have to re-extend it again just because"
  },
  {
    "i": 196,
    "speaker": "Speaker 2",
    "text": "we want to lengthen the horizon of these models, yeah. Do you need to do any of the kind of like pre-training simulated RL? You have so much actual user data. I imagine that's just like much more"
  },
  {
    "i": 197,
    "speaker": "Speaker 2",
    "text": "valuable to to train and tune on. Like why not just go straight to the online RL step? Why why do you have to do the the offline RL? The online RL currently is pretty inefficient. We suffer from"
  },
  {
    "i": 198,
    "speaker": "Speaker 1",
    "text": "this problem that the GPUs are offline for a long time essentially. And beside that there's also like different trade-offs also in terms of efficiency and user experience."
  },
  {
    "i": 199,
    "speaker": "Speaker 3",
    "text": ">> Yeah."
  },
  {
    "i": 200,
    "speaker": "Speaker 3",
    "text": ">> If you do simulation, you actually do multiple rollouts from the same prompt, right? You effectively take a task and you ask a model to do 16 tries of the task or like 128 tries of the task like"
  },
  {
    "i": 201,
    "speaker": "Speaker 3",
    "text": "different rollouts from the same prompt."
  },
  {
    "i": 202,
    "speaker": "Speaker 3",
    "text": "Some of them are going to uh go go well, some of them are not going to go well."
  },
  {
    "i": 203,
    "speaker": "Speaker 3",
    "text": "And by doing it multiple rollouts in parallel, you are able to get much more precise signal. Maybe like, you know, maybe a model is very good and it does it well 90% of the time. Maybe it's not"
  },
  {
    "i": 204,
    "speaker": "Speaker 3",
    "text": "very good. Losses like GRPO like group group policy gradient like kind of work by doing multiple rollouts at the same time. If you're doing online, you have only one uh rollout coming back. And so"
  },
  {
    "i": 205,
    "speaker": "Speaker 3",
    "text": "so trade-offs of like how you do it algorithmically different. And most importantly, if your simulated rollout goes wrong, it's not it's not bad, right? I mean, you just, you know, maybe"
  },
  {
    "i": 206,
    "speaker": "Speaker 3",
    "text": "you spend some time on the view. Uh if it's actual user, you you have much higher like minimum bar on that because effectively you're doing AB test, right?"
  },
  {
    "i": 207,
    "speaker": "Speaker 3",
    "text": "So if the model produces something weird, like that's a bad user experience."
  },
  {
    "i": 208,
    "speaker": "Speaker 2",
    "text": ">> Yeah. Okay. So you can go off policy more often when it's not a real user because you can like you can experiment with like crazy things and without affecting the user experience. You can"
  },
  {
    "i": 209,
    "speaker": "Speaker 2",
    "text": "do a lot more rollouts. You can do GRPO."
  },
  {
    "i": 210,
    "speaker": "Speaker 2",
    "text": "Yeah. Um and then you can basically like bootstrap some level of performance that's good enough to even put in front of users. Okay."
  },
  {
    "i": 211,
    "speaker": "Speaker 1",
    "text": ">> Yeah, like we teach reasoning through like the offline RL, which is actually like called online RL. Offline RL is more like DPO kind of technique to sort of reinforce kind of RL is online. Uh"
  },
  {
    "i": 212,
    "speaker": "Speaker 1",
    "text": "and then we there we like teach the reasoning to the model. We give it some kind of input of the behavior it should have. Uh we try to give it to new information about the world and we teach"
  },
  {
    "i": 213,
    "speaker": "Speaker 1",
    "text": "it tool calling."
  },
  {
    "i": 214,
    "speaker": "Speaker 1",
    "text": "And then we put it live to users. Cuz you could imagine like if the model is bad, users don't want to use it. They're not going to give us any feedback, right? So the model has to meet some"
  },
  {
    "i": 215,
    "speaker": "Speaker 1",
    "text": "kind of bar to even like be put into online RL. Like we want to be really happy with the model, and this is the model we ship. That's kind of the paradox of online online RL or how we like to call it real"
  },
  {
    "i": 216,
    "speaker": "Speaker 1",
    "text": "time, is that, you know, we can't use this to really like create the model from scratch because users need to be using the model. And so, it has to be good already, and we can only make it"
  },
  {
    "i": 217,
    "speaker": "Speaker 1",
    "text": "better. Yeah."
  },
  {
    "i": 218,
    "speaker": "Speaker 1",
    "text": ">> Yeah. So, it's kind of like cherry on top to like really get this super delightful experience for Yeah, totally."
  },
  {
    "i": 219,
    "speaker": "Speaker 1",
    "text": "the sessions. Hopefully, one day it will be like big big cherry, you know."
  },
  {
    "i": 220,
    "speaker": "Speaker 2",
    "text": ">> [laughter] >> Yeah, Dan Roberts presented at our conference last year. I think you were there. It's like traditionally, it was the big cake and then the little cherry."
  },
  {
    "i": 221,
    "speaker": "Speaker 2",
    "text": ">> LeCun's cherry, yeah. Little cake, big cherry."
  },
  {
    "i": 222,
    "speaker": "Speaker 2",
    "text": ">> [laughter] >> Yep. I'm curious, the the Andrej Karpathy line of like, right now RL is, you know, still super inefficient. You you do a big big long rollout, and then you kind of get get"
  },
  {
    "i": 223,
    "speaker": "Speaker 2",
    "text": "like, you know, a little bit of information at the end, and it's still like, I think slurping bits from a straw. What do you think? And have you have you been able to figure out how to"
  },
  {
    "i": 224,
    "speaker": "Speaker 2",
    "text": "get more bits out of that path? Uh, I can't talk about that. Okay, okay, got it. We're back on We're back on the secret stuff. Good. That's how I know I'm asking the right questions."
  },
  {
    "i": 225,
    "speaker": "Speaker 2",
    "text": ">> [laughter] >> You mentioned the rollouts are a few minutes at a time. It seems like the whole field is pushing towards making like long horizon agents, agents that can work for for a long period of time"
  },
  {
    "i": 226,
    "speaker": "Speaker 2",
    "text": "uninterrupted, and generally not failing. I love that meter scaling charts. What goes into into the RL process to try to get the agent to run for longer? Several things. So, one problem about sort of like reinforcement"
  },
  {
    "i": 227,
    "speaker": "Speaker 1",
    "text": "learning is that the longer the trajectory is, the harder it it is to do credit assignment. So, you can imagine like we are giving thumbs up thumbs down at the bottom line right at the end of"
  },
  {
    "i": 228,
    "speaker": "Speaker 1",
    "text": "its work. And sort of like, to simplify the problem is like, the model asks itself, \"Okay, where did I do right and where did I do wrong?\" That's basically the problem called credit assignment. It"
  },
  {
    "i": 229,
    "speaker": "Speaker 1",
    "text": "gets harder as this gets longer, so you have to do a bunch of tricks there. The other problem is just like you run out of space, right? Like these models have a finite context window and at some"
  },
  {
    "i": 230,
    "speaker": "Speaker 1",
    "text": "point they're going to reach that. So, actually the way we solved this at Cursor is we put compaction inside their RL loop. So, we call this self-summarization. So, during reinforcement learning, the agent"
  },
  {
    "i": 231,
    "speaker": "Speaker 1",
    "text": "actually learns how to continue and go on forever. So, in practice, our model is like a 200,000 context window model, but in reality it can go on for millions of tokens and just because of this ability that it can"
  },
  {
    "i": 232,
    "speaker": "Speaker 1",
    "text": "summarize its work and then take that summary to restart its context window while still trying to accomplish the task. And through RL, because RL pushes the model to do things correctly towards the goal, at"
  },
  {
    "i": 233,
    "speaker": "Speaker 1",
    "text": "the same time jointly, we are training the model to produce a good summary and then we're training the model to listen to that summary very well at the same time."
  },
  {
    "i": 234,
    "speaker": "Speaker 1",
    "text": "And so, this is kind of like a continuation to reasoning almost, I feel like."
  },
  {
    "i": 235,
    "speaker": "Speaker 3",
    "text": ">> it fascinating because I mean, usually context management is considered like part of the hardness, right? In this case, you're effectively co-optimizing like how part of the hardness and like model itself work"
  },
  {
    "i": 236,
    "speaker": "Speaker 3",
    "text": "together and throwing all of that into the optimization loop. And we've seen this again and again in the AI that like the more you throw computers at the problem, the more you can solve the"
  },
  {
    "i": 237,
    "speaker": "Speaker 3",
    "text": "problem end-to-end. The magic of computing gets your lesson works and you get much better system which can work together. Totally. Totally. Do you think every company's going to be"
  },
  {
    "i": 238,
    "speaker": "Speaker 2",
    "text": "RL-ing their own harnesses?"
  },
  {
    "i": 239,
    "speaker": "Speaker 1",
    "text": "Like do you think that every company has the same shape of problem as Cursor? If they are using AI and they're like producing lots of tokens and they have a product to optimize against, I think"
  },
  {
    "i": 240,
    "speaker": "Speaker 1",
    "text": "it's it's like the right move and they're the right direction to train models. Yeah. Yeah, interesting."
  },
  {
    "i": 241,
    "speaker": "Speaker 2",
    "text": "Interesting. Um and so it so it seems like most of the reinforcement learning you did then was on the kind of like the harness / tool use part rather than on the get good at, you know, completing next"
  },
  {
    "i": 242,
    "speaker": "Speaker 2",
    "text": "token for code. Is that roughly the pattern that other founders should have in mind when they're trying to think about where should I use reinforcement learning? So, like if you're trying to"
  },
  {
    "i": 243,
    "speaker": "Speaker 2",
    "text": "get an agent to perform tasks with tools over long horizon, you need RL. If you're trying to create a model that's good at summarization or next token or whatever, you probably don't need RL. Is"
  },
  {
    "i": 244,
    "speaker": "Speaker 1",
    "text": "Is that a good framework for when you need RL? I think RL fits everywhere. So, even for tab, we used RL. Personally, this is just my theory and it's not backed up by anything. When you"
  },
  {
    "i": 245,
    "speaker": "Speaker 1",
    "text": "pre-train a model, they're just The models are just in the ingesting the totality of human knowledge. Let's say you're training a model for math. The model sort of like learns all the math"
  },
  {
    "i": 246,
    "speaker": "Speaker 1",
    "text": "on Stack Exchange. The model, when it's presented with a math problem, and this is a model that hasn't gone through RL, the model is needs to wonder what kind of person it is. Is it the expert? Or is"
  },
  {
    "i": 247,
    "speaker": "Speaker 1",
    "text": "it the student that's trying to learn?"
  },
  {
    "i": 248,
    "speaker": "Speaker 1",
    "text": "And so, one of the things that I think happens during RL is that we're tuning this knob, letting the model know, \"Hey, you are the expert. You need to do things correctly.\" So, that's like one"
  },
  {
    "i": 249,
    "speaker": "Speaker 1",
    "text": "thing that happens is we are sharpening this distribution. Sort of like RL has a few phases. So, like there is the very first phase where the model learns and becomes very good very quickly. And then there"
  },
  {
    "i": 250,
    "speaker": "Speaker 1",
    "text": "is like a second phase where like it takes a lot of compute to continuously improve the model and like that you see the model starts reasoning and have this pattern. So, in the very first phase of"
  },
  {
    "i": 251,
    "speaker": "Speaker 1",
    "text": "the curve, I think that's where we're just tuning the knob, telling the model, \"Hey, you should do things correctly here.\" And so, RL in the small compute case is also very useful just to let the"
  },
  {
    "i": 252,
    "speaker": "Speaker 3",
    "text": "model know that it has to do things correctly. That's sort of like my case to this. Yeah, I mean, so second that. I mean, you we see this pattern because many of these cases you know, we helped"
  },
  {
    "i": 253,
    "speaker": "Speaker 3",
    "text": "RL fine-tuning generally for many customers and we see this usually you kind of continuous pre-training basically we train like regular supervised fine-tuning is simplifying you can say it's transfer of"
  },
  {
    "i": 254,
    "speaker": "Speaker 3",
    "text": "new knowledge kind of in abstract way and RL is kind of sharpening the behavior or like particular qualities you would you would want from from the model and usually you end up needing"
  },
  {
    "i": 255,
    "speaker": "Speaker 3",
    "text": "both. And even to your example of summarization, it's actually like RL may be very useful for this because sometimes if it's if you want particular style out of summarization, right? It's"
  },
  {
    "i": 256,
    "speaker": "Speaker 3",
    "text": "really hard to like come up with examples of like good and bad summarization. It's actually like really describing this precisely. But if you use for example LM as a judge, right?"
  },
  {
    "i": 257,
    "speaker": "Speaker 3",
    "text": "You can actually say very precise rubrics. You can kind of prompt your well saying like, \"Okay, this is the criteria how I am going to evaluate whether summarization good or not, throw it into RL loop, and let the"
  },
  {
    "i": 258,
    "speaker": "Speaker 3",
    "text": "model kind of experiment with different summarization styles, figure out what you actually want from it.\" While maybe another LM kind of evaluate it whether it's matching particular rubric or not."
  },
  {
    "i": 259,
    "speaker": "Speaker 2",
    "text": "And that's kind of type of pattern which you see a lot, not just in coding like I see. Okay, I'm going to ask this question to Dima because Federico is going to plead the fifth. Um, you"
  },
  {
    "i": 260,
    "speaker": "Speaker 2",
    "text": "mentioned LLM as judge a couple times."
  },
  {
    "i": 261,
    "speaker": "Speaker 2",
    "text": "Do you think that ultimately companies will be more successful having like experts hand examining RL rollouts and you know, hand coaching the model behavior in some way or do you think LLM"
  },
  {
    "i": 262,
    "speaker": "Speaker 3",
    "text": "as judge, other automated rubrics are likely to get us there? You don't really like put experts directly in judging general rollouts. I mean, that would be some kind of like I mean, real-time RL"
  },
  {
    "i": 263,
    "speaker": "Speaker 3",
    "text": "if it's actually users or like some form of I don't know like RLHF or DPO. I mean, generally the more verifiable your reward is it the better because it allows you to like scale the compute and just get better"
  },
  {
    "i": 264,
    "speaker": "Speaker 3",
    "text": "outcome. In some case and by verifiable we basically means like how can you automatically produce it without a human?"
  },
  {
    "i": 265,
    "speaker": "Speaker 3",
    "text": "Uh, of course if it's like math or coding and you can craft something like very deterministic, that's the best. The reason why LLM the judge works is that it's actually it's generator discriminator distinction like"
  },
  {
    "i": 266,
    "speaker": "Speaker 3",
    "text": "it's much easier to judge I mean the central for humans, right? It's easier to judge than than to create. Does LLM have a C?"
  },
  {
    "i": 267,
    "speaker": "Speaker 3",
    "text": "Yeah."
  },
  {
    "i": 268,
    "speaker": "Speaker 3",
    "text": ">> [laughter] >> No implication there. But yeah, it's much easier to judge and you can craft precisely like different criteria you want to run some answer."
  },
  {
    "i": 269,
    "speaker": "Speaker 3",
    "text": "And you see this pattern where you might have like very complicated eval from multiple aspects, right? Because if you dump multiple aspects to a single LLM it might be get confused how to judge,"
  },
  {
    "i": 270,
    "speaker": "Speaker 3",
    "text": "right? Like you you might break it down, okay, you're going to judge rubric based based on style, based on like some different aspects like based on sexuality, kind of really craft this"
  },
  {
    "i": 271,
    "speaker": "Speaker 3",
    "text": "rewards. Some of them will be the genius, some will be LLM based and that's what guides your model behavior."
  },
  {
    "i": 272,
    "speaker": "Speaker 3",
    "text": "Then you just turn on turn on more compute and I see the graph go up."
  },
  {
    "i": 273,
    "speaker": "Speaker 2",
    "text": ">> think that we're going to see RL be more effective in the harder to verify domains? Like do you think LLM as a judge is sufficient? That's one of the big things you would you would you would start, right?"
  },
  {
    "i": 274,
    "speaker": "Speaker 3",
    "text": "Ideally you want to figure out what is their actual outcome, what is their actual metric you want to get, right? So kind of trying to approximate this through LLM is one way, trying to get"
  },
  {
    "i": 275,
    "speaker": "Speaker 3",
    "text": "bigger simulated environments is another way. Like if you can simulate more of your product, if you can simulate more of your environment, usually you have like final metric which you care about."
  },
  {
    "i": 276,
    "speaker": "Speaker 3",
    "text": "It's just harder to capture. If you can figure out how to capture this, that's great. And to your point about, you know, experts, I mean experts are still still needed, right? Because crafting"
  },
  {
    "i": 277,
    "speaker": "Speaker 3",
    "text": "this task and actually encoding the product experience you want, that's that's what matters, right? We went through software 1.0, 2.0, 3.0, right? Instead of crafting software directly, we went to crafting training"
  },
  {
    "i": 278,
    "speaker": "Speaker 3",
    "text": "data. Right now you're effectively crafting the evaluation rules. But that's still very important. You need to look at examples, you need to look at the data, you need to look at like where"
  },
  {
    "i": 279,
    "speaker": "Speaker 2",
    "text": "your product fails and how to nudge the model in the right in the right behavior. I want to ask about RL environments, which is maybe related to what you were talking about. It seems"
  },
  {
    "i": 280,
    "speaker": "Speaker 2",
    "text": "like there's been a huge explosion in just the revenue scale that some of these RL environments companies are reaching. What do they provide that's actually useful? Cuz I mean Cursor, for"
  },
  {
    "i": 281,
    "speaker": "Speaker 2",
    "text": "example, you have so much data on like how your customers are actually using your environments. What do the RL environment vendors offer you on top of what you already have? Yeah, we don't"
  },
  {
    "i": 282,
    "speaker": "Speaker 1",
    "text": "actually use any of the environment vendors."
  },
  {
    "i": 283,
    "speaker": "Speaker 1",
    "text": "I think So, it's very difficult to construct working environments. It's a valuable product for people that do not have like have access to these. However, for coding particularly, there is like a"
  },
  {
    "i": 284,
    "speaker": "Speaker 1",
    "text": "very large amount of working coding environments available to everybody."
  },
  {
    "i": 285,
    "speaker": "Speaker 1",
    "text": "That's GitHub, right? You can go in and maybe like you can have a model like just install all of the dependencies for a repository, and that's like a working environment. I think a lot of the"
  },
  {
    "i": 286,
    "speaker": "Speaker 1",
    "text": "difficulty comes from the infrastructure as well. So, you can imagine that a environment that's that works well for a particular task may need like services up. You're like making a change that um"
  },
  {
    "i": 287,
    "speaker": "Speaker 1",
    "text": "let's say like a database migration."
  },
  {
    "i": 288,
    "speaker": "Speaker 1",
    "text": "To test that it's actually working, you need the database up, right? And so, those kind of things are very tricky. I think like these environment companies are like quite helpful for that that"
  },
  {
    "i": 289,
    "speaker": "Speaker 3",
    "text": "kind of stuff. There are kind of two aspects on to this, right? First, like if you look at Frontier Labs, right?"
  },
  {
    "i": 290,
    "speaker": "Speaker 3",
    "text": "They're trying to build generic model which is good at everything, right? So, they need to cover all these different tasks underneath, package up in one model, and kind of encourage it to generalize,"
  },
  {
    "i": 291,
    "speaker": "Speaker 3",
    "text": "right? So, that's that's kind of one pattern, and that's that's very helpful, right? In cases like Composer, right?"
  },
  {
    "i": 292,
    "speaker": "Speaker 3",
    "text": "You have you have your actual product, right? And I think that's what also kind of video fireworks like yeah, if you have your actual product, you should you should do RL against it, right? The most"
  },
  {
    "i": 293,
    "speaker": "Speaker 2",
    "text": "powerful environment is your own product."
  },
  {
    "i": 294,
    "speaker": "Speaker 3",
    "text": ">> Exactly, because like that's where your model actually will be used. And And of course, if you have Frontier Labs, you're are going to do it across all the products, right?"
  },
  {
    "i": 295,
    "speaker": "Speaker 3",
    "text": "But, if you're if you're trying to build the best model for your product, specialize and tailor it. You should just use your production environment. Of course, you want to isolate it properly,"
  },
  {
    "i": 296,
    "speaker": "Speaker 3",
    "text": "right? You don't want to model or have havoc on your production database. You want to clone it, etc. And there are some, you know, tools from the underlying companies which are just like"
  },
  {
    "i": 297,
    "speaker": "Speaker 3",
    "text": "from general infrastructure which makes it easier."
  },
  {
    "i": 298,
    "speaker": "Speaker 3",
    "text": "But, generally, you want your RL environment to be as close to real production as possible. And that's what, you know, as an example, we see it is if you look at kind of toy RL examples,"
  },
  {
    "i": 299,
    "speaker": "Speaker 3",
    "text": "the toy RL framework, they always start with like, \"Oh, there's this like toy environment. I'm going to spin up a Docker container and run everything in it.\" Uh which is great for like toy"
  },
  {
    "i": 300,
    "speaker": "Speaker 3",
    "text": "examples if you are trying to teach model how to play Atari or whatever, right? But, if you're actually transition to like production cases, you can't just put your real real production"
  },
  {
    "i": 301,
    "speaker": "Speaker 3",
    "text": "application in a Docker container. And we found it pretty early ourselves like working with many of our clients. In case of course, our trainer on our side, some other customers, we run trainer on"
  },
  {
    "i": 302,
    "speaker": "Speaker 3",
    "text": "our training platform. But, for environments, we actually default to running them on the customer side because that's where the actual implementation is. And you you can you effectively have the same setup of"
  },
  {
    "i": 303,
    "speaker": "Speaker 3",
    "text": "trainer, even if it's part of our works platform or on the customer side, calling the actual production environment, not trying to kind of wrap it and componentize it. Yeah. On the on"
  },
  {
    "i": 304,
    "speaker": "Speaker 1",
    "text": "the hosted platform because that's really hard and that introduces differences. Yeah, like I mean, what we call RL environments is really three components. One is the harness."
  },
  {
    "i": 305,
    "speaker": "Speaker 1",
    "text": "So, the harness is like where where the model can submit tools and that tools get executed. And the second thing is let's call it the like a kind of operating system, right? So, like what"
  },
  {
    "i": 306,
    "speaker": "Speaker 1",
    "text": "is the actual like world, the the end state where the model is like interacting with. And then there is like the reward component. We need which needs to check at the end that the work is is done correctly. And"
  },
  {
    "i": 307,
    "speaker": "Speaker 1",
    "text": "generally, the harness is pretty portable. You can take the harness and put in in many different environments."
  },
  {
    "i": 308,
    "speaker": "Speaker 1",
    "text": "The thing that's key is the operating system. And to replicate this just normal containers don't really work very well. So, at Cursor we actually built like a whole virtual machine"
  },
  {
    "i": 309,
    "speaker": "Speaker 1",
    "text": "stack. And so, we can spin up like virtual machines really quickly. And it has to be super bursty cuz you can imagine like we are asking this system, \"Please give me 100,000 virtual machines"
  },
  {
    "i": 310,
    "speaker": "Speaker 2",
    "text": "now.\" And it has to come all come up and um um yeah. Awesome. I really enjoyed this conversation today. I think Cursor is such an inspiration in what you all are doing as a company towards going from"
  },
  {
    "i": 311,
    "speaker": "Speaker 2",
    "text": "application company to really a frontier mobile lab. And I think the work you do with Composer too really leads that charge. So, really special to hear about it. And then, Dima, really cool to hear"
  },
  {
    "i": 312,
    "speaker": "Speaker 2",
    "text": "about the hardcore infrastructure problems actually that the two of you solved together in the trenches over many, many late nights to make it all possible. So, thank you. Thank you guys"
  },
  {
    "i": 313,
    "speaker": "Speaker 1",
    "text": "for joining today."
  },
  {
    "i": 314,
    "speaker": "Speaker 1",
    "text": ">> Thank you so much for having us. Thank you."
  },
  {
    "i": 315,
    "speaker": "Speaker 2",
    "text": ">> [music] [music]"
  }
]