[ { "i": 0, "speaker": "Speaker 1", "text": "Hi, I'm Sasha, a researcher here at Cursor. I'm going to talk about Composer 2, our in-house" }, { "i": 1, "speaker": "Speaker 1", "text": "coding model. Composer 2 is a model we built for agentic coding. It's extremely strong" }, { "i": 2, "speaker": "Speaker 1", "text": "as a coding model. It scores similar to Opus 4.6 upon release and slightly behind GPT 5.4." }, { "i": 3, "speaker": "Speaker 1", "text": "Composer 2 combines really strong coding speed with a very competitive price. So it's extremely" }, { "i": 4, "speaker": "Speaker 1", "text": "efficient for using for all sorts of coding tasks. A lot of people say it just kind of" }, { "i": 5, "speaker": "Speaker 1", "text": "flies by. And it's also extremely affordable per token. So we've been building coding models" }, { "i": 6, "speaker": "Speaker 1", "text": "for the last year or so at Cursor. And the goal of our first generation of models, Composer" }, { "i": 7, "speaker": "Speaker 1", "text": "1, was to build models that would be a really helpful, interactive model." }, { "i": 8, "speaker": "Speaker 1", "text": "We saw that users were moving from a tab, kind of manual coding process, to one that" }, { "i": 9, "speaker": "Speaker 1", "text": "more relied on agents. And this process really picked up over the last few months." }, { "i": 10, "speaker": "Speaker 1", "text": "When we originally designed the model, the goal was to kind of answer these coding style" }, { "i": 11, "speaker": "Speaker 1", "text": "tasks. So the user would ask something, it would get sent to our model. And then the" }, { "i": 12, "speaker": "Speaker 1", "text": "user would use the tools. So these tools code read files, edit files, use code based" }, { "i": 13, "speaker": "Speaker 1", "text": "search, collect LIM's, and run terminal commands. And the agent would kind of cycle between" }, { "i": 14, "speaker": "Speaker 1", "text": "these in order to make the edit. Within the last several months, we've seen" }, { "i": 15, "speaker": "Speaker 1", "text": "a major change where people are moving from single code changes to full on software engineering." }, { "i": 16, "speaker": "Speaker 1", "text": "And the goal of Composer 2 was to build a coding agent for this sort of world." }, { "i": 17, "speaker": "Speaker 1", "text": "We found that for a typical cursor developer, agents would write almost 100% of their code." }, { "i": 18, "speaker": "Speaker 1", "text": "They would spend time doing things beyond coding, such as breaking down problems," }, { "i": 19, "speaker": "Speaker 1", "text": "reviewing artifacts, maybe running tests, and giving feedback. And that there was an expectation" }, { "i": 20, "speaker": "Speaker 1", "text": "that you could spin up multiple agents simultaneously. You wouldn't have to babysit" }, { "i": 21, "speaker": "Speaker 1", "text": "them. You would just expect that they could get the job done. With this goal in mind, we set out" }, { "i": 22, "speaker": "Speaker 1", "text": "to build Composer 2 with three specific subgoals. We wanted the agents to have deep knowledge about" }, { "i": 23, "speaker": "Speaker 1", "text": "code in general. We wanted it to be able to run hard tasks to completion. And we wanted it to work" }, { "i": 24, "speaker": "Speaker 1", "text": "well for the sorts of realistic tasks that people were considering. So in this talk, I'm going to go" }, { "i": 25, "speaker": "Speaker 1", "text": "through each of these and discuss some of the machine learning techniques we used in order to" }, { "i": 26, "speaker": "Speaker 1", "text": "approach each of these tasks. So let's start with the first one, which is the" }, { "i": 27, "speaker": "Speaker 1", "text": "Composer 2. We wanted the agents to have deep knowledge about code in general. We wanted it to" }, { "i": 28, "speaker": "Speaker 1", "text": "run hard tasks to completion. And we wanted it to run hard tasks to completion. So let's start with" }, { "i": 29, "speaker": "Speaker 1", "text": "knowledge. We want to build an agent that has deep knowledge about code in general." }, { "i": 30, "speaker": "Speaker 1", "text": "And we do this by including a large-scale continued pre-training phase." }, { "i": 31, "speaker": "Speaker 1", "text": "The goal of this additional stage is to improve the core knowledge of the system, but not in kind" }, { "i": 32, "speaker": "Speaker 1", "text": "of general chatbot-like knowledge, in knowledge of the specific types of coding that we'd want it to" }, { "i": 33, "speaker": "Speaker 1", "text": "do. To facilitate this process, we started with a strong base model. So we used the Composer 2" }, { "i": 34, "speaker": "Speaker 1", "text": "to use the camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel camel" }, { "i": 35, "speaker": "Speaker 1", "text": "When deciding on which base model to use, we tested several different open models on several different benchmarks." }, { "i": 36, "speaker": "Speaker 1", "text": "Some of these look at our internal code. Others look at other properties we wanted to have in our final model." }, { "i": 37, "speaker": "Speaker 1", "text": "We actually found that many of the open source models are quite strong." }, { "i": 38, "speaker": "Speaker 1", "text": "And we decided to use Kimi versus other ones, mostly based on infrastructure factors and that it fits some of the systems we had developed." }, { "i": 39, "speaker": "Speaker 1", "text": "When we do continued pre-training, we're going to run a relatively standard pipeline, but we're going to run it at a very large scale." }, { "i": 40, "speaker": "Speaker 1", "text": "So we're going to run a kind of standard short context pre-training phase over many tokens." }, { "i": 41, "speaker": "Speaker 1", "text": "The goal here is to make the model more knowledgeable about the subjects we'd like to use it on in practice." }, { "i": 42, "speaker": "Speaker 1", "text": "We're then going to run a long context extension phase to provide more data at 256,000 tokens." }, { "i": 43, "speaker": "Speaker 1", "text": "And then we end with a final supervised pre-training." }, { "i": 44, "speaker": "Speaker 1", "text": "And then we run a final supervised fine tuning phase, which looks more like the kind of agent data that you might use in practice." }, { "i": 45, "speaker": "Speaker 1", "text": "On the left, you can see our continued pre-training graph, and it has the nice structure you'd like during pre-training where it goes down over time." }, { "i": 46, "speaker": "Speaker 1", "text": "One question is whether this continued pre-training was necessary for improving the final model." }, { "i": 47, "speaker": "Speaker 1", "text": "And we found that, yes, in practice, it did have a nice impact." }, { "i": 48, "speaker": "Speaker 1", "text": "To test this, we consider three different variables." }, { "i": 49, "speaker": "Speaker 1", "text": "For three different variants of the amount of pre-training we do, small, medium, and large, these all end up with different negative log likelihoods, which we see on the x-axis." }, { "i": 50, "speaker": "Speaker 1", "text": "We find that after running our standard reinforcement learning phase, we get different rewards for the three different systems." }, { "i": 51, "speaker": "Speaker 1", "text": "You always have to be careful about connecting pre-training to the final model, but if you can afford to run these sorts of experiments, you can see that there is a benefit." }, { "i": 52, "speaker": "Speaker 1", "text": "The next and primary phase of the system is long-horizon reinforcement learning." }, { "i": 53, "speaker": "Speaker 1", "text": "This is targeting building an agentic model that can really run and complete very hard task." }, { "i": 54, "speaker": "Speaker 1", "text": "Reinforcement learning can be thought of as simulating as close as possible the actual user queries people will run in cursor." }, { "i": 55, "speaker": "Speaker 1", "text": "will run in Cursor. And in addition to improving the intelligence of the model, it allows us to" }, { "i": 56, "speaker": "Speaker 1", "text": "tune behaviors to make the model actually work in a good way that maximizes the user experience of" }, { "i": 57, "speaker": "Speaker 1", "text": "the agent. In order to run reinforcement learning, we first have to collect a large collection of" }, { "i": 58, "speaker": "Speaker 1", "text": "different problems. The problems we work on, which we show on the left, target a large set of real" }, { "i": 59, "speaker": "Speaker 1", "text": "world coding problems. We focus on things like iterating on features, debugging code, and adding" }, { "i": 60, "speaker": "Speaker 1", "text": "new features. But we also increasingly include tasks that look more like parts of the software" }, { "i": 61, "speaker": "Speaker 1", "text": "development process, such as including documentations or doing migrations or kind" }, { "i": 62, "speaker": "Speaker 1", "text": "of managing different project structures. The tasks vary greatly in difficulty, with some of" }, { "i": 63, "speaker": "Speaker 1", "text": "them being easy to solve and some of them being nearly impossible." }, { "i": 64, "speaker": "Speaker 1", "text": "And in fact, it is increasingly a challenge to find hard enough problems that really challenge the model and force it to get better." }, { "i": 65, "speaker": "Speaker 1", "text": "Before we actually run reinforcement learning, we run a process called auto install. Auto install is" }, { "i": 66, "speaker": "Speaker 1", "text": "one of my favorite parts of this system because it shows how developing a model can improve future" }, { "i": 67, "speaker": "Speaker 1", "text": "versions of the model itself. So to run auto install, we use our previous model, Composer 1.0," }, { "i": 68, "speaker": "Speaker 1", "text": "1.5. It runs in two stages. The first stage, it explores a repo and reads the docs. And then it" }, { "i": 69, "speaker": "Speaker 1", "text": "comes up with ten installation commands that it thinks the model should be able to run" }, { "i": 70, "speaker": "Speaker 1", "text": "in order for it to be in a good state. It then writes kind of tests to check that these are" }, { "i": 71, "speaker": "Speaker 1", "text": "verifiable. In the second stage, Composer is given the task of installing the environment." }, { "i": 72, "speaker": "Speaker 1", "text": "It's going to run, I don't know, UV setup. It's going to run" }, { "i": 73, "speaker": "Speaker 1", "text": "all the different commands that it may need. And it will even go to further extremes. It will try" }, { "i": 74, "speaker": "Speaker 1", "text": "to make sure the tests pass. It will mock up various dependencies. It will install other" }, { "i": 75, "speaker": "Speaker 1", "text": "packages that maybe didn't work on the first try. If it can pass the verification from stage one," }, { "i": 76, "speaker": "Speaker 1", "text": "we consider the environment ready and we set it up for RL training." }, { "i": 77, "speaker": "Speaker 1", "text": "During RL, we run somewhat simplistic but effective process. The way RL works is that you" }, { "i": 78, "speaker": "Speaker 1", "text": "start with one of these problems that we set up in the previous step. And then you run several" }, { "i": 79, "speaker": "Speaker 1", "text": "different rollouts. So each of these rollouts consists of a simulated version of the environment." }, { "i": 80, "speaker": "Speaker 1", "text": "The agent is given the same task that it would be given in cursor. And it tries to solve the" }, { "i": 81, "speaker": "Speaker 1", "text": "problem." }, { "i": 82, "speaker": "Speaker 1", "text": "But to do this, the infrastructure challenges are immense. These agentic rollouts can be 200,000" }, { "i": 83, "speaker": "Speaker 1", "text": "tokens and they can call hundreds of tools. We're also going to have many of these for every" }, { "i": 84, "speaker": "Speaker 1", "text": "problem that we have to work on. The result of this RL process is that some of the rollouts are" }, { "i": 85, "speaker": "Speaker 1", "text": "better than others. Some of them solve the problem that we're interested in and others fail." }, { "i": 86, "speaker": "Speaker 1", "text": "We're basically going to find the good ones and update the model to be more similar to" }, { "i": 87, "speaker": "Speaker 1", "text": "them. And find the bad ones and update away from those." }, { "i": 88, "speaker": "Speaker 1", "text": "In addition to trying to find rollouts that score strongly on our benchmarks," }, { "i": 89, "speaker": "Speaker 1", "text": "we also are going to take into account the behavioral properties that we would like" }, { "i": 90, "speaker": "Speaker 1", "text": "the models to have. We have a lot of effort going into developing rewards that can help" }, { "i": 91, "speaker": "Speaker 1", "text": "shape the agent's behavior and make it provide a good user experience. One of the most interesting" }, { "i": 92, "speaker": "Speaker 1", "text": "of these is trying to find the role that the agent wants to play. And we're also going to" }, { "i": 93, "speaker": "Speaker 1", "text": "to decide how many tokens we want the model to actually use." }, { "i": 94, "speaker": "Speaker 1", "text": "We explored a lot of different penalties for how we can penalize a token model from being" }, { "i": 95, "speaker": "Speaker 1", "text": "too talkative or being not talkative enough." }, { "i": 96, "speaker": "Speaker 1", "text": "And we settled on a nonlinear length penalty, which is shown on the bottom left." }, { "i": 97, "speaker": "Speaker 1", "text": "The idea here is that we want to penalize short sequences to try to make them more efficient." }, { "i": 98, "speaker": "Speaker 1", "text": "If the problem is easy, we would like Composer 2 to solve it efficiently and move on." }, { "i": 99, "speaker": "Speaker 1", "text": "However, if the problem is very challenging, we'd like Composer to spend a lot more time" }, { "i": 100, "speaker": "Speaker 1", "text": "on it to try to find the right answer." }, { "i": 101, "speaker": "Speaker 1", "text": "And so therefore, in a marginal sense, the penalty goes down as the thinking gets longer." }, { "i": 102, "speaker": "Speaker 1", "text": "In addition, in order to encourage the model to be able to solve very long problems, we" }, { "i": 103, "speaker": "Speaker 1", "text": "developed a system called self-summarization." }, { "i": 104, "speaker": "Speaker 1", "text": "So the way self-summarization works is that the model is allowed to work past its length" }, { "i": 105, "speaker": "Speaker 1", "text": "limit." }, { "i": 106, "speaker": "Speaker 1", "text": "So in this example here, maybe the first example on the left runs for longer than, say, 200,000" }, { "i": 107, "speaker": "Speaker 1", "text": "tokens." }, { "i": 108, "speaker": "Speaker 1", "text": "It then hits a trigger point." }, { "i": 109, "speaker": "Speaker 1", "text": "We ask the model to summarize what it's done so far." }, { "i": 110, "speaker": "Speaker 1", "text": "That summary is then provided to the agent in the next step." }, { "i": 111, "speaker": "Speaker 1", "text": "It runs again, maybe hits summary again, and then keeps on going." }, { "i": 112, "speaker": "Speaker 1", "text": "So the benefit of this process is that during RL, we can actually treat all three of these" }, { "i": 113, "speaker": "Speaker 1", "text": "steps as steps that led to the same final reward." }, { "i": 114, "speaker": "Speaker 1", "text": "That allows us to train the model to work effectively for infinite length while also" }, { "i": 115, "speaker": "Speaker 1", "text": "kind of capping the length that it actually uses in each part of the rollout." }, { "i": 116, "speaker": "Speaker 1", "text": "We found this to be an effective way to get the model to work on very hard problems." }, { "i": 117, "speaker": "Speaker 1", "text": "So that's the end of the session." }, { "i": 118, "speaker": "Speaker 1", "text": "Thank you." }, { "i": 119, "speaker": "Speaker 1", "text": "Thank you very much." }, { "i": 120, "speaker": "Speaker 1", "text": "Thank you very much to my colleagues here at the Center for" }, { "i": 121, "speaker": "Speaker 1", "text": "model at the degradation of properties like diversity or doing well when you call the" }, { "i": 122, "speaker": "Speaker 1", "text": "model multiple times. However, in practice, we didn't see this in our approach that if" }, { "i": 123, "speaker": "Speaker 1", "text": "we consider best of 16 performance that also increased over time. This indicates that the" }, { "i": 124, "speaker": "Speaker 1", "text": "model is not zooming in too close to one particular solution." }, { "i": 125, "speaker": "Speaker 1", "text": "Finally, our goal was not just to increase some random benchmarks. We wanted to do well" }, { "i": 126, "speaker": "Speaker 1", "text": "at the realistic tasks that people use Composer for every day. To do this, we built our own" }, { "i": 127, "speaker": "Speaker 1", "text": "internal evaluation that's used to drive performance on realistic coding tasks. The goal here was" }, { "i": 128, "speaker": "Speaker 1", "text": "to have queries and goal targeted solutions that look much more like the things that software" }, { "i": 129, "speaker": "Speaker 1", "text": "engineers do in practice." }, { "i": 130, "speaker": "Speaker 1", "text": "To do this, we've been developing a benchmark called Cursor Bench. This is the third iteration" }, { "i": 131, "speaker": "Speaker 1", "text": "of the Cursor Bench project. We have actually quite a large number of solutions that we" }, { "i": 132, "speaker": "Speaker 1", "text": "include in Cursor Bench, and they've gotten longer and harder over time. So this graph" }, { "i": 133, "speaker": "Speaker 1", "text": "on the left shows the average number of lines of code changed and the number of files touched" }, { "i": 134, "speaker": "Speaker 1", "text": "for each example in our benchmark." }, { "i": 135, "speaker": "Speaker 1", "text": "The goal of this benchmark was to remain uncontaminated." }, { "i": 136, "speaker": "Speaker 1", "text": "We didn't want it to be kind of found on the web." }, { "i": 137, "speaker": "Speaker 1", "text": "We wanted to cover a range of different use cases." }, { "i": 138, "speaker": "Speaker 1", "text": "And we wanted to test more than just did you pass a test." }, { "i": 139, "speaker": "Speaker 1", "text": "To do this, we're actually using code from software engineers here at Cursor." }, { "i": 140, "speaker": "Speaker 1", "text": "So actual problems from real queries that developers asked." }, { "i": 141, "speaker": "Speaker 1", "text": "So CursorBench has some interesting properties," }, { "i": 142, "speaker": "Speaker 1", "text": "particularly when compared to some of the famous public benchmarks." }, { "i": 143, "speaker": "Speaker 1", "text": "One is that it's using much longer code diffs." }, { "i": 144, "speaker": "Speaker 1", "text": "So it's not just trying to make a couple lines of code changes, but" }, { "i": 145, "speaker": "Speaker 1", "text": "actually, as I mentioned earlier, hundreds of lines over multiple files." }, { "i": 146, "speaker": "Speaker 1", "text": "More interestingly, though, this graph on the right shows that the mean problem" }, { "i": 147, "speaker": "Speaker 1", "text": "description length," }, { "i": 148, "speaker": "Speaker 1", "text": "that is the thing that we" }, { "i": 149, "speaker": "Speaker 1", "text": "asked the question about," }, { "i": 150, "speaker": "Speaker 1", "text": "that we asked the agent to do, is actually much shorter than other benchmarks." }, { "i": 151, "speaker": "Speaker 1", "text": "At first, this might seem like a negative thing," }, { "i": 152, "speaker": "Speaker 1", "text": "as the problems are much well less specified." }, { "i": 153, "speaker": "Speaker 1", "text": "But we actually think of this as a positive." }, { "i": 154, "speaker": "Speaker 1", "text": "It looks much closer to the sort of problems agents actually have to solve in" }, { "i": 155, "speaker": "Speaker 1", "text": "practice." }, { "i": 156, "speaker": "Speaker 1", "text": "And so resolving ambiguity or guessing user intent is actually a core part of" }, { "i": 157, "speaker": "Speaker 1", "text": "what we're asking the agent to do." }, { "i": 158, "speaker": "Speaker 1", "text": "So here's an example of what this looks like." }, { "i": 159, "speaker": "Speaker 1", "text": "common to ask kind of very hard things to the model and expect it to figure it out." }, { "i": 160, "speaker": "Speaker 1", "text": "But we don't see this sort of ill-structured problems in a lot of the public benchmarks." }, { "i": 161, "speaker": "Speaker 1", "text": "So one of the benefits of CursorBench is it does a much better job than some other benchmarks at" }, { "i": 162, "speaker": "Speaker 1", "text": "separating models out. So we see a much wider range between models we know are really good," }, { "i": 163, "speaker": "Speaker 1", "text": "like Opus 4.6, and models that we think are less good or maybe from a previous generation," }, { "i": 164, "speaker": "Speaker 1", "text": "like Sonnet 4.5. For some reason, these models all score really well on CursorBench," }, { "i": 165, "speaker": "Speaker 1", "text": "but that might just be because they've kind of targeted that style of problem" }, { "i": 166, "speaker": "Speaker 1", "text": "really well over the last couple of years." }, { "i": 167, "speaker": "Speaker 1", "text": "And at the end of the day, we can use tools like CursorBench to get a sense of how well" }, { "i": 168, "speaker": "Speaker 1", "text": "models are doing. So we can see the improvement from Composer 1.5 to Composer 2. And we can also" }, { "i": 169, "speaker": "Speaker 1", "text": "graph the models from CursorBench to CursorBench. And we can also graph the models from CursorBench" }, { "i": 170, "speaker": "Speaker 1", "text": "GPT 5.4, there actually is quite a difference in how good they are at answering problems" }, { "i": 171, "speaker": "Speaker 1", "text": "based on how many tokens they use." }, { "i": 172, "speaker": "Speaker 1", "text": "So in this graph here, we can see the medium, high, and extra high version of the model" }, { "i": 173, "speaker": "Speaker 1", "text": "as it utilizes more tokens, also improving on these benchmarks." }, { "i": 174, "speaker": "Speaker 1", "text": "Cool." }, { "i": 175, "speaker": "Speaker 1", "text": "So that's roughly what we're building." }, { "i": 176, "speaker": "Speaker 1", "text": "Composer 2 came out a couple months ago, and we are currently in the process of releasing" }, { "i": 177, "speaker": "Speaker 1", "text": "Composer 2.5." }, { "i": 178, "speaker": "Speaker 1", "text": "I was hoping to be able to talk about it today, but it will come out in the very near future." }, { "i": 179, "speaker": "Speaker 1", "text": "We're also building models that are scaling up to a much larger training cluster." }, { "i": 180, "speaker": "Speaker 1", "text": "So we're using the clusters from SpaceX to train Composer 3 and beyond." }, { "i": 181, "speaker": "Speaker 1", "text": "So coding models are very good." }, { "i": 182, "speaker": "Speaker 1", "text": "I think we all use them in our day-to-day life." }, { "i": 183, "speaker": "Speaker 1", "text": "And we've seen how impressive they are." }, { "i": 184, "speaker": "Speaker 1", "text": "But we think there's a lot more to be done." }, { "i": 185, "speaker": "Speaker 1", "text": "So larger pre-training and better RL will continue to improve performance." }, { "i": 186, "speaker": "Speaker 1", "text": "And I can give you a brief preview of Composer 2.5." }, { "i": 187, "speaker": "Speaker 1", "text": "So this model coming out soon will have even stronger terminal bench scores as well as" }, { "i": 188, "speaker": "Speaker 1", "text": "other similar benchmarks." }, { "i": 189, "speaker": "Speaker 1", "text": "And that's really from refining this reinforcement learning process, refining this continued" }, { "i": 190, "speaker": "Speaker 1", "text": "pre-training process, and working on our evals to go in the correct direction." }, { "i": 191, "speaker": "Speaker 1", "text": "So thanks very much for listening." }, { "i": 192, "speaker": "Speaker 1", "text": "And I'm happy to answer any questions people might have." }, { "i": 193, "speaker": "Speaker 2", "text": "Yes, we have a lot of questions." }, { "i": 194, "speaker": "Speaker 2", "text": "And thank you, Sasha." }, { "i": 195, "speaker": "Speaker 2", "text": "I guess the first one from Lonnie is, have we found that Composer works better for some" }, { "i": 196, "speaker": "Speaker 2", "text": "languages and frameworks than others?" }, { "i": 197, "speaker": "Speaker 1", "text": "That's a great question." }, { "i": 198, "speaker": "Speaker 1", "text": "So there's a couple of reasons why that might happen." }, { "i": 199, "speaker": "Speaker 1", "text": "One is certain languages are just represented more in the generic data." }, { "i": 200, "speaker": "Speaker 1", "text": "So languages like Python and TypeScript are extremely common and kind of are just seen" }, { "i": 201, "speaker": "Speaker 1", "text": "way more by these models." }, { "i": 202, "speaker": "Speaker 1", "text": "There's also the question of which languages can be simulated well in the RL framework." }, { "i": 203, "speaker": "Speaker 1", "text": "So languages that say use web development or kind of simple backends are really nice" }, { "i": 204, "speaker": "Speaker 1", "text": "for RL because we can simulate those and get a good reward." }, { "i": 205, "speaker": "Speaker 1", "text": "but still do relatively well in our framework." }, { "i": 206, "speaker": "Speaker 2", "text": "One question from Andrew just generally," }, { "i": 207, "speaker": "Speaker 2", "text": "he's wondering like when models go" }, { "i": 208, "speaker": "Speaker 2", "text": "from Composer 1 to Composer 2," }, { "i": 209, "speaker": "Speaker 2", "text": "what is leading to this jump?" }, { "i": 210, "speaker": "Speaker 2", "text": "Is this the original LLMs have more access" }, { "i": 211, "speaker": "Speaker 2", "text": "to more internet data?" }, { "i": 212, "speaker": "Speaker 2", "text": "Are we giving it more data?" }, { "i": 213, "speaker": "Speaker 2", "text": "Are we changing how we approach training?" }, { "i": 214, "speaker": "Speaker 2", "text": "What leads to this jump in capabilities?" }, { "i": 215, "speaker": "Speaker 1", "text": "Yeah, I would say that a lot of the jump in capabilities" }, { "i": 216, "speaker": "Speaker 1", "text": "is kind of refining the training process." }, { "i": 217, "speaker": "Speaker 1", "text": "So we'll see that from Composer 2 to Composer 2.5," }, { "i": 218, "speaker": "Speaker 1", "text": "we're actually not changing much" }, { "i": 219, "speaker": "Speaker 1", "text": "about the base system itself," }, { "i": 220, "speaker": "Speaker 1", "text": "but we're able to kind of provide a cleaner rewards." }, { "i": 221, "speaker": "Speaker 1", "text": "We're able to refine the details and refine the data mix." }, { "i": 222, "speaker": "Speaker 1", "text": "And we're able to kind of balance user experience" }, { "i": 223, "speaker": "Speaker 1", "text": "with intelligence in better ways." }, { "i": 224, "speaker": "Speaker 1", "text": "And that's a lot of what the kind of work" }, { "i": 225, "speaker": "Speaker 1", "text": "and compute at Cursor goes into doing." }, { "i": 226, "speaker": "Speaker 2", "text": "And then seeing a few questions about summarization," }, { "i": 227, "speaker": "Speaker 2", "text": "is it using the current version of the Composer model" }, { "i": 228, "speaker": "Speaker 2", "text": "to self-summarize?" }, { "i": 229, "speaker": "Speaker 2", "text": "And have we seen it hurt the caching rate," }, { "i": 230, "speaker": "Speaker 2", "text": "making like longer horizon tasks more expensive?" }, { "i": 231, "speaker": "Speaker 1", "text": "Yeah, really good question." }, { "i": 232, "speaker": "Speaker 1", "text": "So yeah, we call it self-summarization" }, { "i": 233, "speaker": "Speaker 1", "text": "because it's literally using the same model as we train it." }, { "i": 234, "speaker": "Speaker 1", "text": "So the agent does each of its tasks," }, { "i": 235, "speaker": "Speaker 1", "text": "it hits the trigger," }, { "i": 236, "speaker": "Speaker 1", "text": "and then we basically just slip in right here." }, { "i": 237, "speaker": "Speaker 1", "text": "This little summary prompt that says," }, { "i": 238, "speaker": "Speaker 1", "text": "given what you've done so far," }, { "i": 239, "speaker": "Speaker 1", "text": "if you want to keep on going," }, { "i": 240, "speaker": "Speaker 1", "text": "how would you summarize what's happened?" }, { "i": 241, "speaker": "Speaker 1", "text": "And then that literally just gets plopped back" }, { "i": 242, "speaker": "Speaker 1", "text": "into the model itself." }, { "i": 243, "speaker": "Speaker 1", "text": "So it's really neat that you can do this." }, { "i": 244, "speaker": "Speaker 1", "text": "It's in some sense, the best use of the cache," }, { "i": 245, "speaker": "Speaker 1", "text": "because this summary is actually using" }, { "i": 246, "speaker": "Speaker 1", "text": "the same cache here as you go." }, { "i": 247, "speaker": "Speaker 1", "text": "Now you're right that when you move on to this stage" }, { "i": 248, "speaker": "Speaker 1", "text": "and you kind of run more tokens, you're building up a new," }, { "i": 249, "speaker": "Speaker 1", "text": "a new cache as well to work with." }, { "i": 250, "speaker": "Speaker 1", "text": "So there are other tricks you might be able to use" }, { "i": 251, "speaker": "Speaker 1", "text": "to kind of make that better." }, { "i": 252, "speaker": "Speaker 2", "text": "And then how much effort goes into data pipelines" }, { "i": 253, "speaker": "Speaker 2", "text": "and cleaning the data we have versus writing the software" }, { "i": 254, "speaker": "Speaker 2", "text": "to actually train the models and evaluate them?" }, { "i": 255, "speaker": "Speaker 1", "text": "Yeah, that's a great question." }, { "i": 256, "speaker": "Speaker 1", "text": "I would say everything about building large scale" }, { "i": 257, "speaker": "Speaker 1", "text": "RL is challenging." }, { "i": 258, "speaker": "Speaker 1", "text": "We have a relatively small team here." }, { "i": 259, "speaker": "Speaker 1", "text": "There are about maybe 40 people involved" }, { "i": 260, "speaker": "Speaker 1", "text": "in Composer 2." }, { "i": 261, "speaker": "Speaker 1", "text": "It's about half researchers and half engineers." }, { "i": 262, "speaker": "Speaker 1", "text": "And I would say data and rewards makes up a large portion of that." }, { "i": 263, "speaker": "Speaker 1", "text": "But we actually, lots of things are challenging here." }, { "i": 264, "speaker": "Speaker 1", "text": "Running evals that are good, that run consistently" }, { "i": 265, "speaker": "Speaker 1", "text": "is an interesting problem." }, { "i": 266, "speaker": "Speaker 1", "text": "Building the low level kernels and training" }, { "i": 267, "speaker": "Speaker 1", "text": "is a whole challenge of its own." }, { "i": 268, "speaker": "Speaker 1", "text": "And then orchestrating this whole pipeline" }, { "i": 269, "speaker": "Speaker 1", "text": "of like how you build the distributed systems to run all these," }, { "i": 270, "speaker": "Speaker 1", "text": "to build all these," }, { "i": 271, "speaker": "Speaker 1", "text": "to build all these," }, { "i": 272, "speaker": "Speaker 1", "text": "to run all these rollouts simultaneously" }, { "i": 273, "speaker": "Speaker 1", "text": "and make it efficient is also a kind of big challenge" }, { "i": 274, "speaker": "Speaker 1", "text": "to work with." }, { "i": 275, "speaker": "Speaker 2", "text": "And then I'm seeing a lot of students in the audience" }, { "i": 276, "speaker": "Speaker 2", "text": "and they're very curious if," }, { "i": 277, "speaker": "Speaker 2", "text": "what would you recommend for folks with access to one GPU" }, { "i": 278, "speaker": "Speaker 2", "text": "to do to learn this and kind of," }, { "i": 279, "speaker": "Speaker 2", "text": "what would you recommend students who want to work on this" }, { "i": 280, "speaker": "Speaker 2", "text": "to study and train up on?" }, { "i": 281, "speaker": "Speaker 1", "text": "Yeah, so in my former life before working at Cursor," }, { "i": 282, "speaker": "Speaker 1", "text": "I was a professor." }, { "i": 283, "speaker": "Speaker 1", "text": "So I used to teach this stuff in practice." }, { "i": 284, "speaker": "Speaker 1", "text": "There's lots of different ways to do this." }, { "i": 285, "speaker": "Speaker 1", "text": "There's lots of different ways to do this." }, { "i": 286, "speaker": "Speaker 1", "text": "There's lots of great resources out there for this kind of thing." }, { "i": 287, "speaker": "Speaker 1", "text": "One thing I would recommend is there's a lovely course at Stanford" }, { "i": 288, "speaker": "Speaker 1", "text": "that goes through all the details of large language models," }, { "i": 289, "speaker": "Speaker 1", "text": "how they work, how they're efficient, what data to use," }, { "i": 290, "speaker": "Speaker 1", "text": "all those sorts of things." }, { "i": 291, "speaker": "Speaker 1", "text": "I think that's a great resource." }, { "i": 292, "speaker": "Speaker 1", "text": "I think all the videos are online." }, { "i": 293, "speaker": "Speaker 1", "text": "I also personally have some resources if you're interested." }, { "i": 294, "speaker": "Speaker 1", "text": "I have a series of kind of puzzles" }, { "i": 295, "speaker": "Speaker 1", "text": "about learning this kind of stuff that might be useful." }, { "i": 296, "speaker": "Speaker 1", "text": "But when you have very few jobs, you're going to have a lot of time." }, { "i": 297, "speaker": "Speaker 1", "text": "I think my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my my" }, { "i": 298, "speaker": "Speaker 1", "text": "figuring out how it works is worthwhile." }, { "i": 299, "speaker": "Speaker 2", "text": "And then I think a few questions about this cursor considering using different thinking" }, { "i": 300, "speaker": "Speaker 2", "text": "mode or thinking effort modes." }, { "i": 301, "speaker": "Speaker 2", "text": "I know some of the other labs have high, medium, low, and so forth." }, { "i": 302, "speaker": "Speaker 1", "text": "Oh, yeah." }, { "i": 303, "speaker": "Speaker 1", "text": "No, it's something we've considered." }, { "i": 304, "speaker": "Speaker 1", "text": "I think there are nice ways of building that into the system." }, { "i": 305, "speaker": "Speaker 1", "text": "As I mentioned earlier, you can kind of during RL enforce certain properties and have the" }, { "i": 306, "speaker": "Speaker 1", "text": "model work better or worse under certain capacities." }, { "i": 307, "speaker": "Speaker 1", "text": "I think from a product point of view, we found it to be a little complex, and so we've kind" }, { "i": 308, "speaker": "Speaker 1", "text": "of focused on a single version." }, { "i": 309, "speaker": "Speaker 1", "text": "But I certainly could see in the future we would add that." }, { "i": 310, "speaker": "Speaker 2", "text": "And then seeing some questions about overall team structure and how we do stuff, who owns" }, { "i": 311, "speaker": "Speaker 2", "text": "the agent harness?" }, { "i": 312, "speaker": "Speaker 2", "text": "Is this driven by the product team or the research team?" }, { "i": 313, "speaker": "Speaker 2", "text": "And similar with the system prompt." }, { "i": 314, "speaker": "Speaker 1", "text": "Yeah, that's a great question, too." }, { "i": 315, "speaker": "Speaker 1", "text": "Cursor is a very interesting company in that we are developing a harness that is globally" }, { "i": 316, "speaker": "Speaker 1", "text": "used across all the models." }, { "i": 317, "speaker": "Speaker 1", "text": "So we have a completely separate team that works on the harness design and the tool use" }, { "i": 318, "speaker": "Speaker 1", "text": "and things like that because it's part of our product." }, { "i": 319, "speaker": "Speaker 1", "text": "In some sense, it is our product." }, { "i": 320, "speaker": "Speaker 1", "text": "And we wanted to work extremely well with Opus, extremely well with GPT models." }, { "i": 321, "speaker": "Speaker 1", "text": "And in fact, in recent kind of evaluations from artificial analysis, we've been able" }, { "i": 322, "speaker": "Speaker 1", "text": "to see that the best models at coding are like cursors, harness, plus other agents." }, { "i": 323, "speaker": "Speaker 1", "text": "So that's like an extremely good sense of how important the harness is." }, { "i": 324, "speaker": "Speaker 1", "text": "That being said, there is a lot of back and forth of like do we want to change this?" }, { "i": 325, "speaker": "Speaker 1", "text": "How will it affect training?" }, { "i": 326, "speaker": "Speaker 1", "text": "How do we train certain things in or not?" }, { "i": 327, "speaker": "Speaker 1", "text": "Luckily, I would say stuff is stable enough now and the models are good enough that it's" }, { "i": 328, "speaker": "Speaker 1", "text": "kind of hard to mess this up." }, { "i": 329, "speaker": "Speaker 1", "text": "But it's an interesting back and forth." }, { "i": 330, "speaker": "Speaker 2", "text": "Awesome." }, { "i": 331, "speaker": "Speaker 2", "text": "And I know we're out of time and I see lots of great questions in the chat." }, { "i": 332, "speaker": "Speaker 2", "text": "But I did want to end off with why is Composer named Composer?" }, { "i": 333, "speaker": "Speaker 1", "text": "Oh, yeah." }, { "i": 334, "speaker": "Speaker 1", "text": "So I think Cursor originally had an agent product that was called Composer." }, { "i": 335, "speaker": "Speaker 1", "text": "And then at some point, the term agent caught on globally." }, { "i": 336, "speaker": "Speaker 1", "text": "And so we changed the name of that to be called agent." }, { "i": 337, "speaker": "Speaker 1", "text": "But I think everyone here kind of liked that." }, { "i": 338, "speaker": "Speaker 1", "text": "We wanted to keep that around." }, { "i": 339, "speaker": "Speaker 1", "text": "So I think that kind of we kind of decided to borrow that name as the name of the model" }, { "i": 340, "speaker": "Speaker 1", "text": "itself." }, { "i": 341, "speaker": "Speaker 1", "text": "And I think I think it's a really nice name." }, { "i": 342, "speaker": "Speaker 2", "text": "I think we see it like as an orchestra." }, { "i": 343, "speaker": "Speaker 2", "text": "You're kind of driving the symphony of agents that are working." }, { "i": 344, "speaker": "Speaker 2", "text": "Well, thank you, everyone, for joining." }, { "i": 345, "speaker": "Speaker 2", "text": "Thank you, Sasha, for the great demo." }, { "i": 346, "speaker": "Speaker 2", "text": "And we'll send out the recording afterwards." }, { "i": 347, "speaker": "Speaker 2", "text": "Have a good one." } ]