How can AI autonomously code, test, and optimize? From independent thinking, to terminal access, and finally rewriting the future!
Compiled by | Eric Harrington
Produced by | AI Tech Headquarters (ID: rgznai100)
Who will spark the next wave in the world of code? When AI is no longer just an auxiliary tool, but transforms into an intelligent software engineer capable of independent thought, terminal access, and even its “own dedicated computer,” the future of software development is being completely rewritten. From last year’s earliest Devin, touted as the “first AI programmer,” to GitHub Copilot gradually becoming a mainstream tool for global programmers, the explosion of Cursor this year, and then just a few days ago, OpenAI’s release of the Coding Agent product Codex—these fantasies are gradually becoming reality.
Today, we share a recent in-depth interview from the renowned AI engineer podcast Latent Space, where the host invited Josh Ma and Alexander Embiricos, core members of the Codex team.
They shared the origin of the Codex project—from the “AGI glimmer” moment brought by giving models terminal access, to the ambitious blueprint of building an “intelligent software engineer.” This conversation not only revealed the technical thinking and product philosophy behind Codex but also explored a new paradigm of human-AI pair programming and how developers can ride the wave in this intelligent era.
AI is evolving from an auxiliary tool to an intelligent software engineer capable of independent thought, terminal access, and its “own dedicated computer,” completely rewriting the future of software development.
Granting AI models terminal access was a key “AGI glimmer” moment for the OpenAI team, spurring the idea of equipping agents with their own computers.
OpenAI core members predict that an “intelligent software engineer” capable of independently completing software engineering tasks could be built within the next two years.
Codex is not just a coding model; it’s an agent skilled at independently completing software engineering tasks and working autonomously for extended periods, aiming to “get it done in one go” for complex tasks.
In the AI era, the model itself is the product core. In the future, models will handle more decisions, while human developers will focus more on architectural design and innovative work where AI is not yet proficient.
Here is the highlight of the interview:
Host: Today’s guests are Josh Ma and Alexander Embiricos from the ChatGPT Codex team. Alexander is a new acquaintance; he’s been leading many of Codex’s tests and demos.
Alexander: Hi everyone, I’m Alexander from OpenAI’s ChatGPT Codex product team.
Host: We’ll assume readers have seen the Codex launch livestream. You also published a blog post that day with many interesting demo videos. I noticed in the demo videos that the engineers were working alone, isolated, talking to their AI partners and coding together. I don’t know if that’s the atmosphere you wanted to create, but that’s how it felt to me.
Alexander: That’s fair. With those videos, we were striving for ultimate authenticity, for the engineer themselves to talk about how AI helped them. I’ll take that feedback.
Host: That’s true, though. Sometimes, working the night shift is lonely, and mobile development can be quite solitary, as there aren’t many people in those roles. So, I completely understand. Anyway, what exactly have you all been working on? Perhaps we can start there. How did you get pulled into this project? And what has been the progress since then?
“Human-AI Pair Programming, Unleashing the Tremendous Value of Giving Models Terminal Access”
Alexander: Perhaps I’ll go first, because there’s quite an interesting story about how the two of us started working together. Before joining OpenAI, I built a macOS native application called Malti, which focused on human-to-human collaboration, a kind of pair programming tool. Then things like ChatGPT caught fire, and we started to wonder, what if it was no longer human-to-human pair programming, but human-to-AI pair programming?
So, I won’t go into the details of the twists and turns, but that was the journey, and then we both joined OpenAI. I used to primarily work on desktop software. Then we released inference models. I’m sure you’re far ahead in understanding the value of inference models, but for me, it was initially just a more powerful chat tool. But once you can give it tools, it can truly transform into an “agent.” An agent is an inference model equipped with tools, an environment, security boundaries, and possibly trained for specific tasks.
In any case, we became deeply interested in this and began thinking about how to bring inference models to the desktop. At the same time, OpenAI was conducting a lot of internal experiments, trying to give these inference models terminal access. To be clear, I wasn’t involved in those initial experiments, but those experiments were indeed the first time I truly felt the “AGI glimmer.” That experience happened when I was talking to a designer named David K, who was working on a project called Scientist at the time. He showed me a demo where it could self-update. Today, changing a background color might not impress any of us anymore.
Host: Do you mean modifying its own code?
Alexander: Yes. And they also had hot reloading set up then. I was absolutely blown away. To this day, that’s still a super cool demo. We tried a lot of similar things back then, and I later joined one of the small groups working on this direction. We realized that figuring out how to give inference models terminal access had immense value. Then, we had to solve the problem of how to turn it into a useful product and how to ensure its security, right? You can’t just let it run wild in your local file system, but that’s how people initially tried to use it.
Many of these experiences eventually evolved into the recently released Codex CLI. Among them, what I’m most proud of is the thinking behind implementing features like full auto mode, and in this mode, we actually enhanced sandbox isolation to ensure user safety.
We were doing things like this, and then started to realize that we wanted the model to have longer “thinking” time, we wanted the model to be larger, and we wanted the model to be able to do more things safely without any approval. So we thought, maybe we should give the model its own computer, give this agent its own dedicated computer. At the same time, we were also trying to put the CLI into our continuous integration (CI) process to make it automatically fix tests. We also came up with a whimsical hack to make it automatically fix tickets on our Linear issue tracker. And so, we eventually created this project called Codex, whose core idea is to give agents access to computers. Oh, I realize I might not have answered what I personally did, but anyway, I’ve told the story, hope it’s okay.
Host: You’ve cleverly woven your personal experience into the grand narrative. I’m sure Josh has more to add.
“Building an Intelligent Software Engineer in Two Years!”
Josh: My story is a bit different. I’ve been at OpenAI for two months, and it’s been the most interesting and chaotic two months of my life. But perhaps I should start with a company I founded a few years ago called Airplane. We were building an internal tools platform, and the original intention was to make it very easy for developers to build internal tools and truly delve into developer needs. This might not sound related to what we’re doing now, but in many ways, similar themes have re-emerged: What’s the best form of local development? How do you deploy tools to the cloud? How do you run code in the cloud? How do you combine fundamental modules like storage, compute, and user interfaces to allow developers to build software extremely quickly?
I often joke that we were just two years early. Towards the end of the project, we started trying GPT-3.5, wanting to make it cooler. We were already able to quickly set up a React view back then. I think if we had continued, perhaps it would have evolved into the AI building tools you see today. But that company was eventually acquired by Airtable, and I led some AI engineering teams there.
Personally, earlier this year, I witnessed the progress we were making in agent-style software development. For me, that was sort of my own “moonshot moment.” I had a premonition that this big thing was about to happen. Whether I was involved or not, within the next two years, I believe we will build an agent software engineer. So, I reached out to my friends at OpenAI and asked, “Hey, are you guys doing something similar?” He looked at me with wide eyes and said, “I can’t tell you anything, but maybe you can talk to the team.”
So, very fortunately, Alex and Foster were just starting related projects at the time. I remember in the interview, we had a fierce debate about the product form, right? Should it be a command-line interface (CLI)? The problem with that form is that you can’t always interrupt it while it’s completing a task, and you might want to run it four or ten times simultaneously. Perhaps that’s when I said, maybe it’s better to have both. We are now working towards that direction. All in all, I was, and still am, very excited to push this project forward. I think Codex is still at a very nascent stage. It’s great to share it with the world, but there’s still a lot of work to do.
Alexander: Our first meeting was a very interesting conversation. He walked in—I’d never encountered this before—and said, “The world is undergoing such a transformation, so I want to build a product like this. I know you can’t confirm if you’re doing this, but this is the only thing I want to do.” Then I asked a few open-ended questions, and we immediately delved into some core points of contention about the tool’s form factor. I thought, great, we have to work together.
Host: I suppose people in the developer tools circle always recognize kindred spirits at a glance.
On a side note, early iPhone teams at Apple were like that too, because team members didn’t know if others were on the same project; they weren’t allowed to tell each other. So they had to rely on “triangulation” to figure it out.
Product Form Discussion: CLI or Cloud?
Host: Speaking of product form, you mentioned the already released CLI, and I think there are other cloud-based code tools on the market, like Aider, and so on. Should everyone consider Codex in ChatGPT as a hosted version of Codex CLI? Are there significant differences between the two? Let’s talk about that.
Alexander: You go.
Josh: I think, simply put, it just allows you to run the Codex agent in OpenAI’s cloud. But I think the product form is much more than just where the computer runs. It’s about how it integrates with the user interface, how it scales over time, how caching and permissions are managed, and how collaboration is achieved. So, I don’t know if you agree, but I think the product form is the core.
Alexander: Honestly, it’s been a very interesting journey. The other day, or maybe last night, Josh went to sleep because he had a live stream, and I didn’t. Anyway, a few of us reviewed the document where we planned which features to release, and we found that our project scope had expanded quite a bit without us realizing it. But in reality, all these scope increases were natural, because we increasingly agreed with the idea that this is not just a model that is good at coding, but an agent that is good at independently completing software engineering tasks. The deeper we delved into this idea, the more distinct things became.
So, we can put aside the topic of the entire computing platform that Josh is responsible for. Just talking about the model itself, we don’t just want it to be good at writing code, nor do we just want it to solve tasks on SWeBench. SWeBench, for those who don’t know, is an evaluation benchmark that has specific ways to functionally score output. But if you look at many of the outputs from agents tested on SWeBench, they are actually not PRs (Pull Requests) you would merge into your codebase, because the code style might be very different. It works, but the style is off.
Therefore, we spent a lot of time ensuring that our model is very good at following instructions and very good at inferring code style, so you don’t have to explicitly tell it. But even so, suppose you get a PR with good code style and that follows your instructions well. If the model’s description of how it built it is endlessly long, this PR might still be difficult to merge. And you’ll likely need to pull it to your local machine to test the changes and verify their effectiveness. If you only run one change, this might be acceptable, but in the future world we envision, most of the code might actually be completed in parallel by the agents we delegate, so for human developers, the ability to easily integrate these changes becomes crucial.
For example, another thing we started training on was PR descriptions. We really want to perfect the art of writing good, concise, and focused PR descriptions. So, our model actually writes beautiful, brief PR descriptions, and PR titles that conform to your code repository’s format. If you like, we also provide an agents.md file, allowing you to guide it more finely. Then, in the PR description, it will also reference relevant code it found during the process, or relevant code within its PR, so you can see it on hover.
Perhaps my favorite feature is how we handle testing. The model will try to test its changes, and then in a very friendly way, like a checkmark, it tells you if these tests passed. Similarly, if the tests pass, it will reference the determined reference information in the logs, so you can see and be confident: “Okay, I know this test passed.” If the tests fail, it will say: “Hey, this didn’t work. I think you might need to install pnpm or something,” and then you can check the logs to find out what went wrong. These are the things we’ve been working on, essentially building this software engineer agent in the cloud—oh, I think I forgot what the original question was, but these are the things we’ve been diving deep into.
Josh: I also feel it’s very different. You can just look at the features, but I think, for me, the feeling is that you have to take a “leap of faith.” The first few times you use it, you’ll think, “I’m really not sure if this thing will work.” Then it runs for thirty minutes. But when it comes back with results, you’ll be amazed: “Wow, this agent actually went out, wrote a bunch of code, even wrote scripts to help modify its own changes, tested them, and truly thought through the changes it wanted to make completely.” At first, I completely didn’t believe it would succeed. But after using it a few times, you’ll feel, “Wow, it actually got it done!” That ability to work independently for extended periods is hard to put into words; you have to try it yourself. But ultimately, it feels completely different, very special.
Host: I’ve used it. I just submitted a PR to it a few minutes ago. I was lucky enough to be among the first 25% of internal testers. It’s very useful. But it took a shortcut because it couldn’t figure out how to run RSpec in a Rails environment, so it just checked the Ruby file’s syntax and said, “Looks good to me.” But I guess it hasn’t used agents.md yet. Once I configure that, it should be fine.
From agents.md to “Conscious Naming”
Host: If you could list some best practices, that would be great. I noticed from the livestream that they mentioned professional users would install linters and formatters, so the agent could utilize these validators in the development environment. It turns out these are also best practices for developers, but now agents can use them automatically. Commit hooks have always been a thorny issue for humans, because in teams I’ve been on, some insisted everything must have commit hooks, while others found them obstructive and just deleted them all. But in reality, for agents, commit hooks are very useful.
Josh: You’ve said exactly what I was about to say. The three points I want to make are: First, agents.md. We’ve put a lot of effort into ensuring that the agent can understand this hierarchical structure of instructions. You can put them in subdirectories, and it will understand which instructions have higher priority. And, we are now also starting to use GPT-3 and GPT-4 to write agents.md files for us.
Host: I like these tricks. You actually open-sourced the prompt descriptions here.
Josh: Yes.
Host: Is there anything worth highlighting?
Josh: I think it’s good to start simple; don’t make it too complicated from the beginning. A simple agents.md file can be a huge help, far better than none. Then, it’s about learning as you use it. What we really hope is that in the future, this file can be automatically generated for you based on the PRs you create and the feedback you give, but we decided to release it early rather than pursue perfection.
Host: You mentioned that you also use GPT-3 and GPT-4 to write agents.md.
Josh: I would give it the entire directory and say, “Hey, generate an agents.md.” In fact, these days I do this with Code One—oh, sorry, Codex One—because it can traverse your directory tree and generate these files. So, I recommend investing in agents.md gradually and incrementally. And then, as you said, configure the most basic linters and formatters. This can actually bring significant benefits, because it’s similar to getting some out-of-the-box checks when you open a new project in VS Code. An agent, at first, if compared to a human, would lack this advantage. So, doing this is about giving that advantage back to the agent. Do you have anything else to add?
Alexander: I have an analogy for that point, and then I also want to talk about how to prepare for this based on our experience with other coding agents, or any coding agents for that matter. The analogy I like is: if you start with a foundational inference model, you essentially get a very precocious, brilliant, knowledgeable, but also a bit quirky, gifted college graduate. We all know that if you hire someone like that and ask them to do software engineering work independently, they’ll be missing a lot of practical things.
So, a lot of what we’ve done with Codex One is essentially giving it those first few years of work experience. That’s really what the training is about, getting it to be more worldly. If you think about it, writing a good PR description is a classic example, maybe even knowing what not to put in.
And then you get this: a strangely knowledgeable, gifted but quirky, and also a college graduate with a few years of experience. Then, every time you start a task, it’s like their first day at your company. So, agents.md is basically a way for you to compress the “onboarding exploration” time it has to do, to let it learn more. As Josh said, we certainly hope—right now it’s still a research preview, so you have to update it yourself—but we have a lot of ideas about how to automate that. It’s just an interesting analogy.
Josh: Perhaps my last point is to make your codebase discoverable. This is equivalent to maintaining good engineering practices for new hires, allowing them to understand your codebase faster. Many of my prompts start with something like, “I’m working in this subdirectory, I want to accomplish this, can you help me do it?” So, giving that kind of guidance and scope limitation is very helpful.
Alexander: I usually give three pieces of advice. First, language choice. I was talking to a friend the other day, who’s a newcomer to AI, and he said, “I want to try to build an agent product, should I use JavaScript?” And I replied, “Are you still using JavaScript? No wonder. At least use TypeScript, give it some type information.” So, I think that’s the most basic point. I believe those listening to our podcast now probably don’t need me to emphasize this anymore.
Another point is, make your code modular. The more modular and testable your code is, the better it works. You don’t even need to write tests yourself; the agent can write them, but you have to design the architecture to be modular. I recently saw a demo from someone here who basically—he’s not someone who codes by feel, but a professional software engineer—but he was building a new system using a tool like Codex. He built this system from scratch, and there was a chart showing his code submission rate. Then his system gained a certain user base, and it got to the stage where “now we need to port it to that huge monolithic codebase at ChatGPT,” that codebase had experienced insane hyper-growth, so maybe the architectural planning wasn’t perfectly thought out. As a result, the same engineer, using the same tools, even with AI tools constantly improving, saw his code submission rate plummet.
So, I think another point is that architecture, good architecture, is more important than ever. And it’s interesting that currently, this is still something humans are truly good at. So, for software engineers to do their job well, it’s good and important.
Host: Please don’t look at my codebase.
Alexander: Then don’t look at mine even more. But the last point is an interesting story: our project’s internal codename was Wham, WHAM. When we chose it, I was with our research lead, and he said, “Hey, before choosing a codename, remember to grep the codebase.” So, we searched the codebase, and the string “Wham” only appeared within a few longer strings, never as a standalone string. This meant that when we wrote prompts, we could be very efficient and just say “in Wham.” Then, whether it was Wham code in our web codebase, server codebase, shared types, or anywhere else, the agent could find it efficiently. Conversely, if we had named the product ChatGPT Code—not that we didn’t consider it—it would have been difficult for the agent to figure out where we wanted it to look, and we might have had to provide more relative folder paths. So, when you start thinking proactively: I’ll have an agent in the future, and it will use the terminal to search, then you can start naming consciously.
Host: Would you start sacrificing some human readability in naming for the sake of agent understanding? In your opinion, what’s the trade-off there?
Josh: That’s interesting, because when I first joined OpenAI, I definitely came in with some preconceived notions, but I now believe that these two systems (human readability and agent readability) are actually highly convergent. Maybe it’s because you see both humans and AI writing it. Perhaps in a world where only AI maintains a codebase, the assumptions would change. But once you have to break that “fourth wall” and have humans involved in code review, deploying code, you need the code to bear human marks everywhere. So, how humans convey where to modify, how to convey bugs that need fixing, or how to convey business requirements, all of that won’t disappear immediately. Therefore, I think the entire system is actually still very “human.” I know there might be cooler answers, like it’s an alien creature, completely different, but I think these things started with large language models, and their roots are deeply embedded in human communication.
Alexander: By the way, if you want to change the topic, feel free to interrupt us, because I realize we’ve just been talking among ourselves.
agents.md vs. readme.md
Host: No, I think this also relates to agents.md. Why agents.md and not readme.md? I think, in your view, there are some fundamental differences in how agents and humans consume information. So I’m curious, do you think this difference is at the class naming level, or just at the instruction level, or where is the boundary?
Josh: You go.
Alexander: For this naming, we considered several options. You could use readme.md, or contributors.md. You could also use Codex agent.md, and then maybe a Codex CLI.md, as two separate branded files.
Host: And like cursor.rules, Witserf rules, every company has its own rules file.
Alexander: And then you could also choose agents.md. There are a few trade-offs here. I think one is openness, and the other is specificity. So what we considered was, well, there might be things you want to tell the agent, but don’t need to tell contributors. Similarly, there are things you want to tell contributors to really help them set up in your codebase, etc., which you don’t need to tell the agent; the agent can figure it out itself. So we thought, these two might be different, and the agent will read your readme anyway, so agents.md might be where you put information you need to tell the agent, but which it can’t automatically get from the readme. So we made that decision.
Then we considered, agents come in different forms. What’s most special about what we’re building and releasing is that it provides an out-of-the-box way to use cloud agents, which can process many tasks in parallel, can think for long periods, and can safely use a lot of tools. So we thought, then, how much difference is there between the set of instructions you want to give such an agent and the set of instructions needed for an agent you use more collaboratively on your local machine? Honestly, we discussed this quite a bit. Ultimately, we concluded that the differences between these instruction sets aren’t so large that you’d need to namespace this file. If there is indeed something that needs namespace differentiation, you can probably just use natural language within the file to explain it.
Finally, we considered how much difference there is between the instructions you need to give our agent and the instructions you might give an agent running on a different model or built by a different company? We just felt that if you had to create all these different agent files and so on, the experience would be very bad. This is also partly why we open-sourced Codex CLI; many problems, such as the security issues involved in how to safely deploy these things, need to be solved, and everyone shouldn’t have to reinvent the wheel. So, that’s why we chose a non-branded name.
Josh: I have a specific example of why readme and agents.md are different. For agents, I think you actually don’t have to tell it about code style. It will look at your codebase and write code consistent with that style. Human developers, however, won’t spend time reading through the entire codebase and following all conventions. This is just one example; ultimately, there’s a difference in how these two “developers” approach problems.
Model as Product?
Host: I think these suggestions are great. What you just said could totally be the title of this podcast episode. Let’s call it “Best Practices for Using ChatGPT Codex.” I think everyone wants to know what the best practices are.
I noticed a very interesting phenomenon. In building agents, there always seem to be two schools of thought. One is to try to control it more strongly, making it more deterministic. The other is to try to just give prompts and then trust the model. I think your approach leans heavily towards “prompt and trust the model.” I see in the agents.md system prompts, you just tell it to act in the way you expect, and then you count on the model to do that. Of course, you can control the model, so if it doesn’t perform well, you can train it. But one thing that puzzles me is, how do you put everything into context? What if my agents.md file is super long? In your live demo, you were working on OpenAI’s huge monorepo. So, how do you manage caching, context windows, and these things?
Josh: If I told you right now that everything still fits into the context window, would you believe me?
Host: Not for OpenAI’s codebase, right?
Josh: No, it’s everything the agent needs.
Host: Oh, I see. So you would streamline agents.md and put it at the very beginning? Like another system prompt.
Josh: No, it’s a file, and the agent knows how to grep, find, and read it into context, because there might be multiple such files. You can actually see in the work log that it actively looks for agents.md. It’s trained to do that.
What I want to say is that after joining OpenAI, I found it really interesting that when you’re thinking about where models are going in the future and what AI products will look like in a few years, you design products in a different way. Before joining OpenAI, especially when you don’t have a research team and a lot of GPU resources, you build these deterministic programs and scaffold a lot around how it works. But you’re not really letting the model reach its full potential.
When I first joined, there was an interesting phenomenon where I was often questioned: “Hey, why don’t we just hardcode this? Listen, you keep misusing this tool, can’t we just say ‘don’t do that’ in the prompt?” And the researchers would say, “No, no, we don’t do that. We want to do it the right way. We want to teach the model why this is the right way to do it.” I think this relates to a broader thought, which is: Where should you set deterministic “guardrails,” and where do you truly let the model “think”?
The same discussion applies to the planning phase. Should we set a clear planning phase, like having it “think aloud first, write down what you’re going to do, and then do it”? Of course, but what if the task is simple? Do you really want it to keep thinking? What if it needs to replan during execution? Do you set up all sorts of if-else conditions and heuristics for that? Or should you train a very good model that knows how to switch between these thinking modes? So, it’s hard. I did also advocate for setting some small “guardrails” until the next training run was complete. But I think what we’re truly building for is a future—a future where the model can make all these decisions. What’s really important is that you provide it with the right tools: methods for managing context, managing memory, and exploring the codebase. These are still very important.
Alexander: Yes, well said. I think building products here is very interesting and different. The model isn’t the entire product, but the model is the product itself. You somewhat need to approach it with a humble attitude and think: there are three parties here, the user, the developer, and perhaps the model. What things does the user need to decide from the outset? And then, what things can we developers decide better than the model? And then, what things can the model itself decide best? Every decision must fall into one of these three categories.
Not everything is decided by the model. For example, we currently have two buttons in the user interface: “Ask” and “Code.” These two functionalities could perhaps be inlined into decisions made by the model. But for now, giving the user choice from the start is very reasonable, because we will launch a different container for the model based on the button the user presses. So, if you request coding, we’ll put all the dependencies in—I’m simplifying here. But if you don’t request coding, and just ask a question, we’ll do a much faster container setup before the model gets any choice. So, that might be a user decision.
In some ways, user and developer decisions converge at the environment level. But ultimately, many agents I see are impressive, but part of what makes them impressive is that a group of developers built a very customized state machine around a series of short model calls. This means the upper limit of complexity that the model can solve is actually limited by what the developer’s brain can contain. And we hope that, over time, these models will be able to independently solve increasingly complex individual tasks and deal with increasingly complex problems. Eventually, you can even imagine a team of agents working collaboratively, perhaps with one agent managing these agents. In that case, complexity would explode.
So, we genuinely want to push as much complexity as possible, as many state machines as possible, to the model to handle. Thus, you get these two building modes. On the one hand, you are building the product’s user interface and rules. On the other hand, you still need to do work to make the model learn certain things, but what you need to do is figure out what correct things this model needs to see during training to learn. So, realizing this change still requires a lot of human effort, but it’s a completely different way of thinking: we want the model to see this.
“Code” vs “Ask”
Host: But how do you build the product to capture these signals? If you think about the “Code” and “Ask” buttons, it’s almost like you’re having users tag prompts in a way—because they say “Ask,” that’s an asking prompt; they say “Code,” that’s a coding prompt. In building this product, are there any other interesting product designs based on the idea of “we think the model can learn this, but we don’t have data, so we design Codex this way to help us collect data”?
Josh: File context and scope limitation, we don’t have good built-in features for them yet, but that’s one of the features we clearly need to add. This is another example of that kind of situation. You might, we’re often pleasantly surprised to find, “Oh, it actually found the exact file I was thinking of,” but that takes some time. So often, you’ll shorten a series of thinking processes by directly saying, “Hey, I’m looking in this directory, can you help me find something?” So I think that situation might continue for a while until we have better architected indexing and search capabilities.
Host: Okay, cool. Regarding Codex itself, we have a few factual questions to wrap up, and then we want to dive deeper into the computing platform side of things, which I think Josh you’d prefer to talk more about. I noticed in the details that task durations are between 1 and 30 minutes. Is that a hard limit? Have you ever encountered tasks that lasted longer? Any comments on task duration?
Josh: I just checked the codebase before coming here. Others have similar questions. Our current hard limit is one hour. But, don’t take it too seriously; this can be adjusted at any time. The longest I’ve seen is two hours, that was in development mode, and the model was completely out of control. So, I think 30 minutes is a reasonable range for the types of tasks we’re trying to solve. These are tough tasks that require a lot of iteration and testing, and the model needs this time to think.
Alexander: Our average duration is far below 30 minutes, but if you give it a daunting task, it can indeed take up to 30 minutes.
Host: I think there are a few analogies here. One is that the Operator team ran a benchmark, and they had to set the upper limit at two hours. Another is the Metr paper, I don’t know if it’s circulated, but they estimate the current average autonomous running time is about an hour, and it might double every seven months. So an hour sounds about right, but that’s also the median, so there will definitely be some tasks that exceed that time.
Alexander: Exactly.
Host: In the SWeBench verified examples, 23 couldn’t run. Was this related to the time limit, or other reasons?
Alexander: Honestly, I’m not entirely sure, but I feel some of the SWeBench cases themselves, saying they’re invalid might be a bit much, but I think they do have some issues when running, so they just don’t run.
Host: Okay. So what about maximum concurrency? Is there a concurrency limit? What if I run 5, 10, 100 Codex instances simultaneously?
Alexander: 5, 10 is perfectly fine. We actually feel that we did set a limit to prevent abuse. I’m not sure what the exact number is.
Josh: I think it’s currently 60 per hour.
Host: One per minute.
Josh: But, listen, that’s the key. In the long run, we actually don’t want you to bother thinking whether you’re delegating tasks to AI or collaborating with AI. Imagine an AGI super assistant; you just ask questions, just talk to it, and it gets things done. When a quick answer is needed, it’s fast; when it needs to think for a long time, it takes its time. And you don’t have to just talk to it; it’s integrated into all your tools. That’s the long-term goal. But in the short term, it’s a tool you use to delegate tasks.
The best way we’ve observed using it—back to what might be the theme of this podcast, “Best Practices”—is that you must have an “abundance mindset,” and you have to view it as helping you explore things, not consuming your time. So, usually when an agent is going to work on your computer, you’ll carefully ponder the prompt because it will occupy your computer for a while, and you might not be able to do anything else. But we see that the people who love using Codex the most don’t do that; they spend at most 30 seconds thinking about the prompt. It’s like: “Oh, I have an idea, let’s go!” “Oh, I want to do this, get it done!” “Oh, I just saw this bug or this customer feedback,” and then you send the task off. So, the more tasks you run in parallel, the happier we actually are, and we also think users will be more satisfied seeing that. That’s the tone of this product.
Host: I’ll also share my personal experience. I was one of the trusted testers for this project, and you both know that very well. Then I realized I was using it wrong. I was treating it like Cursor, keeping the chat window open, watching it write code. Then it dawned on me, I shouldn’t be doing that. It clicked: “Oh, you guys just throw tasks at it and then go do your own thing.” That truly is a shift in mindset.
Alexander: A quick addition, I’ll try to keep it brief. A very interesting use case is using it on a phone, because for some reason, operating on a phone changes people’s way of thinking. So we made the website responsive, and eventually, we’ll integrate it into the app. Try it out, it’s really very interesting and satisfying.
Host: So this is different from the mobile engineer coding on a phone shown in one of the videos; it can’t currently be used in the ChatGPT app.
Alexander: Not yet.
Host: I have a question about the notifications I receive on mobile. When it starts a task, it displays “Starting research,” just like a deep research notification. Is it using deep research as a tool, or are you just reusing the same notification?
Alexander: We just used the same notification.
What Can the Agent Access? Where Are the Security Boundaries?
Host: You mentioned the computing platform, and you also mentioned that you share some infrastructure with reinforcement learning (RL). Can you give the audience a general idea of what Codex can and cannot access? It seems users can’t run commands themselves, but can only instruct the model to do so. Are there other things to note?
Josh: This is an ongoing discussion, and we are still exploring which parts can be opened up for user and agent access, and which parts currently need to be retained. So we are still learning, and we very much hope to share as much access as possible with human users and agents, provided, of course, that it complies with safety and security constraints.
What you can do currently, as a human user, is set up the environment and set up run scripts. These scripts are typically used to install dependencies, and I expect this will account for about 95% of use cases. The goal is to prepare all the correct binaries for your agent to use. We actually also provide some environment editing experience, where human users can enter an interactive interpreter (REPL) and try some things. So, please don’t abuse it, but you do have ways to interact with the environment there.
Alexander: We hadn’t planned to make an interactive REPL that updates the environment, but we tried. Josh said, “Oh my god, we need this so badly,” so that was an example of scope creep. Thank you for making it.
Josh: We do have rate limits set, and we monitor them very carefully. But there are indeed some interactive features there to help you get started. Once the agent starts running, what we currently actually do—and hope to improve on—is that we cut off internet access. Because we still don’t fully understand the consequences of letting an agent operate freely in its own environment. So far, all security tests have very rigorously shown that it’s not easily susceptible to certain theft attempts targeting prompt injection, but there are still many risks in this area, so we’re not yet sure. That’s why we adopted a more conservative strategy initially, where the agent doesn’t have full network access when it’s running.
But I really hope to change that: to allow it limited access, for example, to specific domains or specific code repositories. All in all, this is constantly evolving as we build the right systems to support these features. Not sure if that fully answered your initial question.
However, I also want to mention one last point, which is the issue of interactivity. When the agent is running, sometimes you might think, “Oh, I want to correct it and make it go somewhere else,” or “Let me fill in this part, and then you take over.” We haven’t fully solved these problems yet either. What we really wanted to pursue from the beginning was that fully independent, one-shot, high-value delivery mode. But we are definitely thinking about how to better combine humans and agents.
Host: To be fair, I think the “get it done in one go” angle is great, and others are comparing you to Devin, Factory, and other competitors. They focus more on multiple human feedback loops. I have a website under development, and I gave it a requirement, then compared it with other tools, and that was my test for Codex. It really got it done in one go.
This is really amazing, especially when you’re running 60 tasks simultaneously. So I think it makes a lot of sense. But it truly is a very ambitious goal, because human feedback is a “crutch” we’re happy to rely on. This also forces us to write more tests, which is annoying because I don’t like writing tests, but now I have to. Fortunately, I can now have Codex write tests for me. I also especially liked the demo in the livestream where you can just have it look at your codebase and suggest what to do, because I don’t even have the energy to think about what to do anymore.
Alexander: Authorizing others to authorize, I think that’s very well put. To be clear, we’re not saying that one form is necessarily better than another. I really like using Codex CLI, and what we really want, as we discussed in my OpenAI interview, is to have both modes. But I think Codex’s role here is to really push the boundaries of that “get it done in one go” autonomous software engineer.
Yes, I somewhat see this research preview as a thought experiment for us. It’s like exploring: what would a coding agent look like in its purest form, where it best embodies AGI potential or scale effects? And then perhaps, for me personally, part of what excites me about working at OpenAI is not just solving problems for developers, but truly thinking about how AGI can benefit all of humanity, and what non-developers will feel about this. So, for me, what’s truly interesting is to view Codex as an experiment to see what it would feel like to work in other functional departments. The goal I’m striving for is a vision where we do the ambiguous, creative, or hard-to-automate work, but beyond that, we delegate most of the work to agents. And these agents, they’re not some distant future product, nor are they fleeting short-term tools, but ubiquitous and always by your side. So, we decided to start with the purest form, which we originally thought would be the smallest release scope, but it might not be the case. However, we will eventually integrate these different things.
Towards the AGI Super Assistant
Host: Okay, I think we still have time for a few questions. Let’s dive deeper into this research preview. Why is it a “research preview”? What are its shortcomings? What standards do you think it needs to meet to be considered a formal release? In the livestream, Greg mentioned a seamless transition between cloud and CLI. Is that the reason, or do you have other considerations?
Alexander: Honestly, part of why we believe so strongly in iterative deployment is that—I can share some of my thoughts now, but we’re also really curious what it will ultimately be like. Because it’s a completely new product form. But my current top concerns include multimodal input, which we’ve discussed before.
Host: I know you like that.
Alexander: Another example is giving it a little more access to the outside world. Many people are asking for various forms of network access. I also think that the user interface we’ve currently released is actually an iterated version of ours. There’s an interesting story behind it, but overall, it’s an interface people find useful, but it’s certainly not the final form, and we really hope it can integrate more closely into the tools developers use daily. These are some of the themes we’re thinking about. But to be clear, we will continue to iterate and figure things out.
Host: I’m worried about its pricing after it’s officially released, but I’ll make good use of this free period now. Why not?
Alexander: Regarding pricing, please also give us feedback.
Host: Is it too early to talk about pricing now?
Alexander: Yes, it’s still too early.
Host: Okay. Referring to the situation with Claude Code, this is indeed a concern for everyone. Claude has started introducing some fixed and variable pricing, which I think is a mess. There’s no standard answer. Everyone just wants the cheapest code processing power they can get. So, good luck to you.
Alexander: Thank you.
Josh: My view is that our goal is to deliver tremendous value, and it’s our responsibility to demonstrate that and truly make people realize: “Wow, this thing is doing economically valuable work for me.” I think many pricing questions can start from that point. But I think the conversation should start here: Are we really delivering that kind of value?
Host: That’s great. All right. Thank you both very much. Thank you for your efforts, and thank you for your time. We’ve waited a long time for this day, but I think everyone is starting to see that OpenAI as a whole is getting increasingly serious about agents. This isn’t just coding, but coding is clearly a self-accelerating closed loop, and I think you’re passionate about that too. It’s really inspiring to see.
Alexander: We are very excited to bring this coding agent to everyone, and eventually integrate it into a general AGI super assistant. Thank you for inviting us.
Original Podcast Link: https://open.spotify.com/episode/7soF0g9cHqxKaQWWJBtKRI