Workflow strategy06/04/2026

Codex Sites in Practice: What AI-Generated Internal Tools Actually Deliver

OpenAI Codex Sites creates internal tools from a prompt. Impressive, but the preview phase also reveals the risks: AI-generated code without review, incorrect KPIs, and unresolved GDPR questions. Why the leap from playground to production is still too far.

The Backlog Problem: Internal Tools That Never Get Built

Every company has it: the list of internal tools that are urgently needed but never reach priority. A vacation planner for the department, a dashboard for open orders, an approval tool for budget requests. The development team is tied up with customer projects, and a spreadsheet has to suffice.

OpenAI addressed this problem on June 2, 2026 with a new Codex plugin called Sites. The idea: describe an internal application in natural language, and the AI builds it, hosts it, and hands you a link. No deployment, no DevOps, no ticket in the backlog. With 5 million weekly Codex users, and non-developers growing faster than engineers, the demand for such tools is real.

What Codex Sites Delivers Technically

Codex Sites is a plugin for the Codex editor that turns a text description into a web application. The architecture: Cloudflare Worker-compatible ES modules on OpenAI infrastructure, with D1 database for persistent data and R2 object storage for files. The workflow separates saving from deploying: a saved version is not live; only an explicit deploy makes it accessible to workspace members.

A sample prompt from OpenAI's documentation:

@Sites Build a project request dashboard. Let team members submit requests, see the status, and filter the list. Require workspace login and save data between sessions.

The demo looks convincing. But what happens when that dashboard processes real business numbers?

The Problem: AI-Generated Software Without Quality Control

Here lies the critical catch. Codex Sites generates code that nobody reviews. There are no automated tests, no static code analysis, no manual verification of business logic. The application appears to work at first glance, but nobody knows whether it calculates correctly.

A real-world example: a KPI dashboard that aggregates revenue figures. The prompt describes the logic in natural language, the AI interprets it, and generates JavaScript. But does it interpret "monthly revenue" as the sum of all transactions or the sum minus cancellations? Does it count partial deliveries separately or as a single line item? Every one of these implicit assumptions can be right or wrong, and the non-technical user who wrote the prompt cannot validate the generated code.

The consequence: an executive makes a staffing decision based on an incorrectly calculated dashboard, and the error surfaces weeks later when quarterly figures do not reconcile.

Eight Use Cases, Eight Risk Profiles

Codex Sites' strength lies in internal applications. These use cases are named by OpenAI itself:

Use Case	Risk Profile
Project request dashboard	Low: Incorrect status display, no financial damage
Onboarding hub	Low: Wrong checklist is annoying but fixable
KPI dashboard	High: Wrong decisions based on incorrect metrics
Idea board	Low: No business-critical data
Launch calendar	Medium: Date errors can have external impact
Scenario planner	High: Incorrect model calculations lead to strategic misjudgments
Enablement library	Low: Outdated content is visible and correctable
Review room	Low: Decision documentation, no automated harm

The pattern is clear: as soon as numbers are aggregated, calculated, or used as a decision basis, the risk becomes critical. Precisely the applications that promise the greatest business value are the riskiest ones.

Sensitive Data: The Second Unanswered Question

Codex Sites runs on OpenAI infrastructure. Access control is limited to three modes: admins_only, workspace_all, and custom. What is missing is role-based permission structures that every mid-range ERP system offers.

For EU-based companies, the GDPR question arises immediately. A damage-reporting tool for property management processes personal data of tenants: names, addresses, damage descriptions. A Data Protection Impact Assessment is mandatory whenever such data is processed. On which infrastructure do the D1 databases run? In which data center? Under which data processing agreement? OpenAI does not document these questions for Codex Sites with the depth a data protection officer requires.

The AI Makes Mistakes, and Nobody Notices

The most fundamental objection is also the simplest: large language models hallucinate. They invent API endpoints that do not exist. They calculate sums incorrectly. They interpret business logic creatively rather than correctly.

In the Codex context, this means: the AI generates an application that appears to work at first glance. The button is there, the table loads, the filters respond. But is the filter logic correct? Does the scenario planner sort by the right criteria? Does the KPI dashboard calculate the moving average over the correct time period? These questions are not answered by a prompt, only by a code review, and that is exactly what is missing from the Sites workflow.

OpenAI itself signals caution: save before deploy, review pane, access only after inspection. But what exactly is a non-technical user supposed to review? The generated code? They cannot. The output? It looks correct, until it no longer is.

Conclusion: An Impressive Playground, Not Yet Production-Ready

Codex Sites is technically impressive. The idea of creating internal tools by voice command is sound and will arrive. But in the current preview phase, the tool is not suitable for professional use as soon as financial consequences, personal data, or strategic decisions are involved.

The development is reminiscent of the early days of no-code platforms: the demos look fantastic, the first prototypes generate enthusiasm, but the leap to production reveals gaps in security, correctness, and governance that are invisible in the demo video.

For centerbit, the lesson is clear: tools like Codex Sites are valuable as a prototyping environment and for non-critical internal tools. The moment an application touches business-critical processes, it needs what AI-generated code by definition does not bring: a traceable quality process, a tested codebase, and governance that catches errors before they lead to wrong decisions.

centerbit

Book a consultation now

If you see similar manual work in your team, we can review the process together in a free initial consultation.

Request consultation