Real-time OpenAI response streaming with FastAPI

In this article, you'll learn how to integrate OpenAI into your FastAPI project. We cover two approaches: a synchronous one and an asynchronous streaming one.

Objectives

By the end of this tutorial, you'll be able to:

Understand what FastAPI is and its advantages
Build a simple RESTful API using FastAPI
Use OpenAI's Responses API to generate text
Stream OpenAI responses using Server-Sent Events (SSE)
Consume the streamed response in your frontend

What is FastAPI?

FastAPI is a high-performance web framework for building APIs in Python. It is based on standard Python type hints and Pydantic, which enable automatic data validation and serialization.

Due to its simplicity and elegance, the framework is easy to learn and code quickly. What's more, it allows you to reduce the number of developer-induced errors.

FastAPI's main features include:

Type safety (utilizing Python type hints and Pydantic)
Automatic docs generation (via SwaggerUI and ReDoc)
Asynchronous support (based on asyncio library)
WebSocket support (via the websockets package)

Altogether, these features make FastAPI a great choice for building AI/ML-based APIs.

Project introduction

Throughout the article, we build a web API that takes a user query, passes it to an OpenAI model, and returns the model's response. In other words, we create a simple OpenAI wrapper in FastAPI.

This is demonstrated in two approaches:

Synchronous (waiting for the entire model response before returning it)
Asynchronous (streaming the model response chunk by chunk in real time)

Let's get started!

FastAPI setup

First, create a directory for the project and navigate to it:

$ mkdir fastapi-openai
$ cd fastapi-openai

Create a virtual environment and activate it:

$ python3 -m venv venv
$ source venv/bin/activate

Next, install the FastAPI package:

(venv) $ pip install fastapi[standard]==0.121.3

We are using the [standard] extra, because it comes with some additional packages, such as email-validator, jinja2, uvicorn, and so on.

Create a main.py file with the following contents:

import logging
from fastapi import FastAPI

logging.basicConfig(level=logging.INFO)

app = FastAPI()


@app.get("/")
def index_view():
  return {"detail": "hello world"}

Lastly, run the development server:

(venv) $ fastapi dev main.py

# FastAPI   Starting development server 🚀
#    app   Using import string: main:app
#
#  server   Server started at http://127.0.0.1:8000
#  server   Documentation at http://127.0.0.1:8000/docs

Your web app should now be accessible at http://localhost:8000. Navigate to it in your favorite web browser and ensure you get the {"detail": "hello world"} message.

OpenAI API

To work with the OpenAI API, we use the official OpenAI Python library. The library provides typed request and response models, so that we don't need to handle raw HTTP calls or manually parse JSON.

This section requires an OpenAI account with some balance.

API key

Before using the API, we must generate an API key. To do this, navigate to the OpenAI platform, select "Settings" from the navbar, and then "API keys" from the sidebar.

Next, create a new secret key with the following details:

Name: fastapi-openai
Project: Default project
Permissions: All

Once the key is created, please take note of it.

Environmental variables

To avoid exposing the API key in the source code, we'll utilize environmental variables.

Firstly, add the following two lines at the top of main.py:

# main.py

from dotenv import load_dotenv
load_dotenv()

These two lines load the environmental variables from the .env file. The load_dotenv() method is provided by the python-dotenv package (which comes preinstalled with FastAPI).

Then create a .env file in the project root:

# .env

OPENAI_API_KEY=sk-proj-f64609172efea86a5a6fbae12ab86d33_f64609172efea86a5

Ensure to replace the API key with the key from the previous step.

Great, you can now access the OpenAI API key like this:

os.environ.get("OPENAI_API_KEY")

OpenAI client

Moving along, let's install the OpenAI Python library:

$ pip install openai==2.8.1

Then, initialize the client at the top of main.py like so:

openai_client = AsyncOpenAI(
  api_key=os.environ.get("OPENAI_API_KEY"),
)

Don't forget about the imports:

import os
from openai import AsyncOpenAI

We're using AsyncOpenAI instead of OpenAI, because the async one supports Python's native async/await syntax. This allows us to add streaming support later on.

Complete response

With this approach, we wait for the model's entire response before sending it to the client.

The benefit of this approach is that it allows us to perform response post-processing, but the downside is that the user has to wait quite a long time for the response.

Go ahead and add the following code to main.py:

# main.py

@app.get("/chat-complete")
async def chat_complete_view(message: str):
  try:
    response = await openai_client.responses.create(
      model="gpt-4o-mini",
      input=message,
    )
    return {
      "data": response.output_text,
    }
  except Exception as e:
    logging.error(f"Error: {str(e)}")
    raise HTTPException(
      detail=str(e),
      status_code=500,
    )

In this code, we created a new endpoint at /chat-complete that accepts a message query parameter. Within the endpoint, we then pass the query to OpenAI's gpt-4o-mini model to generate a response. OpenAI only sends a response once it has been generated entirely.

Feel free to swap gpt-4o-mini for another model. Check out the model comparison.

To test the endpoint, run the following cURL command:

curl http://localhost:8000/chat-complete?message=tell%20me%20a%20joke

# {
#   "data": "Why don't skeletons fight each other? Because they don't have the # guts!"
# }
#
# took: 2.279 seconds

We used %20 instead of spaces to properly encode the URL.

Streaming response

With this approach, we stream the model's response in real time as it's being generated.

The benefit of this approach is an almost instant user response, while the downside is that we can't do proper response post-processing. In other words, we don't know what token the LLM will generate next.

In the OpenAI API, streaming is implemented using Server-Sent Events (SSE).

SSE is a server push technology that enables clients to receive automatic updates from a server via an HTTP connection. The data is typically transferred as UTF-8-encoded text.

Each server-sent event has the following structure:

event: <name of the event>  # can be omitted when sending data
data: <data>

To enable streaming, add the following endpoint:

# main.py

@app.get("/chat-stream")
async def chat_stream_view(message: str):
  try:
    response = await openai_client.responses.create(
      model="gpt-4o-mini",
      input=message,
      stream=True,
    )

    async def async_generator():
      yield "event: start\ndata: [START]\n\n"

      async for event in response:
        if event.type == "response.output_text.delta":
          text = event.delta
          yield f"data: {text}\n\n"
        if event.type == "response.completed":
          total_tokens = event.response.usage.total_tokens
          logging.info(f"Used tokens: {total_tokens}")

      yield "event: end\ndata: [END]\n\n"

    return StreamingResponse(
      async_generator(),
      media_type="text/event-stream",
      headers={
        "Cache-Control": "no-cache",
        "X-Accel-Buffering": "no",
      },
    )
  except Exception as e:
    logging.error(f"Error: {str(e)}")
    raise HTTPException(
      detail=str(e),
      status_code=500,
    )

Don't forget about the import:

from fastapi import HTTPException
from starlette.responses import StreamingResponse

The FastAPI part is analogous to the previous code snippet. We again defined an endpoint and sent a request to the OpenAI API. The only difference is that we're now using stream=True.

When stream is enabled, OpenAI begins sending server-sent events as soon as the response starts to be generated. In other words, OpenAI no longer waits for the entire response before sending it over.

Here's an example of OpenAI server-sent events:

ResponseTextDeltaEvent(i=0, delta='Why ',  type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='did ',  type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='the ',  type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='scare', type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='crow ', type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='win ',  type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='an ',   type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='award', type='response.output_text.delta', ...}
ResponseTextDeltaEvent(i=0, delta='? ',    type='response.output_text.delta', ...}

Check out OpenAI responses reference to see the exact events structure.

As you can see, we can form the complete response by aggregating the deltas of the events of type response.output_text.delta.

In our example, that would be:

Why did the scarecrow win an award?

We could forward OpenAI SSE events directly to the client, but instead, we implemented a simplified SSE system. The implementation is defined with async_generator().

All the generator does is first send a [START] event, then pass the data chunk by chunk, and finish with an [END] event.

Lastly, we stream the generator's yields using the StreamingResponse class.

To test the endpoint, use the following cURL request:

curl http://localhost:8000/chat-stream?message=tell%20me%20a%20joke

# took: 0.1s
#
# event: start
# data: [START]
#
# data: Why
#
# data:  did
#
# data:  the
#
# data:  scare
#
# data: crow
#
# data:  win
#
# data:  an
#
# data:  award
#
# data: ?
#
# event: end
# data: [END]

Frontend

In this section, we look at how to consume the just-implemented API. To implement the frontend, we'll use Jinja2 — a fast, expressive, and extensible templating engine.

First, create a templates folder in the project root, and within the folder, create index.html.

After the changes, your project structure should look like this:

fastapi-openai/
├── templates/
│   └── index.html
├── main.py
└── .env

Put the following code in the index.html file:

<!-- templates/index.html -->

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>sevalla-fastapi-openai</title>
    <link
      href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
      rel="stylesheet"
      crossorigin="anonymous"
    />
    <link
      href="https://cdn.jsdelivr.net/npm/[email protected]/font/bootstrap-icons.css"
      rel="stylesheet"
    />
    <script
      src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"
      crossorigin="anonymous"
    ></script>
  </head>
  <body>
    <div class="container">
      <h3 class="mt-3">
        <a href="https://github.com/duplxey/sevalla-fastapi-openai">
          sevalla-fastapi-openai
        </a>
      </h3>
      <div class="mb-3 mt-5">
        <h5 id="input-label" class="text-center">
          Hola 👋, what is your prompt?
        </h5>
        <div class="my-3 gap-2">
          <form id="form" class="d-flex gap-2" autocomplete="off">
            <input
              id="input"
              class="form-control flex-grow-1"
              aria-label="input-label"
              autocomplete="off"
              placeholder="Type your prompt..."
            />
            <button id="submitButton" type="submit" class="btn btn-primary">
              <i class="bi bi-send-fill"></i>
            </button>
          </form>
        </div>
      </div>
      <div>
        <label for="output-input" class="form-label">Streamed response:</label>
        <textarea
          id="output-input"
          class="form-control"
          aria-label="output-label"
          rows="10"
          autocomplete="off"
        ></textarea>
      </div>
    </div>
    <script>
      // TODO: add the script here
    </script>
  </body>
</html>

This code defines a simple HTML5 template that renders a form with an input field, a submit button, and an output area.

To serve the template, we have to define templates in our main.py like so:

# main.py

from starlette.templating import Jinja2Templates

# ...

templates = Jinja2Templates(directory="templates")

And then modify the index_view() to renders the template:

# main.py

@app.get("/")
def index_view(request: Request):
  return templates.TemplateResponse(
    "index.html",
    {
        "request": request,
    },
  )

Finally, add the following script into the template's <script></script> section:

// templates/index.html

const form = document.getElementById("form")
const input = document.getElementById("input")
const submitButton = document.getElementById("submitButton")
const output = document.getElementById("output-input")

form.addEventListener("submit", (e) => {
  e.preventDefault()

  // Ensure the input is not empty
  if (!input.value) {
    return
  }

  input.disabled = true
  submitButton.disabled = true
  output.value = ""

  // Initialize a SSE connection to receive streamed responses
  const urlEncodedMessage = encodeURIComponent(input.value)
  const eventSource = new EventSource(
    `/chat-stream?message=${urlEncodedMessage}`,
  )

  // Handle SSE events (start, <unnamed>, end, error)
  eventSource.addEventListener("start", () => {
    console.log("Stream started...")
  })

  eventSource.onmessage = (event) => {
    console.log(event.data)
    output.value += event.data
  }

  eventSource.addEventListener("end", () => {
    console.log("Stream complete.")
    eventSource.close()
  })

  eventSource.onerror = (err) => {
    console.error("Oops, something went wrong with SSE...")
    console.error(err)
  }
})

The script first retrieves references to the HTML elements (input, form, and output). After that, it attaches a submit event to the form. On submission, it subscribes to the SSE and attaches data to the output's value in real time.

That's it, good job!

Lastly, navigate to http://localhost:8000 and try running a query.

Conclusion

In this article, you've seen how to integrate OpenAI into a FastAPI project using both synchronous and asynchronous approaches. With these patterns in hand, you're ready to create your own OpenAI-powered utilities and services.

Now that the application is complete, it's a great moment to deploy it to Sevalla. At the time of writing, new users receive $50 in free credits.

The final source code is available on GitHub.