OpenAI Assistants File Search - How Vector Stores Work in Practice

March 9, 2026•7 min read•Michael Ridland

OpenAI's Assistants API has a file search tool that lets you upload documents and ask questions against them. On paper, it sounds like every other RAG implementation. In practice, it's one of the faster ways to get document Q&A working without building your own vector database infrastructure. But there are trade-offs, and after building several production systems that use it, I want to share what actually matters when you're deciding whether to use this or roll your own.

What File Search Does

The Assistants API file search tool works through vector stores. You upload files (PDFs, text files, and other document formats), OpenAI chunks and embeds them into a vector store, and then when your assistant gets a question, it automatically searches the relevant vectors to find context before generating a response.

The workflow looks like this:

Create an assistant with the file_search tool enabled
Create a vector store and upload your files to it
Attach the vector store to your assistant
Create a thread with a user message and run it

The assistant automatically searches the vector store when it thinks the answer might be in your documents. You don't need to manually trigger the search or write retrieval logic - the assistant handles that decision.

OpenAI's Assistants file search documentation has the full API reference, but I'll walk through what matters practically.

Setting It Up - The Code Is Surprisingly Simple

Here's the Python version of getting a basic file search assistant running:

from openai import OpenAI

client = OpenAI()

# Create the assistant
assistant = client.beta.assistants.create(
    name="Financial Analyst Assistant",
    instructions="You are an expert financial analyst. Use your knowledge base to answer questions about audited financial statements.",
    model="gpt-4o",
    tools=[{"type": "file_search"}],
)

# Create a vector store and upload files
vector_store = client.vector_stores.create(name="Financial Statements")

file_paths = ["edgar/goog-10k.pdf", "edgar/brka-10k.txt"]
file_streams = [open(path, "rb") for path in file_paths]

file_batch = client.vector_stores.file_batches.upload_and_poll(
    vector_store_id=vector_store.id, files=file_streams
)

# Attach the vector store to the assistant
assistant = client.beta.assistants.update(
    assistant_id=assistant.id,
    tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

That's it for the setup. A few lines of code and you've got a document Q&A system. The upload_and_poll helper is nice - it handles the asynchronous upload and waits until processing is complete before returning.

Two Ways to Attach Files

There's an architectural decision here that's worth thinking about. You can attach files at the assistant level (via vector stores) or at the thread level (via message attachments).

Assistant-level vector stores are shared across all conversations. Think of these as your knowledge base - company documents, product manuals, policy documents. Every thread that uses this assistant can search these files.

Thread-level attachments are conversation-specific. When a user uploads a document as part of a message, it gets added to a thread-specific vector store. This is useful for scenarios like "analyse this specific report" where the document is unique to that conversation.

# Thread-level file attachment
message_file = client.files.create(
    file=open("edgar/aapl-10k.pdf", "rb"), purpose="assistants"
)

thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": "How many shares of AAPL were outstanding at the end of October 2023?",
            "attachments": [
                {"file_id": message_file.id, "tools": [{"type": "file_search"}]}
            ],
        }
    ]
)

In practice, most production systems we build use both. The assistant has a base knowledge store, and users can upload additional documents per conversation when needed.

Streaming vs Polling - Pick Based on Your UX

You can get responses two ways: streaming (tokens arrive as they're generated) or polling (wait for the complete response). The streaming approach is better for chat interfaces where you want that real-time typing effect:

with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=EventHandler(),
) as stream:
    stream.until_done()

Polling is simpler and works better for batch processing or API backends where you just need the final answer:

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id
)

For the AI solutions we build, streaming is almost always the right choice for user-facing applications. Nobody likes staring at a loading spinner for 15 seconds.

Citations - The Killer Feature

Here's what separates Assistants file search from a basic RAG setup: built-in citations. When the assistant references a document, it includes annotations in the response that tell you exactly which file the information came from.

message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(
        annotation.text, f"[{index}]"
    )
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

This matters a lot for business applications. When someone asks a question about a financial statement or a policy document, they need to know where the answer came from. "Trust but verify" is how most professionals work with AI-generated answers, and citations make verification possible.

We've built document Q&A systems for professional services firms where the citation trail is as important as the answer itself. Without citations, the system is a toy. With them, it's a tool people actually rely on.

What Works Well

Speed to production. If you need document Q&A working this week, Assistants file search gets you there faster than building a custom RAG pipeline with your own embedding model, vector database, chunking strategy, and retrieval logic.

File format handling. OpenAI handles parsing PDFs, which is non-trivial if you've ever tried to extract text from scanned documents or complex layouts. Their parser isn't perfect, but it's good enough for most business documents.

Automatic retrieval decisions. The assistant decides when to search and when to answer from its training data. This sounds small, but getting retrieval triggering right in a custom system takes more tuning than people expect.

What to Watch Out For

Cost at scale. Vector store storage and search queries add up. If you're processing thousands of documents and running hundreds of queries per day, model the costs before committing. For smaller workloads (a few hundred documents, dozens of queries daily), the costs are reasonable.

Control over chunking. You don't get to control how OpenAI chunks your documents. For most documents, their default chunking works fine. But if you have highly structured documents (like forms, tables, or technical specifications), the automatic chunking might split things in ways that hurt retrieval quality.

Vendor lock-in. Your vector stores live in OpenAI's infrastructure. If you later decide to switch to a different LLM provider or want to use your own embedding model, you're rebuilding the document processing pipeline from scratch. For prototype and MVP stages, this trade-off is usually acceptable. For long-term production systems, think carefully.

Latency. File search adds latency to every response where the assistant decides to search. It's not dramatic - usually a few seconds - but it's noticeable compared to a plain chat response. Make sure your UX accounts for this.

When to Use This vs. Building Your Own RAG

This is the question we get most often from clients. My rule of thumb:

Use Assistants file search when:

You're prototyping or building an MVP
Your document collection is under a few hundred files
You want to minimise infrastructure management
Standard chunking and retrieval work for your documents
You're already committed to the OpenAI ecosystem

Build your own RAG pipeline when:

You need control over the embedding model and chunking strategy
You're working with specialised document types that need custom parsing
You need to run on your own infrastructure (for compliance or cost reasons)
You want to use multiple LLM providers
Your document collection is very large or changes frequently

Many of our AI development projects start with Assistants file search for the proof of concept, then graduate to a custom pipeline once the use case is validated and the specific requirements are clearer. Starting simple and adding complexity only when you have evidence it's needed is almost always the right approach.

Getting Started

If you're looking at building document Q&A into your products or internal tools, we can help you evaluate whether Assistants file search fits your needs or whether a custom approach makes more sense. We work across the AI agent development space and have experience with both OpenAI's tools and custom RAG implementations on Azure.

Get in touch if you want to talk through your specific situation. The right architecture depends on your documents, your scale, and your longer-term platform strategy - and those are conversations worth having before you start building.