Skip to main content

Build RAG Without a Vector Database

ScaleDown Team • February 2025 • 12 min read Most RAG tutorials make you spin up Pinecone, Weaviate, or Chroma before you can ask a single question. What if you could skip the vector database entirely and still get accurate, grounded answers from your documents? ScaleDown’s compression API is query-aware. It takes both your document context and the user’s question, then keeps only the parts that actually matter. Retrieval and compression in a single API call. No embeddings, no vector store, no infra.
This guide walks you through building a customer support bot that answers questions from a product knowledge base. All you need is ScaleDown and an LLM.

How It Works

A typical RAG pipeline has you embed documents, store them in a vector database, retrieve relevant chunks, and then send them to an LLM. That’s a lot of moving parts. ScaleDown replaces the embed, store, retrieve steps with a single compression call. The API accepts both context and prompt, so it knows what the user is asking and throws away everything else.
ScaleDown’s context + prompt design means compression is query-aware. It doesn’t just shrink text. It keeps what’s relevant to the question and removes what isn’t. Retrieval and compression in one step.

The Use Case: Customer Support Bot

Say you run a SaaS product with a knowledge base of support articles. Customers ask questions like:

Password Help

“How do I reset my password?”

Pricing

“What’s included in the Pro plan?”

Data Export

“Can I export my data as CSV?”
Your bot needs to find the right answer from the knowledge base without making things up. Here’s how to build it.

Build It Step by Step

1

Set up your API keys

You need a ScaleDown API key and an OpenAI key (or any LLM provider).
import requests
import json
from openai import OpenAI

SCALEDOWN_URL = "https://api.scaledown.xyz/compress/raw/"
SCALEDOWN_HEADERS = {
    "x-api-key": "YOUR_SCALEDOWN_API_KEY",
    "Content-Type": "application/json"
}

openai_client = OpenAI()  # Uses OPENAI_API_KEY env var
Don’t have a ScaleDown API key yet? Contact our sales team to get one.
2

Load your knowledge base

In production, this comes from your docs, CMS, or help center. Here’s a sample with 6 support articles:
knowledge_base = """
## Password Reset
To reset your password, go to Settings > Security > Reset Password.
Click "Send Reset Link" and check your email. The link expires in 24 hours.
If you don't receive the email, check your spam folder or contact support@example.com.
Two-factor authentication users must verify via their authenticator app after resetting.

## Pricing Plans
We offer three plans:
- Starter ($9/mo): 1 user, 5GB storage, email support, basic analytics.
- Pro ($29/mo): 5 users, 50GB storage, priority support, advanced analytics,
  API access, and custom integrations.
- Enterprise ($99/mo): Unlimited users, 500GB storage, dedicated account manager,
  SSO, audit logs, SLA guarantee, and custom onboarding.
All plans include a 14-day free trial. Annual billing saves 20%.

## Data Export
You can export your data in CSV, JSON, or PDF format.
Go to Settings > Data > Export. Select the date range and format.
Exports under 10,000 rows are instant. Larger exports are queued and
you'll receive an email when the file is ready for download.
Export history is retained for 30 days.

## API Rate Limits
The API allows 100 requests per minute on Starter, 500 on Pro,
and 2000 on Enterprise. Rate limit headers are included in every response.
If you exceed the limit, you'll receive a 429 status code.
Implement exponential backoff for best results.

## Account Deletion
To delete your account, go to Settings > Account > Delete Account.
This action is irreversible. All data will be permanently removed within 72 hours.
You must cancel any active subscription before deleting your account.
An email confirmation is required to proceed with deletion.

## Integrations
We support integrations with Slack, Jira, GitHub, Zapier, and Salesforce.
Go to Settings > Integrations to connect your tools.
Pro and Enterprise plans support custom webhooks and OAuth2 applications.
Starter plans are limited to 2 active integrations.
"""
That’s roughly 350 tokens across 6 topics. In production your knowledge base could be thousands of articles. ScaleDown handles compression the same way.
3

Compress the context with ScaleDown

This is the key step. Send your entire knowledge base as context and the user’s question as prompt. ScaleDown compresses the context while keeping what’s relevant to the question.
user_question = "What's included in the Pro plan and how much does it cost?"

# ScaleDown compresses the context, keeping what's relevant to the question
response = requests.post(
    SCALEDOWN_URL,
    headers=SCALEDOWN_HEADERS,
    json={
        "context": knowledge_base,
        "prompt": user_question,
        "model": "gpt-4o",
        "scaledown": {"rate": "auto"}
    }
)

result = response.json()
compressed_context = result["compressed_prompt"]

print(f"Original tokens:   {result['original_prompt_tokens']}")
print(f"Compressed tokens: {result['compressed_prompt_tokens']}")
# Original tokens:   350
# Compressed tokens:  85
The knowledge base has 6 articles: password reset, pricing, exports, rate limits, account deletion, and integrations. The user asked about the Pro plan.ScaleDown looked at both the context and the prompt, figured out that only the pricing section (and maybe integrations, since Pro unlocks features there) is relevant, and compressed everything else away.The result: ~75% fewer tokens, with the pricing details fully intact.
4

Send the compressed context to your LLM

Pass the compressed context to your LLM. Since ScaleDown already removed the noise, the model gets focused, relevant context. That means faster responses, lower cost, and fewer hallucinations.
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a helpful customer support assistant. "
                "Answer using ONLY the provided context. "
                "If the context doesn't contain the answer, say so."
            )
        },
        {
            "role": "user",
            "content": f"Context:\n{compressed_context}\n\nQuestion: {user_question}"
        }
    ],
    temperature=0.2
)

print(response.choices[0].message.content)
Example output:
The Pro plan costs $29/month and includes 5 users, 50GB storage, priority support, advanced analytics, API access, and custom integrations. A 14-day free trial is available, and annual billing saves 20%.
The answer is grounded entirely in the knowledge base. No hallucinated features, no made-up prices.

Putting It All Together

Here’s the complete, copy-paste-ready version wrapped in a reusable function:
import requests
from openai import OpenAI

SCALEDOWN_URL = "https://api.scaledown.xyz/compress/raw/"
SCALEDOWN_HEADERS = {
    "x-api-key": "YOUR_SCALEDOWN_API_KEY",
    "Content-Type": "application/json"
}
openai_client = OpenAI()


def ask(knowledge_base: str, question: str) -> str:
    """Answer a question using a knowledge base, powered by ScaleDown."""

    # Step 1: Compress. ScaleDown keeps what's relevant to the question.
    compress_response = requests.post(
        SCALEDOWN_URL,
        headers=SCALEDOWN_HEADERS,
        json={
            "context": knowledge_base,
            "prompt": question,
            "model": "gpt-4o",
            "scaledown": {"rate": "auto"}
        }
    )
    result = compress_response.json()
    compressed_context = result["compressed_prompt"]

    print(f"Tokens: {result['original_prompt_tokens']} -> {result['compressed_prompt_tokens']}")

    # Step 2: Generate. Send compressed context to your LLM.
    ai_response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful customer support assistant. "
                    "Answer using ONLY the provided context. "
                    "If the answer isn't in the context, say you don't know."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{compressed_context}\n\nQuestion: {question}"
            }
        ],
        temperature=0.2
    )

    return ai_response.choices[0].message.content


# Try it
print(ask(knowledge_base, "How do I export my data as CSV?"))
print(ask(knowledge_base, "What happens if I delete my account?"))
print(ask(knowledge_base, "How many API requests can I make on the Pro plan?"))

How This Compares

Traditional RAGScaleDown RAG
RetrievalVector DB (Pinecone, Weaviate, etc.)ScaleDown’s query-aware compression
EmbeddingsRequired (OpenAI, Cohere, etc.)Not needed
InfrastructureDB hosting, index managementSingle API call
Context sent to LLMFull retrieved chunksCompressed, 40-70% fewer tokens
Hallucination riskDepends on chunk qualityLower, less noise in context
Setup timeHours (provisioning, indexing)Minutes (one API key)
Best forMillions of documentsKnowledge bases, support docs, internal tools
For very large corpora (millions of documents), a dedicated vector database gives you better indexing and query performance at scale. ScaleDown RAG works best for knowledge bases where simplicity and cost savings are the priority.

Key Takeaways

No vector DB needed

ScaleDown’s compression is query-aware. It keeps what’s relevant to the question and removes the rest. Retrieval and compression in one API call.

40-70% fewer tokens

Compressed context means lower LLM costs, faster responses, and less room for the model to hallucinate from irrelevant details.

Works with any LLM

The compressed context is plain text. Send it to OpenAI, Gemini, Claude, or whatever model you use.

Minutes to set up

No infrastructure to provision. Get an API key, compress, and generate. That’s the whole pipeline.