ScaleDown Team • February 2025 • 12 min readMost RAG tutorials make you spin up Pinecone, Weaviate, or Chroma before you can ask a single question. What if you could skip the vector database entirely and still get accurate, grounded answers from your documents?ScaleDown’s compression API is query-aware. It takes both your document context and the user’s question, then keeps only the parts that actually matter. Retrieval and compression in a single API call. No embeddings, no vector store, no infra.
This guide walks you through building a customer support bot that answers questions from a product knowledge base. All you need is ScaleDown and an LLM.
A typical RAG pipeline has you embed documents, store them in a vector database, retrieve relevant chunks, and then send them to an LLM. That’s a lot of moving parts.ScaleDown replaces the embed, store, retrieve steps with a single compression call. The API accepts both context and prompt, so it knows what the user is asking and throws away everything else.
ScaleDown’s context + prompt design means compression is query-aware. It doesn’t just shrink text. It keeps what’s relevant to the question and removes what isn’t. Retrieval and compression in one step.
In production, this comes from your docs, CMS, or help center. Here’s a sample with 6 support articles:
Copy
knowledge_base = """## Password ResetTo reset your password, go to Settings > Security > Reset Password.Click "Send Reset Link" and check your email. The link expires in 24 hours.If you don't receive the email, check your spam folder or contact support@example.com.Two-factor authentication users must verify via their authenticator app after resetting.## Pricing PlansWe offer three plans:- Starter ($9/mo): 1 user, 5GB storage, email support, basic analytics.- Pro ($29/mo): 5 users, 50GB storage, priority support, advanced analytics, API access, and custom integrations.- Enterprise ($99/mo): Unlimited users, 500GB storage, dedicated account manager, SSO, audit logs, SLA guarantee, and custom onboarding.All plans include a 14-day free trial. Annual billing saves 20%.## Data ExportYou can export your data in CSV, JSON, or PDF format.Go to Settings > Data > Export. Select the date range and format.Exports under 10,000 rows are instant. Larger exports are queued andyou'll receive an email when the file is ready for download.Export history is retained for 30 days.## API Rate LimitsThe API allows 100 requests per minute on Starter, 500 on Pro,and 2000 on Enterprise. Rate limit headers are included in every response.If you exceed the limit, you'll receive a 429 status code.Implement exponential backoff for best results.## Account DeletionTo delete your account, go to Settings > Account > Delete Account.This action is irreversible. All data will be permanently removed within 72 hours.You must cancel any active subscription before deleting your account.An email confirmation is required to proceed with deletion.## IntegrationsWe support integrations with Slack, Jira, GitHub, Zapier, and Salesforce.Go to Settings > Integrations to connect your tools.Pro and Enterprise plans support custom webhooks and OAuth2 applications.Starter plans are limited to 2 active integrations."""
That’s roughly 350 tokens across 6 topics. In production your knowledge base could be thousands of articles. ScaleDown handles compression the same way.
3
Compress the context with ScaleDown
This is the key step. Send your entire knowledge base as context and the user’s question as prompt. ScaleDown compresses the context while keeping what’s relevant to the question.
Python
TypeScript
Copy
user_question = "What's included in the Pro plan and how much does it cost?"# ScaleDown compresses the context, keeping what's relevant to the questionresponse = requests.post( SCALEDOWN_URL, headers=SCALEDOWN_HEADERS, json={ "context": knowledge_base, "prompt": user_question, "model": "gpt-4o", "scaledown": {"rate": "auto"} })result = response.json()compressed_context = result["compressed_prompt"]print(f"Original tokens: {result['original_prompt_tokens']}")print(f"Compressed tokens: {result['compressed_prompt_tokens']}")# Original tokens: 350# Compressed tokens: 85
Copy
const userQuestion = "What's included in the Pro plan and how much does it cost?";const response = await fetch(SCALEDOWN_URL, { method: "POST", headers: SCALEDOWN_HEADERS, body: JSON.stringify({ context: knowledgeBase, prompt: userQuestion, model: "gpt-4o", scaledown: { rate: "auto" }, }),});const result = await response.json();const compressedContext = result.compressed_prompt;console.log(`Original tokens: ${result.original_prompt_tokens}`);console.log(`Compressed tokens: ${result.compressed_prompt_tokens}`);
What just happened?
The knowledge base has 6 articles: password reset, pricing, exports, rate limits, account deletion, and integrations. The user asked about the Pro plan.ScaleDown looked at both the context and the prompt, figured out that only the pricing section (and maybe integrations, since Pro unlocks features there) is relevant, and compressed everything else away.The result: ~75% fewer tokens, with the pricing details fully intact.
4
Send the compressed context to your LLM
Pass the compressed context to your LLM. Since ScaleDown already removed the noise, the model gets focused, relevant context. That means faster responses, lower cost, and fewer hallucinations.
Python
TypeScript
Copy
response = openai_client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": ( "You are a helpful customer support assistant. " "Answer using ONLY the provided context. " "If the context doesn't contain the answer, say so." ) }, { "role": "user", "content": f"Context:\n{compressed_context}\n\nQuestion: {user_question}" } ], temperature=0.2)print(response.choices[0].message.content)
Copy
import OpenAI from "openai";const openai = new OpenAI();const completion = await openai.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: "You are a helpful customer support assistant. " + "Answer using ONLY the provided context. " + "If the context doesn't contain the answer, say so.", }, { role: "user", content: `Context:\n${compressedContext}\n\nQuestion: ${userQuestion}`, }, ], temperature: 0.2,});console.log(completion.choices[0].message.content);
Example output:
The Pro plan costs $29/month and includes 5 users, 50GB storage, priority support, advanced analytics, API access, and custom integrations. A 14-day free trial is available, and annual billing saves 20%.
The answer is grounded entirely in the knowledge base. No hallucinated features, no made-up prices.
Here’s the complete, copy-paste-ready version wrapped in a reusable function:
Python
TypeScript
Copy
import requestsfrom openai import OpenAISCALEDOWN_URL = "https://api.scaledown.xyz/compress/raw/"SCALEDOWN_HEADERS = { "x-api-key": "YOUR_SCALEDOWN_API_KEY", "Content-Type": "application/json"}openai_client = OpenAI()def ask(knowledge_base: str, question: str) -> str: """Answer a question using a knowledge base, powered by ScaleDown.""" # Step 1: Compress. ScaleDown keeps what's relevant to the question. compress_response = requests.post( SCALEDOWN_URL, headers=SCALEDOWN_HEADERS, json={ "context": knowledge_base, "prompt": question, "model": "gpt-4o", "scaledown": {"rate": "auto"} } ) result = compress_response.json() compressed_context = result["compressed_prompt"] print(f"Tokens: {result['original_prompt_tokens']} -> {result['compressed_prompt_tokens']}") # Step 2: Generate. Send compressed context to your LLM. ai_response = openai_client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": ( "You are a helpful customer support assistant. " "Answer using ONLY the provided context. " "If the answer isn't in the context, say you don't know." ) }, { "role": "user", "content": f"Context:\n{compressed_context}\n\nQuestion: {question}" } ], temperature=0.2 ) return ai_response.choices[0].message.content# Try itprint(ask(knowledge_base, "How do I export my data as CSV?"))print(ask(knowledge_base, "What happens if I delete my account?"))print(ask(knowledge_base, "How many API requests can I make on the Pro plan?"))
Copy
import OpenAI from "openai";const SCALEDOWN_URL = "https://api.scaledown.xyz/compress/raw/";const SCALEDOWN_HEADERS = { "x-api-key": "YOUR_SCALEDOWN_API_KEY", "Content-Type": "application/json",};const openai = new OpenAI();async function ask(knowledgeBase: string, question: string): Promise<string> { // Step 1: Compress const compressRes = await fetch(SCALEDOWN_URL, { method: "POST", headers: SCALEDOWN_HEADERS, body: JSON.stringify({ context: knowledgeBase, prompt: question, model: "gpt-4o", scaledown: { rate: "auto" }, }), }); const result = await compressRes.json(); const compressedContext = result.compressed_prompt; console.log(`Tokens: ${result.original_prompt_tokens} -> ${result.compressed_prompt_tokens}`); // Step 2: Generate const completion = await openai.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: "You are a helpful customer support assistant. " + "Answer using ONLY the provided context. " + "If the answer isn't in the context, say you don't know.", }, { role: "user", content: `Context:\n${compressedContext}\n\nQuestion: ${question}`, }, ], temperature: 0.2, }); return completion.choices[0].message.content!;}// Try itconsole.log(await ask(knowledgeBase, "How do I export my data as CSV?"));
For very large corpora (millions of documents), a dedicated vector database gives you better indexing and query performance at scale. ScaleDown RAG works best for knowledge bases where simplicity and cost savings are the priority.