Version 0.1.4 ScaleDown is a context engineering platform that intelligently compresses AI prompts while preserving semantic integrity and reducing hallucinations. Our research-backed compression algorithms analyze prompt components—from reasoning chains to code contexts—and apply targeted optimization techniques that maintain output quality while dramatically reducing token consumption.Documentation Index
Fetch the complete documentation index at: https://docs.scaledown.ai/llms.txt
Use this file to discover all available pages before exploring further.
Installation
To use the new optimization modules, install ScaleDown with the required extras:Main Classes
scaledown.compressor.ScaleDownCompressor
The compressor module contains various Compressors, the default being ScaleDownCompressor.
The main entry point for compressing text. It manages API communication, batch processing, and compression settings.
This class inherits from
BaseCompressor and handles both single-string and list-based inputs automatically.
Maintains backward compatibility for users focused purely on context compression.| Parameter | Type | Default | Description |
|---|---|---|---|
target_model | str | 'gpt-4o' | The target LLM you plan to use downstream. ScaleDown optimizes specifically for this model’s tokenizer and attention biases. Supported: 'gpt-4o', 'gpt-4o-mini', 'gemini-2.5-flash', etc. |
rate | float | 'auto' | 'auto' | The aggressiveness of compression. - 'auto': ScaleDown determines the optimal rate based on redundancy (recommended). - float: A target retention rate (e.g., 0.4 keeps ~40% of tokens). |
api_key | str | None | Your ScaleDown API key. If None, looks for SCALEDOWN_API_KEY environment variable. |
temperature | float | None | Controls compression randomness. Higher values introduce more variation in token selection. |
preserve_keywords | bool | False | If True, forces the preservation of detected domain-specific keywords. |
preserve_words | List[str] | None | A list of specific words or phrases that must never be removed during compression. |
compress
Compresses the given context and prompt.
| Parameter | Type | Description |
|---|---|---|
context | str | List[str] | The background information (documents, code, history) to compress. |
prompt | str | List[str] | The user query or instruction. This is usually not compressed but used to guide the compression of the context. |
max_tokens | int | Optional strict limit on the output token count. |
**kwargs | dict | Additional parameters passed directly to the API payload. |
CompressedPrompt: If inputs are strings.List[CompressedPrompt]: If inputs are lists (supports batch processing).
Usage
Usage
scaledown.optimizer.HasteOptimizer
Requires: pip install scaledown[haste]
The HASTE (Hybrid AST-Enhanced) optimizer uses Tree-sitter parsing to understand code structure. It performs a hybrid search using BM25 and AST traversal (BFS) to identify relevant function and class definitions, ensuring that code dependencies are included in the context.
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k | int | 6 | Number of top functions/classes to retrieve initially. |
prefilter | int | 300 | Size of candidate pool before reranking. |
bfs_depth | int | 1 | Depth of Breadth-First Search over the call graph to find dependencies. |
max_add | int | 12 | Maximum nodes added during BFS expansion. |
semantic | bool | False | If True, enables semantic reranking using OpenAI embeddings. |
sem_model | str | text-embedding-3-small | OpenAI embedding model for semantic search. |
hard_cap | int | 1200 | Strict token limit for the optimized output. |
soft_cap | int | 1800 | Soft token limit for the optimized output. |
target_model | str | 'gpt-4o' | Model used for token counting calculations. |
optimize
Extracts relevant code sections based on the query and file structure. The main function that optimizes code context using HASTE.
| Parameter | Type | Description |
|---|---|---|
| context | str | List[str] | Source code content. If file_path is not provided, context is written to a temporary file for processing. haste.py |
| query | str | Query to guide context retrieval (e.g., “find training loop”). Required for HASTE. haste.py |
| max_tokens | int | Maximum token budget (uses hard_cap if not specified). haste.py |
| file_path | str | Path to the source file (required for AST parsing). haste.py |
| **kwargs | dict | Additional HASTE parameters passed dynamically. haste.py |
Usage
Usage
scaledown.optimizer.SemanticOptimizer
Requires: pip install scaledown[semantic]
Uses local embeddings (Sentence Transformers) and FAISS to perform purely semantic search over code chunks or document segments. Best used for large codebases where keyword matching is insufficient.
| Parameter | Type | Default | Description |
|---|---|---|---|
| model_name | str | ”Qwen/Qwen3-Embedding-0.6B” | The HuggingFace embedding model to load locally. |
| top_k | int | 3 | Number of top code chunks to retrieve. |
| target_model | str | ”gpt-4o” | Target model for code context optimization. |
optimize
Finds semantically similar code or text segments. The function that optimizes code context using Semantic code optimizer.
| Parameter | Type | Description |
|---|---|---|
| context | str | List[str] | The text or code content to search within. semantic_code.py |
| query | str | The search query used to find relevant sections. Defaults to “main logic” if not provided. semantic_code.py |
| file_path | str | Path to the source file (required for extracting semantic units). semantic_code.py |
| max_tokens | int | Optional token limit (currently unused in logic but accepted). semantic_code.py |
Usage
Usage
scaledown.pipeline.Pipeline
The Pipeline class orchestrates multiple optimization and compression steps into a single workflow. It allows you to chain a HasteOptimizer, SemanticOptimizer, and ScaleDownCompressor sequentially.
run
Executes the pipeline on the input data.
| Parameter | Type | Description |
|---|---|---|
| context | str | The initial input text or code to process. |
| **kwargs | dict | Arguments passed to each step (e.g., query, prompt, file_path). |
make_pipeline
Helper function to create a pipeline.
Usage
Usage
Data Structures
scaledown.types.CompressedPrompt
A Pydantic object containing the compressed text and valid metadta.
This also is the output of Compressor objects.
| Attribute | Type | Description |
|---|---|---|
content | str | The actual compressed text string. |
metrics | CompressionMetrics | Structured metrics object containing token counts and latency. |
tokens | Tuple[int, int] | A tuple of (original_count, compressed_count). |
savings_percent | float | The percentage of tokens removed (e.g., 60.0 for 60% reduction). |
compression_ratio | float | The ratio of original size to compressed size (e.g., 2.5x). |
latency_ms | int | Server-side processing time in milliseconds. |
-
print_stats()
Prints a formatted summary of compression performance to stdout. As an example- ScaleDown Stats:- Tokens: 1000 -> 400
- Savings: 60.0%
- Ratio: 2.5x
- Latency: 150ms
-
from_api_response(cls, content: str, raw_response: Dict[str, Any])
Factory method to create instance from raw API response dict.
scaledown.types.OptimizedContext
The output returned by HasteOptimizer and SemanticOptimizer.
| Attribute | Type | Description |
|---|---|---|
content | str | The optimized/selected code or text. |
metrics | dict | Metrics regarding the optimization (e.g., compression_ratio). Same as OptimizerMetrics. |
scaledown.types.PipelineResult
The final output returned by Pipeline.run().
| Attribute | Type | Description |
|---|---|---|
final_content | str | The fully processed text (after optimization and compression). |
savings_percent | float | Total percentage of tokens removed across all steps. |
history | List[StepMetrics] | A breakdown of token usage at each stage of the pipeline. |
metrics | CompressionMetrics | Aggregated metrics for the final output. |
scaledown.metrics.CompressionMetrics
Pydantic model that validates the raw metrics returned by the API. Metrics returned in the CompressedPrompt object returned by compress method of ScaleDownCompressor.
| Field | Type | Description |
|---|---|---|
original_prompt_tokens | int | Token count before compression. Validated to be non-negative. |
compressed_prompt_tokens | int | Token count after compression. Validated to be non-negative. |
latency_ms | int | Processing time in milliseconds. |
timestamp | datetime | Time when the compression request was processed. |
optimize method, inside OptimizedContext. Token counts, compression ratios and latency are calculated uniformly across all optimization methods.
| Field | Type | Description |
|---|---|---|
original_tokens | int | Token count before compression. |
optimized_tokens | int | Token count after Optimization. |
latency_ms | int | Processing time in milliseconds. |
compression_ratio | float | Ratio of original to optimized tokens. |
retrieval_mode | str | Method used (e.g., ‘hybrid’, ‘bm25’, ‘semantic_search’). |
ast_fidelty | float | The AST Fidelty at which the code context was optimized. Metric indicating structural preservation. |
chunks_retreived | int | Number of code chunks/functions selected |
Configuration & Exceptions
Configuration
ScaleDown uses a global configuration system for API keys and endpoints.- Environment Variables:
SCALEDOWN_API_KEY: Automatically loaded if not set in code.SCALEDOWN_API_URL: Override the default API endpoint (Default:https://api.scaledown.xyz).
Exceptions
All custom exceptions inherit fromScaleDownError.
| Exception | Description |
|---|---|
ScaleDownError | Base class for all package errors. |
OptimizerError | Raised when a local optimizer fails (e.g., missing dependencies like Tree-sitter). |
AuthenticationError | Raised when the API key is missing, invalid, or expired. |
APIError | Raised when the server returns a non-200 response (e.g., rate limits, server errors). |