Version 0.1.4ScaleDown is a context engineering platform that intelligently compresses AI prompts while preserving semantic integrity and reducing hallucinations. Our research-backed compression algorithms analyze prompt components—from reasoning chains to code contexts—and apply targeted optimization techniques that maintain output quality while dramatically reducing token consumption.
The compressor module contains various Compressors, the default being ScaleDownCompressor.The main entry point for compressing text. It manages API communication, batch processing, and compression settings.
This class inherits from BaseCompressor and handles both single-string and list-based inputs automatically.
Maintains backward compatibility for users focused purely on context compression.
The target LLM you plan to use downstream. ScaleDown optimizes specifically for this model’s tokenizer and attention biases. Supported:'gpt-4o', 'gpt-4o-mini', 'gemini-2.5-flash', etc.
rate
float | 'auto'
'auto'
The aggressiveness of compression. - 'auto': ScaleDown determines the optimal rate based on redundancy (recommended). - float: A target retention rate (e.g., 0.4 keeps ~40% of tokens).
api_key
str
None
Your ScaleDown API key. If None, looks for SCALEDOWN_API_KEY environment variable.
temperature
float
None
Controls compression randomness. Higher values introduce more variation in token selection.
preserve_keywords
bool
False
If True, forces the preservation of detected domain-specific keywords.
preserve_words
List[str]
None
A list of specific words or phrases that must never be removed during compression.
MethodscompressCompresses the given context and prompt.
Requires:pip install scaledown[haste]The HASTE (Hybrid AST-Enhanced) optimizer uses Tree-sitter parsing to understand code structure. It performs a hybrid search using BM25 and AST traversal (BFS) to identify relevant function and class definitions, ensuring that code dependencies are included in the context.
Copy
class scaledown.optimizer.HasteOptimizer(top_k: int = 6, prefilter: int = 300, bfs_depth: int = 1, max_add: int=12, semantic: bool = False, sem_model: str = 'text-embedding-3-small', hard_cap: int = 1200, soft_cap: int = 1800, target_model: str = 'gpt-4o', **kwargs)
Parameters
Parameter
Type
Default
Description
top_k
int
6
Number of top functions/classes to retrieve initially.
prefilter
int
300
Size of candidate pool before reranking.
bfs_depth
int
1
Depth of Breadth-First Search over the call graph to find dependencies.
max_add
int
12
Maximum nodes added during BFS expansion.
semantic
bool
False
If True, enables semantic reranking using OpenAI embeddings.
sem_model
str
text-embedding-3-small
OpenAI embedding model for semantic search.
hard_cap
int
1200
Strict token limit for the optimized output.
soft_cap
int
1800
Soft token limit for the optimized output.
target_model
str
'gpt-4o'
Model used for token counting calculations.
MethodsoptimizeExtracts relevant code sections based on the query and file structure. The main function that optimizes code context using HASTE.
Requires:pip install scaledown[semantic]Uses local embeddings (Sentence Transformers) and FAISS to perform purely semantic search over code chunks or document segments. Best used for large codebases where keyword matching is insufficient.
Copy
class scaledown.optimizer.SemanticOptimizer(model_name: str = "Qwen/Qwen3-Embedding-0.6B", top_k: int = 3, target_model: str = "gpt-4o")
Parameters
Parameter
Type
Default
Description
model_name
str
”Qwen/Qwen3-Embedding-0.6B”
The HuggingFace embedding model to load locally.
top_k
int
3
Number of top code chunks to retrieve.
target_model
str
”gpt-4o”
Target model for code context optimization.
MethodsoptimizeFinds semantically similar code or text segments. The function that optimizes code context using Semantic code optimizer.
The Pipeline class orchestrates multiple optimization and compression steps into a single workflow. It allows you to chain a HasteOptimizer, SemanticOptimizer, and ScaleDownCompressor sequentially.
Copy
class scaledown.pipeline.Pipeline(steps: List[Tuple[str, Union[BaseOptimizer, BaseCompressor]]])
MethodsrunExecutes the pipeline on the input data.
Arguments passed to each step (e.g., query, prompt, file_path).
make_pipelineHelper function to create a pipeline.
Copy
def make_pipeline(*steps) -> Pipeline
Usage
Copy
from scaledown.pipeline import Pipeline, make_pipelinefrom scaledown.optimizer import HasteOptimizerfrom scaledown.compressor import ScaleDownCompressor# Initialize Pipelinepipe = make_pipeline( ('haste', HasteOptimizer(top_k=5)), ('compressor', ScaleDownCompressor(rate=0.4)))# Runresult = pipe.run( context=source_code, # Used by Compressor and Haste (if file_path not given) query="Find the payment processing logic", # Used by Haste prompt="Explain the payment logic", # Used by Compressor file_path="payment.py" # Used by Haste)# Retrieving step dataprint(f"Final Savings: {result.savings_percent}%")for step in result.history: print(f"Step {step.step_name}: {step.input_tokens} -> {step.output_tokens} tokens")
Pydantic model that validates the raw metrics returned by the API. Metrics returned in the CompressedPrompt object returned by compress method of ScaleDownCompressor.
Copy
class scaledown.metrics.CompressionMetrics
Field
Type
Description
original_prompt_tokens
int
Token count before compression. Validated to be non-negative.
compressed_prompt_tokens
int
Token count after compression. Validated to be non-negative.
latency_ms
int
Processing time in milliseconds.
timestamp
datetime
Time when the compression request was processed.
Copy
class scaledown.metrics.OptimizerMetrics
The ‘metrics’ dict returned in HasteOptimizer’s and SemanticOptimizer’s optimize method, inside OptimizedContext. Token counts, compression ratios and latency are calculated uniformly across all optimization methods.
Field
Type
Description
original_tokens
int
Token count before compression.
optimized_tokens
int
Token count after Optimization.
latency_ms
int
Processing time in milliseconds.
compression_ratio
float
Ratio of original to optimized tokens.
retrieval_mode
str
Method used (e.g., ‘hybrid’, ‘bm25’, ‘semantic_search’).
ast_fidelty
float
The AST Fidelty at which the code context was optimized. Metric indicating structural preservation.