Skip to main content
Version 0.1.4 ScaleDown is a context engineering platform that intelligently compresses AI prompts while preserving semantic integrity and reducing hallucinations. Our research-backed compression algorithms analyze prompt components—from reasoning chains to code contexts—and apply targeted optimization techniques that maintain output quality while dramatically reducing token consumption.

Installation

To use the new optimization modules, install ScaleDown with the required extras:
# Core compression only
pip install scaledown

# With HASTE (AST-based) and Semantic search
pip install "scaledown[haste,semantic]"

Main Classes

scaledown.compressor.ScaleDownCompressor

The compressor module contains various Compressors, the default being ScaleDownCompressor. The main entry point for compressing text. It manages API communication, batch processing, and compression settings.
This class inherits from BaseCompressor and handles both single-string and list-based inputs automatically. Maintains backward compatibility for users focused purely on context compression.
class scaledown.compressor.ScaleDownCompressor(target_model: str = 'gpt-4o',
                                               rate: Union[float, str] = 'auto',
                                               api_key: Optional[str] = None,
                                               temperature: Optional[float] = None,
                                               preserve_keywords: bool = False,
                                               preserve_words: Optional[List[str]] = None)
Parameters
ParameterTypeDefaultDescription
target_modelstr'gpt-4o'The target LLM you plan to use downstream. ScaleDown optimizes specifically for this model’s tokenizer and attention biases.
Supported: 'gpt-4o', 'gpt-4o-mini', 'gemini-2.5-flash', etc.
ratefloat | 'auto''auto'The aggressiveness of compression.
- 'auto': ScaleDown determines the optimal rate based on redundancy (recommended).
- float: A target retention rate (e.g., 0.4 keeps ~40% of tokens).
api_keystrNoneYour ScaleDown API key. If None, looks for SCALEDOWN_API_KEY environment variable.
temperaturefloatNoneControls compression randomness. Higher values introduce more variation in token selection.
preserve_keywordsboolFalseIf True, forces the preservation of detected domain-specific keywords.
preserve_wordsList[str]NoneA list of specific words or phrases that must never be removed during compression.
Methods compress Compresses the given context and prompt.
def compress(context: Union[str, List[str]],
            prompt: Union[str, List[str]],
            max_tokens: int = None,
            **kwargs ) -> Union[CompressedPrompt, List[CompressedPrompt]]
ParameterTypeDescription
contextstr | List[str]The background information (documents, code, history) to compress.
promptstr | List[str]The user query or instruction. This is usually not compressed but used to guide the compression of the context.
max_tokensintOptional strict limit on the output token count.
**kwargsdictAdditional parameters passed directly to the API payload.
Returns
  • CompressedPrompt: If inputs are strings.
  • List[CompressedPrompt]: If inputs are lists (supports batch processing).
from scaledown.compressor import ScaleDownCompressor

compressor = ScaleDownCompressor(target_model="gpt-4o", rate=0.5)

# Single input
result = compressor.compress(
    context="Long legal document text...",
    prompt="What is the liability clause."
)
print(result.content)
print(result.savings_percent)

# Batch input
results = compressor.compress(
    context=["Doc A text", "Doc B text"],
    prompt=["Analyze A", "Analyze B"]
)

scaledown.optimizer.HasteOptimizer

Requires: pip install scaledown[haste] The HASTE (Hybrid AST-Enhanced) optimizer uses Tree-sitter parsing to understand code structure. It performs a hybrid search using BM25 and AST traversal (BFS) to identify relevant function and class definitions, ensuring that code dependencies are included in the context.
class scaledown.optimizer.HasteOptimizer(top_k: int = 6,
                                         prefilter: int = 300,
                                         bfs_depth: int = 1,
                                         max_add: int=12,
                                         semantic: bool = False,
                                         sem_model: str = 'text-embedding-3-small',
                                         hard_cap: int = 1200,
                                         soft_cap: int = 1800,
                                         target_model: str = 'gpt-4o',
                                         **kwargs)
                                    

Parameters
ParameterTypeDefaultDescription
top_kint6Number of top functions/classes to retrieve initially.
prefilterint300Size of candidate pool before reranking.
bfs_depthint1Depth of Breadth-First Search over the call graph to find dependencies.
max_addint12Maximum nodes added during BFS expansion.
semanticboolFalseIf True, enables semantic reranking using OpenAI embeddings.
sem_modelstrtext-embedding-3-smallOpenAI embedding model for semantic search.
hard_capint1200Strict token limit for the optimized output.
soft_capint1800Soft token limit for the optimized output.
target_modelstr'gpt-4o'Model used for token counting calculations.
Methods optimize Extracts relevant code sections based on the query and file structure. The main function that optimizes code context using HASTE.
def optimize(context: Union[str, List[str]],
             query: Optional[str] = None,
             max_tokens: Optional[int] = None,
             file_path: Optional[str] = None,
             **kwargs) -> Union[OptimizedContext, List[OptimizedContext]]

ParameterTypeDescription
contextstr | List[str]Source code content. If file_path is not provided, context is written to a temporary file for processing. haste.py​
querystrQuery to guide context retrieval (e.g., “find training loop”). Required for HASTE. haste.py​
max_tokensintMaximum token budget (uses hard_cap if not specified). haste.py​
file_pathstrPath to the source file (required for AST parsing). haste.py​
**kwargsdictAdditional HASTE parameters passed dynamically. haste.py​

from scaledown.optimizer import HasteOptimizer

optimizer = HasteOptimizer(top_k=5, bfs_depth=2)

# Option 1: Using a file path
optimized_ctx = optimizer.optimize(
    context="",  # Content loaded from file
    query="Where is the authentication logic?",
    file_path="src/auth_service.py"
)

# Option 2: Using raw string (creates temp file)
code_snippet = "def login(): pass\ndef logout(): pass..."
optimized_ctx = optimizer.optimize(
    context=code_snippet,
    query="login function"
)

print(optimized_ctx.content)

scaledown.optimizer.SemanticOptimizer

Requires: pip install scaledown[semantic] Uses local embeddings (Sentence Transformers) and FAISS to perform purely semantic search over code chunks or document segments. Best used for large codebases where keyword matching is insufficient.
class scaledown.optimizer.SemanticOptimizer(model_name: str = "Qwen/Qwen3-Embedding-0.6B",
                                            top_k: int = 3,
                                            target_model: str = "gpt-4o")

Parameters
ParameterTypeDefaultDescription
model_namestr”Qwen/Qwen3-Embedding-0.6B”The HuggingFace embedding model to load locally.
top_kint3Number of top code chunks to retrieve.
target_modelstr”gpt-4o”Target model for code context optimization.
Methods optimize Finds semantically similar code or text segments. The function that optimizes code context using Semantic code optimizer.
def optimize(context: str,
             query: str,
             file_path: Optional[str] = None,
             max_tokens:Optional[int]=None,
             **kwargs) -> OptimizedContext

ParameterTypeDescription
contextstr | List[str]The text or code content to search within. semantic_code.py​
querystrThe search query used to find relevant sections. Defaults to “main logic” if not provided. semantic_code.py​
file_pathstrPath to the source file (required for extracting semantic units). semantic_code.py​
max_tokensintOptional token limit (currently unused in logic but accepted). semantic_code.py​
from scaledown.optimizer import SemanticOptimizer

optimizer = SemanticOptimizer(model_name="sentence-transformers/all-MiniLM-L6-v2")

result = optimizer.optimize(
    context=my_large_code_string,
    file_path="dummy_name.py", # Required for unit extraction
    query="database connection pool"
)

print(f"Retrieved {result.metrics.chunks_retrieved} chunks")

scaledown.pipeline.Pipeline

The Pipeline class orchestrates multiple optimization and compression steps into a single workflow. It allows you to chain a HasteOptimizer, SemanticOptimizer, and ScaleDownCompressor sequentially.
class scaledown.pipeline.Pipeline(steps: List[Tuple[str, Union[BaseOptimizer, BaseCompressor]]])

Methods run Executes the pipeline on the input data.
def run(query: str,
        file_path: str,
        prompt: str,
        context: str = "",
        **kwargs) -> PipelineResult

ParameterTypeDescription
contextstrThe initial input text or code to process.
**kwargsdictArguments passed to each step (e.g., query, prompt, file_path).​
make_pipeline Helper function to create a pipeline.
def make_pipeline(*steps) -> Pipeline
from scaledown.pipeline import Pipeline, make_pipeline
from scaledown.optimizer import HasteOptimizer
from scaledown.compressor import ScaleDownCompressor

# Initialize Pipeline
pipe = make_pipeline(
    ('haste', HasteOptimizer(top_k=5)),
    ('compressor', ScaleDownCompressor(rate=0.4))
)

# Run
result = pipe.run(
    context=source_code, # Used by Compressor and Haste (if file_path not given)
    query="Find the payment processing logic",  # Used by Haste
    prompt="Explain the payment logic",         # Used by Compressor
    file_path="payment.py"                      # Used by Haste
)

# Retrieving step data
print(f"Final Savings: {result.savings_percent}%")
for step in result.history:
    print(f"Step {step.step_name}: {step.input_tokens} -> {step.output_tokens} tokens")

Data Structures

scaledown.types.CompressedPrompt

A Pydantic object containing the compressed text and valid metadta. This also is the output of Compressor objects.
class scaledown.types.CompressedPrompt(content: str,
                                       original_prompt: str,
                                       tokens: Tuple[int, int],
                                       latency: float,
                                       target_model: str)

Attributes
AttributeTypeDescription
contentstrThe actual compressed text string.
metricsCompressionMetricsStructured metrics object containing token counts and latency.
tokensTuple[int, int]A tuple of (original_count, compressed_count).
savings_percentfloatThe percentage of tokens removed (e.g., 60.0 for 60% reduction).
compression_ratiofloatThe ratio of original size to compressed size (e.g., 2.5x).
latency_msintServer-side processing time in milliseconds.
Methods
  • print_stats()
    Prints a formatted summary of compression performance to stdout.
    As an example- ScaleDown Stats:
    • Tokens: 1000 -> 400
    • Savings: 60.0%
    • Ratio: 2.5x
    • Latency: 150ms
  • from_api_response(cls, content: str, raw_response: Dict[str, Any])
    Factory method to create instance from raw API response dict.

scaledown.types.OptimizedContext

The output returned by HasteOptimizer and SemanticOptimizer.
class scaledown.types.OptimizedContext

AttributeTypeDescription
contentstrThe optimized/selected code or text.
metricsdictMetrics regarding the optimization (e.g., compression_ratio). Same as OptimizerMetrics.

scaledown.types.PipelineResult

The final output returned by Pipeline.run().
class scaledown.types.PipelineResult

AttributeTypeDescription
final_contentstrThe fully processed text (after optimization and compression).
savings_percentfloatTotal percentage of tokens removed across all steps.
historyList[StepMetrics]A breakdown of token usage at each stage of the pipeline.
metricsCompressionMetricsAggregated metrics for the final output.

scaledown.metrics.CompressionMetrics

Pydantic model that validates the raw metrics returned by the API. Metrics returned in the CompressedPrompt object returned by compress method of ScaleDownCompressor.
class scaledown.metrics.CompressionMetrics
FieldTypeDescription
original_prompt_tokensintToken count before compression. Validated to be non-negative.
compressed_prompt_tokensintToken count after compression. Validated to be non-negative.
latency_msintProcessing time in milliseconds.
timestampdatetimeTime when the compression request was processed.
class scaledown.metrics.OptimizerMetrics
The ‘metrics’ dict returned in HasteOptimizer’s and SemanticOptimizer’s optimize method, inside OptimizedContext. Token counts, compression ratios and latency are calculated uniformly across all optimization methods.
FieldTypeDescription
original_tokensintToken count before compression.
optimized_tokensintToken count after Optimization.
latency_msintProcessing time in milliseconds.
compression_ratiofloatRatio of original to optimized tokens.
retrieval_modestrMethod used (e.g., ‘hybrid’, ‘bm25’, ‘semantic_search’).
ast_fideltyfloatThe AST Fidelty at which the code context was optimized. Metric indicating structural preservation.
chunks_retreivedintNumber of code chunks/functions selected

Configuration & Exceptions

Configuration

ScaleDown uses a global configuration system for API keys and endpoints.
import scaledown
Set API key globally
scaledown.set_api_key("your-api-key")
Get current key
key = scaledown.get_api_key()
  • Environment Variables:
    • SCALEDOWN_API_KEY: Automatically loaded if not set in code.
    • SCALEDOWN_API_URL: Override the default API endpoint (Default: https://api.scaledown.xyz).

Exceptions

All custom exceptions inherit from ScaleDownError.
from scaledown.exceptions import (ScaleDownError,
                                  AuthenticationError,
                                  APIError,
                                  OptimizationError)
ExceptionDescription
ScaleDownErrorBase class for all package errors.
OptimizerErrorRaised when a local optimizer fails (e.g., missing dependencies like Tree-sitter).
AuthenticationErrorRaised when the API key is missing, invalid, or expired.
APIErrorRaised when the server returns a non-200 response (e.g., rate limits, server errors).