Performance Optimization
Get the best speed and quality from your Thox.ai device.
Thox.ai is optimized out of the box, but you can fine-tune performance for your specific workflow. This guide covers hardware, network, and software optimizations to get the fastest responses with the best quality.
Expected Performance
With default settings and the thox-coder model, you should expect:
50-100ms
First token latency
30-50
Tokens per second
<200ms
End-to-end completion
Results vary based on model size, context length, and network conditions.
Model Selection
Use the right model for the task
thox-coder-fast for quick completions, thox-coder for balanced quality, larger models for complex generation.
Consider model quantization
Quantized models (Q4, Q5) are faster and use less memory with minimal quality loss for most tasks.
Pre-load your primary model
Keep your most-used model loaded to avoid cold-start latency. Use thox models switch only when needed.
Network Configuration
Use Ethernet for lowest latency
Wired connections add ~5ms latency vs 20-50ms for Wi-Fi. Essential for real-time completions.
Optimize network path
Place the device on the same network segment as your development machine. Avoid routing through VPNs.
Use local DNS
Configure your router to resolve thox.local locally, or use the IP address directly in IDE settings.
Thermal Management
Ensure proper ventilation
2+ inches clearance on all sides. Don't stack or enclose. Place on hard, flat surface.
Monitor thermal status
Run thox thermal status to check temperatures. Throttling begins at 80°C sustained.
Consider ambient temperature
Best performance at 0-35°C (32-95°F). In warm environments, a small fan can help.
Context Optimization
Minimize context size
Close unnecessary files in your IDE. Smaller context = faster processing.
Use .thoxignore
Exclude build directories, node_modules, and large files from indexing.
Target specific files
Use @filename references in chat instead of project-wide context when possible.
Useful Commands
thox statusView overall system status and resource usage
thox thermal statusCheck current temperatures and throttle state
thox models statusSee loaded model and memory usage
thox benchmarkRun performance benchmark
thox cache clearClear inference cache to free memory
thox service restartRestart the inference service
Advanced Tuning
Adjust Thread Count
By default, the device uses all available cores. Reduce threads if you need to reserve CPU for other tasks:
Adjust Context Length
Reduce context length for faster processing if you don't need full context:
Enable Flash Attention
Faster attention mechanism for compatible models (enabled by default):
Benchmarking Your Device
Run the built-in benchmark to measure your device's performance:
This tests inference speed, memory bandwidth, and network latency. Results are compared to expected baselines and saved to /var/log/thox/benchmark.log.