Features & Usage
Learn how to make the most of your Thox.ai device's capabilities.
Popular Guides
AI-powered code completion
Get intelligent suggestions as you type.
How it works
Thox.ai analyzes your code context in real-time to provide relevant completions. It considers your current file, open files, and project structure to suggest accurate code.
Triggering completions
Completions appear automatically as you type. Press Tab to accept, Escape to dismiss. In VS Code, you can also use Ctrl+Space to manually trigger suggestions.
Multi-line completions
For longer suggestions, Thox.ai can complete entire functions or code blocks. These appear with a preview showing what will be inserted.
Language support
Best results with Python, JavaScript, TypeScript, Go, Rust, Java, and C++. Other languages are supported but may have reduced accuracy.
Customization
Adjust completion behavior in settings: delay before suggestions, maximum suggestion length, and languages to enable/disable.
Choosing the right model
Select optimal models for your use case.
thox-coder (7B)
Optimized for code completion and generation. 7B parameters, balanced speed and quality. Best for most development workflows. Runs on Ollama backend (45-72 tok/s).
thox-coder-pro (14B)
Enhanced 14B model for complex development tasks. Automatically routes to TensorRT-LLM for 60-100% faster inference. Ideal for system design and complex refactoring.
thox-coder-max (32B)
Maximum capability 32B model for enterprise workloads. Uses TensorRT-LLM backend for production performance. Best for architecture design and security auditing.
Hybrid Inference
Thox.ai automatically routes requests to the optimal backend: Ollama for smaller models (7B) and TensorRT-LLM for larger models (14B+). This provides up to 100% performance improvement for large models.
Switching models
Change active model via web interface (/admin/models) or CLI: "thox models switch [name]". The smart router automatically selects the best backend for your model.
Interactive chat and Q&A
Ask questions and get explanations.
Accessing chat
Use the web interface at /chat or IDE extensions' chat panel. Send questions about code, ask for explanations, or request help with debugging.
Context-aware responses
The chat understands your codebase. Reference files with @filename and it will include them in context. Ask about specific functions or classes.
Code generation
Request new code: "Write a function that validates email addresses" and receive complete, ready-to-use code blocks.
Conversation history
Chat maintains context within a session. Follow up on previous responses without repeating context. Start a new session to reset.
System prompts
Customize behavior with system prompts in settings. Define coding style preferences, language preferences, or specialized instructions.
Context and project understanding
How Thox.ai understands your codebase.
Automatic indexing
On first connection, Thox.ai indexes your project structure. This enables smart completions that reference other files and understand project layout.
Context window
The model can process thousands of tokens of context. It automatically selects relevant code from open files, imports, and related files.
Project configuration
Add a .thoxignore file to exclude files from indexing (similar to .gitignore). Exclude build directories, node_modules, and large binary files.
Re-indexing
Trigger manual re-index after major project changes: "thox index refresh" or via web interface at /admin/index.
API and integrations
Integrate Thox.ai with your tools.
OpenAI-compatible API
Thox.ai exposes an OpenAI-compatible API at /v1. Use existing OpenAI client libraries by pointing them to your Thox.ai device.
Endpoints
/v1/completions for text completion, /v1/chat/completions for chat, /v1/embeddings for vector embeddings. Full API reference at /docs/api-reference.
Authentication
Generate API keys in /admin/api-keys. Pass via Authorization header: "Bearer your-api-key". Keys can have scopes and rate limits.
Rate limits
Default 60 requests/minute, 100k tokens/hour. Adjust per-key limits in admin. Local network requests can be exempted from limits.
Webhooks
Configure webhooks in /admin/webhooks to receive notifications on completion events, errors, or model changes.
Getting the best performance
Optimize speed and quality of responses.
Hybrid Architecture
Thox.ai uses a hybrid Ollama + TensorRT-LLM architecture. Smaller models (7B) use Ollama for simplicity, while larger models (14B+) automatically route to TensorRT-LLM for 60-100% faster inference.
TensorRT-LLM Benefits
TensorRT-LLM provides custom attention kernels, paged KV caching, and INT8/INT4 quantization. This delivers significantly higher tokens/second for production workloads with large models.
Use Ethernet
Wired connections provide the lowest latency. Wi-Fi adds 20-50ms per request. For real-time completions, Ethernet is strongly recommended.
Smart Routing
The smart router automatically selects the optimal backend based on model size, latency requirements, and backend availability. Check router status at /router/status.
Thermal management
Keep the device cool for sustained performance. TensorRT-LLM is more GPU-intensive but also more efficient. Allow cool-down periods during intensive sessions.