Mistral Small 3 sets a new benchmark in the “small” Large Language Models category below 70B.
90.9K Pulls Updated 4 days ago
Updated 4 days ago
4 days ago
8039dd90c113 · 14GB
Readme
Mistral Small 3 sets a new benchmark in the “small” Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models.
Mistral Small can be deployed locally and is exceptionally “knowledge-dense”, fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized. Perfect for:
- Fast response conversational agents.
- Low latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.
Key Features
- Multilingual: Supports dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish.
- Agent-Centric: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
- Advanced Reasoning: State-of-the-art conversational and reasoning capabilities.
- Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes.
- Context Window: A 32k context window.
- System Prompt: Maintains strong adherence and support for system prompts.
- Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.
Human Evaluations
We conducted side by side evaluations with an external third-party vendor, on a set of over 1k proprietary coding and generalist prompts. Evaluators were tasked with selecting their preferred model response from anonymized generations produced by Mistral Small 3 vs another model. We are aware that in some cases the benchmarks on human judgement starkly differ from publicly available benchmarks, but have taken extra caution in verifying a fair evaluation. We are confident that the above benchmarks are valid.
Instruct performance
Our instruction tuned model performs competitively with open weight models three times its size and with proprietary GPT4o-mini model across Code, Math, General knowledge and Instruction following benchmarks.
Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance (Qwen2.5-32B-Instruct, Llama-3.3-70B-Instruct, Gemma-2-27B-IT). Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
Customers are evaluating Mistral Small 3 across multiple industries, including:
- Financial services customers for fraud detection
- Healthcare providers for customer triaging
- Robotics, automotive, and manufacturing companies for on-device command and control
- Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.