Alibaba’s now famous Qwen AI development team has done it again: just over a day ago they released the Qwen3.5 Medium Model Series consisting of four new major language models (LLMs) with support for calling agentic tools, three of which are available for commercial use by enterprises and indie developers under the standard open source Apache 2.0 license:
Qwen3.5-35B-A3B
Qwen3.5-122B-A10B
Qwen3.5-27B
Developers can download them now at Hugging face And Model range. A fourth model, Qwen3.5-Flash, appears to be proprietary and only available through the Alibaba Cloud Model Studio APIbut still offers a strong cost advantage compared to other models in the West (see price comparison table below).
But the big difference with the open source models is that they deliver similarly high performance on third-party benchmark tests as similarly sized proprietary models from major US startups like OpenAI or Anthropic, and are actually better than OpenAI’s GPT-5 mini and Anthropic’s Claude Sonnet 4.5 – the latter model released just five months ago.
And the Qwen team say it has designed these models to remain highly accurate even when ‘quantized’, a process that further reduces their footprint by reducing the number of values used to store the model’s settings from many values to many fewer.
Crucially, this release brings border-level context windows to the desktop PC. The flagship Qwen3.5-35B-A3B can now exceed a context length of 1 million tokens on consumer-grade GPUs with 32 GB of VRAM. While this isn’t something everyone has access to, this is a lot less computing power than many other similarly performing options.
This leap is made possible by virtually lossless accuracy under 4-bit weighting and KV cache quantization, allowing developers to process massive data sets without server-grade infrastructure.
Technology: Delta Power
At the heart of Qwen 3.5’s performance is an advanced hybrid architecture. While many models rely solely on standard Transformer blocks, Qwen 3.5 integrates Gated Delta Networks in conjunction with a sparse Mixture-of-Experts (MoE) system. The technical specifications for the Qwen3.5-35B-A3B reveal a highly efficient design:
Parameter efficiency: Although the model contains a total of 35 billion parameters, it is only activated 3 billion for each given token.
Expert diversity: The MoE layer uses 256 experts, including 8 routed experts and 1 shared expert that help maintain performance while reducing inference latency.
Near-lossless quantization: The series maintains high accuracy even when compressed to 4-bit weights, significantly reducing the memory footprint for local implementation.
Basic model release: In an effort to support the research community, Alibaba has created the Qwen3.5-35B-A3B base model in addition to the instructed versions.
Product: Intelligence that ‘thinks’ first
Qwen 3.5 introduces a native “Thinking Mode” as the default state. Before the model provides a final answer, it generates an internal chain of reasoning – bounded by
Qwen3.5-27B: Optimized for high efficiency and supports a context length of more than 800,000 tokens.
Qwen3.5 flash: The production-quality hosted version, with a default context length of 1 million tokens and built-in official tools.
Qwen3.5-122B-A10B: Designed for server-grade GPUs (80 GB VRAM), this model supports context lengths over 1 million while closing the gap with the world’s largest frontier models.
Benchmark results validate this architectural shift. The 35B-A3B model notably surpasses much larger predecessors such as Qwen3-235B, as well as the aforementioned proprietary GPT-5 mini and Sonnet 4.5 in categories such as knowledge (MMMLU) and visual reasoning (MMMU-Pro).
Pricing and API integration
For those who don’t host their own weights, Alibaba Cloud Model Studio offers a competitive API for Qwen3.5-Flash.
Import: $0.1 per 1 million token
Export: $0.4 per 1 million tokens
Cache creation: $0.125 per 1 million tokens
Read cache: $0.01 per 1 million tokens
The API also features a detailed Tool Calling pricing model, with Web Search costing $10 per 1,000 calls and Code Interpreter currently offered for free for a limited time.
This makes Qwen3.5-Flash one of the most affordable APIs of all the major LLMs in the world. See a table comparing them below:
Model | Import | Export | Total costs | Source |
Qwen3 Turbo | $0.05 | $0.20 | $0.25 | Alibaba cloud |
Qwen3.5 Flash | $0.10 | $0.40 | $0.50 | Alibaba cloud |
deepseek chat (V3.2-Exp) | $0.28 | $0.42 | $0.70 | Deep Search |
deepseek reasoner (V3.2-Exp) | $0.28 | $0.42 | $0.70 | Deep Search |
Grok 4.1 Quick (reasoning) | $0.20 | $0.50 | $0.70 | xAI |
Grok 4.1 Fast (not reasoning) | $0.20 | $0.50 | $0.70 | xAI |
MiniMax M2.5 | $0.15 | $1.20 | $1.35 | MiniMax |
MiniMax M2.5-Lightning | $0.30 | $2.40 | $2.70 | MiniMax |
Gemini 3 Flash Preview | $0.50 | $3.00 | $3.50 | Googling |
Kimi-k2.5 | $0.60 | $3.00 | $3.60 | Moonshot |
GLM-5 | $1.00 | $3.20 | $4.20 | Z.ai |
ERNIE 5.0 | $0.85 | $3.40 | $4.25 | Baidu |
Claude Haiku 4.5 | $1.00 | $5.00 | $6.00 | Anthropic |
Qwen3-Max (23-01-2026) | $1.20 | $6.00 | $7.20 | Alibaba cloud |
Gemini 3 Pro (≤200K) | $2.00 | $12.00 | $14.00 | Googling |
GPT-5.2 | $1.75 | $14.00 | $15.75 | OpenAI |
Claude Sonnet 4.5 | $3.00 | $15.00 | $18.00 | Anthropic |
Gemini 3 Pro (>200K) | $4.00 | $18.00 | $22.00 | Googling |
Close work 4.6 | $5.00 | $25.00 | $30.00 | Anthropic |
GPT-5.2 Pro | $21.00 | $168.00 | $189.00 | OpenAI |
What it means for technical leaders and enterprise decision makers
With the launch of the Qwen3.5 Medium Models, the rapid iteration and refinement once reserved for well-funded labs is now accessible to on-premise development at many non-tech companies, effectively decoupling advanced AI from massive capital expenditure.
Across the organization, this architecture transforms the way data is processed and secured. The ability to locally ingest massive document repositories or hour-scale videos enables deep institutional analysis without the privacy risks of third-party APIs.
By running these specialized ‘Mixture-of-Experts’ models within a private firewall, organizations can maintain sovereign control over their data, while leveraging native ‘thinking’ modes and official tool-calling capabilities to build more reliable, autonomous agents.
Early adopters of Hugging Face have specifically praised the model’s ability to “narrow the gap” in agentic scenarios where previously only the largest closed models could compete.
This shift toward architectural efficiency over raw scale ensures that AI integration remains cost-conscious, secure, and flexible enough to keep pace with changing operational needs.
#Alibabas #Open #Source #Qwen3.5Medium #Models #Deliver #Sonnet #Performance #Local #Machines


