Alibaba’s New Open Source Qwen3.5-Medium Models Deliver Sonnet 4.5 Performance on Local Machines

Alibaba’s now famous Qwen AI development team has done it again: just over a day ago they released the Qwen3.5 Medium Model Series consisting of four new major language models (LLMs) with support for calling agentic tools, three of which are available for commercial use by enterprises and indie developers under the standard open source Apache 2.0 license:

Qwen3.5-35B-A3B
Qwen3.5-122B-A10B
Qwen3.5-27B

Developers can download them now at Hugging face And Model range. A fourth model, Qwen3.5-Flash, appears to be proprietary and only available through the Alibaba Cloud Model Studio APIbut still offers a strong cost advantage compared to other models in the West (see price comparison table below).

But the big difference with the open source models is that they deliver similarly high performance on third-party benchmark tests as similarly sized proprietary models from major US startups like OpenAI or Anthropic, and are actually better than OpenAI’s GPT-5 mini and Anthropic’s Claude Sonnet 4.5 – the latter model released just five months ago.

And the Qwen team say it has designed these models to remain highly accurate even when ‘quantized’, a process that further reduces their footprint by reducing the number of values used to store the model’s settings from many values to many fewer.

Crucially, this release brings border-level context windows to the desktop PC. The flagship Qwen3.5-35B-A3B can now exceed a context length of 1 million tokens on consumer-grade GPUs with 32 GB of VRAM. While this isn’t something everyone has access to, this is a lot less computing power than many other similarly performing options.

This leap is made possible by virtually lossless accuracy under 4-bit weighting and KV cache quantization, allowing developers to process massive data sets without server-grade infrastructure.

Technology: Delta Power

At the heart of Qwen 3.5’s performance is an advanced hybrid architecture. While many models rely solely on standard Transformer blocks, Qwen 3.5 integrates Gated Delta Networks in conjunction with a sparse Mixture-of-Experts (MoE) system. The technical specifications for the Qwen3.5-35B-A3B reveal a highly efficient design:

Parameter efficiency: Although the model contains a total of 35 billion parameters, it is only activated 3 billion for each given token.
Expert diversity: The MoE layer uses 256 experts, including 8 routed experts and 1 shared expert that help maintain performance while reducing inference latency.
Near-lossless quantization: The series maintains high accuracy even when compressed to 4-bit weights, significantly reducing the memory footprint for local implementation.
Basic model release: In an effort to support the research community, Alibaba has created the Qwen3.5-35B-A3B base model in addition to the instructed versions.

Product: Intelligence that ‘thinks’ first

Qwen 3.5 introduces a native “Thinking Mode” as the default state. Before the model provides a final answer, it generates an internal chain of reasoning – bounded by tags – to understand complex logic. The product range is tailor-made for different hardware environments:

Qwen3.5-27B: Optimized for high efficiency and supports a context length of more than 800,000 tokens.
Qwen3.5 flash: The production-quality hosted version, with a default context length of 1 million tokens and built-in official tools.
Qwen3.5-122B-A10B: Designed for server-grade GPUs (80 GB VRAM), this model supports context lengths over 1 million while closing the gap with the world’s largest frontier models.

Benchmark results validate this architectural shift. The 35B-A3B model notably surpasses much larger predecessors such as Qwen3-235B, as well as the aforementioned proprietary GPT-5 mini and Sonnet 4.5 in categories such as knowledge (MMMLU) and visual reasoning (MMMU-Pro).

Pricing and API integration

For those who don’t host their own weights, Alibaba Cloud Model Studio offers a competitive API for Qwen3.5-Flash.

Import: $0.1 per 1 million token
Export: $0.4 per 1 million tokens
Cache creation: $0.125 per 1 million tokens
Read cache: $0.01 per 1 million tokens

The API also features a detailed Tool Calling pricing model, with Web Search costing $10 per 1,000 calls and Code Interpreter currently offered for free for a limited time.

This makes Qwen3.5-Flash one of the most affordable APIs of all the major LLMs in the world. See a table comparing them below:

Model	Import	Export	Total costs	Source
Qwen3 Turbo	$0.05	$0.20	$0.25	Alibaba cloud
Qwen3.5 Flash	$0.10	$0.40	$0.50	Alibaba cloud
deepseek chat (V3.2-Exp)	$0.28	$0.42	$0.70	Deep Search
deepseek reasoner (V3.2-Exp)	$0.28	$0.42	$0.70	Deep Search
Grok 4.1 Quick (reasoning)	$0.20	$0.50	$0.70	xAI
Grok 4.1 Fast (not reasoning)	$0.20	$0.50	$0.70	xAI
MiniMax M2.5	$0.15	$1.20	$1.35	MiniMax
MiniMax M2.5-Lightning	$0.30	$2.40	$2.70	MiniMax
Gemini 3 Flash Preview	$0.50	$3.00	$3.50	Googling
Kimi-k2.5	$0.60	$3.00	$3.60	Moonshot
GLM-5	$1.00	$3.20	$4.20	Z.ai
ERNIE 5.0	$0.85	$3.40	$4.25	Baidu
Claude Haiku 4.5	$1.00	$5.00	$6.00	Anthropic
Qwen3-Max (23-01-2026)	$1.20	$6.00	$7.20	Alibaba cloud
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Googling
GPT-5.2	$1.75	$14.00	$15.75	OpenAI
Claude Sonnet 4.5	$3.00	$15.00	$18.00	Anthropic
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Googling
Close work 4.6	$5.00	$25.00	$30.00	Anthropic
GPT-5.2 Pro	$21.00	$168.00	$189.00	OpenAI

What it means for technical leaders and enterprise decision makers

With the launch of the Qwen3.5 Medium Models, the rapid iteration and refinement once reserved for well-funded labs is now accessible to on-premise development at many non-tech companies, effectively decoupling advanced AI from massive capital expenditure.

Across the organization, this architecture transforms the way data is processed and secured. The ability to locally ingest massive document repositories or hour-scale videos enables deep institutional analysis without the privacy risks of third-party APIs.

By running these specialized ‘Mixture-of-Experts’ models within a private firewall, organizations can maintain sovereign control over their data, while leveraging native ‘thinking’ modes and official tool-calling capabilities to build more reliable, autonomous agents.

Early adopters of Hugging Face have specifically praised the model’s ability to “narrow the gap” in agentic scenarios where previously only the largest closed models could compete.

This shift toward architectural efficiency over raw scale ensures that AI integration remains cost-conscious, secure, and flexible enough to keep pace with changing operational needs.

#Alibabas #Open #Source #Qwen3.5Medium #Models #Deliver #Sonnet #Performance #Local #Machines

Alibaba’s New Open Source Qwen3.5-Medium Models Deliver Sonnet 4.5 Performance on Local Machines

Like this:

Related

Similar Posts

Candace Owens responds after the French president and First Lady Sue for defamation

‘Lokpal now Shockpal’: Jairam Ramesh criticizes anti-corruption body’s tender to buy luxury cars from BMW

Leave a Reply Cancel reply

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply