Alibaba’s New Open Source Qwen3.5-Medium Models Deliver Sonnet 4.5 Performance on Local Machines

Alibaba’s New Open Source Qwen3.5-Medium Models Deliver Sonnet 4.5 Performance on Local Machines

Alibaba’s now famous Qwen AI development team has done it again: just over a day ago they released the Qwen3.5 Medium Model Series consisting of four new major language models (LLMs) with support for calling agentic tools, three of which are available for commercial use by enterprises and indie developers under the standard open source Apache 2.0 license:

  • Qwen3.5-35B-A3B

  • Qwen3.5-122B-A10B

  • Qwen3.5-27B

Developers can download them now at Hugging face And Model range. A fourth model, Qwen3.5-Flash, appears to be proprietary and only available through the Alibaba Cloud Model Studio APIbut still offers a strong cost advantage compared to other models in the West (see price comparison table below).

But the big difference with the open source models is that they deliver similarly high performance on third-party benchmark tests as similarly sized proprietary models from major US startups like OpenAI or Anthropic, and are actually better than OpenAI’s GPT-5 mini and Anthropic’s Claude Sonnet 4.5 – the latter model released just five months ago.

And the Qwen team say it has designed these models to remain highly accurate even when ‘quantized’, a process that further reduces their footprint by reducing the number of values ​​used to store the model’s settings from many values ​​to many fewer.

Crucially, this release brings border-level context windows to the desktop PC. The flagship Qwen3.5-35B-A3B can now exceed a context length of 1 million tokens on consumer-grade GPUs with 32 GB of VRAM. While this isn’t something everyone has access to, this is a lot less computing power than many other similarly performing options.

This leap is made possible by virtually lossless accuracy under 4-bit weighting and KV cache quantization, allowing developers to process massive data sets without server-grade infrastructure.

Technology: Delta Power

At the heart of Qwen 3.5’s performance is an advanced hybrid architecture. While many models rely solely on standard Transformer blocks, Qwen 3.5 integrates Gated Delta Networks in conjunction with a sparse Mixture-of-Experts (MoE) system. The technical specifications for the Qwen3.5-35B-A3B reveal a highly efficient design:

  • Parameter efficiency: Although the model contains a total of 35 billion parameters, it is only activated 3 billion for each given token.

  • Expert diversity: The MoE layer uses 256 experts, including 8 routed experts and 1 shared expert that help maintain performance while reducing inference latency.

  • Near-lossless quantization: The series maintains high accuracy even when compressed to 4-bit weights, significantly reducing the memory footprint for local implementation.

  • Basic model release: In an effort to support the research community, Alibaba has created the Qwen3.5-35B-A3B base model in addition to the instructed versions.

Product: Intelligence that ‘thinks’ first

Qwen 3.5 introduces a native “Thinking Mode” as the default state. Before the model provides a final answer, it generates an internal chain of reasoning – bounded by tags – to understand complex logic. The product range is tailor-made for different hardware environments:

  • Qwen3.5-27B: Optimized for high efficiency and supports a context length of more than 800,000 tokens.

  • Qwen3.5 flash: The production-quality hosted version, with a default context length of 1 million tokens and built-in official tools.

  • Qwen3.5-122B-A10B: Designed for server-grade GPUs (80 GB VRAM), this model supports context lengths over 1 million while closing the gap with the world’s largest frontier models.

Benchmark results validate this architectural shift. The 35B-A3B model notably surpasses much larger predecessors such as Qwen3-235B, as well as the aforementioned proprietary GPT-5 mini and Sonnet 4.5 in categories such as knowledge (MMMLU) and visual reasoning (MMMU-Pro).

Pricing and API integration

For those who don’t host their own weights, Alibaba Cloud Model Studio offers a competitive API for Qwen3.5-Flash.

  • Import: $0.1 per 1 million token

  • Export: $0.4 per 1 million tokens

  • Cache creation: $0.125 per 1 million tokens

  • Read cache: $0.01 per 1 million tokens

The API also features a detailed Tool Calling pricing model, with Web Search costing $10 per 1,000 calls and Code Interpreter currently offered for free for a limited time.

This makes Qwen3.5-Flash one of the most affordable APIs of all the major LLMs in the world. See a table comparing them below:

Model

Import

Export

Total costs

Source

Qwen3 Turbo

$0.05

$0.20

$0.25

Alibaba cloud

Qwen3.5 Flash

$0.10

$0.40

$0.50

Alibaba cloud

deepseek chat (V3.2-Exp)

$0.28

$0.42

$0.70

Deep Search

deepseek reasoner (V3.2-Exp)

$0.28

$0.42

$0.70

Deep Search

Grok 4.1 Quick (reasoning)

$0.20

$0.50

$0.70

xAI

Grok 4.1 Fast (not reasoning)

$0.20

$0.50

$0.70

xAI

MiniMax M2.5

$0.15

$1.20

$1.35

MiniMax

MiniMax M2.5-Lightning

$0.30

$2.40

$2.70

MiniMax

Gemini 3 Flash Preview

$0.50

$3.00

$3.50

Googling

Kimi-k2.5

$0.60

$3.00

$3.60

Moonshot

GLM-5

$1.00

$3.20

$4.20

Z.ai

ERNIE 5.0

$0.85

$3.40

$4.25

Baidu

Claude Haiku 4.5

$1.00

$5.00

$6.00

Anthropic

Qwen3-Max (23-01-2026)

$1.20

$6.00

$7.20

Alibaba cloud

Gemini 3 Pro (≤200K)

$2.00

$12.00

$14.00

Googling

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

Claude Sonnet 4.5

$3.00

$15.00

$18.00

Anthropic

Gemini 3 Pro (>200K)

$4.00

$18.00

$22.00

Googling

Close work 4.6

$5.00

$25.00

$30.00

Anthropic

GPT-5.2 Pro

$21.00

$168.00

$189.00

OpenAI

What it means for technical leaders and enterprise decision makers

With the launch of the Qwen3.5 Medium Models, the rapid iteration and refinement once reserved for well-funded labs is now accessible to on-premise development at many non-tech companies, effectively decoupling advanced AI from massive capital expenditure.

Across the organization, this architecture transforms the way data is processed and secured. The ability to locally ingest massive document repositories or hour-scale videos enables deep institutional analysis without the privacy risks of third-party APIs.

By running these specialized ‘Mixture-of-Experts’ models within a private firewall, organizations can maintain sovereign control over their data, while leveraging native ‘thinking’ modes and official tool-calling capabilities to build more reliable, autonomous agents.

Early adopters of Hugging Face have specifically praised the model’s ability to “narrow the gap” in agentic scenarios where previously only the largest closed models could compete.

This shift toward architectural efficiency over raw scale ensures that AI integration remains cost-conscious, secure, and flexible enough to keep pace with changing operational needs.

#Alibabas #Open #Source #Qwen3.5Medium #Models #Deliver #Sonnet #Performance #Local #Machines

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *