AI Quick Take
- Willison's quick comparison favors Alibaba's Qwen3.6-35B-A3B over Anthropic's Opus 4.7 on two whimsical image-generation prompts.
- The Qwen run used a 20.
Simon Willison reports that Qwen3.6-35B-A3B produced preferable illustrations to Anthropic's Claude Opus 4.7 on his informal 'pelican riding a bicycle' test and on a separate SVG flamingo-on-a-unicycle prompt. The comparison reflects direct prompt outputs and visual judgement rather than formal metrics.
The Qwen result came from a 20.9GB gguf quantized model by Unsloth, run locally on a MacBook Pro M5 through LM Studio and the llm-lmstudio plugin. Willison also ran Opus 4.7 and retried it with thinking_level set to max; his follow-up did not close the gap in these creative examples.
Willison emphasizes that the pelican benchmark is intentionally absurd and not a robust evaluation, though he notes past informal correlation between pelican quality and broader model usefulness. He also expresses skepticism that labs specifically train for this benchmark, even as the outcome nudges that suspicion.
For practitioners, the post is a narrow datapoint: it suggests quantized local inference of a 35B model can yield strong creative outputs, but it does not replace comprehensive benchmarks or controlled comparisons. Watch for repeatable, standardized tests and larger sample sets before changing deployment or procurement choices based on this anecdote.