model note1 may '26

sonnet 4.6 outperforms opus 4.7 on my issue-triage skill.

6/7 vs 4/7 on the eval. for narrow, well-defined tasks the smaller model wins. saving this.

my issue-triage skill takes a raw github issue and outputs (category, severity, suggested owner). i ran the same 7-issue eval on opus 4.7 and sonnet 4.6.

results

opus 4.7: 4/7 correct. when it was wrong, it was wrong in a 'reasoning past the spec' way — it'd suggest an owner who hadn't been on the team in months because the issue resembled an old one.

sonnet 4.6: 6/7 correct. when it was wrong, it was wrong in a smaller way — it'd over-classify severity.

what i think is happening

opus is doing too much reasoning for this task. the skill has a tight spec; sonnet sticks closer to it. the gain from opus's extra capability is wiped out by its tendency to imagine adjacent context.

narrow task → smaller model. saving the eval so i can re-run it on the next release.

back to matrix