sonnet 4.6 outperforms opus 4.7 on my issue-triage skill.
6/7 vs 4/7 on the eval. for narrow, well-defined tasks the smaller model wins. saving this.
my issue-triage skill takes a raw github issue and outputs (category, severity, suggested owner). i ran the same 7-issue eval on opus 4.7 and sonnet 4.6.
results
opus 4.7: 4/7 correct. when it was wrong, it was wrong in a 'reasoning past the spec' way — it'd suggest an owner who hadn't been on the team in months because the issue resembled an old one.
sonnet 4.6: 6/7 correct. when it was wrong, it was wrong in a smaller way — it'd over-classify severity.
what i think is happening
opus is doing too much reasoning for this task. the skill has a tight spec; sonnet sticks closer to it. the gain from opus's extra capability is wiped out by its tendency to imagine adjacent context.
narrow task → smaller model. saving the eval so i can re-run it on the next release.