Grok-3 thinking had to take 64 answers per question to do better than o3-mini

OpenAI has used such graphs before so it’s not the worst sin, but it does go to show the o3 family is still in a league of its own.