ScienceUnited States1 sourcesNeutral13 min read

Can AI actually solve real math proofs? Researchers put it to the test

Kendra Pierre-Louis: For Scientific American’s Science Quickly, I’m Kendra Pierre-Louis, in for Rachel Feltman.

Alex Sugiura,Fonda Mwangi,Joseph Howlett,Kendra Pierre-Louis

via Alex Sugiura,Fonda Mwangi,Joseph Howlett,Kendra Pierre-Louis

March 25, 2026 | 10:00 AM UTCUpdated 2h ago

Kendra Pierre-Louis: For Scientific American’s Science Quickly, I’m Kendra Pierre-Louis, in for Rachel Feltman.

In 1997, Deep Blue, a supercomputer built by IBM, did the unexpected: it defeated chess giant Garry Kasparov at his own game, leading to a flurry of headlines about whether Deep Blue was truly intelligent and if computers could now outthink humans. The answer, at least then, was mostly no. But it’s now 2026, and we have a growing number of generative AI models that are once again making us wonder, “Can machines outthink us?” To dig into this question, a group of researchers aren’t turning to chess this time—they’re looking to math.

Can AI actually solve real math proofs? Researchers put it to the test

On supporting science journalism If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. To learn more about that, I talked to Joe Howlett, a staff reporter here at SciAm covering math.

Thanks for joining us today, Joe. Joe Howlett: Thank you for having me. Pierre-Louis: So you wrote a piece that’s talking about the challenges of AI and math.

Before we kinda get into the meat and potatoes of that piece, I have a—maybe a more basic question for you. Howlett: Yeah. Pierre-Louis: For those of us who maybe peaked with high-school algebra, when you’re talking about AI and math problems, what are the kind of math problems we’re really talking about?

Howlett: That’s actually a lot of what this story’s about, is that the kind of questions that mathematicians ask and spend their time thinking about kind of don’t really sound like or have anything in common with the problems that we work on for homework in math class. Pierre-Louis: Mm-hmm. Howlett: If you’ve recently taken a math class, you’re used to problems that have answers, right?

Pierre-Louis: Mm-hmm. Howlett: And the answer is, like, a number ... Pierre-Louis: Yep.

Howlett: Or something. And you hand in your homework, and the teacher can check that number [Laughs], if it’s the right number or the wrong number, and they give you a grade. But what research mathematicians are doing is trying to prove that statements are either true or false about the mathematical universe.

So what does that mean? Like, you know about triangles and squares and basic shapes, but there’s ... Pierre-Louis: I did graduate from kindergarten, yes. [Laughs.] Howlett: [Laughs.] That’s right, exactly.

That’s about as far as I made it, too. There’s way more complicated shapes that exist in many dimensions and have weird curvatures that you can’t even picture in your mind. But mathematicians are able to say things about them, right?

Using equations and using proofs, they’re able to learn about these objects that we can’t actually see or picture. Pierre-Louis: So now that we kind of know what math is, in [one of your pieces] you note that LLMs have had some mathematical wins, like Google Gemini Deep Think achieved a gold-level score on the International Mathematical Olympiad and that AI has solved multiple “Erdős problems.” Why isn’t that enough to show AI’s math prowess?

Howlett: Yeah, I mean, the thing about most of these so-called benchmarks, is what they call ’em—for a lot of reasons AI companies have fixated on mathematics as, like, the next thing to prove ... Pierre-Louis: Mm-hmm. Howlett: That LLMs can think, or to take a step towards intelligence.

But most of those examples, like you said, they have more in common with the kind of test questions and homework problems that we were just talking about, not really looking like ... Pierre-Louis: Mm-hmm. Howlett: Research math, right, which is more about proving statements about the world and exploring that world, posing questions that are interesting.

So in a way all of those accomplishments are very impressive. [Laughs.] It’s crazy that a computer can win gold on the math IMO ... Pierre-Louis: Mm-hmm. Howlett: But it doesn’t say much about whether and to what extent a computer can advance mathematics, right, on its own, or even with the help of a human.

Pierre-Louis: Kind of like the difference between a really good calculator and a mathematician. Howlett: Exactly! Yeah.

Like, mathematicians have come across—in the history of mathematics, new tools have been invented time and again that have been useful for mathematicians and have accelerated things. And one of the big questions here [is]: Is this just another one of those tools, or is this gonna fundamentally revolutionize how mathematics is done at a level that we’ve never seen before? And it’s kind of too early to say.

Pierre-Louis: And one of the ways it seems that people are trying to suss out whether AI is kind of just a giant calculator or can really advance math is this First Proof challenge that was put together by a group of 11 mathematicians. Can you explain what this challenge was? Howlett: Yeah, so these mathematicians who are, like, luminaries in their various fields of mathematics—and they cover a broad range of subfields in mathematics—they wanted to rectify this situation where we don’t really have a good sense of how good AI is at posing and solving real research math problems.

All of them have had this anecdotal experience where LLMs have gotten a lot better in just the last few months at interrogating mathematical questions kind of in the way a mathematician would and at proposing proofs and methods of proof that seem to bear out in some situations. But then they also hallucinate a lot, and they propose a lot of very confident nonsense. So these mathematicians—who, by the way, don’t work for AI companies, right ...

Pierre-Louis: Mm-hmm. Howlett: They decided to get together and pose actual research questions that they are trying to solve for their own mathematical research, right? So each of them has papers that are coming out with proofs, and each of them took a little section of that.

Proofs—the way mathematicians do proofs is they break them up into smaller theorems, right? So if you wanted to prove that seven is bigger than three, you might first prove that seven is bigger than five, and then prove that five is bigger than three, right? And that’s kind of how mathematicians work.

And these smaller proofs are called “lemmas.” What these mathematicians did is they each took from an upcoming paper a lemma that they proved as part of their bigger proof and picked it out of that paper, posed it as a problem for an LLM and did all of this before uploading that paper to any online place so that it’s not in the training data of the LLMs, right? Pierre-Louis: Mm-hmm.

Howlett: ’Cause any math problem that I could pose an LLM has probably been posed before and probably an answer exists on the Internet. So these are real cutting-edge research questions, and if an LLM can solve them, then it would be, like, substantially able to contribute to the practice of doing math. Pierre-Louis: So what are the early results from running this kind of challenge?