A New Direction for AI Patent Filings in 2025

By June 12, 2025 News

A recent article out of Tsinghua University, a top tech university in China, is generating some interest in the field of AI regarding potential limitations of reinforcement learning in developing reasoning models. Why does this matter? One reason is that reinforcement learning has been a major factor in the recent improvement of large language models (LLMs) at reasoning tasks. Another reason is that reinforcement learning is thought by many to be an important tool towards achieving artificial general intelligence (AGI).

Last fall Open AI released a new family of LLMs, referred to as o-series models, with enhanced reasoning capabilities. These models mimic reasoning patterns that humans use when solving problems, such as making a plan first, breaking tasks down into steps, and backtracking when mistakes are made. When given problems to solve, these models produce “chain of thought” (CoT) responses that include reasoned explanations of how a solution was arrived at. These models outperform non-reasoning LLMs such as ChatGPT-4 in that they arrive at solutions faster, meaning, with less tries.

These LLMs are trained using a type of reinforcement learning called Reinforcement Learning with Verifiable Rewards (RLVR). RLVR starts with a base model LLM that is pre-trained using supervised learning (such as a version of ChatGPT), and optimizes the model using reinforcement learning where automatically computable rewards are provided when the model’s output matches ground-truth data. For example, the model may be given a mathematical problem, where a reward is given if a solution is obtained, or the model may be given a programming task, where a reward may be given if code outputted by the model is successfully executed. During this secondary training, the LLM learns strategies that maximize the rewards. This simple concept has proven remarkably effective at training models for some kinds of reasoning tasks. For example, agents trained using RLVR to play Go have outperformed humans by discovering previously unknown strategies.

However, until this article came out, whether or not LLMs trained using RLVR are able to develop truly novel reasoning strategies has remained an open question. Researchers at Tsinghua set out to test this hypothesis, with some surprising results. What they discovered is that although the reasoning models trained using RLVR are able to solve some problems faster and more efficiently than the base models, the reasoning models are only able to come up with solutions that the base models would eventually figure out if given enough time (meaning, enough tries). Even more surprising, they discovered that when given enough time, the base models actually outperformed the reasoning models. That is, if you limit the number of tries to get a correct answer to a problem, the reasoning models perform better than the base models, but if you allow unlimited tries, the non-reasoning models show better performance and end up being able to solve more types of reasoning problems.

In other words, the reinforcement learning makes the models reason faster and more efficiently, but the reinforcement learning process is not able to get the models to “think outside the box”. The research suggests that this class of LLMs are capable of generalizing to new data that is similar to the data they are trained on, but not able to perform “out-of-distribution generalization”, meaning, to reason outside of what the models learned during their own training. These models still lack the capacity for adaptation.

However, the research team found more positive results with respect to a different approach used to train LLMs to find solutions more efficiently, called distillation. In model distillation, a large, complex model referred to as a “teacher” is used to train a smaller, more efficient model, referred to as a “student”. The student model is trained to mimic the output of the teacher, effectively transferring the knowledge of the teacher to the student in a more compact form. The teacher is trained on a dataset, and the student is trained on both the dataset and the output predictions of the teacher. The teacher may be pre-trained, or the teacher and the student may be trained simultaneously. This can result in the student model being able to generate similar predictions as the teacher model, but faster and with a smaller overhead.

The research team studied the reasoning capabilities of a Deep Seek distilled model, and they found the performance of the distilled model to be consistently and significantly above that of the base model. As stated in the paper, “this indicates that, unlike RL that is fundamentally bounded by the reasoning capacity of the base model, distillation introduces new reasoning patterns learned from a stronger teacher model. As a result, the distilled model is capable of surpassing the reasoning boundary of the base model.”

It will be interesting to see how the industry reaction to this paper affects tech companies’ patent portfolios. In addition to spurring advances in reinforcement learning algorithms, one might expect to see an increase in AI patent filings on distillation training techniques.

Leave a Reply