Chinese AI firm DeepSeek has unveiled a new method to improve LLM reasoning skills, claiming it offers more accurate and faster responses than current technologies. The approach, developed with researchers from Tsinghua University, combines generative reward modeling (GRM) with a self-principled critique tuning technique.
The method aims to refine how AI LLMs respond to general queries by better aligning their outputs with human preferences. According to a paper published on the arXiv scientific repository, the resulting DeepSeek-GRM models showed stronger performance than existing methods and proved competitive against widely accepted public reward models.
DeepSeek has announced intentions to release these models as open source, though no release date has been set. The move follows increased global interest in the company, which had earlier gained attention for its V3 foundation model and R1 reasoning model.