Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Separate out reasoning content in API response #269

Open
Trippnology opened this issue Mar 8, 2025 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@Trippnology
Copy link

Following on from this issue, it would be very helpful if reasoning content was separated from regular content when using the SDK, the same way it can be separated when using the REST API.

@ryan-the-crayon explained the resoningType property of fragments when streaming, but that is not currently available on the final result (please refer to the issue above for code examples).

@ryan-the-crayon ryan-the-crayon added the enhancement New feature or request label Mar 10, 2025
@ryan-the-crayon
Copy link
Collaborator

Thanks for the suggestion. We will likely add this feature. However, we still have to decide the field names.

For the time being, you can do this yourself:

const model = await client.llm.model("deepseek-r1-distill-llama-8b");

let reasoningPart = "";
let nonReasoningPart = "";

const result = await model.respond("What is the meaning of life?", {
  onPredictionFragment: ({ reasoningType, content }) => {
    if (reasoningType === "reasoning") {
      reasoningPart += content;
    } else if (reasoningType === "none") {
      nonReasoningPart += content;
    }
  },
});

console.info("Reasoning part:", reasoningPart);
console.info("Non-reasoning part:", nonReasoningPart);

@ryan-the-crayon
Copy link
Collaborator

Alternatively, if you are streaming already:

const model = await client.llm.model("deepseek-r1-distill-llama-8b");

let reasoningPart = "";
let nonReasoningPart = "";

for await (const { reasoningType, content } of model.respond("What is the meaning of life?")) {
  if (reasoningType === "reasoning") {
    reasoningPart += content;
  } else if (reasoningType === "none") {
    nonReasoningPart += content;
  }
}

console.info("Reasoning part:", reasoningPart);
console.info("Non-reasoning part:", nonReasoningPart);

@Trippnology
Copy link
Author

Thanks Ryan, I appreciate the explanation. I'm already doing that as my terminal-based app handles both streaming, and non-streaming responses. While it works fine for a short example, it quickly gets more complicated than it needs to be in a real app.

To explain my reasoning (see what I did there!):

const prediction = mainAgent.sendMessage();

if (options.md) {
	conversationLog.await('Generating response');
	// As this is a terminal-based app, if the response should be parsed as MD,
	// we have to wait for the entire result, as we cannot go back and parse it after
	// it's displayed, like we could with a web-based app.
} else {
	// This could be removed if reasoning was separate as we wouldn't need to compile it ourselves
	last_response = {
		text: '',
		thinking: '',
	};
	// No MD parsing means we can just output immediately
	for await (const { reasoningType, content } of prediction) {
		if (options.hideThinking) {
			if (reasoningType === 'reasoning') {
				last_response.thinking += content;
			} else if (reasoningType === 'none') {
				last_response.text += content;
				process.stdout.write(content);
			}
		} else {
			// No need to hide reasoning content, output everything
			process.stdout.write(content);
		}
	}
	process.stdout.write('\n\n');
}

const result = await prediction;

conversationLog.success(
	`Got ${result.stats.totalTokensCount} tokens at ${_.round(result.stats.tokensPerSecond, 2)} t/s`,
);

// This section could be removed if reasoning was separate, and we could just cache the
// last response directly with: last_response = result;
const parts = result.content.split('</think>');
last_response = {
	text: parts[1],
	thinking: `${parts[0]}</think>`,
};

if (options.md) {
	// Output the final result parsed as MD
	console.log(
		cliMd(
			options.hideThinking ? last_response.text : result.content,
		),
	);
}

If result had a reasoning_content property (as found in REST API responses), this could be reduced down to:

const prediction = mainAgent.sendMessage();

if (options.md) {
	conversationLog.await('Generating response');
} else {
	for await (const { reasoningType, content } of prediction) {
		if (options.hideThinking) {
			if (reasoningType === 'none') {
				process.stdout.write(content);
			}
		} else {
			process.stdout.write(content);
		}
	}
	process.stdout.write('\n\n');
}

const result = await prediction;
last_response = result;

conversationLog.success(
	`Got ${result.stats.totalTokensCount} tokens at ${_.round(result.stats.tokensPerSecond, 2)} t/s`,
);

if (options.md) {
	console.log(
		cliMd(
			options.hideThinking ? result.content : `${result.reasoning_content}\n${result.content}`,
		),
	);
}

It's also important to consider that DeepSeek's official recommendation is that the thinking content should not be passed back to the model as part of the conversation, so splitting it out before adding the response to the conversation stack is essential, even if it's displayed in normal output.

Hope this makes sense, and I appreciate you and the rest of the team are working incredibly hard on keeping up with a mountain of requests! 🤜

@ryan-the-crayon
Copy link
Collaborator

ryan-the-crayon commented Mar 11, 2025

@Trippnology Thanks for the detailed information. We will definitely add the support. 👍

As regarding to stripping the reasoning content, here is something I typed earlier:

tl;dr: LM Studio performs reasoning content stripping via prompt template. i.e. We only remove stuff between <think> tags iff the prompt template asks us to.

Explanation:

On its own, LM Studio aims to preserve models' output, as we have no way of knowing whether a model expects to see its previous reasoning content in subsequent generations. However, for many recent reasoning models (such as deepseek), they do expect reasoning content being stripped. This is achieved by the prompt template shipped with the model. If you check the jinja template of deepseek (go to My Models page -> your model -> gears -> Prompt-> Prompt Template), it will likely contain something like {% if '</think>' in content %}{% set content = content.split('</think>')|last %}{% endif %}, which removes the content before </think>. Since prompt template is applied before feeding the context to the model, it ensures reasoning content from previous generations are not passed to the model.

If you suspect reasoning content of previous generations are not stripped properly, please use the command lms log stream. It will show you the raw input before passing to the model. (In general, lms log stream is what you should use when you need to make sure correct content is being fed to the model.)

If you do see reasoning content from previous generation in lms log stream, it is likely caused by incorrect prompt template being used. There are usually 2 causes:

  • The prompt template shipped with the model is incorrect. (Check the prompt template in My Models page.) If you are unsure, you can ask here.
  • You have provided a prompt template override that does not have reasoning content stripping (you can check this by right clicking the top gear in Chat -> Copy Model Debug Information).
    • If you are using "Manual" prompt template, it will not support reasoning content stripping.
    • If you are using a custom jinja template, make sure it has reasoning content stripping.

@Trippnology
Copy link
Author

Oh, that's really good information. I had no idea reasoning content was automatically stripped like that.

Pleased to hear this will land eventually, I can muddle through in the short term. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants