-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement self-instruct and evolve-instruct synthetic data generation pipeline #720
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
implementations/example.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be better to put the logic in a main() function and then run
if __name__ == "__main__":
main()
in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a way to pass in tasks rather than always using the whole seed_tasks file?
implementations/synthetic_datagen/self_instruct/self_instruct_generator.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move the self_instruct/utils folder one level up and make it common to (potentially) other generation methods
implementations/synthetic_datagen/evolve_instruct/seed_files/alpaca_data.json
Outdated
Show resolved
Hide resolved
""" | ||
|
||
self.prompt_templates[Mutation.FRESH_START] = ( | ||
self.prompt_templates['base'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move the prompts to a separate file, similar to templates.py, in self-instruct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could break down the implementation into smaller methods for ease of readability
implementations/synthetic_datagen/evolve_instruct/evolve_instruct_generator.py
Outdated
Show resolved
Hide resolved
|
||
import os | ||
|
||
if isinstance(self.seed_data, str) and os.path.exists(self.seed_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try to use more descriptive variable names
implementations/synthetic_datagen/evolve_instruct/evolve_instruct_generator.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synthetic data does not need to be inside the package. Best to download it from huggingface.
Union[ChatCompletion, Stream[ChatCompletionChunk]]: | ||
`ChatCompletion` in the non-stream mode. | ||
""" | ||
print(messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, you can use logging info to instead print func
class NemotronRewardEvalAgent(BaseEvalAgentSystem): | ||
def __init__(self) -> None: | ||
self.nemotron = NvidiaModelV2(model_type=ModelType.NEMOTRON_4_REWARD) | ||
|
||
def run_eval( | ||
self, synthetic_datum: SyntheticDatum | ||
) -> Optional[Dict[str, Any]]: | ||
message = [ | ||
{ | ||
"role": "user", | ||
"content": USER_PROMPT_TEMPLATE.format( | ||
instruction=synthetic_datum.instruction, | ||
input=synthetic_datum.input, | ||
), | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": synthetic_datum.output, | ||
}, | ||
] | ||
response = self.nemotron.run(message) | ||
print(response) | ||
match = re.search(r"content='(.*?)'", str(response)) | ||
if not match: | ||
logger.error("Failed to parse response from Nemotron. {response}") | ||
return None | ||
|
||
scores_str = match.group(1) | ||
logger.info(f"Scores: {scores_str}") | ||
scores = { | ||
item.split(":")[0]: float(item.split(":")[1]) | ||
for item in scores_str.split(",") | ||
} | ||
|
||
return scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need code comments in here
@dataclass | ||
class SyntheticDatum: | ||
instruction: str | ||
input: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are moving from dataclass to pydantic BaseModel, maybe better to use pydantic BaseModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an example in camel of using pydantic BaseModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's not necessary to set this SingleAgent class, in this class the model type is hard coded as ollama, why not calling agent from ChatAgent from camel.agents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns: | ||
str: The response generated by the agent. | ||
""" | ||
for agent in self.agents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could it work? no self.agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still in progress sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this MultiAgent class used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to initialize a multi-agent system with chat/eval agents etc. inside
import re | ||
from typing import Any, Dict, Optional | ||
|
||
from camel.models import NvidiaModelV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if only NvidiaModelV2 is used, we can remove other models like NvidiaModelV3
match = re.search(r"content='(.*?)'", str(response)) | ||
if not match: | ||
logger.error("Failed to parse response from Nemotron. {response}") | ||
return None | ||
|
||
scores_str = match.group(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can get the score from response without using regex matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that way, it will give the error that we were discussing during the call
raise NotImplementedError( | ||
"This method must be implemented by subclasses." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally we use pass
rather than rasing error
@property | ||
def is_self_instruct(self) -> bool: | ||
return self is SyntheticDataGeneratorMethodType.SELFINSTRUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to check this? I think we can do the check by using .value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing prototype implementation. |
Description
Implementation of a generic pipeline for synthetic data generation designed to support multiple methods, starting with self-instruct (credits to https://github.com/yizhongw/self-instruct).
Motivation and Context
New project: synthetic data generation
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Implemented Tasks
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!