You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When aiming to fine-tune starcoder or octocoder on a custom dataset for integration with an IDE, would it be more appropriate to process the data in a question & answer format by masking custom code for instruction tuning, or would it be better to train it like a base model, utilizing concat tokens to attach the entire code and maintain identical input labels for certain sequence units? Could you share any opinions or experiences regarding this?
The text was updated successfully, but these errors were encountered:
For Code completion in The IDE (GitHub copilot style) we recommend just combining the code files like we did for pre-training, for chat-like applications and instruction tuning it's more common to use the instruction/answer format
When aiming to fine-tune starcoder or octocoder on a custom dataset for integration with an IDE, would it be more appropriate to process the data in a question & answer format by masking custom code for instruction tuning, or would it be better to train it like a base model, utilizing concat tokens to attach the entire code and maintain identical input labels for certain sequence units? Could you share any opinions or experiences regarding this?
The text was updated successfully, but these errors were encountered: