-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Lumina 2.0 #1924
Comments
I wonder if it could be possible to document what may be required to implement a new model to be trained. Now with the abstractions for LoRA training for different models (SD3/SDXL/Flux) to where we can work to implement support for new models. Maybe even to a point to where third parties could add their own model support by implementing some base functionality. For instance we now have train_network which we implement support for the models and call this. We have strategy_base which implements caching, text encoder strategies. Then the networks lora_ for the specific models. More to get to it would be nice to have it be a collaborative process to implement new models without it all landing on Kohya's shoulders and also not have it be a direct tie in for this repo to allow some flexibility in iteration. Then those changes could be adapted upstream into this repo with minimal changes to implementation where desired. The idea is that more experimental changes could be done without having conflicts issues with the branch and keeping a fork with those changes to support new models. But also to be in conjunction with this repo in terms of officially supporting models, to not step on each others toes in terms of implementing the same support. More than this ticket but thought it was relevant because I have been working with these abstractions and want to consider if an approach here could be possible. |
I'm implementing the relevant code, the main problem is that I'm not very familiar with the cache strategy. |
If you have any questions on the cache strategy, post them here. I just spent some time implementing a new cache strategy feature to allow outside embeddings to be cached. |
#1927 |
I apologize for the delay in improving the implementation. I think what rockerBoo said makes a lot of sense. I think the current caching strategy is better than the chaos it was in before, but I still think there's a lot of room for improvement. I've started working on improvements in #1784, but it's still a long way off. I'd like to see it through somehow. Musubi tuner https://github.com/kohya-ss/musubi-tuner is much simpler to implement, so maybe sd-scripts can be simplified by dropping less used features. Please open an issue or discussion about the cache strategy if necessary. |
Thank @kohya for your long-term support, which is also why I have always used this repository. I personally feel that the current caching strategy is based on textencoder and placed under the overall model. This results in the need to write several nested abstract methods back and forth, first placing te in strategy_model.py, and then calling it from model_train_util or model_util. Can we separate the TE into a separate model class, like what is done in the transformer? Considering that future models will have more diverse TE for caching and combination with DiT, it may be better to separate LLM/CLIP from the overall model in the future. |
I am truly grateful to all the contributors.
I may not fully understand... For the text encoder, we use the transformers model almost as is. Is the idea to create a model that wraps the transformer's text encoder for each architecture, rather than calling the text encoder from the strategy? |
That makes sense. I think the difficult point is that even if the same Text Encoder is used (for example SD3 and FLUX.1), the handling differs depending on the model architecture (maximum token length, whether to use one or both of pool and hidden state, etc.). For VAE, I think we already have different strategies for different model architectures. |
There is a factory singular for the different strategies in the new system, so these might just be more legacy as we move away from them? tokenize_strategy = self.get_tokenize_strategy(args)
strategy_base.TokenizeStrategy.set_strategy(tokenize_strategy)
tokenizers = self.get_tokenizers(tokenize_strategy) # will be removed after sample_image is refactored
latents_caching_strategy = self.get_latents_caching_strategy(args)
strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy) |
So we can set these additional parameters as a dataclass and input them when calling. After this, you only need to call these textencoder input parameters separately, without encapsulating the entire model architecture. |
code:https://github.com/Alpha-VLLM/Lumina-Image-2.0
Currently, they have a very strong power, and the license using Apache 2.0 is also very friendly.
Compared to SANA, the re norm CFG method they adopt to a great extent offsets the high saturation and smoothness issues brought by synthetic data.
Gemma2B is also very new LLM, captions for multilingual training.
The text was updated successfully, but these errors were encountered: