Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Lumina 2.0 #1924

Open
sdbds opened this issue Feb 7, 2025 · 11 comments
Open

Support Lumina 2.0 #1924

sdbds opened this issue Feb 7, 2025 · 11 comments

Comments

@sdbds
Copy link
Contributor

sdbds commented Feb 7, 2025

code:https://github.com/Alpha-VLLM/Lumina-Image-2.0

Currently, they have a very strong power, and the license using Apache 2.0 is also very friendly.

Compared to SANA, the re norm CFG method they adopt to a great extent offsets the high saturation and smoothness issues brought by synthetic data.

Gemma2B is also very new LLM, captions for multilingual training.

@rockerBOO
Copy link
Contributor

I wonder if it could be possible to document what may be required to implement a new model to be trained. Now with the abstractions for LoRA training for different models (SD3/SDXL/Flux) to where we can work to implement support for new models. Maybe even to a point to where third parties could add their own model support by implementing some base functionality.

For instance we now have train_network which we implement support for the models and call this. We have strategy_base which implements caching, text encoder strategies. Then the networks lora_ for the specific models.

More to get to it would be nice to have it be a collaborative process to implement new models without it all landing on Kohya's shoulders and also not have it be a direct tie in for this repo to allow some flexibility in iteration. Then those changes could be adapted upstream into this repo with minimal changes to implementation where desired.

The idea is that more experimental changes could be done without having conflicts issues with the branch and keeping a fork with those changes to support new models. But also to be in conjunction with this repo in terms of officially supporting models, to not step on each others toes in terms of implementing the same support.

More than this ticket but thought it was relevant because I have been working with these abstractions and want to consider if an approach here could be possible.

@sdbds
Copy link
Contributor Author

sdbds commented Feb 12, 2025

I wonder if it could be possible to document what may be required to implement a new model to be trained. Now with the abstractions for LoRA training for different models (SD3/SDXL/Flux) to where we can work to implement support for new models. Maybe even to a point to where third parties could add their own model support by implementing some base functionality.

For instance we now have train_network which we implement support for the models and call this. We have strategy_base which implements caching, text encoder strategies. Then the networks lora_ for the specific models.

More to get to it would be nice to have it be a collaborative process to implement new models without it all landing on Kohya's shoulders and also not have it be a direct tie in for this repo to allow some flexibility in iteration. Then those changes could be adapted upstream into this repo with minimal changes to implementation where desired.

The idea is that more experimental changes could be done without having conflicts issues with the branch and keeping a fork with those changes to support new models. But also to be in conjunction with this repo in terms of officially supporting models, to not step on each others toes in terms of implementing the same support.

More than this ticket but thought it was relevant because I have been working with these abstractions and want to consider if an approach here could be possible.

I'm implementing the relevant code, the main problem is that I'm not very familiar with the cache strategy.

@rockerBOO
Copy link
Contributor

If you have any questions on the cache strategy, post them here. I just spent some time implementing a new cache strategy feature to allow outside embeddings to be cached.

@sdbds
Copy link
Contributor Author

sdbds commented Feb 12, 2025

#1927
It is currently under implementation.

@kohya-ss
Copy link
Owner

I apologize for the delay in improving the implementation.

I think what rockerBoo said makes a lot of sense.

I think the current caching strategy is better than the chaos it was in before, but I still think there's a lot of room for improvement. I've started working on improvements in #1784, but it's still a long way off. I'd like to see it through somehow.

Musubi tuner https://github.com/kohya-ss/musubi-tuner is much simpler to implement, so maybe sd-scripts can be simplified by dropping less used features.

Please open an issue or discussion about the cache strategy if necessary.

@sdbds
Copy link
Contributor Author

sdbds commented Feb 12, 2025

I apologize for the delay in improving the implementation.

I think what rockerBoo said makes a lot of sense.

I think the current caching strategy is better than the chaos it was in before, but I still think there's a lot of room for improvement. I've started working on improvements in #1784, but it's still a long way off. I'd like to see it through somehow.

Musubi tuner https://github.com/kohya-ss/musubi-tuner is much simpler to implement, so maybe sd-scripts can be simplified by dropping less used features.

Please open an issue or discussion about the cache strategy if necessary.

Thank @kohya for your long-term support, which is also why I have always used this repository.

I personally feel that the current caching strategy is based on textencoder and placed under the overall model.

This results in the need to write several nested abstract methods back and forth, first placing te in strategy_model.py, and then calling it from model_train_util or model_util.

Can we separate the TE into a separate model class, like what is done in the transformer?

Considering that future models will have more diverse TE for caching and combination with DiT, it may be better to separate LLM/CLIP from the overall model in the future.

@kohya-ss
Copy link
Owner

I am truly grateful to all the contributors.

Can we separate the TE into a separate model class, like what is done in the transformer?

I may not fully understand... For the text encoder, we use the transformers model almost as is. Is the idea to create a model that wraps the transformer's text encoder for each architecture, rather than calling the text encoder from the strategy?

@sdbds
Copy link
Contributor Author

sdbds commented Feb 15, 2025

I am truly grateful to all the contributors.

Can we separate the TE into a separate model class, like what is done in the transformer?

I may not fully understand... For the text encoder, we use the transformers model almost as is. Is the idea to create a model that wraps the transformer's text encoder for each architecture, rather than calling the text encoder from the strategy?

Image

I mean that in terms of cache strategy, the current method is to judge the model and then assign the type of TextEncoder according to the model architecture.

With the increase of other models in the future (such as hunyuan, lumina, sana, etc.)

Can we directly cache the cache strategy based on the parameters of Textencoder?

In this way, if you encounter other models, you can reuse it.

Just like the cache of VAE.

Otherwise we must continue to add the judgment model label
is_xxx...

Image
same like this in lora modules, we will add each Textencoder for diff models.

@kohya-ss
Copy link
Owner

Can we directly cache the cache strategy based on the parameters of Textencoder?

That makes sense. I think the difficult point is that even if the same Text Encoder is used (for example SD3 and FLUX.1), the handling differs depending on the model architecture (maximum token length, whether to use one or both of pool and hidden state, etc.).

For VAE, I think we already have different strategies for different model architectures.

@rockerBOO
Copy link
Contributor

There is a factory singular for the different strategies in the new system, so these might just be more legacy as we move away from them?

        tokenize_strategy = self.get_tokenize_strategy(args)
        strategy_base.TokenizeStrategy.set_strategy(tokenize_strategy)
        tokenizers = self.get_tokenizers(tokenize_strategy)  # will be removed after sample_image is refactored

        latents_caching_strategy = self.get_latents_caching_strategy(args)
        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)

@sdbds
Copy link
Contributor Author

sdbds commented Feb 16, 2025

Can we directly cache the cache strategy based on the parameters of Textencoder?

That makes sense. I think the difficult point is that even if the same Text Encoder is used (for example SD3 and FLUX.1), the handling differs depending on the model architecture (maximum token length, whether to use one or both of pool and hidden state, etc.).

For VAE, I think we already have different strategies for different model architectures.

So we can set these additional parameters as a dataclass and input them when calling.

After this, you only need to call these textencoder input parameters separately, without encapsulating the entire model architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants