Support Lumina 2.0 #1924

sdbds · 2025-02-07T15:14:38Z

code:https://github.com/Alpha-VLLM/Lumina-Image-2.0

Currently, they have a very strong power, and the license using Apache 2.0 is also very friendly.

Compared to SANA, the re norm CFG method they adopt to a great extent offsets the high saturation and smoothness issues brought by synthetic data.

Gemma2B is also very new LLM, captions for multilingual training.

rockerBOO · 2025-02-11T18:38:47Z

I wonder if it could be possible to document what may be required to implement a new model to be trained. Now with the abstractions for LoRA training for different models (SD3/SDXL/Flux) to where we can work to implement support for new models. Maybe even to a point to where third parties could add their own model support by implementing some base functionality.

For instance we now have train_network which we implement support for the models and call this. We have strategy_base which implements caching, text encoder strategies. Then the networks lora_ for the specific models.

More to get to it would be nice to have it be a collaborative process to implement new models without it all landing on Kohya's shoulders and also not have it be a direct tie in for this repo to allow some flexibility in iteration. Then those changes could be adapted upstream into this repo with minimal changes to implementation where desired.

The idea is that more experimental changes could be done without having conflicts issues with the branch and keeping a fork with those changes to support new models. But also to be in conjunction with this repo in terms of officially supporting models, to not step on each others toes in terms of implementing the same support.

More than this ticket but thought it was relevant because I have been working with these abstractions and want to consider if an approach here could be possible.

sdbds · 2025-02-12T04:45:48Z

I wonder if it could be possible to document what may be required to implement a new model to be trained. Now with the abstractions for LoRA training for different models (SD3/SDXL/Flux) to where we can work to implement support for new models. Maybe even to a point to where third parties could add their own model support by implementing some base functionality.

For instance we now have train_network which we implement support for the models and call this. We have strategy_base which implements caching, text encoder strategies. Then the networks lora_ for the specific models.

More to get to it would be nice to have it be a collaborative process to implement new models without it all landing on Kohya's shoulders and also not have it be a direct tie in for this repo to allow some flexibility in iteration. Then those changes could be adapted upstream into this repo with minimal changes to implementation where desired.

The idea is that more experimental changes could be done without having conflicts issues with the branch and keeping a fork with those changes to support new models. But also to be in conjunction with this repo in terms of officially supporting models, to not step on each others toes in terms of implementing the same support.

More than this ticket but thought it was relevant because I have been working with these abstractions and want to consider if an approach here could be possible.

I'm implementing the relevant code, the main problem is that I'm not very familiar with the cache strategy.

rockerBOO · 2025-02-12T06:10:32Z

If you have any questions on the cache strategy, post them here. I just spent some time implementing a new cache strategy feature to allow outside embeddings to be cached.

sdbds · 2025-02-12T10:42:57Z

#1927
It is currently under implementation.

kohya-ss · 2025-02-12T12:39:25Z

I apologize for the delay in improving the implementation.

I think what rockerBoo said makes a lot of sense.

I think the current caching strategy is better than the chaos it was in before, but I still think there's a lot of room for improvement. I've started working on improvements in #1784, but it's still a long way off. I'd like to see it through somehow.

Musubi tuner https://github.com/kohya-ss/musubi-tuner is much simpler to implement, so maybe sd-scripts can be simplified by dropping less used features.

Please open an issue or discussion about the cache strategy if necessary.

sdbds · 2025-02-12T15:22:10Z

I apologize for the delay in improving the implementation.

I think what rockerBoo said makes a lot of sense.

I think the current caching strategy is better than the chaos it was in before, but I still think there's a lot of room for improvement. I've started working on improvements in #1784, but it's still a long way off. I'd like to see it through somehow.

Musubi tuner https://github.com/kohya-ss/musubi-tuner is much simpler to implement, so maybe sd-scripts can be simplified by dropping less used features.

Please open an issue or discussion about the cache strategy if necessary.

Thank @kohya for your long-term support, which is also why I have always used this repository.

I personally feel that the current caching strategy is based on textencoder and placed under the overall model.

This results in the need to write several nested abstract methods back and forth, first placing te in strategy_model.py, and then calling it from model_train_util or model_util.

Can we separate the TE into a separate model class, like what is done in the transformer?

Considering that future models will have more diverse TE for caching and combination with DiT, it may be better to separate LLM/CLIP from the overall model in the future.

kohya-ss · 2025-02-13T12:44:53Z

I am truly grateful to all the contributors.

Can we separate the TE into a separate model class, like what is done in the transformer?

I may not fully understand... For the text encoder, we use the transformers model almost as is. Is the idea to create a model that wraps the transformer's text encoder for each architecture, rather than calling the text encoder from the strategy?

sdbds · 2025-02-15T08:31:53Z

I am truly grateful to all the contributors.

Can we separate the TE into a separate model class, like what is done in the transformer?

I may not fully understand... For the text encoder, we use the transformers model almost as is. Is the idea to create a model that wraps the transformer's text encoder for each architecture, rather than calling the text encoder from the strategy?

I mean that in terms of cache strategy, the current method is to judge the model and then assign the type of TextEncoder according to the model architecture.

With the increase of other models in the future (such as hunyuan, lumina, sana, etc.)

Can we directly cache the cache strategy based on the parameters of Textencoder?

In this way, if you encounter other models, you can reuse it.

Just like the cache of VAE.

Otherwise we must continue to add the judgment model label
is_xxx...

same like this in lora modules, we will add each Textencoder for diff models.

kohya-ss · 2025-02-15T12:49:43Z

Can we directly cache the cache strategy based on the parameters of Textencoder?

That makes sense. I think the difficult point is that even if the same Text Encoder is used (for example SD3 and FLUX.1), the handling differs depending on the model architecture (maximum token length, whether to use one or both of pool and hidden state, etc.).

For VAE, I think we already have different strategies for different model architectures.

rockerBOO · 2025-02-16T02:01:49Z

There is a factory singular for the different strategies in the new system, so these might just be more legacy as we move away from them?

        tokenize_strategy = self.get_tokenize_strategy(args)
        strategy_base.TokenizeStrategy.set_strategy(tokenize_strategy)
        tokenizers = self.get_tokenizers(tokenize_strategy)  # will be removed after sample_image is refactored

        latents_caching_strategy = self.get_latents_caching_strategy(args)
        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)

sdbds · 2025-02-16T07:37:21Z

Can we directly cache the cache strategy based on the parameters of Textencoder?

That makes sense. I think the difficult point is that even if the same Text Encoder is used (for example SD3 and FLUX.1), the handling differs depending on the model architecture (maximum token length, whether to use one or both of pool and hidden state, etc.).

For VAE, I think we already have different strategies for different model architectures.

So we can set these additional parameters as a dataclass and input them when calling.

After this, you only need to call these textencoder input parameters separately, without encapsulating the entire model architecture.

sdbds mentioned this issue Feb 15, 2025

Support Lumina-image-2.0 #1927

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Lumina 2.0 #1924

Support Lumina 2.0 #1924

sdbds commented Feb 7, 2025

rockerBOO commented Feb 11, 2025

sdbds commented Feb 12, 2025

rockerBOO commented Feb 12, 2025

sdbds commented Feb 12, 2025

kohya-ss commented Feb 12, 2025

sdbds commented Feb 12, 2025

kohya-ss commented Feb 13, 2025

sdbds commented Feb 15, 2025 •

edited

Loading

kohya-ss commented Feb 15, 2025

rockerBOO commented Feb 16, 2025

sdbds commented Feb 16, 2025

Support Lumina 2.0 #1924

Support Lumina 2.0 #1924

Comments

sdbds commented Feb 7, 2025

rockerBOO commented Feb 11, 2025

sdbds commented Feb 12, 2025

rockerBOO commented Feb 12, 2025

sdbds commented Feb 12, 2025

kohya-ss commented Feb 12, 2025

sdbds commented Feb 12, 2025

kohya-ss commented Feb 13, 2025

sdbds commented Feb 15, 2025 • edited Loading

kohya-ss commented Feb 15, 2025

rockerBOO commented Feb 16, 2025

sdbds commented Feb 16, 2025

sdbds commented Feb 15, 2025 •

edited

Loading