Welcome to DALLE-pytorch Discussions! #14

lucidrains · 2021-01-11T12:33:33Z

lucidrains
Jan 11, 2021
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

leemengtw · 2021-01-11T13:23:55Z

leemengtw
Jan 11, 2021

Thanks for starting the discussion, @lucidrains !

I'm Meng Lee, a machine learning engineer based in Tokyo, also write a blog (in Chinese) @ leemeng.tw.
I would like to build some (ideally Chinese) text-to-image synthesis like DALLE and hope we can learn from each other here 😄

2 replies

lucidrains Jan 11, 2021
Maintainer Author

hello, I'm taiwanese (-american) too! :)

leemengtw Jan 12, 2021

@lucidrains wow! didn't know that and glad that Taiwan have someone like you building awesome repo for everyone!

TheodoreGalanos · 2021-01-11T13:27:14Z

TheodoreGalanos
Jan 11, 2021

Thanks for this and all the efforts in recreating this amazing work!

I'm Theodore, and I'm only adjacent to the ML field. My work focuses on developing ML workflows for architectural design, and I'm really interested in exploring the potential of DALL-E in generative design through natural language.

0 replies

TheodoreGalanos · 2021-01-12T20:54:45Z

TheodoreGalanos
Jan 12, 2021

So I was looking at CLIP and contrastive bag of words and I had a thought and a question, hope this is a good place for it.

Would an adversarial type of training be interesting for this domain text+image domain? Something like learning representations by paraphrasing text input (a few papers on that recently, like MARGE from Facebook). Would that sort of shift in the space between the two text representations also be coupled with a shift in the visual domain representations? Could it be a way to contrastively train a dall-e kind of system (if that is even a thing)?

I apologize in advance if I'm using some terms loosely here, not an expert in this by any means.

4 replies

lucidrains Jan 12, 2021
Maintainer Author

Sure! What CLIP and DALL-E proves is that attention is enough for multi-modal data. MARGE is a type of pretraining that could be applied to multi-modal data provided you discretize your modalities into a set of unique tokens.

The natural extension of CLIP is from 2 modalities to N modalities. As an example, you could have image / text pairs or sound / text pairs. You would have three attention networks embed image, text, and sound. You then conduct two contrastive learning jobs on the pair of dataset you have. I believe by proxy, it would have learned a joint embedding space that covers image and sound as well. Extrapolate this to all types of data we have!

Reflecting on DALL-E leads you to a similar conclusion. We can do natural language prompted generation for any modality. Ex. if you have a bunch of protein or genomic sequences with annotation. Or even prompt with any modality of your choosing. Ex. show the network visual tokens of a moonlit night, and ask it how it would sound (discretized mel-spectrogam?)

TheodoreGalanos Jan 12, 2021

Thank you for the insights, really appreciate it. Fascinating ways forward really!

Love the idea of how would a scene sound. In some domains, like architecture and gaming (2 that interest me) there's a fascinating research area on affect. It'd be amazing to show a layout design and ask what the occupant would feel

lucidrains Jan 13, 2021
Maintainer Author

If you can encode your layout into a sequence of tokens, and pair them with occupant reviews (BPE encoded text), I am sure you can ask the neural net how it feels. :) Or even the reverse

TheodoreGalanos Jan 14, 2021

yes that is the idea! my goal is to try both ways: occupant reviews/surveys (so real world data if available) and occupant comfort (simulation data that can be produced for any input layout). The latter is probably easier to start, but I'm not sure yet how to encode quantitative data as language (I guess I could have some rules). But as you said, maybe other modalities work :)

First, need to dive into the work that lead to dall-e, still struggling with that. For one, need to understand what a codebook is and how to create it for arbitrary inputs. This repo helps a ton by the way, navigating all this. Thank you for it.

CDitzel · 2021-01-21T08:43:31Z

CDitzel
Jan 21, 2021

To foster this discussion I want to mention that yesterday I stumbled upon a recent project of fellow southern-Germany researchers of Heidelberg

https://compvis.github.io/taming-transformers/

Aside from the amazing visuals these repository contains rich examples and code for the very purpose of generating high resolution images by means of quantization. It seems that OpenAI and those guys at the same time came up with this codebook technique

2 replies

Thierryonre Jan 21, 2021

I've been looking for a way to generate high res landscape images and I thought that I'd have to train stylegan2-ada with a dataset that I'd have to make.
Let's just say that I'll never take image datasets for granted again XD

lucidrains Jan 22, 2021
Maintainer Author

Yea, it's very similar! The only difference is that OpenAI does not use VQ-VAE, and their transformer is conditioned on text tokens (and much bigger)

CDitzel · 2021-01-28T13:32:30Z

CDitzel
Jan 28, 2021

maybe this is a more appropriate place to ask this question.

I am wondering, if instead of text one has another image modality, say for example the left image of a pair of stereo cameras where the right image has been used to train the VAE, how would one go about using this in DALL-E? According to the discussion section of this repo, the camera image has to be tokenized.

I am contemplating whether it makes more sense to use another VAE for the second stream of images and rely on its resulting codebook indices or if it is more reasonable to use e.g. a ViT prior to token concatenation and feeding into the main transformer of DALL-E? Maybe even a simple trainable ViT Embedding layer within the forward pass of DALL-E before the concatenation process suffices?

I am just spitballing here and would be grateful for yours or anyone else's take on this

0 replies

ZackPashkin · 2021-02-06T07:54:05Z

ZackPashkin
Feb 6, 2021

Hi. Can you steer me a bit for CLIP part, I didn't get from the paper. Is there a way to use pretrained CLIP and then finetune on a custom dataset. if I have output from different GANs like few dozens pics and the task is so pick the most relevant pic , can I pass this image embeddings to CLIP without finetuning. How much data and compute is required to train a released CLIP .
Is it reasonable to take only a few layers from OAI CLIP and train on 10k dataset . In the paper where scalability discussed ,like VIT vs wideResnet that 256 gpu ,big scale networks,
but what about close to the ground solution for say will fit to one v100 16g . so, interestingly , how will it perform if scaled well down.
Also I wondering why they didn't use performer type of attention FAVOR+. What's the downside of this FAVOR+

0 replies

Kloxen · 2021-02-22T21:30:38Z

Kloxen
Feb 22, 2021

Hello,

I am really interested in DALL-E and exploring that sort of thing. But I have absolutely no knowledge in terms of coding or anything of that nature and this question is probably going to be pretty dumb.

But is this the OpenAI portion of DALL-E? Can I use it like the examples on the OpenAI website by giving prompts and having images generated? And if so, how do I go about installing or interacting with something like this?

Thanks,

0 replies

kobiso · 2021-02-25T02:39:01Z

kobiso
Feb 25, 2021

I guess DALLE paper and dVAE code with pretrained model is out. 🎉

0 replies

janismdhanbad · 2021-03-14T13:18:47Z

janismdhanbad
Mar 14, 2021

Hi all,

Thank you for your amazing work!
I am facing some problems while training DALLE using my own VAE. When I am using the pre-trained VAE from OpenAI, the training of DALLE works fine but when I trained my own VAE using the default parameters here and then, training the DALLE model gives me a memory error. any suggestions on the VAE hyperparameters? Or any other suggestions?

Thank you!

0 replies

quarquade · 2021-03-18T20:18:06Z

quarquade
Mar 18, 2021

Good morning, thank you so much for sharing this project. I'm new to machine learning in general, so apologies from the start if questions may be trivial. I have some difficulties reading the readme.

Is it possible to use as an example for a dall-e already trained by inserting a text and expecting as output an image?
Assuming then that the produced images are not as I expect them, where can I find the instructions to run a new training (what and how it is needed)?

Thank you very much

qq.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to DALLE-pytorch Discussions! #14

{{title}}

Replies: 10 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Welcome to DALLE-pytorch Discussions! #14

lucidrains Jan 11, 2021 Maintainer

👋 Welcome!

Replies: 10 comments · 8 replies

lucidrains Jan 11, 2021 Maintainer Author

lucidrains Jan 12, 2021 Maintainer Author

lucidrains Jan 13, 2021 Maintainer Author

lucidrains Jan 22, 2021 Maintainer Author

lucidrains
Jan 11, 2021
Maintainer

Replies: 10 comments 8 replies

lucidrains Jan 11, 2021
Maintainer Author

lucidrains Jan 12, 2021
Maintainer Author

lucidrains Jan 13, 2021
Maintainer Author

lucidrains Jan 22, 2021
Maintainer Author