Using multimodal processor vs. separate tokenizer and image_processor #6

joris-sense · 2024-08-28T07:47:57Z

The model currently returns a tokenizer separately from an image_processor, as seen e.g. here. The huggingface "preferred way" seems to be to use a multimodal processor that processes text and images at once (see here or the sample code for Idefics3 here). I have been having trouble with this because I am trying to use llava-more for structured text generation with the outlines package, which assumes a single multimodal processor rather than a separate tokenizer and image_processor object (see e.g. this line of code) (though Idefics3 currently doesn't seem to work either because of incompatible inputs to the processor).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multimodal processor vs. separate tokenizer and image_processor #6

Using multimodal processor vs. separate tokenizer and image_processor #6

joris-sense commented Aug 28, 2024 •

edited

Loading

Using multimodal processor vs. separate tokenizer and image_processor #6

Using multimodal processor vs. separate tokenizer and image_processor #6

Comments

joris-sense commented Aug 28, 2024 • edited Loading

joris-sense commented Aug 28, 2024 •

edited

Loading