Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit of vlm #151

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

initial commit of vlm #151

wants to merge 11 commits into from

Conversation

davidkoski
Copy link
Collaborator

@davidkoski davidkoski commented Nov 1, 2024

Note: this is not ready for use but feel free to comment! The paligemma model loads and "runs" but doesn't produce valid output. The structure of load/evaluation is getting close.

@@ -1,30 +1,7 @@
// Copyright © 2024 Apple Inc.

import Foundation

public enum StringOrNumber: Codable, Equatable, Sendable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to LMCommon


/// Container for models that guarantees single threaded access.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to ModelContainer

}
}
}
// TODO move? these cause some ambiguity -- how to resolve?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing around with these to avoid breaking API -- moving types into LMCommon means callers will need to import LMCommon if they refer to them. This (the aliases) caused more trouble than I think it is worth

@@ -3,6 +3,7 @@
import Foundation
@preconcurrency import Hub
import MLX
import MLXLMCommon
import MLXNN
import MLXRandom
import Tokenizers
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I would like this to move into LMCommon -- I think it can support both LLM and VLM models, but I didn't get a chance to move this yet.

import MLXNN
import MLXOptimizers
import MLXRandom
import Tokenizers

/// Layers to apply LoRA adapters to.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to LMCommon

return y + scale * z
}
}

/// Equivalent to `lora.py/iterate_batches()`. Used internally by ``LoRATrain``.
struct LoRABatchIterator: Sequence, IteratorProtocol {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the rest of this moves to LMCommon as well -- I think it can.

mutating func prompt(_ prompt: MLXArray)
func process(logits: MLXArray) -> MLXArray
mutating func didSample(token: MLXArray)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate / step code has been refactored a bit and can now take custom logit samplers and processors

public init(
prompt: MLXArray, model: any LanguageModel, cache: [KVCache]? = nil,
parameters: GenerateParameters
) throws {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now takes either a prompt (MLXArray) or an LMInput (text + image + ...) via multiple initializers.

}
}

public struct LMInput {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new union type that holds the different inputs to generate() and LanguageModel.prepare()

}
}

public struct LMOutput {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union type for the output. Some of the VLMs return additional state, which is represented here.

@@ -134,6 +135,7 @@ extension ModelConfiguration {
extraEOSTokens: ["<|end|>"]
)

// TODO the prompt formatter is replaced by the chat template
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it? #150


import CoreImage
import Foundation
import MLX
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file may be deleted -- it was some notes & thoughts along the way

// Copyright © 2024 Apple Inc.

import Foundation
import MLX
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to be deleted -- LMInput replaces this

private let context = CIContext()

// TODO documentation
public enum MediaProcessing {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation, but see PaliGemmaImageProvider which implements

SiglipImageProcessor {
  "do_convert_rgb": null,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "SiglipImageProcessor",
  "image_seq_length": 1024,
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "processor_class": "PaliGemmaProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "height": 448,
    "width": 448
  }
}

from the python transformers code.

import MLXNN
import Tokenizers

// MARK: - Language
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this builds, loads weights and "runs" but doesn't produce any output -- still needs to be debugged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be usable as an example of the structure I think we need

}
}

// TODO does not suport multiple images -- how do we represent?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a protocol for the image and text processing pieces.

image = MediaProcessing.inSRGBToneCurveSpace(image)

image = MediaProcessing.resampleBicubic(image, to: .init(width: size, height: size))
image = MediaProcessing.normalize(image, mean: (0.5, 0.5, 0.5), std: (0.5, 0.5, 0.5))
Copy link
Collaborator Author

@davidkoski davidkoski Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SiglipImageProcessor {
  "do_convert_rgb": null,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "SiglipImageProcessor",
  "image_seq_length": 1024,
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "processor_class": "PaliGemmaProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "height": 448,
    "width": 448
  }
}

}
}

private func loadConfiguration(url: URL) throws -> PaliGemma {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These next couple of functions are just stubs to let me try it out -- this will work much like the LLM models

private let _ropeTheta: Float?
public var ropeTheta: Float { _ropeTheta ?? 10_000 }
public let _ropeTraditional: Bool?
public var ropeTraditional: Bool { _ropeTraditional ?? false }
Copy link
Collaborator Author

@davidkoski davidkoski Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing the full implementation of Codable I went a simpler route for default values. Less code, cleaner (I think)

@Option var path: URL

@MainActor
mutating func run() async throws {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just stub code to exercise the model. This still needs the input processing layers, in particular the prompt processing. The image processing is in place but will need to be wrapped up API-wise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant