initial commit of vlm #151

davidkoski · 2024-11-01T22:45:35Z

based on models from https://github.com/Blaizzy/mlx-vlm
for Make mlx-vlm examples in swift #132

Note: this is not ready for use but feel free to comment! The paligemma model loads and "runs" but doesn't produce valid output. The structure of load/evaluation is getting close.

davidkoski · 2024-11-01T22:47:02Z

Libraries/LLM/Configuration.swift

@@ -1,30 +1,7 @@
 // Copyright © 2024 Apple Inc.

 import Foundation
-
-public enum StringOrNumber: Codable, Equatable, Sendable {


move to LMCommon

davidkoski · 2024-11-01T22:47:53Z

Libraries/LLM/LLMModel.swift


-/// Container for models that guarantees single threaded access.


Move to ModelContainer

davidkoski · 2024-11-01T22:48:58Z

Libraries/LLM/LLMModel.swift

-        }
-    }
-}
+// TODO move? these cause some ambiguity -- how to resolve?


I was playing around with these to avoid breaking API -- moving types into LMCommon means callers will need to import LMCommon if they refer to them. This (the aliases) caused more trouble than I think it is worth

davidkoski · 2024-11-01T22:49:48Z

Libraries/LLM/Load.swift

@@ -3,6 +3,7 @@
 import Foundation
 @preconcurrency import Hub
 import MLX
+import MLXLMCommon
 import MLXNN
 import MLXRandom
 import Tokenizers


Ultimately I would like this to move into LMCommon -- I think it can support both LLM and VLM models, but I didn't get a chance to move this yet.

davidkoski · 2024-11-01T22:52:09Z

Libraries/LLM/LoraTrain.swift

 import MLXNN
 import MLXOptimizers
 import MLXRandom
 import Tokenizers

-/// Layers to apply LoRA adapters to.


Move to LMCommon

davidkoski · 2024-11-01T22:52:46Z

Libraries/LLM/LoraTrain.swift

-        return y + scale * z
-    }
-}
-
 /// Equivalent to `lora.py/iterate_batches()`.  Used internally by ``LoRATrain``.
 struct LoRABatchIterator: Sequence, IteratorProtocol {


Ideally the rest of this moves to LMCommon as well -- I think it can.

davidkoski · 2024-11-01T22:54:42Z

Libraries/LMCommon/Evaluate.swift

+    mutating func prompt(_ prompt: MLXArray)
+    func process(logits: MLXArray) -> MLXArray
+    mutating func didSample(token: MLXArray)
+}


The generate / step code has been refactored a bit and can now take custom logit samplers and processors

davidkoski · 2024-11-01T22:55:27Z

Libraries/LMCommon/Evaluate.swift

+    public init(
+        prompt: MLXArray, model: any LanguageModel, cache: [KVCache]? = nil,
+        parameters: GenerateParameters
+    ) throws {


This now takes either a prompt (MLXArray) or an LMInput (text + image + ...) via multiple initializers.

davidkoski · 2024-11-01T22:56:19Z

Libraries/LMCommon/LanguageModel.swift

+    }
+}
+
+public struct LMInput {


A new union type that holds the different inputs to generate() and LanguageModel.prepare()

davidkoski · 2024-11-01T22:56:48Z

Libraries/LMCommon/LanguageModel.swift

+    }
+}
+
+public struct LMOutput {


Union type for the output. Some of the VLMs return additional state, which is represented here.

davidkoski · 2024-11-01T22:57:17Z

Libraries/LMCommon/Models.swift

@@ -134,6 +135,7 @@ extension ModelConfiguration {
        extraEOSTokens: ["<|end|>"]
    )

+    // TODO the prompt formatter is replaced by the chat template


Or is it? #150

davidkoski · 2024-11-01T22:57:48Z

Libraries/LMCommon/Processor.swift

+
+import CoreImage
+import Foundation
+import MLX


This file may be deleted -- it was some notes & thoughts along the way

davidkoski · 2024-11-01T22:58:11Z

Libraries/LMCommon/Prompt.swift

+// Copyright © 2024 Apple Inc.
+
+import Foundation
+import MLX


Also to be deleted -- LMInput replaces this

davidkoski · 2024-11-01T22:59:41Z

Libraries/VLM/MediaProcessing.swift

+private let context = CIContext()
+
+// TODO documentation
+public enum MediaProcessing {


Needs documentation, but see PaliGemmaImageProvider which implements

SiglipImageProcessor { "do_convert_rgb": null, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.5, 0.5, 0.5 ], "image_processor_type": "SiglipImageProcessor", "image_seq_length": 1024, "image_std": [ 0.5, 0.5, 0.5 ], "processor_class": "PaliGemmaProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "height": 448, "width": 448 } }

from the python transformers code.

davidkoski · 2024-11-01T23:00:24Z