Long Training Times and GPU Utilization Problem #860

meirm · 2024-07-02T04:08:20Z

meirm
Jul 2, 2024

Issue: Long Training Times and GPU Utilization Problem

Description:

Hi everyone,

I am currently working on fine-tuning an LLM using the MLX library on my MacBook Pro M1 with 16GB of RAM. While the training process works, it is taking an exceptionally long time to complete. From what I observe, the training is not utilizing the GPU and is instead maxing out the Performance-CPUs at 100%.

Here are the key details of my setup and issue:

Machine: MacBook Pro M1 with 16GB RAM
Library: MLX
Model: TinyLlama-1.1B-Chat-v1.0-4bit

Training Command:

python -m mlx_lm.lora --model mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit \
  --data data --train --batch-size 2  --lora-layers -1 --iters 4000 \
  --learning-rate 0.00001 --grad-checkpoint  --seed 42 --save-every 200 \
  --steps-per-eval 20

Symptoms:

Training process is very slow.
Performance-CPUs are at 100% utilization.
GPU does not appear to be utilized at all.

Questions:

Is there a way to ensure that the GPU is used for the training process?
Are there specific configurations or settings in the MLX library that I might be missing to enable GPU utilization?
Any tips or recommendations to speed up the training process on an M1 MacBook?

Additional Information:
I have verified that the necessary packages and dependencies are installed correctly. Any insights or guidance on how to resolve this issue would be greatly appreciated.

Thanks in advance for your help!

Best regards,
Meir

awni · 2024-07-02T19:40:57Z

awni
Jul 2, 2024
Maintainer

It looks like you are using the GPU but the utilization is not great. You can double check with:

python -c "import mlx.core as mx; print(mx.default_device())"

Now why it's under-utilized is a harder question to answer. One thing that is very odd is the way your toks/sec are periodic: high, then low. Could you share some details about the dataset? Are there highly variable sequence lengths?

0 replies

austinbv · 2025-01-05T03:12:59Z

austinbv
Jan 5, 2025

I am having the same issue on am M3 Macbook Pro 64gb of ram. Training data is writing samples, I added set of the examples below.

The command I am running is

python -m mlx_lm.lora \
  --model mlx-community/Llama-3.3-70B-Instruct-8bit \
  --data ./data \
  --train \
  --iters 1000

I am getting 100% CPU util with very bursty GPU util and really slow training

{"text": "Write in the style of austinbv:\nThe Law of Preferential attachment: Why your User class keeps growing\n\nSoftware often models the natural world. One pattern I have been thinking about is why our biggest components continue to grow. In natural systems there is a phenomenon known as preferential attachment, which describes how new elements in a system tend to collect at the most connected nodes in the network. Zoologists observed this phenomenon in taxonomy where the largest class, order, family, etc collect newly named organisms."}
{"text": "Write in the style of austinbv:\nThe pull of the common class\n\nThis is exactly what happens in our codebases. Someone once said there are two hard things in computer science: “Naming things, cache-invalidation, and off by one errors.” I love the joke, but it hurts every time. Poorly named components create a gravitational pull forcing the next developer between a rock and a hard place: either comb through the component to understand what its purpose is and then refactor the name to be more specific, or take the path of least resistance and add new functionality further expanding the responsibility of the component. This decision creates a feedback loop attracting more and more “like” functionality.\n\nConsider a component named UserHandler. The original intent could be to manage a user's authentication, but because of the vague name, the component could also logically handle anything user related. The UserHandler soon contains preferences, notifications, and social connections along with the original intention, authentication. Each addition makes new complexity even easier to add, moving it closer and closer to the “god class.”"}
{"text": "Write in the style of austinbv:\nYes there’s math\n\nThere’s actually some math to back this up. Preferential attachment follows a power law distribution where the probability of new functionality being added to a component is proportional to the existing functionality of the component. We can express this mathematically if F(c) represents the functionality in component c, then the probability P of adding complexity to that component is:\n\n$$P(c) = \\frac{F(c)^\\alpha}{\\sum_{i} F(i)^\\alpha}$$\n\nWhere \\(\\alpha\\) represents the strength of the preferential attachment.\n\nImagine a system where we have three components:\n\nA - 100 lines of code\nB - 200 lines of code\nC - 1000 lines of code\n\nAnd the strength of preferential attachment is 2 (which is a pretty common value in the power law) then we end up with:\n\n$$P(A) = 0.006 = \\frac{100^2}{(100+200+1000)^2}$$\n\nVs\n\n$$P(C) = 0.591 = \\frac{1000^2}{(100+200+1000)^2}$$\n\nThe probability of adding to C is not 10x but 100x more likely. Which makes it more clear why some classes continue to grow while most remain small."}
{"text": "Write in the style of austinbv:\nThere is no strength in numbers\n\nIf it is not painfully obvious, the larger the component the more impact that that component has on the efficiency of future development. Large components become immovable objects in architectures shaping the systems evolution. This shape has a significant cost on the ability of a system to evolve.\n\nConsider testing. As a component attracts more and more functionality its dependencies grow as well creating more and more potential interactions that must be validated every time the component changes. Changing one part of a system will eventually cascade to the “God Component” and then the changes to that component will require testing and updates to other seemingly unrelated parts of the code.\n\nThe answer to preferential attachment is more simply said than done, just name things better. It’s always easier to bring components together than it is to untangle them and refactor them apart."}
{"text": "Write in the style of austinbv:\nUsing the UserHandler example above\n\ninterface UserHandler {\n    authenticate(credentials: Credentials): User;\n    validateSession(token: string): boolean;\n    revokeAccess(userId: string): boolean;\n    fetchPreferences(userId: string): UserPreferences;\n    updatePreferences(userId: string, preferences: UserPreferences): UserPreferences;\n    getDefaultPreferences(): UserPreferences;\n    getNotificationSettings(userId: string): NotificationSettings;\n    updateNotificationSettings(userId: string, settings: NotificationSettings): NotificationSettings;\n    addSocial(userId: string, connectionId: string): string;\n    removeSocial(userId: string, connectionId: string): string;\n    listSocial(userId: string): string[];\n}\n\nThis component clearly does a lot and if each method is implemented it could be hundreds of lines long with dependencies on databases, caches, and other whole components."}
{"text": "Write in the style of austinbv:\nThe only constant is change\n\nThere are four clear concerns here that each could have their own class or component, creating more clear naming and a more clear place for new functionality to go.\n\ninterface AuthenticationService {\n  authenticate(credentials: Credentials): User;\n  validateSession(token: string): boolean;\n  revokeAccess(userId: string): boolean;\n}\n\ninterface PreferenceService {\n  fetch(userId: string): UserPreferences;\n  update(userId: string, preferences: UserPreferences): UserPreferences;\n  getDefaults(): UserPreferences;\n}\n\ninterface NotificationService {\n  getSettings(userId: string): NotificationSettings;\n  updateSettings(userId: string, settings: NotificationSettings): NotificationSettings;\n}\n\ninterface SocialService {\n  addSocial(userId: string, connectionId: string): string;\n  removeSocial(userId: string, connectionId: string): string;\n  listSocial(userId: string): string[];\n}\n\nThis form is more clear and gives clear indication of the behavior for each component and makes it easy to test and easy for developers to grok the context."}
{"text": "Write in the style of austinbv:\nBuild for change\n\nOur job as developers is to build code that enables change in a system and by creating clear boundaries with naming we avoid the pull of preferential attachment, making new components easier to build and new functionality easier to change old.\n\nJust as natural systems tend toward entropy, software systems tend toward complexity, and the science supports it. Our role is to manage that complexity by introducing patterns and constraints ensuring sustainable and predictable growth of the system over time."}
{"text": "Write in the style of austinbv:\nHoly cow a lot has happened in the last few months. First on the personal front I had my third kid... all boys... yes my hands are full, no there’s not a lot of sleep, yes it’s the best, and no I wouldn’t trade it for the world.\n\nNext to the new beginnings in my family my company also had a new beginning of its own. It’s no secret that the industry is changing, remote work, AI development, and the ubiquity of nearshore means that software agencies need to pivot and our value proposition needs to change."}
{"text": "Write in the style of austinbv:\nThis takes me back, almost 6 years ago I started Focused Labs. I don’t think I have talked much about why I named the company what I did or its roots. My career was formed pair programming, writing tests, and looking at code as a craft but the production of software as an assembly line.\n\nI came from XP and Agile, but I never felt alliance to the methodology just to the results. Agile, is burdened with countless certifications, numbed by the desire of enterprises to remove agility for the sake of risk mitigation. Agile - when “implemented” was waterfall with a new outfit. XP, on the other hand kept its purity because of its obscurity. Instead of suffering the tarnish of mass-deployment, XP became an echo chamber of zealots, dogmatists, and purists, all congratulating each other on their test coverage, small stories, and red green refactors while complaining that the business doesn’t value quality code."}
{"text": "Write in the style of austinbv:\nWhat I love about these methods, and others, is they help teams and people bring focus to their work. A small story focuses a developer on a single piece of value. A test focuses them on a single piece of logic. A pair focuses their counterpart on moving forward. All the tooling that surrounds us allows the developer to focus on adding more code, while trusting our code will run. Focus is what brings efficiency to teams, to people, and to companies. The methodologies are playbooks to maintain focus.\n\n6 years is a lot of time, it’s the second longest job I have ever had. In those six years we’ve grown, from me and my co-owner Luke to 60 strong. We have had ups and downs and through that time we grew in complexity too. Starting as a small DevOps firm, we added software development, then design, then product. With each additional service our purpose blurred. We lost focus.\n\nOur rebrand has been an eye opening experience, refreshing the original purpose and focusing our mission back to who we are."}
{"text": "Write in the style of austinbv:\nSo we dropped the Labs from our name. It created confusion. Our developers are not scientists. Our projects are not experimental. At Focused we build massive production systems that run the world. We use tried and true methods, technology, and teams to execute with focus.\n\nWe refreshed our look. Our new look reflects our seriousness, our passion for our craft, and our acknowledgment of the need to look forward, enabling the legacy’s that power every part of the world to continue to provide value even as technology evolves.\n\nWe focused our team. Leaning into the core of delivery we have stepped away from the ambiguity of project / product management and see our place in the SDLC clearly as implementors, trades people, and builders. We are focused on creating software that is designed to change.\n\nAll of these changes welcome a new era of development and new paradigms for how developers will work.\n\nToday we are more Focused than ever."}

2 replies

awni Jan 5, 2025
Maintainer

My guess is it's swapping. Llama 3 70B in 8-bit is ~70GB for the model alone not including activations and other memory needed for fine-tuning.

My recommendation is:

Try with a smaller model first just to make sure your setup is good (maybe Llama 8B).
Once the above is working, if you really want to use the 70B you'd have to use it in 4 or 6 bit since 8 is too large for your machine. Also check-out this guide on reducing memory use during LoRA fine-tuning.

austinbv Jan 7, 2025

yup that did it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Training Times and GPU Utilization Problem #860

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Long Training Times and GPU Utilization Problem #860

meirm Jul 2, 2024

Issue: Long Training Times and GPU Utilization Problem

Replies: 2 comments · 2 replies

awni Jul 2, 2024 Maintainer

austinbv Jan 5, 2025

awni Jan 5, 2025 Maintainer

austinbv Jan 7, 2025

meirm
Jul 2, 2024

Replies: 2 comments 2 replies

awni
Jul 2, 2024
Maintainer

austinbv
Jan 5, 2025

awni Jan 5, 2025
Maintainer