diff --git a/README.md b/README.md
index 051164258f..55170272e3 100644
--- a/README.md
+++ b/README.md
@@ -1,549 +1,293 @@
-## Latest News
-* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance.
-
-## Megatron-DeepSpeed
-DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```examples_deepspeed/``` folder includes example scripts about the features supported by DeepSpeed.
-
-### Recent sync with NVIDIA/Megatron-LM
-In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.
-
-### Run on Azure and AzureML
-To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder. 
-
-Below is Megatron-LM's original README. Note that examples mentioned below are from the original NVIDIA/Megatron-LM repo. All of them do NOT have DeepSpeed technologies integrations, and some of them may not work due to changes in this Megatron-DeepSpeed repo. Thus we recommend you to go to ```../examples_deepspeed/``` folder which includes examples that have DeepSpeed technologies integrated and are tested by DeepSpeed team.
-------
-
-Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf), [2](https://arxiv.org/pdf/2104.04473.pdf), and [3](https://arxiv.org/pdf/2205.05198)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel ([tensor](https://arxiv.org/pdf/1909.08053.pdf), [sequence](https://arxiv.org/pdf/2205.05198), and [pipeline](https://arxiv.org/pdf/2104.04473.pdf)), and multi-node pre-training of transformer based models such as [GPT](https://arxiv.org/abs/2005.14165), [BERT](https://arxiv.org/pdf/1810.04805.pdf), and [T5](https://arxiv.org/abs/1910.10683) using mixed precision.
-
-Below are some of the projects where we have directly used Megatron:
-* [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
-* [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)
-* [End-to-End Training of Neural Retrievers for Open-Domain Question Answering](https://arxiv.org/abs/2101.00408)
-* [Large Scale Multi-Actor Generative Dialog Modeling](https://www.aclweb.org/anthology/2020.acl-main.8.pdf)
-* [Local Knowledge Powered Conversational Agents](https://arxiv.org/abs/2010.10150)
-* [MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models](https://www.aclweb.org/anthology/2020.emnlp-main.226.pdf)
-* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
-* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
-* [Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases](https://arxiv.org/abs/2112.07868)
-* [Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models](https://arxiv.org/abs/2202.04173)
-* [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](https://arxiv.org/abs/2201.11990)
-* [Multi-Stage Prompting for Knowledgeable Dialogue Generation](https://arxiv.org/abs/2203.08745)
-
-Megatron is also used in [NeMo Megatron](https://developer.nvidia.com/nvidia-nemo#nemo-megatron), a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.
-
-Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. Each cluster node has 8 NVIDIA 80GB A100 GPUs. The graph below shows that we scale nearly linear up to 1 trillion parameter models running on 3072 GPUs. Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
-
-![Scaling Graph](images/Achieved_petaFLOPs.png)
-
-The following table shows both model (MFU) and hardware (HFU) FLOPs utilization for select configurations up to 1T parameters (see [our paper](https://arxiv.org/pdf/2205.05198) for a description of how these are calculated). As the model size increases, we achieve better GPU utilization and for the one trillion parameter model, we reach a MFU and HFU of 56.3% and 57.0%, respectively. Note that these numbers are also measured on benchmark runs and in this case are measured using a data parallel size of one. Data parallelism introduces some overhead due to the gradient all-reduce required between the data parallel groups. However, for large transformer models, this overhead is not large and can almost entirely eliminted by overlapping the gradient all-reduce with backpropagation.
-
-| Model Size | Model FLOPs Utilization | Hardware FLOPs Utilization |
-| :---: | :---: | :---: |
-| 22B   | 41.5% | 43.7% |
-| 175B  | 51.4% | 52.8% |
-| 530B  | 56.0% | 57.0% |
-| 1T    | 56.3% | 57.0% |
-
-# Contents
-   * [Contents](#contents)
-   * [Setup](#setup)
-      * [Downloading Checkpoints](#downloading-checkpoints)
-   * [Usage](#usage)
-   * [Training](#training)
-      * [Data Preprocessing](#data-preprocessing)
-      * [BERT Pretraining](#bert-pretraining)
-      * [GPT Pretraining](#gpt-pretraining)
-      * [T5 Pretraining](#t5-pretraining)
-      * [Distributed Pretraining](#distributed-pretraining)
-      * [Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
-      * [Distributed Optimizer](#distributed-optimizer)
-      * [FlashAttention](#flashattention)
-      * [GPT-3 Example](#gpt-3-example)
-      * [Retro](#retro)
-   * [Evaluation and Tasks](#evaluation-and-tasks)
-      * [GPT Text Generation](#gpt-text-generation)
-      * [GPT Evaluation](#gpt-evaluation)
-         * [WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
-         * [LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
-      * [BERT Task Evaluation](#bert-task-evaluation)
-         * [RACE Evaluation](#race-evaluation)
-         * [MNLI Evaluation](#mnli-evaluation)
-   * [Datasets](#datasets)
-      * [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
-      * [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
-   * [Reproducibility](#reproducibility)
+# LLM for PyTorch
 
-# Setup
-We strongly recommend using the latest release of [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) with DGX nodes. If you can't use this for some reason, use the latest pytorch, cuda, nccl, and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start) releases.  Data preprocessing requires [NLTK](https://www.nltk.org/install.html), though this is not required for training, evaluation, or downstream tasks.
-
-You can launch an instance of the PyTorch container and mount Megatron, your dataset, and checkpoints with the following Docker commands:
-```
-docker pull nvcr.io/nvidia/pytorch:xx.xx-py3
-docker run --gpus all -it --rm -v /path/to/megatron:/workspace/megatron -v /path/to/dataset:/workspace/dataset -v /path/to/checkpoints:/workspace/checkpoints nvcr.io/nvidia/pytorch:xx.xx-py3
-```
-
-## Downloading Checkpoints
-We have provided pretrained [BERT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m) and [GPT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m) checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first [sign up](https://ngc.nvidia.com/signup) for and [setup](https://ngc.nvidia.com/setup/installers/cli) the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the [NGC documentation](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_6_4_1).
-
-Alternatively, you can directly download the checkpoints using:
-
-<pre>
-BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
-BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
-GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
-</pre>
-
-The models require vocabulary files to run. The BERT  WordPiece vocab file can be extracted from Google's pretrained BERT models: [uncased](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt), [cased](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt). The GPT [vocab file](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) and [merge table](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) can be downloaded directly.
-
-Additional notes for DeepSpeed. We have added a helper script to download the checkpoints and make the example runnable.
-
-Steps to follow:
- - bash dataset/download_ckpt.sh -- this will download and extract the checkpoint
- - bash dataset/download_vocab.sh -- this will download GPT merges and vocab files.
- - bash examples/generate_text.sh -- this will generate examples using the 345m GPT model.
-
-# Usage
-
-After installation, there are several possible workflows. The most comprehensive is:
-1. Data preprocessing
-2. Pretraining
-3. Finetuning (Optional for zero-shot tasks)
-4. Downstream task evaluation or text generation
-
-However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.
-
-We've provided several scripts for pretraining both BERT and GPT in [`examples`](./examples) directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.
-
-# Training
-## Data Preprocessing
-The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:
-<pre>
-{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
-{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
-</pre>
-
-The name of the `text` field of the json can be changed by using the `--json-key` flag in [`preprocess_data.py`](./tools/preprocess_data.py) The other metadata are optional and are not used in training.
-
-The loose json is then processed into a binary format for training. To convert the json into mmap format use `preprocess_data.py`. An example script to prepare data for BERT training is:
-<pre>
-python tools/preprocess_data.py \
-       --input my-corpus.json \
-       --output-prefix my-bert \
-       --vocab-file bert-vocab.txt \
-       --tokenizer-type BertWordPieceLowerCase \
-       --split-sentences \
-       --workers 5
-</pre>
-
-The output will be two files named, in this case, `my-bert_text_sentence.bin` and `my-bert_text_sentence.idx`. The `--data-path` specified in later BERT training is the full path and new filename, but without the file extension.
-
-For T5 use the same preprocessing as BERT, perhaps renaming it to:
-<pre>
-       --output-prefix my-t5 \
-</pre>
-
-Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:
-<pre>
-python tools/preprocess_data.py \
-       --input my-corpus.json \
-       --output-prefix my-gpt2 \
-       --vocab-file gpt2-vocab.json \
-       --dataset-impl mmap \
-       --tokenizer-type GPT2BPETokenizer \
-       --merge-file gpt2-merges.txt \
-       --append-eod \
-       --workers 5
-</pre>
-
-Here the output files are named `my-gpt2_text_document.bin` and `my-gpt2_text_document.idx`. As before, in GPT training, use the longer name without the extension as `--data-path`.
-
-Further command line arguments are described in the source file [`preprocess_data.py`](./tools/preprocess_data.py).
-
-## BERT Pretraining
-
-
-The [`examples/pretrain_bert.sh`](./examples/pretrain_bert.sh) script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`.
-
-The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.
-
-Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
-
-To run `examples/pretrain_bert.sh`, make any desired modifications including setting the environment variables for `CHECKPOINT_PATH`, `VOCAB_FILE`, and `DATA_PATH`. Make sure to set these variables to their paths in the container. Then launch the container with Megatron and necessary paths mounted (as explained in [Setup](#setup)) and run the example script.
-
-## GPT Pretraining
-
-The `examples/pretrain_gpt.sh` script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.
-
-It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a `json` vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the `--lr-decay-style` has been set to cosine decay.  Note that the `--data-path` now includes the additional `_text_document` suffix added in preprocessing, but does not include the file extensions.
-
-Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
+This directory provides scripts to train the GPT-based LLaMA and Mixtral models in the Megatron-DeepSpeed repository on Intel® Gaudi® 2 AI accelerator.
+Before you get started, make sure to review the [Supported Configuration](#supported-configuration).
 
-`examples/pretrain_gpt.sh` can be launched the same way as described for BERT. Set the env vars and make any other modifications, launch the container with appropriate mounts, and run the script.
+## Table of Contents
+* [Model Overview](#model-overview)
+* [Setup](#setup)
+* [Training Script Settings](#training-script-settings)
+* [LLaMA Training and Examples](#llama-training-and-examples)
+* [Mixtral Training and Examples](#mixtral-training-and-examples)
+* [Changelog](#changelog)
+* [Known Issues](#known-issues)
 
-## T5 Pretraining
+## Model Overview
+This implementation is based on https://github.com/microsoft/Megatron-DeepSpeed at 7eb36a11b3a9c48ed07b93692ccf22bfb5577f7e.
+Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf) and [2](https://arxiv.org/pdf/2104.04473.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for training large transformer language models such as LLaMA at scale. Codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism.
 
-Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accommodate the T5 architecture:
+### How to use
+Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.
 
-* `--kv-channels` sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
 
-* `--ffn-hidden-size` sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.
-
-* `--encoder-seq-length` and `--decoder-seq-length` set the sequence length for the encoder and decoder separately.
-
-All of the other arguments remain as they were for BERT and GPT pretraining. Run this example with the same steps described above for the other scripts.
-
-## Distributed Pretraining
-
-The `examples/pretrain_{bert,gpt,t5}_distributed.sh` scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables. See the official PyTorch [documentation](https://pytorch.org/docs/stable/elastic/run.html#launcher-api) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default, multi-node training uses the [nccl](https://developer.nvidia.com/nccl) distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the `torchrun` elastic launcher (equivalent to `python -m torch.distributed.run`) are the only additional requirements to adopt distributed training. See any of `examples/pretrain_{bert,gpt,t5}_distributed.sh` for more details.
-
-We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use `--DDP-impl local` or `--DDP-impl torch`, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.
-
-Second, we developed a simple and efficient two-dimensional model-parallel approach. To use tensor model parallelism (splitting execution of a single transformer module over multiple GPUs, see Section 3 of [our paper](https://arxiv.org/pdf/1909.08053.pdf)), add the `--tensor-model-parallel-size` flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use sequence parallelism specify `--sequence-parallel`, which requires tensor model parallel as it split among the same GPUs (more details in Section 4.2.2 of [our paper](https://arxiv.org/pdf/2205.05198.pdf)).
-
-To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches, see Section 2.2 of [our paper](https://arxiv.org/pdf/2104.04473.pdf)), use the `--pipeline-model-parallel-size` flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).
-
-<!-- The number of microbatches in a per-pipeline minibatch is controlled by the `--num-microbatches-in-minibatch` argument. With `WORLD_SIZE` GPUs, `TENSOR_MP_SIZE` tensor-model-parallel size, `PIPELINE_MP_SIZE` pipeline-model-parallel-size, `WORLD_SIZE`/(`TENSOR_MP_SIZE` * `PIPELINE_MP_SIZE`) GPUs will be used for data parallelism. The default values for `--tensor-model-parallel-size` and `--pipeline-model-parallel-size` is 1, which will not implement either form of model parallelism. -->
-
-We have examples of how to use these two different forms of model parallelism the example scripts ending in `distributed_with_mp.sh`:
-
-Other than these minor changes, the distributed training is identical to the training on a single GPU.
-
-The interleaved pipelining schedule (more details in Section 2.2.2 of [our paper](https://arxiv.org/pdf/2104.04473.pdf)) can be enabled using the `--num-layers-per-virtual-pipeline-stage` argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with `NUM_LAYERS / PIPELINE_MP_SIZE` transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as `GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)`) should be divisible by the `PIPELINE_MP_SIZE` when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (`PIPELINE_MP_SIZE=2`).
-
-## Activation Checkpointing and Recomputation
-
-To reduce GPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation. We support two levels of recompute granularity: `selective` and `full`. Selective recomputation is the default and recommended in almost all cases. It saves the activations that take less space and are expensive to recompute and recomputes activations that take a lot of space but are relatively cheap to recompute (see [our paper](https://arxiv.org/pdf/2205.05198) for details). To enable selective activation recompute simply use `--recompute-activations`.
-
-For cases where memory is very tight, `full` checkpointing saves just the inputs to a transformer layer, or a block of transformer layers, and recomputes everything else. To turn on full activation recompute use `--recompute-granularity full`. When using full activation recomputation, there are two methods: `uniform` and `block`, chosen using the `--recompute-method` argument.
-
-* Uniform method uniformly divides the Transformer layers into groups of layers and stores the input activations of each group in the memory. The baseline group size is 1 and, in this case, the input activation of each Transformer layer is checkpointed. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage thus enables running a bigger model. For example, when using the number of layers per group of 4, the input activation of each group of 4 Transformer layers is checkpointed.
-
-* Block method checkpoints the input activations of a set number of individual Transformer layers per pipeline stage and do the rest of layers without any checkpointing. This method can be used to skip checkpointing some Transformer layers until the GPU memory is fully used, which is applicable only when there is unused GPU memory. Checkpointing fewer transformer layers avoids unnecessary activation recomputation in the backprop thus improves training performance. For example, when we specify 5 layers to checkpoint of 8 layers per pipeline stage, the input activations of only the first 5 Transformer layers are checkpointed and activation recomputation for the rest 3 layers is not needed in the backprop.
-
-
-## Distributed Optimizer
-
-Usage: `--use-distributed-optimizer`. Compatible with all model and data types.
-
-The distributed optimizer is a memory savings technique, whereby the optimizer state is evenly distributed across data parallel ranks (versus the traditional method of replicating the optimizer state across data parallel ranks). As described in [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054), our implementation distributes all optimizer state that does not overlap with the model state. For example, when using fp16 model params, the distributed optimizer maintains its own separate copy of fp32 main params & grads, which are distributed across DP ranks. When using bf16 model params, however, the distributed optimizer's fp32 main grads are the same as the model's fp32 grads, and so the grads in this case are not distributed (although the fp32 main params are still distributed, as they are separate from the bf16 model params).
-
-Theoretical memory savings vary depending on the combination of the model's param dtype and grad dtype. In our implementation, the theoretical number of bytes per parameter is (where 'd' is the data parallel size):
-
-| | Non-distributed optim | Distributed optim |
-|-|-|-|
-| fp16 param, fp16 grads | 20 | 4 + 16/d |
-| bf16 param, fp32 grads | 18 | 6 + 12/d |
-| fp32 param, fp32 grads | 16 | 8 + 8/d |
-
-## FlashAttention
-
-Usage: `--use-flash-attn`. Support attention head dimensions at most 128.
-
-[FlashAttention](https://github.com/HazyResearch/flash-attention) is a fast and
-memory-efficient algorithm to compute exact attention. It speeds up model
-training and reduces memory requirement.
-
-To install FlashAttention:
-```sh
-pip install flash-attn
+# Setup
+Please follow the instructions provided in the [Intel Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html)
+to set up the environment including the `$PYTHON` environment variable. To achieve the best performance, please follow the methods outlined in the [Optimizing Training Platform guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
+The guides will walk you through the process of setting up your system to run the model on Gaudi 2.
+
+## Install Intel Gaudi DeepSpeed
+Please follow the instructions provided in the [DeepSpeed Installation Guide](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#installing-deepspeed-library) to install deepspeed.
+
+## Clone Intel Gaudi Megatron-DeepSpeed
+In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version.
+You can run the [`hl-smi`](https://docs.habana.ai/en/latest/System_Management_Tools_Guide/System_Management_Tools.html#hl-smi-utility-options) utility to determine the Intel Gaudi software version.
+```bash
+git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Megatron-DeepSpeed
 ```
 
-## GPT-3 Example
-
-In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure Megatron to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
-
-With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
-
-
-## Retro
-
-See:
-
-- `tools/retro/README.md` for an overview.
-- `tools/retro/examples/get_preprocess_cmd.sh` for an example of common preprocessing arguments.
-- `tools/retro/examples/preprocess_data.sh` for an example of how to preprocess data.
-- `tools/retro/examples/pretrain_model.sh` for an example of how to pretrain a model.
-
-Retro is a retrieval-enhanced model that is based on GPT. As described in [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426), Retro retrieves from a database of document chunks by performing locality search using a sample's tokens. The retrieval database can be large -- often billions or even trillions of tokens -- and provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters.
-
-Using Retro requires two steps: 1) preprocessing the retrieval database and pretraining neighbors, and 2) pretraining a model using this data. Please see `tools/retro/README.md` for a detailed overview.
-
-<!--
-## REALM Pipeline
-We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.
-Loosely, they are pretraining the retriever modules, then jointly training the language model and the retriever, and then finetuning a question answering head on the language model with fixed retriever.
-
-### Inverse Cloze Task (ICT) Pretraining
-1. Have a corpus in loose JSON format with the intention of creating a collection of fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block but also multiple blocks per document.
-Run `tools/preprocess_data.py` to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. For the original REALM system, we construct two datasets, one with the title of every document, and another with the body.
-Refer to the following script
-<pre>
-python preprocess_data.py \
-    --input /path/to/corpus.json \
-    --json-keys text title \
-    --split-sentences \
-    --tokenizer-type BertWordPieceLowerCase \
-    --vocab-file /path/to/vocab.txt \
-    --output-prefix corpus_indexed \
-    --workers 5  # works well for 10 CPU cores. Scale up accordingly.
-</pre>
-
-2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
- The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
-3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
-In REALM, this is an uncased bert base model trained with the standard hyperparameters.
-4. Use `pretrain_ict.py` to train an `ICTBertModel` which uses two BERT-based encoders to encode queries and blocks to perform retrieval with.
-The script below trains the ICT model from REALM. It refrences a pretrained BERT model (step 3) in the `--bert-load` argument. The batch size used in the paper is 4096, so this would need to be run with data parallel world size 32.
-<pre>
-python pretrain_ict.py \
-    --num-layers 12 \
-    --num-attention-heads 12 \
-    --hidden-size 768 \
-    --batch-size 128 \
-    --seq-length 256 \
-    --max-position-embeddings 256 \
-    --ict-head-size 128 \
-    --train-iters 100000 \
-    --activations-checkpoint-method uniform \
-    --bert-load /path/to/pretrained_bert \
-    --load checkpoints \
-    --save checkpoints \
-    --data-path /path/to/indexed_dataset \
-    --titles-data-path /path/to/titles_indexed_dataset \
-    --vocab-file /path/to/vocab.txt \
-    --lr 0.0001 \
-    --num-workers 2 \
-    --lr-decay-style linear \
-    --weight-decay 1e-2 \
-    --clip-grad 1.0 \
-    --lr-warmup-fraction .01 \
-    --save-interval 3000 \
-    --query-in-block-prob 0.1 \
-    --fp16
-
-</pre>
-
-### Building an Index of Block Embeddings
-After having trained an ICT model, you can now embed an entire dataset of blocks by creating a `BlockData` structure. After that has been saved, you can load it
-and wrap it with a `FaissMIPSIndex` to do fast similarity search which is key in the learned information retrieval pipeline. The initial index can be built with the following script, meant to be run in an interactive session. It can leverage multiple GPUs on multiple nodes to index large datasets much more quickly.
-
-<pre>
-python tools/create_doc_index.py \
-    --num-layers 12 \
-    --hidden-size 768 \
-    --ict-head-size 128 \
-    --num-attention-heads 12 \
-    --batch-size 128 \
-    --activations-checkpoint-method uniform \
-    --seq-length 256 \
-    --max-position-embeddings 256 \
-    --ict-load /path/to/pretrained_ict \
-    --data-path /path/to/indexed_dataset \
-    --titles-data-path /path/to/titles_indexed_dataset \
-    --block-data-path embedded_blocks.pkl \
-    --indexer-log-interval 1000 \
-    --indexer-batch-size 128 \
-    --vocab-file /path/to/vocab.txt \
-    --num-workers 2 \
-    --fp16
-</pre>
-
--->
-
-# Evaluation and Tasks
-
-We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the `--finetune` flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the `--finetune` flag before continuing, otherwise the training will start again from the beginning.
-
-Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on fewer GPUs in downstream tasks. The following script accomplishes this. This example reads in a GPT model with 4-way tensor and 4-way pipeline model parallelism and writes out a model with 2-way tensor and 2-way pipeline model parallelism.
-
-<pre>
-python tools/checkpoint_util.py \
-        --model-type GPT \
-        --load-dir checkpoints/gpt3_tp4_pp4 \
-        --save-dir checkpoints/gpt3_tp2_pp2 \
-        --target-tensor-parallel-size 2 \
-        --target-pipeline-parallel-size 2
-
-</pre>
-
-Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.
-
-## GPT Text Generation
-
-We have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`. You run it much like you would start a pretraining job, specifying an appropriate pretrained checkpoint. There are also few optional parameters: `temperature`, `top-k`and `top-p`. See `--help` or the source file for more information. See [examples/run_text_generation_server_345M.sh](examples/run_text_generation_server_345M.sh) for an example of how to run the server.
-
-Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
-
-<pre>
-tools/text_generation_cli.py localhost:5000
-</pre>
-
-You can also use CURL or any other tools to query the server directly:
-
-<pre>
-curl 'http://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["Hello world"], "tokens_to_generate":1}'
-</pre>
-
-See [megatron/text_generation_server.py](megatron/text_generation_server.py) for more API options.
-
-### Detoxify GPT via Self-generation
-We include an example in `examples/detxoify_lm/` to detoxify language models by leveraging the generative power of language models.
-
-See [examples/detxoify_lm/README.md](examples/detxoify_lm/README.md) for step-by-step tutorials on how to perform domain-adaptive training and detoxify LM using self-generated corpus.
-
-
-## GPT Evaluation
-We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.
-
-### WikiText Perplexity Evaluation
-For even comparison with prior works, we evaluate perplexity on the word-level [WikiText-103 test dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
-
-We use the following command to run WikiText-103 evaluation on a 345M parameter model.
-<pre>
-TASK="WIKITEXT103"
-
-VALID_DATA=&#60;wikitext path&#62;.txt
-VOCAB_FILE=gpt2-vocab.json
-MERGE_FILE=gpt2-merges.txt
-CHECKPOINT_PATH=checkpoints/gpt2_345m
-
-COMMON_TASK_ARGS="--num-layers 24 \
-                  --hidden-size 1024 \
-                  --num-attention-heads 16 \
-                  --seq-length 1024 \
-                  --max-position-embeddings 1024 \
-                  --fp16 \
-                  --vocab-file $VOCAB_FILE"
-
-python tasks/main.py \
-       --task $TASK \
-       $COMMON_TASK_ARGS \
-       --valid-data $VALID_DATA \
-       --tokenizer-type GPT2BPETokenizer \
-       --merge-file $MERGE_FILE \
-       --load $CHECKPOINT_PATH \
-       --micro-batch-size 8 \
-       --activations-checkpoint-method uniform \
-       --log-interval 10 \
-       --no-load-optim \
-       --no-load-rng
-</pre>
-
-
-### LAMBADA Cloze Accuracy
-To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
-
-We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.
-
-<pre>
-TASK="LAMBADA"
-
-VALID_DATA=&#60;lambada path&#62;.json
-VOCAB_FILE=gpt2-vocab.json
-MERGE_FILE=gpt2-merges.txt
-CHECKPOINT_PATH=checkpoints/gpt2_345m
-COMMON_TASK_ARGS=&#60;same as those in <a href="#wikitext-perplexity-evaluation">WikiText Perplexity Evaluation</a> above&#62;
-
-python tasks/main.py \
-       --task $TASK \
-       $COMMON_TASK_ARGS \
-       --valid-data $VALID_DATA \
-       --tokenizer-type GPT2BPETokenizer \
-       --strict-lambada \
-       --merge-file $MERGE_FILE \
-       --load $CHECKPOINT_PATH \
-       --micro-batch-size 8 \
-       --activations-checkpoint-method uniform \
-       --log-interval 10 \
-       --no-load-optim \
-       --no-load-rng
-</pre>
-
-Further command line arguments are described in the source file [`main.py`](./tasks/main.py)
-
-## BERT Task Evaluation
-### RACE Evaluation
-The following script finetunes the BERT model for evaluation on the [RACE dataset](http://www.cs.cmu.edu/~glai1/data/race/). The `TRAIN_DATA` and `VALID_DATA` directory contain the RACE dataset as separate `.txt` files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.
-
-<pre>
-TRAIN_DATA="data/RACE/train/middle"
-VALID_DATA="data/RACE/dev/middle \
-            data/RACE/dev/high"
-VOCAB_FILE=bert-vocab.txt
-PRETRAINED_CHECKPOINT=checkpoints/bert_345m
-CHECKPOINT_PATH=checkpoints/bert_345m_race
-COMMON_TASK_ARGS="--num-layers 24 \
-                  --hidden-size 1024 \
-                  --num-attention-heads 16 \
-                  --seq-length 512 \
-                  --max-position-embeddings 512 \
-                  --fp16 \
-                  --vocab-file $VOCAB_FILE"
-
-COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
-                      --valid-data $VALID_DATA \
-                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
-                      --activations-checkpoint-method uniform \
-                      --save-interval 10000 \
-                      --save $CHECKPOINT_PATH \
-                      --log-interval 100 \
-                      --eval-interval 1000 \
-                      --eval-iters 10 \
-                      --weight-decay 1.0e-1"
-
-python tasks/main.py \
-       --task RACE \
-       $COMMON_TASK_ARGS \
-       $COMMON_TASK_ARGS_EXT \
-       --tokenizer-type BertWordPieceLowerCase \
-       --epochs 3 \
-       --micro-batch-size 4 \
-       --lr 1.0e-5 \
-       --lr-warmup-fraction 0.06
-</pre>
-
-### MNLI Evaluation
-The following script finetunes the BERT model for evaluation with the [MultiNLI sentence pair corpus](https://www.nyu.edu/projects/bowman/multinli/). Because the matching tasks are quite similar, the script can be quickly tweaked to work with the [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset) (QQP) dataset as well.
-
-<pre>
-
-TRAIN_DATA="data/glue_data/MNLI/train.tsv"
-VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
-            data/glue_data/MNLI/dev_mismatched.tsv"
-PRETRAINED_CHECKPOINT=checkpoints/bert_345m
-VOCAB_FILE=bert-vocab.txt
-CHECKPOINT_PATH=checkpoints/bert_345m_mnli
-COMMON_TASK_ARGS=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
-COMMON_TASK_ARGS_EXT=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
-
-python tasks/main.py \
-       --task MNLI \
-       $COMMON_TASK_ARGS \
-       $COMMON_TASK_ARGS_EXT \
-       --tokenizer-type BertWordPieceLowerCase \
-       --epochs 5 \
-       --micro-batch-size 8 \
-       --lr 5.0e-5 \
-       --lr-warmup-fraction 0.065
-</pre>
-
-# Datasets
-We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.
-
-## Collecting Wikipedia Training Data
-We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with [WikiExtractor.py](https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text."
-
-We recommend using the `--json` argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, use the `--split-sentences` flag to `preprocess_data.py` as described [above](#data-preprocessing) to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the `--split-sentences` flag.
-
-## Collecting GPT Webtext Data
-We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library from [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our [openwebtext](./tools/openwebtext) directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.
-
-# Reproducibility
-Megatron training is intended to be bitwise reproducible. This means that the same training config run twice in the same HW and SW environment should produce identical model checkpoints, losses and accuracy metric values (iteration time metrics may vary).
-
-There are currently three known Megatron optimizations that break reproducibility whilst still producing almost identical training runs. They are only applicable when using NGC containers >=22.05. The following workarounds should be applied in cases where reproducibility is required:
-1. When training using the `--bf16` option the backward pass of `torch.nn.functional.embedding` is non-deterministic. If reproducibility is required you should also use the option `--embedding-weights-in-fp32`. The speed and memory impact of this change is negligible.
-2. Also when training using `--bf16`, reproducbility is only obtained when the checkpointing and resume schedule of training is identical. If the checkpointing schedule will change, i.e. checkpointing and resume will occur at different iterations, the option `--no-bias-gelu-fusion` should be used.
-3. Flash attention is non-deterministic. If reproducibility is required do not use `--use-flash-attn`.
-
-These sources of non-determinism are under active investigation. If you observe non-determinism in Megatron training under other circumstances please open an issue.
+```
+export MEGATRON_DEEPSPEED_ROOT=/path/to/Megatron-DeepSpeed
+export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH
+```
+## Install Megatron-DeepSpeed Requirements
+* In the docker container, go to the Megatron-DeepSpeed directory:
+  ```bash
+  cd $MEGATRON_DEEPSPEED_ROOT
+  ```
+
+* Install the required packages using pip:
+  ```bash
+  pip install -r megatron/core/requirements.txt
+  ```
+
+* To run training on more than 128 cards, apply the below configuration changes:
+  ```bash
+  echo '*    soft nofile  unlimited' >> /etc/security/limits.conf
+  echo '*    hard nofile  unlimited' >> /etc/security/limits.conf
+  echo 'root soft nofile  unlimited' >> /etc/security/limits.conf
+  echo 'root hard nofile  unlimited' >> /etc/security/limits.conf
+  ```
+
+## Dataset Preparation
+Follow the instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar to download oscar-en full dataset. Note that the dataset takes around 550G of disk space. This dataset is used for training LLaMA & LLaMA 2.
+### Dataset Preparation Example
+The below provides the steps required to prepare your dataset. It is based on instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar.  The dataset in the example is intended to be `zh`
+### Step 0 :
+```bash
+git clone https://github.com/bigscience-workshop/bigscience.git
+cd bigscience/data/oscar
+# Edit the `oscar-to-jsonl.py` in the list language_subsets and remove the comment on unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en
+vi oscar-to-jsonl.py
+```
+### Step 1 :
+```bash
+# -s can be added for subset of data
+$PYTHON oscar-to-jsonl.py
+```
+### Step 2 :
+  ```bash
+mkdir -p zh
+mv oscar*.jsonl zh
+cd zh
+  ```
+### Step 3 :
+Use one of the three methods below to tokenize the dataset. You can use any number of workers based on the CPU cores.
+*  Tokenize the dataset using GPT2BPETokenizer:
+    ```bash
+    # download gpt2 vocab and merge files
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+
+    # tokenize individual jsonl files
+    # loop count will change based on number of files for a given dataset
+    mkdir zh_tokenized
+    for i in $(seq 0 4);
+    do
+      $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type GPT2BPETokenizer --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --workers 80
+    done
+    ```
+  * Tokenize the dataset using GPTSentencePieceTokenizer:
+    ```bash
+    # download tokenizer.model based on model trying to train
+    # tokenize individual jsonl files
+    # loop count will change based on number of files for a given dataset
+    mkdir zh_tokenized
+    for i in $(seq 0 4);
+    do
+      $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model /path/to/tokenizer.model --append-eod --workers 80
+    done
+    ```
+
+  * Tokenize the dataset using HFTokenizer:
+    ```bash
+    # path to tokenizer can be local directory path and to run custom code from it, trust remote code option(--trust-remote-code) should be passed
+    #  or
+    # path to tokenizer can be link to huggingface repo model card
+    # if huggingface repo model card is a gated repo, Log in using a token from huggingface.co/settings/tokens with below command
+    # huggingface-cli login
+    # --seq-length value need to be passed explicitly from huggingface repo model card or local directory path which has model_max_length in tokenizer_config.json file
+
+    # tokenize individual jsonl files
+    # loop count will change based on number of files for a given dataset
+    mkdir zh_tokenized
+    for i in $(seq 0 4);
+    do
+      $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type HFTokenizer --tokenizer-model /path/to/tokenizer --append-eod --workers 4 --seq-length 1000000000000000019884624838656
+    done
+    ```
+### Step 4 :
+ * Multiple tokenized dataset files are merged into a single file using the below method:
+    ```bash
+    # merge tokenized files
+    mkdir zh_tokenized_merged
+    $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/merge_datasets.py --input zh_tokenized --output-prefix zh_tokenized_merged/tokenized_text_document
+    # use the tokenized files generated from above command to train
+    ```
+
+# Training Script Settings
+* Based on the tokenization method, update the tokenizer type:
+  ```
+  HL_TOKENIZER_TYPE=GPT2BPETokenizer
+  ```
+* To run custom tokenizer code from local path using HFTokenizer method:
+  ```
+  HL_TRUST_REMOTE_CODE=1
+  ```
+* Update data root dir with the path of your choice:
+  ```
+  HL_DATA_DIR_ROOT=/data/bigscience/oscar-en
+  ```
+* Update data file prefix(*.bin and *.idx) based on file name in data root dir:
+  ```
+  HL_DATA_FILE_PREFIX=tokenized_text_document
+  ```
+* Update tokenizer.model file path if it is not in data root dir, required for any sentence piece based tokenizer:
+  ```
+  HL_TOKENIZER_MODEL=path/to/tokenizer.model
+  ```
+
+Note: For the training commands, make sure to change the IP addresses in hostsfile according to your setup.
+`HL_RESULTS_DIR` and `HL_DATA_DIR_ROOT` must be shared writable across all nodes and launchers when running training on more than 8 cards.
+The same applies to `HL_CHECKPOINTS_DIR`, `HL_TENSORBOARD_DIR` and `HL_KILL_SWITCH` if specified.
+If `HL_DATA_DIR_ROOT` is not writable, then `HL_DATA_CACHE_DIR` must be set to a writable location and
+must be shared and accessible across all nodes and launchers when running training on more than 8 cards.
+
+
+# LLaMA Training and Examples
+* Training of LLaMA is based on https://arxiv.org/abs/2302.13971
+* Training of LLaMA 2 is based on https://arxiv.org/pdf/2307.09288
+
+## Multi-Card Training Examples
+* Run LLaMA 2 13B on 8 HPUs with BF16 precision:
+  ```
+  HL_NUM_NODES=1 HL_PP=2 HL_TP=2 HL_DP=2 scripts/run_llama.sh
+  ```
+
+* Run LLaMA 2 13B on 64 HPUs with BF16 precision:
+  ```
+  HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=8 HL_PP=2 HL_TP=2 HL_DP=16 scripts/run_llama.sh
+  ```
+
+* Run LLaMA 2 70B on 32 HPUs with BF16 precision:
+  ```
+  HL_HOSTSFILE=scripts/hostsfile HL_LLAMA_MODEL_SIZE=70 HL_NUM_NODES=4 HL_PP=4 HL_TP=8 HL_DP=1 scripts/run_llama.sh
+  ```
+
+LLaMA 2 training supports FP8 precision, which improves model performance. To enable FP8, set `HL_USE_TRANSFORMER_ENGINE=1`. Several FP8 parameters adjust model performance, accuracy, and memory utilization. It is not recommended to change the following default parameters, as they are set optimally:
+ - `HL_FP8_FORMAT=hybrid`
+ - `HL_FP8_MARGIN=0`
+ - `HL_FP8_AMAX_RECOMPUTE_ALGO=max`
+ - `HL_FP8_AMAX_REDUCE=1`
+ - `HL_FP8_MEASURE_INTERVAL=GBS/micro_batch_size/DP`
+ - `HL_FP8_AMAX_HISTORY_LEN=GBS/micro_batch_size/DP`
+
+The below parameter can be added to improve model performance while using FP8. Try adding them if you have enough memory:
+ - `HL_USE_CACHE_FP8_WEIGHT_FWD=1`
+ - `HL_USE_CACHE_FP8_WEIGHT=1`
+
+* Run LLaMA 2 70B on 32 HPUs with FP8 precision:
+  ```
+  HL_HOSTSFILE=scripts/hostsfile HL_LLAMA_MODEL_SIZE=70 HL_NUM_NODES=4 HL_PP=4 HL_TP=8 HL_DP=1 HL_CKP_ACT=0 HL_SEQ_LEN=4096 HL_MICRO_BATCH=1 HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 scripts/run_llama.sh
+  ```
+
+* Run LLaMA 2 13B on 16 HPUs with FP8 precision:
+  ```
+  HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=2 HL_PP=2 HL_TP=2 HL_DP=4 HL_CKP_ACT=2 HL_SEQ_LEN=4096 HL_ZERO_STAGE=1 HL_USE_FAST_SOFTMAX=1 HL_MICRO_BATCH=2 HL_GRAD_ACCUM_DTYPE=bf16 HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 HL_USE_CACHE_FP8_WEIGHT=1 scripts/run_llama.sh
+  ```
+
+* Run LLaMA 2 7B on 8 HPUs with FP8 precision:
+  ```
+  HL_LLAMA_MODEL_SIZE=7 HL_NUM_NODES=1 HL_PP=1 HL_TP=1 HL_DP=8 HL_CKP_ACT=2 HL_SEQ_LEN=4096 HL_ZERO_STAGE=1 HL_USE_FAST_SOFTMAX=1 HL_MICRO_BATCH=1 HL_GRAD_ACCUM_DTYPE=bf16  HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 HL_USE_CACHE_FP8_WEIGHT=1 scripts/run_llama.sh
+
+# Mixtral Training and Examples
+* Training of Mixtral is based on https://arxiv.org/abs/2401.04088
+
+## Multi-Card Training Examples
+Configure the following for the Mixtral examples below:
+* Set the correct path for `HL_DATA_DIR_ROOT`.
+* Set the correct values for `HL_TOKENIZER_TYPE` and `HL_DATA_FILE_PREFIX`.
+* Add `HL_DATA_CACHE_DIR` and/or `HL_TOKENIZER_MODEL` if necessary.
+
+Refer to [training script settings](#training-script-settings) for details.
+
+In addition, Capacity Bins functionality was introduced for Mixtral. Capacity bins is
+a solution for more performance efficent handling of dynamicity in Mixture of Expert layer.
+Expert capacity values are limited to a fixed set of values (defined by bins).
+Bins are auto-optimized in given steps intervals, based previous bins usage frequencies.
+
+Capacity bins are configured using following variables:
+* `HL_MOE_NUM_CAPACITY_BINS` - Number of bins to be used.
+* `HL_CAPACITY_BINS_EXP_BASE` - Exponential base for initialization of capacity bins.
+Bins are generated with exponential growing bins width.
+Bins that are closer to the start are smaller and thus have less extra non-required capacity.
+* `HL_MOE_CAPACITY_BINS_ALIGNMENT` - Every capacity bin value (initialized or optimized)
+will be a multiple of this alignment.
+* `HL_MOE_CAPACITY_BINS_OPTIMIZE_INTERVAL` - Steps interval for auto-optimization of MoE capacity bins.
+* `HL_MOE_CAPACITY_BINS_OPTIMIZE_MAX_GROUP` - Maximum group size of adjacent MoE gates
+that their capacity bins are optimized jointly.
+
+Capacity bins functionality is enabled by setting `HL_MOE_NUM_CAPACITY_BINS`.
+Recomended configuration is to set `HL_MOE_NUM_CAPACITY_BINS=10`
+and leave other parameters as default values.
+
+* Run Mixtral 8x7b on 32 HPUs, Lazy mode, with BF16 precision, sequence length 32k:
+  ```
+  HL_HOSTSFILE=$MEGATRON_DEEPSPEED_ROOT/scripts/hostsfile \
+  HL_MOE_NUM_CAPACITY_BINS=10 \
+  HL_NUM_NODES=4 \
+  HL_TP=8 \
+  HL_MOE_EP=1 \
+  HL_SEQ_PARALLEL=1 \
+  HL_MOE_ENABLE_EXPERT_TP=1 \
+  HL_ZERO_STAGE=1 \
+  HL_CKP_ACT=1 \
+  $MEGATRON_DEEPSPEED_ROOT/scripts/run_mixtral.sh
+  ```
+
+# Supported Configuration
+| Validated on  | Intel Gaudi software Version | PyTorch Version | Mode     |
+|---------------|------------------------------|-----------------|----------|
+| Gaudi 2       | 1.17.0                       | 2.3.1           | Training |
+
+
+# Changelog
+## 1.17.0
+ - Added throughput timers configuration to the Deepspeed json config.
+ - Rebased Megatron-DeepSpeed repository from [PR#372](https://github.com/microsoft/Megatron-DeepSpeed/pull/372) to [PR#374](https://github.com/microsoft/Megatron-DeepSpeed/pull/374).
+ - Added support for Megatron-DeepSpeed Eval Harness tasks. Usage example is available [here](tasks/eval_harness/README.md#run-mds-eval-harness).
+ - Added support for full recompute in FP8.
+ - Added Lazy mode support for Mixtral.
+ - Added Capacity Bins functionality for Mixtral.
+ - Added Megatron-DeepSpeed to Hugging Face checkpoint conversion support. Usage example is available [here](./tools/convert_checkpoint/README.md#megatron-deepspeed-to-universal-then-to-hf-transformers).
+## 1.16.0
+ - Added Mixtral model with Eager and torch.compile modes support. Lazy mode is not supported.
+ - Rebased Megatron-DeepSpeed repository from [PR#307](https://github.com/microsoft/Megatron-DeepSpeed/pull/307) to [PR#372](https://github.com/microsoft/Megatron-DeepSpeed/pull/372).
+ - Set the LLaMA 2 model as the default.
+ - Added support for Zeroshot_gpt tasks using DeepSpeed 3D parallelism.
+ - Added support for ALiBi positional embeddings in core attention only.
+ - Added support for fast softmax. Currently disabled by default.
+ - Added support for accumulation of gradients in BF16. Currently disabled by default.
+## 1.15.0
+ - Initial release.
+
+### Script Modifications
+Major changes done to the original model from [microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/commit/3c5f47563f697702c1e305fa01b7563f54b747fc) repository:
+* Changed README file content.
+* TFLOPs calculation changed.
+* Added HPU FP8 support.
+* Flash attention support via FusedSDPA is added for HPU Accelerator.
+* Added checkpoint verification.
+* Added kill-switch mechanism to gracefully stop training.
+
+# Known Issues
+* Only scripts and configurations mentioned in this README are supported and verified.
diff --git a/examples_deepspeed/MoE/ds_pretrain_gpt_125M_MoE64.sh b/examples_deepspeed/MoE/ds_pretrain_gpt_125M_MoE64.sh
index f93f0b7126..99ae9e8c8a 100644
--- a/examples_deepspeed/MoE/ds_pretrain_gpt_125M_MoE64.sh
+++ b/examples_deepspeed/MoE/ds_pretrain_gpt_125M_MoE64.sh
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 DIR=`pwd`
 ###############################################################################
@@ -119,8 +121,14 @@ MP_SIZE=1
 ## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
 ## to 1 and use the "--no-pipeline-parallel" arg.
 PP_SIZE=1
-NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
-NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+nvidia-smi || count_GPU=0
+if [[ ${count_GPU} == 0 ]];then
+    NUM_GPUS=$(lspci | grep -i "Processing accelerators: Habana Labs Ltd." | wc -l)
+    NUM_GPUS_PERNODE=${NUM_GPUS}
+else
+    NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
+    NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+fi
 NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
 ###############################################################################
 ### MoE configs
@@ -172,6 +180,7 @@ LOG_INTERVAL=10
 EVAL_ITERS=10
 EVAL_INTERVAL=100
 SAVE_INTERVAL=10000
+EXIT_INTERVAL=${HL_EXIT_INTERVAL:-0}
 
 ## Standard deviation for weight initialization
 ## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
@@ -241,13 +250,17 @@ if [ "${USE_INTERNAL_DATA}" = "true" ]; then
     0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
     0.01359 ${ARX} 0.01588 ${GIT}"
 else
-    VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
-    MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
+    #VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
+    #MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
     # Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
     # For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
-    DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
+    #DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
     # For cluster Azure-WestUS3-A100
     # DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
+    BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+    VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
+    MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
+    DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 fi
 ###############################################################################
 data_options=" \
@@ -284,6 +297,7 @@ megatron_options=" \
         --min-lr ${MIN_LR} \
         --lr-decay-style cosine \
         --split 98,2,0 \
+        --exit-interval ${EXIT_INTERVAL} \
         --log-interval ${LOG_INTERVAL} \
         --eval-interval ${EVAL_INTERVAL} \
         --eval-iters ${EVAL_ITERS} \
@@ -299,11 +313,12 @@ megatron_options=" \
         --log-timers-to-tensorboard \
         --log-batch-size-to-tensorboard \
         --log-validation-ppl-to-tensorboard \
+        --no-gradient-accumulation-fusion \
         --tensorboard-dir ${TENSORBOARD_DIR}"
 
 if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
 megatron_options="${megatron_options} \
-        --checkpoint-activations"
+        --checkpoint-activations --recompute-granularity=full --recompute-method=uniform"
 fi
 
 if [[ $EP_SIZE -gt 1 ]]; then
@@ -329,12 +344,12 @@ sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
     | sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
     | sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
     | sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
-	  > ${config_json}
+        > ${config_json}
 
 deepspeed_options=" \
-		    --deepspeed \
-		    --deepspeed_config ${config_json} \
-		    --pipeline-model-parallel-size ${PP_SIZE}"
+            --deepspeed \
+            --deepspeed_config ${config_json} \
+            --pipeline-model-parallel-size ${PP_SIZE}"
 
 # Currently MoE is not compatible with pipeline parallel
 if [[ $EP_SIZE -gt 1 ]]; then
@@ -369,4 +384,4 @@ fi
 run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
 echo ${run_cmd}
 eval ${run_cmd}
-set +x
+set +x
\ No newline at end of file
diff --git a/examples_deepspeed/MoE/ds_pretrain_gpt_125M_dense_cl.sh b/examples_deepspeed/MoE/ds_pretrain_gpt_125M_dense_cl.sh
index 36b654e02b..41be24b8d3 100644
--- a/examples_deepspeed/MoE/ds_pretrain_gpt_125M_dense_cl.sh
+++ b/examples_deepspeed/MoE/ds_pretrain_gpt_125M_dense_cl.sh
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 DIR=`pwd`
 ###############################################################################
@@ -123,8 +125,14 @@ NO_PP="true"
 ZERO_STAGE=0
 
 ## Total number of GPUs
-NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
-NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+nvidia-smi || count_GPU=0
+if [[ ${count_GPU} == 0 ]];then
+    NUM_GPUS=$(lspci | grep -i "Processing accelerators: Habana Labs Ltd." | wc -l)
+    NUM_GPUS_PERNODE=${NUM_GPUS}
+else
+    NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
+    NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+fi
 NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
 DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
 ###############################################################################
@@ -143,6 +151,7 @@ LOG_INTERVAL=10
 EVAL_ITERS=10
 EVAL_INTERVAL=100
 SAVE_INTERVAL=1000
+EXIT_INTERVAL=${HL_EXIT_INTERVAL:-0}
 
 ## Standard deviation for weight initialization. Usually larger model needs
 ## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
@@ -175,13 +184,17 @@ mkdir -p ${LOG_PATH}
 mkdir -p ${TENSORBOARD_PATH}
 mkdir -p ${CHECKPOINT_PATH}
 
-VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
-MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
+#VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
+#MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
 # Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
 # For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
-DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
+#DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
 # For cluster Azure-WestUS3-A100
 # DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
+MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 ###############################################################################
 data_options=" \
          --vocab-file ${VOCAB_PATH} \
@@ -211,6 +224,7 @@ megatron_options=" \
         --min-lr ${MIN_LR} \
         --lr-decay-style cosine \
         --split 98,2,0 \
+        --exit-interval ${EXIT_INTERVAL} \
         --log-interval ${LOG_INTERVAL} \
         --eval-interval ${EVAL_INTERVAL} \
         --eval-iters ${EVAL_ITERS} \
@@ -226,11 +240,12 @@ megatron_options=" \
         --log-timers-to-tensorboard \
         --log-batch-size-to-tensorboard \
         --log-validation-ppl-to-tensorboard \
+        --no-gradient-accumulation-fusion \
         --tensorboard-dir ${TENSORBOARD_PATH}"
 
 if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
 megatron_options="${megatron_options} \
-        --checkpoint-activations"
+        --checkpoint-activations --recompute-granularity=full --recompute-method=uniform"
 fi
 
 if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
@@ -306,4 +321,4 @@ fi
 run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
 echo ${run_cmd}
 eval ${run_cmd}
-set +x
+set +x
\ No newline at end of file
diff --git a/examples_deepspeed/MoE/readme_evalharness.md b/examples_deepspeed/MoE/readme_evalharness.md
index d30075e2fc..1b61d34300 100644
--- a/examples_deepspeed/MoE/readme_evalharness.md
+++ b/examples_deepspeed/MoE/readme_evalharness.md
@@ -165,4 +165,4 @@ Import location: Replace data at selected cell
 
 4. Now it should be easy to align the new records with the old ones - delete irrelevant records and Insert->Cells where data is missing until the first 2 columns match
 
-5. now create 2 cols in the main table on top and now it should be safe to Copy-n-Paste the 2-col data range, without the task/metrics columns into the newly created space. -->
+5. now create 2 cols in the main table on top and now it should be safe to Copy-n-Paste the 2-col data range, without the task/metrics columns into the newly created space. -->
\ No newline at end of file
diff --git a/examples_deepspeed/run_deepspeed_example.sh b/examples_deepspeed/run_deepspeed_example.sh
index 909cdf6713..de89d65b33 100644
--- a/examples_deepspeed/run_deepspeed_example.sh
+++ b/examples_deepspeed/run_deepspeed_example.sh
@@ -1,8 +1,12 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 set -ex
 
-BASE_PATH=/vc_data/Megatron-LM/data
-DATA_PATH=${BASE_PATH}/indexed_datasets/megatron
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
+VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
+MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
 DS_CONFIG=ds_config.json
 
 TP=1
@@ -48,7 +52,7 @@ ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
 ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
 
 
-deepspeed pretrain_gpt.py \
+deepspeed ../pretrain_gpt.py \
     --tensor-model-parallel-size $TP \
     --pipeline-model-parallel-size $PP \
     --num-layers $NLAYERS \
@@ -67,8 +71,8 @@ deepspeed pretrain_gpt.py \
     --eval-iters 40 \
     --eval-interval 1000 \
     --data-path $DATA_PATH \
-    --vocab-file $BASE_PATH/gpt2-vocab.json \
-    --merge-file $BASE_PATH/gpt2-merges.txt \
+    --vocab-file $VOCAB_PATH \
+    --merge-file $MERGE_PATH \
     --save-interval 1000 \
     --split 98,2,0 \
     --clip-grad 1.0 \
@@ -78,7 +82,9 @@ deepspeed pretrain_gpt.py \
     --init-method-std 0.006 \
     --fp16 \
     --checkpoint-activations \
+    --recompute-granularity=full \
+    --recompute-method=uniform \
+    --no-gradient-accumulation-fusion \
     --tensorboard-dir $OUTPUT_DIR \
     $ds_args \
-    --exit-interval 5000 | tee ${OUTPUT_DIR}/output.log
-
+    --exit-interval 5000 | tee ${OUTPUT_DIR}/output.log
\ No newline at end of file
diff --git a/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh b/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh
index da028dc731..2256adf5a5 100644
--- a/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh
+++ b/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 dir=`pwd`
 ###############################################################################
@@ -147,8 +149,14 @@ no_pp="true"
 zero_stage=1
 
 ## Total number of GPUs. ds_ssh is from DeepSpeed library.
-num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
-num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+nvidia-smi || count_GPU=0
+if [[ ${count_GPU} == 0 ]];then
+    num_gpus=$(lspci | grep -i "Processing accelerators: Habana Labs Ltd." | wc -l)
+    num_gpus_pernode=${num_gpus}
+else
+    num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
+    num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+fi
 num_node=$(( ${num_gpus} / ${num_gpus_pernode} ))
 
 ## Data parallel size.
@@ -187,21 +195,28 @@ host="${HOSTNAME}"
 seed=1234
 num_workers=0
 
-data_path="BookCorpusDataset_text_document"
-if [ ! -f "BookCorpusDataset_text_document.bin" ]; then
-    wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
-fi
-if [ ! -f "BookCorpusDataset_text_document.idx" ]; then
-    wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx
-fi
-
-vocab_path="gpt2-vocab.json"
-if [ ! -f "$vocab_path" ]; then
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-fi
-merge_path="gpt2-merges.txt"
-if [ ! -f "$merge_path" ]; then
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+USE_INTERNAL_DATA="false"
+if [ "${USE_INTERNAL_DATA}" = "true" ]; then
+    data_path="BookCorpusDataset_text_document"
+    if [ ! -f "BookCorpusDataset_text_document.bin" ]; then
+        wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
+    fi
+    if [ ! -f "BookCorpusDataset_text_document.idx" ]; then
+        wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx
+    fi
+    vocab_path="gpt2-vocab.json"
+    if [ ! -f "$vocab_path" ]; then
+        wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    fi
+    merge_path="gpt2-merges.txt"
+    if [ ! -f "$merge_path" ]; then
+        wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+    fi
+else
+    BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+    data_path=${BASE_DATA_PATH}/meg-gpt2_text_document
+    vocab_path=${BASE_DATA_PATH}/gpt2-vocab.json
+    merge_path=${BASE_DATA_PATH}/gpt2-merges.txt
 fi
 
 prescale_grad="true"
@@ -282,11 +297,12 @@ megatron_options=" \
     --log-timers-to-tensorboard \
     --log-batch-size-to-tensorboard \
     --log-validation-ppl-to-tensorboard \
+    --no-gradient-accumulation-fusion \
     --tensorboard-dir ${tensorboard_path}"
 
 if [ "${activation_checkpoint}" = "true" ]; then
 megatron_options="${megatron_options} \
-    --checkpoint-activations"
+    --checkpoint-activations --recompute-granularity=full --recompute-method=uniform"
 fi
 
 if [ "${log_optimizer_state}" = "true" ]; then
@@ -338,4 +354,4 @@ if [[ $iteration -gt 0 ]]; then
     ds_ssh "echo $iteration_2 > $iteration_file_2"
 fi
 
-deepspeed ${dir}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} 2>&1 | tee ${log_path}/${jobname}_${host}_${current_time}.log
+deepspeed ${dir}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} 2>&1 | tee ${log_path}/${jobname}_${host}_${current_time}.log
\ No newline at end of file
diff --git a/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_30B_seq_parallel_32k.sh b/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_30B_seq_parallel_32k.sh
index f23e6f9585..be1bf071f3 100644
--- a/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_30B_seq_parallel_32k.sh
+++ b/examples_deepspeed/sequence_parallel/ds_pretrain_gpt_30B_seq_parallel_32k.sh
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 dir=`pwd`
 ###############################################################################
@@ -157,8 +159,14 @@ no_pp="true"
 zero_stage=3
 
 ## Total number of GPUs. ds_ssh is from DeepSpeed library.
-num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
-num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+nvidia-smi || count_GPU=0
+if [[ ${count_GPU} == 0 ]];then
+    num_gpus=$(lspci | grep -i "Processing accelerators: Habana Labs Ltd." | wc -l)
+    num_gpus_pernode=${num_gpus}
+else
+    num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
+    num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+fi
 num_node=$(( ${num_gpus} / ${num_gpus_pernode} ))
 
 ## Data parallel size.
@@ -197,21 +205,28 @@ host="${HOSTNAME}"
 seed=1234
 num_workers=0
 
-data_path="BookCorpusDataset_text_document"
-if [ ! -f "BookCorpusDataset_text_document.bin" ]; then
-    wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
-fi
-if [ ! -f "BookCorpusDataset_text_document.idx" ]; then
-    wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx
-fi
-
-vocab_path="gpt2-vocab.json"
-if [ ! -f "$vocab_path" ]; then
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-fi
-merge_path="gpt2-merges.txt"
-if [ ! -f "$merge_path" ]; then
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+USE_INTERNAL_DATA="false"
+if [ "${USE_INTERNAL_DATA}" = "true" ]; then
+    data_path="BookCorpusDataset_text_document"
+    if [ ! -f "BookCorpusDataset_text_document.bin" ]; then
+        wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
+    fi
+    if [ ! -f "BookCorpusDataset_text_document.idx" ]; then
+        wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx
+    fi  
+    vocab_path="gpt2-vocab.json"
+    if [ ! -f "$vocab_path" ]; then
+        wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    fi
+    merge_path="gpt2-merges.txt"
+    if [ ! -f "$merge_path" ]; then
+        wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+    fi
+else
+    BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+    data_path=${BASE_DATA_PATH}/meg-gpt2_text_document
+    vocab_path=${BASE_DATA_PATH}/gpt2-vocab.json
+    merge_path=${BASE_DATA_PATH}/gpt2-merges.txt
 fi
 
 prescale_grad="true"
@@ -292,11 +307,12 @@ megatron_options=" \
     --log-timers-to-tensorboard \
     --log-batch-size-to-tensorboard \
     --log-validation-ppl-to-tensorboard \
+    --no-gradient-accumulation-fusion \
     --tensorboard-dir ${tensorboard_path}"
 
 if [ "${activation_checkpoint}" = "true" ]; then
 megatron_options="${megatron_options} \
-    --checkpoint-activations"
+    --checkpoint-activations --recompute-granularity=full --recompute-method=uniform"
 fi
 
 if [ "${log_optimizer_state}" = "true" ]; then
@@ -348,4 +364,4 @@ if [[ $iteration -gt 0 ]]; then
     ds_ssh "echo $iteration_2 > $iteration_file_2"
 fi
 
-deepspeed ${dir}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} 2>&1 | tee ${log_path}/${jobname}_${host}_${current_time}.log
+deepspeed ${dir}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} 2>&1 | tee ${log_path}/${jobname}_${host}_${current_time}.log
\ No newline at end of file
diff --git a/examples_deepspeed/universal_checkpointing/run_bf16.sh b/examples_deepspeed/universal_checkpointing/run_bf16.sh
index 0953954222..69680e4f8b 100755
--- a/examples_deepspeed/universal_checkpointing/run_bf16.sh
+++ b/examples_deepspeed/universal_checkpointing/run_bf16.sh
@@ -1,14 +1,14 @@
-#!/bin/bash
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 
+#!/bin/bash
 
 DIR=`pwd`
 DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
-BASE_DATA_PATH=datasets
-DATASET=${BASE_DATA_PATH}/my-gpt2_text_document
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
 MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
 
-
 script_path=$(realpath $0)
 script_dir=$(dirname $script_path)
 CONFIG_JSON="$script_dir/ds_config.json"
@@ -77,41 +77,44 @@ done
 
 
 options=" \
-	--tensor-model-parallel-size $TP \
-	--pipeline-model-parallel-size $PP \
-	--ds-sequence-parallel-size $SP \
-        --num-layers $LAYERS \
-        --hidden-size $HIDDEN \
-        --num-attention-heads 32 \
-        --seq-length $SEQ \
-        --loss-scale 12 \
-        --max-position-embeddings $SEQ \
-	--micro-batch-size $MICRO_BATCH \
-	--global-batch-size $GLOBAL_BATCH \
-	--train-iters $TRAIN_ITERS \
-        --lr $LR \
-	--min-lr $MIN_LR \
-        --lr-decay-style cosine \
-        --log-interval 1 \
-        --eval-iters 40 \
-        --eval-interval 10 \
-	--data-path ${DATASET} \
-	--vocab-file ${VOCAB_PATH} \
-	--merge-file ${MERGE_PATH} \
-	--save-interval 100 \
-        --split 98,2,0 \
-        --clip-grad 1.0 \
-	--weight-decay 0.1 \
-	--adam-beta1 0.9 \
-	--adam-beta2 0.95 \
-	--init-method-std 0.006 \
-        --${DTYPE} \
-	--checkpoint-activations \
-	--exit-interval ${EXIT_INTERVAL} \
-        --save ${CHECKPOINT_PATH} \
-        --load ${LOAD_CHECKPOINT_PATH} \
-        --make-vocab-size-divisible-by 256 \
-	--tensorboard-dir $LOG_DIR
+    --tensor-model-parallel-size $TP \
+    --pipeline-model-parallel-size $PP \
+    --ds-sequence-parallel-size $SP \
+    --num-layers $LAYERS \
+    --hidden-size $HIDDEN \
+    --num-attention-heads 32 \
+    --seq-length $SEQ \
+    --loss-scale 12 \
+    --max-position-embeddings $SEQ \
+    --micro-batch-size $MICRO_BATCH \
+    --global-batch-size $GLOBAL_BATCH \
+    --train-iters $TRAIN_ITERS \
+    --lr $LR \
+    --min-lr $MIN_LR \
+    --lr-decay-style cosine \
+    --log-interval 1 \
+    --eval-iters 40 \
+    --eval-interval 10 \
+    --data-path ${DATA_PATH} \
+    --vocab-file ${VOCAB_PATH} \
+    --merge-file ${MERGE_PATH} \
+    --save-interval 100 \
+    --split 98,2,0 \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --init-method-std 0.006 \
+    --${DTYPE} \
+    --checkpoint-activations \
+    --recompute-granularity=full \
+    --recompute-method=uniform \
+    --no-gradient-accumulation-fusion \
+    --exit-interval ${EXIT_INTERVAL} \
+    --save ${CHECKPOINT_PATH} \
+    --load ${LOAD_CHECKPOINT_PATH} \
+    --make-vocab-size-divisible-by 256 \
+    --tensorboard-dir $LOG_DIR
         "
 
 options="${options} \
@@ -148,10 +151,10 @@ cat <<EOT > $CONFIG_JSON
 EOT
 
 WORKER_STR="--num_nodes 1 --num_gpus $WORLD_SIZE"
-run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/pretrain_gpt.py $@ ${options}"
+run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/../../pretrain_gpt.py $@ ${options}"
 
 echo ${options}
 echo ${run_cmd}
 eval ${run_cmd}
 
-set +x
+set +x
\ No newline at end of file
diff --git a/examples_deepspeed/universal_checkpointing/run_fp16.sh b/examples_deepspeed/universal_checkpointing/run_fp16.sh
index 691fa8a8e6..0733bb55c1 100755
--- a/examples_deepspeed/universal_checkpointing/run_fp16.sh
+++ b/examples_deepspeed/universal_checkpointing/run_fp16.sh
@@ -1,14 +1,15 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 
 
 DIR=`pwd`
 DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
-BASE_DATA_PATH=datasets
-DATASET=${BASE_DATA_PATH}/my-gpt2_text_document
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
 MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
 
-
 script_path=$(realpath $0)
 script_dir=$(dirname $script_path)
 CONFIG_JSON="$script_dir/ds_config.json"
@@ -77,41 +78,44 @@ done
 
 
 options=" \
-	--tensor-model-parallel-size $TP \
-	--pipeline-model-parallel-size $PP \
+    --tensor-model-parallel-size $TP \
+    --pipeline-model-parallel-size $PP \
     --ds-sequence-parallel-size $SP \
-        --num-layers $LAYERS \
-        --hidden-size $HIDDEN \
-        --num-attention-heads 32 \
-        --seq-length $SEQ \
-        --loss-scale 12 \
-        --max-position-embeddings $SEQ \
-	--micro-batch-size $MICRO_BATCH \
-	--global-batch-size $GLOBAL_BATCH \
-	--train-iters $TRAIN_ITERS \
-        --lr $LR \
-	--min-lr $MIN_LR \
-        --lr-decay-style cosine \
-        --log-interval 1 \
-        --eval-iters 40 \
-        --eval-interval 10 \
-	--data-path ${DATASET} \
-	--vocab-file ${VOCAB_PATH} \
-	--merge-file ${MERGE_PATH} \
-	--save-interval 100 \
-        --split 98,2,0 \
-        --clip-grad 1.0 \
-	--weight-decay 0.1 \
-	--adam-beta1 0.9 \
-	--adam-beta2 0.95 \
-	--init-method-std 0.006 \
-        --${DTYPE} \
-	--checkpoint-activations \
-	--exit-interval ${EXIT_INTERVAL} \
-        --save ${CHECKPOINT_PATH} \
-        --load ${LOAD_CHECKPOINT_PATH} \
-        --make-vocab-size-divisible-by 256 \
-	--tensorboard-dir $LOG_DIR
+    --num-layers $LAYERS \
+    --hidden-size $HIDDEN \
+    --num-attention-heads 32 \
+    --seq-length $SEQ \
+    --loss-scale 12 \
+    --max-position-embeddings $SEQ \
+    --micro-batch-size $MICRO_BATCH \
+    --global-batch-size $GLOBAL_BATCH \
+    --train-iters $TRAIN_ITERS \
+    --lr $LR \
+    --min-lr $MIN_LR \
+    --lr-decay-style cosine \
+    --log-interval 1 \
+    --eval-iters 40 \
+    --eval-interval 10 \
+    --data-path ${DATA_PATH} \
+    --vocab-file ${VOCAB_PATH} \
+    --merge-file ${MERGE_PATH} \
+    --save-interval 100 \
+    --split 98,2,0 \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --init-method-std 0.006 \
+    --${DTYPE} \
+    --checkpoint-activations \
+    --recompute-granularity=full \
+    --recompute-method=uniform \
+    --no-gradient-accumulation-fusion \
+    --exit-interval ${EXIT_INTERVAL} \
+    --save ${CHECKPOINT_PATH} \
+    --load ${LOAD_CHECKPOINT_PATH} \
+    --make-vocab-size-divisible-by 256 \
+    --tensorboard-dir $LOG_DIR
         "
 
 options="${options} \
@@ -153,11 +157,11 @@ cat <<EOT > $CONFIG_JSON
 EOT
 
 WORKER_STR="--num_nodes 1 --num_gpus $WORLD_SIZE"
-run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/pretrain_gpt.py $@ ${options}"
+run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/../../pretrain_gpt.py $@ ${options}"
 
 
 echo ${options}
 echo ${run_cmd}
 eval ${run_cmd}
 
-set +x
+set +x
\ No newline at end of file
diff --git a/examples_deepspeed/universal_checkpointing/run_tb_analysis.sh b/examples_deepspeed/universal_checkpointing/run_tb_analysis.sh
index 7aa988a0a0..a02c5cc7b0 100755
--- a/examples_deepspeed/universal_checkpointing/run_tb_analysis.sh
+++ b/examples_deepspeed/universal_checkpointing/run_tb_analysis.sh
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 # Copyright (c) Microsoft Corporation.
 # SPDX-License-Identifier: Apache-2.0
@@ -11,7 +13,7 @@ if [ "$OUTPUT_PATH" == "" ]; then
 fi
 
 # Training Loss
-python3 examples_deepspeed/universal_checkpointing/tb_analysis/tb_analysis_script.py \
+python3 tb_analysis/tb_analysis_script.py \
     --tb_dir $OUTPUT_PATH \
     --tb_event_key "lm-loss-training/lm loss" \
     --plot_name "uc_char_training_loss.png" \
@@ -19,7 +21,7 @@ python3 examples_deepspeed/universal_checkpointing/tb_analysis/tb_analysis_scrip
     --use_sns
 
 # Validation Loss
-python3 examples_deepspeed/universal_checkpointing/tb_analysis/tb_analysis_script.py \
+python3 tb_analysis/tb_analysis_script.py \
     --tb_dir $OUTPUT_PATH \
     --tb_event_key "lm-loss-validation/lm loss validation" \
     --csv_name "val_" \
diff --git a/examples_deepspeed/universal_checkpointing/run_universal_bf16.sh b/examples_deepspeed/universal_checkpointing/run_universal_bf16.sh
index ef0e134cfc..cc80ac1255 100755
--- a/examples_deepspeed/universal_checkpointing/run_universal_bf16.sh
+++ b/examples_deepspeed/universal_checkpointing/run_universal_bf16.sh
@@ -1,14 +1,15 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 
 
 DIR=`pwd`
 DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
-BASE_DATA_PATH=datasets
-DATASET=${BASE_DATA_PATH}/my-gpt2_text_document
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
 MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
 
-
 script_path=$(realpath $0)
 script_dir=$(dirname $script_path)
 CONFIG_JSON="$script_dir/ds_config.json"
@@ -76,42 +77,45 @@ done
 
 
 options=" \
-	--tensor-model-parallel-size $TP \
-	--pipeline-model-parallel-size $PP \
-	--ds-sequence-parallel-size $SP \
-        --num-layers $LAYERS \
-        --hidden-size $HIDDEN \
-        --num-attention-heads 32 \
-        --seq-length $SEQ \
-        --loss-scale 12 \
-        --max-position-embeddings $SEQ \
-	--micro-batch-size $MICRO_BATCH \
-	--global-batch-size $GLOBAL_BATCH \
-	--train-iters $TRAIN_ITERS \
-        --lr $LR \
-	--min-lr $MIN_LR \
-        --lr-decay-style cosine \
-        --log-interval 1 \
-        --eval-iters 40 \
-        --eval-interval 10 \
-	--data-path ${DATASET} \
-	--vocab-file ${VOCAB_PATH} \
-	--merge-file ${MERGE_PATH} \
-	--save-interval 100 \
-        --split 98,2,0 \
-        --clip-grad 1.0 \
-	--weight-decay 0.1 \
-	--adam-beta1 0.9 \
-	--adam-beta2 0.95 \
-	--init-method-std 0.006 \
-        --${DTYPE} \
-	--checkpoint-activations \
-	--exit-interval ${EXIT_INTERVAL} \
-        --save ${CHECKPOINT_PATH} \
-        --load ${LOAD_CHECKPOINT_PATH} \
-        --make-vocab-size-divisible-by 256 \
-        --universal-checkpoint \
-	--tensorboard-dir $LOG_DIR
+    --tensor-model-parallel-size $TP \
+    --pipeline-model-parallel-size $PP \
+    --ds-sequence-parallel-size $SP \
+    --num-layers $LAYERS \
+    --hidden-size $HIDDEN \
+    --num-attention-heads 32 \
+    --seq-length $SEQ \
+    --loss-scale 12 \
+    --max-position-embeddings $SEQ \
+    --micro-batch-size $MICRO_BATCH \
+    --global-batch-size $GLOBAL_BATCH \
+    --train-iters $TRAIN_ITERS \
+    --lr $LR \
+    --min-lr $MIN_LR \
+    --lr-decay-style cosine \
+    --log-interval 1 \
+    --eval-iters 40 \
+    --eval-interval 10 \
+    --data-path ${DATA_PATH} \
+    --vocab-file ${VOCAB_PATH} \
+    --merge-file ${MERGE_PATH} \
+    --save-interval 100 \
+    --split 98,2,0 \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --init-method-std 0.006 \
+    --${DTYPE} \
+    --checkpoint-activations \
+    --recompute-granularity=full \
+    --recompute-method=uniform \
+    --no-gradient-accumulation-fusion \
+    --exit-interval ${EXIT_INTERVAL} \
+    --save ${CHECKPOINT_PATH} \
+    --load ${LOAD_CHECKPOINT_PATH} \
+    --make-vocab-size-divisible-by 256 \
+    --universal-checkpoint \
+    --tensorboard-dir $LOG_DIR
         "
 
 options="${options} \
@@ -148,10 +152,10 @@ cat <<EOT > $CONFIG_JSON
 EOT
 
 WORKER_STR="--num_nodes 1 --num_gpus $WORLD_SIZE"
-run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/pretrain_gpt.py $@ ${options}"
+run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/../../pretrain_gpt.py $@ ${options}"
 
 echo ${options}
 echo ${run_cmd}
 eval ${run_cmd}
 
-set +x
+set +x
\ No newline at end of file
diff --git a/examples_deepspeed/universal_checkpointing/run_universal_fp16.sh b/examples_deepspeed/universal_checkpointing/run_universal_fp16.sh
index 1e207e422b..149f63fa6c 100755
--- a/examples_deepspeed/universal_checkpointing/run_universal_fp16.sh
+++ b/examples_deepspeed/universal_checkpointing/run_universal_fp16.sh
@@ -1,14 +1,15 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 #!/bin/bash
 
 
 DIR=`pwd`
 DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
-BASE_DATA_PATH=datasets
-DATASET=${BASE_DATA_PATH}/my-gpt2_text_document
+BASE_DATA_PATH=${HL_DATA_DIR_ROOT:-/data/bigscience/oscar-en/}
+DATA_PATH=${BASE_DATA_PATH}/meg-gpt2_text_document
 VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
 MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
 
-
 script_path=$(realpath $0)
 script_dir=$(dirname $script_path)
 CONFIG_JSON="$script_dir/ds_config.json"
@@ -76,42 +77,45 @@ done
 
 
 options=" \
-	--tensor-model-parallel-size $TP \
-	--pipeline-model-parallel-size $PP \
+    --tensor-model-parallel-size $TP \
+    --pipeline-model-parallel-size $PP \
     --ds-sequence-parallel-size $SP \
-        --num-layers $LAYERS \
-        --hidden-size $HIDDEN \
-        --num-attention-heads 32 \
-        --seq-length $SEQ \
-        --loss-scale 12 \
-        --max-position-embeddings $SEQ \
-	--micro-batch-size $MICRO_BATCH \
-	--global-batch-size $GLOBAL_BATCH \
-	--train-iters $TRAIN_ITERS \
-        --lr $LR \
-	--min-lr $MIN_LR \
-        --lr-decay-style cosine \
-        --log-interval 1 \
-        --eval-iters 40 \
-        --eval-interval 10 \
-	--data-path ${DATASET} \
-	--vocab-file ${VOCAB_PATH} \
-	--merge-file ${MERGE_PATH} \
-	--save-interval 100 \
-        --split 98,2,0 \
-        --clip-grad 1.0 \
-	--weight-decay 0.1 \
-	--adam-beta1 0.9 \
-	--adam-beta2 0.95 \
-	--init-method-std 0.006 \
-        --${DTYPE} \
-	--checkpoint-activations \
-	--exit-interval ${EXIT_INTERVAL} \
-        --save ${CHECKPOINT_PATH} \
-        --load ${LOAD_CHECKPOINT_PATH} \
-        --make-vocab-size-divisible-by 256 \
-        --universal-checkpoint \
-	--tensorboard-dir $LOG_DIR
+    --num-layers $LAYERS \
+    --hidden-size $HIDDEN \
+    --num-attention-heads 32 \
+    --seq-length $SEQ \
+    --loss-scale 12 \
+    --max-position-embeddings $SEQ \
+    --micro-batch-size $MICRO_BATCH \
+    --global-batch-size $GLOBAL_BATCH \
+    --train-iters $TRAIN_ITERS \
+    --lr $LR \
+    --min-lr $MIN_LR \
+    --lr-decay-style cosine \
+    --log-interval 1 \
+    --eval-iters 40 \
+    --eval-interval 10 \
+    --data-path ${DATA_PATH} \
+    --vocab-file ${VOCAB_PATH} \
+    --merge-file ${MERGE_PATH} \
+    --save-interval 100 \
+    --split 98,2,0 \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --init-method-std 0.006 \
+    --${DTYPE} \
+    --checkpoint-activations \
+    --recompute-granularity=full \
+    --recompute-method=uniform \
+    --no-gradient-accumulation-fusion \
+    --exit-interval ${EXIT_INTERVAL} \
+    --save ${CHECKPOINT_PATH} \
+    --load ${LOAD_CHECKPOINT_PATH} \
+    --make-vocab-size-divisible-by 256 \
+    --universal-checkpoint \
+    --tensorboard-dir $LOG_DIR
         "
 
 options="${options} \
@@ -153,11 +157,11 @@ cat <<EOT > $CONFIG_JSON
 EOT
 
 WORKER_STR="--num_nodes 1 --num_gpus $WORLD_SIZE"
-run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/pretrain_gpt.py $@ ${options}"
+run_cmd="deepspeed --master_port 29700 $WORKER_STR ${DIR}/../../pretrain_gpt.py $@ ${options}"
 
 
 echo ${options}
 echo ${run_cmd}
 eval ${run_cmd}
 
-set +x
+set +x
\ No newline at end of file
diff --git a/megatron/__init__.py b/megatron/__init__.py
index d92a279ec6..541df1ccea 100644
--- a/megatron/__init__.py
+++ b/megatron/__init__.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 import torch
@@ -5,6 +6,8 @@
 from .global_vars import get_args, get_retro_args
 from .global_vars import get_current_global_batch_size
 from .global_vars import get_num_microbatches
+from .global_vars import get_num_eval_microbatches
+from .global_vars import get_num_microbatches_by_mode
 from .global_vars import get_signal_handler
 from .global_vars import update_num_microbatches
 from .global_vars import get_tokenizer
diff --git a/megatron/arguments.py b/megatron/arguments.py
index dad993be04..7852c719a8 100644
--- a/megatron/arguments.py
+++ b/megatron/arguments.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron arguments."""
@@ -16,6 +17,9 @@
 from tools.retro.utils import get_args_path as get_retro_args_path
 
 from megatron.core.transformer import TransformerConfig
+from megatron.model.utils import init_method_normal
+
+from deepspeed.accelerator import get_accelerator
 
 def parse_args(extra_args_provider=None, ignore_unknown_args=False):
     """Parse all arguments."""
@@ -44,6 +48,10 @@ def parse_args(extra_args_provider=None, ignore_unknown_args=False):
     parser = _add_inference_args(parser)
     parser = _add_transformer_engine_args(parser)
     parser = _add_retro_args(parser)
+    parser = _add_profiler_args(parser)
+    parser = _add_tensor_logger_args(parser)
+    parser = _add_pytorch_args(parser)
+    parser = _add_debug_args(parser)
 
     # Custom arguments.
     if extra_args_provider is not None:
@@ -134,6 +142,11 @@ def validate_args(args, defaults={}):
     #     exit()
     # del args.checkpoint_activations
 
+    if args.checkpoint_activations:
+        assert args.recompute_granularity == 'full', \
+            'cannot use --recompute-granularity=selective with --checkpoint-activations, ' \
+            'for --recompute-granularity=selective use --recompute-activations alone only'
+
     if args.recompute_activations:
         args.recompute_granularity = 'selective'
     del args.recompute_activations
@@ -161,6 +174,8 @@ def validate_args(args, defaults={}):
             print('setting global batch size to {}'.format(
                 args.global_batch_size), flush=True)
     assert args.global_batch_size > 0
+    if args.eval_micro_batch_size is None:
+        args.eval_micro_batch_size = args.micro_batch_size
     if args.num_layers_per_virtual_pipeline_stage is not None:
         assert args.pipeline_model_parallel_size > 2, \
             'pipeline-model-parallel size should be greater than 2 with ' \
@@ -206,6 +221,9 @@ def validate_args(args, defaults={}):
         assert args.DDP_impl == 'local'
         assert args.use_contiguous_buffers_in_local_ddp
 
+    if get_accelerator().device_name() != "hpu" and args.optimizer == "fusedadamw":
+        args.optimizer = "adamw"
+
     # For torch DDP, we do not use contiguous buffer
     # if args.DDP_impl == 'torch':
     if args.DDP_impl != 'local':
@@ -303,9 +321,9 @@ def validate_args(args, defaults={}):
             assert args.max_position_embeddings >= args.seq_length
         if args.decoder_seq_length is not None:
             assert args.max_position_embeddings >= args.decoder_seq_length
-    # When rotary position embeddings is used, set add_position_embedding
+    # When rotary/alibi position embeddings is used, set add_position_embedding
     # to false to turn off absolute position embedding.
-    if args.use_rotary_position_embeddings:
+    if args.use_rotary_position_embeddings or args.use_alibi_position_embeddings:
         args.add_position_embedding = False
     if args.lr is not None:
         assert args.min_lr <= args.lr
@@ -360,15 +378,22 @@ def validate_args(args, defaults={}):
             'v1.10 and above (Nvidia Pytorch container >= 21.07). Current ' \
             'pytorch version is v%s.%s.' % (TORCH_MAJOR, TORCH_MINOR)
 
+    list_of_formats = [args.fp8_e4m3, args.fp8_hybrid, args.fp8_e5m2]
     # Tranformer-Engine/FP8 related checking
-    if args.fp8_e4m3 or args.fp8_hybrid:
+    if any(list_of_formats):
         assert args.transformer_impl == 'transformer_engine', \
             'transformer-engine required for fp8 training and inference'
 
-    assert not (args.fp8_e4m3 and args.fp8_hybrid), \
-        'cannot train with both fp8 e4m3 and hybrid formatting'
+    assert not (str(list_of_formats).count('True') >= 2), \
+        'fp8 formatting can only be one of: e5m2, e4m3, hybrid'
+
+    assert not (args.fp8_e4m3 and get_accelerator().device_name() == "hpu"), \
+        'hpu does not support e4m3 format type'
 
     if args.recompute_granularity == 'selective':
+        assert not args.partition_activations, \
+            '--partition-activations is supported only with ' \
+            '--recompute-granularity=full' # TODO: [SW-162958]
         assert args.recompute_method is None, \
             'recompute method is not yet supported for ' \
             'selective recomputing granularity'
@@ -383,6 +408,12 @@ def validate_args(args, defaults={}):
     # model parallel memory optimization is enabled
     if args.sequence_parallel:
         args.async_tensor_model_parallel_allreduce = False
+        assert args.check_tp_norm == False, "with sequence_parallel enabled," \
+        " args.check_tp_norm cannot be used"
+    else:
+        if args.check_tp_norm:
+            assert args.start_check_tp_norm_iter >= 0
+            assert args.end_check_tp_norm_iter >= args.start_check_tp_norm_iter
 
     # TODO: currently DeepSpeed seems to be incompatible with
     # async_tensor_model_parallel_allreduce thus temporarily disabling it.
@@ -391,7 +422,7 @@ def validate_args(args, defaults={}):
         args.async_tensor_model_parallel_allreduce = False
 
     if not args.use_dataset_only:
-        if os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1":
+        if get_accelerator().device_name() == "cuda" and os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1":
             if args.sequence_parallel:
                 raise RuntimeError(
                     "Using sequence parallelism requires setting the environment variable "
@@ -422,6 +453,21 @@ def validate_args(args, defaults={}):
 
     # FlashAttention
     args.use_flash_attn = args.use_flash_attn_v1 or args.use_flash_attn_triton or args.use_flash_attn_v2 or args.use_flash_attn_builder
+    if get_accelerator().device_name() != "hpu":
+        args.use_fused_sdpa = False
+    else:
+        if args.use_fused_sdpa_with_recompute:
+            args.use_fused_sdpa = args.use_fused_sdpa_with_recompute
+        if args.use_fused_sdpa and args.recompute_granularity:
+            assert args.use_fused_sdpa_with_recompute is False, "Please use either " \
+            "use_fused_sdpa_with_recompute or recompute_granularity only but not both"
+
+    if args.use_alibi_position_embeddings:
+        assert not args.use_fused_sdpa, "use_alibi_position_embeddings is not " \
+            "supported with use_fused_sdpa"
+
+    if get_accelerator().device_name() != "hpu" or args.normalization != "rmsnorm":
+        args.use_fused_rmsnorm = False
 
     # AML
     if args.aml_data_download_path is not None:
@@ -442,6 +488,13 @@ def validate_args(args, defaults={}):
             assert not args.mos, 'GQA currently does not support args.mos'
             assert not args.kd, 'GQA currently does not support args.kd'
 
+    # MoE
+    moe_ds_enabled = max(args.num_experts) > 1
+    if moe_ds_enabled:
+        sp_enabled = args.sequence_parallel and args.tensor_model_parallel_size > 1
+        assert not sp_enabled or args.enable_expert_tensor_parallelism, \
+            'MoE with sequence parallelism is only supported when using --enable-expert-tensor-parallelism'
+
     # Print arguments.
     _print_args("arguments", args)
     retro_args = get_retro_args()
@@ -485,15 +538,28 @@ def core_transformer_config_from_args(args):
         kw_args['activation_func'] = F.silu
         kw_args['gated_linear_unit'] = True
         kw_args['bias_gelu_fusion'] = False
+    if args.no_scaled_init:
+        kw_args['output_layer_init_method'] = init_method_normal(args.init_method_std)
     if args.init_method_xavier_uniform:
         kw_args['init_method'] = torch.nn.init.xavier_uniform_
-        kw_args['scaled_init_method'] = torch.nn.init.xavier_uniform_
+        kw_args['output_layer_init_method'] = torch.nn.init.xavier_uniform_
 
     return TransformerConfig(**kw_args)
 
 def _add_transformer_engine_args(parser):
     group = parser.add_argument_group(title='Transformer-Engine')
 
+    group.add_argument('--cache-fp8-weight',
+                       default=False,
+                       action='store_true',
+                       help='Cache fp8 weight from forward to backward. \
+                           This will increase memory usage, but improve performance.')
+    group.add_argument('--cache-fp8-weight-fwd',
+                       type=lambda x: x.lower() in ['true', '1'],
+                       default=True,
+                       help='In forward, calculate fp8 weight only once for the entire batch.')
+    group.add_argument('--fp8-e5m2', action='store_true',
+                        help='E5M2 TransformerLayer', dest='fp8_e5m2')
     group.add_argument('--fp8-e4m3', action='store_true',
                         help='E4M3 TransformerLayer', dest='fp8_e4m3')
     group.add_argument('--fp8-hybrid', action='store_true',
@@ -515,6 +581,8 @@ def _add_transformer_engine_args(parser):
                        choices=['most_recent', 'max'],
                        help='Algorithm for computing amax from history',
                        dest='fp8_amax_compute_algo')
+    group.add_argument('--fp8-amax-reduce', action='store_true', default=False,
+                        help='Sync amax between workers')
 
     return parser
 
@@ -539,6 +607,10 @@ def _add_inference_args(parser):
                        choices=["megatron", "huggingface"],
                        help='Select either Megatron or Huggingface as the '
                        'Bert embedder.')
+    group.add_argument('--eval-hf-rope', action='store_true', default=False,
+                        help='Run RoPE in HuggingFace way')
+    group.add_argument('--eval-add-bos', action='store_true', default=False,
+                        help='Add beginning of sentence (bos) token when encoding with tokenizer')
 
     return parser
 
@@ -640,12 +712,16 @@ def _add_network_size_args(parser):
                        help='Options for layer normalization type:'
                             '  layernorm'
                             '  rmsnorm')
+    group.add_argument('--use-fused-rmsnorm',
+                        type=lambda x: x.lower() in ['true', '1'],
+                        default=True,
+                        help='Enable FusedRMSNorm when rmsnorm normalization is used.')
     group.add_argument('--layernorm-epsilon', type=float, default=1e-5,
                        help='Layer norm epsilon.')
     group.add_argument('--apply-layernorm-1p', action='store_true',
                        help='Adjust LayerNorm weights such that they are centered '
                        'around zero. This improves numerical stability.')
-    group.add_argument('--disable-mem-efficient-ln', action='store_false', 
+    group.add_argument('--disable-mem-efficient-ln', action='store_false',
                        help='Disable the memory-efficient fused LayerNorm optimization '
                        'introduced in https://github.com/NVIDIA/apex/pull/1715', dest='mem_efficient_ln')
     group.add_argument('--apply-residual-connection-post-layernorm',
@@ -672,6 +748,16 @@ def _add_network_size_args(parser):
                        help='Untie embeddings and output weights.'),
     group.add_argument('--embedding-weights-in-fp32', action='store_true',
                        help='Cast word embedding weights to fp32 before embedding fwd.'),
+    group.add_argument('--fix-position-emb-redundant-alloc', action='store_true',
+                       help='If true, will not allocate position embeddings at '
+                       'the embed object that is used to generate logits.')
+    group.add_argument('--embed-layernorm', action='store_true',
+                       help='use layernorm for embedding')
+    group.add_argument('--kill-switch-path', type=str, default=None,
+                       help='Path to look for a kill switch. '
+                            'If found will automatically exit the program.')
+    group.add_argument('--use-alibi-position-embeddings', action='store_true',
+                       help='Use ALiBI positional embeddings or not')
     return parser
 
 
@@ -773,6 +859,8 @@ def _add_regularization_args(parser):
                        'numerical stability')
     group.add_argument('--sgd-momentum', type=float, default=0.9,
                        help='Momentum factor for sgd')
+    group.add_argument('--do-norm-bias-weight-decay', action='store_true',
+                       help='Enable Weight Decay for LayerNorm/Norm (weight and bias) and all Bias Parameters')
 
     return parser
 
@@ -784,6 +872,9 @@ def _add_training_args(parser):
                        help='Batch size per model instance (local batch size). '
                        'Global batch size is local batch size times data '
                        'parallel size times number of micro batches.')
+    group.add_argument('--eval-micro-batch-size', type=int, default=None,
+                       help='Batch size per model instance (local batch size) for evaluation. '
+                       'If not defined, using --micro-batch-size value instead')
     group.add_argument('--batch-size', type=int, default=None,
                        help='Old batch size parameter, do not use. '
                        'Use --micro-batch-size instead')
@@ -902,6 +993,34 @@ def _add_training_args(parser):
     group.add_argument('--disable-moe-top2-2nd-expert-sampling', action='store_false',
                        help='Disable MoE top2 sampling of the 2nd expert. Instead of sampling, use argmax.',
                        dest='moe_top2_2nd_expert_sampling')
+    group.add_argument('--moe-num-capacity-bins', type=int, default=0,
+                       help='Number of MoE capacity bins to for reducing dynamic tensor shapes; '
+                       '0 = bins not used.')
+    group.add_argument('--moe-capacity-bins',
+                       nargs="+",
+                       action="extend",
+                       type=lambda x: x.split(","),
+                       default=None,
+                       help='Initial configured capacity bin edges for MoE. '
+                       'For --moe-capacity-bins-optimize-interval = 0, initial configured bins '
+                       'are not changed. Configuration of bins is done separately per each value '
+                       'of num-experts. For example, lets assume we have MoE layers with '
+                       '2 experts and MoE layers with 4 experts. '
+                       'Setting "--moe-capacity-bins 2,100,200,300 4,100,300,600" will then '
+                       'configure layers bins = [100, 200, 300] for 2 experts and '
+                       'layers bins = [100, 300, 600] for 4 experts')
+    group.add_argument('--moe-capacity-bins-exp-base', type=float, default=2.0,
+                       help='Exponential base for calculation of capacity bins. '
+                       'Used when --moe-num-capacity-bins > 0')
+    group.add_argument('--moe-capacity-bins-alignment', type=int, default=1,
+                       help='Capacity bins required alignment; 1 = no alignment required.')
+    group.add_argument('--moe-capacity-bins-optimize-interval', type=int, default=0,
+                       help='Interval for auto-optimization of MoE capacity bins (if used); '
+                       '0 = no auto-optimization.')
+    group.add_argument('--moe-capacity-bins-optimize-max-group', type=int, default=1,
+                       help='Maximum group size of adjacent MoE gates that their capacity bins '
+                       'are optimized jointly. For max=1, each MoE gate optimizes its own '
+                       'capacity bins.')
     group.add_argument('--use-flash-attn', '--use-flash-attn-v1', dest='use_flash_attn_v1', action='store_true',
                        help='use first version FlashAttention implementation of attention. '
                        'https://arxiv.org/abs/2205.14135')
@@ -912,11 +1031,21 @@ def _add_training_args(parser):
                        help='use FlashAttention implementation of attention using Triton.')
     group.add_argument('--use-flash-attn-builder', action='store_true',
                        help='use FlashAttention op builder.')
+    group.add_argument('--use-fused-sdpa',
+                        type=lambda x: x.lower() in ['true', '1'],
+                        default=True,
+                        help='Enable Fused Scaled Dot Product Attention.')
+    group.add_argument('--use-fused-sdpa-with-recompute',
+                        type=lambda x: x.lower() in ['true', '1'],
+                        default=False,
+                        help='Enable Fused Scaled Dot Product Attention with recompute feature.')
+    group.add_argument('--use-fast-softmax', action='store_true',
+                       help='Enable fast softmax in flash self attention')
     group.add_argument('--disable-bias-linear', action='store_false',
                        help='Disable bias in the linear layers',
                        dest='add_bias_linear')
     group.add_argument('--optimizer', type=str, default='adam',
-                       choices=['adam', 'sgd'],
+                       choices=['adam', 'sgd', 'adamw', 'fusedadamw'],
                        help='Optimizer function')
     group.add_argument('--dataloader-type', type=str, default=None,
                        choices=['single', 'cyclic'],
@@ -977,6 +1106,10 @@ def _add_initialization_args(parser):
                        'distribution used for weight initialization.')
     group.add_argument('--init-method-xavier-uniform', action='store_true',
                        help='Enable Xavier uniform parameter initialization')
+    group.add_argument('--no-scaled-init', action='store_true',
+                       help='No scaled initialization with number of '
+                       'layers and have same init method for all the model '
+                       'parameters')
 
     return parser
 
@@ -1072,6 +1205,12 @@ def _add_checkpointing_args(parser):
                        "initialization.")
     group.add_argument('--universal-checkpoint', action='store_true',
                         help='Loading a universal format checkpoint.')
+    group.add_argument('--verify-checkpoint', action='store_true',
+                       help='Run verification on saved checkpoint.')
+    group.add_argument("--verify-checkpoint-model-type", default='GPT', type=str,
+                       help='Model family type, used for checkpoint verification only.',
+                       choices=['GPT', 'BLOOM', 'LLAMA'])
+
     return parser
 
 
@@ -1196,6 +1335,8 @@ def _add_validation_args(parser):
     group.add_argument('--skip-train', action='store_true',
                        default=False, help='If set, bypass the training loop, '
                        'optionally do evaluation for validation/test, and exit.')
+    group.add_argument('--eval-loss-exit-value', type=float, default=None,
+                       help='Eval loss value below which the training will exit')
 
     return parser
 
@@ -1278,6 +1419,8 @@ def _add_data_args(parser):
                        help='What type of tokenizer to use.')
     group.add_argument('--tokenizer-model', type=str, default=None,
                        help='Sentencepiece tokenizer model.')
+    group.add_argument('--trust-remote-code', action='store_true', default=False,
+                       help='to run HFTokenizer model from local path.')
     group.add_argument('--data-impl', type=str, default='infer',
                        choices=['mmap', 'infer'],
                        help='Implementation of indexed datasets.')
@@ -1307,6 +1450,14 @@ def _add_data_args(parser):
                        help='Force to use certain index file.')
     group.add_argument('--repeated-dataloader', action='store_true',
                        help='Once all the data has been loaded, reuse the DataLoader.')
+    group.add_argument('--mask-tensor-adding', action='store_true',
+                       help='Perform attention masking by adding tensor instead of doing fill')
+    group.add_argument('--no-seq-len-plus-one-tokens',
+                       action='store_false', help='If set, dont get '
+                       'sequence length plus one tokens for training',
+                       dest='use_seq_len_plus_one_tokens')
+    group.add_argument('--disable-doc-shuffling', action='store_true',
+                       help='If set, documents wont be shuffled before traning')
     return parser
 
 
@@ -1531,3 +1682,79 @@ def _add_distillation_args(parser):
                        help='Directory containing a teacher model checkpoint.')
 
     return parser
+
+
+def _add_profiler_args(parser):
+    group = parser.add_argument_group(title='profiling configuration')
+
+    group.add_argument("--profile",
+                       type=str,
+                       default=None,
+                       choices=['pt', 'pt-full', 'hltv'],
+                       help="Enable profiling")
+
+    group.add_argument("--profile-steps",
+                       type=str,
+                       default='3,4',
+                       help="Which steps to profile."
+                       "Format: <start step>,<end step>")
+
+    return parser
+
+def _add_tensor_logger_args(parser):
+    group = parser.add_argument_group(title='tensor_logger \
+                                      logging configuration')
+
+    group.add_argument("--log-model-inputs", action="store_true",
+                       help="If set, log model's inputs for configured"
+                       "iterations")
+
+    group.add_argument("--log-fwd-activations", action="store_true",
+                       help="If set, log model's nn.Module forward activations"
+                       "for configured iterations")
+
+    group.add_argument("--log-bwd-grads", action="store_true",
+                       help="If set, log model's nn.Module backward gradients"
+                       "for configured iterations")
+
+    group.add_argument("--tensor-logger-start-iter", type=int, default=0,
+                       help="Set the starting number of iteration to capture."
+                       "If 1 - from beginning, 0 - disable tensor logger")
+
+    group.add_argument("--tensor-logger-end-iter", type=int, default=0,
+                       help="Set the ending number of iteration to capture."
+                       "If 0 - disable tensor logger")
+
+    group.add_argument("--tensor-logger-path", type=str, default=None,
+                       help="Path for saving tensor logger captured"
+                       "tensors file")
+
+    return parser
+
+def _add_pytorch_args(parser):
+    group = parser.add_argument_group(title='pytorch')
+
+    group.add_argument('--use-torch-compile',
+                       type=lambda x: x.lower() in ['true', '1'],
+                       default=False,
+                       help='Enable model compilation using torch.compile')
+
+    return parser
+
+def _add_debug_args(parser):
+    group = parser.add_argument_group(title='Debug configuration')
+
+    group.add_argument("--check-tp-norm", action="store_true",
+                       help="set this to check norm layer fwd/bwd input and"
+                       "output in tensor parallel(tp) group")
+    group.add_argument('--start-check-tp-norm-iter', type=int, default=-1,
+                       help='start iteration to check tp norm')
+    group.add_argument('--end-check-tp-norm-iter', type=int, default=-1,
+                       help='end iteration to check tp norm')
+    group.add_argument("--check-tp-norm-type", default='all',
+                       choices=['all', 'io', 'wb'],
+                       help="when check tp norm is enabled, this controls the "
+                       "scope of checking to all=weight,bias,input and output,"
+                       " io=input and output, wb=weight and bias")
+
+    return parser
diff --git a/megatron/checkpointing.py b/megatron/checkpointing.py
index 3e4d20035e..116b1df04a 100644
--- a/megatron/checkpointing.py
+++ b/megatron/checkpointing.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 """Input/output checkpointing."""
@@ -12,6 +13,7 @@
 from megatron import update_num_microbatches, get_tokenizer
 from megatron.core import mpu, tensor_parallel
 from .global_vars import get_args
+from tools import verify_checkpoint
 from .utils import (unwrap_model,
                     print_rank_0,
                     is_rank_0)
@@ -67,7 +69,8 @@ def _compare(arg_name, old_arg_name=None, default=None):
     _compare('num_key_value_heads')
     _compare('add_position_embedding', default=True)
     if args.vocab_file:
-        _compare('max_position_embeddings')
+        if not args.use_alibi_position_embeddings:
+            _compare('max_position_embeddings')
         if not args.universal_checkpoint:
             _compare('make_vocab_size_divisible_by')
             _compare('padded_vocab_size')
@@ -314,6 +317,18 @@ def state_dict_for_save_checkpoint_deepspeed(destination=None, prefix='', keep_v
 
     # And update the latest iteration
     if is_rank_0():
+        if args.verify_checkpoint:
+            ckpt_folder = os.path.join(args.save, f"global_step{iteration}")
+            prev_iter = iteration - args.save_interval
+            ckpt_ok = verify_checkpoint(ckpt_folder,
+                                        args.verify_checkpoint_model_type)
+            if not ckpt_ok:
+                # Fix latest file to previous valid ckpt
+                with open(os.path.join(args.save, 'latest'), 'w') as fd:
+                    fd.write(f"global_step{prev_iter}")
+                raise RuntimeError(f"verify_checkpoint failed!!! {ckpt_folder}")
+            else:
+                print_rank_0(f"successfully passed ckpt validation: {ckpt_folder}")
         tracker_filename = get_checkpoint_tracker_filename(args.save)
         with open(tracker_filename, 'w') as f:
             f.write(str(iteration))
@@ -511,6 +526,7 @@ def _set_arg(arg_name, old_arg_name=None, force=False):
     _set_arg('max_position_embeddings')
     _set_arg('add_position_embedding', force=True)
     _set_arg('use_rotary_position_embeddings', force=True)
+    _set_arg('use_alibi_position_embeddings', force=True)
     _set_arg('rotary_percent', force=True)
     _set_arg('add_bias_linear', force=True)
     _set_arg('swiglu', force=True)
diff --git a/megatron/core/requirements.txt b/megatron/core/requirements.txt
index 732feb812c..873290cc58 100644
--- a/megatron/core/requirements.txt
+++ b/megatron/core/requirements.txt
@@ -1,3 +1,9 @@
 pybind11
 torch
-regex
\ No newline at end of file
+regex
+einops
+datasets
+sentencepiece
+transformers
+nltk
+git+https://github.com/EleutherAI/lm-evaluation-harness.git@master
diff --git a/megatron/core/tensor_parallel/data.py b/megatron/core/tensor_parallel/data.py
index 4de1a8b65b..c32fbbf452 100644
--- a/megatron/core/tensor_parallel/data.py
+++ b/megatron/core/tensor_parallel/data.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 import torch
@@ -14,6 +15,7 @@
 from deepspeed.accelerator import get_accelerator
 
 _MAX_DATA_DIM = 5
+_GLOBAL_CACHED_BROADCAST_SIZES = []
 
 
 def _check_data_types(keys, data, target_dtype):
@@ -23,6 +25,11 @@ def _check_data_types(keys, data, target_dtype):
             'is different than {}'.format(key, data[key].dtype, target_dtype)
 
 
+def reset_cached_broadcast_sizes():
+    global _GLOBAL_CACHED_BROADCAST_SIZES
+    _GLOBAL_CACHED_BROADCAST_SIZES = []
+
+
 def _build_key_size_numel_dictionaries(keys, data, group=None, rank=-1, src_rank=-1):
     if group is None:
         group = get_tensor_model_parallel_group()
@@ -45,12 +52,18 @@ def _build_key_size_numel_dictionaries(keys, data, group=None, rank=-1, src_rank
                 sizes[i + offset] = s
             offset += max_dim
 
-    # Move to GPU and broadcast.
-    sizes_cuda = get_accelerator().LongTensor(sizes)
-    torch.distributed.broadcast(sizes_cuda, src_rank, group=group)
+    global _GLOBAL_CACHED_BROADCAST_SIZES
+    if not _GLOBAL_CACHED_BROADCAST_SIZES:
+        # Move to GPU and broadcast.
+        sizes_cuda = get_accelerator().LongTensor(sizes)
+        torch.distributed.broadcast(sizes_cuda, src_rank, group=group)
+        # Move back to cpu and unpack.
+        sizes_cpu = sizes_cuda.cpu()
+        sizes_cpu = sizes_cpu.tolist()
+        _GLOBAL_CACHED_BROADCAST_SIZES = sizes_cpu
+    else:
+        sizes_cpu = _GLOBAL_CACHED_BROADCAST_SIZES
 
-    # Move back to cpu and unpack.
-    sizes_cpu = sizes_cuda.cpu()
     key_size = {}
     key_numel = {}
     total_numel = 0
diff --git a/megatron/core/tensor_parallel/layers.py b/megatron/core/tensor_parallel/layers.py
index 020d25915a..e2064c4433 100644
--- a/megatron/core/tensor_parallel/layers.py
+++ b/megatron/core/tensor_parallel/layers.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 # Parts of the code here are adapted from PyTorch
@@ -23,7 +24,10 @@
     get_tensor_model_parallel_world_size,
     get_tensor_model_parallel_group,
     get_global_memory_buffer,
+    is_pipeline_first_stage,
 )
+from megatron import get_args
+from megatron.global_vars import get_num_microbatches
 from .mappings import (
     copy_to_tensor_model_parallel_region,
     gather_from_tensor_model_parallel_region,
@@ -49,6 +53,15 @@
 except ImportError:
     _grad_accum_fusion_available = False
 
+try:
+    import habana_frameworks.torch.hpex.experimental.transformer_engine as te
+except ImportError:
+    if get_accelerator().device_name() == 'hpu' and get_args().transformer_impl == "transformer_engine":
+        raise RuntimeError(
+                "Device name is hpu and transformer implementation is transformer_engine"
+                "but couldn't import habana_frameworks.torch.hpex.experimental.transformer_engine"
+            )
+
 _MODEL_PARALLEL_ATTRIBUTE_DEFAULTS = {'tensor_model_parallel': False,
                                       'partition_dim': -1,
                                       'partition_stride': 1}
@@ -136,6 +149,41 @@ def _initialize_affine_weight_cpu(weight, output_size, input_size,
     return None
 
 
+# This class encapsulates the behavior related to two mechanisms: hpu graph and amax measuring interval
+class FP8ModuleRunner():
+    def __init__(self, module, measure_interval: int=1, cache_fp8_weight_fwd=False):
+        self.module = module
+        self.measure_interval = measure_interval
+        self.cache_fp8_weight_fwd = cache_fp8_weight_fwd
+        self.run_cnt = 0
+        self.in_activation_recompute_phase = None
+
+    def _is_first_microbatch(self):
+        if not self.cache_fp8_weight_fwd:
+            return None
+
+        return self.run_cnt % get_num_microbatches() in [1,2]
+
+    def __call__(self, input, weight, bias=None):
+        if te.distributed.is_fp8_activation_recompute_enabled():
+            if not torch.is_grad_enabled():
+                # grad disabled - first non-recompute phase
+                self.in_activation_recompute_phase = False
+            elif self.in_activation_recompute_phase == False:
+                # grad enabled after being disabled - second recompute phase
+                self.in_activation_recompute_phase = True
+
+        if not self.in_activation_recompute_phase:
+            self.run_cnt += 1
+
+        measure = self.measure_interval == 1 or self.run_cnt % self.measure_interval == 1
+        te.fp8.set_measurement_mode(manual=True, manual_value=measure)
+
+        is_first_microbatch = self._is_first_microbatch()
+
+        return self.module(input, weight, bias, is_first_microbatch=is_first_microbatch)
+
+
 class VocabParallelEmbedding(torch.nn.Module):
     """Embedding parallelized in the vocabulary dimension.
 
@@ -172,6 +220,15 @@ def __init__(self, num_embeddings: int, embedding_dim: int, *,
         self.num_embeddings_per_partition = self.vocab_end_index - \
             self.vocab_start_index
 
+        # Allocate weights and initialize.
+        args = get_args()
+        # only the first stage embedding runs this class' forward. The head's embedding does its own
+        # thing, so don't waste memory allocating LN weights.
+        self.layer_norm = None
+        if is_pipeline_first_stage() and args.embed_layernorm:
+            from megatron.model import LayerNorm
+            self.layer_norm = LayerNorm(embedding_dim, sequence_parallel=config.sequence_parallel)
+
         # Allocate weights and initialize.
         if config.use_cpu_initialization:
             self.weight = Parameter(torch.empty(
@@ -210,6 +267,10 @@ def forward(self, input_):
             output_parallel[input_mask, :] = 0.0
         # Reduce across all the model parallel GPUs.
         output = reduce_from_tensor_model_parallel_region(output_parallel)
+
+        if self.layer_norm is not None:
+            output = self.layer_norm(output)
+
         return output
 
 
@@ -241,6 +302,9 @@ class LinearWithGradAccumulationAndAsyncCommunication(torch.autograd.Function):
     @custom_fwd
     def forward(ctx, input, weight, bias, gradient_accumulation_fusion,
                 async_grad_allreduce, sequence_parallel):
+        # sequence parallel will cause all_gather which requires contiguous tensors
+        input = input.contiguous() if sequence_parallel else input
+
         ctx.save_for_backward(input, weight)
         ctx.use_bias = bias is not None
         ctx.gradient_accumulation_fusion = gradient_accumulation_fusion
@@ -270,9 +334,10 @@ def forward(ctx, input, weight, bias, gradient_accumulation_fusion,
         else:
             total_input = input
 
-        output = torch.matmul(total_input, weight.t())
-        if bias is not None:
-            output = output + bias
+        # output = torch.matmul(total_input, weight.t())
+        # if bias is not None:
+        #     output = output + bias
+        output = F.linear(total_input, weight, bias)
         return output
 
     @staticmethod
@@ -431,6 +496,8 @@ def linear_with_grad_accumulation_and_async_allreduce(
         all gathered, and the backward pass the input gradients are
         reduce scattered.
     """
+    if not sequence_parallel:
+        return F.linear(input, weight, bias)
     args = [
         input,
         weight,
@@ -441,7 +508,7 @@ def linear_with_grad_accumulation_and_async_allreduce(
     ]
 
     if not linear_with_grad_accumulation_and_async_allreduce.warned:
-        if os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1":
+        if get_accelerator().device_name() == "cuda" and os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1":
             if sequence_parallel:
                 warnings.warn(
                     "When using sequence parallelism it is recommended to set the "
@@ -520,6 +587,8 @@ def __init__(self, input_size, output_size, *,
         self.skip_bias_add = skip_bias_add
         self.config = config
 
+        args = get_args()
+
         # Parameters.
         # Note: torch.nn.functional.linear performs XA^T + b and as a result
         # we allocate the transpose.
@@ -591,6 +660,16 @@ def __init__(self, input_size, output_size, *,
                 "cannot be enabled at the same time."
             )
 
+        self.output_parallel_linear = F.linear
+        if self.training and args.transformer_impl == "transformer_engine" \
+            and get_accelerator().device_name() == "hpu":
+            linear_fp8 = te.Linear(
+                self.input_size,
+                self.output_size_per_partition,
+                skip_weight_param_allocation=True,
+                bias=bias,
+                minimize_memory=not args.cache_fp8_weight)
+            self.output_parallel_linear = FP8ModuleRunner(linear_fp8, args.fp8_interval, args.cache_fp8_weight_fwd)
 
     def forward(self,
                 input_: torch.Tensor,
@@ -608,6 +687,8 @@ def forward(self,
             - bias
 
         """
+        args = get_args()
+
         if weight is None:
             if self.weight is None:
                 raise RuntimeError("weight was not supplied to ColumnParallelLinear forward pass "
@@ -629,14 +710,20 @@ def forward(self,
         else:
             input_parallel = copy_to_tensor_model_parallel_region(input_)
         # Matrix multiply.
-        output_parallel = linear_with_grad_accumulation_and_async_allreduce(
+        if args.transformer_impl == "transformer_engine" and get_accelerator().device_name() == 'hpu':
+            gather_input = lambda x: x
+            if self.sequence_parallel:
+                gather_input = gather_from_sequence_parallel_region
+            output_parallel = self.output_parallel_linear(gather_input(input_parallel), self.weight, self.bias)
+        else:
+            output_parallel = linear_with_grad_accumulation_and_async_allreduce(
             input=input_parallel,
             weight=weight,
             bias=bias,
             gradient_accumulation_fusion=self.gradient_accumulation_fusion,
             async_grad_allreduce=self.async_tensor_model_parallel_allreduce,
             sequence_parallel=self.sequence_parallel
-        )
+            )
         if self.gather_output and not self.is_expert_without_slicing:
             # All-gather across the partitions.
             assert not self.sequence_parallel
@@ -711,6 +798,8 @@ def __init__(self, input_size: int, output_size: int, *,
         if self.sequence_parallel and not self.input_is_parallel:
             raise RuntimeError("To enable `sequence_parallel`, `input_is_parallel` must be `True`")
 
+        args = get_args()
+
         # Parameters.
         # Note: torch.nn.functional.linear performs XA^T + b and as a result
         # we allocate the transpose.
@@ -749,6 +838,15 @@ def __init__(self, input_size: int, output_size: int, *,
         else:
             self.register_parameter('bias', None)
 
+        self.output_parallel_linear = F.linear
+        if self.training and args.transformer_impl == "transformer_engine" and get_accelerator().device_name() == 'hpu':
+            linear_fp8 = te.Linear(
+                self.input_size_per_partition,
+                self.output_size,
+                skip_weight_param_allocation=True,
+                bias=bias,
+                minimize_memory=not args.cache_fp8_weight)
+            self.output_parallel_linear = FP8ModuleRunner(linear_fp8, args.fp8_interval, args.cache_fp8_weight_fwd)
 
 
     def forward(self, input_):
@@ -768,14 +866,17 @@ def forward(self, input_):
             assert not self.sequence_parallel
             input_parallel = scatter_to_tensor_model_parallel_region(input_)
         # Matrix multiply.
-        output_parallel = linear_with_grad_accumulation_and_async_allreduce(
-            input=input_parallel,
-            weight=self.weight,
-            bias=None,
-            gradient_accumulation_fusion=self.gradient_accumulation_fusion,
-            async_grad_allreduce=False,
-            sequence_parallel=False,
-        )
+        if get_args().transformer_impl == "transformer_engine" and get_accelerator().device_name() == 'hpu':
+            output_parallel = self.output_parallel_linear(input_parallel, self.weight, self.bias)
+        else:
+            output_parallel = linear_with_grad_accumulation_and_async_allreduce(
+                input=input_parallel,
+                weight=self.weight,
+                bias=None,
+                gradient_accumulation_fusion=self.gradient_accumulation_fusion,
+                async_grad_allreduce=False,
+                sequence_parallel=False,
+            )
 
         # All-reduce across all the partitions.
         if self.sequence_parallel:
diff --git a/megatron/core/tensor_parallel/mappings.py b/megatron/core/tensor_parallel/mappings.py
index ae8d63ab2c..ed01e691ee 100644
--- a/megatron/core/tensor_parallel/mappings.py
+++ b/megatron/core/tensor_parallel/mappings.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 import torch
@@ -11,6 +12,13 @@
 from deepspeed.accelerator import get_accelerator
 
 
+def get_async_op():
+    async_op = False
+    if get_accelerator().device_name() == "hpu":
+        async_op = True
+    return async_op
+
+
 def _reduce(input_):
     """All-reduce the input tensor across model parallel group."""
 
@@ -19,7 +27,7 @@ def _reduce(input_):
         return input_
 
     # All-reduce.
-    torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())
+    torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group(), async_op=get_async_op())
 
     return input_
 
@@ -79,7 +87,7 @@ def _gather_along_last_dim(input_):
 
     tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
     tensor_list[rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=get_tensor_model_parallel_group())
+    torch.distributed.all_gather(tensor_list, input_, group=get_tensor_model_parallel_group(), async_op=get_async_op())
 
     # Note: torch.cat already creates a contiguous tensor.
     output = torch.cat(tensor_list, dim=last_dim).contiguous()
@@ -101,7 +109,8 @@ def _gather_along_first_dim(input_):
     output = torch.empty(dim_size, dtype=input_.dtype,
                          device=get_accelerator().current_device_name())
     torch.distributed._all_gather_base(output, input_.contiguous(),
-                                       group=get_tensor_model_parallel_group())
+                                       group=get_tensor_model_parallel_group(),
+                                       async_op=get_async_op())
 
     return output
 
@@ -121,7 +130,8 @@ def _reduce_scatter_along_first_dim(input_):
     output = torch.empty(dim_size, dtype=input_.dtype,
                          device=get_accelerator().current_device_name())
     torch.distributed._reduce_scatter_base(output, input_.contiguous(), 
-                                           group=get_tensor_model_parallel_group())
+                                           group=get_tensor_model_parallel_group(),
+                                           async_op=get_async_op())
     return output
 
 
diff --git a/megatron/core/transformer/attention.py b/megatron/core/transformer/attention.py
index 15818bddf1..ecd3f59dd8 100644
--- a/megatron/core/transformer/attention.py
+++ b/megatron/core/transformer/attention.py
@@ -1,6 +1,8 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 from abc import ABC, abstractmethod
+from deepspeed.accelerator import get_accelerator
 from .enums import AttnMaskType
 from .transformer_config import TransformerConfig
 import torch
@@ -13,7 +15,11 @@
 from megatron.core.transformer.enums import AttnType, AttnMaskType
 from megatron.core.transformer.transformer_config import TransformerConfig
 from megatron.core.transformer.custom_layers.transformer_engine import \
-        TECoreAttention, TEColumnParallelLinear, TERowParallelLinear
+        TEColumnParallelLinear, TERowParallelLinear
+
+if get_accelerator().device_name() == "cuda":
+    from megatron.core.transformer.custom_layers.transformer_engine import \
+        TECoreAttention
 
 class Attention(MegatronModule, ABC):
     """Attention layer abstract class.
diff --git a/megatron/core/transformer/custom_layers/transformer_engine.py b/megatron/core/transformer/custom_layers/transformer_engine.py
index 8d5c6aa15c..664c3b9295 100644
--- a/megatron/core/transformer/custom_layers/transformer_engine.py
+++ b/megatron/core/transformer/custom_layers/transformer_engine.py
@@ -1,28 +1,53 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import torch
-import transformer_engine as te
 from typing import Callable
 
+from deepspeed.accelerator import get_accelerator
+
 from megatron.core.transformer.transformer_config import TransformerConfig
 from megatron.core.transformer.enums import AttnMaskType
 from megatron.core.parallel_state import get_tensor_model_parallel_group
 from megatron.core.tensor_parallel import get_cuda_rng_tracker
 
-class TELayerNorm(te.pytorch.module.LayerNorm):
-    """
-    Wrapper for the Transformer-Engine's `LayerNorm`.
-    """
-    def __init__(self,
-                 hidden_size: int,
-                 eps: float = 1e-5,
-                 sequence_parallel: bool = False,
-                 **kwargs):
-        super().__init__(
-            hidden_size=hidden_size,
-            eps=eps,
-            sequence_parallel=sequence_parallel
+
+cuda_available = False
+hpu_available = False
+if get_accelerator().device_name() == "cuda":
+    cuda_available = True
+elif get_accelerator().device_name() == "hpu":
+    hpu_available = True
+
+
+if cuda_available:
+    import transformer_engine as te
+elif hpu_available:
+    import habana_frameworks.torch.hpex.experimental.transformer_engine as te
+
+
+if cuda_available:
+    class TELayerNorm(te.pytorch.module.LayerNorm):
+        """
+        Wrapper for the Transformer-Engine's `LayerNorm`.
+        """
+        def __init__(self,
+                    hidden_size: int,
+                    eps: float = 1e-5,
+                    sequence_parallel: bool = False,
+                    **kwargs):
+            super().__init__(
+                hidden_size=hidden_size,
+                eps=eps,
+                sequence_parallel=sequence_parallel
         )
 
-class TELinear(te.pytorch.module.Linear):
+telinear = None
+if cuda_available:
+    telinear = te.pytorch.module.Linear
+elif hpu_available:
+    telinear = te.Linear
+
+class TELinear(telinear):
     """
     Wrapper for the Transformer-Engine's `Linear` layer.
 
@@ -47,12 +72,13 @@ def __init__(self,
         # ourselves. This way our forward always returns two values
         # and we don't have to deal with the zero length Tensor.
         self.te_return_bias = skip_bias_add and bias
+        if cuda_available:
+            kwargs["fuse_wgrad_accumulation"] = self.config.gradient_accumulation_fusion
 
         super().__init__(
             in_features=input_size,
             out_features=output_size,
             sequence_parallel=self.config.sequence_parallel,
-            fuse_wgrad_accumulation=self.config.gradient_accumulation_fusion,
             tp_group=get_tensor_model_parallel_group(check_initialized=False),
             tp_size=self.config.tensor_model_parallel_size,
             get_rng_state_tracker=get_cuda_rng_tracker,
@@ -112,30 +138,31 @@ def __init__(self,
             **kwargs
         )
 
-class TECoreAttention(te.pytorch.transformer.DotProductAttention):
-    """
-    Wrapper for the Transformer-Engine's `DotProductAttention` layer that also
-    has "flash attention" enabled.
+if cuda_available:
+    class TECoreAttention(te.pytorch.transformer.DotProductAttention):
+        """
+        Wrapper for the Transformer-Engine's `DotProductAttention` layer that also
+        has "flash attention" enabled.
 
-    Note that if Megatron's parallel_state has not been initialized
-    yet, the tp_group passed to TE will be None and must be set later
-    via set_tensor_parallel_group().
-    """
-    def __init__(self,
-                 config: TransformerConfig,
-                 layer_number: int = 1,
-                 attn_mask_type: AttnMaskType = AttnMaskType.padding,
-                 **kwargs):
-        self.config = config
-        super().__init__(
-            num_attention_heads=self.config.num_attention_heads,
-            kv_channels=self.config.kv_channels,
-            attention_dropout=self.config.attention_dropout,
-            layer_number=layer_number,
-            attn_mask_type=attn_mask_type.name,
-            sequence_parallel=self.config.sequence_parallel,
-            tp_size=self.config.tensor_model_parallel_size,
-            get_rng_state_tracker=get_cuda_rng_tracker,
-            tp_group=get_tensor_model_parallel_group(check_initialized=False),
-            **kwargs
-        )
+        Note that if Megatron's parallel_state has not been initialized
+        yet, the tp_group passed to TE will be None and must be set later
+        via set_tensor_parallel_group().
+        """
+        def __init__(self,
+                    config: TransformerConfig,
+                    layer_number: int = 1,
+                    attn_mask_type: AttnMaskType = AttnMaskType.padding,
+                    **kwargs):
+            self.config = config
+            super().__init__(
+                num_attention_heads=self.config.num_attention_heads,
+                kv_channels=self.config.kv_channels,
+                attention_dropout=self.config.attention_dropout,
+                layer_number=layer_number,
+                attn_mask_type=attn_mask_type.name,
+                sequence_parallel=self.config.sequence_parallel,
+                tp_size=self.config.tensor_model_parallel_size,
+                get_rng_state_tracker=get_cuda_rng_tracker,
+                tp_group=get_tensor_model_parallel_group(check_initialized=False),
+                **kwargs
+            )
diff --git a/megatron/core/transformer/transformer_config.py b/megatron/core/transformer/transformer_config.py
index de21046f02..ae60921d83 100644
--- a/megatron/core/transformer/transformer_config.py
+++ b/megatron/core/transformer/transformer_config.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 from dataclasses import dataclass
@@ -75,6 +76,8 @@ class TransformerConfig(ModelParallelConfig):
                                    This kernel only supports a fixed set of hidden sizes.
                                    Defaults to False.
         bias_dropout_fusion (bool): If true, uses bias dropout fusion.
+        use_fused_sdpa (bool): If true, use Fused Scaled Dot Product Attention.
+        use_fused_sdpa_with_recompute (bool): If true, use Fused Scaled Dot Product Attention with recompute feature.
 
         # activation recomputation
 
@@ -137,6 +140,8 @@ class TransformerConfig(ModelParallelConfig):
     masked_softmax_fusion: bool = False
     persist_layer_norm: bool = False
     bias_dropout_fusion: bool = False  # TODO: this should be bias_dropout_add_fusion?
+    use_fused_sdpa: bool = True
+    use_fused_sdpa_with_recompute: bool = False
 
     # activation recomputation
     recompute_granularity: str = None
diff --git a/megatron/core/transformer/transformer_layer.py b/megatron/core/transformer/transformer_layer.py
index 19804e4c60..081d6f7f06 100644
--- a/megatron/core/transformer/transformer_layer.py
+++ b/megatron/core/transformer/transformer_layer.py
@@ -1,5 +1,7 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
+from deepspeed.accelerator import get_accelerator
 import torch
 
 from megatron.core.transformer.module import MegatronModule
@@ -9,7 +11,9 @@
 from megatron.core.transformer.attention import SelfAttention
 from megatron.core.transformer.mlp import MLP
 from megatron.core.utils import make_viewless_tensor
-from megatron.core.transformer.custom_layers.transformer_engine import TELayerNorm
+
+if get_accelerator().device_name() == "cuda":
+    from megatron.core.transformer.custom_layers.transformer_engine import TELayerNorm
 
 class TransformerLayer(MegatronModule):
     """A single transformer layer.
diff --git a/megatron/data/biencoder_dataset_utils.py b/megatron/data/biencoder_dataset_utils.py
index d4661cd34a..02d2c079a6 100644
--- a/megatron/data/biencoder_dataset_utils.py
+++ b/megatron/data/biencoder_dataset_utils.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import os
 import time
 
@@ -25,6 +27,8 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
     """Specifically one epoch to be used in an indexing job."""
     args = get_args()
 
+    assert args.micro_batch_size == args.eval_micro_batch_size, \
+           "get_one_epoch_dataloader (biencoder) - Unsupported for split micro batch size"
     if micro_batch_size is None:
         micro_batch_size = args.micro_batch_size
     num_workers = args.num_workers
@@ -36,9 +40,10 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
     batch_sampler = MegatronPretrainingSampler(
         total_samples=len(dataset),
         consumed_samples=0,
-        micro_batch_size=args.micro_batch_size,
+        micro_batch_size=micro_batch_size,
         data_parallel_rank=mpu.get_data_parallel_rank(),
         data_parallel_size=mpu.get_data_parallel_world_size(),
+        is_train=False,
         drop_last=False)
 
     return torch.utils.data.DataLoader(dataset,
diff --git a/megatron/data/blendable_dataset.py b/megatron/data/blendable_dataset.py
index 2516e58415..b53b40ee79 100644
--- a/megatron/data/blendable_dataset.py
+++ b/megatron/data/blendable_dataset.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Blendable dataset."""
@@ -25,6 +26,11 @@ def __init__(self, datasets, weights, size, *,
 
         self.size = size
 
+        if size == -1:
+            self.size = 0
+            for dataset in self.datasets:
+                self.size += len(dataset)
+
         # Normalize weights.
         weights = np.array(weights, dtype=np.float64)
         sum_weights = np.sum(weights)
diff --git a/megatron/data/data_samplers.py b/megatron/data/data_samplers.py
index 2d7da67e15..627e2cbeed 100644
--- a/megatron/data/data_samplers.py
+++ b/megatron/data/data_samplers.py
@@ -1,38 +1,50 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Dataloaders."""
 
 
+from itertools import chain
 import random
 import torch
 import numpy as np
 from torch.utils.data import Dataset
-from megatron import get_args
+from megatron import get_args, get_num_microbatches_by_mode
 from megatron.core import mpu
 from deepspeed.runtime.dataloader import RepeatingLoader
 
 
-def build_pretraining_data_loader(dataset, consumed_samples):
+def build_pretraining_data_loader(dataset, consumed_samples, is_train, use_all_samples=False):
     """Buld dataloader given an input dataset."""
 
     if dataset is None:
         return None
     args = get_args()
+    assert not use_all_samples or args.dataloader_type == 'single', \
+        'consuming whole dataset supported only for "single" dataloader type'
+
+    if is_train:
+        micro_batch_size=args.micro_batch_size
+    else:
+        micro_batch_size=args.eval_micro_batch_size
 
     # Megatron sampler
     if args.dataloader_type == 'single':
         batch_sampler = MegatronPretrainingSampler(
             total_samples=len(dataset),
             consumed_samples=consumed_samples,
-            micro_batch_size=args.micro_batch_size,
+            micro_batch_size=micro_batch_size,
             data_parallel_rank=mpu.get_data_parallel_rank(),
-            data_parallel_size=mpu.get_data_parallel_world_size())
+            data_parallel_size=mpu.get_data_parallel_world_size(),
+            is_train=is_train,
+            drop_last=not use_all_samples,
+            pad_negative_indices=use_all_samples)
     elif args.dataloader_type == 'cyclic':
         batch_sampler = MegatronPretrainingRandomSampler(
             dataset,
             total_samples=len(dataset),
             consumed_samples=consumed_samples,
-            micro_batch_size=args.micro_batch_size,
+            micro_batch_size=micro_batch_size,
             data_parallel_rank=mpu.get_data_parallel_rank(),
             data_parallel_size=mpu.get_data_parallel_world_size(),
             data_sharding=args.data_sharding)
@@ -52,7 +64,8 @@ def build_pretraining_data_loader(dataset, consumed_samples):
 class MegatronPretrainingSampler:
 
     def __init__(self, total_samples, consumed_samples, micro_batch_size,
-                 data_parallel_rank, data_parallel_size, drop_last=True):
+                 data_parallel_rank, data_parallel_size, is_train, drop_last=True,
+                 pad_negative_indices=False):
         # Keep a copy of input params for later use.
         self.total_samples = total_samples
         self.consumed_samples = consumed_samples
@@ -61,6 +74,10 @@ def __init__(self, total_samples, consumed_samples, micro_batch_size,
         self.micro_batch_times_data_parallel_size = \
             self.micro_batch_size * data_parallel_size
         self.drop_last = drop_last
+        self.global_batch_size = (self.micro_batch_times_data_parallel_size
+                                * get_num_microbatches_by_mode(is_train))
+        self.pad_negative_indices = pad_negative_indices
+        self.is_train = is_train
 
         # Sanity checks.
         assert self.total_samples > 0, \
@@ -85,7 +102,22 @@ def get_start_end_idx(self):
     def __iter__(self):
         batch = []
         # Last batch will be dropped if drop_last is not set False
-        for idx in range(self.consumed_samples, self.total_samples):
+        indices = range(self.consumed_samples, self.total_samples)
+        if (not self.drop_last) and self.pad_negative_indices:
+            # TODO: this approach (padding to global_batch_size) is not optimal
+            #  since many batches could by empty (only padding) on all devices.
+            #  This should be fixed by creating a microbatches calculator
+            #  than can be instructed (e.g. with `update_num_microbatches`) to
+            #  use less num_microbatches in last valid iteration.
+            #  The code here will not change except from replacing
+            #  `self.global_batch_size` with
+            #  `self.micro_batch_times_data_parallel_size` Done for Eval.
+            remainder = self.global_batch_size if self.is_train else self.micro_batch_times_data_parallel_size
+            pad_samples_num = -len(indices) % remainder
+            pad_indices = range(-1, -pad_samples_num - 1, -1)
+            indices = chain(indices, pad_indices)
+
+        for idx in indices:
             batch.append(idx)
             if len(batch) == self.micro_batch_times_data_parallel_size:
                 start_idx, end_idx = self.get_start_end_idx()
@@ -94,6 +126,8 @@ def __iter__(self):
 
         # Check the last partial batch and see drop_last is set
         if len(batch) > 0 and not self.drop_last:
+            assert not self.pad_negative_indices, \
+                'with pad_negative_indices all batches should be complete'
             start_idx, end_idx = self.get_start_end_idx()
             yield batch[start_idx:end_idx]
 
diff --git a/megatron/data/gpt_dataset.py b/megatron/data/gpt_dataset.py
index 1d9b7e1c1d..eb14f45610 100644
--- a/megatron/data/gpt_dataset.py
+++ b/megatron/data/gpt_dataset.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """GPT style dataset."""
@@ -24,7 +25,8 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                                     valid_data_prefix=None,
                                     test_data_prefix=None,
                                     return_doc_ids=False, *,
-                                    data_cache_path=None):
+                                    data_cache_path=None,
+                                    use_seq_len_plus_one_tokens=True):
     """Build train, valid, and test datasets."""
 
     if data_prefix:
@@ -36,7 +38,8 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                                                     data_impl, splits_string,
                                                     train_valid_test_num_samples,
                                                     seq_length, seed, skip_warmup,
-                                                    data_cache_path=data_cache_path)
+                                                    data_cache_path=data_cache_path,
+                                                    use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
 
         # Blending dataset.
         # Parse the values.
@@ -58,7 +61,8 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                 datasets_train_valid_test_num_samples[i],
                 seq_length, seed, skip_warmup,
                 return_doc_ids,
-                data_cache_path=data_cache_path)
+                data_cache_path=data_cache_path,
+                use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
             if train_ds:
                 train_datasets.append(train_ds)
             if valid_ds:
@@ -93,14 +97,16 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                                           splits_string,
                                           train_valid_test_num_samples[0],
                                           seq_length, seed, skip_warmup,
-                                          data_cache_path=data_cache_path)
+                                          data_cache_path=data_cache_path,
+                                          use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
 
         if valid_data_prefix is not None:
             valid_dataset = build_dataset("valid", valid_data_prefix, data_impl,
                                           splits_string,
                                           train_valid_test_num_samples[1],
                                           seq_length, seed, False,
-                                          data_cache_path=data_cache_path)
+                                          data_cache_path=data_cache_path,
+                                          use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
 
 
         if test_data_prefix is not None:
@@ -108,7 +114,8 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                                          splits_string,
                                          train_valid_test_num_samples[2],
                                          seq_length, seed, False,
-                                         data_cache_path=data_cache_path)
+                                         data_cache_path=data_cache_path,
+                                         use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
 
         return (train_dataset, valid_dataset, test_dataset)
 
@@ -117,7 +124,8 @@ def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
                                      train_valid_test_num_samples,
                                      seq_length, seed, skip_warmup,
                                      return_doc_ids=False, *,
-                                     data_cache_path=None):
+                                     data_cache_path=None,
+                                     use_seq_len_plus_one_tokens):
     """Build train, valid, and test datasets."""
 
     # Indexed dataset.
@@ -150,7 +158,8 @@ def build_dataset(index, name):
                                  train_valid_test_num_samples[index],
                                  seq_length, seed,
                                  return_doc_ids,
-                                 data_cache_path=data_cache_path)
+                                 data_cache_path=data_cache_path,
+                                 use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
         return dataset
 
     train_dataset = build_dataset(0, 'train')
@@ -164,13 +173,15 @@ def build_dataset(dataset_name, data_prefix, data_impl,
                   splits_string, num_samples,
                   seq_length, seed, skip_warmup,
                   *,
-                  data_cache_path=None):
+                  data_cache_path=None,
+                  use_seq_len_plus_one_tokens=True):
     dataset = None
     if len(data_prefix) == 1:
         dataset = _build_dataset(dataset_name, data_prefix[0], data_impl,
                                  splits_string, num_samples, seq_length,
                                  seed, skip_warmup,
-                                 data_cache_path=data_cache_path)
+                                 data_cache_path=data_cache_path,
+                                 use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
     else:
         # Blending dataset.
         # Parse the values.
@@ -184,7 +195,8 @@ def build_dataset(dataset_name, data_prefix, data_impl,
             ds = _build_dataset(dataset_name, prefixes[i], data_impl,
                                 splits_string, dataset_num_samples[i],
                                 seq_length, seed, skip_warmup,
-                                data_cache_path=data_cache_path)
+                                data_cache_path=data_cache_path,
+                                use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
             if ds:
                 datasets.append(ds)
 
@@ -198,7 +210,8 @@ def build_dataset(dataset_name, data_prefix, data_impl,
 def _build_dataset(dataset_name, data_prefix, data_impl, splits_string,
                    num_samples, seq_length, seed, skip_warmup,
                    *,
-                   data_cache_path=None):
+                   data_cache_path=None,
+                   use_seq_len_plus_one_tokens=True):
     """
     Build dataset. This method is called when individual
     train, valid, test datasets are provided
@@ -220,7 +233,7 @@ def _build_dataset(dataset_name, data_prefix, data_impl, splits_string,
 
     dataset = GPTDataset(dataset_name, data_prefix, documents, indexed_dataset,
                          splits_string, num_samples, seq_length, seed,
-                         data_cache_path=data_cache_path)
+                         data_cache_path=data_cache_path, use_seq_len_plus_one_tokens=use_seq_len_plus_one_tokens)
 
     return dataset
 
@@ -246,11 +259,16 @@ class GPTDataset(torch.utils.data.Dataset):
     def __init__(self, name, data_prefix, documents, indexed_dataset,
                  splits_string, num_samples, seq_length, seed,
                  return_doc_ids=False, *,
-                 data_cache_path=None):
+                 data_cache_path=None,
+                 use_seq_len_plus_one_tokens):
 
         self.name = name
         self.indexed_dataset = indexed_dataset
         self.return_doc_ids = return_doc_ids
+        self.seq_length = seq_length
+        self.add_extra_token = 0
+        if use_seq_len_plus_one_tokens:
+            self.add_extra_token = 1
 
         # Checks
         assert np.min(documents) >= 0
@@ -261,7 +279,7 @@ def __init__(self, name, data_prefix, documents, indexed_dataset,
             _build_index_mappings(self.name, data_prefix,
                                   documents, self.indexed_dataset.sizes,
                                   splits_string, num_samples, seq_length, seed,
-                                  data_cache_path=data_cache_path)
+                                  data_cache_path=data_cache_path, add_extra_token=self.add_extra_token)
 
 
     def __len__(self):
@@ -271,6 +289,8 @@ def __len__(self):
 
     def __getitem__(self, idx):
         args = get_args()
+        dummy_sample = idx < 0
+        idx = np.abs(idx)
         orig_idx = idx
         # Get the shuffled index.
         idx = self.shuffle_idx[idx]
@@ -285,7 +305,7 @@ def __getitem__(self, idx):
             doc_ids.append(self.doc_idx[doc_index_f])
             sample = self.indexed_dataset.get(self.doc_idx[doc_index_f],
                                               offset=offset_f,
-                                              length=offset_l - offset_f + 1)
+                                              length=offset_l - offset_f + self.add_extra_token)
         else:
             # Otherwise, get the rest of the initial document.
             doc_ids.append(self.doc_idx[doc_index_f])
@@ -299,7 +319,7 @@ def __getitem__(self, idx):
             doc_ids.append(self.doc_idx[doc_index_l])
             sample_list.append(self.indexed_dataset.get(
                 self.doc_idx[doc_index_l],
-                length=offset_l + 1))
+                length=offset_l + self.add_extra_token))
             sample = np.concatenate(sample_list)
 
         text_name = 'text'
@@ -315,13 +335,26 @@ def __getitem__(self, idx):
         if args.use_dataset_only:
             sample_dict.update({'labels': np.array(sample, dtype=np.int64)})
 
-        return sample_dict
+        if len(sample) != (self.seq_length + self.add_extra_token):
+            sample = np.array(sample, dtype=np.int64)
+            sample = np.pad(sample, (0, self.seq_length + self.add_extra_token - len(sample)), mode='constant', constant_values=-1)
 
+        if args.return_data_index:
+            return {'text': np.array(sample, dtype=np.int64),
+                    'index': np.array([orig_idx], dtype=np.int64)}
+        elif self.return_doc_ids: # for retro preprocessing
+            return {'text': np.array(sample, dtype=np.int64),
+                    'doc_ids': np.array(doc_ids, dtype=np.int64)}
+        else:
+            return {'text': np.array(sample, dtype=np.int64),
+                    'dummy_sample': np.array(int(dummy_sample), dtype=np.int64)}
+
+        return sample_dict
 
 def _build_index_mappings(name, data_prefix, documents, sizes,
                           splits_string, num_samples, seq_length, seed,
                           *,
-                          data_cache_path):
+                          data_cache_path, add_extra_token):
     """Build doc-idx, sample-idx, and shuffle-idx.
     doc-idx: is an array (ordered) of documents to be used in training.
     sample-idx: is the start document index and document offset for each
@@ -331,7 +364,11 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
     args = get_args()
     # Number of tokens in each epoch and number of required epochs.
     tokens_per_epoch = _num_tokens(documents, sizes)
-    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples, add_extra_token)
+    if num_samples < 0:
+        print_num_samples = tokens_per_epoch // seq_length
+    else:
+        print_num_samples = num_samples
     if args.train_data_exact_num_epochs is not None and name == 'train':
         num_epochs = args.train_data_exact_num_epochs
 
@@ -342,7 +379,7 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
     desc = "GPT Dataset\n\n"
     desc += f"Data prefix {data_prefix}\n"
     desc += f"Dataset name {name}\n"
-    desc += f"Number of samples {num_samples}\n"
+    desc += f"Number of samples {print_num_samples}\n"
     desc += f"Number of epochs {num_epochs}\n"
     desc += f"Sequence length {seq_length}\n"
     desc += f"Random seed {seed}\n"
@@ -405,13 +442,14 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
 
         else:
             # Get the number of samples for the last epoch
+            assert num_samples >= 0, 'number of samples should be non-negative'
             num_samples_from_epochs_minus_one = (
-                (num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+                (num_epochs - 1) * tokens_per_epoch - add_extra_token) // seq_length
             last_epoch_num_samples = num_samples - \
                                      num_samples_from_epochs_minus_one
             assert last_epoch_num_samples >= 0, \
                 'last epoch number of samples should be non-negative.'
-            num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+            num_samples_per_epoch = (tokens_per_epoch - add_extra_token) // seq_length
             assert last_epoch_num_samples <= (num_samples_per_epoch + 1), \
                 'last epoch number of samples exceeded max value.'
             # If we have less than 80% of the samples for the last epoch,
@@ -454,7 +492,8 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
             assert doc_idx.dtype == np.int32
             assert sizes.dtype == np.int32
             sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
-                                                  num_epochs, tokens_per_epoch)
+                                                  num_epochs, tokens_per_epoch,
+                                                  num_samples < 0, add_extra_token)
             np.save(idx_path['sample'], sample_idx, allow_pickle=True)
             print_rank_0(' > elasped time to build and save sample-idx mapping '
                          '(seconds): {:4f}'.format(time.time() - start_time))
@@ -514,7 +553,7 @@ def _num_tokens(documents, sizes):
     return np.sum(sizes[documents])
 
 
-def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+def _num_epochs(tokens_per_epoch, seq_length, num_samples, add_extra_token):
     """Based on number of samples and sequence lenght, calculate how many
     epochs will be needed."""
     num_epochs = 0
@@ -525,19 +564,23 @@ def _num_epochs(tokens_per_epoch, seq_length, num_samples):
         # -1 is because we need to retrieve seq_length + 1 token each time
         # but the last token will overlap with the first token of the next
         # sample except for the last sample.
-        if ((total_tokens - 1) // seq_length) >= num_samples:
+        if ((total_tokens - add_extra_token) // seq_length) >= num_samples:
             return num_epochs
 
 
 def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
     """Build an array with length = number-of-epochs * number-of-dcuments.
     Each index is mapped to a corresponding document."""
+    args = get_args()
     if not separate_last_epoch or num_epochs == 1:
         doc_idx = np.mgrid[0:num_epochs, 0:len(documents)][1]
         doc_idx[:] = documents
         doc_idx = doc_idx.reshape(-1)
         doc_idx = doc_idx.astype(np.int32)
-        np_rng.shuffle(doc_idx)
+        if args.disable_doc_shuffling:
+            print_rank_0(' > Disabled document shuffling...')
+        else:
+            np_rng.shuffle(doc_idx)
         return doc_idx
 
     doc_idx_first = _build_doc_idx(documents, num_epochs-1, np_rng, False)
@@ -546,14 +589,19 @@ def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
 
 
 def _build_sample_idx(sizes, doc_idx, seq_length,
-                      num_epochs, tokens_per_epoch):
+                      num_epochs, tokens_per_epoch,
+                      keep_last_sequence, add_extra_token):
     """Sample index mapping is a 2D array with sizes
     [number-of-samples + 1, 2] where [..., 0] contains
     the index into `doc_idx` and [..., 1] is the
     starting offset in that document."""
 
     # Total number of samples. For -1 see comments in `_num_epochs`.
-    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    if keep_last_sequence:
+        import math
+        num_samples = math.ceil((num_epochs * tokens_per_epoch - add_extra_token) / seq_length)
+    else:
+        num_samples = (num_epochs * tokens_per_epoch - add_extra_token) // seq_length
     sample_idx = np.zeros([num_samples + 1, 2], dtype=np.int32)
 
     # Index into sample_idx.
@@ -568,7 +616,7 @@ def _build_sample_idx(sizes, doc_idx, seq_length,
     sample_index += 1
     while sample_index <= num_samples:
         # Start with a fresh sequence.
-        remaining_seq_length = seq_length + 1
+        remaining_seq_length = seq_length + add_extra_token
         while remaining_seq_length != 0:
             # Get the document length.
             doc_id = doc_idx[doc_idx_index]
@@ -580,10 +628,14 @@ def _build_sample_idx(sizes, doc_idx, seq_length,
             # Note that -1 here is for the same reason we have -1 in
             # `_num_epochs` calculations.
             if remaining_seq_length <= 0:
-                doc_offset += (remaining_seq_length + doc_length - 1)
+                doc_offset += (remaining_seq_length + doc_length - add_extra_token)
                 remaining_seq_length = 0
             else:
                 # Otherwise, start from the begining of the next document.
+                if doc_idx_index == (len(doc_idx) - 1):
+                    assert sample_index == num_samples, F"sample_index={sample_index} and num_samples={num_samples} should be the same"
+                    doc_offset = sizes[doc_idx[doc_idx_index]] - add_extra_token
+                    break
                 doc_idx_index += 1
                 doc_offset = 0
         # Record the sequence.
diff --git a/megatron/data/helpers.cpp b/megatron/data/helpers.cpp
index 5c3a054875..f6d7f78db5 100644
--- a/megatron/data/helpers.cpp
+++ b/megatron/data/helpers.cpp
@@ -1,3 +1,4 @@
+/* Copyright (C) 2024 Habana Labs, Ltd. an Intel Company. */
 /* Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved. */
 
 /* Helper methods for fast index mapping builds */
@@ -84,7 +85,9 @@ py::array build_sample_idx(const py::array_t<int32_t>& sizes_,
 			   const py::array_t<int32_t>& doc_idx_,
 			   const int32_t seq_length,
 			   const int32_t num_epochs,
-			   const int64_t tokens_per_epoch) {
+			   const int64_t tokens_per_epoch,
+         const bool keep_last_sequence,
+         const int32_t add_extra_token) {
     /* Sample index (sample_idx) is used for gpt2 like dataset for which
        the documents are flattened and the samples are built based on this
        1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2]
@@ -101,7 +104,12 @@ py::array build_sample_idx(const py::array_t<int32_t>& sizes_,
     auto doc_idx = doc_idx_.unchecked<1>();
 
     // Mapping and it's length (1D).
-    int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length;
+    int64_t num_samples = 0;
+    if (keep_last_sequence) {
+      num_samples = ceil(float(num_epochs * tokens_per_epoch - add_extra_token) / seq_length);
+    } else {
+      num_samples = (num_epochs * tokens_per_epoch - add_extra_token) / seq_length;
+    }
     int64_t* sample_idx = new int64_t[2*(num_samples+1)];
 
     cout << "    using:" << endl << std::flush;
@@ -127,7 +135,7 @@ py::array build_sample_idx(const py::array_t<int32_t>& sizes_,
 
     while (sample_index <= num_samples) {
         // Start with a fresh sequence.
-      int64_t remaining_seq_length = seq_length + 1;
+      int64_t remaining_seq_length = seq_length + add_extra_token;
       while (remaining_seq_length != 0) {
             // Get the document length.
 	auto doc_id = static_cast<int64_t>(doc_idx[doc_idx_index]);
@@ -139,10 +147,15 @@ py::array build_sample_idx(const py::array_t<int32_t>& sizes_,
 	// Note that -1 here is for the same reason we have -1 in
 	// `_num_epochs` calculations.
 	if (remaining_seq_length <= 0) {
-	  doc_offset += (remaining_seq_length + doc_length - 1);
+	  doc_offset += (remaining_seq_length + doc_length - add_extra_token);
 	  remaining_seq_length = 0;
 	} else {
 	  // Otherwise, start from the begining of the next document.
+    if (doc_idx_index == (doc_idx_.shape(0) - 1)) {
+      assert(sample_index == num_samples);
+      doc_offset = sizes[doc_idx[doc_idx_index]] - add_extra_token;
+      break;
+    }
 	  ++doc_idx_index;
 	  doc_offset = 0;
 	}
diff --git a/megatron/fused_kernels/__init__.py b/megatron/fused_kernels/__init__.py
index 6886c17e8b..ee134abfd1 100644
--- a/megatron/fused_kernels/__init__.py
+++ b/megatron/fused_kernels/__init__.py
@@ -1,5 +1,7 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
+from deepspeed.accelerator import get_accelerator
 import os
 import pathlib
 import subprocess
@@ -16,6 +18,8 @@
 
 
 def load(args):
+    if get_accelerator().device_name() != "cuda":
+        return
 
     # Check if cuda 11 is installed for compute capability 8.0
     cc_flag = []
diff --git a/megatron/fused_kernels/tests/test_fused_kernels.py b/megatron/fused_kernels/tests/test_fused_kernels.py
index 74024c5020..8ceea51be8 100644
--- a/megatron/fused_kernels/tests/test_fused_kernels.py
+++ b/megatron/fused_kernels/tests/test_fused_kernels.py
@@ -1,3 +1,6 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+from deepspeed.accelerator import get_accelerator
 import math
 
 import torch
@@ -9,6 +12,20 @@
 from megatron.model.utils import attention_mask_func
 from megatron.fused_kernels import load
 
+try:
+    from transformers import BertTokenizer, GPT2Tokenizer
+    from transformers.models.bert.modeling_bert import BertModel
+    from transformers.models.gpt2.modeling_gpt2 import GPT2Model
+    import transformers
+
+    transformers.logging.set_verbosity(
+        transformers.logging.FATAL,
+    )
+
+except:
+    print("\n[Fail] Please install `transformers` package to test fused kernels\n")
+    exit(-1)
+
 def test_load_fused_kernels():
     try:
         import fused_layer_norm_cuda
@@ -18,11 +35,12 @@ def test_load_fused_kernels():
 
         print("[Success] load_fused_kernels")
     except ImportError as e:
-        print("[Fail] load_fused_kernels")
-        raise e
+        if get_accelerator().device_name() == "cuda":
+            print("[Fail] load_fused_kernels")
+            raise e
 
 def test_fused_softmax():
-    bert = BertModel.from_pretrained("bert-base-cased").cuda().half()
+    bert = BertModel.from_pretrained("bert-base-cased").to(get_accelerator().device_name()).half()
     tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
     test_text = (
         "Hello. How are you? I am fine thank you and you? yes Good. "
@@ -35,16 +53,16 @@ def test_fused_softmax():
     )
 
     embedding_output = bert.embeddings(
-        input_ids=tokens["input_ids"].cuda(),
+        input_ids=tokens["input_ids"].to(get_accelerator().device_name()),
         position_ids=None,
-        token_type_ids=tokens["token_type_ids"].cuda(),
+        token_type_ids=tokens["token_type_ids"].to(get_accelerator().device_name()),
         inputs_embeds=None,
         past_key_values_length=0,
     )
 
     # (bsz, 1, 1, seq_len)
     mask = bert.get_extended_attention_mask(
-        attention_mask=tokens["attention_mask"].cuda(),
+        attention_mask=tokens["attention_mask"].to(get_accelerator().device_name()),
         input_shape=tokens["input_ids"].shape,
         device=bert.device,
     )
@@ -68,7 +86,7 @@ def test_fused_softmax():
             attn_mask_type=AttnMaskType.padding,
             scaled_masked_softmax_fusion=True,
         )
-        .cuda()
+        .to(get_accelerator().device_name())
         .half()
     )
 
@@ -87,7 +105,7 @@ def test_fused_softmax():
             attn_mask_type=AttnMaskType.padding,
             scaled_masked_softmax_fusion=False,
         )
-        .cuda()
+        .to(get_accelerator().device_name())
         .half()
     )
 
@@ -120,7 +138,7 @@ def test_fused_softmax():
 
 
 def test_fused_upper_triangle_mask_softmax():
-    gpt = GPT2Model.from_pretrained("gpt2").cuda().half()
+    gpt = GPT2Model.from_pretrained("gpt2").to(get_accelerator().device_name()).half()
     tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
     test_text = (
         "Hello. How are you? I am fine thank you and you? yes Good. "
@@ -132,14 +150,14 @@ def test_fused_upper_triangle_mask_softmax():
         return_tensors="pt",
     )
 
-    attention_mask = tokens["attention_mask"].cuda()
+    attention_mask = tokens["attention_mask"].to(get_accelerator().device_name())
     attention_mask = attention_mask.view(attention_mask.size(0), -1)
     attention_mask = attention_mask[:, None, None, :]
     attention_mask = (1.0 - attention_mask) * -10000.0
     attention_mask = attention_mask.repeat(1, 1, attention_mask.size()[-1], 1)
     attn = gpt.h[0]
 
-    hidden_states = gpt.wte(tokens["input_ids"].cuda())
+    hidden_states = gpt.wte(tokens["input_ids"].to(get_accelerator().device_name()))
     q, k, v = attn.attn.c_attn(hidden_states).split(768, dim=-1)
     q = attn.attn._split_heads(q, attn.attn.num_heads, attn.attn.head_dim)
     k = attn.attn._split_heads(k, attn.attn.num_heads, attn.attn.head_dim)
@@ -168,7 +186,7 @@ def test_fused_upper_triangle_mask_softmax():
             attn_mask_type=AttnMaskType.causal,
             scaled_masked_softmax_fusion=True,
         )
-        .cuda()
+        .to(get_accelerator().device_name())
         .half()
     )
 
@@ -187,7 +205,7 @@ def test_fused_upper_triangle_mask_softmax():
             attn_mask_type=AttnMaskType.causal,
             scaled_masked_softmax_fusion=False,
         )
-        .cuda()
+        .to(get_accelerator().device_name())
         .half()
     )
 
@@ -220,7 +238,7 @@ def test_fused_upper_triangle_mask_softmax():
 
 
 def test_layer_norm():
-    bert = BertModel.from_pretrained("bert-base-cased").cuda().half()
+    bert = BertModel.from_pretrained("bert-base-cased").to(get_accelerator().device_name()).half()
     tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
     test_text = (
         "Hello. How are you? I am fine thank you and you? yes Good. "
@@ -235,22 +253,22 @@ def test_layer_norm():
     # [bsz, seq_len, d_model]
     embedding_output = (
         bert.embeddings(
-            input_ids=tokens["input_ids"].cuda(),
+            input_ids=tokens["input_ids"].to(get_accelerator().device_name()),
             position_ids=None,
-            token_type_ids=tokens["token_type_ids"].cuda(),
+            token_type_ids=tokens["token_type_ids"].to(get_accelerator().device_name()),
             inputs_embeds=None,
             past_key_values_length=0,
         )
-        .cuda()
+        .to(get_accelerator().device_name())
         .half()
     )
 
     fused_layernorm_layer = (
-        MixedFusedLayerNorm(normalized_shape=embedding_output.size(-1)).cuda().half()
+        MixedFusedLayerNorm(normalized_shape=embedding_output.size(-1)).to(get_accelerator().device_name()).half()
     )
 
     torch_layernorm_layer = (
-        LayerNorm(normalized_shape=embedding_output.size(-1)).cuda().half()
+        LayerNorm(normalized_shape=embedding_output.size(-1)).to(get_accelerator().device_name()).half()
     )
 
     fused_output = fused_layernorm_layer(embedding_output)
@@ -291,6 +309,8 @@ def forward_torch_softmax(input, mask, scale):
 
 
 def test_masked_softmax_forward():
+    if get_accelerator().device_name() != "cuda":
+        return
     import scaled_masked_softmax_cuda
 
     batch = 2
@@ -306,6 +326,8 @@ def test_masked_softmax_forward():
             assert error < 1e-3
 
 def test_masked_softmax_backward():
+    if get_accelerator().device_name() != "cuda":
+        return
     import scaled_masked_softmax_cuda
 
     batch = 2
@@ -327,6 +349,8 @@ def test_masked_softmax_backward():
 
 
 def test_allmasked_softmax_forward():
+    if get_accelerator().device_name() != "cuda":
+        return
     import scaled_masked_softmax_cuda
 
     batch = 2
@@ -343,6 +367,8 @@ def test_allmasked_softmax_forward():
 
 
 def test_allmasked_softmax_backward():
+    if get_accelerator().device_name() != "cuda":
+        return
     import scaled_masked_softmax_cuda
 
     batch = 2
@@ -363,20 +389,6 @@ def test_allmasked_softmax_backward():
 
 
 if __name__ == "__main__":
-    try:
-        from transformers import BertTokenizer, GPT2Tokenizer
-        from transformers.models.bert.modeling_bert import BertModel
-        from transformers.models.gpt2.modeling_gpt2 import GPT2Model
-        import transformers
-
-        transformers.logging.set_verbosity(
-            transformers.logging.FATAL,
-        )
-
-    except:
-        print("\n[Fail] Please install `transformers` package to test fused kernels\n")
-        exit(-1)
-
     load()
     test_masked_softmax_forward()
     test_masked_softmax_backward()
diff --git a/megatron/global_vars.py b/megatron/global_vars.py
index cb284b3c34..9d1d127c27 100644
--- a/megatron/global_vars.py
+++ b/megatron/global_vars.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron global variables."""
@@ -14,6 +15,7 @@
 _GLOBAL_ARGS = None
 _GLOBAL_RETRO_ARGS = None
 _GLOBAL_NUM_MICROBATCHES_CALCULATOR = None
+_GLOBAL_NUM_EVAL_MICROBATCHES_CALCULATOR = None
 _GLOBAL_TOKENIZER = None
 _GLOBAL_TENSORBOARD_WRITER = None
 _GLOBAL_ADLR_AUTORESUME = None
@@ -44,6 +46,19 @@ def update_num_microbatches(consumed_samples, consistency_check=True):
                                                consistency_check)
 
 
+def get_num_eval_microbatches():
+    return _GLOBAL_NUM_EVAL_MICROBATCHES_CALCULATOR.get()
+
+
+# When using different micro batch size for training and evaluation/validation
+# we have different number of micro batches.
+def get_num_microbatches_by_mode(is_training):
+    if is_training:
+        return get_num_microbatches()
+    else:
+        return get_num_eval_microbatches()
+
+
 def get_tokenizer():
     """Return tokenizer."""
     _ensure_var_is_initialized(_GLOBAL_TOKENIZER, 'tokenizer')
@@ -111,11 +126,16 @@ def set_retro_args(retro_args):
 def _build_num_microbatches_calculator(args):
 
     global _GLOBAL_NUM_MICROBATCHES_CALCULATOR
+    global _GLOBAL_NUM_EVAL_MICROBATCHES_CALCULATOR
     _ensure_var_is_not_initialized(_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
                                    'num microbatches calculator')
+    _ensure_var_is_not_initialized(_GLOBAL_NUM_EVAL_MICROBATCHES_CALCULATOR,
+                                   'num eval microbatches calculator')
 
     _GLOBAL_NUM_MICROBATCHES_CALCULATOR = build_num_microbatches_calculator(
-        args)
+        args, args.micro_batch_size)
+    _GLOBAL_NUM_EVAL_MICROBATCHES_CALCULATOR = build_num_microbatches_calculator(
+        args, args.eval_micro_batch_size)
 
 
 def _build_tokenizer(args):
diff --git a/megatron/initialize.py b/megatron/initialize.py
index 85c37a5767..248971e34a 100644
--- a/megatron/initialize.py
+++ b/megatron/initialize.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron initialization."""
@@ -5,6 +6,7 @@
 import random
 import os
 import time
+import shutil
 
 import numpy as np
 import torch
@@ -17,7 +19,7 @@
 from megatron.core import mpu, tensor_parallel
 from megatron.arguments import (parse_args, validate_args)
 from megatron.checkpointing import load_args_from_checkpoint
-from megatron.global_vars import set_global_variables
+from megatron.global_vars import set_global_variables, set_args
 from megatron.model.transformer import bias_dropout_add_fused_train
 from megatron.model.fused_bias_gelu import bias_gelu
 from megatron.utils import is_rank_0
@@ -29,12 +31,19 @@
 
 
 def initialize_megatron(extra_args_provider=None, args_defaults={},
-                        ignore_unknown_args=False, allow_no_cuda=False, external_args={}):
+                        ignore_unknown_args=False, allow_no_cuda=False,
+                        external_args={}, allow_parsing=True,
+                        allow_validating_args=True):
     """Set global variables, initialize distributed, and
     set autoresume and random seeds.
     `allow_no_cuda` should not be set unless using megatron for cpu only 
     data processing. In general this arg should not be set unless you know 
     what you are doing.
+    `allow_parsing` set to False when argument parsing done already. In general
+     this arg should not be set to False, unless you know what you are doing.
+    `allow_validating_args` set to False when parsed args need not be validated.
+    In general this arg should not be set to False, unless you know what you
+    are doing.
     Returns a function to finalize distributed env initialization 
     (optionally, only when args.lazy_mpu_init == True)
     """
@@ -42,8 +51,14 @@ def initialize_megatron(extra_args_provider=None, args_defaults={},
         # Make sure cuda is available.
         assert get_accelerator().is_available(), 'Megatron requires accelerator.'
 
-    # Parse arguments
-    args = parse_args(extra_args_provider, ignore_unknown_args)
+    if allow_parsing:
+        # Parse arguments
+        args = parse_args(extra_args_provider, ignore_unknown_args)
+    else:
+        # arguments parsed already and set
+        args = get_args()
+        # reset it so that below set_global_variables doesn't assert
+        set_args(None)
 
     for key in external_args:
         if key in args:
@@ -53,12 +68,19 @@ def initialize_megatron(extra_args_provider=None, args_defaults={},
         assert args.load is not None, '--use-checkpoints-args requires --load argument'
         load_args_from_checkpoint(args)
 
-    validate_args(args, args_defaults)
+    if allow_validating_args:
+        # validate parsed arguments
+        validate_args(args, args_defaults)
 
     # set global args, build tokenizer, and set adlr-autoresume,
     # tensorboard-writer, and timers.
     set_global_variables(args)
 
+    # profiler config, must be done before hpu initialization
+    if args.profile == 'hltv':
+        os.environ['HABANA_PROFILE'] = 'profile_api_with_nics'
+        shutil.rmtree('.graph_dumps', ignore_errors=True)
+
     # torch.distributed initialization
     def finish_mpu_init():
         args = get_args()
diff --git a/megatron/microbatches.py b/megatron/microbatches.py
index 6449d7479c..72ab0b2dc1 100644
--- a/megatron/microbatches.py
+++ b/megatron/microbatches.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron number of micro-batches calculators."""
@@ -6,12 +7,12 @@
 from abc import abstractmethod
 
 
-def build_num_microbatches_calculator(args):
+def build_num_microbatches_calculator(args, micro_batch):
 
     # Constant num micro-batches.
     if args.rampup_batch_size is None:
         num_microbatches_calculator = ConstantNumMicroBatches(
-            args.global_batch_size, args.micro_batch_size,
+            args.global_batch_size, micro_batch,
             args.data_parallel_size)
         if args.rank == 0:
             print('setting number of micro-batches to constant {}'.format(
@@ -21,6 +22,9 @@ def build_num_microbatches_calculator(args):
         assert len(args.rampup_batch_size) == 3, 'expected the following ' \
             'format: --rampup-batch-size <start batch size> ' \
             '<batch size incerement> <ramp-up samples>'
+        assert args.micro_batch_size == args.eval_micro_batch_size, \
+            "build_num_microbatches_calculator with rampup_batch_size - " \
+            "Unsupported for split micro batch size"
         start_batch_size = int(args.rampup_batch_size[0])
         batch_size_increment = int(args.rampup_batch_size[1])
         ramup_samples = int(args.rampup_batch_size[2])
diff --git a/megatron/model/__init__.py b/megatron/model/__init__.py
index 2306749fcb..9fba283628 100644
--- a/megatron/model/__init__.py
+++ b/megatron/model/__init__.py
@@ -1,13 +1,18 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 from deepspeed.accelerator.real_accelerator import get_accelerator
-if get_accelerator().device_name() == 'cuda':
+
+if get_accelerator().device_name() in ['cuda', 'hpu']:
     from .fused_layer_norm import MixedFusedLayerNorm as LayerNorm
-    from apex.normalization import MixedFusedRMSNorm as RMSNorm
 else:
-    from .rmsnorm import RMSNorm
     from torch.nn import LayerNorm
 
+if get_accelerator().device_name() == 'cuda':
+    from .rmsnorm_apex import RMSNorm
+else:
+    from .rmsnorm import RMSNorm
+
 from .distributed import DistributedDataParallel
 from .bert_model import BertModel
 from .gpt_model import GPTModel, GPTModelPipe
diff --git a/megatron/model/fused_layer_norm.py b/megatron/model/fused_layer_norm.py
index 2f3b89014b..5376b906a5 100644
--- a/megatron/model/fused_layer_norm.py
+++ b/megatron/model/fused_layer_norm.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """This code is copied fron NVIDIA apex:
@@ -13,6 +14,9 @@
 import inspect
 
 from megatron.core.utils import make_viewless_tensor
+from megatron import get_args
+
+from deepspeed.accelerator.real_accelerator import get_accelerator
 
 try:
     from apex.contrib.layer_norm.layer_norm import FastLayerNormFN
@@ -20,7 +24,10 @@
 except:
     HAVE_PERSIST_LAYER_NORM = False
 
-from apex.normalization.fused_layer_norm import FusedLayerNormAffineFunction
+try:
+    from apex.normalization.fused_layer_norm import FusedLayerNormAffineFunction
+except ModuleNotFoundError:
+    pass
 
 
 global fused_layer_norm_cuda
@@ -39,8 +46,9 @@ def __init__(self, normalized_shape, eps=1e-5,
         self.apply_layernorm_1p = apply_layernorm_1p
         self.mem_efficient_ln = mem_efficient_ln
 
-        global fused_layer_norm_cuda
-        fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
+        if get_accelerator().device_name() == 'cuda':
+            global fused_layer_norm_cuda
+            fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
 
         # List of hiddens sizes supported in the persistent layer norm kernel
         # If the hidden size is not supported, fall back to the non-persistent
@@ -56,8 +64,19 @@ def __init__(self, normalized_shape, eps=1e-5,
             normalized_shape = (normalized_shape,)
         self.normalized_shape = torch.Size(normalized_shape)
         self.eps = eps
-        self.weight = Parameter(torch.Tensor(*normalized_shape))
-        self.bias = Parameter(torch.Tensor(*normalized_shape))
+        try:
+            args = get_args()
+        except:
+            args = None
+        params_dtype = torch.float32
+        if args is not None:
+            params_dtype = args.params_dtype
+        self.weight = Parameter(torch.empty(*normalized_shape,
+                                device=get_accelerator().current_device_name(),
+                                dtype=params_dtype))
+        self.bias = Parameter(torch.empty(*normalized_shape,
+                              device=get_accelerator().current_device_name(),
+                              dtype=params_dtype))
         self.reset_parameters()
         self.no_persist_layer_norm = no_persist_layer_norm
         self.sequence_parallel = sequence_parallel
@@ -81,8 +100,9 @@ def forward(self, input):
     weight = self.weight + 1 if self.apply_layernorm_1p else self.weight
     # CPU path is here for unittest sake.
     if not input.is_cuda:
-        print("WARNING! The input of FusedLayerNorm should be on the GPU."
-              "This warning should only be triggered in the FusedLayerNorm unit tests.")
+        if get_accelerator().device_name() == 'cuda':
+            print("WARNING! The input of FusedLayerNorm should be on the GPU."
+                "This warning should only be triggered in the FusedLayerNorm unit tests.")
         return F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)
 
     if self.no_persist_layer_norm:
diff --git a/megatron/model/fused_softmax.py b/megatron/model/fused_softmax.py
index 2fe61d4073..0ca005bf16 100644
--- a/megatron/model/fused_softmax.py
+++ b/megatron/model/fused_softmax.py
@@ -1,6 +1,8 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 
+from deepspeed.accelerator import get_accelerator
 import torch
 import torch.nn as nn
 from megatron.model.enums import AttnMaskType
@@ -144,7 +146,7 @@ def forward(self, input, mask):
         # [b, np, sq, sk]
         assert input.dim() == 4
 
-        if self.is_kernel_available(mask, *input.size()):
+        if get_accelerator().device_name() == "cuda" and self.is_kernel_available(mask, *input.size()):
             return self.forward_fused_softmax(input, mask)
         else:
             return self.forward_torch_softmax(input, mask)
diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index 8968c96655..bbd9cbc33c 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 """GPT-2 model."""
@@ -347,6 +348,7 @@ def _to_float16(inputs):
                                         args.max_position_embeddings,
                                         args.hidden_dropout,
                                         config,
+                                        add_position_embedding=args.add_position_embedding,
                                         num_tokentypes=num_tokentypes,
                                         embedding_weights_in_fp32=args.embedding_weights_in_fp32,))
         else:
@@ -357,6 +359,7 @@ def _to_float16(inputs):
                                             args.max_position_embeddings,
                                             args.hidden_dropout,
                                             config,
+                                            add_position_embedding=args.add_position_embedding,
                                             num_tokentypes=num_tokentypes,
                                             embedding_weights_in_fp32=args.embedding_weights_in_fp32,
                                             tied_weight_attr='word_embeddings_weight'))
@@ -393,9 +396,13 @@ def _to_float16(inputs):
         if args.normalization == 'layernorm':
             self.specs.append(LayerSpec(LayerNorm,
                           args.hidden_size,
-                          eps=args.layernorm_epsilon))
+                          eps=args.layernorm_epsilon,
+                          sequence_parallel=args.sequence_parallel,
+                          apply_layernorm_1p=args.apply_layernorm_1p))
         else:
-            self.specs.append(LayerSpec(RMSNorm, args.hidden_size, args.layernorm_epsilon))
+            self.specs.append(LayerSpec(RMSNorm, args.hidden_size,
+                                        args.layernorm_epsilon,
+                                        sequence_parallel=args.sequence_parallel))
 
         def _logits_helper(embedding, lm_output):
             """A wrapper to massage inputs/outputs from pipeline. """
@@ -404,8 +411,9 @@ def _logits_helper(embedding, lm_output):
                 embedding.word_embeddings_weight,
                 self.parallel_output)
         if args.untie_embeddings_and_output_weights:
+            gather_output = not parallel_output
             self.specs.append(
-                LayerSpec(LMHeadPipe, args.hidden_size, args.padded_vocab_size, config)
+                LayerSpec(LMHeadPipe, args.hidden_size, args.padded_vocab_size, config, gather_output)
             )
         else:
             self.specs.append(
@@ -416,6 +424,7 @@ def _logits_helper(embedding, lm_output):
                               args.max_position_embeddings,
                               args.hidden_dropout,
                               config,
+                              add_position_embedding=(args.add_position_embedding and (not args.fix_position_emb_redundant_alloc)),
                               num_tokentypes=num_tokentypes,
                               embedding_weights_in_fp32=args.embedding_weights_in_fp32,
                               forward_fn=_logits_helper,
@@ -431,6 +440,8 @@ def _logits_helper(embedding, lm_output):
         self.last_lm_loss = None    # detached, for display only
         self.last_moe_loss = None   # detached, for display only
 
+        # for selective -> use --recompute-granularity='selective'
+        # for full -> use --recompute-granularity='full' --recompute-method='uniform'
         if args.checkpoint_activations:
             interval = args.checkpoint_num_layers
         elif args.recompute_granularity == "full" and args.recompute_method == 'uniform':
diff --git a/megatron/model/language_model.py b/megatron/model/language_model.py
index ec2ae1877a..71cc96775a 100644
--- a/megatron/model/language_model.py
+++ b/megatron/model/language_model.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 """Transformer based language model."""
@@ -141,6 +142,7 @@ def __init__(self,
                  max_sequence_length,
                  embedding_dropout_prob,
                  config,
+                 add_position_embedding=True,
                  num_tokentypes=0,
                  embedding_weights_in_fp32=False):
         super(Embedding, self).__init__()
@@ -159,7 +161,7 @@ def __init__(self,
         self._word_embeddings_key = 'word_embeddings'
 
         # Position embedding (serial).
-        self.add_position_embedding = args.add_position_embedding
+        self.add_position_embedding = add_position_embedding
         if self.add_position_embedding:
             self._position_embeddings_key = 'position_embeddings'
             if args.sequence_parallel:
@@ -256,8 +258,8 @@ def forward(self, input_ids, position_ids, tokentype_ids=None):
 
         # Dropout.
         if self.sequence_parallel:
-            # already partition sequence, do not need scatter_to_sequence_parallel_region
-            # embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)
+            # already partition sequence, do not need scatter_to_sequence_parallel_region ?
+            embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)
             with tensor_parallel.get_cuda_rng_tracker().fork():
                 embeddings = self.embedding_dropout(embeddings)
         else:
@@ -415,6 +417,7 @@ def __init__(self,
                                        args.max_position_embeddings,
                                        args.hidden_dropout,
                                        config,
+                                       args.add_position_embedding,
                                        self.num_tokentypes,
                                        args.embedding_weights_in_fp32)
             self._embedding_key = 'embedding'
diff --git a/megatron/model/module.py b/megatron/model/module.py
index 963ad2d29d..59a13aed89 100644
--- a/megatron/model/module.py
+++ b/megatron/model/module.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron Module"""
@@ -10,10 +11,9 @@
 from megatron.core import mpu, tensor_parallel
 
 
-_FLOAT_TYPES = [get_accelerator().FloatTensor(0).dtype]
-_HALF_TYPES = [get_accelerator().HalfTensor(0).dtype]
-_BF16_TYPES = [get_accelerator().BFloat16Tensor(0).dtype]
-
+_FLOAT_TYPES = None
+_HALF_TYPES = None
+_BF16_TYPES = None
 
 
 def param_is_not_shared(param):
@@ -131,6 +131,9 @@ def conversion_helper(val, conversion):
 
 def fp32_to_float16(val, float16_convertor):
     """Convert fp32 `val` to fp16/bf16"""
+    global _FLOAT_TYPES
+    if _FLOAT_TYPES is None:
+        _FLOAT_TYPES = [get_accelerator().FloatTensor(0).dtype]
     def half_conversion(val):
         val_typecheck = val
         if isinstance(val_typecheck, (Parameter, Variable)):
@@ -143,6 +146,12 @@ def half_conversion(val):
 
 def float16_to_fp32(val):
     """Convert fp16/bf16 `val` to fp32"""
+    global _HALF_TYPES
+    global _BF16_TYPES
+    if _HALF_TYPES is None:
+        _HALF_TYPES = [get_accelerator().HalfTensor(0).dtype]
+    if _BF16_TYPES is None:
+        _BF16_TYPES = [get_accelerator().BFloat16Tensor(0).dtype]
     def float_conversion(val):
         val_typecheck = val
         if isinstance(val_typecheck, (Parameter, Variable)):
diff --git a/megatron/model/rmsnorm.py b/megatron/model/rmsnorm.py
index 60e8978171..f85f36a920 100644
--- a/megatron/model/rmsnorm.py
+++ b/megatron/model/rmsnorm.py
@@ -1,16 +1,38 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+from deepspeed.accelerator import get_accelerator
+from megatron import get_args
 import torch
+from torch.nn import init
 from torch.nn.parameter import Parameter
 
+try:
+    from habana_frameworks.torch.hpex.normalization import FusedRMSNorm
+except:
+    FusedRMSNorm = None
+
+
 # Taken from facebookresearch/llama
 class RMSNorm(torch.nn.Module):
-    def __init__(self, dim: int, eps: float = 1e-6):
+    def __init__(self, dim: int, eps: float = 1e-6, sequence_parallel=False):
         super().__init__()
         self.eps = eps
-        self.weight = Parameter(torch.ones(dim))
+        self.weight = Parameter(torch.empty(dim,
+                                device=get_accelerator().current_device_name(),
+                                dtype=get_args().params_dtype))
+        init.ones_(self.weight)
+        
+        self.use_fused_rmsnorm = get_args().use_fused_rmsnorm
+
+        if sequence_parallel:
+            setattr(self.weight, 'sequence_parallel', sequence_parallel)
 
     def _norm(self, x):
         return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
 
     def forward(self, x):
+        if self.use_fused_rmsnorm and x.device.type == "hpu":
+            assert FusedRMSNorm is not None, "failed to import FusedRMSNorm"
+            return FusedRMSNorm.apply(x, self.weight, self.eps)
         output = self._norm(x.float()).type_as(x)
-        return output * self.weight
\ No newline at end of file
+        return output * self.weight
diff --git a/megatron/model/rmsnorm_apex.py b/megatron/model/rmsnorm_apex.py
new file mode 100644
index 0000000000..7290aa37a2
--- /dev/null
+++ b/megatron/model/rmsnorm_apex.py
@@ -0,0 +1,21 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+from deepspeed.accelerator import get_accelerator
+
+try:
+    from apex.normalization import MixedFusedRMSNorm as ApexMixedFusedRMSNorm
+except:
+    assert False, "Failed: from apex.normalization import MixedFusedRMSNorm"
+
+
+class RMSNorm(ApexMixedFusedRMSNorm):
+    """ Derived class to handle sequence parallel configuration """
+    def __init__(self, dim: int, eps: float = 1e-5, **kwargs):
+        assert get_accelerator().device_name() == 'cuda', f"Unsupported device: {get_accelerator().device_name()}"
+        sequence_parallel = kwargs.pop('sequence_parallel') if 'sequence_parallel' in kwargs else False
+        super().__init__(dim, eps, **kwargs)
+        if sequence_parallel:
+            setattr(self.weight, 'sequence_parallel', sequence_parallel)
+
+    def forward(self, x):
+        return super().forward(x)
diff --git a/megatron/model/rotary_pos_embedding.py b/megatron/model/rotary_pos_embedding.py
index 4d4497e0cd..535ad062fc 100644
--- a/megatron/model/rotary_pos_embedding.py
+++ b/megatron/model/rotary_pos_embedding.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # coding=utf-8
 
 # The following code has been taken from https://github.com/NVIDIA/NeMo/blob/ \
@@ -5,21 +6,38 @@
 # common/megatron/rotary_pos_embedding.py
 
 import importlib.util
+from megatron.global_vars import get_args
 import torch
 
 from torch import einsum, nn
 
 __all__ = ['RotaryEmbedding', 'apply_rotary_pos_emb']
 
+try:
+    from habana_frameworks.torch.hpex.kernels import RotaryPosEmbeddingHelperV1
+except ImportError:
+    RotaryPosEmbeddingHelperV1 = None
+
+# sin, cos tensors cached for all devices
+cos_cached = None
+sin_cached = None
+
+
 class RotaryEmbedding(nn.Module):
     def __init__(self, dim, theta=10000):
         super().__init__()
+        args = get_args()
+        self.eval_hf_rope = args.eval_hf_rope
         inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
+        if self.eval_hf_rope and inv_freq.dtype != args.params_dtype:
+            inv_freq = inv_freq.to(args.params_dtype)
         self.register_buffer('inv_freq', inv_freq)
         if importlib.util.find_spec('einops') is None:
             raise RuntimeError("einops is required for Rotary Embedding")
 
     def forward(self, max_seq_len, offset=0):
+        if self.eval_hf_rope and self.inv_freq.dtype != torch.float32:
+            self.inv_freq = self.inv_freq.float()
         seq = torch.arange(max_seq_len, device=self.inv_freq.device) + offset
         freqs = einsum('i , j -> i j', seq.type_as(self.inv_freq), self.inv_freq)
         # first part even vector components, second part odd vector components,
@@ -43,14 +61,30 @@ def _rotate_half(x):
 def apply_rotary_pos_emb(t, freqs):
     """
     input tensor t is of shape [seq_length, ..., dim]
-    rotary positional embeding tensor freqs is of shape [seq_length, ..., dim]
+    rotary positional embedding tensor freqs is of shape [seq_length, ..., dim]
     check https://kexue.fm/archives/8265 for detailed formulas
     """
     rot_dim = freqs.shape[-1]
-    # ideally t_pass is empty so rotary pos embedding is applied to all tensor t
-    t, t_pass = t[..., :rot_dim], t[..., rot_dim:]
+    t_pass = None
+    # due to 0 dim of t_pass tensor, there is zeros tensor DMA from H2D which
+    # affects performance, check whether we need t_pass
+    if t.shape[-1] != rot_dim:
+        # ideally t_pass is empty so rotary pos embedding is applied to all tensor t
+        t, t_pass = t[..., :rot_dim], t[..., rot_dim:]
+
+    global cos_cached, sin_cached
+    if cos_cached is None or sin_cached is None or t.shape[0] != cos_cached.shape[0]:
+        freqs_ = freqs[:t.shape[0]]
+        cos_cached = freqs_.cos().to(t.dtype)
+        sin_cached = freqs_.sin().to(t.dtype)
 
-    # first part is cosine component
-    # second part is sine component, need to change signs with _rotate_half method
-    t = (t * freqs.cos().to(t.dtype)) + (_rotate_half(t) * freqs.sin().to(t.dtype))
-    return t if t_pass.shape[-1] == 0 else torch.cat((t, t_pass), dim=-1)
+    if t.device.type == "hpu":
+        assert RotaryPosEmbeddingHelperV1 is not None, "failed to import RotaryPosEmbeddingHelperV1"
+        t = RotaryPosEmbeddingHelperV1.apply(t, cos_cached, sin_cached, 0) # offset already used in RotaryEmbedding.forward
+    else:
+        # first part is cosine component
+        # second part is sine component, need to change signs with _rotate_half method
+        t = (t * cos_cached) + (_rotate_half(t) * sin_cached)
+    if t_pass is None:
+        return t
+    return torch.cat((t, t_pass), dim=-1)
diff --git a/megatron/model/transformer.py b/megatron/model/transformer.py
index e75f13a24f..3ca7204188 100644
--- a/megatron/model/transformer.py
+++ b/megatron/model/transformer.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 """Transformer."""
@@ -49,6 +50,15 @@
 
 FlashAttentionBuilder = get_accelerator().get_op_builder("FlashAttentionBuilder")
 flash_attn_builder = None
+try:
+    flash_attn_builder = FlashAttentionBuilder().load()
+except (TypeError, ValueError):
+    flash_attn_builder = None
+
+try:
+    from habana_frameworks.torch.hpex.kernels import FusedSDPA
+except ImportError:
+    FusedSDPA = None
 
 
 """ We use the following notation throughout this file:
@@ -232,7 +242,7 @@ def __init__(self, layer_number, config,
         self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
         if self.apply_query_key_layer_scaling:
             self.attention_softmax_in_fp32 = True
-        self.layer_number = max(1, layer_number)
+        self.layer_number = max(1, layer_number+1)
         self.attn_mask_type = attn_mask_type
         self.sequence_parallel = config.sequence_parallel
 
@@ -268,10 +278,10 @@ def __init__(self, layer_number, config,
         # Dropout. Note that for a single iteration, this layer will generate
         # different outputs on different number of parallel partitions but
         # on average it should not be partition dependent.
-        self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
+        self.attention_dropout = torch.nn.Dropout(config.attention_dropout) if config.attention_dropout != 0 else None
 
     def forward(self, query_layer, key_layer,
-                value_layer, attention_mask):
+                value_layer, attention_mask, alibi=None):
 
         # ===================================
         # Raw attention scores. [b, np, s, s]
@@ -290,17 +300,22 @@ def forward(self, query_layer, key_layer,
         key_layer = key_layer.view(output_size[3],
                                    output_size[0] * output_size[1], -1)
 
-        # preallocting input tensor: [b * np, sq, sk]
-        matmul_input_buffer = parallel_state.get_global_memory_buffer().get_tensor(
-            (output_size[0]*output_size[1], output_size[2], output_size[3]),
-            query_layer.dtype, "mpu")
+        beta = 0.0
+        if alibi is None:
+            # preallocting input tensor: [b * np, sq, sk]
+            matmul_input_buffer = parallel_state.get_global_memory_buffer().get_tensor(
+                (output_size[0]*output_size[1], output_size[2], output_size[3]),
+                query_layer.dtype, "mpu")
+        else:
+            matmul_input_buffer = alibi[:output_size[0]*output_size[1], :, :output_size[3]]
+            beta = 1.0
 
         # Raw attention scores. [b * np, sq, sk]
         matmul_result = torch.baddbmm(
             matmul_input_buffer,
             query_layer.transpose(0, 1),   # [b * np, sq, hn]
             key_layer.transpose(0, 1).transpose(1, 2),  # [b * np, hn, sk]
-            beta=0.0, alpha=(1.0/self.norm_factor))
+            beta=beta, alpha=(1.0/self.norm_factor))
 
         # change view to [b, np, sq, sk]
         attention_scores = matmul_result.view(*output_size)
@@ -313,13 +328,14 @@ def forward(self, query_layer, key_layer,
         attention_probs = self.scale_mask_softmax(attention_scores,
                                                   attention_mask)
 
-        # This is actually dropping out entire tokens to attend to, which might
-        # seem a bit unusual, but is taken from the original Transformer paper.
-        if not self.sequence_parallel:
-            with tensor_parallel.get_cuda_rng_tracker().fork():
+        if self.attention_dropout is not None:
+            # This is actually dropping out entire tokens to attend to, which might
+            # seem a bit unusual, but is taken from the original Transformer paper.
+            if not self.sequence_parallel:
+                with tensor_parallel.get_cuda_rng_tracker().fork():
+                    attention_probs = self.attention_dropout(attention_probs)
+            else:
                 attention_probs = self.attention_dropout(attention_probs)
-        else:
-            attention_probs = self.attention_dropout(attention_probs)
 
         # =========================
         # Context layer. [sq, b, hp]
@@ -490,6 +506,51 @@ def forward(self, q, k, v):
         output = rearrange(output, 'b s h d -> s b (h d)').contiguous()
         return output
 
+
+class HabanaFlashSelfAttention(MegatronModule):
+
+    def __init__(self, config, attn_mask_type=AttnMaskType.padding):
+        super(HabanaFlashSelfAttention, self).__init__()
+        assert FusedSDPA is not None, "Failed to import FusedSDPA"
+        self.attn_mask_type = attn_mask_type
+        self.attention_dropout_p = config.attention_dropout
+        self.use_fused_sdpa = config.use_fused_sdpa
+        self.use_fused_sdpa_with_recompute = config.use_fused_sdpa_with_recompute
+        self.use_fast_softmax = "fast" if get_args().use_fast_softmax is True else "None"
+
+        # Per attention head and per partition values.
+        seq_parallel_world_size = 1
+        if parallel_state.sequence_parallel_is_initialized():
+            seq_parallel_world_size = parallel_state.get_sequence_parallel_world_size()
+        world_size = seq_parallel_world_size if seq_parallel_world_size > 1 else parallel_state.get_tensor_model_parallel_world_size()
+
+        projection_size = config.kv_channels * config.num_attention_heads
+        self.hidden_size_per_partition = core.utils.divide(projection_size,
+                                                           world_size)
+
+    def forward(self, query_layer, key_layer,
+                value_layer, attention_mask, alibi=None):
+        # [sq, b, np, hn] -> [b, np, sq, hn]
+        q, k, v = [x.transpose(0, 1).transpose(1, 2) for x in [query_layer, key_layer, value_layer]]
+        causal = True
+        scale = None
+        attn_mask = None
+        context_layer = FusedSDPA.apply(
+            q, k, v, attn_mask, self.attention_dropout_p, causal, scale,
+            self.use_fast_softmax, self.use_fused_sdpa_with_recompute
+        )
+
+        # [b, np, sq, hn] --> [sq, b, np, hn]
+        context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
+
+        # [sq, b, np, hn] --> [sq, b, hp]
+        new_context_layer_shape = context_layer.size()[:-2] + \
+            (self.hidden_size_per_partition,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        return context_layer
+
+
 class ParallelAttention(MegatronModule):
     """Parallel self-attention layer abstract class.
 
@@ -509,6 +570,8 @@ def __init__(self, config, layer_number,
         self.sequence_parallel = config.sequence_parallel
         self.num_attention_heads = config.num_attention_heads
         self.num_key_value_heads = config.num_key_value_heads
+        # TODO - Remove self.attention_dropout usage when SW-172239 is solved
+        self.attention_dropout = config.attention_dropout
         self.use_gqa = (self.num_attention_heads != self.num_key_value_heads)
 
         self.use_flash_attn = (args.use_flash_attn_v1 or args.use_flash_attn_triton or args.use_flash_attn_v2 or \
@@ -516,6 +579,7 @@ def __init__(self, config, layer_number,
             and attention_type == AttnType.self_attn \
             and self.attn_mask_type == AttnMaskType.causal
         self.use_flash_attn_triton = args.use_flash_attn_triton
+        self.use_fused_sdpa = config.use_fused_sdpa
         if self.use_flash_attn:
             global flash_attn_builder
             try:
@@ -590,6 +654,8 @@ def __init__(self, config, layer_number,
             local_attn = FlashSelfAttentionTriton(causal=True, attention_dropout=args.attention_dropout)
         elif self.use_flash_attn:
             local_attn = FlashSelfAttention(causal=True, attention_dropout=config.attention_dropout)
+        elif self.use_fused_sdpa and get_accelerator().device_name() == "hpu":
+            local_attn = HabanaFlashSelfAttention(config, self.attn_mask_type)
         else:
             local_attn = CoreAttention(self.layer_number, config, self.attn_mask_type)
 
@@ -615,19 +681,27 @@ def __init__(self, config, layer_number,
             bias=args.add_bias_linear,
             input_is_parallel=True,
             skip_bias_add=True)
+        # Alibi
+        if args.use_alibi_position_embeddings:
+            alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size)
+            alibi = alibi.to(get_accelerator().current_device_name())
+            self.alibi = alibi.to(args.params_dtype)
+        else:
+            self.alibi = None
 
 
     def _checkpointed_attention_forward(self, query_layer, key_layer,
                                         value_layer, attention_mask,
-                                        rotary_pos_emb=None):
+                                        rotary_pos_emb=None, alibi=None):
         """Forward method with activation checkpointing."""
         def custom_forward(*inputs):
             query_layer = inputs[0]
             key_layer = inputs[1]
             value_layer = inputs[2]
             attention_mask = inputs[3]
+            alibi = inputs[6]
             output_ = self.core_attention(query_layer, key_layer,
-                                          value_layer, attention_mask)
+                                          value_layer, attention_mask, alibi)
             return output_
 
         q_pos_emb, k_pos_emb = (None, None) if rotary_pos_emb is None \
@@ -636,7 +710,7 @@ def custom_forward(*inputs):
         hidden_states = tensor_parallel.checkpoint(
             custom_forward,
             False, query_layer, key_layer, value_layer, attention_mask,
-            q_pos_emb, k_pos_emb)
+            q_pos_emb, k_pos_emb, alibi)
 
         return hidden_states
 
@@ -653,16 +727,22 @@ def repeat_kv(self, hidden_states, n_rep):
         slen, batch, num_key_value_heads_per_partition, head_dim = hidden_states.shape
         if n_rep == 1:
             return hidden_states
-        hidden_states = hidden_states[:, :, :, None, :].expand(
-            slen, batch, num_key_value_heads_per_partition, n_rep, head_dim)
-        return hidden_states.reshape(slen, batch,
-                                     num_key_value_heads_per_partition * n_rep,
-                                     head_dim)
+        elif num_key_value_heads_per_partition == 1:
+            # If no of KV heads is 1 then just perform expand operation
+            # instead of unsqueeze, expand and reshape to match query states.
+            return hidden_states.expand(slen, batch, n_rep, head_dim)
+        else:
+            hidden_states = hidden_states[:, :, :, None, :].expand(
+                slen, batch, num_key_value_heads_per_partition, n_rep, head_dim)
+            return hidden_states.reshape(slen, batch,
+                                        num_key_value_heads_per_partition * n_rep,
+                                        head_dim)
                                      
     def split_tensor(self, mixed_x_layer):
-        query_layer = mixed_x_layer[:, :, :, :-2, :].reshape(mixed_x_layer.shape[:2] + (-1, self.hidden_size_per_attention_head))
-        key_layer = mixed_x_layer[:, :, :, -2, :]
-        value_layer = mixed_x_layer[:, :, :, -1, :]
+        query_layer, key_layer, value_layer = torch.split(mixed_x_layer, [self.num_key_value_groups, 1, 1], dim=-2)
+        query_layer = query_layer.reshape(mixed_x_layer.shape[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head))
+        key_layer = torch.squeeze(key_layer, -2)
+        value_layer = torch.squeeze(value_layer, -2)
 
         return query_layer, key_layer, value_layer
 
@@ -709,8 +789,9 @@ def forward(self, hidden_states, attention_mask,
              key_layer,
              value_layer) = self.split_tensor(mixed_x_layer)
 
-            # Repeat kv
-            if self.use_gqa:
+            # Repeat kv ; fused SPDA internally handles it
+            # TODO - Remove self.attention_dropout check when SW-172239 is solved
+            if self.use_gqa and (not self.use_fused_sdpa or self.attention_dropout != 0):
                 key_layer = self.repeat_kv(key_layer, self.num_key_value_groups)
                 value_layer = self.repeat_kv(value_layer,
                                              self.num_key_value_groups)
@@ -831,10 +912,10 @@ def forward(self, hidden_states, attention_mask,
             else:
                 if self.checkpoint_core_attention:
                     context_layer = self._checkpointed_attention_forward(
-                        query_layer, key_layer, value_layer, attention_mask)
+                        query_layer, key_layer, value_layer, attention_mask, alibi=self.alibi)
                 else:
                     context_layer = self.core_attention(
-                        query_layer, key_layer, value_layer, attention_mask)
+                        query_layer, key_layer, value_layer, attention_mask, alibi=self.alibi)
 
         # =================
         # Output. [sq, b, h]
@@ -844,12 +925,45 @@ def forward(self, hidden_states, attention_mask,
 
         return output, bias
 
+    @staticmethod
+    def _build_alibi_tensor(max_seq_len, num_attention_heads, batch_size):
+        # Based on https://github.com/ofirpress/attention_with_linear_biases/blob/a35aaca144e0eb6b789dfcb46784c4b8e31b7983/fairseq/models/transformer.py#L742
+        """Returns tensor shaped (batch_size * num_attention_heads, 1, max_seq_len)"""
+
+        def get_slopes(n):
+            def get_slopes_power_of_2(n):
+                start = (2 ** (-2 ** -(math.log2(n) - 3)))
+                ratio = start
+                return [start * ratio ** i for i in range(n)]
+
+            if math.log2(n).is_integer():
+                return get_slopes_power_of_2(n)
+            else:
+                closest_power_of_2 = 2 ** math.floor(math.log2(n))
+                return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][
+                                                                   :n - closest_power_of_2]
+
+        slopes = torch.Tensor(get_slopes(num_attention_heads))
+        alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(
+            num_attention_heads, -1, -1)
+
+        #Select the part of the tensor that corresponds to our tensor parallel index.
+        tp_world_size = parallel_state.get_tensor_model_parallel_world_size()
+        tp_index = parallel_state.get_tensor_model_parallel_rank()
+        alibi = alibi.reshape((tp_world_size, -1, *alibi.shape[1:]))[tp_index]
+
+        alibi = alibi.repeat(batch_size, 1, 1)
+        return alibi
+
 
 def bias_dropout_add(x, bias, residual, prob, training):
     # type: (Tensor, Optional[Tensor], Tensor, float, bool) -> Tensor
     if bias is not None:
         x = x + bias
-    out = torch.nn.functional.dropout(x, p=prob, training=training)
+    if prob == 0:
+        out = x
+    else:
+        out = torch.nn.functional.dropout(x, p=prob, training=training)
     out = residual + out
     return out
 
@@ -902,7 +1016,7 @@ def __init__(self, config,
 
         # Layernorm on the input data.
         if args.normalization == 'layernorm':
-            if get_accelerator().device_name() == 'cuda':
+            if get_accelerator().device_name() in ['cuda', 'hpu']:
                 self.input_layernorm = LayerNorm(
                     config.hidden_size,
                     eps=config.layernorm_epsilon,
@@ -915,7 +1029,9 @@ def __init__(self, config,
                     config.hidden_size,
                     eps=config.layernorm_epsilon)
         else:
-            self.input_layernorm = RMSNorm(config.hidden_size, config.layernorm_epsilon)
+            self.input_layernorm = RMSNorm(config.hidden_size,
+                                           config.layernorm_epsilon,
+                                           sequence_parallel=config.sequence_parallel)
         # Self attention.
         self.self_attention = ParallelAttention(
             config,
@@ -928,7 +1044,7 @@ def __init__(self, config,
 
         # Layernorm on the attention output
         if args.normalization == 'layernorm':
-            if get_accelerator().device_name() == 'cuda':
+            if get_accelerator().device_name() in ['cuda', 'hpu']:
                 self.post_attention_layernorm = LayerNorm(
                     config.hidden_size,
                     eps=config.layernorm_epsilon,
@@ -941,7 +1057,9 @@ def __init__(self, config,
                     config.hidden_size,
                     eps=config.layernorm_epsilon)
         else:
-            self.post_attention_layernorm = RMSNorm(config.hidden_size, config.layernorm_epsilon)
+            self.post_attention_layernorm = RMSNorm(config.hidden_size,
+                                                    config.layernorm_epsilon,
+                                                    sequence_parallel=config.sequence_parallel)
             # Cross attention.
         if self.layer_type in (LayerType.decoder,
                                LayerType.retro_decoder,
@@ -961,7 +1079,9 @@ def __init__(self, config,
                     apply_layernorm_1p=args.apply_layernorm_1p,
                     mem_efficient_ln=args.mem_efficient_ln)
             else:
-                self.post_inter_attention_layernorm = RMSNorm(config.hidden_size, config.layernorm_epsilon)
+                self.post_inter_attention_layernorm = RMSNorm(config.hidden_size,
+                                                              config.layernorm_epsilon,
+                                                              sequence_parallel=config.sequence_parallel)
 
         # MLP
         self.num_experts = num_experts
@@ -972,6 +1092,7 @@ def __init__(self, config,
                 self.mlp = ParallelMLP(config)
             else: # DeepSpeed's MoE
                 enable_expert_tensor_parallelism = args.enable_expert_tensor_parallelism
+                configured_capacity_bins = self.get_configured_capacity_bins()
                 self.mlp = MoE(args.hidden_size,
                                ParallelMLP(config,
                                            moe=True,
@@ -986,7 +1107,12 @@ def __init__(self, config,
                                drop_tokens=args.moe_token_dropping,
                                use_tutel=args.use_tutel,
                                enable_expert_tensor_parallelism=enable_expert_tensor_parallelism,
-                               top2_2nd_expert_sampling=args.moe_top2_2nd_expert_sampling)
+                               top2_2nd_expert_sampling=args.moe_top2_2nd_expert_sampling,
+                               sequence_parallel=args.sequence_parallel,
+                               num_capacity_bins=args.moe_num_capacity_bins,
+                               capacity_bins_exp_base=args.moe_capacity_bins_exp_base,
+                               capacity_bins_alignment=args.moe_capacity_bins_alignment,
+                               configured_capacity_bins=configured_capacity_bins)
 
         # Set bias+dropout+add fusion grad_enable execution handler.
         TORCH_MAJOR = int(torch.__version__.split('.')[0])
@@ -1004,8 +1130,7 @@ def __init__(self, config,
         # Retriever (bi-directional transformer with cross attention)
         if layer_type == LayerType.retro_decoder_with_retriever:
             self.retriever = ParallelTransformer(
-                init_method,
-                output_layer_init_method,
+                config=config,
                 model_type=ModelType.retro_encoder,
                 self_attn_mask_type=AttnMaskType.padding,
                 pre_process=True,
@@ -1015,6 +1140,21 @@ def __init__(self, config,
         else:
             self.retriever = None
 
+    def get_configured_capacity_bins(self):
+        args = get_args()
+        ret = None
+        if self.num_experts > 1 and args.moe_num_capacity_bins > 0 and args.moe_capacity_bins is not None:
+            for ne_plus_bins in args.moe_capacity_bins:
+                ne_plus_bins = [int(v) for v in ne_plus_bins]
+                ne = ne_plus_bins[0]
+                bins = ne_plus_bins[1:]
+                assert len(bins) == args.moe_num_capacity_bins, \
+                    f'Mismatch len of {bins=} for num-experts={self.num_experts}, expected={args.moe_num_capacity_bins}'
+                if self.num_experts == ne:
+                    ret = bins
+                    break
+        return ret
+
     def default_decoder_cross_attention(self,
                                         encoder_output,
                                         enc_dec_attn_mask,
@@ -1571,6 +1711,7 @@ def __init__(self, config,
         self.transformer_impl = args.transformer_impl
         self.retro_add_retriever = args.retro_add_retriever
         self.ds_inference = args.ds_inference
+        self.device_name = get_accelerator().device_name()
 
         # Store activation checkpoiting flag.
         self.checkpoint_activations = args.checkpoint_activations
@@ -1583,9 +1724,10 @@ def __init__(self, config,
 
         self.sequence_parallel = config.sequence_parallel
 
+        self.use_fp8 = False
         # Transformer Engine Init.
         self.transformer_engine_rope_available = False
-        if self.transformer_impl == 'transformer_engine':
+        if self.transformer_impl == 'transformer_engine' and self.device_name == 'cuda':
             global transformer_engine
             import transformer_engine
             from importlib.metadata import version
@@ -1597,23 +1739,23 @@ def __init__(self, config,
 
             del version, packaging
 
-        self.use_fp8 = args.fp8_e4m3 or args.fp8_hybrid
-        self.fp8_recipe = None
-        self.fp8_group = None
-        if self.use_fp8:
-            self.fp8_group = parallel_state.get_data_parallel_group()
-            if args.fp8_e4m3:
-                fp8_format = transformer_engine.common.recipe.Format.E4M3
-            elif args.fp8_hybrid:
-                fp8_format = transformer_engine.common.recipe.Format.HYBRID
-            self.fp8_recipe = transformer_engine.common.recipe.DelayedScaling(
-                margin=args.fp8_margin,
-                interval=args.fp8_interval,
-                fp8_format=fp8_format,
-                amax_history_len=args.fp8_amax_history_len,
-                amax_compute_algo=args.fp8_amax_compute_algo,
-                override_linear_precision=(False, False, not args.fp8_wgrad),
-            )
+            self.use_fp8 = args.fp8_e4m3 or args.fp8_hybrid
+            self.fp8_recipe = None
+            self.fp8_group = None
+            if self.use_fp8:
+                self.fp8_group = parallel_state.get_data_parallel_group()
+                if args.fp8_e4m3:
+                    fp8_format = transformer_engine.common.recipe.Format.E4M3
+                elif args.fp8_hybrid:
+                    fp8_format = transformer_engine.common.recipe.Format.HYBRID
+                self.fp8_recipe = transformer_engine.common.recipe.DelayedScaling(
+                    margin=args.fp8_margin,
+                    interval=args.fp8_interval,
+                    fp8_format=fp8_format,
+                    amax_history_len=args.fp8_amax_history_len,
+                    amax_compute_algo=args.fp8_amax_compute_algo,
+                    override_linear_precision=(False, False, not args.fp8_wgrad),
+                )
 
         self.num_microbatches_in_previous_step = -1
         self.microbatch_count = 0
@@ -1642,7 +1784,7 @@ def __init__(self, config,
             assert args.transformer_impl == 'local', \
                 "Transformer engine does not support Retro layers."
         def build_layer(layer_number, n_e):
-            if args.transformer_impl == 'local':
+            if args.transformer_impl == 'local' or self.device_name == 'hpu':
                 current_layer_type = _get_layer_type(
                     model_type, layer_type, self.retro_layer_numbers,
                     layer_number)
@@ -1751,7 +1893,7 @@ def build_layer(layer_number, n_e):
         if self.post_process and self.post_layer_norm:
             # Final layer norm before output.
             if args.normalization == 'layernorm':
-                if get_accelerator().device_name() == 'cuda':
+                if get_accelerator().device_name() in ['cuda', 'hpu']:
                     self.final_layernorm = LayerNorm(
                         config.hidden_size,
                         eps=config.layernorm_epsilon,
@@ -1764,7 +1906,9 @@ def build_layer(layer_number, n_e):
                         config.hidden_size,
                         eps=config.layernorm_epsilon)
             else:
-                self.final_layernorm = RMSNorm(config.hidden_size, config.layernorm_epsilon)
+                self.final_layernorm = RMSNorm(config.hidden_size,
+                                               config.layernorm_epsilon,
+                                               sequence_parallel=config.sequence_parallel)
 
     def _get_layer(self, layer_number):
         return self.layers[layer_number]
@@ -2043,14 +2187,15 @@ class LMHeadPipe(MegatronModule):
         config:
     """
 
-    def __init__(self, hidden_size, vocab_size, config):
+    def __init__(self, hidden_size, vocab_size, config, gather_output=False):
         args = get_args()
         super(LMHeadPipe, self).__init__()
         self.lm_head = tensor_parallel.ColumnParallelLinear(input_size=hidden_size,
                                                             output_size=vocab_size,
                                                             bias=False,
                                                             config=config,
-                                                            init_method=config.init_method,)
+                                                            init_method=config.init_method,
+                                                            gather_output=gather_output,)
 
     def forward(self, inputs, **kwargs):
         assert torch.is_tensor(inputs) or isinstance(inputs, tuple)
diff --git a/megatron/model/utils.py b/megatron/model/utils.py
index 6c94921c95..2d3a758587 100644
--- a/megatron/model/utils.py
+++ b/megatron/model/utils.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Utilities for models."""
@@ -7,8 +8,15 @@
 import torch
 
 from megatron import get_args
+from megatron.core.parallel_state import (
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+    get_tensor_model_parallel_group,
+)
+from megatron.model import RMSNorm, LayerNorm
 
 from deepspeed.runtime.zero import GatheredParameters
+from deepspeed.accelerator import get_accelerator
 
 def init_method_normal(sigma):
     """Init method based on N(0, sigma)."""
@@ -33,6 +41,13 @@ def gather_and_init(param, init_method):
         init_method(param)
         
 
+def perform_masking(attention_scores, attention_mask):
+    if attention_mask.dtype == torch.bool:
+        attention_scores.masked_fill_(attention_mask, -10000.0)
+    else:
+        attention_scores.add_(attention_mask)
+
+
 def attention_mask_func(attention_scores, attention_mask):
     args = get_args()
     if args.curriculum_learning_legacy or args.data_efficiency_curriculum_learning:
@@ -41,15 +56,17 @@ def attention_mask_func(attention_scores, attention_mask):
         if actual_seqlen != attention_mask_.size()[2]:
             # attention_mask has size [1, 1, seqlen, seqlen]
             attention_mask_ = attention_mask_[:, :, :actual_seqlen, :actual_seqlen].contiguous()
-        attention_scores.masked_fill_(attention_mask_, -10000.0)
+        perform_masking(attention_scores, attention_mask_)
     else:
-        attention_scores.masked_fill_(attention_mask, -10000.0)
+        perform_masking(attention_scores, attention_mask)
     return attention_scores
 
 
 def get_linear_layer(rows, columns, init_method, gather_params_on_init=False):
     """Simple linear layer with weight initialization."""
-    layer = torch.nn.Linear(rows, columns)
+    layer = torch.nn.Linear(rows, columns,
+                            device=get_accelerator().current_device_name(),
+                            dtype=get_args().params_dtype)
     if get_args().perform_initialization:
         with GatheredParameters(layer.weight, modifier_rank=0, enabled=gather_params_on_init):
             init_method(layer.weight)
@@ -70,3 +87,87 @@ def openai_gelu(x):
 @torch.jit.script
 def erf_gelu(x):
     return x * 0.5 * (torch.erf(x / 1.41421).to(dtype=x.dtype)+torch.ones_like(x).to(dtype=x.dtype))
+
+
+@torch.no_grad()
+def gather_tensors(input_):
+    world_size = get_tensor_model_parallel_world_size()
+    rank = get_tensor_model_parallel_rank()
+
+    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
+    tensor_list[rank] = input_.clone()
+    torch.distributed.all_gather(tensor_list, input_, group=get_tensor_model_parallel_group())
+
+    return tensor_list
+
+
+@torch.no_grad()
+def compare_tensors(inputs):
+    ok = all([x.eq(inputs[0]).all().item() for x in inputs[1:]])
+    return ok
+
+
+def check_equal(inp):
+    return compare_tensors(gather_tensors(inp))
+
+
+def assert_equal(inp, msg=""):
+    assert check_equal(inp), msg
+
+
+def get_check_tp_norm():
+    args = get_args()
+    check_tp_norm = args.curr_iteration >= args.start_check_tp_norm_iter
+    check_tp_norm &= args.curr_iteration <= args.end_check_tp_norm_iter
+    check_tp_norm &= args.check_tp_norm
+    return check_tp_norm
+
+
+def tp_norm_module_hook(mod, inp, out, fwd=None, layer_name=""):
+    if get_check_tp_norm():
+        args = get_args()
+        if not isinstance(inp, tuple):
+            inputs = [inp]
+        else:
+            inputs = inp
+        if not isinstance(out, tuple):
+            outputs = [out]
+        else:
+            outputs = out
+
+        def get_message(message):
+            msg = f"error in {message}, fwd={fwd}"
+            if not layer_name:
+                return msg
+            return msg + f", layer_name = {layer_name}"
+
+        if args.check_tp_norm_type in ["all", "wb"]:
+            # compare weight and weight grad
+            assert_equal(mod.weight, get_message("mod.weight"))
+            assert_equal(mod.weight.grad, get_message("mod.weight.grad"))
+            # compare bias and bias grad if present
+            if hasattr(mod, "bias"):
+                assert_equal(mod.bias, get_message("mod.bias"))
+                assert_equal(mod.bias.grad, get_message("mod.bias.grad"))
+
+        if args.check_tp_norm_type in ["all", "io"]:
+            # compare inputs
+            for i, in_ in enumerate(inputs):
+                assert_equal(in_, get_message(f"in_ {i}"))
+            # compare outputs
+            for i, out_ in enumerate(outputs):
+                assert_equal(out_, get_message(f"out_ {i}"))
+
+
+def layer_name_tp_norm_module_hook(fwd=None, layer_name=""):
+    def hook(mod, inp, out):
+        tp_norm_module_hook(mod, inp, out, fwd, layer_name)
+    return hook
+
+
+def add_tp_norm_hooks(model, args):
+    if args.check_tp_norm:
+        for param_name, mod in model.named_modules():
+            if isinstance(mod, (RMSNorm, LayerNorm)):
+                mod.register_forward_hook(layer_name_tp_norm_module_hook(True, param_name))
+                mod.register_full_backward_hook(layer_name_tp_norm_module_hook(False, param_name))
diff --git a/megatron/mpu/tests/commons.py b/megatron/mpu/tests/commons.py
index 0fdb5ab0fc..bc611b66f4 100644
--- a/megatron/mpu/tests/commons.py
+++ b/megatron/mpu/tests/commons.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 import argparse
@@ -6,7 +7,7 @@
 import numpy
 import torch
 
-import mpu
+from megatron.core import mpu
 from deepspeed.accelerator import get_accelerator
 
 class IdentityLayer(torch.nn.Module):
@@ -54,7 +55,7 @@ def initialize_distributed(backend='nccl'):
     master_port = os.getenv('MASTER_PORT', '6000')
     init_method += master_ip + ':' + master_port
     torch.distributed.init_process_group(
-        backend=backend,
+        backend=get_accelerator().communication_backend_name(),
         world_size=world_size,
         rank=rank,
         init_method=init_method)
diff --git a/megatron/mpu/tests/test_cross_entropy.py b/megatron/mpu/tests/test_cross_entropy.py
index 7f161348ce..45bfb3d138 100644
--- a/megatron/mpu/tests/test_cross_entropy.py
+++ b/megatron/mpu/tests/test_cross_entropy.py
@@ -1,11 +1,12 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
-from commons import set_random_seed
-from commons import IdentityLayer
-from commons import print_separator
-from commons import initialize_distributed
-from mpu.cross_entropy import vocab_parallel_cross_entropy
-import mpu
+from .commons import set_random_seed
+from .commons import IdentityLayer
+from .commons import print_separator
+from .commons import initialize_distributed
+from megatron.core.tensor_parallel.cross_entropy import vocab_parallel_cross_entropy
+from megatron.core import mpu
 import torch.nn.functional as F
 import torch
 import random
diff --git a/megatron/mpu/tests/test_data.py b/megatron/mpu/tests/test_data.py
index 1e95447099..087260be1e 100644
--- a/megatron/mpu/tests/test_data.py
+++ b/megatron/mpu/tests/test_data.py
@@ -1,10 +1,11 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
-from commons import print_separator
-from commons import initialize_distributed
+from .commons import print_separator
+from .commons import initialize_distributed
 from deepspeed.accelerator import get_accelerator
-from mpu import data as data_utils
-import mpu
+from megatron.core import mpu
+from megatron.core.tensor_parallel import data as data_utils
 import torch
 import functools
 import operator
diff --git a/megatron/mpu/tests/test_initialize.py b/megatron/mpu/tests/test_initialize.py
index e5d2be37e2..bf8e943e74 100644
--- a/megatron/mpu/tests/test_initialize.py
+++ b/megatron/mpu/tests/test_initialize.py
@@ -1,8 +1,9 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
-from commons import print_separator
-from commons import initialize_distributed
-import mpu
+from .commons import print_separator
+from .commons import initialize_distributed
+from megatron.core import mpu
 import torch
 import sys
 sys.path.append("../..")
diff --git a/megatron/mpu/tests/test_layers.py b/megatron/mpu/tests/test_layers.py
index f524f94441..3c4fe25d75 100644
--- a/megatron/mpu/tests/test_layers.py
+++ b/megatron/mpu/tests/test_layers.py
@@ -1,14 +1,16 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
-from mpu import layers
-from commons import set_random_seed
-from commons import print_separator
-from commons import initialize_distributed
-import mpu
+from megatron.core.tensor_parallel import layers
+from .commons import set_random_seed
+from .commons import print_separator
+from .commons import initialize_distributed
+from megatron.core import mpu
 from torch.nn.parameter import Parameter
 import torch.nn.init as init
 import torch
 import random
+from deepspeed.accelerator import get_accelerator
 import sys
 sys.path.append("../..")
 
diff --git a/megatron/mpu/tests/test_random.py b/megatron/mpu/tests/test_random.py
index 21dcd10907..eb8d3c11c8 100644
--- a/megatron/mpu/tests/test_random.py
+++ b/megatron/mpu/tests/test_random.py
@@ -1,9 +1,11 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
-from commons import print_separator
-from commons import initialize_distributed
-import mpu
+from .commons import print_separator
+from .commons import initialize_distributed
+from megatron.core import mpu
 import torch
+from deepspeed.accelerator import get_accelerator
 import sys
 sys.path.append("../..")
 
diff --git a/megatron/optimizer/__init__.py b/megatron/optimizer/__init__.py
index 12a458375d..94a985a733 100644
--- a/megatron/optimizer/__init__.py
+++ b/megatron/optimizer/__init__.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 from deepspeed.accelerator import get_accelerator
@@ -8,6 +9,9 @@
     from torch.optim import Adam
     from torch.optim import SGD
 
+import torch
+from torch.optim import AdamW
+
 from megatron import get_args
 
 from .distrib_optimizer import DistributedOptimizer
@@ -24,6 +28,7 @@ def get_param_groups(modules,
        scale_lr_cond is used during finetuning where head of the network requires a scaled
        version of the base learning rate. 
     """
+    args = get_args()
     wd_no_scale_lr = []
     wd_scale_lr = []
     no_wd_no_scale_lr = []
@@ -33,9 +38,10 @@ def get_param_groups(modules,
             if not param.requires_grad:
                 continue
 
+            no_wd = None
             if no_weight_decay_cond is not None:
                 no_wd = no_weight_decay_cond(name, param)
-            else:
+            elif not args.do_norm_bias_weight_decay:
                 # do not regularize biases nor Norm parameters
                 no_wd = name.endswith(".bias") or len(param.shape) == 1
 
@@ -108,6 +114,20 @@ def get_megatron_optimizer(model,
                             lr=args.lr,
                             weight_decay=args.weight_decay,
                             momentum=args.sgd_momentum)
+        elif args.optimizer == 'adamw':
+            optimizer = AdamW(param_groups,
+                            lr=args.lr,
+                            weight_decay=args.weight_decay,
+                            betas=(args.adam_beta1, args.adam_beta2),
+                            eps=args.adam_eps)
+        elif args.optimizer == 'fusedadamw':
+            assert get_accelerator().device_name() == "hpu", "FusedAdamW optimizer is supported only when using HPU"
+            from habana_frameworks.torch.hpex.optimizers import FusedAdamW
+            optimizer = FusedAdamW(param_groups,
+                            lr=args.lr,
+                            weight_decay=args.weight_decay,
+                            betas=(args.adam_beta1, args.adam_beta2),
+                            eps=args.adam_eps)
         else:
             raise Exception('{} optimizer is not supported.'.format(
             args.optimizer))
diff --git a/megatron/profiler.py b/megatron/profiler.py
new file mode 100644
index 0000000000..7fddfad7aa
--- /dev/null
+++ b/megatron/profiler.py
@@ -0,0 +1,87 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import sys
+import os
+
+on_step_begin = []
+on_step_end = []
+
+def trigger(phase):
+    [f() for f in phase]
+
+def setup_profiler(args, device):
+    if args.profile is None:
+        return
+
+    start_step, end_step = map(int, args.profile_steps.split(','))
+    active_steps = end_step - start_step + 1
+    cur_step = 0
+
+    def on_step_begin_fn():
+        nonlocal cur_step
+        cur_step = cur_step + 1
+    on_step_begin.append(on_step_begin_fn)
+
+    def when(cond, clbk):
+        def fn():
+            if cond():
+                clbk()
+        return fn
+
+    def is_start_step():
+        return cur_step == start_step
+
+    def is_end_step():
+        return cur_step == end_step
+
+    def is_capture_step():
+        return cur_step >= start_step and cur_step <= end_step
+
+    if args.profile.startswith('pt'):
+        schedule = torch.profiler.schedule(wait=0, warmup=0, active=active_steps, repeat=1)
+        activities = [torch.profiler.ProfilerActivity.CPU]
+        activities.extend([torch.profiler.ProfilerActivity.HPU] if device.startswith("hpu") else [])
+        activities.extend([torch.profiler.ProfilerActivity.CUDA] if device.startswith("cuda") else [])
+        full = args.profile == 'pt-full'
+
+        profiler = torch.profiler.profile(
+            schedule=schedule,
+            activities=activities,
+            on_trace_ready=torch.profiler.tensorboard_trace_handler(args.tensorboard_dir, use_gzip=True),
+            with_stack=full)
+
+        on_step_begin.append(when(is_start_step, profiler.start))
+        on_step_end.append(when(is_capture_step, profiler.step))
+        on_step_end.append(when(is_end_step, profiler.stop))
+
+    elif args.profile == 'hltv':
+        sys.path.append(os.environ['PYTORCH_MODULES_ROOT_PATH'])
+        from topologies.tools import SynapseProfilerApi, TraceType
+        api = SynapseProfilerApi()
+
+        def on_start_step():
+            nonlocal api
+            api.profiler_start(TraceType.TraceAll, 0)
+
+        def on_end_step():
+            nonlocal api
+            import habana_frameworks.torch.hpu as hpu
+            hpu.synchronize()
+            api.profiler_stop(TraceType.TraceAll, 0)
+            api.profiler_get_trace_json(TraceType.TraceAll, 0)
+
+        on_step_begin.append(when(is_start_step, on_start_step))
+        on_step_end.append(when(is_end_step, on_end_step))
diff --git a/megatron/tokenizer/tokenizer.py b/megatron/tokenizer/tokenizer.py
index 43c251bab1..bb9eae6135 100644
--- a/megatron/tokenizer/tokenizer.py
+++ b/megatron/tokenizer/tokenizer.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Megatron tokenizers."""
@@ -34,18 +35,21 @@ def build_tokenizer(args):
         tokenizer = _SentencePieceTokenizer(args.tokenizer_model, vocab_extra_ids=args.vocab_extra_ids)
     elif args.tokenizer_type == 'GPTSentencePieceTokenizer':
         assert args.tokenizer_model is not None
-        tokenizer = _GPTSentencePieceTokenizer(args.tokenizer_model)
+        tokenizer = _GPTSentencePieceTokenizer(args.tokenizer_model, args.eval_add_bos)
     elif args.tokenizer_type == 'NullTokenizer':
         assert args.vocab_size is not None
         tokenizer = _NullTokenizer(args.vocab_size)
     elif args.tokenizer_type == 'HFTokenizer':
         assert args.tokenizer_model is not None
-        tokenizer = _HFTokenizer(args.tokenizer_model,args.seq_length)
+        tokenizer = _HFTokenizer(args.tokenizer_model,args.seq_length,
+                                 args.trust_remote_code)
     else:
         raise NotImplementedError('{} tokenizer is not '
                                   'implemented.'.format(args.tokenizer_type))
     
     # Add vocab size.
+    if args.vocab_size is None:
+        args.vocab_size = tokenizer.vocab_size
     args.padded_vocab_size = _vocab_size_with_padding(tokenizer.vocab_size,
                                                       args)
 
@@ -294,12 +298,12 @@ def eod(self):
 class _SentencePieceTokenizer(AbstractTokenizer):
     """SentencePieceTokenizer-Megatron wrapper"""
 
-    def __init__(self, model_file, vocab_extra_ids=0):
+    def __init__(self, model_file, add_bos=False, vocab_extra_ids=0):
         name = 'SentencePieceTokenizer'
         super().__init__(name)
 
         import sentencepiece
-        self.tokenizer = sentencepiece.SentencePieceProcessor(model_file=model_file)
+        self.tokenizer = sentencepiece.SentencePieceProcessor(model_file=model_file, add_bos=add_bos)
         self._initalize(vocab_extra_ids)
 
     def _populate_vocab(self):
@@ -468,8 +472,8 @@ def additional_special_tokens_ids(self):
 class _GPTSentencePieceTokenizer(_SentencePieceTokenizer):
     """SentencePieceTokenizer-Megatron wrapper"""
 
-    def __init__(self, model_file,):
-        super().__init__(model_file, vocab_extra_ids=0)
+    def __init__(self, model_file, add_bos=False):
+        super().__init__(model_file, add_bos=add_bos, vocab_extra_ids=0)
 
     def _initalize(self, vocab_extra_ids):
         self._populate_vocab()
@@ -540,10 +544,13 @@ def additional_special_tokens_ids(self):
 
 class _HFTokenizer(AbstractTokenizer):
     """HF Tokenizer"""
-    def __init__(self, tokenizer_name_or_path,max_seq_len):
+    def __init__(self, tokenizer_name_or_path,max_seq_len, trust_remote_code):
         name = tokenizer_name_or_path
         super().__init__(name)
-        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path,padding_side="right",use_fast=False)
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path,
+                                                       padding_side="right",
+                                                       trust_remote_code=trust_remote_code,
+                                                       use_fast=False)
         
         DEFAULT_PAD_TOKEN = "[PAD]"
         DEFAULT_EOS_TOKEN = "</s>"
diff --git a/megatron/training.py b/megatron/training.py
index 19b8a6c71f..7a05a06e6f 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -1,7 +1,9 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 """Pretrain utilities."""
 
+from contextlib import nullcontext
 from datetime import datetime
 import math
 import sys
@@ -18,7 +20,7 @@
 from megatron import get_timers
 from megatron import get_tensorboard_writer
 from megatron import get_current_global_batch_size
-from megatron import get_num_microbatches
+from megatron import get_num_microbatches, get_num_eval_microbatches
 from megatron import is_last_rank
 from megatron import update_num_microbatches
 from megatron.core import mpu, tensor_parallel
@@ -36,13 +38,16 @@
 from megatron.optimizer_param_scheduler import OptimizerParamScheduler
 from megatron.model import DistributedDataParallel as LocalDDP
 from megatron.utils import check_adlr_autoresume_termination
-from megatron.utils import unwrap_model
+from megatron.utils import unwrap_model, found_kill_switch
 from megatron.data.data_samplers import build_pretraining_data_loader
 from megatron.utils import calc_params_l2_norm
 from megatron.core.pipeline_parallel import get_forward_backward_func
-from megatron.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator, update_rotary_pos_emb
+from megatron.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator, update_rotary_pos_emb, get_fp8_recipe
+from megatron.core.tensor_parallel.data import reset_cached_broadcast_sizes
+from megatron.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator
 from megatron.model.vision.knn_monitor import compute_feature_bank
 from megatron.arguments import core_transformer_config_from_args
+from megatron.profiler import setup_profiler, trigger, on_step_begin, on_step_end
 
 import deepspeed
 from deepspeed.accelerator import get_accelerator
@@ -52,11 +57,23 @@
 
 from deepspeed import comm as dist
 
+try:
+    from deepspeed.tools.tensor_logger import TensorLogger, save_logged_tensors
+except:
+    TensorLogger = None
+    save_logged_tensors = None
+
 try:
     import wandb
 except (ImportError, ModuleNotFoundError):
     wandb = None
 
+try:
+    from habana_frameworks.torch.hpex.experimental.transformer_engine import fp8_autocast
+    from habana_frameworks.torch.hpex.experimental.transformer_engine.distributed import activation_checkpointing
+except (ImportError, ModuleNotFoundError):
+    fp8_autocast = None
+    activation_checkpointing = None
 
 def print_datetime(string):
     """Note that this call will sync across all ranks."""
@@ -126,6 +143,14 @@ def pretrain(train_valid_test_dataset_provider,
     # Initalize and get arguments, timers, and Tensorboard writer.
     initialize_megatron(extra_args_provider=extra_args_provider,
                         args_defaults=args_defaults, external_args=external_args)
+
+    args = get_args()
+
+    if found_kill_switch():
+        print_datetime(f"Detected kill switch at {args.kill_switch_path}. Exiting")
+        torch.distributed.barrier()
+        sys.exit()
+
     # Set pytorch JIT layer fusion options and warmup JIT functions.
     if get_accelerator().device_name() == 'cuda':
         set_jit_fusion_options()
@@ -142,7 +167,6 @@ def pretrain(train_valid_test_dataset_provider,
         time.time() - _TRAIN_START_TIME))
     print_datetime('after megatron is initialized')
 
-    args = get_args()
     timers = get_timers()
 
     if args.deepspeed:
@@ -240,21 +264,23 @@ def pretrain(train_valid_test_dataset_provider,
         print_rank_0('skipping training (--skip-train is on) ...')
 
         iteration = args.iteration
+        if args.save and (iteration != 0 or args.universal_checkpoint):
+            save_checkpoint(iteration, model, optimizer, opt_param_scheduler)
 
     config = core_transformer_config_from_args(args)
     if args.do_valid:
         prefix = f'iteration {iteration} on {args.eval_iters * args.global_batch_size}-sample draw from validation set'
-        evaluate_and_print_results(prefix, forward_step_func,
-                                   valid_data_iterator, model,
-                                   iteration, process_non_loss_data_func, config,
-                                   verbose=True, write_to_tensorboard=not args.skip_train)
+        _ = evaluate_and_print_results(prefix, forward_step_func,
+                                       valid_data_iterator, model,
+                                       iteration, process_non_loss_data_func, config,
+                                       verbose=True, write_to_tensorboard=not args.skip_train)
 
     if args.do_test:
         prefix = f'iteration {iteration} on {args.eval_iters * args.global_batch_size}-sample draw from test set'
-        evaluate_and_print_results(prefix, forward_step_func,
-                                   test_data_iterator, model,
-                                   iteration, process_non_loss_data_func, config,
-                                   verbose=True, write_to_tensorboard=not args.skip_train, test=True)
+        _ = evaluate_and_print_results(prefix, forward_step_func,
+                                       test_data_iterator, model,
+                                       iteration, process_non_loss_data_func, config,
+                                       verbose=True, write_to_tensorboard=not args.skip_train, test=True)
     return model
 
 
@@ -374,7 +400,7 @@ def get_model(model_provider_func, model_type=ModelType.encoder_or_decoder, wrap
 
     # Disallow training and inference with Transformer Engine
     # for non-GPT models
-    args.allow_transformer_engine = all([type(m) == GPTModel for m in model])
+    args.allow_transformer_engine = all([type(m).__name__ in ['GPTModelPipe', 'GPTModel'] for m in model])
     assert args.allow_transformer_engine or args.transformer_impl == 'local', \
         'Transformer Engine is only approved for GPT models'
 
@@ -614,6 +640,8 @@ def setup_model_and_optimizer(model_provider_func,
                 mpu=mpu if args.no_pipeline_parallel else None,
                 config=args.deepspeed_config_dict,
             )
+        if args.use_torch_compile:
+            model.compile()
         if isinstance(model, deepspeed.PipelineEngine):
             # hack to get batch_fn from pretrain_gpt.py
             model.set_batch_fn(model.module._megatron_batch_fn)
@@ -662,7 +690,6 @@ def train_step(forward_step_func, data_iterator,
     """Single training step."""
     args = get_args()
     timers = get_timers()
-
     if args.deepspeed and args.ds_pipeline_enabled:
         skipped_iter = 0
         num_zeros_in_grad = 0
@@ -1119,6 +1146,14 @@ def training_log(loss_dict, total_loss_dict, learning_rate, iteration,
             log_string += ' curriculum seqlen: {:5d} |'.format(args.curriculum_seqlen)
         if args.random_ltd:
             log_string += ' random ltd reserved length: {:5d} |'.format(args.random_ltd_reserved_length)
+        if args.deepspeed and model[0].has_moe_layers and hasattr(model[0].gate_modules[0], 'get_stats'):
+            # to reduce clutter, log stats only first and last gates
+            gate_indexes = [0] if len(model[0].gate_modules) == 1 else [0, len(model[0].gate_modules)-1]
+            for i in gate_indexes:
+                stats = model[0].gate_modules[i].get_stats()
+                if stats is not None and 'capacity_bins' in stats:
+                    log_string += f' moe_{i} stats: {stats["capacity_bins"]["summary"]} |'
+
         log_string += ' actual seqlen: {:5d} |'.format(seq_len)
         log_string += ' number of skipped iterations: {:3d} |'.format(
             total_loss_dict[skipped_iters_key])
@@ -1161,6 +1196,8 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
     # Write args to tensorboard
     write_args_to_tensorboard()
 
+    setup_profiler(args, get_accelerator().device_name())
+
     if args.random_ltd:
         # random-ltd requires different randomness on each rank
         import random
@@ -1185,12 +1222,23 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
     timers('interval-time', log_level=0).start(barrier=True)
     print_datetime('before the start of training step')
     report_memory_flag = True
+    tensor_logger = None
+    if args.tensor_logger_end_iter > 0:
+        tensor_logger = TensorLogger(model[0].module,
+                                    log_activations_enabled=args.log_fwd_activations,
+                                    start_iteration=args.tensor_logger_start_iter,
+                                    end_iteration=args.tensor_logger_end_iter,
+                                    log_grads_enabled=args.log_bwd_grads,
+                                    log_inputs_enabled=args.log_model_inputs,
+                                    prefix=None)
     if args.random_ltd:
         assert model[0].random_ltd_enabled()
         args.random_ltd_layer_num = model[0].random_ltd_scheduler.get_random_ltd_layer_num()
-        
+
+    hpu_transformer_engine = get_accelerator().device_name() == 'hpu' and get_args().transformer_impl == "transformer_engine"
     while iteration < args.train_iters and (args.train_tokens is None or \
         args.consumed_train_tokens < args.train_tokens):
+        trigger(on_step_begin)
         update_num_microbatches(args.consumed_train_samples)
         if args.deepspeed:
             # inform deepspeed of any batch size changes
@@ -1207,14 +1255,21 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
                     update_rotary_pos_emb(curriculum_seqlen)
             args.curriculum_seqlen = curriculum_seqlen
         args.curr_iteration = iteration
-        loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
-            train_step(forward_step_func,
-                       train_data_iterator,
-                       model,
-                       optimizer,
-                       opt_param_scheduler,
-                       config)
+
+        with fp8_autocast(enabled=True, fp8_recipe=get_fp8_recipe(args), fp8_group=mpu.get_data_parallel_group()) \
+            if hpu_transformer_engine else nullcontext():
+            with activation_checkpointing() if args.recompute_granularity == 'full' else nullcontext():
+                with tensor_logger.log_iteration(iteration+1) if tensor_logger else nullcontext():
+                    loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
+                        train_step(forward_step_func,
+                                train_data_iterator,
+                                model,
+                                optimizer,
+                                opt_param_scheduler,
+                                config)
         iteration += 1
+        if args.tensor_logger_end_iter > 0:
+            save_logged_tensors(tensor_logger, args.tensor_logger_path, args.rank, iteration=iteration)
         args.iteration = iteration
         new_samples = mpu.get_data_parallel_world_size() * \
                                        args.micro_batch_size * \
@@ -1257,6 +1312,11 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
                                           grad_norm, params_norm, num_zeros_in_grad,
                                           model, optimizer)
 
+        if args.deepspeed and model[0].has_moe_layers and hasattr(model[0], 'optimize_moe') \
+                and args.moe_capacity_bins_optimize_interval > 0 \
+                and iteration > 0 and iteration % args.moe_capacity_bins_optimize_interval == 0:
+            model[0].optimize_moe(step=iteration, max_grouped_experts=args.moe_capacity_bins_optimize_max_group)
+
         # Autoresume
         if args.adlr_autoresume and \
            (iteration % args.adlr_autoresume_interval == 0):
@@ -1267,10 +1327,20 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
         if args.eval_interval and iteration % args.eval_interval == 0 and \
            args.do_valid:
             prefix = 'iteration {}'.format(iteration)
-            evaluate_and_print_results(prefix, forward_step_func,
-                                       valid_data_iterator, model,
-                                       iteration, process_non_loss_data_func,
-                                       config, False)
+            eval_loss = evaluate_and_print_results(prefix, forward_step_func,
+                                                   valid_data_iterator, model,
+                                                   iteration,
+                                                   process_non_loss_data_func,
+                                                   config, False)
+            # Exiting based on eval loss
+            if args.eval_loss_exit_value is not None and eval_loss <= args.eval_loss_exit_value:
+                if args.save:
+                    save_checkpoint_and_time(iteration, model, optimizer,
+                                         opt_param_scheduler)
+                torch.distributed.barrier()
+                print_datetime(f"Reached target loss value: {args.eval_loss_exit_value}. "
+                            f"Stopping the training at iteration: {iteration} with loss: {eval_loss}")
+                sys.exit()
 
         # Checkpointing
         saved_checkpoint = False
@@ -1311,7 +1381,17 @@ def train(forward_step_func, model, optimizer, opt_param_scheduler,
             torch.distributed.barrier()
             print_datetime('exiting program at iteration {}'.format(iteration))
             sys.exit()
+        trigger(on_step_end)
 
+        # Exiting based on kill-switch
+        if found_kill_switch():
+            if not saved_checkpoint:
+                save_checkpoint_and_time(iteration, model, optimizer,
+                                         opt_param_scheduler)
+            print_datetime(f"Detected kill switch at {args.kill_switch_path}, "
+                           f"iteration={iteration}. Exiting")
+            torch.distributed.barrier()
+            sys.exit()
 
     return iteration
 
@@ -1344,12 +1424,29 @@ def evaluate(forward_step_func,
                 update_rotary_pos_emb(args.curriculum_seqlen)
             model[0].reset_activation_shape()
 
+    if args.eval_micro_batch_size != args.micro_batch_size:
+        reset_cached_broadcast_sizes()
+        model[0].reset_activation_shape()
+
     total_loss_dict = {}
 
     with torch.no_grad():
         iteration = 0
-        while iteration < args.eval_iters:
+        total_iterations = args.eval_iters
+        if args.eval_iters == -1:
+            print_rank_0(F"Evaluation on the entire set as eval-iters is set to {args.eval_iters}")
+            samples_per_iteration = mpu.get_data_parallel_world_size() \
+                                        * args.eval_micro_batch_size \
+                                        * get_num_eval_microbatches()
+            total_iterations = math.ceil(args.eval_total_samples / samples_per_iteration)
+            print_rank_0(F"Evaluation Iterations: {total_iterations}, Total Eval Samples: {args.eval_total_samples}, samples per iteration: {samples_per_iteration}")
+            args.consumed_valid_samples = 0
+        num_eval_microbatches = get_num_eval_microbatches()
+        while iteration < total_iterations:
             iteration += 1
+            if iteration == total_iterations and args.eval_iters == -1:
+                num_eval_microbatches = math.ceil((args.eval_total_samples - args.consumed_valid_samples) / \
+                                (mpu.get_data_parallel_world_size() * args.eval_micro_batch_size))
             if verbose and iteration % args.log_interval == 0:
                 print_rank_0('Evaluating iter {}/{}'.format(iteration,
                                                             args.eval_iters))
@@ -1360,9 +1457,12 @@ def evaluate(forward_step_func,
             if args.deepspeed and args.ds_pipeline_enabled:
                 # DeepSpeed uses eval_batch() and already aggregates losses.
                 assert isinstance(model, list) and len(model) == 1
-                loss = model[0].eval_batch(data_iterator)
-                loss_dicts = [{'lm loss' : loss}] * get_num_microbatches()
+                loss = model[0].eval_batch(data_iterator, num_micro_batches=num_eval_microbatches)
+                loss_dicts = [{'lm loss' : loss}] * num_eval_microbatches
             else:
+                assert args.micro_batch_size == args.eval_micro_batch_size, \
+                        "evaluate (training) - Megatron's forward_backward_func options - " \
+                        "Unsupported for split micro batch size"
                 loss_dicts = forward_backward_func(
                     forward_step_func=forward_step_func,
                     data_iterator=data_iterator,
@@ -1387,8 +1487,8 @@ def evaluate(forward_step_func,
                                 key, get_accelerator().FloatTensor([0.0])) + loss_dict[key]
 
             args.consumed_valid_samples += mpu.get_data_parallel_world_size() \
-                                           * args.micro_batch_size \
-                                           * get_num_microbatches()
+                                           * args.eval_micro_batch_size \
+                                           * num_eval_microbatches
         collected_non_loss_data = None
         if process_non_loss_data_func is not None and is_last_rank():
             collected_non_loss_data = forward_backward_func(
@@ -1407,7 +1507,7 @@ def evaluate(forward_step_func,
         model_module.train()
 
     for key in total_loss_dict:
-        total_loss_dict[key] /= args.eval_iters * get_num_microbatches()
+        total_loss_dict[key] /= (((total_iterations-1) * get_num_eval_microbatches()) + num_eval_microbatches)
 
     if args.curriculum_learning_legacy and not args.no_pipeline_parallel:
         # roll back to actual curriculum seqlen at the end of eval.
@@ -1418,6 +1518,9 @@ def evaluate(forward_step_func,
                 update_rotary_pos_emb(args.curriculum_seqlen)
             model[0].reset_activation_shape()
 
+    if args.eval_micro_batch_size != args.micro_batch_size:
+        reset_cached_broadcast_sizes()
+        model[0].reset_activation_shape()
     return total_loss_dict, collected_non_loss_data
 
 def evaluate_and_print_results(prefix, forward_step_func,
@@ -1435,20 +1538,22 @@ def evaluate_and_print_results(prefix, forward_step_func,
         forward_step_func, data_iterator, model,
         process_non_loss_data_func, config, verbose)
     string = ' validation loss at {} | '.format(prefix)
+    eval_loss = 0
     for key in total_loss_dict:
-        string += '{} value: {:.6E} | '.format(key, total_loss_dict[key].item())
-        ppl = math.exp(min(20, total_loss_dict[key].item()))
+        eval_loss = total_loss_dict[key].item()
+        string += '{} value: {:.6E} | '.format(key, eval_loss)
+        ppl = math.exp(min(20, eval_loss))
         string += '{} PPL: {:.6E} | '.format(key, ppl)
         if writer and is_last_rank():
             data_type = 'test' if test else 'validation'
             writer.add_scalar(f'lm-loss-validation/{key} {data_type}',
-                              total_loss_dict[key].item(),
+                              eval_loss,
                               iteration)
             writer.add_scalar(f'lm-loss-validation/{key} {data_type} vs samples',
-                              total_loss_dict[key].item(),
+                              eval_loss,
                               args.consumed_train_samples)
             writer.add_scalar(f'lm-loss-validation/{key} {data_type} vs tokens',
-                              total_loss_dict[key].item(),
+                              eval_loss,
                               args.consumed_train_tokens)
             if args.log_validation_ppl_to_tensorboard:
                 writer.add_scalar(f'lm-loss-validation/{key} {data_type} ppl', ppl,
@@ -1466,6 +1571,13 @@ def evaluate_and_print_results(prefix, forward_step_func,
     print_rank_last(string)
     print_rank_last('-' * length)
 
+    if args.eval_loss_exit_value is not None:
+        eval_loss_tensor = get_accelerator().FloatTensor([eval_loss])
+        torch.distributed.all_reduce(eval_loss_tensor, op=torch.distributed.ReduceOp.MAX)
+        eval_loss = eval_loss_tensor.item()
+
+    return eval_loss
+
 
 def cyclic_iter(iter):
     while True:
@@ -1486,9 +1598,13 @@ def build_train_valid_test_datasets(build_train_valid_test_datasets_provider):
     eval_iters = (args.train_iters // args.eval_interval + 1) * \
                  args.eval_iters
     test_iters = args.eval_iters
-    train_val_test_num_samples = [train_samples,
-                                  eval_iters * args.global_batch_size,
-                                  test_iters * args.global_batch_size]
+    if args.eval_iters == -1:
+        print_rank_0("Evaluation iterations are set to -1")
+        train_val_test_num_samples = [train_samples, -1, -1]
+    else:
+        train_val_test_num_samples = [train_samples,
+                                      eval_iters * args.global_batch_size,
+                                      test_iters * args.global_batch_size]
     print_rank_0(' > datasets target sizes (minimum size):')
     print_rank_0('    train:      {}'.format(train_val_test_num_samples[0]))
     print_rank_0('    validation: {}'.format(train_val_test_num_samples[1]))
@@ -1526,22 +1642,31 @@ def build_train_valid_test_data_loaders(
         train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
             build_train_valid_test_datasets_provider)
 
+        if args.eval_iters == -1:
+            eval_total_samples = len(valid_ds)
+            consumed_valid_samples = 0
+            use_all_eval_samples = True
+        else:
+            eval_total_samples = 0
+            consumed_valid_samples = args.consumed_valid_samples
+            use_all_eval_samples = False
+
         # Build dataloders.
         train_dataloader = build_pretraining_data_loader(
-            train_ds, args.consumed_train_samples)
+            train_ds, args.consumed_train_samples, True)
         valid_dataloader = build_pretraining_data_loader(
-            valid_ds, args.consumed_valid_samples)
-        test_dataloader = build_pretraining_data_loader(test_ds, 0)
+            valid_ds, consumed_valid_samples, False, use_all_eval_samples)
+        test_dataloader = build_pretraining_data_loader(test_ds, 0, False)
 
         # Flags to know if we need to do training/validation/testing.
         do_train = train_dataloader is not None and args.train_iters > 0
-        do_valid = valid_dataloader is not None and args.eval_iters > 0
+        do_valid = valid_dataloader is not None and (args.eval_iters > 0 or args.eval_iters == -1)
         do_test = test_dataloader is not None and args.eval_iters > 0
         # Need to broadcast num_tokens and num_type_tokens.
         flags = get_accelerator().LongTensor(
-            [int(do_train), int(do_valid), int(do_test)])
+            [int(do_train), int(do_valid), int(do_test), int(eval_total_samples)])
     else:
-        flags = get_accelerator().LongTensor([0, 0, 0])
+        flags = get_accelerator().LongTensor([0, 0, 0, 0])
 
     # Broadcast num tokens.
     if ds_sequence_parallel:
@@ -1555,6 +1680,7 @@ def build_train_valid_test_data_loaders(
     args.do_train = flags[0].item()
     args.do_valid = flags[1].item()
     args.do_test = flags[2].item()
+    args.eval_total_samples = flags[3].item()
 
     return train_dataloader, valid_dataloader, test_dataloader
 
@@ -1581,7 +1707,7 @@ def build_train_valid_test_data_iterators(
         train_data_iterator = None
 
     if valid_dataloader is not None:
-        valid_data_iterator = iter(valid_dataloader) if dl_type == 'single' \
+        valid_data_iterator = iter(valid_dataloader) if (dl_type == 'single' and args.eval_iters != -1) \
                               else iter(cyclic_iter(valid_dataloader))
     else:
         valid_data_iterator = None
diff --git a/megatron/utils.py b/megatron/utils.py
index 97294070af..9d5aa61ca7 100644
--- a/megatron/utils.py
+++ b/megatron/utils.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """General utilities."""
@@ -23,6 +24,11 @@
 from megatron.model.module import param_is_not_shared
 from megatron.model.rotary_pos_embedding import RotaryEmbedding
 
+try:
+    from habana_frameworks.torch.hpex.experimental.transformer_engine import recipe
+except (ImportError, ModuleNotFoundError):
+    recipe = None
+
 
 def update_rotary_pos_emb(seq_length):
     args = get_args()
@@ -163,7 +169,9 @@ def get_ltor_masks_and_position_ids(data,
                                     reset_position_ids,
                                     reset_attention_mask,
                                     eod_mask_loss,
-                                    skip_mask=False):
+                                    skip_mask=False,
+                                    dummy_sample=None,
+                                    labels=None):
     """Build masks and position id for left to right model."""
 
     # Extract batch size and sequence length.
@@ -178,13 +186,19 @@ def get_ltor_masks_and_position_ids(data,
     attention_mask = None
     if not skip_mask:
         attention_mask = torch.tril(torch.ones(
-            (att_mask_batch, seq_length, seq_length))).view(att_mask_batch, 1, seq_length, seq_length)
+            (att_mask_batch, seq_length, seq_length), device=data.device)).view(att_mask_batch, 1, seq_length, seq_length)
 
     # Loss mask.
     loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
     if eod_mask_loss:
         loss_mask[data == eod_token] = 0.0
 
+    if dummy_sample is not None:
+        loss_mask[dummy_sample.bool()] = 0.0
+
+    if labels is not None:
+        loss_mask[labels == -1] = 0.0
+
     # Position ids.
     position_ids = torch.arange(seq_length, dtype=torch.long,
                                 device=data.device)
@@ -218,7 +232,6 @@ def get_ltor_masks_and_position_ids(data,
     # Convert attention mask to binary:
     if not skip_mask:
         attention_mask = (attention_mask < 0.5)
-        attention_mask = attention_mask.to(data.device)
 
     return attention_mask, loss_mask, position_ids
 
@@ -275,22 +288,38 @@ def throughput_calculator(model, args, iteration_time, total_iterations):
 
     #flops calculator
     hidden_size = args.hidden_size
+    num_attention_heads = args.num_attention_heads
+    head_dim = hidden_size // num_attention_heads
+    ffn_hidden_size = args.ffn_hidden_size
     num_layers = args.num_layers
     vocab_size = args.padded_vocab_size
+    gqa = args.num_attention_heads // args.num_key_value_heads
+    ffn_multiplier = 3 if args.swiglu else 2
+    macs_per_flops = 2
 
     # General TFLOPs formula (borrowed from Equation 3 in Section 5.1 of
     # https://arxiv.org/pdf/2104.04473.pdf).
-    # The factor of 4 is when used with activation check-pointing,
-    # otherwise it will be 3.
-    checkpoint_activations_factor = 3
-    if hasattr(args, 'checkpoint_activations') and args.checkpoint_activations:
-        checkpoint_activations_factor = 4
-    if hasattr(args, 'recompute_granularity') and (args.recompute_granularity == 'selective' or args.recompute_granularity == 'full'):
-        checkpoint_activations_factor = 4
+    # correction has been made to TFLOPs formula due to incorrect behavior
+    # observed with selective recompute when GQA not used and for all with GQA
     seq_len = args.seq_length
     if hasattr(args, 'actual_seq_length'):
         seq_len = args.actual_seq_length
-    flops_per_iteration = (24 * checkpoint_activations_factor * batch_size * seq_len * num_layers * (hidden_size**2)) * (1. + (seq_len / (6. * hidden_size)) + (vocab_size / (16. * num_layers * hidden_size)))
+
+    pre_and_post_mha_gemm_macs = batch_size * num_layers * (1 + (2 // gqa) + 1) * (hidden_size**2) * seq_len
+    mha_bgemm_macs = batch_size * num_layers * 2 * head_dim * num_attention_heads * (seq_len**2)
+    ffn_gemm_macs = batch_size * num_layers * ffn_multiplier * ffn_hidden_size * hidden_size * seq_len
+    logit_lmhead_gemm_macs = batch_size * vocab_size * hidden_size * seq_len
+
+    fwd_macs = pre_and_post_mha_gemm_macs + mha_bgemm_macs + ffn_gemm_macs + logit_lmhead_gemm_macs
+    bwd_macs = 2 * fwd_macs
+    fwd_bwd_macs = fwd_macs + bwd_macs
+
+    if (hasattr(args, 'checkpoint_activations') and args.checkpoint_activations) or (hasattr(args, 'recompute_granularity') and args.recompute_granularity == 'full'):
+        fwd_bwd_macs += fwd_macs
+    if hasattr(args, 'recompute_granularity') and args.recompute_granularity == 'selective':
+        fwd_bwd_macs += mha_bgemm_macs
+
+    flops_per_iteration = fwd_bwd_macs * macs_per_flops
     tflops = flops_per_iteration / (elapsed_time_per_iter * args.world_size * (10**12))
     return samples_per_second, tflops, approx_parameters_in_billions
 
@@ -301,7 +330,6 @@ def checkpoint_throughput_calculator(model, latency_second):
     GB_per_second = checkpoint_GB / latency_second
     print_rank_0(f"Checkpoint Save GB: {round(checkpoint_GB, 3)}, GB/Sec: {round(GB_per_second,2)}, Latency(second): {round(latency_second, 3)}")
 
-
 def get_fingerprint_header():
     return f"{'min':^13} {'max':^13} {'mean':^13} {'l2 norm':^12} metadata"
 
@@ -381,3 +409,29 @@ def dump_weights(preamble, iteration, model, optimizer, tensor=None):
                 p = model[0].module.tied_modules.embed.word_embeddings.weight._hp_param
                 fh.write(f"{get_fingerprint(p)} module.tied_modules.embed.word_embeddings.weight._hp_param {p.shape}\n")
 
+def found_kill_switch():
+    args = get_args()
+    if args.kill_switch_path is not None and os.path.exists(args.kill_switch_path):
+        return True
+    else:
+        return False
+
+
+FP8_RECIPE=None
+def get_fp8_recipe(args):
+    global FP8_RECIPE
+    if FP8_RECIPE is None:
+        if args.fp8_e5m2:
+            fp8_format = recipe.Format.E5M2
+        elif args.fp8_hybrid:
+            fp8_format = recipe.Format.HYBRID
+        fp8_interval = get_args().fp8_interval
+        FP8_RECIPE = recipe.DelayedScaling(
+            margin=args.fp8_margin,
+            interval=fp8_interval,
+            fp8_format=fp8_format,
+            amax_history_len=args.fp8_amax_history_len,
+            amax_compute_algo=args.fp8_amax_compute_algo,
+            reduce_amax=args.fp8_amax_reduce,
+        )
+    return FP8_RECIPE
\ No newline at end of file
diff --git a/pretrain_gpt.py b/pretrain_gpt.py
index 52681e5f8f..c7aea08e77 100644
--- a/pretrain_gpt.py
+++ b/pretrain_gpt.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
 
 """Pretrain GPT"""
@@ -13,6 +14,7 @@
 from megatron.core.enums import ModelType
 from megatron.data.gpt_dataset import build_train_valid_test_datasets
 from megatron.model import GPTModel, GPTModelPipe
+from megatron.model.utils import add_tp_norm_hooks
 from megatron.training import pretrain
 from megatron.utils import get_ltor_masks_and_position_ids
 from megatron.utils import average_losses_across_data_parallel_group, update_rotary_pos_emb
@@ -28,7 +30,7 @@
 import torch.nn.functional as F
 
 
-def model_provider(pre_process=True, post_process=True):
+def model_provider(pre_process=True, post_process=True, parallel_output=True):
     """Build the model."""
 
     print_rank_0('building GPT model ...')
@@ -51,8 +53,10 @@ def model_provider(pre_process=True, post_process=True):
             model = GPTModelPipe(
                 config=config,
                 num_tokentypes=0,
-                parallel_output=True
+                parallel_output=parallel_output
             )
+            add_tp_norm_hooks(model, args)
+
             # This is a hack to give us a reference to get_batch_pipe from within training.py
             # We need to call model.set_batch_fn after deepspeed.initialize
             model._megatron_batch_fn = get_batch_pipe
@@ -71,18 +75,23 @@ def model_provider(pre_process=True, post_process=True):
             elif args.bf16:
                 attention_mask = attention_mask.bfloat16()
 
-            # Attention mask must be bool.
-            args.attn_mask = attention_mask.to(torch.bool)
+            if args.mask_tensor_adding:
+                args.attn_mask = attention_mask * -10000.0
+            else:
+                # Attention mask must be bool.
+                args.attn_mask = attention_mask.to(torch.bool)
 
             # For prertaining, since sequence length is fixed, cache rotary embedding in args, to avoid communicating around
             if args.use_rotary_position_embeddings:
                 update_rotary_pos_emb(args.seq_length)
 
         else:
+            assert not args.use_alibi_position_embeddings, \
+                "GPTModel doesn't yet support ALiBi positional encoding"
             model = GPTModel(
                 config=config,
                 num_tokentypes=0,
-                parallel_output=True,
+                parallel_output=parallel_output,
                 pre_process=pre_process,
                 post_process=post_process
             )
@@ -108,8 +117,13 @@ def get_batch(data_iterator):
 
     # Unpack.
     tokens_ = data_b['text'].long()
-    labels = tokens_[:, 1:].contiguous()
-    tokens = tokens_[:, :-1].contiguous()
+    if not args.use_seq_len_plus_one_tokens:
+        labels = torch.roll(tokens_, shifts=-1, dims=1)
+        labels[:, -1] = -1
+        tokens = tokens_
+    else:
+        labels = tokens_[:, 1:].contiguous()
+        tokens = tokens_[:, :-1].contiguous()
 
     # Get the masks and postition ids.
     skip_mask = args.use_flash_attn or args.use_flash_attn_triton
@@ -119,7 +133,9 @@ def get_batch(data_iterator):
         args.reset_position_ids,
         args.reset_attention_mask,
         args.eod_mask_loss,
-        skip_mask)
+        skip_mask,
+        labels = labels,
+        dummy_sample= None,)
 
     # For DS's sequence parallel
     seq_parallel_world_size = mpu.get_sequence_parallel_world_size()
@@ -136,6 +152,9 @@ def get_batch(data_iterator):
     sub_seq_start = seq_parallel_world_rank * sub_seq_length
     sub_seq_end = (seq_parallel_world_rank + 1) * sub_seq_length
 
+    tokens[tokens == -1] = 0
+    labels[labels == -1] = 0
+
     tokens = tokens[:, sub_seq_start:sub_seq_end]
     position_ids = position_ids[:, sub_seq_start:sub_seq_end]
     # For DS's sequence parallel
@@ -183,8 +202,13 @@ def get_batch_pipe(data):
 
     # Unpack.
     tokens_ = data_b['text'].long()
-    labels = tokens_[:, 1:].contiguous()
-    tokens = tokens_[:, :-1].contiguous()
+    if not args.use_seq_len_plus_one_tokens:
+        labels = torch.roll(tokens_, shifts=-1, dims=1)
+        labels[:, -1] = -1
+        tokens = tokens_
+    else:
+        labels = tokens_[:, 1:].contiguous()
+        tokens = tokens_[:, :-1].contiguous()
 
     # Get the masks and postition ids.
     attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
@@ -192,7 +216,13 @@ def get_batch_pipe(data):
         tokenizer.eod,
         args.reset_position_ids,
         args.reset_attention_mask,
-        args.eod_mask_loss)
+        args.eod_mask_loss,
+        labels = labels,
+        dummy_sample = None,)
+
+    tokens[tokens == -1] = 0
+    labels[labels == -1] = 0
+
     if args.curriculum_learning_legacy and args.curriculum_seqlen < tokens.size()[1]:
         # seqlen-based curriculum learning
         # tokens, position_ids, labels, loss_mask have size [batch size, seqlen]
@@ -318,7 +348,8 @@ def train_valid_test_datasets_provider(train_val_test_num_samples):
         train_data_prefix=args.train_data_path,
         valid_data_prefix=args.valid_data_path,
         test_data_prefix=args.test_data_path,
-        data_cache_path=args.data_cache_path)
+        data_cache_path=args.data_cache_path,
+        use_seq_len_plus_one_tokens=args.use_seq_len_plus_one_tokens)
     print_rank_0("> finished creating GPT datasets ...")
 
     return train_ds, valid_ds, test_ds
diff --git a/scripts/convert_ds_to_universal.sh b/scripts/convert_ds_to_universal.sh
new file mode 100755
index 0000000000..67096eb917
--- /dev/null
+++ b/scripts/convert_ds_to_universal.sh
@@ -0,0 +1,75 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/bin/bash
+
+echo "**************************************************************************"
+echo "Script to convert Megatron-DeepSpeed Checkpoint into a Universal Checkpoint"
+echo "**************************************************************************"
+
+###### INPUTS : START #####
+# here for testing only
+# MEGATRON_DEEPSPEED_ROOT
+# DEEPSPEED_ROOT
+# HL_LATEST_CHECKPOINT=<>/checkpoints/global_step48600
+# HL_UNIV_CP_EXTRACT_WORKERS
+# HL_UNIV_CP_MERGE_WORKERS
+###### INPUTS : END  ######
+
+LATEST_CHECKPOINT=${HL_LATEST_CHECKPOINT:-}
+EXTRACT_WORKERS=${HL_UNIV_CP_EXTRACT_WORKERS:-}
+MERGE_WORKERS=${HL_UNIV_CP_MERGE_WORKERS:-}
+
+if [[ -z "$MEGATRON_DEEPSPEED_ROOT" ]]; then
+    MEGATRON_DEEPSPEED_ROOT=$(realpath $(dirname $0)/../)
+fi
+
+if [[ -z "$DEEPSPEED_ROOT" ]]; then
+    res=$(deepspeed --help)
+    if [ $? -ne 0 ]; then
+        echo "please install deepspeed or set DEEPSPEED_ROOT"
+    fi
+    DEEPSPEED_ROOT=$(pip show deepspeed | grep -i "^Location:" | cut -d" " -f 2)
+fi
+
+if [[ -z "$LATEST_CHECKPOINT" ]]; then
+    echo "please set HL_LATEST_CHECKPOINT"
+    exit 1
+else
+    LATEST_CHECKPOINT=${LATEST_CHECKPOINT%/}
+fi
+
+export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH
+UNIV_CP_PATH=${LATEST_CHECKPOINT}_universal
+mkdir -p $UNIV_CP_PATH
+PYTHON_CMD="python ${DEEPSPEED_ROOT}/deepspeed/checkpoint/ds_to_universal.py --input_folder ${LATEST_CHECKPOINT} --output_folder ${UNIV_CP_PATH}"
+
+if [ -n "$EXTRACT_WORKERS" ]; then
+    PYTHON_CMD="${PYTHON_CMD} --num_extract_workers ${EXTRACT_WORKERS}"
+fi
+
+if [ -n "$MERGE_WORKERS" ]; then
+    PYTHON_CMD="${PYTHON_CMD} --num_merge_workers ${MERGE_WORKERS}"
+fi
+
+echo $PYTHON_CMD
+eval $PYTHON_CMD
+
+if [ $? -ne 0 ]; then
+    echo 'Failed to run ds_to_universal.py '
+    exit 1
+else
+    echo "Conversion to universal checkpoint finished. Converted checkpoint available at ${UNIV_CP_PATH} "
+    exit 0
+fi
diff --git a/scripts/hostsfile b/scripts/hostsfile
new file mode 100644
index 0000000000..19903f01eb
--- /dev/null
+++ b/scripts/hostsfile
@@ -0,0 +1,2 @@
+10.10.100.101 slots=8
+10.10.100.102 slots=8
diff --git a/scripts/run_llama.sh b/scripts/run_llama.sh
new file mode 100755
index 0000000000..6d855223de
--- /dev/null
+++ b/scripts/run_llama.sh
@@ -0,0 +1,468 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+#!/bin/bash
+
+# -------------------------------------------
+# LLaMA 7/13/65 B training script
+# Paper: https://arxiv.org/abs/2302.13971
+# LLaMA 2 7/13/34/70 B training script
+# Paper: https://arxiv.org/pdf/2307.09288.pdf
+# -------------------------------------------
+
+# LLaMA   Reference script : ../examples_deepspeed/pretrain_llama_distributed.sh
+# LLaMA 2 Reference script : ../examples_deepspeed/pretrain_llama2_distributed.sh
+
+set -ex
+
+# ----------------------
+# Configurable parameters
+
+TOKENIZER_TYPE=${HL_TOKENIZER_TYPE:-GPTSentencePieceTokenizer}
+TRUST_REMOTE_CODE=${HL_TRUST_REMOTE_CODE:-0}
+DATA_DIR=${HL_DATA_DIR_ROOT:-/data/datasets/red_pajama}
+DATA_CACHE_DIR=${HL_DATA_CACHE_DIR:-}
+DATA_FILE_PREFIX=${HL_DATA_FILE_PREFIX:-redpajama}
+TOKENIZER_MODEL=${HL_TOKENIZER_MODEL:-}
+NUM_NODES=${HL_NUM_NODES:-1}
+DP=${HL_DP:-2}
+TP=${HL_TP:-2}
+PP=${HL_PP:-2}
+MICRO_BATCH=${HL_MICRO_BATCH:-1}
+EXIT_INTERVAL=${HL_EXIT_INTERVAL:-0}
+OUTPUT_DIR=${HL_RESULTS_DIR:-}
+OUTPUT_DIR_PREFIX=${HL_RESULTS_DIR_PREFIX:-.}
+CHECKPOINT_SAVE=${HL_SAVE:-1}
+SAVE_INTERVAL=${HL_SAVE_INTERVAL:-2000}
+CHECKPOINTS_DIR=${HL_CHECKPOINTS_DIR:-}
+TENSORBOARD_DIR=${HL_TENSORBOARD_DIR:-}
+KILL_SWITCH_FILE=${HL_KILL_SWITCH:-}
+HOSTSFILE=${HL_HOSTSFILE:-}
+CKP_ACT=${HL_CKP_ACT:-0}
+UNIV_CP=${HL_UNIV_CP:-0}
+QNPU_DIR=${HL_QNPU_DIR:-}
+LOG_INTERVAL=${HL_LOG_INTERVAL:-10}
+LLAMA_VER=${HL_LLAMA_VER:-2} # 1 for LLaMA and 2 for LLaMA 2
+LLAMA_MODEL_SIZE=${HL_LLAMA_MODEL_SIZE:-13}
+DEVICES_PER_NODE=${HL_DEVICES_PER_NODE:-8}
+ZERO_STAGE=${HL_ZERO_STAGE:-0}
+SEQ_PARALLEL=${HL_SEQ_PARALLEL:-1}
+OPTIMIZER=${HL_OPTIMIZER:-fusedadamw}
+DROPOUT=${HL_DROPOUT:-0.0}
+EVAL_ITERS=${HL_EVAL_ITERS:-100}
+EVAL_INTERVAL=${HL_EVAL_INTERVAL:-1000}
+USE_FUSED_SDPA=${HL_USE_FUSED_SDPA:-1}
+USE_FUSED_SDPA_WITH_RECOMPUTE=${HL_USE_FUSED_SDPA_WITH_RECOMPUTE:-0}
+USE_FUSED_RMSNORM=${HL_USE_FUSED_RMSNORM:-1}
+PROFILE=${HL_PROFILE:-} # provide either of pt, pt-full, hltv
+PROFILE_STEPS=${HL_PROFILE_STEPS:-"3,4"}
+USE_TRANSFORMER_ENGINE=${HL_USE_TRANSFORMER_ENGINE:-0}
+USE_CACHE_FP8_WEIGHT=${HL_USE_CACHE_FP8_WEIGHT:-0}
+USE_CACHE_FP8_WEIGHT_FWD=${HL_USE_CACHE_FP8_WEIGHT_FWD:-0}
+FP8_FORMAT=${HL_FP8_FORMAT:-hybrid} # hybrid or e5m2
+GRAD_ACCUM_DTYPE=${HL_GRAD_ACCUM_DTYPE}
+FP8_MARGIN=${HL_FP8_MARGIN:-0}
+FP8_AMAX_RECOMPUTE_ALGO=${HL_FP8_AMAX_RECOMPUTE_ALGO:-max} # max or most_recent
+TENSOR_LOGGER=${HL_TENSOR_LOGGER:-0}
+TENSOR_LOGGER_DIR=${HL_TENSOR_LOGGER_DIR:-}
+TENSOR_LOGGER_START_ITER=${HL_TENSOR_LOGGER_START_ITER:-0}
+TENSOR_LOGGER_END_ITER=${HL_TENSOR_LOGGER_END_ITER:-0}
+USE_LAZY_MODE=${HL_USE_LAZY_MODE:-1}
+USE_TORCH_COMPILE=${HL_USE_TORCH_COMPILE:-0}
+NO_PIPELINE_PARALLEL=${HL_NO_PIPELINE_PARALLEL:-0}
+POSITION_EMBEDDING_TYPE=${HL_POSITION_EMBEDDING_TYPE:-rotary}
+USE_FAST_SOFTMAX=${HL_USE_FAST_SOFTMAX:-0}
+IMMEDIATE_GRAD_UPDATE=${HL_IMMEDIATE_GRAD_UPDATE:-true}
+CHECK_TP_NORM=${HL_CHECK_TP_NORM:-0}
+START_CHECK_TP_NORM_ITER=${HL_START_CHECK_TP_NORM_ITER:--1}
+END_CHECK_TP_NORM_ITER=${HL_END_CHECK_TP_NORM_ITER:--1}
+CHECK_TP_NORM_TYPE=${HL_CHECK_TP_NORM_TYPE:-all} # all, io, wb
+# ----------------------
+
+if [ $((NUM_NODES*DEVICES_PER_NODE)) -ne $((DP*TP*PP)) ]; then
+    echo "NUM_NODES*DEVICES_PER_NODE != DP*TP*PP"
+    exit 1
+fi
+
+if [  $(( HL_NUM_LAYERS % PP )) -ne 0  ]; then
+    echo 'HL_NUM_LAYERS must be divisible by PP'
+    exit 1
+fi
+
+if [[ -z "$MEGATRON_DEEPSPEED_ROOT" ]]; then
+    MEGATRON_DEEPSPEED_ROOT=$(realpath $(dirname $0)/../)
+fi
+
+DATA_PATH=${DATA_DIR}/${DATA_FILE_PREFIX}
+
+if [ "$LLAMA_VER" = "1" ]; then
+    GLOBAL_BATCH=${HL_GBS:-2048} # microbatches in the pipeline (computed as `GLOBAL_BATCH / (DP * MICRO_BATCH)`) should be divisible by the PP
+    SEQ_LEN=${HL_SEQ_LEN:-2048}
+    TRAIN_ITERS=${HL_TRAIN_ITERS:-250000}
+    if [ $LLAMA_MODEL_SIZE -eq 65 ]; then
+        # LLaMA-65B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-80} # must be divisible by PP
+        NHIDDEN=8192
+        NHEADS=64 # must be divisible by TP
+        FFN_HIDDEN_SIZE=22016
+        LR=1.5e-4
+        MIN_LR=1.5e-5
+    elif [ $LLAMA_MODEL_SIZE -eq 13 ]; then
+        # LLaMA-13B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-40} # must be divisible by PP
+        NHIDDEN=5120
+        NHEADS=40 # must be divisible by TP
+        FFN_HIDDEN_SIZE=13824
+        LR=3e-4
+        MIN_LR=3e-5
+    elif [ $LLAMA_MODEL_SIZE -eq 7 ]; then
+        # LLaMA-7B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-32} # must be divisible by PP
+        NHIDDEN=4096
+        NHEADS=32 # must be divisible by TP
+        FFN_HIDDEN_SIZE=11008
+        LR=3e-4
+        MIN_LR=3e-5
+    else
+        echo "incorrect HL_LLAMA_MODEL_SIZE=$LLAMA_MODEL_SIZE is set"
+        exit 1
+    fi
+else
+    GLOBAL_BATCH=${HL_GBS:-1024} # microbatches in the pipeline (computed as `GLOBAL_BATCH / (DP * MICRO_BATCH)`) should be divisible by the PP
+    SEQ_LEN=${HL_SEQ_LEN:-4096}
+    TRAIN_ITERS=${HL_TRAIN_ITERS:-500000}
+    if [ $LLAMA_MODEL_SIZE -eq 70 ]; then
+        # LLaMA2-70B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-80} # must be divisible by PP
+        NHIDDEN=8192
+        NHEADS=64 # must be divisible by TP
+        NUM_KV_HEADS=$((NHEADS/8)) # must be divisible by TP
+        FFN_HIDDEN_SIZE=28672
+        LR=1.5e-4
+        MIN_LR=1.5e-5
+    elif [ $LLAMA_MODEL_SIZE -eq 34 ]; then
+        # LLaMA2-34B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-48} # must be divisible by PP
+        NHIDDEN=8192
+        NHEADS=64 # must be divisible by TP
+        NUM_KV_HEADS=$((NHEADS/8)) # must be divisible by TP
+        FFN_HIDDEN_SIZE=22016
+        LR=1.5e-4
+        MIN_LR=1.5e-5
+    elif [ $LLAMA_MODEL_SIZE -eq 13 ]; then
+        # LLaMA2-13B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-40} # must be divisible by PP
+        NHIDDEN=5120
+        NHEADS=40 # must be divisible by TP
+        NUM_KV_HEADS=${NHEADS} # must be divisible by TP
+        FFN_HIDDEN_SIZE=13824
+        LR=3e-4
+        MIN_LR=3e-5
+    elif [ $LLAMA_MODEL_SIZE -eq 7 ]; then
+        # LLaMA2-7B model architecture
+        N_LAYERS=${HL_NUM_LAYERS:-32} # must be divisible by PP
+        NHIDDEN=4096
+        NHEADS=32 # must be divisible by TP
+        NUM_KV_HEADS=${NHEADS} # must be divisible by TP
+        FFN_HIDDEN_SIZE=11008
+        LR=3e-4
+        MIN_LR=3e-5
+    else
+        echo "incorrect HL_LLAMA_MODEL_SIZE=$LLAMA_MODEL_SIZE is set"
+        exit 1
+    fi
+fi
+
+RUNTIME=`date +"%Y%m%d_%H%M"`
+# output paths
+if [ -z "$OUTPUT_DIR" ]; then
+    NUM_DEVICES=$(($DP * $TP * $PP))
+    # Experiment name
+    if [ -z "$EXP_NAME" ]; then
+        EXP_NAME="default"
+    fi
+    OUTPUT_DIR=${OUTPUT_DIR_PREFIX}/out/llama${LLAMA_VER}_${LLAMA_MODEL_SIZE}b/ds_${EXP_NAME}_z${ZERO_STAGE}_nl${N_LAYERS}_hs${NHIDDEN}_ffn${FFN_HIDDEN_SIZE}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}_sp${SEQ_PARALLEL}_D${DP}_T${TP}_P${PP}_devices${NUM_DEVICES}_${RUNTIME}
+fi
+
+if [ -z "$CHECKPOINTS_DIR" ]; then
+    CHECKPOINTS_DIR=$OUTPUT_DIR/checkpoints
+fi
+
+# if [ $UNIV_CP -eq 1 ]
+# then
+#     ckpt_name=$(cat $CHECKPOINTS_DIR/latest)
+#     res=$(HL_NUM_NODES=$NUM_NODES HL_DEVICES_PER_NODE=$DEVICES_PER_NODE HL_LATEST_CHECKPOINT=$CHECKPOINTS_DIR/$ckpt_name $MEGATRON_DEEPSPEED_ROOT/scripts/convert_ds_to_universal.sh)
+# fi
+
+if [ -z "$TENSORBOARD_DIR" ]; then
+    TENSORBOARD_DIR=$OUTPUT_DIR/tensorboard
+fi
+
+mkdir -p ${OUTPUT_DIR}
+mkdir -p ${TENSORBOARD_DIR}
+
+PARTITIONED_MODE="true"
+if [ $SEQ_PARALLEL -eq 1 ]; then
+    PARTITIONED_MODE="false"
+fi
+
+# create DS config
+
+# optional grad_accum_dtype setting
+DS_CONFIG_GRAD_ACCUM_DTYPE=""
+if [[ -n "$GRAD_ACCUM_DTYPE" ]]; then
+    DS_CONFIG_GRAD_ACCUM_DTYPE=",
+  \"data_types\": {
+    \"grad_accum_dtype\": \"$GRAD_ACCUM_DTYPE\"
+  }"
+fi
+
+DS_CONFIG=${OUTPUT_DIR}/ds_config.json
+cat << EOT > $DS_CONFIG
+{
+  "train_batch_size" : $GLOBAL_BATCH,
+  "train_micro_batch_size_per_gpu": $MICRO_BATCH,
+  "steps_per_print": $LOG_INTERVAL,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE
+  },
+  "zero_allow_untested_optimizer": true,
+  "bf16": {
+    "enabled": true,
+    "immediate_grad_update": $IMMEDIATE_GRAD_UPDATE
+  },
+  "fp16": {"enabled": false},
+  "wall_clock_breakdown": false,
+  "pipeline": {
+    "pipe_partitioned": $PARTITIONED_MODE,
+    "grad_partitioned": $PARTITIONED_MODE
+  },
+  "timers": {
+    "throughput": {
+      "enabled": true,
+      "synchronized": false
+    }
+  }$DS_CONFIG_GRAD_ACCUM_DTYPE
+}
+EOT
+
+# configure multi-node
+MULTINODE_CMD=""
+if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]; then
+    MULTINODE_CMD="--hostfile=$HOSTSFILE \
+                   --master_addr $(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p) "
+fi
+
+# training script command
+CMD=""
+if [ ! -z "$QNPU_DIR" ]; then
+    CMD="source ${QNPU_DIR}/activate ;"
+fi
+
+if [ $USE_LAZY_MODE -eq 0 ]; then
+    CMD="${CMD} PT_HPU_LAZY_MODE=0"
+else
+    LOWER_CASE_USE_TORCH_COMPILE=$(echo "$USE_TORCH_COMPILE" | tr '[:upper:]' '[:lower:]')
+    if [[ "$LOWER_CASE_USE_TORCH_COMPILE" == "true" || "$LOWER_CASE_USE_TORCH_COMPILE" == "1" ]]; then
+        echo "Cannot use lazy(HL_USE_LAZY_MODE) and torch.compile(HL_USE_TORCH_COMPILE) modes together"
+        exit 1
+    fi
+fi
+
+CMD="${CMD} \
+    python3 -u ${MEGATRON_DEEPSPEED_ROOT}/pretrain_gpt.py \
+    --deepspeed \
+    --tensor-model-parallel-size $TP \
+    --pipeline-model-parallel-size $PP \
+    --num-layers ${N_LAYERS} \
+    --hidden-size ${NHIDDEN} \
+    --ffn-hidden-size ${FFN_HIDDEN_SIZE} \
+    --num-attention-heads ${NHEADS} \
+    --seq-length ${SEQ_LEN} \
+    --micro-batch-size ${MICRO_BATCH} \
+    --global-batch-size ${GLOBAL_BATCH} \
+    --train-iters ${TRAIN_ITERS} \
+    --log-interval ${LOG_INTERVAL} \
+    --eval-iters ${EVAL_ITERS} \
+    --eval-interval ${EVAL_INTERVAL} \
+    --data-path ${DATA_PATH} \
+    --optimizer ${OPTIMIZER} \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --adam-eps 1e-8 \
+    --lr ${LR} \
+    --min-lr ${MIN_LR} \
+    --lr-decay-style cosine \
+    --lr-warmup-iters 2000 \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --tensorboard-dir ${TENSORBOARD_DIR} \
+    --log-validation-ppl-to-tensorboard \
+    --log-batch-size-to-tensorboard \
+    --log-timers-to-tensorboard \
+    --load ${CHECKPOINTS_DIR} \
+    --deepspeed_config=${DS_CONFIG} \
+    --use-torch-compile=${USE_TORCH_COMPILE} \
+    --zero-stage=${ZERO_STAGE} \
+    --exit-interval ${EXIT_INTERVAL} \
+    --no-masked-softmax-fusion \
+    --no-bias-gelu-fusion \
+    --no-bias-dropout-fusion \
+    --no-gradient-accumulation-fusion \
+    --bf16 \
+    --max-position-embeddings $SEQ_LEN \
+    --untie-embeddings-and-output-weights \
+    --swiglu \
+    --normalization rmsnorm \
+    --disable-bias-linear \
+    --no-query-key-layer-scaling \
+    --attention-dropout ${DROPOUT} \
+    --hidden-dropout ${DROPOUT} \
+    --use-fused-sdpa $USE_FUSED_SDPA \
+    --use-fused-sdpa-with-recompute $USE_FUSED_SDPA_WITH_RECOMPUTE \
+    --use-fused-rmsnorm $USE_FUSED_RMSNORM"
+
+if [ "$POSITION_EMBEDDING_TYPE" = "rotary" ]; then
+    CMD="${CMD} --use-rotary-position-embeddings"
+else
+    CMD="${CMD} --use-alibi-position-embeddings"
+fi
+
+if [ "$TOKENIZER_TYPE" = "GPTSentencePieceTokenizer" ]; then
+    CMD="${CMD} --tokenizer-type GPTSentencePieceTokenizer"
+    if [[ -z "$TOKENIZER_MODEL" ]]; then
+        TOKENIZER_MODEL="${DATA_DIR}/tokenizer.model"
+    fi
+    CMD="${CMD} --tokenizer-model $TOKENIZER_MODEL"
+elif [ "$TOKENIZER_TYPE" = "GPT2BPETokenizer" ]; then
+    CMD="${CMD} --tokenizer-type GPT2BPETokenizer"
+    CMD="${CMD} --vocab-file $DATA_DIR/gpt2-vocab.json"
+    CMD="${CMD} --merge-file $DATA_DIR/gpt2-merges.txt"
+elif [ "$TOKENIZER_TYPE" = "HFTokenizer" ]; then
+    CMD="${CMD} --tokenizer-type HFTokenizer"
+    if [[ -z "$TOKENIZER_MODEL" ]]; then
+        echo "HL_TOKENIZER_MODEL path is not set"
+        exit 1
+    fi
+    CMD="${CMD} --tokenizer-model $TOKENIZER_MODEL"
+    if [ $TRUST_REMOTE_CODE -eq 1 ]; then
+        CMD="${CMD} --trust-remote-code"
+    fi
+else
+    echo "incorrect HL_TOKENIZER_TYPE=$TOKENIZER_TYPE is set"
+    exit 1
+fi
+
+if [ "$LLAMA_VER" = "2" ] && [ $NHEADS -ne $NUM_KV_HEADS ]; then
+    CMD="${CMD} --num-key-value-heads $NUM_KV_HEADS"
+fi
+
+if [ ! -z "$DATA_CACHE_DIR" ]; then
+    CMD="${CMD} --data-cache-path ${DATA_CACHE_DIR}"
+fi
+
+# handle kill switch argument
+if [ ! -z "$KILL_SWITCH_FILE" ]; then
+    CMD="${CMD} --kill-switch-path $KILL_SWITCH_FILE"
+fi
+
+if [ $SEQ_PARALLEL -eq 1 ]
+then
+    CMD="${CMD} --sequence-parallel"
+fi
+
+if [ $USE_FAST_SOFTMAX -eq 1 ]
+then
+    CMD="${CMD} --use-fast-softmax"
+fi
+
+if [ $UNIV_CP -eq 1 ]
+then
+    echo "Loading Universal Checkpoint from ${CHECKPOINTS_DIR}"
+    CMD="${CMD} --universal-checkpoint"
+fi
+
+# fp8 args
+if [ $USE_TRANSFORMER_ENGINE -eq 1 ]; then
+    CMD="${CMD} --transformer-impl transformer_engine"
+
+    if [ $USE_CACHE_FP8_WEIGHT -eq 1 ]; then
+        CMD="${CMD} --cache-fp8-weight"
+    fi
+
+    FP8_MEASURE_INTERVAL=${HL_FP8_MEASURE_INTERVAL:-$(( GLOBAL_BATCH / MICRO_BATCH / DP ))}
+    FP8_AMAX_HISTORY_LEN=${HL_FP8_AMAX_HISTORY_LEN:-$(( GLOBAL_BATCH / MICRO_BATCH / DP ))}
+    FP8_AMAX_REDUCE=${HL_FP8_AMAX_REDUCE:-1}
+
+    CMD="${CMD} --cache-fp8-weight-fwd $USE_CACHE_FP8_WEIGHT_FWD"
+    CMD="${CMD} --fp8-interval $FP8_MEASURE_INTERVAL"
+    CMD="${CMD} --fp8-margin $FP8_MARGIN"
+    CMD="${CMD} --fp8-amax-compute-algo $FP8_AMAX_RECOMPUTE_ALGO"
+    CMD="${CMD} --fp8-amax-history-len $FP8_AMAX_HISTORY_LEN"
+
+    if [ "$FP8_FORMAT" = "e5m2" ]; then
+        CMD="${CMD} --fp8-e5m2"
+    else
+        CMD="${CMD} --fp8-hybrid"
+    fi
+
+    if [ $FP8_AMAX_REDUCE -eq 1 ]; then
+        CMD="${CMD} --fp8-amax-reduce"
+    fi
+fi
+
+if [[ "$NO_PIPELINE_PARALLEL" == "1" ]]; then
+    CMD="${CMD} --no-pipeline-parallel"
+fi
+
+if [ $CHECKPOINT_SAVE -eq 1 ]
+then
+    mkdir -p ${CHECKPOINTS_DIR}
+    CMD="${CMD} --save $CHECKPOINTS_DIR --save-interval $SAVE_INTERVAL --verify-checkpoint --verify-checkpoint-model-type LLAMA"
+fi
+
+if [ $CKP_ACT -eq 1 ]
+then
+    CMD="${CMD} --deepspeed-activation-checkpointing --recompute-granularity=full --recompute-method uniform"
+elif [ $CKP_ACT -eq 2 ]
+then
+    CMD="${CMD} --deepspeed-activation-checkpointing --recompute-granularity=selective"
+fi
+
+if [ $TENSOR_LOGGER -eq 1 ]; then
+    if [ -z "$TENSOR_LOGGER_DIR" ]; then
+        TENSOR_LOGGER_DIR=$OUTPUT_DIR/tensordumps
+    fi
+    mkdir -p $TENSOR_LOGGER_DIR
+    CMD="${CMD} --log-model-inputs"
+    CMD="${CMD} --log-fwd-activations"
+    CMD="${CMD} --log-bwd-grads"
+    CMD="${CMD} --tensor-logger-start-iter $TENSOR_LOGGER_START_ITER"
+    CMD="${CMD} --tensor-logger-end-iter $TENSOR_LOGGER_END_ITER"
+    CMD="${CMD} --tensor-logger-path $TENSOR_LOGGER_DIR"
+fi
+
+if [ $CHECK_TP_NORM -eq 1 ]; then
+    CMD="${CMD} --check-tp-norm"
+    CMD="${CMD} --start-check-tp-norm-iter $START_CHECK_TP_NORM_ITER"
+    CMD="${CMD} --end-check-tp-norm-iter $END_CHECK_TP_NORM_ITER"
+    CMD="${CMD} --check-tp-norm-type $CHECK_TP_NORM_TYPE"
+fi
+
+if [ ! -z "$PROFILE" ]; then
+    CMD="${CMD} --profile ${PROFILE}"
+    CMD="${CMD} --profile-steps ${PROFILE_STEPS}"
+fi
+
+if [ ! -z "$QNPU_DIR" ]; then
+    rm -rf $HOME/.deepspeed_env
+    echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> $HOME/.deepspeed_env
+fi
+
+# run!
+deepspeed --num_nodes ${NUM_NODES} \
+          --num_gpus ${DEVICES_PER_NODE} \
+          --no_local_rank \
+          --no_python \
+          $MULTINODE_CMD \
+          /usr/bin/bash -c "$CMD" #2>&1 | tee ${OUTPUT_DIR}/log_${RUNTIME}.txt
diff --git a/scripts/run_mixtral.sh b/scripts/run_mixtral.sh
new file mode 100755
index 0000000000..086abf1942
--- /dev/null
+++ b/scripts/run_mixtral.sh
@@ -0,0 +1,467 @@
+# Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.
+
+#!/bin/bash
+
+# ----------------------------------------------------------------------------
+# Mixtral model
+# Paper: https://arxiv.org/pdf/2401.04088.pdf
+# ----------------------------------------------------------------------------
+
+set -ex
+
+# ----------------------------------------------------------------------------
+# User configurable parameters
+
+DATA_DIR=${HL_DATA_DIR_ROOT:-/data/datasets/oscar}
+DATA_CACHE_DIR=${HL_DATA_CACHE_DIR:-}
+DATA_FILE_PREFIX=${HL_DATA_FILE_PREFIX:-oscar}
+TOKENIZER_TYPE=${HL_TOKENIZER_TYPE:-GPTSentencePieceTokenizer}
+TOKENIZER_MODEL=${HL_TOKENIZER_MODEL:-}
+NUM_NODES=${HL_NUM_NODES:-1}
+DP=${HL_DP:-4}
+TP=${HL_TP:-2}
+PP=${HL_PP:-1}
+NO_PIPELINE_PARALLEL=${HL_NO_PIPELINE_PARALLEL:-0}
+MICRO_BATCH=${HL_MICRO_BATCH:-1}
+EXIT_INTERVAL=${HL_EXIT_INTERVAL:-0}
+OUTPUT_DIR=${HL_RESULTS_DIR:-}
+OUTPUT_DIR_PREFIX=${HL_RESULTS_DIR_PREFIX:-.}
+CHECKPOINT_SAVE=${HL_SAVE:-1}
+SAVE_INTERVAL=${HL_SAVE_INTERVAL:-2000}
+CHECKPOINTS_DIR=${HL_CHECKPOINTS_DIR:-}
+CHECKPOINT_LOAD_TAG=${HL_CHECKPOINT_LOAD_TAG:-}
+TENSORBOARD_DIR=${HL_TENSORBOARD_DIR:-}
+KILL_SWITCH_FILE=${HL_KILL_SWITCH:-}
+HOSTSFILE=${HL_HOSTSFILE:-}
+CKP_ACT=${HL_CKP_ACT:-0}
+UNIV_CP=${HL_UNIV_CP:-0}
+VERIFY_CP=${HL_VERIFY_CP:-0}
+QNPU_DIR=${HL_QNPU_DIR:-}
+LOG_INTERVAL=${HL_LOG_INTERVAL:-10}
+MIXTRAL_MODEL=${HL_MIXTRAL_MODEL:-8x7b}
+DEVICES_PER_NODE=${HL_DEVICES_PER_NODE:-8}
+ZERO_STAGE=${HL_ZERO_STAGE:-0}
+SEQ_PARALLEL=${HL_SEQ_PARALLEL:-0}
+OPTIMIZER=${HL_OPTIMIZER:-fusedadamw}
+DROPOUT=${HL_DROPOUT:-0.0}
+TRAIN_ITERS=${HL_TRAIN_ITERS:-250000}
+LR_WARMUP_ITERS=${HL_LR_WARMUP_ITERS:-2000}
+EVAL_ITERS=${HL_EVAL_ITERS:-100}
+EVAL_INTERVAL=${HL_EVAL_INTERVAL:-1000}
+PROFILE=${HL_PROFILE:-} # provide either of pt, pt-full, hltv
+PROFILE_STEPS=${HL_PROFILE_STEPS:-"3,4"}
+MOE_NUM_CAPACITY_BINS=${HL_MOE_NUM_CAPACITY_BINS:-0}
+MOE_CAPACITY_BINS=${HL_MOE_CAPACITY_BINS:-}
+MOE_CAPACITY_BINS_EXP_BASE=${HL_CAPACITY_BINS_EXP_BASE:-1.5}
+MOE_CAPACITY_BINS_ALIGNMENT=${HL_MOE_CAPACITY_BINS_ALIGNMENT:-64}
+MOE_CAPACITY_BINS_OPTIMIZE_INTERVAL=${HL_MOE_CAPACITY_BINS_OPTIMIZE_INTERVAL:-300}
+MOE_CAPACITY_BINS_OPTIMIZE_MAX_GROUP=${HL_MOE_CAPACITY_BINS_OPTIMIZE_MAX_GROUP:-4}
+MOE_MIN_CAP=${HL_MOE_MIN_CAP:-64}
+MOE_ENABLE_EXPERT_TP=${HL_MOE_ENABLE_EXPERT_TP:-0}
+MOE_EP=${HL_MOE_EP:-}
+MOE_USE_DATA_BEFORE_EXPERT_PARALLEL=${HL_MOE_USE_DATA_BEFORE_EXPERT_PARALLEL:-0}
+USE_LAZY_MODE=${HL_USE_LAZY_MODE:-1}
+USE_TORCH_COMPILE=${HL_USE_TORCH_COMPILE:-0}
+USE_FUSED_SDPA=${HL_USE_FUSED_SDPA:-1}
+USE_FUSED_SDPA_WITH_RECOMPUTE=${HL_USE_FUSED_SDPA_WITH_RECOMPUTE:-0}
+PARTITIONED_MODE=${HL_PARTITIONED_MODE:-false}
+USE_TRANSFORMER_ENGINE=${HL_USE_TRANSFORMER_ENGINE:-0}
+USE_CACHE_FP8_WEIGHT=${HL_USE_CACHE_FP8_WEIGHT:-0}
+USE_CACHE_FP8_WEIGHT_FWD=${HL_USE_CACHE_FP8_WEIGHT_FWD:-0}
+FP8_FORMAT=${HL_FP8_FORMAT:-hybrid} # hybrid or e5m2
+GRAD_ACCUM_DTYPE=${HL_GRAD_ACCUM_DTYPE}
+FP8_MARGIN=${HL_FP8_MARGIN:-0}
+FP8_AMAX_RECOMPUTE_ALGO=${HL_FP8_AMAX_RECOMPUTE_ALGO:-max} # max or most_recent
+USE_FUSED_RMSNORM=${HL_USE_FUSED_RMSNORM:-1}
+# Following configuration are dependant on specific model definitions, but can
+# be overridden for debug purposes
+# - HL_MOE_NUM_EXPERTS
+# - HL_NUM_LAYERS
+# - HL_SEQ_LEN
+# - HL_GBS
+# - HL_TRAIN_ITERS
+
+# ----------------------------------------------------------------------------
+# Verify supported configuration
+
+if [ $PARTITIONED_MODE -ne 'false' ]; then
+    echo "Currently PipelineEngine does not support partitioning of 2+ outputs from MoE; Configured with HL_PARTITIONED_MODE=${HL_PARTITIONED_MODE}"
+    exit 1
+fi
+
+if [[ $MOE_ENABLE_EXPERT_TP -eq 0 && $TP -ne 1 ]]; then
+    echo "When using TP, MOE must also be configured with TP"
+    exit 1
+fi
+
+if [ $UNIV_CP -ne 0 ]; then
+    echo "No support for loading from universal checkpoint; Configured with HL_UNIV_CP=${HL_UNIV_CP}"
+    exit 1
+fi
+
+if [ $VERIFY_CP -ne 0 ]; then
+    echo "No support for checkpoint verification; Configured with HL_VERIFY_CP=${HL_VERIFY_CP}"
+    exit 1
+fi
+
+NUM_DEVICES=$(($DP * $TP * $PP))
+NUM_DEVICES_2=$(($DEVICES_PER_NODE * $NUM_NODES))
+if [ $NUM_DEVICES -ne $NUM_DEVICES_2 ]; then
+    echo "Bad devices configuration. DPxTPxPP=${NUM_DEVICES} != N_NODES*N_DEVICES_PER_NODE=${NUM_DEVICES_2}"
+    exit 1
+fi
+
+# ----------------------------------------------------------------------------
+# Mixtral architecture
+
+if [ $MIXTRAL_MODEL == "8x7b" ]; then
+    # Mixtral-8x7B model architecture
+    MOE_NUM_EXPERTS=${HL_MOE_NUM_EXPERTS:-8}
+    N_LAYERS=${HL_NUM_LAYERS:-32}
+    SEQ_LEN=${HL_SEQ_LEN:-32768}
+    NHIDDEN=4096
+    FFN_HIDDEN_SIZE=14336
+    NHEADS=32
+    NUM_KV_HEADS=8
+    LR=3e-4
+    MIN_LR=3e-6    # using 0.01 of max-lr (DeepSpeed-MoE https://arxiv.org/pdf/2201.05596.pdf section 3.2)
+elif [ $MIXTRAL_MODEL == "small" ]; then
+    MOE_NUM_EXPERTS=${HL_MOE_NUM_EXPERTS:-4}
+    N_LAYERS=${HL_NUM_LAYERS:-8}
+    SEQ_LEN=${HL_SEQ_LEN:-256}
+    NHIDDEN=768
+    FFN_HIDDEN_SIZE=3072
+    NHEADS=16
+    NUM_KV_HEADS=8
+    LR=3e-4
+    MIN_LR=3e-6
+else
+    echo "Unsupported HL_MIXTRAL_MODEL=$MIXTRAL_MODEL"
+    exit 1
+fi
+
+if [ -z "${MOE_EP}" ]; then
+  if [[ $MOE_NUM_EXPERTS -gt $NUM_DEVICES ]]; then
+      MOE_EP=${NUM_DEVICES}
+  else
+      MOE_EP=${MOE_NUM_EXPERTS}
+  fi
+fi
+echo "Using Num Experts=${MOE_NUM_EXPERTS} with MoE EP=${MOE_EP}"
+
+# ----------------------------------------------------------------------------
+# Training configuration: Mixtral paper has no details on training regime.
+# Therefore using LLAMA1 regime.
+# So, assuming LLAMA1 regime with few exceptions:
+# - seq_len = 32768
+# - smaller min_lr
+
+TOKENS_IN_BATCH=$((2 ** 22))  # 4M tokens
+CALCULATED_GBS=$(($TOKENS_IN_BATCH / $SEQ_LEN))
+GLOBAL_BATCH=${HL_GBS:-$CALCULATED_GBS}
+TOTAL_TOKENS=$((250000 * $TOKENS_IN_BATCH))  # ~1T tokens
+
+# ----------------------------------------------------------------------------
+# PATHs
+
+if [[ -z "$MEGATRON_DEEPSPEED_ROOT" ]]; then
+    MEGATRON_DEEPSPEED_ROOT=$(realpath $(dirname $0)/../)
+fi
+
+DATA_PATH=${DATA_DIR}/${DATA_FILE_PREFIX}
+
+RUNTIME=`date +"%Y%m%d_%H%M"`
+# output paths
+if [ -z "$OUTPUT_DIR" ]; then
+    # Experiment name
+    if [ -z "$EXP_NAME" ]; then
+        EXP_NAME="default"
+    fi
+    OUTPUT_DIR=${OUTPUT_DIR_PREFIX}/out/mixtral_${MIXTRAL_MODEL}/ds_${EXP_NAME}_z${ZERO_STAGE}_nl${N_LAYERS}_hs${NHIDDEN}_ffn${FFN_HIDDEN_SIZE}_moe_exp${MOE_NUM_EXPERTS}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}_sp${SEQ_PARALLEL}_D${DP}_T${TP}_P${PP}_E${MOE_EP}_moeT${MOE_ENABLE_EXPERT_TP}_devices${NUM_DEVICES}_${RUNTIME}
+fi
+
+if [ -z "$CHECKPOINTS_DIR" ]; then
+    CHECKPOINTS_DIR=$OUTPUT_DIR/checkpoints
+fi
+
+if [ -z "$TENSORBOARD_DIR" ]; then
+    TENSORBOARD_DIR=$OUTPUT_DIR/tensorboard
+fi
+
+mkdir -p ${OUTPUT_DIR}
+mkdir -p ${TENSORBOARD_DIR}
+
+# ----------------------------------------------------------------------------
+# Create DS config
+
+if [ $SEQ_PARALLEL -eq 1 ]; then
+    PARTITIONED_MODE="false"
+fi
+
+# Currently, PipelineEngine does not support partitioning of 2+ outputs that
+# require gradients). Therefore, disable partitioned mode if using pipeline
+# and MoE experts
+if [[ ${MOE_NUM_EXPERTS} -gt 1 ]] && [[ ${PP} -ne 1 ]]; then
+    PARTITIONED_MODE="false"
+fi
+
+DS_CONFIG=${OUTPUT_DIR}/ds_config.json
+cat << EOT > $DS_CONFIG
+{
+  "train_batch_size" : $GLOBAL_BATCH,
+  "train_micro_batch_size_per_gpu": $MICRO_BATCH,
+  "steps_per_print": $LOG_INTERVAL,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE
+  },
+  "zero_allow_untested_optimizer": true,
+  "bf16": {
+    "enabled": true,
+    "immediate_grad_update": true
+  },
+  "fp16": {"enabled": false},
+  "wall_clock_breakdown": false,
+  "pipeline": {
+    "pipe_partitioned": $PARTITIONED_MODE,
+    "grad_partitioned": $PARTITIONED_MODE
+  },
+  "use_data_before_expert_parallelism": $MOE_USE_DATA_BEFORE_EXPERT_PARALLEL
+}
+EOT
+
+# ----------------------------------------------------------------------------
+# Create command
+
+# configure multi-node
+MULTINODE_CMD=""
+if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]; then
+    MULTINODE_CMD="--hostfile=$HOSTSFILE \
+                   --master_addr $(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p) "
+fi
+
+# training script command
+CMD=""
+if [ ! -z "$QNPU_DIR" ]; then
+    CMD="source ${QNPU_DIR}/activate ;"
+fi
+
+if [ $USE_LAZY_MODE -eq 0 ]; then
+    CMD="${CMD} PT_HPU_LAZY_MODE=0"
+else
+    LOWER_CASE_USE_TORCH_COMPILE=$(echo "$USE_TORCH_COMPILE" | tr '[:upper:]' '[:lower:]')
+    if [[ "$LOWER_CASE_USE_TORCH_COMPILE" == "true" || "$LOWER_CASE_USE_TORCH_COMPILE" == "1" ]]; then
+        echo "Cannot use lazy(HL_USE_LAZY_MODE) and torch.compile(HL_USE_TORCH_COMPILE) modes together"
+        exit 1
+    fi
+fi
+
+CMD="${CMD} \
+    python3 -u ${MEGATRON_DEEPSPEED_ROOT}/pretrain_gpt.py \
+    --bf16 \
+    --deepspeed \
+    --tensor-model-parallel-size ${TP} \
+    --pipeline-model-parallel-size ${PP} \
+    --num-layers ${N_LAYERS} \
+    --hidden-size ${NHIDDEN} \
+    --ffn-hidden-size ${FFN_HIDDEN_SIZE} \
+    --num-attention-heads ${NHEADS} \
+    --num-key-value-heads ${NUM_KV_HEADS} \
+    --seq-length ${SEQ_LEN} \
+    --micro-batch-size ${MICRO_BATCH} \
+    --global-batch-size ${GLOBAL_BATCH} \
+    --train-iters ${TRAIN_ITERS} \
+    --log-interval ${LOG_INTERVAL} \
+    --eval-iters ${EVAL_ITERS} \
+    --eval-interval ${EVAL_INTERVAL} \
+    --data-path ${DATA_PATH} \
+    --optimizer ${OPTIMIZER} \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --adam-eps 1e-8 \
+    --lr ${LR} \
+    --min-lr ${MIN_LR} \
+    --lr-decay-style cosine \
+    --lr-warmup-iters ${LR_WARMUP_ITERS} \
+    --clip-grad 1.0 \
+    --weight-decay 0.1 \
+    --tensorboard-dir ${TENSORBOARD_DIR} \
+    --log-validation-ppl-to-tensorboard \
+    --log-batch-size-to-tensorboard \
+    --log-timers-to-tensorboard \
+    --load ${CHECKPOINTS_DIR} \
+    --deepspeed_config=${DS_CONFIG} \
+    --use-torch-compile=${USE_TORCH_COMPILE} \
+    --zero-stage=${ZERO_STAGE} \
+    --exit-interval ${EXIT_INTERVAL} \
+    --no-masked-softmax-fusion \
+    --no-bias-gelu-fusion \
+    --no-bias-dropout-fusion \
+    --no-gradient-accumulation-fusion \
+    --max-position-embeddings ${SEQ_LEN} \
+    --use-rotary-position-embeddings \
+    --rotary-position-embeddings-theta 1000000 \
+    --untie-embeddings-and-output-weights \
+    --swiglu \
+    --normalization rmsnorm \
+    --disable-bias-linear \
+    --no-query-key-layer-scaling \
+    --attention-dropout ${DROPOUT} \
+    --hidden-dropout ${DROPOUT} \
+    --use-fused-sdpa ${USE_FUSED_SDPA} \
+    --use-fused-sdpa-with-recompute ${USE_FUSED_SDPA_WITH_RECOMPUTE} \
+    --use-fused-rmsnorm $USE_FUSED_RMSNORM"
+
+# -------------
+# MoE arguments
+# -------------
+MOE_ARGS=" \
+    --num-experts ${MOE_NUM_EXPERTS} \
+    --moe-expert-parallel-size ${MOE_EP} \
+    --topk 2 \
+    --disable-moe-token-dropping \
+    --moe-loss-coeff 0.02 \
+    --expert-interval 1 \
+    --moe-train-capacity-factor 1.0 \
+    --moe-eval-capacity-factor 1.0 \
+    --moe-min-capacity ${MOE_MIN_CAP} \
+    "
+
+if [[ $MOE_NUM_EXPERTS -gt 1 ]]; then
+    MOE_ARGS="${MOE_ARGS} --create-moe-param-group"
+fi
+
+if [[ $MOE_ENABLE_EXPERT_TP -gt 0 ]]; then
+    MOE_ARGS="${MOE_ARGS} --enable-expert-tensor-parallelism"
+fi
+
+# ---------------------------
+# MoE Capacity Bins arguments
+# ---------------------------
+
+MOE_CAPACITY_BINS_ARGS=" \
+    --moe-num-capacity-bins ${MOE_NUM_CAPACITY_BINS} \
+    --moe-capacity-bins-exp-base ${MOE_CAPACITY_BINS_EXP_BASE} \
+    --moe-capacity-bins-alignment ${MOE_CAPACITY_BINS_ALIGNMENT} \
+    --moe-capacity-bins-optimize-interval ${MOE_CAPACITY_BINS_OPTIMIZE_INTERVAL} \
+    --moe-capacity-bins-optimize-max-group ${MOE_CAPACITY_BINS_OPTIMIZE_MAX_GROUP} \
+    "
+
+if [ ! -z "$MOE_CAPACITY_BINS" ]; then
+    MOE_CAPACITY_BINS_ARGS="${MOE_CAPACITY_BINS_ARGS} --moe-capacity-bins ${MOE_CAPACITY_BINS}"
+fi
+
+if [[ $MOE_NUM_CAPACITY_BINS -gt 0 ]]; then
+    MOE_ARGS="${MOE_ARGS} ${MOE_CAPACITY_BINS_ARGS}"
+fi
+
+CMD="${CMD} ${MOE_ARGS}"
+
+# ---------------------------
+# FP8 arguments
+# ---------------------------
+if [ $USE_TRANSFORMER_ENGINE -eq 1 ]; then
+    CMD="${CMD} --transformer-impl transformer_engine"
+
+    if [ $USE_CACHE_FP8_WEIGHT -eq 1 ]; then
+        CMD="${CMD} --cache-fp8-weight"
+    fi
+
+    FP8_MEASURE_INTERVAL=${HL_FP8_MEASURE_INTERVAL:-$(( GLOBAL_BATCH / MICRO_BATCH / DP ))}
+    FP8_AMAX_HISTORY_LEN=${HL_FP8_AMAX_HISTORY_LEN:-$(( GLOBAL_BATCH / MICRO_BATCH / DP ))}
+    FP8_AMAX_REDUCE=${HL_FP8_AMAX_REDUCE:-1}
+
+    CMD="${CMD} --cache-fp8-weight-fwd $USE_CACHE_FP8_WEIGHT_FWD"
+    CMD="${CMD} --fp8-interval $FP8_MEASURE_INTERVAL"
+    CMD="${CMD} --fp8-margin $FP8_MARGIN"
+    CMD="${CMD} --fp8-amax-compute-algo $FP8_AMAX_RECOMPUTE_ALGO"
+    CMD="${CMD} --fp8-amax-history-len $FP8_AMAX_HISTORY_LEN"
+
+    if [ "$FP8_FORMAT" = "e5m2" ]; then
+        CMD="${CMD} --fp8-e5m2"
+    else
+        CMD="${CMD} --fp8-hybrid"
+    fi
+
+    if [ $FP8_AMAX_REDUCE -eq 1 ]; then
+        CMD="${CMD} --fp8-amax-reduce"
+    fi
+fi
+
+# ---------------------------
+# Additonal arguments
+# ---------------------------
+
+if [ "$TOKENIZER_TYPE" = "GPTSentencePieceTokenizer" ]; then
+    CMD="${CMD} --tokenizer-type GPTSentencePieceTokenizer"
+    if [[ -z "$TOKENIZER_MODEL" ]]; then
+        TOKENIZER_MODEL="${DATA_DIR}/tokenizer.model"
+    fi
+    CMD="${CMD} --tokenizer-model $TOKENIZER_MODEL"
+elif [ "$TOKENIZER_TYPE" = "GPT2BPETokenizer" ]; then
+    CMD="${CMD} --tokenizer-type GPT2BPETokenizer"
+    CMD="${CMD} --vocab-file $DATA_DIR/gpt2-vocab.json"
+    CMD="${CMD} --merge-file $DATA_DIR/gpt2-merges.txt"
+else
+    echo "incorrect HL_TOKENIZER_TYPE=$TOKENIZER_TYPE is set"
+    exit 1
+fi
+
+if [ ! -z "$CHECKPOINT_LOAD_TAG" ]; then
+    CMD="${CMD} --load-tag ${CHECKPOINT_LOAD_TAG}"
+fi
+
+if [ ! -z "$KILL_SWITCH_FILE" ]; then
+    CMD="${CMD} --kill-switch-path $KILL_SWITCH_FILE"
+fi
+
+if [ ! -z "$DATA_CACHE_DIR" ]; then
+    CMD="${CMD} --data-cache-path ${DATA_CACHE_DIR}"
+fi
+
+if [ $SEQ_PARALLEL -eq 1 ]; then
+    CMD="${CMD} --sequence-parallel"
+fi
+
+if [ $NO_PIPELINE_PARALLEL -eq 1 ]; then
+    CMD="${CMD} --no-pipeline-parallel"
+fi
+
+if [ $UNIV_CP -eq 1 ]; then
+    echo "Loading Universal Checkpoint from ${CHECKPOINTS_DIR}"
+    CMD="${CMD} --universal-checkpoint"
+fi
+
+if [ $CHECKPOINT_SAVE -eq 1 ]; then
+    mkdir -p ${CHECKPOINTS_DIR}
+    CMD="${CMD} --save $CHECKPOINTS_DIR --save-interval $SAVE_INTERVAL"
+
+    if [ $VERIFY_CP -eq 1 ]; then
+        # TODO: can we use LLaMA model type to verify Mixtral?
+        CMD="${CMD} --verify-checkpoint --verify-checkpoint-model-type LLAMA"
+    fi
+fi
+
+if [ $CKP_ACT -eq 1 ]; then
+    CMD="${CMD} --deepspeed-activation-checkpointing --recompute-granularity=full --recompute-method uniform"
+elif [ $CKP_ACT -eq 2 ]; then
+    CMD="${CMD} --deepspeed-activation-checkpointing --recompute-granularity=selective"
+fi
+
+if [ ! -z "$PROFILE" ]; then
+    CMD="${CMD} --profile ${PROFILE}"
+    CMD="${CMD} --profile-steps ${PROFILE_STEPS}"
+fi
+
+if [ ! -z "$QNPU_DIR" ]; then
+    rm -rf $HOME/.deepspeed_env
+    echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> $HOME/.deepspeed_env
+fi
+
+# run!
+deepspeed --num_nodes ${NUM_NODES} \
+          --num_gpus ${DEVICES_PER_NODE} \
+          --no_local_rank \
+          --no_python \
+          $MULTINODE_CMD \
+          /usr/bin/bash -c "$CMD" #2>&1 | tee ${OUTPUT_DIR}/log_${RUNTIME}.txt
diff --git a/tasks/ckp_utils.py b/tasks/ckp_utils.py
new file mode 100644
index 0000000000..f796d3e812
--- /dev/null
+++ b/tasks/ckp_utils.py
@@ -0,0 +1,165 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+import torch
+import megatron
+import deepspeed
+from functools import partial
+from megatron.arguments import parse_args
+from megatron.global_vars import get_args, set_args
+from megatron.initialize import initialize_megatron
+from megatron.training import setup_model_and_optimizer
+from megatron.core.enums import ModelType
+from deepspeed.checkpoint.deepspeed_checkpoint import DeepSpeedCheckpoint
+from pretrain_gpt import model_provider
+
+
+def allow_loading_from_hpu_checkpoint():
+    """ Allows loading a checkpoint trained on HPU to a system without HPU software stack"""
+    def my_rebuild(data, dtype, device, requires_grad):
+        device = 'cpu' if 'hpu' in device else device
+        tensor = torch.from_numpy(data).to(dtype=dtype, device=device)
+        tensor.requires_grad = requires_grad
+        return tensor
+
+    torch._utils._rebuild_device_tensor_from_numpy = my_rebuild
+
+
+def override_args(base_args, override, skip_keys, skip_if_specified_keys):
+    for k, v in vars(override).items():
+        if k in skip_keys:
+            continue
+        if k in skip_if_specified_keys and getattr(base_args, k) is not None:
+            continue
+        setattr(base_args, k, v)
+
+
+def parse_args_and_setup_megatron(extra_args_provider, pre_init_megatron_fn=None):
+    """ Sets up the arguments and initializes the megatron
+
+    Below note was copied from eval_harness/evaluate.py from method load_ds_checkpoint_and_setup_megatron()
+
+    Note(Hesslow):
+    The model loading is a bit convoluted.
+    We want to parse out the model arguments from the checkpoint and use those to initialize megatron-ds.
+
+    However, megatron-ds expects its arguments on the command line. And at that point we don't know them.
+
+    Instead, we use Jasons way: we load the arguments form the checkpoint and then override _parse_args to
+    return whatever args we want.
+
+    If the checkpoint is old, some new arguments may have been introduced and the code will expect these arguments to
+    exist. In order to support this we _first_ parse the arguments normally, and then override them with the arguments
+    from the checkpoint. Keeping the default-value of newer arguments.
+    """
+
+    # avoid printing the arguments, since they will later be overridden.
+    _print_args = megatron.arguments._print_args
+    megatron.arguments._print_args = lambda *_args, **kwarg: None
+
+    # parse the megatorn args, but wait with initializing megatron as they will be overridden later.
+    args = parse_args(extra_args_provider)
+
+    # we set below values as we don't validate args
+    args.sequence_parallel = False
+    model_parallel_size = args.tensor_model_parallel_size * args.pipeline_model_parallel_size
+    args.data_parallel_size = args.world_size // model_parallel_size
+    args.eval_micro_batch_size = args.micro_batch_size
+    if args.global_batch_size is None:
+        args.global_batch_size = args.micro_batch_size * args.data_parallel_size
+    if args.weight_decay_incr_style == 'constant':
+        assert args.start_weight_decay is None
+        assert args.end_weight_decay is None
+        args.start_weight_decay = args.weight_decay
+        args.end_weight_decay = args.weight_decay
+    args.curriculum_learning_legacy = False
+
+    # load DeepSpeed checkpoint
+    ds_checkpoint = DeepSpeedCheckpoint(args.load,
+                                        tp_degree=args.tensor_model_parallel_size,
+                                        pp_degree=args.pipeline_model_parallel_size,
+                                        dp_degree=args.data_parallel_size)
+
+    # Merge the current args with the checkpoint args.
+    cp_args = ds_checkpoint.get_args()
+
+    # update arguments due to name difference from ckpt
+    old_to_new_arg = {"apply_layernorm_weight_plus_one": "apply_layernorm_1p"}
+    for key in old_to_new_arg.keys():
+        if hasattr(cp_args, key):
+            setattr(args, old_to_new_arg[key], getattr(cp_args, key))
+
+    skip_keys = ['world_size', 'rank', 'local_rank', 'device_count', 'micro_batch_size', 'global_batch_size',
+                 'batch_size', 'tensorboard_dir', 'deepspeed', 'deepspeed_config', 'deepspeed_configuration',
+                 'data_parallel_size', 'pipeline_model_parallel_size', 'tensor_model_parallel_size',
+                 'moe_expert_parallel_size', 'moe_token_dropping', 'load', 'rampup_batch_size', 'iteration',
+                 'inference', 'bias_dropout_fusion', 'masked_softmax_fusion', 'bias_dropout_fusion',
+                 'gradient_accumulation_fusion', 'fp16', 'bf16', 'use_seq_len_plus_one_tokens', 'log_interval',
+                 'seq_length', 'max_position_embeddings', 'encoder_seq_length', 'distributed_backend', 'device',
+                 'recompute_granularity', 'deepspeed_activation_checkpointing', 'eval_micro_batch_size', 'random_ltd',
+                 'use_fused_sdpa', 'use_fused_rmsnorm', 'tokenizer_model', 'attention_dropout', 'hidden_dropout',
+                 'attention_softmax_in_fp32', 'eval_hf_rope', 'sequence_parallel', 'eval_add_bos']
+
+    skip_if_specified = ['merge_file', 'vocab_file']
+
+    # allow special handling before arguments override
+    if pre_init_megatron_fn is not None:
+        pre_init_megatron_fn(args, cp_args, skip_keys, skip_if_specified)
+
+    override_args(args, cp_args, skip_keys, skip_if_specified)
+
+    # stop megatron from reparsing the arguments.
+    set_args(args)
+    initialize_megatron(allow_parsing=False, allow_validating_args=False)
+    torch.distributed.barrier()
+
+    # Initializing megatron will update eg. tokenizer size. Override again.
+    override_args(args, cp_args, skip_keys, skip_if_specified)
+
+    # Create minimal deepspeed configuration
+    if args.deepspeed_config is None:
+        args.deepspeed_config_dict = {
+            'train_batch_size': args.global_batch_size,
+            'train_micro_batch_size_per_gpu': args.micro_batch_size,
+            'bf16': {'enabled': args.bf16},
+            'fp16': {'enabled': args.fp16},
+            'zero_optimization': {'stage': 0},
+        }
+
+    # print final arguments.
+    _print_args("arguments", args)
+
+    return args
+
+
+def load_ds_model(parallel_output=True):
+    args = get_args()
+    assert args.deepspeed, "load_ds_model() only support DeepSpeed models"
+
+    # Loading pipelined models in deepspeed with different TP than it was originally trained on fails
+    # due to a sanity check, that makes sure that all state_dicts that we merge contains attention layers.
+    # This, however, is not true for pipelining when we will merge the state_dict for the embeddings
+    # which does not contain these attention-specific keys.
+    # Deepspeed does however manage to load the model if we just turn off this sanity check.
+    deepspeed.runtime.state_dict_factory.MegatronSDLoader.sanity_check = lambda self, ckpt_file_name: None
+
+    if args.deepspeed_config is None:
+        args.deepspeed_config_dict = {
+            'train_batch_size': args.global_batch_size,
+            'train_micro_batch_size_per_gpu': args.micro_batch_size,
+            'bf16': {'enabled': args.bf16},
+            'fp16': {'enabled': args.fp16},
+            'zero_optimization': {'stage': 0},
+        }
+
+    cp_path = args.load
+    args.load = None
+    model_provider_ = partial(model_provider, parallel_output=parallel_output)
+    model, _, _ = setup_model_and_optimizer(model_provider_, ModelType.encoder_or_decoder)
+    model = model[0]
+    zero_enabled = model._config.zero_enabled
+    model._config.zero_enabled = False
+    _, _ = model.load_checkpoint(cp_path, tag=args.load_tag, load_optimizer_states=False,
+                                 load_lr_scheduler_states=False, load_module_only=True)
+    model._config.zero_enabled = zero_enabled
+
+    return model
diff --git a/tasks/eval_harness/README.md b/tasks/eval_harness/README.md
new file mode 100644
index 0000000000..42b0bbdb25
--- /dev/null
+++ b/tasks/eval_harness/README.md
@@ -0,0 +1,23 @@
+# How to run lm-eval on Megatron-DeepSpeed checkpoint
+* Follow the instructions on how to setup [here](../../README.md#setup)
+
+# Run MDS Eval Harness
+
+Below example shows running eval harness for LLaMA model.
+Need to set `num_gpus, PP, TP, seq_length, tokenizer_model, MBS, GBS, load, load_tag, task` appropriately based on trained Megatron-DeepSpeed checkpoint and it's location. To match lm-eval(HuggingFace checkpoint) way of compute using Megatron-DeepSpeed checkpoint need to use `--attention-softmax-in-fp32`, `--eval-add-bos` and `--eval-hf-rope` command line arguments.
+
+```bash
+deepspeed --num_gpus num_gpus $MEGATRON_DEEPSPEED_ROOT/tasks/eval_harness/evaluate.py --pipeline-model-parallel-size PP --tensor-model-parallel-size TP --seq-length seq_length --tokenizer-model /path/to/tokenizer.model --micro-batch-size MBS --global-batch-size GBS --no-load-optim --no-load-rng --no-gradient-accumulation-fusion --bf16 --deepspeed --load /path/to/checkpoint --load-tag /path/to/folder/in/checkpoint/location --inference --eval_fp32 --adaptive_seq_len --use-fused-sdpa 0 --eval-add-bos --task_list task
+```
+
+# How to run lm-eval on Hugging Face checkpoint
+* Follow the instructions on how to setup Optimum for Intel Gaudi [here](https://github.com/huggingface/optimum-habana/tree/main?tab=readme-ov-file#gaudi-setup)
+* Follow the instructions on how to setup lm-eval [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#lm-eval-requirements)
+
+# Run Eval Harness on Optimum for Intel Gaudi to compare
+* Set `num_nodes, num_gpus, BS, task` to match desired running configuration.
+* You can choose a number of buckets and values, but the max model sequence length must be accommodated in buckets.
+
+```bash
+python -m deepspeed.launcher.runner --num_nodes num_nodes --num_gpus num_gpus --no_local_rank examples/text-generation/run_lm_eval.py --model_name_or_path /path/to/converted/model --batch_size BS --tasks task -o results.txt --warmup 0 --buckets 16 32 64 128 max_model_sequence_length
+```
diff --git a/tasks/eval_harness/evaluate.py b/tasks/eval_harness/evaluate.py
index 860d3cf016..741f79a99b 100644
--- a/tasks/eval_harness/evaluate.py
+++ b/tasks/eval_harness/evaluate.py
@@ -1,12 +1,36 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 # This code is originally from https://github.com/bigscience-workshop/Megatron-DeepSpeed
 # under the license https://huggingface.co/spaces/bigscience/license
 
-from functools import reduce
-from logging import logMultiprocessing
+from functools import partial
 import os
 import sys
+import json
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
-                                             os.path.pardir,os.path.pardir)))
+                                             os.path.pardir, os.path.pardir)))
+
+os.environ["HF_DATASETS_TRUST_REMOTE_CODE"] = "1"
+
+import multiprocessing as mp
+import psutil
+
+# This hack is a workaround to limitations of lm_eval which always allocates
+# mp.Pool with max cpu count which explodes on multinode scenarios and for hpu
+# create multiprocess with spawn context
+OrigPool = mp.Pool
+def LimitedSpawnPool(_):
+    spawn_context = mp.get_context("spawn")
+    physical_cpu_count = psutil.cpu_count(logical=False)
+    pool_size = physical_cpu_count
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+    if world_size == 0:
+        world_size = 1
+    pool_size //= world_size
+    if (pool_size * world_size) != physical_cpu_count:
+        pool_size -= 1
+    return spawn_context.Pool(pool_size)
+mp.Pool = LimitedSpawnPool
 
 from lm_eval.models.gpt2 import GPT2LM
 from lm_eval import evaluator, tasks, utils
@@ -21,23 +45,22 @@
 
 import torch
 from megatron import get_args
-from megatron import print_rank_0
 from megatron import get_tokenizer
-from megatron.core.enums import ModelType
 from megatron.core import mpu
-from megatron.training import setup_model_and_optimizer, get_model
-from megatron.core.tensor_parallel.mappings import gather_from_tensor_model_parallel_region
+from megatron.training import get_model
 
 from megatron.utils import get_ltor_masks_and_position_ids, unwrap_model
 from megatron.p2p_communication import recv_forward, send_forward
-import pickle
-import json
 
 from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
 from megatron.model.distributed import DistributedDataParallel as LocalDDP
 from megatron.model.module import Float16Module
-from deepspeed.runtime.pipe import schedule
+from tools.convert_checkpoint.deepspeed_to_megatron import _create_rank_checkpoint
 from deepspeed.accelerator import get_accelerator
+from deepspeed.checkpoint.deepspeed_checkpoint import DeepSpeedCheckpoint
+
+from tasks.ckp_utils import allow_loading_from_hpu_checkpoint, parse_args_and_setup_megatron, load_ds_model
+
 
 class EvalHarnessAdaptor(GPT2LM):
     def __init__(self, model, tokenizer):
@@ -52,7 +75,8 @@ def __init__(self, model, tokenizer):
 
         # For ds we split into mini batches and then micro batches to keep pipelining api happy.
         # With Megatron we just go to micro_batches directly
-        self._batch_size = args.micro_batch_size
+        self._batch_size = args.global_batch_size if (args.deepspeed and not args.no_pipeline_parallel) \
+            else args.micro_batch_size
 
         self.cache_hook = CacheHook(None)
         self.is_main = args.rank == 0
@@ -80,16 +104,26 @@ def device(self):
         return self._device
 
 
+    def _encode_pair(self, context, continuation):
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tokenizer_encode(context + continuation)
+        context_enc = self.tokenizer_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
     def loglikelihood(self, requests):
         new_reqs = []
         for context, continuation in requests:
             if context == "":
                 # end of text as context
                 context_enc = [self.EOT_TOKEN_ID]
+                continuation_enc = self.tokenizer_encode(continuation)
             else:
-                context_enc = self.tokenizer_encode(context)
-
-            continuation_enc = self.tokenizer_encode(continuation)
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
 
             new_reqs.append(((context, continuation), context_enc, continuation_enc))
 
@@ -162,6 +196,10 @@ def _collate(x):
                 logits = self._model_call(torch.cat(inps, dim=0))
                 res_len += len(chunk)
                 if logits is not None:
+                    # GptModel/GptModelPipe transpose batch and seq dims.
+                    # They transpose back at loss fn, which we replace. Therefore, transpose back.
+                    logits = logits.transpose(0, 1)
+
                     multi_logits = F.log_softmax(logits, dim=-1).cpu()  # [batch, seq, vocab]
 
                     for (cache_key, _, _), logits, inp, inplen, cont_toks in zip(chunk, multi_logits, inps, inplens, contlens):
@@ -244,14 +282,15 @@ def _model_call(self, inps):
                 self.model.micro_batches = len(data_iterator)
                 output = self.model.eval_batch(iter(data_iterator), compute_loss = False, reduce_output = None)
 
-
                 if output is not None:
-                    output = torch.cat(output, 0)[:len(inps)]
+                    output = torch.cat(output, 1)
                 else:
                     output = None
 
                 # hack #2 for adaptive_seq_len to work as total_loss gets appended to and shapes aren't the same
+                # in addition, need to reset pipeline activation shapes
                 if args.adaptive_seq_len:
+                    self.model.reset_activation_shape()
                     self.model.total_loss = None
         else:
             # Since the shape of the micro-batch will change
@@ -271,127 +310,39 @@ def _model_call(self, inps):
             output = self.model(*self.create_model_inputs(inps)[0])
             send_forward(output)
 
-        if mpu.is_pipeline_last_stage():
-            return gather_from_tensor_model_parallel_region(output)[..., :self.tokenizer.vocab_size]
-        else:
-            return None
+        return output if mpu.is_pipeline_last_stage() else None
 
     def tokenizer_encode(self, text):
         """Tokenize text *without* adding special tokens."""
         # Splitting this into its own method in case we need to handle special cases for different tokenizers
         from megatron.tokenizer.gpt2_tokenization import GPT2Tokenizer
-        if isinstance(self.tokenizer.tokenizer, GPT2Tokenizer):
+        from megatron.tokenizer.tokenizer import _GPTSentencePieceTokenizer
+        if isinstance(self.tokenizer.tokenizer, GPT2Tokenizer) \
+                or isinstance(self.tokenizer, _GPTSentencePieceTokenizer):
             return self.tokenizer.tokenizer.encode(text)
         else:
             return self.tokenizer.tokenizer.encode(text, add_special_tokens=False)
 
+def load_non_ds_model():
+    args = get_args()
+    assert not args.deepspeed, "setup_non_ds_model() does not support DeepSpeed models"
 
-from megatron.initialize import initialize_megatron
-import megatron
-
-from tools.convert_checkpoint.deepspeed_checkpoint import DeepSpeedCheckpoint
-from tools.convert_checkpoint.deepspeed_to_megatron import _create_rank_checkpoint
-
-def override_args(args, override_args, skip_keys, skip_if_specified_keys):
-    for k, v in vars(override_args).items():
-        if k in skip_keys:
-            continue
-        if k in skip_if_specified_keys and getattr(args, k) is not None:
-            continue
-        setattr(args, k, v)
-
-
-# Note(Hesslow):
-# The model loading is a bit convoluted.
-# We want to parse out the model arguments from the checkpoint and use those to initialize megatron-ds.
-#
-# However megatron-ds expects its arguments on the command line.
-# And at that point we don't know them.
-#
-# Instead we use Jasons way: we load the arguments form the checkpoint and then override _parse_args to return whatever args we want.
-#
-# If the checkpoint is old, some new arguments may have been introduced and the code will expect these arguments to exist.
-# In order to support this we _first_ parse the arguments normally, and then override them with the arguments from the checkpoint.
-# Keeping the default-value of newer arguments.
-#
-# We then use the megatron deepspeed converter to load the deepspeed checkpoints as if they we're megatron checkpoints.
-def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
-    # parse the megatorn args. But wait with initalizing megatron.
-    # avoid printing the arguments, since they will later be overridden.
-    _print_args = megatron.arguments._print_args
-    megatron.arguments._print_args = lambda *_args, **kwarg: None
-    args = parse_args(extra_args_provider=extra_args_provider)
-
+    # Initialize megatron model using the parsed state dict.
+    model_provider_ = partial(model_provider, parallel_output=False)
+    model = get_model(model_provider_)[0]
     ds_checkpoint = DeepSpeedCheckpoint(args.load,
                                         tp_degree=args.tensor_model_parallel_size,
                                         pp_degree=args.pipeline_model_parallel_size,
-                                        no_pp=args.no_pipeline_parallel)
-
-
-    cp_args = ds_checkpoint.get_args()
-    # Merge the current args with the checkpoint args.
-    skip_keys = ['world_size', 'rank', 'local_rank','device_count', 'micro_batch_size','global_batch_size', 'batch_size', 'tensorboard_dir', 'deepspeed', 'deepspeed_config',
-                     'data_parallel_size', 'pipeline_model_parallel_size', 'tensor_model_parallel_size', 'moe_expert_parallel_size', 'moe_token_dropping', 'load', 'rampup_batch_size', 'iteration', 'inference', 'random_ltd']
-
-    skip_if_specified = ['merge_file', 'vocab_file']
-
-    if args.eval_fp32:
-        cp_args.fp16 = False
-        cp_args.bf16 = False
-        cp_args.params_dtype = torch.float32
-
-    cp_args.tokenizer_type = 'GPT2BPETokenizer'
-
-    override_args(args, cp_args, skip_keys, skip_if_specified)
+                                        dp_degree=args.data_parallel_size)
+    sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(),
+                                 mpu.get_pipeline_model_parallel_rank(), True)
 
-    # stop megatron from reparsing the arguments.
-    megatron.arguments.parse_args = lambda *_args, **kwarg: args
-    megatron.global_vars._ensure_var_is_not_initialized = lambda *_args, **kwarg: None
-    megatron.global_vars._GLOBAL_ARGS = args
-
-    initialize_megatron(extra_args_provider=extra_args_provider)
-    megatron.global_vars._GLOBAL_ARGS = args
-    torch.distributed.barrier()
-
-    # Initializing megatron will update eg. tokenizer size. Override again.
-    override_args(args, cp_args, skip_keys, skip_if_specified)
-
-    # print final arguments.
-    _print_args("eval_harness arguments", args)
-    if args.deepspeed:
-
-        # Hack #3:
-        # Loading pipelined models in deepspeed with different TP than it was originally trained on fails
-        # due to a sanity check, that makes sure that all state_dicts that we merge contains attention layers.
-        # This, however, is not true for pipelining when we will merge the state_dict for the embeddings which
-        # which does not contain these attention-specific keys.
-        #
-        # Deepspeed does however manage to load the model if we just turn off this sanity check.
-        import deepspeed
-        deepspeed.runtime.state_dict_factory.MegatronSDLoader.sanity_check = lambda self, ckpt_file_name: None
-
-
-        cp_path = args.load
-        args.load = None
-        model, _, _ = setup_model_and_optimizer(model_provider, ModelType.encoder_or_decoder)
-        model = model[0]
-        zero_enabled = model._config.zero_enabled
-        model._config.zero_enabled = False
-        _, _ = model.load_checkpoint(cp_path, tag = '.', load_optimizer_states=False, load_lr_scheduler_states=False, load_module_only=True)
-        model._config.zero_enabled = zero_enabled
-    else:
-        model = get_model(model_provider)[0]
-        # Initialize megatron model using the parsed state dict.
-        sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True)
-
-        model.load_state_dict(sd['model'], strict=True)
-
-    if args.eval_fp32:
-        model = model.float()
+    model.load_state_dict(sd['model'], strict=True)
 
     torch.distributed.barrier()
     return model
 
+
 def tasks_args(parser):
     """Provide extra arguments required for tasks."""
     group = parser.add_argument_group(title='Evaluation options')
@@ -401,19 +352,39 @@ def tasks_args(parser):
                        help='Should the sequence length be adapted to the batch during evaluation, if in fp16 the results will be slightly different due to numerical errors but greatly speed up evaluation.')
     group.add_argument('--num_fewshot', type=int, default = 0, help='Number of few-shot prompts.')
     group.add_argument('--eval_fp32',  default = False, action='store_true', help='Should the evaluation run in fp32')
+    group.add_argument('--num_iters', type=int, default = 0, help='Number of few-shot prompts.')
     return parser
 
-from megatron.arguments import parse_args
 
 def main():
     start = time.time()
-    model = load_ds_checkpoint_and_setup_megatron(extra_args_provider=tasks_args)
 
+    allow_loading_from_hpu_checkpoint()
+
+    def pre_init_megatron_fn(args_, cp_args, _skip_keys, _skip_if_specified):
+        if args_.eval_fp32:
+            cp_args.fp16 = False
+            cp_args.bf16 = False
+            cp_args.params_dtype = torch.float32
+            args_.fp16 = False
+            args_.bf16 = False
+            args_.params_dtype = torch.float32
+
+    parse_args_and_setup_megatron(extra_args_provider=tasks_args, pre_init_megatron_fn=pre_init_megatron_fn)
     args = get_args()
+
+    if args.deepspeed:
+        model = load_ds_model(parallel_output=False)
+    else:
+        model = load_non_ds_model()
+
+    if args.eval_fp32:
+        model = model.float()
+
     if args.deepspeed and args.adaptive_seq_len:
         # adaptive_seq_len hack #1:
         # CL automatically enables reset_activation_shape() which allows us to change input shapes
-        # and it also reshapes the attenion scores in attention_mask_func
+        # and it also reshapes the attention scores in attention_mask_func
         args.curriculum_learning_legacy = 1
 
     task_list = ALL_TASKS if args.task_list == 'all' else args.task_list.split(',')
@@ -425,14 +396,17 @@ def main():
 
     tokenizer = get_tokenizer()
     adaptor = EvalHarnessAdaptor(model, tokenizer)
-    results = evaluator.evaluate(adaptor, task_dict, False, args.num_fewshot, None)
+    results = evaluator.evaluate(adaptor, task_dict, False, args.num_fewshot,
+                                 limit=None if args.num_iters == 0 else args.num_iters)
 
     if mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0:
         print(json.dumps(results, indent=2))
         with open(args.results_path, 'w') as outfile:
             json.dump(results, outfile, indent = 4)
+
     end = time.time()
     print("evaluation of {} ends in {:.2f} sec, or {:.2f} min, or {:.2f} hr".format(args.task_list, end-start, (end-start)/60.0, (end-start)/3600.0))
 
+
 if __name__ == '__main__':
     main()
diff --git a/tasks/main.py b/tasks/main.py
index 9bc38f5fd2..e20568e32d 100644
--- a/tasks/main.py
+++ b/tasks/main.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Main tasks functionality."""
@@ -50,10 +51,6 @@ def get_tasks_args(parser):
                        help='Number of blocks to use as top-k during retrieval')
 
     # finetune for retriever
-    group.add_argument('--eval-micro-batch-size', type=int, default=None,
-                       help='Eval Batch size per model instance (local batch '
-                            'size). Global batch size is local batch size '
-                            'times data parallel size.')
     group.add_argument('--train-with-neg', action='store_true',
                        help='Whether to use negative examples during model '
                         'training')
@@ -72,6 +69,10 @@ def get_tasks_args(parser):
                         ' take from each question pool')
 
 
+    # checkpoint manipulations
+    group.add_argument('--checkpoint-override-tokenizer', action='store_true',
+                       help='If set, override checkpoint tokenizer information with current args')
+
     return parser
 
 
diff --git a/tasks/main_3d.py b/tasks/main_3d.py
new file mode 100644
index 0000000000..8014d00010
--- /dev/null
+++ b/tasks/main_3d.py
@@ -0,0 +1,42 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Main 3D parallel tasks functionality."""
+
+import os
+import sys
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
+                                             os.path.pardir)))
+from main import get_tasks_args
+from tasks.ckp_utils import allow_loading_from_hpu_checkpoint, parse_args_and_setup_megatron
+
+
+if __name__ == '__main__':
+    allow_loading_from_hpu_checkpoint()
+
+    def pre_init_megatron_fn(args_, _cp_args, skip_keys, _skip_if_specified):
+        if args_.checkpoint_override_tokenizer:
+            skip_keys += ['merge_file', 'tokenizer_model', 'tokenizer_type',
+                          'vocab_extra_ids', 'vocab_file']
+
+    args = parse_args_and_setup_megatron(get_tasks_args, pre_init_megatron_fn)
+
+    if args.task in ['LAMBADA', 'WIKITEXT103']:
+        from zeroshot_gpt.evaluate import main
+    else:
+        raise NotImplementedError(f'Task {args.task} is not implemented.')
+
+    main()
diff --git a/tasks/zeroshot_gpt/evaluate.py b/tasks/zeroshot_gpt/evaluate.py
index 3f136a2687..43ded61d8b 100644
--- a/tasks/zeroshot_gpt/evaluate.py
+++ b/tasks/zeroshot_gpt/evaluate.py
@@ -1,7 +1,9 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """GPT zero-shot evaluation."""
 
+import functools
 import math
 
 import torch
@@ -24,6 +26,8 @@
 from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
 from megatron.model import DistributedDataParallel as LocalDDP
 from megatron.model import Float16Module
+from tasks.ckp_utils import load_ds_model
+
 
 def get_model_provider(eval_metric):
     """Based on evaluation metric set the parallel-output flag and
@@ -61,7 +65,7 @@ def process_batch(batch):
     labels = tokens_[:, 1:].contiguous()
     tokens = tokens_[:, :-1].contiguous()
 
-    # Get the masks and postition ids.
+    # Get the masks and position ids.
     attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
         tokens,
         tokenizer.eod,
@@ -72,6 +76,33 @@ def process_batch(batch):
     return tokens, labels, attention_mask, position_ids, loss_mask
 
 
+def calculate_metric(output, labels, loss_mask, eval_metric):
+    # For loss, return the unreduced loss.
+    if eval_metric == 'loss':
+        if output.shape != labels.shape:
+            labels = labels.transpose(0, 1)
+        losses = tensor_parallel.vocab_parallel_cross_entropy(
+            output.contiguous().float(), labels.contiguous())
+        if losses.shape != loss_mask.shape:
+            losses = losses.transpose(0, 1).contiguous()
+        loss = torch.sum(
+            losses.view(-1) * loss_mask.contiguous().view(-1).float())
+        return loss
+
+    # For accuracy, return the number of correctly predicted samples.
+    if eval_metric == 'accuracy':
+        outputs = torch.argmax(output, -1)
+        if outputs.shape != labels.shape:
+            outputs = outputs.transpose(0, 1).contiguous()
+        correct = (outputs == labels).float()
+        correct[(1 - loss_mask).bool()] = 1
+        correct = correct.prod(-1)
+        return correct.sum()
+
+    raise NotImplementedError('calculate_metric method for evaluation metric {} '
+                                'is not implemented.'.format(eval_metric))
+
+
 def forward_step(batch, model, eval_metric):
     """Forward step."""
 
@@ -94,27 +125,86 @@ def forward_step(batch, model, eval_metric):
     send_forward(output)
 
     if parallel_state.is_pipeline_last_stage():
-        # For loss, return the unreduced loss.
-        if eval_metric == 'loss':
-            losses = tensor_parallel.vocab_parallel_cross_entropy(
-                output.contiguous().float(), labels.contiguous())
-            loss = torch.sum(
-                losses.view(-1) * loss_mask.contiguous().view(-1).float())
-            return loss
-
-        # For accuracy, return the number of correctly predicted samples.
-        if eval_metric == 'accuracy':
-            outputs = torch.argmax(output, -1)
-            correct = (outputs == labels).float()
-            correct[(1 - loss_mask).bool()] = 1
-            correct = correct.prod(-1)
-            return correct.sum()
-
-        raise NotImplementedError('forward method for evaluation metric {} '
-                                  'is not implemented.'.format(eval_metric))
+        return calculate_metric(output, labels, loss_mask, eval_metric)
     return None
 
 
+class PeekableIterator:
+    def __init__(self, iterator):
+        self.iterator = iterator
+        self.next_item = None
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        if self.next_item is not None:
+            item = self.next_item
+            self.next_item = None
+            return item
+        else:
+            return next(self.iterator)
+
+    def peek(self):
+        if self.next_item is None:
+            self.next_item = next(self.iterator)
+        return self.next_item
+
+
+def evaluate_3d(loader, model, eval_metric, do_print):
+    args = get_args()
+
+    total_iters = len(loader)
+    peekable_loader = PeekableIterator(iter(loader))
+
+    dp_world_size = parallel_state.get_data_parallel_world_size()
+
+    total_output, total_tokens = 0.0, 0
+    last_batch_size = loader.batch_size
+    for i in range(total_iters):
+        batch = peekable_loader.peek()
+
+        # We create the data_loader with drop_last=False
+        # This can cause the last batch to be smaller than loader.batch_size
+        # However, Megatron caches the size of the batch
+        # Therefore, we detect that the current batch size has changed and reset the cache
+        # In addition, Pipeline model engine calculates total_loss aggregated over micro batches.
+        # However, total_loss has no meaning for eval, yet being calculated.
+        # Reset total_loss to avoid above similar batch size issue
+        batch_size = batch['text'].shape[0]
+        if batch_size != last_batch_size:
+            model.reset_activation_shape()
+            tensor_parallel.data.reset_cached_broadcast_sizes()
+            model.total_loss = None
+            last_batch_size = batch_size
+
+        output = model.eval_batch(peekable_loader, compute_loss=False, reduce_output=None)
+
+        # output logits are available only on last stage pipeline workers
+        if parallel_state.is_pipeline_last_stage():
+            output = torch.cat(output)
+
+            _, labels, _, _, loss_mask = process_batch(batch)
+
+            res = calculate_metric(output, labels, loss_mask, eval_metric)
+            total_output += res
+            total_tokens += loss_mask.view(-1).eq(1).sum()
+
+            # Average loss across DP
+            # HCCL does not support torch.distributed.ReduceOp.AVG
+            torch.distributed.all_reduce(total_output,
+                                         group=parallel_state.get_data_parallel_group(),
+                                         op=torch.distributed.ReduceOp.SUM)
+            total_output = total_output / dp_world_size
+
+            if do_print and (i+1) % args.log_interval == 0:
+                avg_metric = total_output / total_tokens
+                print(f'Iteration: {i+1}: avg_{eval_metric}={avg_metric}')
+
+    loss = total_output * dp_world_size
+    return loss
+
+
 def evaluate(data_loader, model, eval_metric):
     """Evaluation."""
     args = get_args()
@@ -141,14 +231,22 @@ def evaluate(data_loader, model, eval_metric):
     return total_output
 
 
-def evaluate_and_print_results(task, data_loader, model, eval_metric):
+def evaluate_and_print_results(task, data_loader, model, eval_metric, using_3d):
     """Evaluate and print results on screen."""
 
     # Evaluate and get results.
-    output = evaluate(data_loader, model, eval_metric)
+    if using_3d:
+        # only a single last stage worker will print
+        do_print = parallel_state.is_pipeline_last_stage() \
+                   and (parallel_state.get_data_parallel_rank() == 0) \
+                   and (parallel_state.get_tensor_model_parallel_rank() == 0)
+        output = evaluate_3d(data_loader, model, eval_metric, do_print)
+    else:
+        do_print = is_last_rank()
+        output = evaluate(data_loader, model, eval_metric)
 
     string = ' validation results on {} | '.format(task)
-    if is_last_rank():
+    if do_print:
         if eval_metric == 'loss':
             num_tokenized_tokens = data_loader.dataset.num_tokenized_tokens
             num_original_tokens = data_loader.dataset.num_original_tokens
@@ -195,19 +293,22 @@ def main():
             args.task))
 
     # Set up model and load checkpoint.
-    model = get_model(get_model_provider(eval_metric), wrap_with_ddp=False)
+    if args.deepspeed:
+        parallel_output = (eval_metric == 'loss')
+        model = load_ds_model(parallel_output=parallel_output)
+    else:
+        model = get_model(get_model_provider(eval_metric), wrap_with_ddp=False)
+        assert len(model) == 1, "Above condition should have caught this"
+        model = model[0]
     if args.load is not None:
         _ = load_checkpoint(model, None, None)
 
-    assert len(model) == 1, "Above condition should have caught this"
-    model = model[0]
-
     # Data stuff.
     dataset = build_dataset(args.task)
     dataloader = build_data_loader(dataset, args.micro_batch_size,
                                    args.num_workers, drop_last=False)
 
     # Run evaluation.
-    evaluate_and_print_results(args.task, dataloader, model, eval_metric)
+    evaluate_and_print_results(args.task, dataloader, model, eval_metric, using_3d=args.deepspeed)
 
     print_rank_0('done :-)')
diff --git a/tests/conftest.py b/tests/conftest.py
index f711e58a27..18cc4d1c52 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,22 +1,64 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
+# import os
 import pytest
 
+from deepspeed.accelerator import get_accelerator
+
+# from megatron import initialize_megatron
 from megatron.core import parallel_state
 from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
 
 from megatron.core.transformer.transformer_config import TransformerConfig
 
+import torch
+
+world_size = 1
+rank = 0
+torch.distributed.init_process_group(
+    backend=get_accelerator().communication_backend_name(),
+    world_size=world_size, rank=rank)
+
+tp_world_size = world_size
+pp_world_size = world_size
+assert world_size == (tp_world_size * pp_world_size)
+
 # initialize model parallel for tests
-parallel_state.set_tensor_model_parallel_world_size(1)
-parallel_state.set_tensor_model_parallel_rank(0)
-parallel_state._set_global_memory_buffer()
-parallel_state.set_pipeline_model_parallel_rank(0)
-parallel_state.set_pipeline_model_parallel_world_size(1)
+parallel_state.set_tensor_model_parallel_world_size(tp_world_size)
+parallel_state.set_tensor_model_parallel_rank(rank)
+# parallel_state._set_global_memory_buffer()
+parallel_state.set_pipeline_model_parallel_world_size(pp_world_size)
+parallel_state.set_pipeline_model_parallel_rank(rank)
+parallel_state.initialize_model_parallel()
 
 model_parallel_cuda_manual_seed(123)
 
+num_layers = 2
+hidden_size = 12
+num_attention_heads = 4
+use_cpu_initialization = True
+# seq_len = 16
+# tokenizer_type = 'GPT2BPETokenizer'
+# data_dir = os.getenv("HL_DATA_DIR_ROOT", "")
+
+# external_args = {}
+# external_args.update({"micro_batch_size": 1})
+# external_args.update({"num_layers": num_layers})
+# external_args.update({"hidden_size": hidden_size})
+# external_args.update({"num_attention_heads": num_attention_heads})
+# external_args.update({"seq_length": seq_len})
+# external_args.update({"max_position_embeddings": seq_len})
+# external_args.update({'tokenizer_type': tokenizer_type})
+# external_args.update({'vocab_file': os.path.join(data_dir, "vocab.json")})
+# external_args.update({'merge_file': os.path.join(data_dir, "merges.txt")})
+
+# initialize_megatron(ignore_unknown_args=True, external_args=external_args)
+
 
 @pytest.fixture
 def transformer_config():
-    return TransformerConfig(num_layers=2, hidden_size=12, num_attention_heads=4, use_cpu_initialization=True)
+    print(f"transformer_config")
+    return TransformerConfig(num_layers=num_layers, hidden_size=hidden_size,
+                             num_attention_heads=num_attention_heads,
+                             use_cpu_initialization=use_cpu_initialization)
diff --git a/tests/functional_tests/python_test_utils/test_ci_pipeline.py b/tests/functional_tests/python_test_utils/test_ci_pipeline.py
index 829ebeec41..1324515b1a 100644
--- a/tests/functional_tests/python_test_utils/test_ci_pipeline.py
+++ b/tests/functional_tests/python_test_utils/test_ci_pipeline.py
@@ -6,7 +6,7 @@
 from tensorboard.backend.event_processing import event_accumulator
 
 LOGS_DIR = os.getenv('LOGS_DIR')
-EXPECTED_METRICS_FILE = os.getenv('EXPECTED_METRICS_FILE')
+EXPECTED_METRICS_FILE = os.getenv('EXPECTED_METRICS_FILE', "")
 
 import enum
 
diff --git a/tests/functional_tests/python_test_utils/test_resume_checkpoint_pipeline.py b/tests/functional_tests/python_test_utils/test_resume_checkpoint_pipeline.py
index 5d3e69d123..7cf741e0d5 100644
--- a/tests/functional_tests/python_test_utils/test_resume_checkpoint_pipeline.py
+++ b/tests/functional_tests/python_test_utils/test_resume_checkpoint_pipeline.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import os
 import sys
 import json
@@ -34,8 +36,9 @@ def collect_train_test_metrics(logs_dir, index):
 
 class TestCIPipeline:
 
-    train_metrics_100 = collect_train_test_metrics(LOGS_DIR, 0)
-    train_metrics_50_to_100 = collect_train_test_metrics(LOGS_DIR, 1)
+    if LOGS_DIR:
+        train_metrics_100 = collect_train_test_metrics(LOGS_DIR, 0)
+        train_metrics_50_to_100 = collect_train_test_metrics(LOGS_DIR, 1)
 
     def _test_helper(self, loss_type):
         expected = self.train_metrics_100[loss_type]
@@ -52,4 +55,5 @@ def _test_helper(self, loss_type):
             assert actual[i] == expected[start_idx_expected + i], f"The value at step {i} should be {expected[start_idx_expected + i]} but it is {actual[i]}."
 
     def test_lm_loss_deterministic(self):
-        self._test_helper("lm loss")
\ No newline at end of file
+        if LOGS_DIR:
+            self._test_helper("lm loss")
diff --git a/tests/models/test_gpt_embedding.py b/tests/models/test_gpt_embedding.py
index 700990adc2..199f29dede 100644
--- a/tests/models/test_gpt_embedding.py
+++ b/tests/models/test_gpt_embedding.py
@@ -1,15 +1,22 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 import pytest
 
 import torch
+import types
 
 from megatron.core.transformer.transformer_config import TransformerConfig
 from megatron.core.models.gpt.gpt_embedding import GPTEmbedding
+from megatron.global_vars import set_args
 
+from deepspeed.accelerator import get_accelerator
+device_name = get_accelerator().device_name()
 
 @pytest.fixture
 def gpt_embedding(transformer_config):
+    args = types.SimpleNamespace(params_dtype=torch.float32, embed_layernorm=False)
+    set_args(args)
     embedding = GPTEmbedding(config=transformer_config, vocab_size=100, max_sequence_length=4)
     return embedding
 
@@ -36,12 +43,12 @@ def test_cpu_forward(self, gpt_embedding: GPTEmbedding):
         assert embeddings.shape[1] == input_ids.shape[0]
         assert embeddings.shape[2] == gpt_embedding.config.hidden_size
 
-    def test_gpu_forward(self, gpt_embedding: GPTEmbedding):
-        gpt_embedding.cuda()
-        input_ids = torch.tensor([0, 1, 2, 3], dtype=torch.int64).repeat((2, 1)).cuda()
-        position_ids = torch.tensor([0, 1, 2, 3], dtype=torch.int64).repeat((2, 1)).cuda()
+    def test_accelerator_forward(self, gpt_embedding: GPTEmbedding):
+        gpt_embedding.to(device_name)
+        input_ids = torch.tensor([0, 1, 2, 3], dtype=torch.int64).repeat((2, 1)).to(device_name)
+        position_ids = torch.tensor([0, 1, 2, 3], dtype=torch.int64).repeat((2, 1)).to(device_name)
         embeddings = gpt_embedding(input_ids, position_ids)
-        assert embeddings.device.type == 'cuda'
+        assert embeddings.device.type == device_name
         assert embeddings.shape[0] == gpt_embedding.max_sequence_length
         assert embeddings.shape[1] == input_ids.shape[0]
         assert embeddings.shape[2] == gpt_embedding.config.hidden_size
diff --git a/tests/models/test_gpt_model.py b/tests/models/test_gpt_model.py
index b854ecd918..cf322908b3 100644
--- a/tests/models/test_gpt_model.py
+++ b/tests/models/test_gpt_model.py
@@ -1,20 +1,28 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 import pytest
 
 import torch
+import types
 
 from megatron.core.transformer.transformer_config import TransformerConfig
 from megatron.core.models.gpt.gpt_model import GPTModel
+from megatron.global_vars import set_args
 
+from deepspeed.accelerator import get_accelerator
+device_name = get_accelerator().device_name()
 
 @pytest.fixture
 def gpt_model(transformer_config):
+    args = types.SimpleNamespace(params_dtype=torch.float32, embed_layernorm=False)
+    set_args(args)
     language_model = GPTModel(config=transformer_config, vocab_size=100, max_sequence_length=4)
     return language_model
 
 
 class TestGPTModel:
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_constructor(self, gpt_model: GPTModel):
         assert isinstance(gpt_model, GPTModel)
 
@@ -23,6 +31,7 @@ def test_constructor(self, gpt_model: GPTModel):
         num_weights = sum([p.numel() for p in gpt_model.parameters()])
         assert num_weights == 5040
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_set_input_tensor(self, gpt_model: GPTModel):
         config: TransformerConfig = gpt_model.config
         sequence_length = gpt_model.max_sequence_length
@@ -37,17 +46,18 @@ def test_set_input_tensor(self, gpt_model: GPTModel):
         assert gpt_model.decoder.input_tensor.shape[1] == micro_batch_size
         assert gpt_model.decoder.input_tensor.shape[2] == config.hidden_size
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_post_process_forward(self, gpt_model: GPTModel):
         config: TransformerConfig = gpt_model.config
         sequence_length = gpt_model.max_sequence_length
         micro_batch_size = 2
 
-        gpt_model.cuda()
+        gpt_model.to(device_name)
 
         data = list(range(sequence_length))
-        input_ids = torch.tensor(data, dtype=torch.int64).repeat((micro_batch_size, 1)).cuda()
-        position_ids = torch.tensor(data, dtype=torch.int64).repeat((micro_batch_size, 1)).cuda()
-        attention_mask = torch.ones((1, 1, sequence_length, sequence_length), dtype=bool).cuda()
+        input_ids = torch.tensor(data, dtype=torch.int64).repeat((micro_batch_size, 1)).to(device_name)
+        position_ids = torch.tensor(data, dtype=torch.int64).repeat((micro_batch_size, 1)).to(device_name)
+        attention_mask = torch.ones((1, 1, sequence_length, sequence_length), dtype=bool).to(device_name)
 
         logits = gpt_model.forward(input_ids=input_ids, position_ids=position_ids, attention_mask=attention_mask)
 
@@ -55,15 +65,19 @@ def test_post_process_forward(self, gpt_model: GPTModel):
         assert logits.shape[1] == sequence_length
         assert logits.shape[2] == gpt_model.vocab_size
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_no_post_process_forward(self, gpt_model: GPTModel):
         pass
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_no_preprocess_forward(self, gpt_model: GPTModel):
         pass
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_state_dict_for_save_checkpoint(self, gpt_model: GPTModel):
         pass
 
+    @pytest.mark.xfail(device_name=='hpu', reason="TELayerNorm is not defined in HPU")
     def test_load_state_dict(self, gpt_model: GPTModel):
         pass
 
diff --git a/tests/old_tests/ds_config_bf16.json b/tests/old_tests/ds_config_bf16.json
new file mode 100644
index 0000000000..6afd1f6b2e
--- /dev/null
+++ b/tests/old_tests/ds_config_bf16.json
@@ -0,0 +1,14 @@
+{
+  "train_micro_batch_size_per_gpu": 1,
+  "train_batch_size": 16,
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": 0
+  },
+  "bf16": {
+    "enabled": true
+  },
+  "zero_allow_untested_optimizer": true,
+  "steps_per_print": 2000,
+  "wall_clock_breakdown": false
+}
diff --git a/tests/old_tests/test_checkpoints.py b/tests/old_tests/test_checkpoints.py
new file mode 100644
index 0000000000..ad1bd6207d
--- /dev/null
+++ b/tests/old_tests/test_checkpoints.py
@@ -0,0 +1,440 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import os
+import pytest
+import shutil
+
+from parameterized import parameterized
+from testing_utils import (
+    CaptureStdout,
+    TestCasePlus,
+    execute_subprocess_async,
+    get_accelerator_count,
+    require_deepspeed,
+    require_torch_accelerator,
+    require_torch_multi_accelerator,
+    set_seed
+)
+
+set_seed(42)
+
+
+def parameterized_custom_name_func(func, param_num, param):
+    # customize the test name generator function as we want both params to appear in the subtest
+    # name, as by default it shows only the first param
+    param_based_name = parameterized.to_safe_name("_to_".join(str(x) for x in param.args))
+    return f"{func.__name__}_{param_based_name}"
+
+
+params = [
+    # TP_PP_DP
+    ["1_1_1", "1_1_1"],
+    ["2_1_1", "1_1_1"],
+    ["1_2_1", "1_1_1"],
+    ["1_1_2", "1_1_1"],
+
+    ["1_1_1", "2_1_1"],
+    ["1_1_1", "1_2_1"],
+    ["1_1_1", "1_1_2"],
+
+    ["1_1_2", "1_1_2"],
+    ["1_1_2", "2_1_1"],
+    ["1_1_2", "1_2_1"],
+
+    ["1_2_1", "1_2_1"],
+    ["1_2_1", "2_1_1"],
+    ["1_2_1", "1_1_2"],
+
+    ["2_1_1", "2_1_1"],
+    ["2_1_1", "1_2_1"],
+    ["2_1_1", "1_1_2"],
+
+    ["2_2_2", "1_1_1"],
+    ["2_2_2", "2_2_2"],
+    ["1_1_1", "2_2_2"],
+
+    ["1_1_8", "2_2_2"],
+
+]
+
+
+def get_launcher(num_accelerators):
+    # 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
+    # - it won't be able to handle that
+    return f"deepspeed --num_nodes 1 --num_gpus {num_accelerators}".split()
+
+
+@require_deepspeed
+@require_torch_accelerator
+class MegDSTestCheckpoints(TestCasePlus):
+    """ """
+
+    def setUp(self):
+        super().setUp()
+
+        # at times magatron fails to build kernels and doesn't remove the lock file, which makes
+        # subsequent runs hang - so make sure there is no lock when starting the testing
+        meg_lock_file_path = self.repo_root_dir_str + "/megatron/fused_kernels/build/lock"
+        if os.path.exists(meg_lock_file_path):
+            os.unlink(meg_lock_file_path)
+
+    @staticmethod
+    def find_lines_with_pattern_in_buffer(buffer, pattern):
+        lines = buffer.splitlines()
+        res = []
+        for line in lines:
+            if line.find(pattern) != -1:
+                res.append(line)
+        return res
+
+    def get_config(self, output_dir, tp_size, pp_size, dp_size, n_iters=None,
+                   exit_interval=None, save_interval= None, skip_train=False,
+                   use_bloom=False):
+
+        data_dir = os.getenv("HL_DATA_DIR_ROOT", "")
+        if data_dir == "":
+            data_dir = f"{self.data_dir}/gpt2"
+
+        num_accelerators = pp_size * tp_size * dp_size
+        print(f"Using {num_accelerators} Accelerators")
+
+        n_iters = 8 if n_iters is None else n_iters
+        exit_interval = n_iters // 2 if exit_interval is None else exit_interval
+        save_interval = 1 if save_interval is None else save_interval
+        seq_len = 8
+
+        # common/shared configs
+
+        ds_args = f"""
+                --deepspeed
+                --deepspeed_config {self.test_file_dir_str}/ds_config_bf16.json
+                --zero-stage 0
+                --deepspeed-activation-checkpointing
+        """.split()
+
+        args = f"""
+                --tensor-model-parallel-size {tp_size}
+                --pipeline-model-parallel-size {pp_size}
+                --distributed-backend hccl
+
+                --log-interval 1
+                --save-interval {save_interval}
+                --eval-interval 10
+                --eval-iters 1
+                --exit-interval {exit_interval}
+
+                --merge-file {data_dir}/merges.txt
+                --vocab-file {data_dir}/vocab.json
+                --data-path {data_dir}/c4_en_6_c4_spm_text_document
+
+                --split 99,0,1
+                --save {output_dir}/checkpoints
+                --load {output_dir}/checkpoints
+
+                --num-layers 2
+                --hidden-size 8
+                --num-attention-heads 2
+                --seq-length {seq_len}
+                --max-position-embeddings 8
+                --micro-batch-size 1
+                --global-batch-size 16
+                --train-iters {n_iters}
+
+                --recompute-granularity=full
+                --recompute-method=uniform
+                --partition-activations
+
+                --optimizer adam
+                --adam-beta1 0.9
+                --adam-beta2 0.95
+                --adam-eps 1e-8
+                --lr 1e-4
+                --lr-warmup-iters 1
+                --lr-decay-iters 6
+                --clip-grad 1.0
+                --weight-decay 1e-1
+                --bf16
+                --no-gradient-accumulation-fusion
+        """
+
+        # removed below args to speedup test
+        _ = f"""
+                --tensorboard-dir {output_dir}/tensorboard
+                --tensorboard-queue-size 5
+                --log-timers-to-tensorboard
+                --log-batch-size-to-tensorboard
+                --log-validation-ppl-to-tensorboard
+        """
+
+        if skip_train:
+            args += "--skip-train"
+
+        args = args.split()
+
+        if use_bloom:
+            bloom_args = f"""
+                --embed-layernorm
+                --use-alibi-position-embeddings
+                --use-fused-sdpa 0
+                """.split()
+            args.extend(bloom_args)
+
+        return args, ds_args, num_accelerators
+
+    def train_checkpoint(self, output_dir, tp_size=1, pp_size=1, dp_size=1,
+                         n_iters=None, exit_interval=None, save_interval=None,
+                         skip_train=False, use_bloom=False):
+        src_dir = self.src_dir
+        script = [f"{src_dir}/pretrain_gpt.py"]
+
+        args, ds_args, num_accelerators = self.get_config(output_dir, tp_size, pp_size, dp_size,
+                                                  n_iters=n_iters, exit_interval=exit_interval,
+                                                  save_interval=save_interval,
+                                                  skip_train=skip_train, use_bloom=use_bloom)
+        launcher = get_launcher(num_accelerators)
+        cmd = launcher + script + args + ds_args
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
+
+        # 1. test training from scratch (no checkpoint)
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test deepspeed is running
+        self.assertIn("DeepSpeed info", cs.out)
+
+        # test reports
+        self.assertIn("consumed samples", cs.out)
+
+        # test there should be no checkpoint this round
+        self.assertIn(f"Unable to find latest file at {output_dir}/checkpoints/latest", cs.out)
+
+        # test checkpoint saving
+        self.assertIn("successfully saved checkpoint at iteration", cs.out)
+        return cs.out
+
+    def convert_checkpoint_to_universal(self, output_dir, step):
+        DEEPSPEED_ROOT = os.getenv("DEEPSPEED_FORK_ROOT", "")
+        if DEEPSPEED_ROOT == "":
+            assert False, "please set DEEPSPEED_FORK_ROOT to deepspeed path"
+        cmd = f"""
+            python {DEEPSPEED_ROOT}/deepspeed/checkpoint/ds_to_universal.py
+            --input_folder  {output_dir}/checkpoints/global_step{step}
+            --output_folder {output_dir}/checkpoints/global_step{step}_universal
+        """.split()
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
+
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        self.assertIn("Convert DeepSpeed Checkpoint to Universal Checkpoint", cs.out)
+
+    def resume_from_checkpoint(self, output_dir, tp_size=1, pp_size=1, dp_size=1):
+        src_dir = self.src_dir
+        script = [f"{src_dir}/pretrain_gpt.py"]
+
+        args, ds_args, num_accelerators = self.get_config(output_dir, tp_size, pp_size, dp_size)
+        launcher = get_launcher(num_accelerators)
+        cmd = launcher + script + args + ds_args
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
+
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test checkpoint loading
+        self.assertIn(f"successfully loaded checkpoint from {output_dir}/checkpoints", cs.out)
+
+        # test reports
+        self.assertIn("consumed samples", cs.out)
+
+        # test checkpoint saving
+        self.assertIn("successfully saved checkpoint at iteration", cs.out)
+        return cs.out
+
+    def resume_from_universal_checkpoint(self, output_dir, tp_size=1, pp_size=1, dp_size=1,
+                                         n_iters=None, exit_interval=None, save_interval=None,
+                                         skip_train=False, use_bloom=False):
+        src_dir = self.src_dir
+        script = [f"{src_dir}/pretrain_gpt.py"]
+
+        args, ds_args, num_accelerators = self.get_config(output_dir, tp_size, pp_size, dp_size,
+                                                  n_iters=n_iters, exit_interval=exit_interval,
+                                                  save_interval=save_interval,
+                                                  skip_train=skip_train, use_bloom=use_bloom)
+        launcher = get_launcher(num_accelerators)
+        extra_args = ["--universal-checkpoint"]
+        if skip_train:
+            extra_args.append("--skip-train")
+
+        cmd = launcher + script + args + ds_args + extra_args
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
+
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test checkpoint loading
+        self.assertIn(f"successfully loaded checkpoint from {output_dir}/checkpoints", cs.out)
+
+        # test reports
+        if not skip_train:
+            self.assertIn("consumed samples", cs.out)
+
+        # test checkpoint saving
+        self.assertIn("successfully saved checkpoint at iteration", cs.out)
+        return cs.out
+
+    @staticmethod
+    def copy_checkpoint(src_ckp_root, dst_ckp_root, ckp_name, is_universal=False):
+        src_root = os.path.join(src_ckp_root, 'checkpoints')
+        dst_root = os.path.join(dst_ckp_root, 'checkpoints')
+        os.makedirs(dst_root, exist_ok=True)
+        src_folder = os.path.join(src_root, ckp_name)
+        dst_folder = os.path.join(dst_root, ckp_name)
+        shutil.copytree(src=src_folder, dst=dst_folder)
+        latest_filename = 'latest_universal' if is_universal else 'latest'
+        dst_latest = os.path.join(dst_root, latest_filename)
+        with open(dst_latest, "w") as f:
+            f.write(ckp_name)
+
+    @require_torch_multi_accelerator
+    @parameterized.expand(params, name_func=parameterized_custom_name_func)
+    def _test_checkpoint_reshaping_main(self, src, tgt):
+        # this test needs at least 2 accelerators - if there are more accelerators it will do more extensive testing
+
+        tp_size_src, pp_size_src, dp_size_src = list(map(int, src.split('_')))
+        tp_size_tgt, pp_size_tgt, dp_size_tgt = list(map(int, tgt.split('_')))
+
+        n_accelerators = get_accelerator_count()
+        n_accelerators_src = tp_size_src * pp_size_src * dp_size_src
+        n_accelerators_tgt = tp_size_tgt * pp_size_tgt * dp_size_tgt
+
+        if n_accelerators_src > n_accelerators:
+            pytest.skip(f"the test requires {n_accelerators_src} accelerators for source topology but have only {n_accelerators}")
+        if n_accelerators_tgt > n_accelerators:
+            pytest.skip(f"the test requires {n_accelerators_tgt} accelerators for target topology but have only {n_accelerators}")
+
+        output_dir = self.get_auto_remove_tmp_dir("./xxx", after=False)
+
+        # 1. train with initial topology defined in the first arg of params
+        self.train_checkpoint(output_dir, tp_size=tp_size_src, pp_size=pp_size_src, dp_size=dp_size_src)
+
+        # 2. convert checkpoint to universal checkpoint (topology )
+        self.convert_checkpoint_to_universal(output_dir=output_dir, step=1)
+
+        # 3. check we can resume training from a reshaped checkpoint to the target topology - the last arg of params
+        self.resume_from_universal_checkpoint(output_dir, tp_size=tp_size_tgt, pp_size=pp_size_tgt, dp_size=dp_size_tgt)
+
+    @require_torch_multi_accelerator
+    def _test_checkpoint_reshaping_empty_dir(self):
+
+        output_dir = self.get_auto_remove_tmp_dir()
+        with self.assertRaises(RuntimeError):
+            self.convert_checkpoint_to_universal(output_dir=output_dir, step=1)
+
+    @require_torch_multi_accelerator
+    @parameterized.expand([True, False])
+    def test_checkpoint_reshaping_2x2x2_to_2x2x1_to_2x2x2(self, use_bloom):
+        # this test needs at least 8 accelerators
+
+        tp_size_src, pp_size_src, dp_size_src = 2, 2, 2
+        tp_size_tgt, pp_size_tgt, dp_size_tgt = 2, 2, 1
+
+        n_accelerators = get_accelerator_count()
+        n_accelerators_src = tp_size_src * pp_size_src * dp_size_src
+        n_accelerators_tgt = tp_size_tgt * pp_size_tgt * dp_size_tgt
+        n_required_accelerators = max(n_accelerators_src, n_accelerators_tgt)
+        if n_required_accelerators > n_accelerators:
+            pytest.skip(f"the test requires {n_required_accelerators} accelerators but have only {n_accelerators}")
+
+        root_dir = self.get_auto_remove_tmp_dir(after=True)
+        output_2x2x2_dir = os.path.join(root_dir, 'topo_2x2x2')
+        output_2x2x1_dir = os.path.join(root_dir, 'topo_2x2x1')
+        output_2x2x2_final_dir = os.path.join(root_dir, 'topo_2x2x2_final')
+
+        total_n_iters = 20
+        checkpoint_iter = total_n_iters // 2
+
+        # 1. train with initial 2x2x2 topology
+        out = self.train_checkpoint(output_2x2x2_dir,
+                                    tp_size=tp_size_src,
+                                    pp_size=pp_size_src,
+                                    dp_size=dp_size_src,
+                                    n_iters=total_n_iters,
+                                    exit_interval=total_n_iters + 1,
+                                    save_interval=checkpoint_iter,
+                                    use_bloom=use_bloom)
+
+        try:
+            orig_2x2x2_test_loss = float(re.search(
+                'test set \| lm loss value: (\d+\.\d+E+\++\d+)', out).group(1))
+        except AttributeError:
+            assert False, 'Not found test set loss in original 2x2x2 training'
+
+        # 2. convert 2x2x2 checkpoint to universal checkpoint
+        self.convert_checkpoint_to_universal(output_dir=output_2x2x2_dir, step=checkpoint_iter)
+
+        # 3. copy 2x2x2 universal checkpoint (step 10) to 2x2x1
+        univ_ckp_name = f'global_step{checkpoint_iter}_universal'
+        self.copy_checkpoint(src_ckp_root=output_2x2x2_dir,
+                             dst_ckp_root=output_2x2x1_dir,
+                             ckp_name=univ_ckp_name,
+                             is_universal=True)
+
+        # 3. use trainer to convert from universal to 2x2x1:
+        #   3.1. load universal checkpoint
+        #   3.1. skip actual training
+        #   3.1. save checkpoint for 2x2x1 topology
+        self.resume_from_universal_checkpoint(output_2x2x1_dir,
+                                              tp_size=tp_size_tgt,
+                                              pp_size=pp_size_tgt,
+                                              dp_size=dp_size_tgt,
+                                              n_iters=total_n_iters,
+                                              exit_interval=checkpoint_iter,
+                                              save_interval=total_n_iters,
+                                              skip_train=True,
+                                              use_bloom=use_bloom)
+
+        # 4. copy 2x2x1 checkpoint (step 10) to 2x2x2_final
+        ckp_name = f'global_step{checkpoint_iter}'
+        self.copy_checkpoint(src_ckp_root=output_2x2x1_dir,
+                             dst_ckp_root=output_2x2x2_final_dir,
+                             ckp_name=ckp_name,
+                             is_universal=False)
+
+        # 5. convert 2x2x1 step 10 checkpoint to universal checkpoint
+        self.convert_checkpoint_to_universal(output_dir=output_2x2x2_final_dir, step=checkpoint_iter)
+
+        # 6. Load from universal created from 2x2x1 and resume training till end
+        out = self.resume_from_universal_checkpoint(output_2x2x2_final_dir,
+                                                    tp_size=tp_size_src,
+                                                    pp_size=pp_size_src,
+                                                    dp_size=dp_size_src,
+                                                    n_iters=total_n_iters,
+                                                    exit_interval=total_n_iters + 1,
+                                                    save_interval=total_n_iters,
+                                                    use_bloom=use_bloom)
+        try:
+            final_2x2x2_test_loss = float(re.search(
+                'test set \| lm loss value: (\d+\.\d+E+\++\d+)', out).group(1))
+        except AttributeError:
+            assert False, 'Not found test set loss in final 2x2x2 training'
+
+        # 7. Verify same test loss for original training and final training
+        assert orig_2x2x2_test_loss == final_2x2x2_test_loss
diff --git a/tests/old_tests/test_training.py b/tests/old_tests/test_training.py
new file mode 100644
index 0000000000..f7aa712584
--- /dev/null
+++ b/tests/old_tests/test_training.py
@@ -0,0 +1,282 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import glob
+import shutil
+from parameterized import parameterized
+
+from testing_utils import (
+    CaptureStdout,
+    CaptureStd,
+    TestCasePlus,
+    execute_subprocess_async,
+    get_accelerator_count,
+    require_deepspeed,
+    require_torch_accelerator,
+    set_seed
+)
+
+set_seed(42)
+
+
+def get_launcher(num_accelerators):
+    # 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
+    # - it won't be able to handle that
+    return f"deepspeed --num_nodes 1 --num_gpus {num_accelerators}".split()
+
+
+def get_3d_dimensions():
+    num_accelerators = get_accelerator_count()
+
+    # with fewer accelerators the preference is first to do PP>1, then TP>1, then DP>1
+    if num_accelerators >= 8:
+        dp_size = 2
+        pp_size = 2
+        tp_size = 2
+    elif num_accelerators >= 4:
+        dp_size = 1
+        pp_size = 2
+        tp_size = 2
+    elif num_accelerators >= 2:
+        dp_size = 1
+        pp_size = 2
+        tp_size = 1
+    else:
+        dp_size = 1
+        pp_size = 1
+        tp_size = 1
+
+    return pp_size, tp_size, dp_size
+
+
+@require_deepspeed
+@require_torch_accelerator
+class MegDSTestTraining(TestCasePlus):
+    """ """
+
+    def setUp(self):
+        super().setUp()
+
+        # at times magatron fails to build kernels and doesn't remove the lock file, which makes
+        # subsequent runs hang - so make sure there is no lock when starting the testing
+        meg_lock_file_path = self.repo_root_dir_str + "/megatron/fused_kernels/build/lock"
+        if os.path.exists(meg_lock_file_path):
+            os.unlink(meg_lock_file_path)
+
+    def copy_data_to_temp(self, root_dir, prefix):
+        """copy data to temp, and return paths to temp version"""
+        src_path = os.path.join(root_dir, prefix)
+        src_dirname = os.path.dirname(src_path)
+
+        tmp_dir = self.get_auto_remove_tmp_dir()
+        dest_path = os.path.join(tmp_dir, prefix)
+        dest_dirname = os.path.dirname(dest_path)
+        os.makedirs(dest_dirname, exist_ok=True)
+        for folder in os.listdir(src_dirname):
+            src_folder = os.path.join(src_dirname, folder)
+            dest_folder = os.path.join(dest_dirname, folder)
+            if src_folder.startswith(src_path):
+                if os.path.isdir(src_folder):
+                    shutil.copytree(src_folder, dest_folder)
+                else:
+                    shutil.copy2(src_folder, dest_folder)
+        return dest_path
+
+    def get_variation_config(self, variation, output_dir, n_samples=None):
+        data_dir = os.getenv("HL_DATA_DIR_ROOT", "")
+        if data_dir == "":
+            data_dir = self.copy_data_to_temp(self.data_dir, "gpt2")
+
+        pp_size, tp_size, dp_size = get_3d_dimensions()
+        num_accelerators = pp_size * tp_size * dp_size
+        print(f"Using {num_accelerators} Accelerators")
+
+        if n_samples is None:
+            n_samples = 300  # about 56 iterations
+
+        exit_interval = 20  # some samples in the first half and then some more in the 2nd half after resume
+        seq_len = 128
+
+        # common/shared configs
+        ds_args = f"""
+                --deepspeed
+                --deepspeed_config {self.test_file_dir_str}/ds_config_bf16.json
+                --zero-stage 1
+                --deepspeed-activation-checkpointing
+        """.split()
+
+        args = f"""
+                --tensor-model-parallel-size {tp_size}
+                --pipeline-model-parallel-size {pp_size}
+                --distributed-backend hccl
+
+                --log-interval 1
+                --save-interval 10
+                --eval-interval 10
+                --eval-iters 5
+                --recompute-activations
+                --exit-interval {exit_interval}
+
+                --merge-file {data_dir}/merges.txt
+                --vocab-file {data_dir}/vocab.json
+                --data-path {data_dir}/c4_en_6_c4_spm_text_document
+
+                --save {output_dir}/checkpoints
+                --load {output_dir}/checkpoints
+                --tensorboard-dir {output_dir}/tensorboard
+                --tensorboard-queue-size 5
+                --log-timers-to-tensorboard
+                --log-batch-size-to-tensorboard
+                --log-validation-ppl-to-tensorboard
+
+                --num-layers 2
+                --hidden-size 64
+                --num-attention-heads 2
+                --seq-length {seq_len}
+                --max-position-embeddings 1024
+                --micro-batch-size 1
+                --global-batch-size 16
+
+                --optimizer adamw
+                --adam-beta1 0.9
+                --adam-beta2 0.95
+                --adam-eps 1e-8
+                --lr 1e-4
+                --lr-warmup-samples 5
+                --clip-grad 1.0
+                --weight-decay 1e-1
+                --bf16
+                --no-gradient-accumulation-fusion
+        """.split()
+        # adam causes NaN and fails in group norm assert
+
+        if variation == "base":
+
+            new_args = f"""
+                --rampup-batch-size 2 2 {n_samples}
+                --train-samples {n_samples}
+                --lr-decay-samples 6
+            """.split()
+
+            new_ds_args = f"""
+                --deepspeed_config {self.test_file_dir_str}/ds_config_bf16.json
+            """.split()
+        elif variation == "alibi":
+            new_args = f"""
+                --rampup-batch-size 2 2 {n_samples}
+                --train-samples {n_samples}
+                --lr-decay-samples 6
+                --use-alibi-position-embeddings
+                --use-fused-sdpa 0
+            """.split()
+
+            new_ds_args = []
+        else:
+            raise ValueError(f"Don't know of variation {variation}")
+
+        args.extend(new_args)
+        ds_args.extend(new_ds_args)
+        return args, ds_args, num_accelerators
+
+    def test_kill_switch(self):
+
+        variation = "base"
+
+        src_dir = self.src_dir
+        output_dir = self.get_auto_remove_tmp_dir() # "./xxx", after=False)
+        kill_switch_path = os.path.join(output_dir, "kill-switch-xyz")
+        args, ds_args, num_accelerators = self.get_variation_config(variation, output_dir)
+        args += f"--kill-switch-path {kill_switch_path}".split()
+
+        script = [f"{src_dir}/pretrain_gpt.py"]
+        launcher = get_launcher(num_accelerators)
+
+        cmd = launcher + script + args + ds_args
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
+
+        # 1. kill switch armed but not triggered
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test deepspeed is running
+        self.assertIn("DeepSpeed info", cs.out)
+
+        # 2. trigger kill switch
+        open(kill_switch_path, "w")
+        with CaptureStd() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        self.assertIn(f"Detected kill switch at {kill_switch_path}", cs.out)
+
+        # test deepspeed wasn't run
+        self.assertNotIn("DeepSpeed info", cs.out)
+
+    @parameterized.expand(["base", "alibi"])
+    def test_training_all(self, variation):
+
+        # optional runs
+        # all in one test
+        src_dir = self.src_dir
+        output_dir = self.get_auto_remove_tmp_dir()
+
+        args, ds_args, num_accelerators = self.get_variation_config(variation, output_dir)
+
+        script = [f"{src_dir}/pretrain_gpt.py"]
+        launcher = get_launcher(num_accelerators)
+
+        cmd = launcher + script + args + ds_args
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
+
+        # 1. test training from scratch (no checkpoint)
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test deepspeed is running
+        self.assertIn("DeepSpeed info", cs.out)
+
+        # test reports
+        self.assertIn("consumed samples", cs.out)
+
+        # test there should be no checkpoint this round
+        self.assertIn(f"Unable to find latest file at {output_dir}/checkpoints/latest", cs.out)
+
+        # test checkpoint saving
+        self.assertIn("successfully saved checkpoint at iteration", cs.out)
+
+        # test tensorboard
+        tensorboard_files = glob.glob(f"{output_dir}/tensorboard/events*")
+        self.assertEqual(len(tensorboard_files), 1, "tensorboard files")
+
+        # 2. test training from checkpoint: resume
+        # now do it again, this time resuming from the checkpoint
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())
+
+        # test checkpoint loading
+        self.assertIn(f"successfully loaded checkpoint from {output_dir}/checkpoints", cs.out)
+
+        # test reports
+        self.assertIn("consumed samples", cs.out)
+
+        # test checkpoint saving
+        self.assertIn("successfully saved checkpoint at iteration", cs.out)
+
+        # test tensorboard (1 file from the first run, plus 1 now)
+        tensorboard_files = glob.glob(f"{output_dir}/tensorboard/events*")
+        self.assertEqual(len(tensorboard_files), 2, "tensorboard files")
+
diff --git a/tests/old_tests/testing_utils.py b/tests/old_tests/testing_utils.py
new file mode 100644
index 0000000000..f33bd15488
--- /dev/null
+++ b/tests/old_tests/testing_utils.py
@@ -0,0 +1,888 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import importlib.util
+import inspect
+import logging
+import numpy as np
+import os
+import random
+import re
+import shutil
+import sys
+import tempfile
+import unittest
+
+from distutils.util import strtobool
+from io import StringIO
+from packaging import version
+from pathlib import Path
+from typing import Iterator, Union
+from unittest import mock
+from unittest.case import SkipTest
+from deepspeed.accelerator import get_accelerator
+
+
+try:
+    import torch
+    _torch_available = True
+except:
+    _torch_available = False
+
+
+try:
+    import datasets
+    _datasets_available = True
+except:
+    _datasets_available = False
+
+
+try:
+    import tensorflow
+    _tf_available = True
+except:
+    _tf_available = False
+
+def is_tf_available():
+    return _tf_available
+
+
+def is_datasets_available():
+    return _datasets_available
+
+
+def is_torch_available():
+    return _torch_available
+
+
+def parse_flag_from_env(key, default=False):
+    try:
+        value = os.environ[key]
+    except KeyError:
+        # KEY isn't set, default to `default`.
+        _value = default
+    else:
+        # KEY is set, convert it to True or False.
+        try:
+            _value = strtobool(value)
+        except ValueError:
+            # More values are supported, but let's keep the message simple.
+            raise ValueError(f"If set, {key} must be yes or no.")
+    return _value
+
+
+def parse_int_from_env(key, default=None):
+    try:
+        value = os.environ[key]
+    except KeyError:
+        _value = default
+    else:
+        try:
+            _value = int(value)
+        except ValueError:
+            raise ValueError(f"If set, {key} must be a int.")
+    return _value
+
+
+def require_torch(test_case):
+    """
+    Decorator marking a test that requires PyTorch.
+
+    These tests are skipped when PyTorch isn't installed.
+
+    """
+    if not is_torch_available():
+        return unittest.skip("test requires PyTorch")(test_case)
+    else:
+        return test_case
+
+
+def require_torch_multi_accelerator(test_case):
+    """
+    Decorator marking a test that requires a multi-accelerators setup (in PyTorch). These tests are skipped on a machine without
+    multiple Accelerators.
+
+    To run *only* the multi_accelerator tests, assuming all test names contain multi_accelerator: $ pytest -sv ./tests/ -k "multi_accelerator"
+    """
+    if not is_torch_available():
+        return unittest.skip("test requires PyTorch")(test_case)
+
+    if get_accelerator().device_count() < 2:
+        return unittest.skip("test requires multiple Accelerators")(test_case)
+    else:
+        return test_case
+
+
+def require_torch_non_multi_accelerator(test_case):
+    """
+    Decorator marking a test that requires 0 or 1 Accelerator setup (in PyTorch).
+    """
+    if not is_torch_available():
+        return unittest.skip("test requires PyTorch")(test_case)
+
+    if get_accelerator().device_count() > 1:
+        return unittest.skip("test requires 0 or 1 Accelerator")(test_case)
+    else:
+        return test_case
+
+
+def require_torch_up_to_2_accelerators(test_case):
+    """
+    Decorator marking a test that requires 0 or 1 or 2 Accelerator setup (in PyTorch).
+    """
+    if not is_torch_available():
+        return unittest.skip("test requires PyTorch")(test_case)
+
+    if get_accelerator().device_count() > 2:
+        return unittest.skip("test requires 0 or 1 or 2 Accelerators")(test_case)
+    else:
+        return test_case
+
+
+if is_torch_available():
+    # Set env var CUDA_VISIBLE_DEVICES="" to force cpu-mode
+    torch_device = get_accelerator().device_name()
+else:
+    torch_device = None
+
+
+def require_torch_accelerator(test_case):
+    """Decorator marking a test that requires Accelerator and PyTorch."""
+    if torch_device == "cpu":
+        return unittest.skip("test requires Accelerator")(test_case)
+    else:
+        return test_case
+
+
+def require_datasets(test_case):
+    """Decorator marking a test that requires datasets."""
+
+    if not is_datasets_available():
+        return unittest.skip("test requires `datasets`")(test_case)
+    else:
+        return test_case
+
+
+def is_deepspeed_available():
+    return importlib.util.find_spec("deepspeed") is not None
+
+
+def require_deepspeed(test_case):
+    """
+    Decorator marking a test that requires deepspeed
+    """
+    if not is_deepspeed_available():
+        return unittest.skip("test requires deepspeed")(test_case)
+    else:
+        return test_case
+
+
+def is_bnb_available():
+    return importlib.util.find_spec("bitsandbytes") is not None
+
+
+def require_bnb(test_case):
+    """
+    Decorator marking a test that requires bitsandbytes
+    """
+    if not is_bnb_available():
+        return unittest.skip("test requires bitsandbytes from https://github.com/facebookresearch/bitsandbytes")(test_case)
+    else:
+        return test_case
+
+
+def require_bnb_non_decorator():
+    """
+    Non-Decorator function that would skip a test if bitsandbytes is missing
+    """
+    if not is_bnb_available():
+        raise SkipTest("Test requires bitsandbytes from https://github.com/facebookresearch/bitsandbytes")
+
+
+def set_seed(seed: int=42):
+    """
+    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch``
+
+    Args:
+        seed (:obj:`int`): The seed to set.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    if is_torch_available():
+        torch.manual_seed(seed)
+        get_accelerator().manual_seed_all(seed)
+
+
+def get_accelerator_count():
+    """
+    Return the number of available accelerators (regardless of whether torch or tf is used)
+    """
+    if is_torch_available():
+        return get_accelerator().device_count()
+    elif is_tf_available():
+        import tensorflow as tf
+        return len(tf.config.list_physical_devices("GPU"))
+    else:
+        return 0
+
+def torch_assert_equal(actual, expected, **kwargs):
+    # assert_close was added around pt-1.9, it does better checks - e.g will check dimensions match
+    if hasattr(torch.testing, "assert_close"):
+        return torch.testing.assert_close(actual, expected, rtol=0.0, atol=0.0, **kwargs)
+    else:
+        return torch.allclose(actual, expected, rtol=0.0, atol=0.0)
+
+
+def torch_assert_close(actual, expected, **kwargs):
+    # assert_close was added around pt-1.9, it does better checks - e.g will check dimensions match
+    if hasattr(torch.testing, "assert_close"):
+        return torch.testing.assert_close(actual, expected, **kwargs)
+    else:
+        kwargs.pop("msg", None) # doesn't have msg arg
+        return torch.allclose(actual, expected, **kwargs)
+
+
+def is_torch_bf16_available():
+    # from https://github.com/huggingface/transformers/blob/26eb566e43148c80d0ea098c76c3d128c0281c16/src/transformers/file_utils.py#L301
+    if is_torch_available():
+        return get_accelerator().is_bf16_supported()
+    else:
+        return False
+
+
+def require_torch_bf16(test_case):
+    """Decorator marking a test that requires Accelerator hardware supporting bf16 and PyTorch >= 1.9."""
+    if not is_torch_bf16_available():
+        return unittest.skip("test requires Accelerator hardware supporting bf16 and PyTorch >= 1.9")(test_case)
+    else:
+        return test_case
+
+
+def get_tests_dir(append_path=None):
+    """
+    Args:
+        append_path: optional path to append to the tests dir path
+
+    Return:
+        The full path to the `tests` dir, so that the tests can be invoked from anywhere. Optionally `append_path` is
+        joined after the `tests` dir the former is provided.
+
+    """
+    # this function caller's __file__
+    caller__file__ = inspect.stack()[1][1]
+    tests_dir = os.path.abspath(os.path.dirname(caller__file__))
+    if append_path:
+        return os.path.join(tests_dir, append_path)
+    else:
+        return tests_dir
+
+
+#
+# Helper functions for dealing with testing text outputs
+# The original code came from:
+# https://github.com/fastai/fastai/blob/master/tests/utils/text.py
+
+# When any function contains print() calls that get overwritten, like progress bars,
+# a special care needs to be applied, since under pytest -s captured output (capsys
+# or contextlib.redirect_stdout) contains any temporary printed strings, followed by
+# \r's. This helper function ensures that the buffer will contain the same output
+# with and without -s in pytest, by turning:
+# foo bar\r tar mar\r final message
+# into:
+# final message
+# it can handle a single string or a multiline buffer
+def apply_print_resets(buf):
+    return re.sub(r"^.*\r", "", buf, 0, re.M)
+
+
+def assert_screenout(out, what):
+    out_pr = apply_print_resets(out).lower()
+    match_str = out_pr.find(what.lower())
+    assert match_str != -1, f"expecting to find {what} in output: f{out_pr}"
+
+
+class CaptureStd:
+    """
+    Context manager to capture:
+
+        - stdout: replay it, clean it up and make it available via ``obj.out``
+        - stderr: replay it and make it available via ``obj.err``
+
+        init arguments:
+
+        - out - capture stdout:`` True``/``False``, default ``True``
+        - err - capture stdout: ``True``/``False``, default ``True``
+        - replay - whether to replay or not: ``True``/``False``, default ``True``. By default each
+        captured stream gets replayed back on context's exit, so that one can see what the test was
+        doing. If this is a not wanted behavior and the captured data shouldn't be replayed, pass
+        ``replay=False`` to disable this feature.
+
+        Examples::
+
+            # to capture stdout only with auto-replay
+            with CaptureStdout() as cs:
+                print("Secret message")
+            assert "message" in cs.out
+
+            # to capture stderr only with auto-replay
+            import sys
+            with CaptureStderr() as cs:
+                print("Warning: ", file=sys.stderr)
+            assert "Warning" in cs.err
+
+            # to capture both streams with auto-replay
+            with CaptureStd() as cs:
+                print("Secret message")
+                print("Warning: ", file=sys.stderr)
+            assert "message" in cs.out
+            assert "Warning" in cs.err
+
+            # to capture just one of the streams, and not the other, with auto-replay
+            with CaptureStd(err=False) as cs:
+                print("Secret message")
+            assert "message" in cs.out
+            # but best use the stream-specific subclasses
+
+            # to capture without auto-replay
+            with CaptureStd(replay=False) as cs:
+                print("Secret message")
+            assert "message" in cs.out
+
+    """
+
+    def __init__(self, out=True, err=True, replay=True):
+
+        self.replay = replay
+
+        if out:
+            self.out_buf = StringIO()
+            self.out = "error: CaptureStd context is unfinished yet, called too early"
+        else:
+            self.out_buf = None
+            self.out = "not capturing stdout"
+
+        if err:
+            self.err_buf = StringIO()
+            self.err = "error: CaptureStd context is unfinished yet, called too early"
+        else:
+            self.err_buf = None
+            self.err = "not capturing stderr"
+
+    def __enter__(self):
+        if self.out_buf:
+            self.out_old = sys.stdout
+            sys.stdout = self.out_buf
+
+        if self.err_buf:
+            self.err_old = sys.stderr
+            sys.stderr = self.err_buf
+
+        return self
+
+    def __exit__(self, *exc):
+        if self.out_buf:
+            sys.stdout = self.out_old
+            captured = self.out_buf.getvalue()
+            if self.replay:
+                sys.stdout.write(captured)
+            self.out = apply_print_resets(captured)
+
+        if self.err_buf:
+            sys.stderr = self.err_old
+            captured = self.err_buf.getvalue()
+            if self.replay:
+                sys.stderr.write(captured)
+            self.err = captured
+
+    def __repr__(self):
+        msg = ""
+        if self.out_buf:
+            msg += f"stdout: {self.out}\n"
+        if self.err_buf:
+            msg += f"stderr: {self.err}\n"
+        return msg
+
+
+# in tests it's the best to capture only the stream that's wanted, otherwise
+# it's easy to miss things, so unless you need to capture both streams, use the
+# subclasses below (less typing). Or alternatively, configure `CaptureStd` to
+# disable the stream you don't need to test.
+
+
+class CaptureStdout(CaptureStd):
+    """Same as CaptureStd but captures only stdout"""
+
+    def __init__(self, replay=True):
+        super().__init__(err=False, replay=replay)
+
+
+class CaptureStderr(CaptureStd):
+    """Same as CaptureStd but captures only stderr"""
+
+    def __init__(self, replay=True):
+        super().__init__(out=False, replay=replay)
+
+
+class CaptureLogger:
+    """
+    Context manager to capture `logging` streams
+
+    Args:
+
+    - logger: 'logging` logger object
+
+    Results:
+        The captured output is available via `self.out`
+
+    Example::
+
+        >>> from transformers import logging
+        >>> from transformers.testing_utils import CaptureLogger
+
+        >>> msg = "Testing 1, 2, 3"
+        >>> logging.set_verbosity_info()
+        >>> logger = logging.get_logger("transformers.models.bart.tokenization_bart")
+        >>> with CaptureLogger(logger) as cl:
+        ...     logger.info(msg)
+        >>> assert cl.out, msg+"\n"
+    """
+
+    def __init__(self, logger):
+        self.logger = logger
+        self.io = StringIO()
+        self.sh = logging.StreamHandler(self.io)
+        self.out = ""
+
+    def __enter__(self):
+        self.logger.addHandler(self.sh)
+        return self
+
+    def __exit__(self, *exc):
+        self.logger.removeHandler(self.sh)
+        self.out = self.io.getvalue()
+
+    def __repr__(self):
+        return f"captured: {self.out}\n"
+
+
+
+@contextlib.contextmanager
+# adapted from https://stackoverflow.com/a/64789046/9201239
+def ExtendSysPath(path: Union[str, os.PathLike]) -> Iterator[None]:
+    """
+    Temporary add given path to `sys.path`.
+
+    Usage ::
+
+       with ExtendSysPath('/path/to/dir'):
+           mymodule = importlib.import_module('mymodule')
+
+    """
+
+    path = os.fspath(path)
+    try:
+        sys.path.insert(0, path)
+        yield
+    finally:
+        sys.path.remove(path)
+
+
+class TestCasePlus(unittest.TestCase):
+    """
+    This class extends `unittest.TestCase` with additional features.
+
+    Feature 1: A set of fully resolved important file and dir path accessors.
+
+    In tests often we need to know where things are relative to the current test file, and it's not trivial since the
+    test could be invoked from more than one directory or could reside in sub-directories with different depths. This
+    class solves this problem by sorting out all the basic paths and provides easy accessors to them:
+
+    * ``pathlib`` objects (all fully resolved):
+
+       - ``test_file_path`` - the current test file path (=``__file__``)
+       - ``test_file_dir`` - the directory containing the current test file
+       - ``tests_dir`` - the directory of the ``tests`` test suite
+       - ``data_dir`` - the directory of the ``tests/data`` test suite
+       - ``repo_root_dir`` - the directory of the repository
+       - ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides)
+
+    * stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects:
+
+       - ``test_file_path_str``
+       - ``test_file_dir_str``
+       - ``tests_dir_str``
+       - ``data_dir_str``
+       - ``repo_root_dir_str``
+       - ``src_dir_str``
+
+    Feature 2: Flexible auto-removable temporary dirs which are guaranteed to get removed at the end of test.
+
+    1. Create a unique temporary dir:
+
+    ::
+
+        def test_whatever(self):
+            tmp_dir = self.get_auto_remove_tmp_dir()
+
+    ``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the
+    test.
+
+
+    2. Create a temporary dir of my choice, ensure it's empty before the test starts and don't
+    empty it after the test.
+
+    ::
+
+        def test_whatever(self):
+            tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
+
+    This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests
+    didn't leave any data in there.
+
+    3. You can override the first two options by directly overriding the ``before`` and ``after`` args, leading to the
+       following behavior:
+
+    ``before=True``: the temporary dir will always be cleared at the beginning of the test.
+
+    ``before=False``: if the temporary dir already existed, any existing files will remain there.
+
+    ``after=True``: the temporary dir will always be deleted at the end of the test.
+
+    ``after=False``: the temporary dir will always be left intact at the end of the test.
+
+    Note 1: In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are
+    allowed if an explicit ``tmp_dir`` is used, so that by mistake no ``/tmp`` or similar important part of the
+    filesystem will get nuked. i.e. please always pass paths that start with ``./``
+
+    Note 2: Each test can register multiple temporary dirs and they all will get auto-removed, unless requested
+    otherwise.
+
+    Feature 3: Get a copy of the ``os.environ`` object that sets up ``PYTHONPATH`` specific to the current test suite.
+    This is useful for invoking external programs from the test suite - e.g. distributed training.
+
+
+    ::
+        def test_whatever(self):
+            env = self.get_env()
+
+    """
+
+    def setUp(self):
+        # get_auto_remove_tmp_dir feature:
+        self.teardown_tmp_dirs = []
+
+        # figure out the resolved paths for repo_root, tests,  etc.
+        self._test_file_path = inspect.getfile(self.__class__)
+        path = Path(self._test_file_path).resolve()
+        self._test_file_dir = path.parents[0]
+        for up in [1, 2, 3]:
+            tmp_dir = path.parents[up]
+            if (tmp_dir / "megatron").is_dir() and (tmp_dir / "tests").is_dir():
+                break
+        if tmp_dir:
+            self._repo_root_dir = tmp_dir
+        else:
+            raise ValueError(f"can't figure out the root of the repo from {self._test_file_path}")
+        self._tests_dir = self._repo_root_dir / "tests"
+        self._data_dir = self._repo_root_dir / "tests" / "data"
+        self._src_dir = self._repo_root_dir # megatron doesn't use "src/" prefix in the repo
+
+    @property
+    def test_file_path(self):
+        return self._test_file_path
+
+    @property
+    def test_file_path_str(self):
+        return str(self._test_file_path)
+
+    @property
+    def test_file_dir(self):
+        return self._test_file_dir
+
+    @property
+    def test_file_dir_str(self):
+        return str(self._test_file_dir)
+
+    @property
+    def tests_dir(self):
+        return self._tests_dir
+
+    @property
+    def tests_dir_str(self):
+        return str(self._tests_dir)
+
+    @property
+    def data_dir(self):
+        return self._data_dir
+
+    @property
+    def data_dir_str(self):
+        return str(self._data_dir)
+
+    @property
+    def repo_root_dir(self):
+        return self._repo_root_dir
+
+    @property
+    def repo_root_dir_str(self):
+        return str(self._repo_root_dir)
+
+    @property
+    def src_dir(self):
+        return self._src_dir
+
+    @property
+    def src_dir_str(self):
+        return str(self._src_dir)
+
+    def get_env(self):
+        """
+        Return a copy of the ``os.environ`` object that sets up ``PYTHONPATH`` correctly. This is useful
+        for invoking external programs from the test suite - e.g. distributed training.
+
+        It always inserts ``.`` first, then ``./tests`` depending on the test suite type and
+        finally the preset ``PYTHONPATH`` if any (all full resolved paths).
+
+        """
+        env = os.environ.copy()
+        paths = [self.src_dir_str]
+        paths.append(self.tests_dir_str)
+        paths.append(env.get("PYTHONPATH", ""))
+
+        env["PYTHONPATH"] = ":".join(paths)
+        return env
+
+    def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
+        """
+        Args:
+            tmp_dir (:obj:`string`, `optional`):
+                if :obj:`None`:
+
+                   - a unique temporary path will be created
+                   - sets ``before=True`` if ``before`` is :obj:`None`
+                   - sets ``after=True`` if ``after`` is :obj:`None`
+                else:
+
+                   - :obj:`tmp_dir` will be created
+                   - sets ``before=True`` if ``before`` is :obj:`None`
+                   - sets ``after=False`` if ``after`` is :obj:`None`
+            before (:obj:`bool`, `optional`):
+                If :obj:`True` and the :obj:`tmp_dir` already exists, make sure to empty it right away if :obj:`False`
+                and the :obj:`tmp_dir` already exists, any existing files will remain there.
+            after (:obj:`bool`, `optional`):
+                If :obj:`True`, delete the :obj:`tmp_dir` at the end of the test if :obj:`False`, leave the
+                :obj:`tmp_dir` and its contents intact at the end of the test.
+
+        Returns:
+            tmp_dir(:obj:`string`): either the same value as passed via `tmp_dir` or the path to the auto-selected tmp
+            dir
+        """
+        if tmp_dir is not None:
+
+            # defining the most likely desired behavior for when a custom path is provided.
+            # this most likely indicates the debug mode where we want an easily locatable dir that:
+            # 1. gets cleared out before the test (if it already exists)
+            # 2. is left intact after the test
+            if before is None:
+                before = True
+            if after is None:
+                after = False
+
+            # using provided path
+            path = Path(tmp_dir).resolve()
+
+            # to avoid nuking parts of the filesystem, only relative paths are allowed
+            if not tmp_dir.startswith("./"):
+                raise ValueError(
+                    f"`tmp_dir` can only be a relative path, i.e. `./some/path`, but received `{tmp_dir}`"
+                )
+
+            # ensure the dir is empty to start with
+            if before is True and path.exists():
+                shutil.rmtree(tmp_dir, ignore_errors=True)
+
+            path.mkdir(parents=True, exist_ok=True)
+
+        else:
+            # defining the most likely desired behavior for when a unique tmp path is auto generated
+            # (not a debug mode), here we require a unique tmp dir that:
+            # 1. is empty before the test (it will be empty in this situation anyway)
+            # 2. gets fully removed after the test
+            if before is None:
+                before = True
+            if after is None:
+                after = True
+
+            # using unique tmp dir (always empty, regardless of `before`)
+            tmp_dir = tempfile.mkdtemp()
+
+        if after is True:
+            # register for deletion
+            self.teardown_tmp_dirs.append(tmp_dir)
+
+        return tmp_dir
+
+    def tearDown(self):
+
+        # get_auto_remove_tmp_dir feature: remove registered temp dirs
+        for path in self.teardown_tmp_dirs:
+            shutil.rmtree(path, ignore_errors=True)
+        self.teardown_tmp_dirs = []
+
+
+def mockenv(**kwargs):
+    """
+    this is a convenience wrapper, that allows this ::
+
+    @mockenv(RUN_SLOW=True, USE_TF=False)
+    def test_something():
+        run_slow = os.getenv("RUN_SLOW", False)
+        use_tf = os.getenv("USE_TF", False)
+
+    """
+    return mock.patch.dict(os.environ, kwargs)
+
+
+# from https://stackoverflow.com/a/34333710/9201239
+@contextlib.contextmanager
+def mockenv_context(*remove, **update):
+    """
+    Temporarily updates the ``os.environ`` dictionary in-place. Similar to mockenv
+
+    The ``os.environ`` dictionary is updated in-place so that the modification is sure to work in all situations.
+
+    Args:
+      remove: Environment variables to remove.
+      update: Dictionary of environment variables and values to add/update.
+    """
+    env = os.environ
+    update = update or {}
+    remove = remove or []
+
+    # List of environment variables being updated or removed.
+    stomped = (set(update.keys()) | set(remove)) & set(env.keys())
+    # Environment variables and values to restore on exit.
+    update_after = {k: env[k] for k in stomped}
+    # Environment variables and values to remove on exit.
+    remove_after = frozenset(k for k in update if k not in env)
+
+    try:
+        env.update(update)
+        [env.pop(k, None) for k in remove]
+        yield
+    finally:
+        env.update(update_after)
+        [env.pop(k) for k in remove_after]
+
+
+
+# --- distributed testing functions --- #
+
+# adapted from https://stackoverflow.com/a/59041913/9201239
+import asyncio  # noqa
+
+
+class _RunOutput:
+    def __init__(self, returncode, stdout, stderr):
+        self.returncode = returncode
+        self.stdout = stdout
+        self.stderr = stderr
+
+
+async def _read_stream(stream, callback):
+    while True:
+        line = await stream.readline()
+        if line:
+            callback(line)
+        else:
+            break
+
+
+async def _stream_subprocess(cmd, env=None, stdin=None, timeout=None, quiet=False, echo=False) -> _RunOutput:
+    if echo:
+        print("\nRunning: ", " ".join(cmd))
+
+    p = await asyncio.create_subprocess_exec(
+        cmd[0],
+        *cmd[1:],
+        stdin=stdin,
+        stdout=asyncio.subprocess.PIPE,
+        stderr=asyncio.subprocess.PIPE,
+        env=env,
+    )
+
+    # note: there is a warning for a possible deadlock when using `wait` with huge amounts of data in the pipe
+    # https://docs.python.org/3/library/asyncio-subprocess.html#asyncio.asyncio.subprocess.Process.wait
+    #
+    # If it starts hanging, will need to switch to the following code. The problem is that no data
+    # will be seen until it's done and if it hangs for example there will be no debug info.
+    # out, err = await p.communicate()
+    # return _RunOutput(p.returncode, out, err)
+
+    out = []
+    err = []
+
+    def tee(line, sink, pipe, label=""):
+        line = line.decode("utf-8").rstrip()
+        sink.append(line)
+        if not quiet:
+            print(label, line, file=pipe)
+
+    # XXX: the timeout doesn't seem to make any difference here
+    await asyncio.wait(
+        [
+            _read_stream(p.stdout, lambda l: tee(l, out, sys.stdout, label="stdout:")),
+            _read_stream(p.stderr, lambda l: tee(l, err, sys.stderr, label="stderr:")),
+        ],
+        timeout=timeout,
+    )
+    return _RunOutput(await p.wait(), out, err)
+
+
+def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False, echo=True) -> _RunOutput:
+
+    loop = asyncio.get_event_loop()
+    result = loop.run_until_complete(
+        _stream_subprocess(cmd, env=env, stdin=stdin, timeout=timeout, quiet=quiet, echo=echo)
+    )
+
+    cmd_str = " ".join(cmd)
+    if result.returncode > 0:
+        stderr = "\n".join(result.stderr)
+        raise RuntimeError(
+            f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
+            f"The combined stderr from workers follows:\n{stderr}"
+        )
+
+    # check that the subprocess actually did run and produced some output, should the test rely on
+    # the remote side to do the testing
+    if not result.stdout and not result.stderr:
+        raise RuntimeError(f"'{cmd_str}' produced no output.")
+
+    return result
+
+
+# --- Misc utils --- #
+
+def flatten_arguments(args):
+    """
+    Converts dictionary argument to a list.
+
+    Note: we add "IGNORED" at the beginning as this value is ignored by the argparser
+
+    Example: {"arg1": "value1", "arg2": "value2"} -> ["IGNORED", "arg1", "value1", "arg2", "value2"]
+    """
+    return ["IGNORED"] + [item for key_value in args.items() for item in key_value if item != ""]
diff --git a/tests/pipeline_parallel/test_schedules.py b/tests/pipeline_parallel/test_schedules.py
index a6bac5b2a3..5c9f6f383d 100644
--- a/tests/pipeline_parallel/test_schedules.py
+++ b/tests/pipeline_parallel/test_schedules.py
@@ -1,5 +1,7 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import torch
-from tests.test_utilities import Utils
+from tests.unit_tests.test_utilities import Utils
 from megatron.core import ModelParallelConfig
 import megatron.core.pipeline_parallel.schedules as schedule
 from pytest_mock import mocker 
@@ -21,7 +23,9 @@ def test_get_forward_backward_func():
 def test_deallocate_output_tensor():
     out = torch.tensor([[1, 2, 3], [4, 5, 6]])
     schedule.deallocate_output_tensor(out)
-    assert(out.nelement() == 1) 
+    assert(out.nelement() == 6)
+    schedule.deallocate_output_tensor(out, True)
+    assert(out.nelement() == 1)
 
 def test_forward_backward_func_without_pipeline_parallel(mocker):
     from megatron.core.pipeline_parallel import get_forward_backward_func
diff --git a/tests/run_megatron.py b/tests/run_megatron.py
index 4f8fdf7944..8e99513aa2 100644
--- a/tests/run_megatron.py
+++ b/tests/run_megatron.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import torch
 import deepspeed
 import megatron
@@ -102,7 +104,7 @@ def add_text_generate_args(parser):
             model=model,
             mp_size=args.tensor_model_parallel_size,
             tensor_parallel={"mpu": mpu},
-            dtype=torch.half,
+            dtype=torch.bfloat16,
             replace_with_kernel_inject=True,
             moe_experts=args.num_experts,
             moe_type=args.mlp_type,
diff --git a/tests/test_megatron.py b/tests/test_megatron.py
index e7342c244c..62f1a7a301 100644
--- a/tests/test_megatron.py
+++ b/tests/test_megatron.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import pytest
 import os
 import re
@@ -39,7 +41,7 @@ def params(moe_num_experts, mp_size):
         "--num-experts", moe_num_experts,
         "--mlp-type", "standard",
         "--num-samples", "0",
-        "--fp16",
+        "--bf16",
     ]
 
 
diff --git a/tests/transformer/test_parallel_attention.py b/tests/transformer/test_parallel_attention.py
index fe1e674e12..88ce228b0c 100644
--- a/tests/transformer/test_parallel_attention.py
+++ b/tests/transformer/test_parallel_attention.py
@@ -1,10 +1,11 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 import pytest
 
 import torch
 
-from megatron.core.transformer.parallel_attention import ParallelAttention
+from megatron.model.transformer import ParallelAttention
 
 
 @pytest.fixture
diff --git a/tests/transformer/test_parallel_mlp.py b/tests/transformer/test_parallel_mlp.py
index f43dc0b467..a297099a88 100644
--- a/tests/transformer/test_parallel_mlp.py
+++ b/tests/transformer/test_parallel_mlp.py
@@ -1,14 +1,30 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 import pytest
 
 import torch
+import types
 
-from megatron.core.transformer.parallel_mlp import ParallelMLP
+from megatron.model.transformer import ParallelMLP
+from megatron.global_vars import set_args
 
+from deepspeed.accelerator import get_accelerator
+device_name = get_accelerator().device_name()
 
 @pytest.fixture
 def mlp(transformer_config):
+    mlp_args = types.SimpleNamespace(
+        swiglu=False,
+        openai_gelu=True,
+        onnx_safe=False,
+        bias_gelu_fusion=False,
+        transformer_impl="",
+        cache_fp8_weight=False,
+        fp8_interval=False,
+        cache_fp8_weight_fwd=False
+    )
+    set_args(mlp_args)
     return ParallelMLP(transformer_config)
 
 
@@ -19,28 +35,27 @@ def test_constructor(self, mlp):
         num_weights = sum([p.numel() for p in mlp.parameters()])
         assert num_weights == 1212
 
-    def test_cpu_forward(self, mlp):
+    def test_cpu_forward(self, mlp, transformer_config):
         # [sequence length, micro batch size, hidden size]
-        hidden_states = torch.ones((32, 2, mlp.config.hidden_size))
+        hidden_states = torch.ones((32, 2, transformer_config.hidden_size))
         output, output_bias = mlp(hidden_states)
         assert output.shape[0] == 32
         assert output.shape[1] == 2
-        assert output.shape[2] == mlp.config.hidden_size
-        assert output_bias.shape[0] == mlp.config.hidden_size
+        assert output.shape[2] == transformer_config.hidden_size
+        assert output_bias == None
         assert output.dtype == torch.float32
 
-    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
-    def test_gpu_forward(self, mlp):
-        mlp.cuda()
+    @pytest.mark.skipif(not get_accelerator().is_available(), reason="accelerator not available")
+    def test_accelerator_forward(self, mlp, transformer_config):
+        mlp.to(device_name)
         # [sequence length, batch size, hidden size]
-        hidden_states = torch.ones((32, 2, mlp.config.hidden_size))
-        hidden_states = hidden_states.cuda()
+        hidden_states = torch.ones((32, 2, transformer_config.hidden_size))
+        hidden_states = hidden_states.to(device_name)
         output, output_bias = mlp(hidden_states)
         assert output.shape[0] == 32
         assert output.shape[1] == 2
-        assert output.shape[2] == mlp.config.hidden_size
-        assert output_bias.shape[0] == mlp.config.hidden_size
+        assert output.shape[2] == transformer_config.hidden_size
+        assert output_bias == None
         assert output.dtype == torch.float32
-        assert output.device.type == 'cuda'
-        assert output_bias.device.type == 'cuda'
+        assert output.device.type == device_name
 
diff --git a/tests/transformer/test_parallel_transformer_block.py b/tests/transformer/test_parallel_transformer_block.py
index baa8ae3e14..a208f6d08b 100644
--- a/tests/transformer/test_parallel_transformer_block.py
+++ b/tests/transformer/test_parallel_transformer_block.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 import pytest
@@ -5,8 +6,11 @@
 import torch
 
 from megatron.core.transformer.transformer_config import TransformerConfig
-from megatron.core.transformer.parallel_transformer_layer import ParallelTransformerLayer
-from megatron.core.transformer.parallel_transformer_block import ParallelTransformerBlock
+from megatron.model.transformer import ParallelTransformerLayer
+try:
+    from megatron.core.transformer.parallel_transformer_block import ParallelTransformerBlock
+except:
+    ParallelTransformerBlock = None
 
 
 @pytest.fixture
diff --git a/tests/transformer/test_parallel_transformer_layer.py b/tests/transformer/test_parallel_transformer_layer.py
index 9ab5003eff..b527cf474f 100644
--- a/tests/transformer/test_parallel_transformer_layer.py
+++ b/tests/transformer/test_parallel_transformer_layer.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 
 
@@ -6,7 +7,7 @@
 import torch
 
 from megatron.core.transformer.transformer_config import TransformerConfig
-from megatron.core.transformer.parallel_transformer_layer import ParallelTransformerLayer
+from megatron.model.transformer import ParallelTransformerLayer
 
 
 @pytest.fixture
diff --git a/tests/unit_tests/test_utilities.py b/tests/unit_tests/test_utilities.py
index b35c77b58d..23d3373dd4 100644
--- a/tests/unit_tests/test_utilities.py
+++ b/tests/unit_tests/test_utilities.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import os
 import torch
 import megatron.core.parallel_state as ps
@@ -5,7 +7,7 @@
 class Utils:
 
     world_size = torch.cuda.device_count()
-    rank = int(os.environ['LOCAL_RANK'])
+    rank = int(os.getenv('LOCAL_RANK', 0))
 
     @staticmethod
     def initialize_distributed():
@@ -27,4 +29,4 @@ def initialize_model_parallel(tensor_model_parallel_size = 1, pipeline_model_par
         ps.destroy_model_parallel()
         if not torch.distributed.is_initialized():
             Utils.initialize_distributed()
-        ps.initialize_model_parallel(tensor_model_parallel_size, pipeline_model_parallel_size, virtual_pipeline_model_parallel_size, pipeline_model_parallel_split_rank)
\ No newline at end of file
+        ps.initialize_model_parallel(tensor_model_parallel_size, pipeline_model_parallel_size, virtual_pipeline_model_parallel_size, pipeline_model_parallel_split_rank)
diff --git a/tools/__init__.py b/tools/__init__.py
index e69de29bb2..f2bc185ace 100644
--- a/tools/__init__.py
+++ b/tools/__init__.py
@@ -0,0 +1,3 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
+from .verify_checkpoint_non_tp_consistency import verify_checkpoint
diff --git a/tools/convert_checkpoint/README.md b/tools/convert_checkpoint/README.md
index 3f74bb1aa4..7496bd4c12 100644
--- a/tools/convert_checkpoint/README.md
+++ b/tools/convert_checkpoint/README.md
@@ -6,8 +6,9 @@ The folder also contains scripts for inspecting checkpoint files and folders, wh
 
 Here are the list and details of checkpoint conversions provided by the available scripts:
 
-1. [Megatron-DeepSpeed to Megatron-LM](#Megatron-DeepSpeed-to-Megatron)
-1. [Megatron-DeepSpeed to HF Transformers](#Megatron-DeepSpeed-to-HF-Transformers)
+1. [Megatron-DeepSpeed to Megatron-LM](#megatron-deepspeed-to-megatron)
+2. [Megatron-DeepSpeed to HF Transformers](#megatron-deepspeed-to-hf-transformers)
+3. [Megatron-DeepSpeed to universal then to HF Transformers](#megatron-deepspeed-to-universal-then-to-hf-transformers)
 
 
 ## Megatron-DeepSpeed to Megatron
@@ -76,3 +77,16 @@ cd /hf/transformers
 python src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py \
 /path/to/Megatron/checkpoint/iter_0097500/mp_rank_00/model_optim_rng.pt
 ```
+
+## Megatron-DeepSpeed to Universal then to HF Transformers
+
+The conversion is done in two steps, Megatron-DeepSpeed to Universal and then Universal to HF Transformers:
+
+```bash
+# 1. Megatron-DeepSpeed to Universal
+HL_LATEST_CHECKPOINT=/path/to/checkpoints/global_step*/ $MEGATRON_DEEPSPEED_ROOT/scripts/convert_ds_to_universal.sh
+
+# 2. Universal to HF Transformers
+python $MEGATRON_DEEPSPEED_ROOT/tools/convert_checkpoint/mds_universal_to_huggingface.py --output-dir /path/to/output/dir --hf-out-format safetensors --universal-dir /path/to/universal/dir/ --model-type llama --config $MEGATRON_DEEPSPEED_ROOT/tools/convert_checkpoint/json/mds_to_hf_llama_7b.json
+'''
+Note: Validated on LLaMA 2 - 7B and 70B models.
diff --git a/tools/convert_checkpoint/json/mds_to_hf_llama_13b.json b/tools/convert_checkpoint/json/mds_to_hf_llama_13b.json
new file mode 100644
index 0000000000..a047c9a056
--- /dev/null
+++ b/tools/convert_checkpoint/json/mds_to_hf_llama_13b.json
@@ -0,0 +1,40 @@
+{
+    "MODEL": {
+        "num_hidden_layers": 40,
+        "hidden_size": 5120,
+        "num_attention_heads": 40,
+        "num_key_value_heads": 40,
+        "intermediate_size": 13824
+    },
+    "LAYER_MAPPINGS" : {
+        "word_embeddings": 1,
+        "transformer": [2, 41],
+        "final_layernorm": 42,
+        "final_word_embeddings": 43
+    },
+    "FULL_NAME_MAPPINGS": {
+    },
+    "PARTIAL_NAME_MAPPINGS": {
+        "final_word_embeddings": {
+            "43": "lm_head"
+        },
+        "final_layernorm": {
+            "42": "model.norm"
+        },
+        "word_embeddings": {
+            "word_embeddings": "model.embed_tokens"
+        },
+        "transformer": {
+            "dense_h_to_4h": {"gate_proj": "mlp.gate_proj", "up_proj": "mlp.up_proj"},
+            "dense_4h_to_h": "mlp.down_proj",
+            "post_attention_layernorm": "post_attention_layernorm",
+            "input_layernorm": "input_layernorm",
+            "dense": "self_attn.o_proj",
+            "query_key_value": {"query": "self_attn.q_proj", "key": "self_attn.k_proj", "value": "self_attn.v_proj"}
+        }
+    },
+    "SPECIAL": {
+        "query_key_value": "attention_qkv",
+        "dense_h_to_4h": "mlp_gate_up_proj"
+    }
+}
diff --git a/tools/convert_checkpoint/json/mds_to_hf_llama_70b.json b/tools/convert_checkpoint/json/mds_to_hf_llama_70b.json
new file mode 100644
index 0000000000..b110defe28
--- /dev/null
+++ b/tools/convert_checkpoint/json/mds_to_hf_llama_70b.json
@@ -0,0 +1,40 @@
+{
+    "MODEL": {
+        "num_hidden_layers": 80,
+        "hidden_size": 8192,
+        "num_attention_heads": 64,
+        "num_key_value_heads": 8,
+        "intermediate_size": 28672
+    },
+    "LAYER_MAPPINGS" : {
+        "word_embeddings": 1,
+        "transformer": [2, 81],
+        "final_layernorm": 82,
+        "final_word_embeddings": 83
+    },
+    "FULL_NAME_MAPPINGS": {
+    },
+    "PARTIAL_NAME_MAPPINGS": {
+        "final_word_embeddings": {
+            "83": "lm_head"
+        },
+        "final_layernorm": {
+            "82": "model.norm"
+        },
+        "word_embeddings": {
+            "word_embeddings": "model.embed_tokens"
+        },
+        "transformer": {
+            "dense_h_to_4h": {"gate_proj": "mlp.gate_proj", "up_proj": "mlp.up_proj"},
+            "dense_4h_to_h": "mlp.down_proj",
+            "post_attention_layernorm": "post_attention_layernorm",
+            "input_layernorm": "input_layernorm",
+            "dense": "self_attn.o_proj",
+            "query_key_value": {"query": "self_attn.q_proj", "key": "self_attn.k_proj", "value": "self_attn.v_proj"}
+        }
+    },
+    "SPECIAL": {
+        "query_key_value": "attention_qkv",
+        "dense_h_to_4h": "mlp_gate_up_proj"
+    }
+}
diff --git a/tools/convert_checkpoint/json/mds_to_hf_llama_7b.json b/tools/convert_checkpoint/json/mds_to_hf_llama_7b.json
new file mode 100644
index 0000000000..02b7459f29
--- /dev/null
+++ b/tools/convert_checkpoint/json/mds_to_hf_llama_7b.json
@@ -0,0 +1,40 @@
+{
+    "MODEL": {
+        "num_hidden_layers": 32,
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 32,
+        "intermediate_size": 11008
+    },
+    "LAYER_MAPPINGS" : {
+        "word_embeddings": 1,
+        "transformer": [2, 33],
+        "final_layernorm": 34,
+        "final_word_embeddings": 35
+    },
+    "FULL_NAME_MAPPINGS": {
+    },
+    "PARTIAL_NAME_MAPPINGS": {
+        "final_word_embeddings": {
+            "35": "lm_head"
+        },
+        "final_layernorm": {
+            "34": "model.norm"
+        },
+        "word_embeddings": {
+            "word_embeddings": "model.embed_tokens"
+        },
+        "transformer": {
+            "dense_h_to_4h": {"gate_proj": "mlp.gate_proj", "up_proj": "mlp.up_proj"},
+            "dense_4h_to_h": "mlp.down_proj",
+            "post_attention_layernorm": "post_attention_layernorm",
+            "input_layernorm": "input_layernorm",
+            "dense": "self_attn.o_proj",
+            "query_key_value": {"query": "self_attn.q_proj", "key": "self_attn.k_proj", "value": "self_attn.v_proj"}
+        }
+    },
+    "SPECIAL": {
+        "query_key_value": "attention_qkv",
+        "dense_h_to_4h": "mlp_gate_up_proj"
+    }
+}
diff --git a/tools/convert_checkpoint/json/mds_to_hf_llama_7b_full_names.json b/tools/convert_checkpoint/json/mds_to_hf_llama_7b_full_names.json
new file mode 100644
index 0000000000..d5d311e8ec
--- /dev/null
+++ b/tools/convert_checkpoint/json/mds_to_hf_llama_7b_full_names.json
@@ -0,0 +1,57 @@
+{
+    "MODEL": {
+        "num_hidden_layers": 32,
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 32,
+        "intermediate_size": 11008
+    },
+    "LAYER_MAPPINGS" : {
+        "word_embeddings": 1,
+        "transformer": [2, 33],
+        "final_layernorm": 34,
+        "final_word_embeddings": 35
+    },
+    "FULL_NAME_MAPPINGS": {
+        "1": {
+            "1.word_embeddings.weight": "model.embed_tokens.weight"
+        },
+        "2": {
+            "2.mlp.dense_h_to_4h.weight": {
+                "gate_proj": "model.layers.0.mlp.gate_proj.weight",
+                "up_proj": "model.layers.0.mlp.up_proj.weight"
+            },
+            "2.post_attention_layernorm.weight": "model.layers.0.post_attention_layernorm.weight",
+            "2.input_layernorm.weight": "model.layers.0.input_layernorm.weight",
+            "2.mlp.dense_4h_to_h.weight": "model.layers.0.mlp.down_proj.weight",
+            "2.self_attention.dense.weight": "model.layers.0.self_attn.o_proj.weight",
+            "2.self_attention.query_key_value.weight": {
+                "query": "model.layers.0.self_attn.q_proj.weight",
+                "key": "model.layers.0.self_attn.k_proj.weight",
+                "value": "model.layers.0.self_attn.v_proj.weight"
+            }
+        },
+        "34": {
+            "34.weight": "model.norm.weight"
+        },
+        "35": {
+            "35.lm_head.weight": "lm_head.weight"
+        }
+    },
+    "PARTIAL_NAME_MAPPINGS": {
+        "transformer": {
+            "dense_h_to_4h": {"gate_proj": "mlp.gate_proj", "up_proj": "mlp.up_proj"},
+            "dense_4h_to_h": "mlp.down_proj",
+            "post_attention_layernorm": "post_attention_layernorm",
+            "input_layernorm": "input_layernorm",
+            "dense": "self_attn.o_proj",
+            "query_key_value": {"query": "self_attn.q_proj", "key": "self_attn.k_proj", "value": "self_attn.v_proj"}
+        }
+    },
+    "SPECIAL": {
+        "query_key_value": "attention_qkv",
+        "2.self_attention.query_key_value.weight": "attention_qkv",
+        "dense_h_to_4h": "mlp_gate_up_proj",
+        "2.mlp.dense_h_to_4h.weight": "mlp_gate_up_proj"
+    }
+}
diff --git a/tools/convert_checkpoint/mds_universal_to_huggingface.py b/tools/convert_checkpoint/mds_universal_to_huggingface.py
new file mode 100644
index 0000000000..b2333a02d9
--- /dev/null
+++ b/tools/convert_checkpoint/mds_universal_to_huggingface.py
@@ -0,0 +1,455 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+# coding=utf-8
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import glob
+import json
+import os
+import torch
+from transformers import AutoModelForCausalLM, LlamaConfig, AutoConfig
+import sys
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
+                                             os.path.pardir, os.path.pardir)))
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="MDS universal to HF checkpoint")
+
+    parser.add_argument('--output-dir', type=str,
+                        help='Output HF checkpoint folder',
+                        default='./MDS_universal_to_HF_checkpoint/')
+    parser.add_argument('--hf-out-format', type=str, default='safetensors',
+                        choices=['safetensors', 'bin'],
+                        help='Huggingface model output format')
+    parser.add_argument('--universal-dir', type=str, required=True,
+                        help='Path to universal checkpoint to be converted')
+    parser.add_argument('--hf-model', type=str, default=None,
+                        help='Huggingface model name or path')
+    parser.add_argument('--model-type', type=str, default=None,
+                        choices=['llama', None],
+                        help='Huggingface model name or path')
+    parser.add_argument('--config', type=str, required=True,
+                        help='path to json config file with conversion'
+                        'information')
+    parser.add_argument('--save-conversion', type=str, default="",
+                        help='json file to save the conversion dict')
+    parser.add_argument('--no-strict', action='store_false', dest='strict',
+                        help='allow non-strict conversion: convert partially'
+                        'even when failing to convert some of the model weight'
+                        'names')
+
+    args = parser.parse_args()
+
+    assert (args.hf_model is not None) ^ (args.model_type is not None), \
+        'Either model type or HuggingFace model name or path is required'
+    return args
+
+
+def load_config(path):
+    """ This function is used for loading the conversion config given by the user """
+    config = json.load(open(path, 'r', encoding='utf-8'))
+    start_idx, end_idx = config['LAYER_MAPPINGS']['transformer']
+    config['LAYER_MAPPINGS']['transformer'] = list(range(start_idx, end_idx+1))
+    if 'MODEL' in config.keys():
+        fields = config['MODEL'].keys()
+        assert ('hidden_size' in fields) & ('intermediate_size' in fields) & \
+        ('num_attention_heads' in fields) & ('num_hidden_layers' in fields), \
+            'Required fields of MODEL are missing in json config file'
+        assert config['MODEL']['num_hidden_layers'] == len(config['LAYER_MAPPINGS']['transformer']), \
+            f'Inconsistency of provided num hidden layers of model in json file'
+    return config
+
+
+def create_model(args, config):
+    """
+    This function is loading a HuggingFace model to create the checkpoint for.
+    It is used to identify the model weight names and for saving.
+    The HF model is given by the user either as a huggingface model name or path
+    or as a model type with architecture params in the json config.
+    """
+    if args.hf_model is not None:
+        model_config = AutoConfig.from_pretrained(args.hf_model)
+        model = AutoModelForCausalLM.from_config(model_config)
+    else:
+        assert 'MODEL' in config.keys(), f'When using model type, model parameters must be ' \
+                                         'included in json configuration file'
+        if args.model_type == 'llama':
+            mp_rank_file_path = get_universal_checkpoint_mp_rank_files(args.universal_dir)[0]
+            mp_rank_file = torch.load(mp_rank_file_path)
+            ckpt_args = mp_rank_file['args']
+            model_config = LlamaConfig(rms_norm_eps=ckpt_args.layernorm_epsilon,
+                                       max_position_embeddings=ckpt_args.max_position_embeddings)
+            model_config.hidden_size = config['MODEL']['hidden_size']
+            model_config.intermediate_size = config['MODEL']['intermediate_size']
+            model_config.num_attention_heads = config['MODEL']['num_attention_heads']
+            model_config.num_hidden_layers = config['MODEL']['num_hidden_layers']
+            if 'num_key_value_heads' in config['MODEL'].keys():
+                model_config.num_key_value_heads = config['MODEL']['num_key_value_heads']
+            else:
+                model_config.num_key_value_heads = model_config.num_attention_heads
+        else:
+            raise NotImplementedError(f'Unsupported model type {args.model_type}')
+        model = AutoModelForCausalLM.from_config(model_config)
+    return model
+
+
+def write_to_file_and_print(text, log_file='log.txt', create=False):
+    print(text)
+    file_mode = "a"
+    if create:
+        file_mode = "w"
+    with open(log_file, file_mode) as f:
+        f.write(text + '\n')
+
+
+def get_universal_checkpoint_mp_rank_files(checkpoint_dir):
+    """
+    This function is used to get all universal checkpoint file names by layer idx.
+    """
+    layer_ckpt_path = os.path.join(checkpoint_dir, 'mp_rank_*_model_states.pt')
+    ckpt_files = glob.glob(layer_ckpt_path)
+    return ckpt_files
+
+
+def get_universal_checkpoint_files(checkpoint_dir, layer_idx):
+    """
+    This function is used to get all universal checkpoint file names by layer idx.
+    """
+    layer_ckpt_path = os.path.join(checkpoint_dir, f'zero/{layer_idx}.*')
+    ckpt_files = glob.glob(layer_ckpt_path)
+    return ckpt_files
+
+
+def convert_partial_name(mds_name, config, layer_idx, layer_type, key, special=None):
+    """
+    This function is used to convert weight name from universal to HF.
+
+    Arguments:
+        mds_name: the weight name to convert (universal)
+        config: conversion configuration given by the user through json file
+        layer_idx: index of the current layer conversion is performed over
+        layer_type: layer type as appears in json config file (e.g. transformer, word_embeddings)
+        key: keyword from mds name used for conversion (as appears in json config file, should be
+            indicative and unique). Used for partial name conversion using MDS_SUBLAYER_MAPPINGS.
+        special: string used as placeholder for special weights (e.g. query_key_value concatenation)
+    """
+    suffix = mds_name.rsplit('.', 1)[-1]
+    suffix = '.' + suffix if suffix in ['weight', 'bias'] else ''
+    if layer_type == 'transformer':
+        prefix = f'model.layers.{layer_idx-config["LAYER_MAPPINGS"]["transformer"][0]}.'
+    else:
+        prefix = ''
+    if special is None:
+        hf_name = prefix + config['PARTIAL_NAME_MAPPINGS'][layer_type][key] + suffix
+    else:
+        hf_name = prefix + special + suffix
+    return hf_name
+
+
+def convert_layer(state_dict, config, layer_idx, layer_type, universal_dir,
+                  missing, unexpected, log_file, conversion_dict, model):
+    """
+    This function is used to convert all weight names in a specific layer from universal to HF.
+
+    Arguments:
+        state_dict: HF model state dict with all model weights
+        config: conversion configuration given by the user through json file
+        layer_idx: index of the current layer conversion is performed over
+        layer_type: layer type as appears in json config file (e.g. transformer, word_embeddings)
+        universal_dir: directory with universal checkpoint files
+        missing: set of HF weight names there was no successfull conversion to yet
+        unexpected: list of converted weight names not matching the model state dict
+            (unsuccessfull conversion)
+        log_file: path to log file of the conversion process
+        conversion_dict: path to save conversion dict (or None)
+        model: HuggingFace model to create checkpoint for
+    """
+    mds_weights = get_universal_checkpoint_files(universal_dir, layer_idx)
+    mds_weight_names = set(mds_weights)
+    # try to convert using full name mappings given by user
+    # remove successfully converted names to ignore with partial name mappings
+    if str(layer_idx) in config['FULL_NAME_MAPPINGS'].keys():
+        for mds_name, hf_name in config['FULL_NAME_MAPPINGS'][str(layer_idx)].items():
+            success = False
+            full_mds_name = os.path.join(universal_dir, 'zero/', mds_name)
+            if mds_name not in config['SPECIAL'].keys():
+                success = load_weight(full_mds_name, state_dict, hf_name,
+                                      missing, conversion_dict, log_file)
+            else:
+                if config['SPECIAL'][mds_name] == 'attention_qkv':
+                    success = qkv_split(full_mds_name, hf_name, state_dict,
+                                        model.config, missing, conversion_dict,
+                                        log_file)
+                if config['SPECIAL'][mds_name] == 'mlp_gate_up_proj':
+                    success = gate_up_proj_split(full_mds_name, hf_name,
+                                                 state_dict, model.config,
+                                                 missing, conversion_dict,
+                                                 log_file)
+            if success:
+                mds_weight_names.remove(full_mds_name)
+            else:
+                unexpected.append(hf_name)
+    # try converting remaining weights using partial name mappings given by user
+    if layer_type in config['PARTIAL_NAME_MAPPINGS'].keys():
+        for mds_name in mds_weight_names:
+            success = False
+            for key in config['PARTIAL_NAME_MAPPINGS'][layer_type].keys():
+                keyword = key + '.' if key[-1] != '.' else key
+                if keyword in mds_name:
+                    if key not in config['SPECIAL'].keys():
+                        hf_name = convert_partial_name(mds_name, config, layer_idx,
+                                                       layer_type, key)
+                        success = load_weight(mds_name, state_dict, hf_name,
+                                              missing, conversion_dict, log_file)
+                    else:
+                        if config['SPECIAL'][key] == 'attention_qkv':
+                            place_holder = 'qkv'
+                            tmp_name = convert_partial_name(mds_name, config, layer_idx,
+                                                            layer_type, key, special=place_holder)
+                            qkv_dict = config['PARTIAL_NAME_MAPPINGS'][layer_type][key]
+                            query_name = tmp_name.replace(place_holder, qkv_dict['query'])
+                            key_name = tmp_name.replace(place_holder, qkv_dict['key'])
+                            value_name = tmp_name.replace(place_holder, qkv_dict['value'])
+                            hf_name = {'query': query_name, 'key': key_name, 'value': value_name}
+                            success = qkv_split(mds_name, hf_name,
+                                                state_dict, model.config, missing,
+                                                conversion_dict, log_file)
+                        if config['SPECIAL'][key] == 'mlp_gate_up_proj':
+                            place_holder = 'mlp_gate_up_proj'
+                            tmp_name = convert_partial_name(mds_name, config, layer_idx,
+                                                            layer_type, key, special=place_holder)
+                            gate_up_dict = config['PARTIAL_NAME_MAPPINGS'][layer_type][key]
+                            gate_name = tmp_name.replace(place_holder, gate_up_dict['gate_proj'])
+                            up_name = tmp_name.replace(place_holder, gate_up_dict['up_proj'])
+                            hf_name = {'gate_proj': gate_name, 'up_proj': up_name}
+                            success = gate_up_proj_split(mds_name, hf_name,
+                                                         state_dict, model.config, missing,
+                                                         conversion_dict, log_file)
+                if success:
+                    break
+            if not success:
+                unexpected.append(mds_name)
+    return
+
+
+def qkv_split(mds_name, hf_name, state_dict, model_config,
+              missing, conversion_dict, log_file):
+    """
+    This function is used to convert query-key-value weights from universal to HF.
+    We use this special function because of difference in shapes:
+    in universal, query-key-value is one matrix whereas in HF there are 3 separate matrices.
+    This is done by loading the qkv weight and doing some reshapes before concatenation based on
+    model parameters (MDS qkv is based on division between attention heads).
+
+    Arguments:
+        mds_name: name of weight in universal format
+        hf_name: name after conversion with special placeholder (to match each of
+            query/key/value HF weight name)
+        qkv_dict: dict with HF partial names from json config file given by user
+        state_dict: HF model state dict with all model weights
+        model_config: HF model configuration
+        missing: set of HF weight names there was no successfull conversion to yet
+        conversion_dict: path to save conversion dict (or None)
+        log_file: path to log file of the conversion process
+    """
+    mds_weight = torch.load(os.path.join(mds_name, 'fp32.pt'))['param']
+    num_heads = model_config.num_attention_heads
+    num_key_value_heads = model_config.num_key_value_heads
+    hidden_size = model_config.hidden_size
+    head_dim = hidden_size // num_heads
+    kv_hidden_size = num_key_value_heads * head_dim
+
+    # transformations from MDS/universal query-key-value matrix to HF 3 matrices
+    qkv = mds_weight.reshape((num_key_value_heads,
+                              num_heads // num_key_value_heads + 2,
+                              head_dim, hidden_size))
+    query, key, value = torch.split(qkv, [num_heads // num_key_value_heads, 1, 1], dim=1)
+    query = query.reshape((-1, hidden_size))
+    key = key.reshape((-1, hidden_size))
+    value = value.reshape((-1, hidden_size))
+
+    assert query.shape[0] == hidden_size, f"shape mismatch in query"
+    assert key.shape[0] == kv_hidden_size, f"shape mismatch in key"
+    assert value.shape[1] == hidden_size, f"shape mismatch in value"
+
+    # reload each matrix to matching key in the model's state dict
+    success_q = load_weight(mds_name, state_dict, hf_name['query'],
+                            missing, conversion_dict, log_file, query)
+    success_k = load_weight(mds_name, state_dict, hf_name['key'],
+                            missing, conversion_dict, log_file, key)
+    success_v = load_weight(mds_name, state_dict, hf_name['value'],
+                            missing, conversion_dict, log_file, value)
+    return all([success_q, success_k, success_v])
+
+
+def gate_up_proj_split(mds_name, hf_name, state_dict, model_config,
+               missing, conversion_dict, log_file):
+    """
+    This function is used to convert dense_h_to_4h weights from universal to HF.
+    We use this special function because of difference in shapes:
+    in universal, dense_h_to_4h is one matrix whereas in HF there are 2 separate matrices.
+    This is done by loading the dense_h_to_4h weight and doing some reshapes before concatenation based on
+    model parameters (MDS dense_h_to_4h is based on division between attention heads).
+
+    Arguments:
+        mds_name: name of weight in universal format
+        hf_name: name after conversion with special placeholder (to match each of
+            gate_proj/up_proj HF weight name)
+        gate_up_proj_dict: dict with HF partial names from json config file given by user
+        state_dict: HF model state dict with all model weights
+        model_config: HF model configuration
+        missing: set of HF weight names there was no successfull conversion to yet
+        conversion_dict: path to save conversion dict (or None)
+        log_file: path to log file of the conversion process
+    """
+    hidden_size = model_config.hidden_size
+    ffn_hidden_size = model_config.intermediate_size
+    mds_weight = torch.load(os.path.join(mds_name,'fp32.pt'))['param']
+
+    # transformations from MDS/universal gate-up-proj matrix to HF 2 matrices
+    gate_up = torch.split(mds_weight, mds_weight.shape[0]//2)
+    gate_proj = gate_up[0]
+    up_proj = gate_up[1]
+
+    assert gate_proj.shape[0] == ffn_hidden_size, f"shape mismatch in gate_proj"
+    assert up_proj.shape[1] == hidden_size, f"shape mismatch in up_proj"
+
+    # reload each matrix to matching key in the model's state dict
+    success_gate = load_weight(mds_name, state_dict, hf_name['gate_proj'],
+                               missing, conversion_dict, log_file, gate_proj)
+    success_up = load_weight(mds_name, state_dict, hf_name['up_proj'],
+                             missing, conversion_dict, log_file, up_proj)
+    return all([success_gate, success_up])
+
+
+def load_weight(mds_name, state_dict, hf_name, missing, conversion_dict, log_file, weight=None):
+    """
+    This function is used to load weight to matching HF weight name in model state dict.
+    The function also handles warnings to user in cases like unexpected names or mismatch in shapes
+
+    Arguments:
+        mds_name: name of weight in universal format
+        state_dict: HF model state dict with all model weights
+        hf_name: name after conversion to HF format
+        missing: set of HF weight names there was no successfull conversion to yet
+        conversion_dict: path to save conversion dict (or None)
+        log_file: path to log file of the conversion process
+        weight: weight from universal for special cases (None by default and loaded from checkpoint)
+    """
+    # load weight by name unless it is given
+    if weight is None:
+        weight = torch.load(os.path.join(mds_name, 'fp32.pt'))['param']
+
+    # converted name is not in model state dict
+    if not hf_name in state_dict.keys():
+        write_to_file_and_print(f'WARNING: conversion failed. tried to convert {mds_name} to ' \
+                                f'{hf_name}', log_file)
+        return False
+
+    # mismatch of shapes
+    if weight.shape != state_dict[hf_name].shape:
+        write_to_file_and_print(f'WARNING: unmatched shape of weight! ' \
+                                f'MDS weight {mds_name} of shape {weight.shape} ' \
+                                f'HF weight {hf_name} of shape {state_dict[hf_name].shape} ',
+                                log_file)
+
+    # converted name was already converted to
+    if not hf_name in missing:
+        write_to_file_and_print(f'WARNING: converted to {hf_name} more than once? ' \
+                                f'(tried to convert {mds_name})', log_file)
+        if conversion_dict is not None:
+            conversion_dict[mds_name] = [conversion_dict[mds_name]]
+            conversion_dict[mds_name].append(hf_name)
+    # save successful conversion
+    else:
+        missing.remove(hf_name)
+        if conversion_dict is not None:
+            if mds_name not in conversion_dict.keys():
+                conversion_dict[mds_name] = []
+            conversion_dict[mds_name].append(hf_name)
+    state_dict[hf_name] = weight
+    return True
+
+
+def main():
+    args = parse_arguments()
+
+    # create output dir and log file
+    os.makedirs(args.output_dir, exist_ok=True)
+    log_file = os.path.join(args.output_dir, 'log.txt')
+    write_to_file_and_print(f'Converting Megatron-DeepSpeed model from {args.universal_dir} ' \
+                            f'weights to HuggingFace model checkpoint in {args.output_dir}', \
+                            log_file, create=True)
+    write_to_file_and_print(f'args = {args}', log_file)
+
+    # load conversion config from json
+    config = load_config(args.config)
+    write_to_file_and_print(f'successfuly loaded model conversion config from {args.config}',
+                            log_file)
+
+    # load HF target model
+    if args.hf_model is not None:
+        write_to_file_and_print(f'Using HuggingFace model {args.hf_model} for conversion', log_file)
+    else:
+        write_to_file_and_print(f'Using model type {args.model_type} for conversion', log_file)
+
+    model = create_model(args, config)
+    state_dict = model.state_dict()
+
+    # do conversion layer by layer and keep track of missing/unexpected weight names
+    missing_weight_names = set(state_dict.keys())
+    unexpected_weight_names = []
+    conversion_dict = {} if args.save_conversion else None
+    for layer_type in config['LAYER_MAPPINGS'].keys():
+        if type(config['LAYER_MAPPINGS'][layer_type]) == list:
+            layers = config['LAYER_MAPPINGS'][layer_type]
+        else:
+            layers = [config['LAYER_MAPPINGS'][layer_type]]
+        for layer_idx in layers:
+            write_to_file_and_print(f'Converting layer {layer_idx} '\
+                                    f'of type {layer_type}', log_file)
+            convert_layer(state_dict, config, layer_idx, layer_type, args.universal_dir,
+                          missing_weight_names, unexpected_weight_names, log_file,
+                          conversion_dict, model)
+
+    # check for missing / unexpected weight names and warn user
+    if unexpected_weight_names or missing_weight_names:
+        write_to_file_and_print(f'WARNING: found {len(unexpected_weight_names)} unexpected ' \
+                                f'weights and {len(missing_weight_names)} missing weights. ',
+                                log_file)
+        write_to_file_and_print(f'unexpected: {unexpected_weight_names} ', log_file)
+        write_to_file_and_print(f'missing: {missing_weight_names}', log_file)
+        assert not args.strict, 'name conversion failed. '
+
+    # load converted weights to HF model and save
+    model.load_state_dict(state_dict)
+
+    safe_serialization = None
+    if args.hf_out_format == "safetensors":
+        safe_serialization = True
+    else:
+        safe_serialization = False
+    model.save_pretrained(args.output_dir, safe_serialization=safe_serialization)
+
+    write_to_file_and_print('Now add tokenizer files to run', log_file)
+    write_to_file_and_print('Successfuly saved all converted weights', log_file)
+
+    if args.save_conversion:
+        with open(args.save_conversion, 'w') as f:
+            json.dump(conversion_dict, f, indent=2)
+
+
+if __name__=='__main__':
+    main()
diff --git a/tools/hf2megads_weight_converter.py b/tools/hf2megads_weight_converter.py
index bfbde1fd05..ae05e729b5 100755
--- a/tools/hf2megads_weight_converter.py
+++ b/tools/hf2megads_weight_converter.py
@@ -1,3 +1,5 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+
 import torch
 import re
 import sys
@@ -136,20 +138,26 @@ def _qkv_refactor(self, pname, p, hf_layer):
             hidden_size, self.tp_rank, self.tp_size)
         hidden_size_per_attention_head = divide(hidden_size,
                                                 self.config.num_attention_heads)
+        key_value_factor = int(wq.shape[0] / wk.shape[0])
+        key_value_hidden_size_per_attention_head = hidden_size_per_attention_head
+        hidden_size_per_attention_head = int(hidden_size_per_attention_head * key_value_factor)
         num_attention_heads_per_partition = divide(self.config.num_attention_heads,
                                                    self.tp_size)
 
-        new_w = torch.zeros((per_partition_size * 3, wq.shape[1]), dtype=wq.dtype)
+        new_w = torch.zeros((int(per_partition_size * (1+2/key_value_factor)), wq.shape[1]), dtype=wq.dtype)
 
         for i in range(num_attention_heads_per_partition):
             current_index = start_index + i * hidden_size_per_attention_head
             next_index = current_index + hidden_size_per_attention_head
-            new_w_index = i * (3 * hidden_size_per_attention_head)
-            new_w[new_w_index: new_w_index + (3 * hidden_size_per_attention_head), :] = \
+            key_value_current_index = start_index + i * key_value_hidden_size_per_attention_head
+            key_value_next_index = key_value_current_index + key_value_hidden_size_per_attention_head
+            combined_hidden_size = hidden_size_per_attention_head + 2 * key_value_hidden_size_per_attention_head
+            new_w_index = i * combined_hidden_size
+            new_w[new_w_index: new_w_index + combined_hidden_size, :] = \
                 torch.cat([
                     wq[current_index: next_index, :],
-                    wk[current_index: next_index, :],
-                    wv[current_index: next_index, :]
+                    wk[key_value_current_index: key_value_next_index, :],
+                    wv[key_value_current_index: key_value_next_index, :]
                 ], dim=0)
         self.record_mapping_info(
             f"mega-ds:{pname,p.data.shape}<--hf{hf_wq_name,hf_wk_name,hf_wv_name,}  cat q,k,v [{current_index}:{next_index},:]  of q,k,v{wq.shape}"
diff --git a/tools/preprocess_data.py b/tools/preprocess_data.py
index 399f93c10e..4345de1026 100644
--- a/tools/preprocess_data.py
+++ b/tools/preprocess_data.py
@@ -1,3 +1,4 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
 # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 """Processing large data for pretraining."""
@@ -193,10 +194,15 @@ def get_args():
     group.add_argument('--tokenizer-type', type=str, required=True,
                        choices=['BertWordPieceLowerCase','BertWordPieceCase',
                                 'GPT2BPETokenizer', 'SentencePieceTokenizer',
-                                'GPTSentencePieceTokenizer', 'NullTokenizer'],
+                                'GPTSentencePieceTokenizer', 'HFTokenizer',
+                                'NullTokenizer'],
                        help='What type of tokenizer to use.')
     group.add_argument('--tokenizer-model', type=str, default=None,
                        help='YTTM tokenizer model.')
+    group.add_argument('--seq-length', type=int, default=None,
+                       help='Maximum sequence length to process.')
+    group.add_argument('--trust-remote-code', action='store_true',
+                       help='to run HFTokenizer model from local path.')
     group.add_argument('--vocab-file', type=str, default=None,
                        help='Path to the vocab file')
     group.add_argument('--vocab-size', default=786,
@@ -229,7 +235,7 @@ def get_args():
         print("Are you sure you don't want to split sentences?")
 
     # some default/dummy values for the tokenizer
-    args.rank = 1
+    args.rank = 0
     args.make_vocab_size_divisible_by = 128
     args.tensor_model_parallel_size = 1
     args.vocab_extra_ids = 0
diff --git a/tools/verify_checkpoint_non_tp_consistency.py b/tools/verify_checkpoint_non_tp_consistency.py
new file mode 100644
index 0000000000..398b2505e9
--- /dev/null
+++ b/tools/verify_checkpoint_non_tp_consistency.py
@@ -0,0 +1,329 @@
+# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import tqdm
+import argparse
+from dataclasses import dataclass
+import torch
+from deepspeed.checkpoint import DeepSpeedCheckpoint
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--folder", default=None, type=str, help="DeepSpeed Checkpoint folder"
+    )
+    parser.add_argument(
+        "--model_type",
+        default="GPT",
+        type=str,
+        help="Type of the model",
+        choices=["GPT", "BLOOM", "LLAMA"],
+    )
+    args = parser.parse_args()
+    print(f"args = {args}")
+    return args
+
+
+def show_3d(ds_checkpoint):
+    src_3d = ds_checkpoint.zero_checkpoint.src_3d
+    dp, tp, pp = src_3d.dp_degree, src_3d.tp_degree, src_3d.pp_degree
+    print(f"3D configuration: DP={dp} TP={tp} PP={pp}")
+
+
+def get_layer_patterns_for_non_sharded(model_type):
+    if model_type == "GPT":
+        return [
+            "position_embeddings.weight",
+            "input_layernorm.weight",
+            "input_layernorm.bias",
+            "self_attention.query_key_value.bias",
+            "self_attention.dense.bias",
+            "post_attention_layernorm.weight",
+            "post_attention_layernorm.bias",
+            "mlp.dense_h_to_4h.bias",
+            "mlp.dense_4h_to_h.bias",
+            "weight",
+            "bias",
+        ]
+    elif model_type == "BLOOM":
+        return [
+            "input_layernorm.weight",
+            "input_layernorm.bias",
+            "self_attention.query_key_value.bias",
+            "self_attention.dense.bias",
+            "post_attention_layernorm.weight",
+            "post_attention_layernorm.bias",
+            "mlp.dense_h_to_4h.bias",
+            "mlp.dense_4h_to_h.bias",
+            "weight",
+            "bias",
+        ]
+    elif model_type == "LLAMA":
+        return [
+            "input_layernorm.weight",
+            "input_layernorm.bias",
+            "self_attention.query_key_value.bias",
+            "self_attention.dense.bias",
+            "post_attention_layernorm.weight",
+            "post_attention_layernorm.bias",
+            "mlp.dense_h_to_4h.bias",
+            "mlp.dense_4h_to_h.bias",
+            "weight",
+            "bias",
+        ]
+
+
+def get_zero_patterns_for_non_sharded(model_type):
+    if model_type == "GPT":
+        patterns = [
+            r"tied_modules.embed.word_embeddings.norm.weight",
+            r"tied_modules.embed.word_embeddings.norm.bias",
+            r"tied_modules.embed.position_embeddings.weight",
+            r"\d+.self_attention.query_key_value.bias",
+            r"\d+.self_attention.dense.bias",
+            r"\d+.mlp.dense_h_to_4h.bias",
+            r"\d+.mlp.dense_4h_to_h.bias",
+            r"\d+.input_layernorm.weight",
+            r"\d+.input_layernorm.bias",
+            r"\d+.post_attention_layernorm.weight",
+            r"\d+.post_attention_layernorm.bias",
+            r"\d+.weight",
+            r"\d+.bias",
+        ]
+        return patterns
+    if model_type == "BLOOM":
+        patterns = [
+            r"tied_modules.embed.word_embeddings.norm.weight",
+            r"tied_modules.embed.word_embeddings.norm.bias",
+            r"\d+.self_attention.query_key_value.bias",
+            r"\d+.self_attention.dense.bias",
+            r"\d+.mlp.dense_h_to_4h.bias",
+            r"\d+.mlp.dense_4h_to_h.bias",
+            r"\d+.input_layernorm.weight",
+            r"\d+.input_layernorm.bias",
+            r"\d+.post_attention_layernorm.weight",
+            r"\d+.post_attention_layernorm.bias",
+            r"\d+.weight",
+            r"\d+.bias",
+        ]
+        return patterns
+    if model_type == "LLAMA":
+        patterns = [
+            r"\d+.word_embeddings.bias",
+            r"\d+.self_attention.query_key_value.bias",
+            r"\d+.self_attention.dense.bias",
+            r"\d+.mlp.dense_h_to_4h.bias",
+            r"\d+.mlp.dense_4h_to_h.bias",
+            r"\d+.input_layernorm.weight",
+            r"\d+.input_layernorm.bias",
+            r"\d+.post_attention_layernorm.weight",
+            r"\d+.post_attention_layernorm.bias",
+            r"\d+.weight",
+            r"\d+.bias",
+        ]
+        return patterns
+
+
+@dataclass
+class ParamInfo:
+    pp: int
+    tp: int
+    dp: int
+    data: torch.Tensor
+    numel: int
+
+
+def get_zero_pp_stage_non_sharded_params(
+    ds_checkpoint, model_type, pp_stage, dp_stage
+):
+    patterns = get_zero_patterns_for_non_sharded(model_type)
+    params = {}
+    for tp_stage in tqdm.tqdm(range(ds_checkpoint.tp_degree), desc="bf16 zero files"):
+        sd = ds_checkpoint.get_zero_checkpoint_state(
+            pp_index=pp_stage, tp_index=tp_stage, dp_index=dp_stage
+        )
+
+        optim_sd = sd["optimizer_state_dict"]
+        param_slice_mappings = optim_sd["param_slice_mappings"]
+        state_groups = optim_sd["base_optimizer_state"]["state"]
+        fp32_groups = optim_sd["single_partition_of_fp32_groups"]
+
+        for param_group_id in range(len(state_groups)):
+            flat_state = dict(
+                exp_avg=state_groups[param_group_id]["exp_avg"],
+                exp_avg_sq=state_groups[param_group_id]["exp_avg_sq"],
+                fp32=fp32_groups[param_group_id],
+            )
+
+            for name, fragment_mapping in param_slice_mappings[param_group_id].items():
+                if not any(re.match(pattern, name) for pattern in patterns):
+                    continue
+
+                for state_key in flat_state.keys():
+                    tensor = (
+                        flat_state[state_key]
+                        .narrow(
+                            dim=0,
+                            start=fragment_mapping.start,
+                            length=fragment_mapping.numel,
+                        )
+                        .clone()
+                    )
+                    info = ParamInfo(
+                        pp=pp_stage,
+                        tp=tp_stage,
+                        dp=dp_stage,
+                        data=tensor,
+                        numel=fragment_mapping.numel,
+                    )
+                    full_name = name + ".__" + state_key
+                    if full_name not in params:
+                        params[full_name] = []
+                    params[full_name].append(info)
+    return params
+
+
+def verify_equal_params(params, tp):
+    failed = 0
+    report = {}
+    for name, info in params.items():
+        n = len(info)
+        if n != tp:
+            ok = False
+            print(f"{name}: FAILED expected n={n} == tp={tp}")
+        elif n == 1:
+            ok = True
+        else:
+            ok = all([(x.numel == info[0].numel) for x in info[1:]])
+            if not ok:
+                print(f"{name}: FAILED numel comparison [n={n}]")
+            else:
+                ok = all([x.data.eq(info[0].data).all().item() for x in info[1:]])
+                if not ok:
+                    print(f"{name}: FAILED data comparison [n={n}]")
+        failed += ok == False
+        report[name] = (ok, n)
+        if ok:
+            print(f"{name}: OK [n={n}]")
+    return failed, report
+
+
+def update_layer_non_sharded_params(params, model_type, filename, pp_index, tp_index):
+    layer_id, file_tp_index = re.search("layer_(\d+)-model_(\d+)", filename).groups()
+    layer_id = int(layer_id)
+    file_tp_index = int(file_tp_index)
+    # assert tp_index == file_tp_index, f'Inconsistent tp index tp_index={tp_index} file_tp_index={file_tp_index}'
+    if tp_index != file_tp_index:
+        print("bad")
+
+    sd = torch.load(filename, map_location=torch.device("cpu"))
+    sequential_layers = get_layer_patterns_for_non_sharded(model_type)
+    for key in sd.keys():
+        if key in sequential_layers:
+            param_key = str(layer_id) + "." + key
+            if param_key not in params:
+                params[param_key] = []
+            info = ParamInfo(
+                pp=pp_index, tp=tp_index, dp=-1, data=sd[key], numel=sd[key].numel()
+            )
+            params[param_key].append(info)
+    return params
+
+
+def verify_layer_files(ds_checkpoint, model_type):
+    src_3d = ds_checkpoint.zero_checkpoint.src_3d
+    dp, tp, pp = src_3d.dp_degree, src_3d.tp_degree, src_3d.pp_degree
+
+    total_failed = 0
+    for pp_index in range(pp):
+        print(f"\nChecking pp_stage={pp_index}")
+        params = {}
+        if pp_index == 0:
+            for tp_index in range(tp):
+                for filename in ds_checkpoint.tp_to_embedding_map[tp_index]:
+                    update_layer_non_sharded_params(
+                        params, model_type, filename, pp_index, tp_index
+                    )
+        for tp_index in range(tp):
+            for filename_list in ds_checkpoint.transformer_file_map[
+                (tp_index, pp_index)
+            ]:
+                for filename in filename_list:
+                    update_layer_non_sharded_params(
+                        params, model_type, filename, pp_index, tp_index
+                    )
+        if pp_index == (pp - 1):
+            for tp_index in range(tp):
+                for filename in ds_checkpoint.tp_to_final_norm_map[tp_index]:
+                    update_layer_non_sharded_params(
+                        params, model_type, filename, pp_index, tp_index
+                    )
+        failed, report = verify_equal_params(params, tp)
+        total_failed += failed
+    return total_failed
+
+
+def verify_zero_files(ds_checkpoint, model_type):
+    src_3d = ds_checkpoint.zero_checkpoint.src_3d
+    dp, tp, pp = src_3d.dp_degree, src_3d.tp_degree, src_3d.pp_degree
+
+    total_failed = 0
+    for i in range(pp):
+        for j in range(dp):
+            print(f"\nChecking pp_stage={i} dp_stage={j}")
+            params = get_zero_pp_stage_non_sharded_params(
+                ds_checkpoint, model_type, pp_stage=i, dp_stage=j
+            )
+            failed, report = verify_equal_params(params, tp)
+            total_failed += failed
+    return total_failed
+
+
+def verify_checkpoint(folder, model_type):
+    final_layer_norm_idx = -2 if model_type == "LLAMA" else -1
+    ds_checkpoint = DeepSpeedCheckpoint(
+        folder, final_layer_norm_idx=final_layer_norm_idx
+    )
+    ds_checkpoint.validate_files()
+    show_3d(ds_checkpoint)
+
+    print("\nVerify ** layer_ ** files")
+    total_failed_layer = verify_layer_files(ds_checkpoint, model_type)
+    if total_failed_layer == 0:
+        print("\nCheckpoint layer files OK")
+    else:
+        print(f"\nCheckpoint layer files BAD with total_failed={total_failed_layer}")
+
+    print("\nVerify ** bf16_zero_ ** files")
+    total_failed_zero = verify_zero_files(ds_checkpoint, model_type)
+    if total_failed_zero == 0:
+        print("\nCheckpoint zero files OK")
+    else:
+        print(f"\nCheckpoint zero files BAD with total_failed={total_failed_zero}")
+
+    return (total_failed_layer + total_failed_zero) == 0
+
+
+def main():
+    print(f"Verify DeepSpeed Checkpoint consistency for non-TP-sharded parameters")
+    args = parse_arguments()
+    assert (
+        verify_checkpoint(args.folder, args.model_type) is True
+    ), "Checkpoint verification failed"
+
+
+if __name__ == "__main__":
+    main()