Will DVC work with git submodules? #10415

Erotemic · 2024-05-06T00:06:17Z

Erotemic
May 6, 2024

I'm having a scalability issue with DVC. If I create a repo with many datasets, any modification to one dataset takes a very long time because DVC wants to crawl the filesystem to reason about the current repo state.

The way I structure DVC repos for a project is one "data" repo that houses multiple datasets, which consist of multiple sub-parts. So I would have a tree that looks like:

╙── repo
    ├─╼ dataset_0
    │   ├─╼ part_0
    │   │   ├─╼ part_0.manifest.json.zip
    │   │   ├─╼ part_0.manifest.json.zip.dvc
    │   │   ├─╼ MODALITY_0
    │   │   ├─╼ MODALITY_0.dvc
    │   │   ├─╼ MODALITY_1
    │   │   └─╼ MODALITY_1.dvc
    │   └─╼ part_1
    │       ├─╼ part_1.manifest.json.zip
    │       ├─╼ part_1.manifest.json.zip.dvc
    │       ├─╼ MODALITY_0
    │       ├─╼ MODALITY_0.dvc
    │       ├─╼ MODALITY_1
    │       └─╼ MODALITY_1.dvc
    └─╼ dataset_1
        ├─╼ part_0
        │   ├─╼ part_0.manifest.json.zip
        │   ├─╼ part_0.manifest.json.zip.dvc
        │   ├─╼ MODALITY_0
        │   ├─╼ MODALITY_0.dvc
        │   ├─╼ MODALITY_1
        │   └─╼ MODALITY_1.dvc
        └─╼ part_1
            ├─╼ part_1.manifest.json.zip
            ├─╼ part_1.manifest.json.zip.dvc
            ├─╼ MODALITY_0
            ├─╼ MODALITY_0.dvc
            ├─╼ MODALITY_1
            └─╼ MODALITY_1.dvc

Generating code:

import networkx as nx
import os
import pathlib

graph = nx.DiGraph()

repo_dpath = pathlib.Path('repo')

num_datasets = 2
num_parts = 2
num_modalities = 2

graph.add_node(repo_dpath)

for dataset_idx in range(num_datasets):
    dataset_dpath = repo_dpath / f'dataset_{dataset_idx}'
    graph.add_edge(repo_dpath, dataset_dpath)
    for part_idx in range(num_parts):
        part_name = f'part_{part_idx}'
        part_dpath = dataset_dpath / part_name
        graph.add_edge(dataset_dpath, part_dpath)
        manifest_fpath = part_dpath / f'{part_name}.manifest.json.zip'
        graph.add_edge(part_dpath, manifest_fpath)

        manifest_fpath_dvc = pathlib.Path(os.fspath(manifest_fpath) + '.dvc')
        graph.add_edge(part_dpath, manifest_fpath_dvc)

        for modality_idx in range(num_modalities):
            modality_dpath = part_dpath / f'MODALITY_{modality_idx}'
            graph.add_edge(part_dpath, modality_dpath)

            modality_dvc_fpath = pathlib.Path(os.fspath(modality_dpath) + '.dvc')
            graph.add_edge(part_dpath, modality_dvc_fpath)

for node, node_data in graph.nodes(data=True):
    node_data['label'] = node.name

nx.write_network_text(graph)

With the number of dataset, parts, and modalities allowed to grow fairly large.

Every time I do a dvc add dvc push or dvc pull on one dataset, it has to check every other dataset, which adds too much overhead.

I'm wondering if I split each dataset into a git submodule, and then make one mono-repo that adds all the datasets as submodules (i.e. each dataset_<xxx> directory is a git submodule, will DVC operations only crawl the relevant submodule? I think the overhead isn't so bad when there is only 1 dataset in a repo, but if you have a repo with 3 small datasets and one HUGE dataset, then do anything to the small datasets will incur the overhead of the HUGE dataset, and I'd like to avoid that. Will git submodules work as a way to do that?

dberenbaum · 2024-05-06T13:05:04Z

dberenbaum
May 6, 2024
Collaborator

Every time I do a dvc add dvc push or dvc pull on one dataset, it has to check every other dataset, which adds too much overhead.

Hm, that doesn't sound right. dvc add <target>, dvc push <target>, and dvc pull <target> should each be self-contained to that target. Are you using some other flags or options? Can you provide a reproducible example?

3 replies

Erotemic May 6, 2024
Author

It certainly is not self-contained. This can be seen easily by looking at -vvv and noticing that the TRACE will go over the entire repo.

Here is a MWE that demonstrates this. This will create a dummy git/DVC repo and incrementally add 32 different "datasets" that follow the format I described above. In each iteration, the amount of time the dvc add command takes is timed and logged to a file. I also log stdout to a file.

#!/bin/bash

ROOT_DPATH=$HOME/tmp/dvc-mwe

# Create a fresh workspace
rm -rf "$ROOT_DPATH"
REPO_DPATH=$ROOT_DPATH/demo-repo
mkdir -p "$ROOT_DPATH"
mkdir -p "$REPO_DPATH"
cd "$REPO_DPATH"

# Initialize git repo
git init
git branch -m main
git config --local user.email [email protected]
git config --local user.name "DVC Tester"

# Initialize DVC inside of the git repo
dvc init --verbose
dvc config cache.type symlink,reflink,hardlink,copy
dvc config cache.protected true
dvc config core.analytics false
dvc config core.check_update false
dvc config core.autostage true


create_demo_dataset_part_modality(){
    DATASET_NAME=$1
    PART_NAME=$2
    MODALITY_NAME=$3
    MODALITY_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME/$MODALITY_NAME
    FILE_SIZE="16384"
    #FILE_SIZE="32768"
    #FILE_SIZE="16777216"
    NUM_FILES_PER_MODALITY=1024
    mkdir -p "$MODALITY_DPATH"
    for index in $(seq 1 $NUM_FILES_PER_MODALITY); do
        head -c "$FILE_SIZE" /dev/random | base32 > "$MODALITY_DPATH"/"data_$index"
    done
}

create_demo_dataset_part(){
    DATASET_NAME=$1
    PART_NAME=$2
    PART_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME
    MANIFEST_FPATH="$PART_DPATH/$PART_NAME.manifest.json"
    mkdir -p "$PART_DPATH"
    echo "$DATASET_NAME manifest $PART_NAME" > "$MANIFEST_FPATH"

    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_1"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_2"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_3"
}

create_demo_dataset(){
    DATASET_NAME=$1
    create_demo_dataset_part "$DATASET_NAME" "part1"
    create_demo_dataset_part "$DATASET_NAME" "part2"
}

# Add multiple simple datasets and time how long it takes to add each one

# Store timing records outside of the repo
RECORDS_DPATH=$ROOT_DPATH/records
mkdir -p "$RECORDS_DPATH"

for dataset_index in {1..32}; do
    echo "dataset_index = $dataset_index"
    DATASET_NAME="dataset_$dataset_index"
    create_demo_dataset "$DATASET_NAME"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
done

This takes a bit to run, but it produces output outside of the repo that we can parse:

echo "
REQUIREMENTS:
    pip install ubelt pandas seaborn kwplot[headless] PyQt5
"

python -c "if 1:

    import kwplot
    import ubelt as ub
    import pandas as pd
    sns = kwplot.autosns()
    root_dpath = ub.Path('~/tmp/dvc-mwe').expand()

    fpaths = list((root_dpath / 'records').glob('add_manifest*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_manifest'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    manifest_df = pd.DataFrame(rows)

    fpaths = list((root_dpath / 'records').glob('add_modality*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_modality'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    modality_df = pd.DataFrame(rows)

    ax = kwplot.figure(fnum=1, doclf=1, pnum=(2, 3, 1)).gca()
    sns.lineplot(data=manifest_df, x='index', y='user', ax=ax)
    ax.set_title('Manifest Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 2)).gca()
    sns.lineplot(data=manifest_df, x='index', y='user', ax=ax)
    ax.set_title('Manifest Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 3)).gca()
    sns.lineplot(data=manifest_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Manifest Add Stdout Lines')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 4)).gca()
    sns.lineplot(data=modality_df, x='index', y='user', ax=ax)
    ax.set_title('Modality Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 5)).gca()
    sns.lineplot(data=modality_df, x='index', y='user', ax=ax)
    ax.set_title('Modality Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 6)).gca()
    sns.lineplot(data=modality_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Modality Add Stdout Lines')

    kwplot.plt.show()
"

Which results in this image

The top row refers to adding the "manifest" files, which are small files that should be very quick to add. We notice there is a clear slowdown in adding new small files as the number of files tracked by DVC grows. Furthermore, we can measure the number of stdout lines, which shows that the TRACE part is looping over more and more data in each new iteration.

The bottom row refers to adding the "modality", which are larger files. The pattern isn't as clear here because the overhead of hashing larger files tends to dominate, but the trend is visible.

Erotemic May 6, 2024
Author

I've tweaked some values and let it run for a longer amount of time:

#!/bin/bash

ROOT_DPATH=$HOME/tmp/dvc-mwe

# Create a fresh workspace
rm -rf "$ROOT_DPATH"
REPO_DPATH=$ROOT_DPATH/demo-repo
mkdir -p "$ROOT_DPATH"
mkdir -p "$REPO_DPATH"
cd "$REPO_DPATH"

# Initialize git repo
git init
git branch -m main
git config --local user.email [email protected]
git config --local user.name "DVC Tester"

# Initialize DVC inside of the git repo
dvc init --verbose
dvc config cache.type symlink,reflink,hardlink,copy
dvc config cache.protected true
dvc config core.analytics false
dvc config core.check_update false
dvc config core.autostage true


create_demo_dataset_part_modality(){
    DATASET_NAME=$1
    PART_NAME=$2
    MODALITY_NAME=$3
    MODALITY_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME/$MODALITY_NAME
    FILE_SIZE="8192"
    #FILE_SIZE="16384"
    #FILE_SIZE="32768"
    #FILE_SIZE="16777216"
    NUM_FILES_PER_MODALITY=256
    mkdir -p "$MODALITY_DPATH"
    echo "Create $MODALITY_DPATH"
    for index in $(seq 1 $NUM_FILES_PER_MODALITY); do
        head -c "$FILE_SIZE" /dev/random | base32 > "$MODALITY_DPATH"/"data_$index"
    done
}

create_demo_dataset_part(){
    DATASET_NAME=$1
    PART_NAME=$2
    PART_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME
    MANIFEST_FPATH="$PART_DPATH/$PART_NAME.manifest.json"
    mkdir -p "$PART_DPATH"
    echo "$DATASET_NAME manifest $PART_NAME" > "$MANIFEST_FPATH"
    echo "Create $PART_DPATH"

    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_1"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_2"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_3"
}

create_demo_dataset(){
    DATASET_NAME=$1
    create_demo_dataset_part "$DATASET_NAME" "part_01"
    create_demo_dataset_part "$DATASET_NAME" "part_02"
    create_demo_dataset_part "$DATASET_NAME" "part_03"
    create_demo_dataset_part "$DATASET_NAME" "part_04"
    create_demo_dataset_part "$DATASET_NAME" "part_05"
    create_demo_dataset_part "$DATASET_NAME" "part_06"
    create_demo_dataset_part "$DATASET_NAME" "part_07"
    create_demo_dataset_part "$DATASET_NAME" "part_08"
    create_demo_dataset_part "$DATASET_NAME" "part_09"
    create_demo_dataset_part "$DATASET_NAME" "part_10"
    create_demo_dataset_part "$DATASET_NAME" "part_11"
    create_demo_dataset_part "$DATASET_NAME" "part_12"
    create_demo_dataset_part "$DATASET_NAME" "part_13"
    create_demo_dataset_part "$DATASET_NAME" "part_14"
    create_demo_dataset_part "$DATASET_NAME" "part_15"
    create_demo_dataset_part "$DATASET_NAME" "part_16"
}

# Add multiple simple datasets and time how long it takes to add each one

# Store timing records outside of the repo
RECORDS_DPATH=$ROOT_DPATH/records
mkdir -p "$RECORDS_DPATH"

for dataset_index in {1..128}; do
    echo "dataset_index = $dataset_index"
    DATASET_NAME="dataset_$dataset_index"
    create_demo_dataset "$DATASET_NAME"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
done


echo "
REQUIREMENTS:
    pip install ubelt pandas seaborn kwplot[headless] PyQt5
"

python -c "if 1:

    import kwplot
    import ubelt as ub
    import pandas as pd
    sns = kwplot.autosns()
    root_dpath = ub.Path('~/tmp/dvc-mwe').expand()

    fpaths = list((root_dpath / 'records').glob('add_manifest*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_manifest'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    manifest_df = pd.DataFrame(rows)

    fpaths = list((root_dpath / 'records').glob('add_modality*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_modality'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    modality_df = pd.DataFrame(rows)

    ax = kwplot.figure(fnum=1, doclf=1, pnum=(2, 3, 1)).gca()
    sns.lineplot(data=manifest_df, x='index', y='user', ax=ax)
    ax.set_title('Manifest Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 2)).gca()
    sns.lineplot(data=manifest_df, x='index', y='user', ax=ax)
    ax.set_title('Manifest Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 3)).gca()
    sns.lineplot(data=manifest_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Manifest Add Stdout Lines')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 4)).gca()
    sns.lineplot(data=modality_df, x='index', y='user', ax=ax)
    ax.set_title('Modality Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 5)).gca()
    sns.lineplot(data=modality_df, x='index', y='user', ax=ax)
    ax.set_title('Modality Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 6)).gca()
    sns.lineplot(data=modality_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Modality Add Stdout Lines')

    kwplot.plt.show()
"

which after 37 iterations looks like this:

I'm killing it to show the trend is clear, and I'm going to rerun without the -vvv on the add command to control for the time spent with extra stdout (which should not impact anything). Also my system times are incorrectly reported as duplicates of user times. The next plot will fix that.

Erotemic May 6, 2024
Author

Disabling the verbosity (and fixing system time measurement) yields these plots:

Each add is adding the same amount of data, but there is a linear increase in the amount of time it takes based on the number of other files in the repo. Because -vvv is not enabled, the stdout lines are always empty, so we know the slowdown is not just from writing extra stdout.

#!/bin/bash

ROOT_DPATH=$HOME/tmp/dvc-mwe

# Create a fresh workspace
rm -rf "$ROOT_DPATH"
REPO_DPATH=$ROOT_DPATH/demo-repo
mkdir -p "$ROOT_DPATH"
mkdir -p "$REPO_DPATH"
cd "$REPO_DPATH"

# Initialize git repo
git init
git branch -m main
git config --local user.email [email protected]
git config --local user.name "DVC Tester"

# Initialize DVC inside of the git repo
dvc init --verbose
dvc config cache.type symlink,reflink,hardlink,copy
dvc config cache.protected true
dvc config core.analytics false
dvc config core.check_update false
dvc config core.autostage true


create_demo_dataset_part_modality(){
    DATASET_NAME=$1
    PART_NAME=$2
    MODALITY_NAME=$3
    MODALITY_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME/$MODALITY_NAME
    FILE_SIZE="8192"
    #FILE_SIZE="16384"
    #FILE_SIZE="32768"
    #FILE_SIZE="16777216"
    NUM_FILES_PER_MODALITY=256
    mkdir -p "$MODALITY_DPATH"
    echo "Create $MODALITY_DPATH"
    for index in $(seq 1 $NUM_FILES_PER_MODALITY); do
        head -c "$FILE_SIZE" /dev/random | base32 > "$MODALITY_DPATH"/"data_$index"
    done
}

create_demo_dataset_part(){
    DATASET_NAME=$1
    PART_NAME=$2
    PART_DPATH=$REPO_DPATH/$DATASET_NAME/$PART_NAME
    MANIFEST_FPATH="$PART_DPATH/$PART_NAME.manifest.json"
    mkdir -p "$PART_DPATH"
    echo "$DATASET_NAME manifest $PART_NAME" > "$MANIFEST_FPATH"
    echo "Create $PART_DPATH"

    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_1"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_2"
    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_3"
}

create_demo_dataset(){
    DATASET_NAME=$1
    create_demo_dataset_part "$DATASET_NAME" "part_01"
    create_demo_dataset_part "$DATASET_NAME" "part_02"
    create_demo_dataset_part "$DATASET_NAME" "part_03"
    create_demo_dataset_part "$DATASET_NAME" "part_04"
    create_demo_dataset_part "$DATASET_NAME" "part_05"
    create_demo_dataset_part "$DATASET_NAME" "part_06"
    create_demo_dataset_part "$DATASET_NAME" "part_07"
    create_demo_dataset_part "$DATASET_NAME" "part_08"
    create_demo_dataset_part "$DATASET_NAME" "part_09"
    create_demo_dataset_part "$DATASET_NAME" "part_10"
    create_demo_dataset_part "$DATASET_NAME" "part_11"
    create_demo_dataset_part "$DATASET_NAME" "part_12"
    create_demo_dataset_part "$DATASET_NAME" "part_13"
    create_demo_dataset_part "$DATASET_NAME" "part_14"
    create_demo_dataset_part "$DATASET_NAME" "part_15"
    create_demo_dataset_part "$DATASET_NAME" "part_16"
}

# Add multiple simple datasets and time how long it takes to add each one

# Store timing records outside of the repo
RECORDS_DPATH=$ROOT_DPATH/records
mkdir -p "$RECORDS_DPATH"

for dataset_index in {1..128}; do
    echo "dataset_index = $dataset_index"
    DATASET_NAME="dataset_$dataset_index"
    create_demo_dataset "$DATASET_NAME"
    #/usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time"
    #/usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
    /usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
    cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
done


echo "
REQUIREMENTS:
    pip install ubelt pandas seaborn kwplot[headless] PyQt5
"

python -c "if 1:

    import kwplot
    import ubelt as ub
    import pandas as pd
    sns = kwplot.autosns()
    root_dpath = ub.Path('~/tmp/dvc-mwe').expand()

    fpaths = list((root_dpath / 'records').glob('add_manifest*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_manifest'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    manifest_df = pd.DataFrame(rows)

    fpaths = list((root_dpath / 'records').glob('add_modality*.time'))
    rows = []
    for fpath in fpaths:
        index = int(fpath.stem.split('_')[-1])
        row = {}
        stdout_fpath = fpath.augment(ext='.stdout')
        row['index'] = index
        row['fpath'] = fpath
        text = fpath.read_text()
        parts = text.split(' ')
        row['user'] = float(parts[0].split('user')[0])
        row['system'] = float(parts[1].split('system')[0])
        row['action'] = 'add_modality'
        row['stdout_lines'] = stdout_fpath.read_text().count('\n')
        #parts[2].split('elapsed')[0]
        rows.append(row)
    modality_df = pd.DataFrame(rows)

    ax = kwplot.figure(fnum=1, doclf=1, pnum=(2, 3, 1)).gca()
    sns.lineplot(data=manifest_df, x='index', y='user', ax=ax)
    ax.set_title('Manifest Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 2)).gca()
    sns.lineplot(data=manifest_df, x='index', y='system', ax=ax)
    ax.set_title('Manifest Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 3)).gca()
    sns.lineplot(data=manifest_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Manifest Add Stdout Lines')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 4)).gca()
    sns.lineplot(data=modality_df, x='index', y='user', ax=ax)
    ax.set_title('Modality Add User Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 5)).gca()
    sns.lineplot(data=modality_df, x='index', y='system', ax=ax)
    ax.set_title('Modality Add System Time')

    ax = kwplot.figure(fnum=1, doclf=0, pnum=(2, 3, 6)).gca()
    sns.lineplot(data=modality_df, x='index', y='stdout_lines', ax=ax)
    ax.set_title('Modality Add Stdout Lines')

    kwplot.plt.show()
"

At this point the repo is about ~15GB, but in practice I have DVC repos with TB of data, and the slowdown is very noticeable. I suppose I could modify this MWE to answer my own initial question.

EDIT: Final plot after the script finished:

Erotemic · 2024-05-07T01:16:53Z

Erotemic
May 7, 2024
Author

By creating a MWE to show the issue, I've also discovered that submodules do look like a workaround to the issue.

Each submodule uses its own .dvc cache and dvc commands do only trace the submodule so you get a nice flat plot as the number of datasets increase.

#!/bin/bash
ROOT_DPATH=$HOME/tmp/dvc-submodule-mwe
EXTERN_DPATH=$ROOT_DPATH/extern
REPO_DPATH=$ROOT_DPATH/demo-repo
NUM_SUBREPOS=128

create_git_repo(){
    # Initialize git repo
    git init
    git branch -m main
    git config --local user.email [email protected]
    git config --local user.name "DVC Tester"

    # Initialize DVC inside of the git repo
    dvc init --verbose
    dvc config cache.type symlink,reflink,hardlink,copy
    dvc config cache.protected true
    dvc config core.analytics false
    dvc config core.check_update false
    dvc config core.autostage true

    git commit -am "Initial Commit"
}

create_demo_dataset_part_modality(){
    DATASET_NAME=$1
    PART_NAME=$2
    MODALITY_NAME=$3
    MODALITY_DPATH=$DATASET_NAME/$PART_NAME/$MODALITY_NAME
    FILE_SIZE="8192"
    #FILE_SIZE="16384"
    #FILE_SIZE="32768"
    #FILE_SIZE="16777216"
    NUM_FILES_PER_MODALITY=256
    mkdir -p "$MODALITY_DPATH"
    echo "Create $MODALITY_DPATH"
    for index in $(seq 1 $NUM_FILES_PER_MODALITY); do
        head -c "$FILE_SIZE" /dev/random | base32 > "$MODALITY_DPATH"/"data_$index"
    done
}

create_demo_dataset_part(){
    DATASET_NAME=$1
    PART_NAME=$2
    PART_DPATH=$DATASET_NAME/$PART_NAME
    MANIFEST_FPATH="$PART_DPATH/$PART_NAME.manifest.json"
    mkdir -p "$PART_DPATH"
    echo "$DATASET_NAME manifest $PART_NAME" > "$MANIFEST_FPATH"
    echo "Create $PART_DPATH"

    create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_1"
    #create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_2"
    #create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_3"
}

create_demo_dataset(){
    DATASET_NAME=$1
    create_demo_dataset_part "$DATASET_NAME" "part_01"
    create_demo_dataset_part "$DATASET_NAME" "part_02"
    create_demo_dataset_part "$DATASET_NAME" "part_03"
    create_demo_dataset_part "$DATASET_NAME" "part_04"

    git commit -am "Add dataset $DATASET_NAME"
}


create_subrepos(){
    # Add multiple simple datasets and time how long it takes to add each one
    for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
        SUBREPO_NAME="subrepo_${subrepo_index}"
        SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
        echo "CREATE SUBREPO_DPATH = $SUBREPO_DPATH"
        mkdir -p "$SUBREPO_DPATH"
        cd "$SUBREPO_DPATH" || exit
        create_git_repo
    done
    cd "$ROOT_DPATH"
}


create_main_repo(){
    mkdir -p "$REPO_DPATH"
    cd "$REPO_DPATH"
    create_git_repo

    git config --local protocol.file.allow always
    git config --global protocol.file.allow always

    for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
        SUBREPO_NAME="subrepo_${subrepo_index}"
        SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
        echo "SUBREPO_DPATH = $SUBREPO_DPATH"
        git submodule add "$SUBREPO_DPATH"/.git
    done
    cd "$ROOT_DPATH"
}

plot(){
    echo "
    REQUIREMENTS:
        pip install ubelt pandas seaborn kwplot[headless] PyQt5
    "

    python -c "if 1:

        import kwplot
        import ubelt as ub
        import pandas as pd
        sns = kwplot.autosns()
        root_dpath = ub.Path('~/tmp/dvc-submodule-mwe').expand()

        fpaths = list((root_dpath / 'records').glob('*.time'))
        print(f'{fpaths=}')
        rows = []
        for fpath in fpaths:
            action = '_'.join(fpath.stem.split('_')[0:2])

            index = int(fpath.stem.split('_')[-1])
            row = {}
            try:
                stdout_fpath = fpath.augment(ext='.stdout')
                row['index'] = index
                row['fpath'] = fpath
                text = fpath.read_text()
                parts = text.split(' ')
                row['user'] = float(parts[0].split('user')[0])
                row['system'] = float(parts[1].split('system')[0])
                row['action'] = action
                row['stdout_lines'] = stdout_fpath.read_text().count('\n')
                #parts[2].split('elapsed')[0]
                rows.append(row)
            except Exception as ex:
                print(ex)
                print(fpath)
                continue
        df = pd.DataFrame(rows)
        print(df)

        sns.lineplot(data=df, x='index', y='user', hue='action')
        kwplot.plt.show()
    "
}



main(){
    # Create a fresh workspace
    rm -rf "$ROOT_DPATH"
    mkdir -p "$ROOT_DPATH"
    cd "$ROOT_DPATH"

    create_subrepos

    create_main_repo

    # Store timing records outside of the repo
    RECORDS_DPATH=$ROOT_DPATH/records
    echo "RECORDS_DPATH = $RECORDS_DPATH"
    mkdir -p "$RECORDS_DPATH"

    # Add multiple simple datasets and time how long it takes to add each one
    for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
        SUBREPO_NAME="subrepo_${subrepo_index}"
        SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
        cd "$SUBREPO_DPATH"

        dataset_index=$subrepo_index
        DATASET_NAME="dataset_$dataset_index"
        create_demo_dataset "$DATASET_NAME"
        #/usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
        /usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- \
            dvc add "$DATASET_NAME"/*/*.json \
            > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
        cat "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time"
        #/usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
        /usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
        cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
        git commit -am "Add DVC files"
    done

    cd "$REPO_DPATH"
    git submodule update --init --recursive

    for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
        SUBREPO_NAME="subrepo_${subrepo_index}"
        SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
        SUBMODULE_DPATH=$REPO_DPATH/$SUBREPO_NAME
        dataset_index=$subrepo_index
        DATASET_NAME="dataset_$dataset_index"
        echo "SUBMODULE_DPATH = $SUBMODULE_DPATH"
        cd "$SUBMODULE_DPATH"
        git pull
        dvc remote add local "$SUBREPO_DPATH/.dvc/cache"
    done

    for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
        SUBREPO_NAME="subrepo_${subrepo_index}"
        SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
        SUBMODULE_DPATH=$REPO_DPATH/$SUBREPO_NAME
        dataset_index=$subrepo_index
        DATASET_NAME="dataset_$dataset_index"
        echo "SUBMODULE_DPATH = $SUBMODULE_DPATH"
        cd "$SUBMODULE_DPATH"

        /usr/bin/time --output "$RECORDS_DPATH"/"pull_manifest_$DATASET_NAME.time" -- \
            dvc pull -r local -vvv  */*/*.json.dvc \
            > "$RECORDS_DPATH"/"pull_manifest_$DATASET_NAME.stdout"

        /usr/bin/time --output "$RECORDS_DPATH"/"pull_modality_$DATASET_NAME.time" -- \
            dvc pull -r local -vvv "$DATASET_NAME"/*/MODALITY_*.dvc \
            > "$RECORDS_DPATH"/"pull_modality_$DATASET_NAME.stdout"

    done
    cd "$ROOT_DPATH"

    plot
}

# bpkg convention
# https://github.com/bpkg/bpkg
if [[ ${BASH_SOURCE[0]} != "$0" ]]; then
	# We are sourcing the library
	echo "Sourcing the library"
else
	# Executing file as a script
	main "${@}"
	exit $?
fi

Of course it would be great if DVC simply didn't need to TRACE the entire repo when executing and add / pull on subset of files. That would eliminate the problem. Perhaps this discussions should turn into an issue?

0 replies

dberenbaum · 2024-05-08T18:00:50Z

dberenbaum
May 8, 2024
Collaborator

Okay, thanks for the detailed explanation. You are right that dvc will always traverse the repo to check that there are no overlaps like two .dvc files tracking the same file, although it will not traverse within a dvc-tracked directory, nor outside of the dvc repo (thus why using submodules helps in your use case). Another option would be to consolidate more of your data into fewer, larger .dvc files.

5 replies

Erotemic May 8, 2024
Author

Why does it need to check that no two ".dvc" files are tracking the same content?

From what I understand multiple .dvc files shouldn't be able to reference the same path on disk, but they might reference the same hashed content.

In the case where each "outs" dictionary needs to have a unique "path" (wrt its relative directory), I can't see why dvc add would need to check other .dvc files in the same repo. If the new or updated content has an existing hash, we should be able to check if that exists in the cache with a single "exists" call.

If I'm wrong and multiple "outs" dictionary can point to the same path, then: why? I don't understand the use-case here, and supporting - namely needing to update information in every other .dvc file that points to that "path" - puts a big limit on the scalability of the system. Avoiding this would make add/push/pull operations go from O(N) to O(1), which is what I would expect from a content-addressable data system.

Imagine an extremely simple use case, where you have a single repo that you use to track candidate models. Each new model you try, you add it as a different file. The more models you add, the slower each new add/push/pull operation will be. It gets very noticeable.

dberenbaum May 10, 2024
Collaborator

The check is needed for two reasons:

Overlapping outputs are not allowed and problematic (can't have two stage outputs or .dvc files that track the same workspace path)
Granular modifications within a dvc-tracked dataset (sometimes we call these virtual operations) require traversing the repo to find which stage or .dvc file is tracking that file.

Discussed with @skshetry that we have long had an internal option to skip these graph checks here. If you are interested in contributing, you could expose that flag in dvc add. You also likely would want to skip this line, which will also trigger building the graph.

Erotemic May 10, 2024
Author

That seems like a good solution. I'm happy to add a flag to let DVC know it can skip the check. Longer term, perhaps there is a way to quickly automatically determine if the check is needed, which will improve the UX for new users.

I'll take a look at a PR for git add. It would be nice if we could skip similar checks for push and pull as well. If you can think of any reason why these would be significantly different, let me know, otherwise I'll try to figure it out in a PR.

skshetry May 14, 2024
Maintainer

push and pull regressed during the index changes. I did open #9259 last year, but it was not merged.

When you do dvc push <target>, the target can be anything from dvc.yaml filepath, .dvc filepath, stage name in dvc.yaml, to the path of the output.

For finding output, dvc needs to read from .dvc and dvc.yaml files. But, we used to optimize when the target was a filepath, by having a shortcut codepath and skipping the file collection (and only loading from those files).

Erotemic May 14, 2024
Author

Note because I don't see a link to it in this discussion: the PR that I'm working on is here: #10425

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will DVC work with git submodules? #10415

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Will DVC work with git submodules? #10415

Erotemic May 6, 2024

Replies: 3 comments · 8 replies

dberenbaum May 6, 2024 Collaborator

Erotemic May 6, 2024 Author

Erotemic May 6, 2024 Author

Erotemic May 6, 2024 Author

Erotemic May 7, 2024 Author

dberenbaum May 8, 2024 Collaborator

Erotemic May 8, 2024 Author

dberenbaum May 10, 2024 Collaborator

Erotemic May 10, 2024 Author

skshetry May 14, 2024 Maintainer

Erotemic May 14, 2024 Author

Erotemic
May 6, 2024

Replies: 3 comments 8 replies

dberenbaum
May 6, 2024
Collaborator

Erotemic May 6, 2024
Author

Erotemic May 6, 2024
Author

Erotemic May 6, 2024
Author

Erotemic
May 7, 2024
Author

dberenbaum
May 8, 2024
Collaborator

Erotemic May 8, 2024
Author

dberenbaum May 10, 2024
Collaborator

Erotemic May 10, 2024
Author

skshetry May 14, 2024
Maintainer

Erotemic May 14, 2024
Author