Replies: 3 comments 8 replies
-
Hm, that doesn't sound right. |
Beta Was this translation helpful? Give feedback.
-
By creating a MWE to show the issue, I've also discovered that submodules do look like a workaround to the issue. Each submodule uses its own .dvc cache and dvc commands do only trace the submodule so you get a nice flat plot as the number of datasets increase. #!/bin/bash
ROOT_DPATH=$HOME/tmp/dvc-submodule-mwe
EXTERN_DPATH=$ROOT_DPATH/extern
REPO_DPATH=$ROOT_DPATH/demo-repo
NUM_SUBREPOS=128
create_git_repo(){
# Initialize git repo
git init
git branch -m main
git config --local user.email [email protected]
git config --local user.name "DVC Tester"
# Initialize DVC inside of the git repo
dvc init --verbose
dvc config cache.type symlink,reflink,hardlink,copy
dvc config cache.protected true
dvc config core.analytics false
dvc config core.check_update false
dvc config core.autostage true
git commit -am "Initial Commit"
}
create_demo_dataset_part_modality(){
DATASET_NAME=$1
PART_NAME=$2
MODALITY_NAME=$3
MODALITY_DPATH=$DATASET_NAME/$PART_NAME/$MODALITY_NAME
FILE_SIZE="8192"
#FILE_SIZE="16384"
#FILE_SIZE="32768"
#FILE_SIZE="16777216"
NUM_FILES_PER_MODALITY=256
mkdir -p "$MODALITY_DPATH"
echo "Create $MODALITY_DPATH"
for index in $(seq 1 $NUM_FILES_PER_MODALITY); do
head -c "$FILE_SIZE" /dev/random | base32 > "$MODALITY_DPATH"/"data_$index"
done
}
create_demo_dataset_part(){
DATASET_NAME=$1
PART_NAME=$2
PART_DPATH=$DATASET_NAME/$PART_NAME
MANIFEST_FPATH="$PART_DPATH/$PART_NAME.manifest.json"
mkdir -p "$PART_DPATH"
echo "$DATASET_NAME manifest $PART_NAME" > "$MANIFEST_FPATH"
echo "Create $PART_DPATH"
create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_1"
#create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_2"
#create_demo_dataset_part_modality "$DATASET_NAME" "$PART_NAME" "MODALITY_3"
}
create_demo_dataset(){
DATASET_NAME=$1
create_demo_dataset_part "$DATASET_NAME" "part_01"
create_demo_dataset_part "$DATASET_NAME" "part_02"
create_demo_dataset_part "$DATASET_NAME" "part_03"
create_demo_dataset_part "$DATASET_NAME" "part_04"
git commit -am "Add dataset $DATASET_NAME"
}
create_subrepos(){
# Add multiple simple datasets and time how long it takes to add each one
for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
SUBREPO_NAME="subrepo_${subrepo_index}"
SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
echo "CREATE SUBREPO_DPATH = $SUBREPO_DPATH"
mkdir -p "$SUBREPO_DPATH"
cd "$SUBREPO_DPATH" || exit
create_git_repo
done
cd "$ROOT_DPATH"
}
create_main_repo(){
mkdir -p "$REPO_DPATH"
cd "$REPO_DPATH"
create_git_repo
git config --local protocol.file.allow always
git config --global protocol.file.allow always
for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
SUBREPO_NAME="subrepo_${subrepo_index}"
SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
echo "SUBREPO_DPATH = $SUBREPO_DPATH"
git submodule add "$SUBREPO_DPATH"/.git
done
cd "$ROOT_DPATH"
}
plot(){
echo "
REQUIREMENTS:
pip install ubelt pandas seaborn kwplot[headless] PyQt5
"
python -c "if 1:
import kwplot
import ubelt as ub
import pandas as pd
sns = kwplot.autosns()
root_dpath = ub.Path('~/tmp/dvc-submodule-mwe').expand()
fpaths = list((root_dpath / 'records').glob('*.time'))
print(f'{fpaths=}')
rows = []
for fpath in fpaths:
action = '_'.join(fpath.stem.split('_')[0:2])
index = int(fpath.stem.split('_')[-1])
row = {}
try:
stdout_fpath = fpath.augment(ext='.stdout')
row['index'] = index
row['fpath'] = fpath
text = fpath.read_text()
parts = text.split(' ')
row['user'] = float(parts[0].split('user')[0])
row['system'] = float(parts[1].split('system')[0])
row['action'] = action
row['stdout_lines'] = stdout_fpath.read_text().count('\n')
#parts[2].split('elapsed')[0]
rows.append(row)
except Exception as ex:
print(ex)
print(fpath)
continue
df = pd.DataFrame(rows)
print(df)
sns.lineplot(data=df, x='index', y='user', hue='action')
kwplot.plt.show()
"
}
main(){
# Create a fresh workspace
rm -rf "$ROOT_DPATH"
mkdir -p "$ROOT_DPATH"
cd "$ROOT_DPATH"
create_subrepos
create_main_repo
# Store timing records outside of the repo
RECORDS_DPATH=$ROOT_DPATH/records
echo "RECORDS_DPATH = $RECORDS_DPATH"
mkdir -p "$RECORDS_DPATH"
# Add multiple simple datasets and time how long it takes to add each one
for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
SUBREPO_NAME="subrepo_${subrepo_index}"
SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
cd "$SUBREPO_DPATH"
dataset_index=$subrepo_index
DATASET_NAME="dataset_$dataset_index"
create_demo_dataset "$DATASET_NAME"
#/usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/*.json > "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
/usr/bin/time --output "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time" -- \
dvc add "$DATASET_NAME"/*/*.json \
> "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.stdout"
cat "$RECORDS_DPATH"/"add_manifest_$DATASET_NAME.time"
#/usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add -vvv "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
/usr/bin/time --output "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time" -- dvc add "$DATASET_NAME"/*/MODALITY_* > "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.stdout"
cat "$RECORDS_DPATH"/"add_modality_$DATASET_NAME.time"
git commit -am "Add DVC files"
done
cd "$REPO_DPATH"
git submodule update --init --recursive
for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
SUBREPO_NAME="subrepo_${subrepo_index}"
SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
SUBMODULE_DPATH=$REPO_DPATH/$SUBREPO_NAME
dataset_index=$subrepo_index
DATASET_NAME="dataset_$dataset_index"
echo "SUBMODULE_DPATH = $SUBMODULE_DPATH"
cd "$SUBMODULE_DPATH"
git pull
dvc remote add local "$SUBREPO_DPATH/.dvc/cache"
done
for subrepo_index in $(seq 1 $NUM_SUBREPOS); do
SUBREPO_NAME="subrepo_${subrepo_index}"
SUBREPO_DPATH=$EXTERN_DPATH/$SUBREPO_NAME
SUBMODULE_DPATH=$REPO_DPATH/$SUBREPO_NAME
dataset_index=$subrepo_index
DATASET_NAME="dataset_$dataset_index"
echo "SUBMODULE_DPATH = $SUBMODULE_DPATH"
cd "$SUBMODULE_DPATH"
/usr/bin/time --output "$RECORDS_DPATH"/"pull_manifest_$DATASET_NAME.time" -- \
dvc pull -r local -vvv */*/*.json.dvc \
> "$RECORDS_DPATH"/"pull_manifest_$DATASET_NAME.stdout"
/usr/bin/time --output "$RECORDS_DPATH"/"pull_modality_$DATASET_NAME.time" -- \
dvc pull -r local -vvv "$DATASET_NAME"/*/MODALITY_*.dvc \
> "$RECORDS_DPATH"/"pull_modality_$DATASET_NAME.stdout"
done
cd "$ROOT_DPATH"
plot
}
# bpkg convention
# https://github.com/bpkg/bpkg
if [[ ${BASH_SOURCE[0]} != "$0" ]]; then
# We are sourcing the library
echo "Sourcing the library"
else
# Executing file as a script
main "${@}"
exit $?
fi
Of course it would be great if DVC simply didn't need to TRACE the entire repo when executing and add / pull on subset of files. That would eliminate the problem. Perhaps this discussions should turn into an issue? |
Beta Was this translation helpful? Give feedback.
-
Okay, thanks for the detailed explanation. You are right that dvc will always traverse the repo to check that there are no overlaps like two .dvc files tracking the same file, although it will not traverse within a dvc-tracked directory, nor outside of the dvc repo (thus why using submodules helps in your use case). Another option would be to consolidate more of your data into fewer, larger .dvc files. |
Beta Was this translation helpful? Give feedback.
-
I'm having a scalability issue with DVC. If I create a repo with many datasets, any modification to one dataset takes a very long time because DVC wants to crawl the filesystem to reason about the current repo state.
The way I structure DVC repos for a project is one "data" repo that houses multiple datasets, which consist of multiple sub-parts. So I would have a tree that looks like:
Generating code:
With the number of dataset, parts, and modalities allowed to grow fairly large.
Every time I do a
dvc add
dvc push
ordvc pull
on one dataset, it has to check every other dataset, which adds too much overhead.I'm wondering if I split each dataset into a git submodule, and then make one mono-repo that adds all the datasets as submodules (i.e. each
dataset_<xxx>
directory is a git submodule, will DVC operations only crawl the relevant submodule? I think the overhead isn't so bad when there is only 1 dataset in a repo, but if you have a repo with 3 small datasets and one HUGE dataset, then do anything to the small datasets will incur the overhead of the HUGE dataset, and I'd like to avoid that. Will git submodules work as a way to do that?Beta Was this translation helpful? Give feedback.
All reactions