You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I executed Deckard to detect clones on a dataset of 47k source files. However, after a day of execution I faced with the an error. following,, you can find the content of different log files.
cluster_vdb_50_4_g9_2.50998_30_100000
Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
Warning: output all clones. Takes more time...
Warning: will compute parameters
Error: the structure supports at most 2097151 points (3238525 were specified).
real 2m58.162s
user 2m50.464s
sys 0m7.492s
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
paramsetting_50_4_0.79_30
paramsetting: 50 4 0.79 ...Looking for optimal parameters by Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
Error: paramsetting failure...exit.
grouping_50_4_2.50998_30
grouping: vectors/vdb_50_4 with distance=2.50998...Total 7602630 vectors read in; 11282415 vectors dispatched into 57 ranges (actual groups may be many fewer).
real 410m12.610s
user 6m43.592s
sys 26m6.544s
Done grouping 50 4 2.50998. See groups in vectors/vdb_50_4_g[0-9]_2.50998_30
Note that I have sufficient memory for execution; Thus, I added two other conditions for the memory limit setting in both vecquery and vertical-param-batch files. The reason I increased the memory limit is that my vectors size is greater than 2G and I have no problem with the availability of enough memory. Now the conditions are like this:
# dumb (not flexible) memory limit setting
mem=`wc "$vdb" | awk '{printf("%.0f", $3/1024/1024+0.5)}'`
if [ $mem -lt 2 ]; then
mem=10000000
elif [ $mem -lt 5 ]; then
mem=20000000
elif [ $mem -lt 10 ]; then
mem=30000000
elif [ $mem -lt 20 ]; then
mem=60000000
elif [ $mem -lt 50 ]; then
mem=150000000
elif [ $mem -lt 100 ]; then
mem=300000000
elif [ $mem -lt 200 ]; then
mem=600000000
elif [ $mem -lt 500 ]; then
mem=900000000
elif [ $mem -lt 1024 ]; then
mem=1900000000
elif [ $mem -lt 2048 ]; then
mem=3800000000
elif [ $mem -lt 4096 ]; then # this condition is added by me
mem=7600000000
elif [ $mem -lt 8192 ]; then # this condition is added by me
mem=15200000000
else
echo "Error: Size of $vdb > 8G. I don't want to do it before you think of any optimization." | tee -a "$TIME_DIR/cluster_${vfile}"
exit 1;
fi
The parameters of deckard is set to the following values:
MIN_TOKENS='50'
STRIDE='4'
SIMILARITY='0.79'
MAX_PROCS = 40
I attached the log files. please help me to mitigate this problem, I need your tool for my experiments. deckard log.zip
The text was updated successfully, but these errors were encountered:
Error: the structure supports at most 2097151 points (3238525 were specified).
The error is an inherent limitation in the LSH library used in Deckard; it can't handle more than 2million vectors at a time.
Using 0.79 similarity is not recommended as it often leads to many false positives.
Use a higher similarity, say 0.90.
Or as another alternative, split your input dataset into smaller ones before feeding it into Deckard. Then, after getting clone results for every smaller dataset, de-duplicate the vectors that have been identified as clones into one and then merge all the vectors left into one dataset to run Deckard again (need to write your own scripts for these, and there may be false negatives due to the split/merge process).
@alex-from-intuita
As far as I remember, this is related to the limitations of one of the bash commands used in the tool. As for me, I ended up using a higher similarity threshold to deal with that. Also, plz visit our lab website at https://sail.cs.queensu.ca/ to get more information about the state of our research. I am particularly studying the quality of smart contracts.
Best,
Amir.
Hi,
I executed Deckard to detect clones on a dataset of 47k source files. However, after a day of execution I faced with the an error. following,, you can find the content of different log files.
cluster_vdb_50_4_g9_2.50998_30_100000
Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
Warning: output all clones. Takes more time...
Warning: will compute parameters
Error: the structure supports at most 2097151 points (3238525 were specified).
real 2m58.162s
user 2m50.464s
sys 0m7.492s
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
paramsetting_50_4_0.79_30
paramsetting: 50 4 0.79 ...Looking for optimal parameters by Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
Error: paramsetting failure...exit.
grouping_50_4_2.50998_30
grouping: vectors/vdb_50_4 with distance=2.50998...Total 7602630 vectors read in; 11282415 vectors dispatched into 57 ranges (actual groups may be many fewer).
real 410m12.610s
user 6m43.592s
sys 26m6.544s
Done grouping 50 4 2.50998. See groups in vectors/vdb_50_4_g[0-9]_2.50998_30
Note that I have sufficient memory for execution; Thus, I added two other conditions for the memory limit setting in both vecquery and vertical-param-batch files. The reason I increased the memory limit is that my vectors size is greater than 2G and I have no problem with the availability of enough memory. Now the conditions are like this:
The parameters of deckard is set to the following values:
I attached the log files. please help me to mitigate this problem, I need your tool for my experiments.
deckard log.zip
The text was updated successfully, but these errors were encountered: