Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Momo-Not-Emo · 2025-02-24T18:43:12Z

Description

I am using Deckard with the following configuration:

MIN_TOKENS='20 30 50'  # can be a sequence of integers
STRIDE='inf'  # can be a sequence of integers
SIMILARITY='1'  # can be a sequence of values <= 1

I set stride as inf because:

The setting with an infinite stride means that vector merging was disabled. reference
If stride is set to infinity, only non-overlapping and syntactically complete pieces of code (e.g., a complete if statement or a complete for statement) are considered for clones. reference

After running Deckard, I noticed that some clusters in post_cluster_* contain code blocks with vastly different lengths. For example, the following cluster includes one block with 1057 lines while the other has only 110 lines:

000000042	dist:0.0	FILE src/findbugs/src/java/edu/umd/cs/findbugs/SortedBugCollection.java LINE:92:1057 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:517 TEID:536
000000008	dist:0.0	FILE src/spotbugs/spotbugs/src/main/java/edu/umd/cs/findbugs/classfile/impl/JrtfsCodeBase.java LINE:62:110 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:294 TEID:313

Since dist:0.0 suggests they are considered identical clones, I would like to understand why blocks of such different lengths are grouped together.

Here are the original files in the above cluster:
files.zip

I also tried setting stride=2, but the large variation in block lengths within clusters still persists.

Questions

How does Deckard determine similarity when block lengths vary significantly?
Could this be due to my configuration, or is it expected behavior?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Momo-Not-Emo commented Feb 24, 2025

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Comments

Momo-Not-Emo commented Feb 24, 2025

Description

Questions