Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Open
Momo-Not-Emo opened this issue Feb 24, 2025 · 0 comments
Open

Comments

@Momo-Not-Emo
Copy link

Description

I am using Deckard with the following configuration:

MIN_TOKENS='20 30 50'  # can be a sequence of integers
STRIDE='inf'  # can be a sequence of integers
SIMILARITY='1'  # can be a sequence of values <= 1

I set stride as inf because:

  • The setting with an infinite stride means that vector merging was disabled. reference
  • If stride is set to infinity, only non-overlapping and syntactically complete pieces of code (e.g., a complete if statement or a complete for statement) are considered for clones. reference

After running Deckard, I noticed that some clusters in post_cluster_* contain code blocks with vastly different lengths. For example, the following cluster includes one block with 1057 lines while the other has only 110 lines:

000000042	dist:0.0	FILE src/findbugs/src/java/edu/umd/cs/findbugs/SortedBugCollection.java LINE:92:1057 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:517 TEID:536
000000008	dist:0.0	FILE src/spotbugs/spotbugs/src/main/java/edu/umd/cs/findbugs/classfile/impl/JrtfsCodeBase.java LINE:62:110 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:294 TEID:313

Since dist:0.0 suggests they are considered identical clones, I would like to understand why blocks of such different lengths are grouped together.

Here are the original files in the above cluster:
files.zip

I also tried setting stride=2, but the large variation in block lengths within clusters still persists.

Questions

  1. How does Deckard determine similarity when block lengths vary significantly?
  2. Could this be due to my configuration, or is it expected behavior?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant