Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourcery Starbot ⭐ refactored rubby33/CodeBERT #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SourceryAI
Copy link

Thanks for starring sourcery-ai/sourcery ✨ 🌟 ✨

Here's your pull request refactoring your most popular Python repo.

If you want Sourcery to refactor all your Python repos and incoming pull requests install our bot.

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch https://github.com/sourcery-ai-bot/CodeBERT master
git merge --ff-only FETCH_HEAD
git reset HEAD^

Copy link
Author

@SourceryAI SourceryAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to GitHub API limits, only the first 60 comments can be shown.

s = " %s " % s
s = f" {s} "
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function normalize refactored with the following changes:

Comment on lines -91 to +94
result = {}
result["testlen"] = len(test)

result = {"testlen": len(test)}
# Calculate effective reference sentence length.

if eff_ref_len == "shortest":
result["reflen"] = min(reflens)
elif eff_ref_len == "average":

if eff_ref_len == "average":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function cook_test refactored with the following changes:

  • Merge dictionary assignment with declaration (merge-dict-assign)
  • Simplify conditional into switch-like form [×2] (switch)

Comment on lines -157 to +172
predictionMap = {}
goldMap = {}
gf = open(goldfile, 'r')
predictionMap = {}
goldMap = {}
gf = open(goldfile, 'r')

for row in predictions:
cols = row.strip().split('\t')
if len(cols) == 1:
(rid, pred) = (cols[0], '')
else:
(rid, pred) = (cols[0], cols[1])
predictionMap[rid] = [splitPuncts(pred.strip().lower())]
for row in predictions:
cols = row.strip().split('\t')
(rid, pred) = (cols[0], '') if len(cols) == 1 else (cols[0], cols[1])
predictionMap[rid] = [splitPuncts(pred.strip().lower())]

for row in gf:
(rid, pred) = row.split('\t')
if rid in predictionMap: # Only insert if the id exists for the method
if rid not in goldMap:
goldMap[rid] = []
goldMap[rid].append(splitPuncts(pred.strip().lower()))
for row in gf:
(rid, pred) = row.split('\t')
if rid in predictionMap: # Only insert if the id exists for the method
if rid not in goldMap:
goldMap[rid] = []
goldMap[rid].append(splitPuncts(pred.strip().lower()))

sys.stderr.write('Total: ' + str(len(goldMap)) + '\n')
return (goldMap, predictionMap)
sys.stderr.write(f'Total: {len(goldMap)}' + '\n')
return (goldMap, predictionMap)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function computeMaps refactored with the following changes:

Comment on lines -183 to +186
score = [0] * 5
num = 0.0
score = [0] * 5
num = 0.0

for key in m1:
if key in m2:
bl = bleu(m1[key], m2[key][0])
score = [ score[i] + bl[i] for i in range(0, len(bl))]
num += 1
return [s * 100.0 / num for s in score]
for key in m1:
if key in m2:
bl = bleu(m1[key], m2[key][0])
score = [score[i] + bl[i] for i in range(len(bl))]
num += 1
return [s * 100.0 / num for s in score]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function bleuFromMaps refactored with the following changes:

Comment on lines -194 to +192
reference_file = sys.argv[1]
predictions = []
for row in sys.stdin:
predictions.append(row)
(goldMap, predictionMap) = computeMaps(predictions, reference_file)
print (bleuFromMaps(goldMap, predictionMap)[0])
reference_file = sys.argv[1]
predictions = list(sys.stdin)
(goldMap, predictionMap) = computeMaps(predictions, reference_file)
print (bleuFromMaps(goldMap, predictionMap)[0])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 194-199 refactored with the following changes:

logger.info("LOOKING AT {}".format(os.path.join(data_dir, train_file)))
logger.info(f"LOOKING AT {os.path.join(data_dir, train_file)}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function CodesearchProcessor.get_train_examples refactored with the following changes:

logger.info("LOOKING AT {}".format(os.path.join(data_dir, dev_file)))
logger.info(f"LOOKING AT {os.path.join(data_dir, dev_file)}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function CodesearchProcessor.get_dev_examples refactored with the following changes:

logger.info("LOOKING AT {}".format(os.path.join(data_dir, test_file)))
logger.info(f"LOOKING AT {os.path.join(data_dir, test_file)}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function CodesearchProcessor.get_test_examples refactored with the following changes:

Comment on lines -119 to +125
guid = "%s-%s" % (set_type, i)
guid = f"{set_type}-{i}"
text_a = line[3]
text_b = line[4]
if (set_type == 'test'):
label = self.get_labels()[0]
else:
label = line[0]
label = self.get_labels()[0] if (set_type == 'test') else line[0]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
if (set_type == 'test'):
return examples, lines
else:
return examples
return (examples, lines) if (set_type == 'test') else examples
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function CodesearchProcessor._create_examples refactored with the following changes:

Comment on lines -164 to +159
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:(max_seq_length - 2)]
elif len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:(max_seq_length - 2)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_examples_to_features refactored with the following changes:

This removes the following comments ( why? ):

# Account for [CLS] and [SEP] with "- 2"


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the following improvement in Function Model.forward:

Comment on lines -77 to +91
pass
pass
#obtain dataflow
if lang=="php":
code="<?php"+code+"?>"
code = f"<?php{code}?>"
try:
tree = parser[0].parse(bytes(code,'utf8'))
root_node = tree.root_node
tokens_index=tree_to_token_index(root_node)
tree = parser[0].parse(bytes(code,'utf8'))
root_node = tree.root_node
tokens_index=tree_to_token_index(root_node)
code=code.split('\n')
code_tokens=[index_to_code_token(x,code) for x in tokens_index]
index_to_code={}
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)):
index_to_code[index]=(idx,code)
code_tokens=[index_to_code_token(x,code) for x in tokens_index]
index_to_code = {
index: (idx, code)
for idx, (index, code) in enumerate(zip(tokens_index, code_tokens))
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function extract_dataflow refactored with the following changes:

Comment on lines -152 to +169

for url in [url1,url2]:
if url not in cache:
func=url_to_code[url]

#extract data flow
code_tokens,dfg=extract_dataflow(func,parser,'java')
code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)]
ori2cur_pos={}
ori2cur_pos[-1]=(0,0)
code_tokens = [
tokenizer.tokenize(f'@ {x}')[1:]
if idx != 0
else tokenizer.tokenize(x)
for idx, x in enumerate(code_tokens)
]

ori2cur_pos = {-1: (0, 0)}
for i in range(len(code_tokens)):
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i]))
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i]))
code_tokens=[y for x in code_tokens for y in x]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_examples_to_features refactored with the following changes:

Comment on lines -203 to +205

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function TextDataset.__init__ refactored with the following changes:

node_index=sum([i>1 for i in self.examples[item].position_idx_1])
max_length=sum([i!=1 for i in self.examples[item].position_idx_1])
node_index = sum(i>1 for i in self.examples[item].position_idx_1)
max_length = sum(i!=1 for i in self.examples[item].position_idx_1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function TextDataset.__getitem__ refactored with the following changes:

Comment on lines -73 to +87
pass
pass
#obtain dataflow
if lang=="php":
code="<?php"+code+"?>"
code = f"<?php{code}?>"
try:
tree = parser[0].parse(bytes(code,'utf8'))
root_node = tree.root_node
tokens_index=tree_to_token_index(root_node)
tree = parser[0].parse(bytes(code,'utf8'))
root_node = tree.root_node
tokens_index=tree_to_token_index(root_node)
code=code.split('\n')
code_tokens=[index_to_code_token(x,code) for x in tokens_index]
index_to_code={}
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)):
index_to_code[index]=(idx,code)
code_tokens=[index_to_code_token(x,code) for x in tokens_index]
index_to_code = {
index: (idx, code)
for idx, (index, code) in enumerate(zip(tokens_index, code_tokens))
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function extract_dataflow refactored with the following changes:

Comment on lines -135 to +142
code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)]
ori2cur_pos={}
ori2cur_pos[-1]=(0,0)
code_tokens = [
tokenizer.tokenize(f'@ {x}')[1:] if idx != 0 else tokenizer.tokenize(x)
for idx, x in enumerate(code_tokens)
]

ori2cur_pos = {-1: (0, 0)}
for i in range(len(code_tokens)):
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i]))
code_tokens=[y for x in code_tokens for y in x]
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i]))
code_tokens=[y for x in code_tokens for y in x]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_examples_to_features refactored with the following changes:

cache_file=args.output_dir+'/'+prefix+'.pkl'
cache_file = f'{args.output_dir}/{prefix}.pkl'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function TextDataset.__init__ refactored with the following changes:

Comment on lines -211 to +212
node_index=sum([i>1 for i in self.examples[item].position_idx])
max_length=sum([i!=1 for i in self.examples[item].position_idx])
node_index = sum(i>1 for i in self.examples[item].position_idx)
max_length = sum(i!=1 for i in self.examples[item].position_idx)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function TextDataset.__getitem__ refactored with the following changes:

Comment on lines -251 to +255

#get optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=args.learning_rate, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,num_training_steps=len(train_dataloader)*args.num_train_epochs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function train refactored with the following changes:


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function evaluate refactored with the following changes:

Comment on lines -414 to +428

parser.add_argument("--lang", default=None, type=str,
help="language.")

parser.add_argument("--model_name_or_path", default=None, type=str,
help="The model checkpoint for weights initialization.")
parser.add_argument("--config_name", default="", type=str,
help="Optional pretrained config name or path if not the same as model_name_or_path")
parser.add_argument("--tokenizer_name", default="", type=str,
help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")

parser.add_argument("--nl_length", default=128, type=int,
help="Optional NL input sequence length after tokenization.")
help="Optional NL input sequence length after tokenization.")
parser.add_argument("--code_length", default=256, type=int,
help="Optional Code input sequence length after tokenization.")
help="Optional Code input sequence length after tokenization.")
parser.add_argument("--data_flow_length", default=64, type=int,
help="Optional Data Flow input sequence length after tokenization.")

parser.add_argument("--do_train", action='store_true',
help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true',
help="Whether to run eval on the dev set.")
parser.add_argument("--do_test", action='store_true',
help="Whether to run eval on the test set.")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function main refactored with the following changes:

This removes the following comments ( why? ):

# Evaluation

Comment on lines -16 to +17
do_first_statement=['for_in_clause']
def_statement=['default_parameter']
states=states.copy()
states=states.copy()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function DFG_python refactored with the following changes:

@@ -185,7 +184,6 @@ def DFG_java(root_node,index_to_code,states):
for_statement=['for_statement']
enhanced_for_statement=['enhanced_for_statement']
while_statement=['while_statement']
do_first_statement=[]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function DFG_java refactored with the following changes:

Comment on lines -27 to +32
# This series of conditionals removes docstrings:
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
# This is likely a docstring; double-check we're not inside an operator:
if prev_toktype != tokenize.NEWLINE:
if start_col > 0:
out += token_string
if (
prev_toktype not in [tokenize.INDENT, tokenize.NEWLINE]
and start_col > 0
):
out += token_string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function remove_comments_and_docstrings refactored with the following changes:

This removes the following comments ( why? ):

# note: a space and not an empty string
# This is likely a docstring; double-check we're not inside an operator:
# This series of conditionals removes docstrings:

Comment on lines -16 to +17
do_first_statement=['for_in_clause']
def_statement=['default_parameter']
states=states.copy()
states=states.copy()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function DFG_python refactored with the following changes:

@@ -185,7 +184,6 @@ def DFG_java(root_node,index_to_code,states):
for_statement=['for_statement']
enhanced_for_statement=['enhanced_for_statement']
while_statement=['while_statement']
do_first_statement=[]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function DFG_java refactored with the following changes:

Comment on lines -27 to +32
# This series of conditionals removes docstrings:
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
# This is likely a docstring; double-check we're not inside an operator:
if prev_toktype != tokenize.NEWLINE:
if start_col > 0:
out += token_string
if (
prev_toktype not in [tokenize.INDENT, tokenize.NEWLINE]
and start_col > 0
):
out += token_string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function remove_comments_and_docstrings refactored with the following changes:

This removes the following comments ( why? ):

# note: a space and not an empty string
# This is likely a docstring; double-check we're not inside an operator:
# This series of conditionals removes docstrings:

else:
code_tokens=[]
for child in root_node.children:
code_tokens+=tree_to_token_index(child)
return code_tokens
code_tokens=[]
for child in root_node.children:
code_tokens+=tree_to_token_index(child)
return code_tokens
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function tree_to_token_index refactored with the following changes:

for i in range(0, len(segment) - order + 1):
for i in range(len(segment) - order + 1):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _get_ngrams refactored with the following changes:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant