-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sourcery Starbot ⭐ refactored rubby33/CodeBERT #1
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to GitHub API limits, only the first 60 comments can be shown.
s = " %s " % s | ||
s = f" {s} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function normalize
refactored with the following changes:
- Replace interpolated string formatting with f-string (
replace-interpolation-with-fstring
)
result = {} | ||
result["testlen"] = len(test) | ||
|
||
result = {"testlen": len(test)} | ||
# Calculate effective reference sentence length. | ||
|
||
if eff_ref_len == "shortest": | ||
result["reflen"] = min(reflens) | ||
elif eff_ref_len == "average": | ||
|
||
if eff_ref_len == "average": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function cook_test
refactored with the following changes:
- Merge dictionary assignment with declaration (
merge-dict-assign
) - Simplify conditional into switch-like form [×2] (
switch
)
predictionMap = {} | ||
goldMap = {} | ||
gf = open(goldfile, 'r') | ||
predictionMap = {} | ||
goldMap = {} | ||
gf = open(goldfile, 'r') | ||
|
||
for row in predictions: | ||
cols = row.strip().split('\t') | ||
if len(cols) == 1: | ||
(rid, pred) = (cols[0], '') | ||
else: | ||
(rid, pred) = (cols[0], cols[1]) | ||
predictionMap[rid] = [splitPuncts(pred.strip().lower())] | ||
for row in predictions: | ||
cols = row.strip().split('\t') | ||
(rid, pred) = (cols[0], '') if len(cols) == 1 else (cols[0], cols[1]) | ||
predictionMap[rid] = [splitPuncts(pred.strip().lower())] | ||
|
||
for row in gf: | ||
(rid, pred) = row.split('\t') | ||
if rid in predictionMap: # Only insert if the id exists for the method | ||
if rid not in goldMap: | ||
goldMap[rid] = [] | ||
goldMap[rid].append(splitPuncts(pred.strip().lower())) | ||
for row in gf: | ||
(rid, pred) = row.split('\t') | ||
if rid in predictionMap: # Only insert if the id exists for the method | ||
if rid not in goldMap: | ||
goldMap[rid] = [] | ||
goldMap[rid].append(splitPuncts(pred.strip().lower())) | ||
|
||
sys.stderr.write('Total: ' + str(len(goldMap)) + '\n') | ||
return (goldMap, predictionMap) | ||
sys.stderr.write(f'Total: {len(goldMap)}' + '\n') | ||
return (goldMap, predictionMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function computeMaps
refactored with the following changes:
- Replace if statement with if expression (
assign-if-exp
) - Use f-string instead of string concatenation (
use-fstring-for-concatenation
) - Remove unnecessary calls to
str()
from formatted values in f-strings (remove-str-from-fstring
)
score = [0] * 5 | ||
num = 0.0 | ||
score = [0] * 5 | ||
num = 0.0 | ||
|
||
for key in m1: | ||
if key in m2: | ||
bl = bleu(m1[key], m2[key][0]) | ||
score = [ score[i] + bl[i] for i in range(0, len(bl))] | ||
num += 1 | ||
return [s * 100.0 / num for s in score] | ||
for key in m1: | ||
if key in m2: | ||
bl = bleu(m1[key], m2[key][0]) | ||
score = [score[i] + bl[i] for i in range(len(bl))] | ||
num += 1 | ||
return [s * 100.0 / num for s in score] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function bleuFromMaps
refactored with the following changes:
- Replace range(0, x) with range(x) (
remove-zero-from-range
)
reference_file = sys.argv[1] | ||
predictions = [] | ||
for row in sys.stdin: | ||
predictions.append(row) | ||
(goldMap, predictionMap) = computeMaps(predictions, reference_file) | ||
print (bleuFromMaps(goldMap, predictionMap)[0]) | ||
reference_file = sys.argv[1] | ||
predictions = list(sys.stdin) | ||
(goldMap, predictionMap) = computeMaps(predictions, reference_file) | ||
print (bleuFromMaps(goldMap, predictionMap)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 194-199
refactored with the following changes:
- Convert for loop into list comprehension (
list-comprehension
) - Replace identity comprehension with call to collection constructor (
identity-comprehension
)
logger.info("LOOKING AT {}".format(os.path.join(data_dir, train_file))) | ||
logger.info(f"LOOKING AT {os.path.join(data_dir, train_file)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function CodesearchProcessor.get_train_examples
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
logger.info("LOOKING AT {}".format(os.path.join(data_dir, dev_file))) | ||
logger.info(f"LOOKING AT {os.path.join(data_dir, dev_file)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function CodesearchProcessor.get_dev_examples
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
logger.info("LOOKING AT {}".format(os.path.join(data_dir, test_file))) | ||
logger.info(f"LOOKING AT {os.path.join(data_dir, test_file)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function CodesearchProcessor.get_test_examples
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
guid = "%s-%s" % (set_type, i) | ||
guid = f"{set_type}-{i}" | ||
text_a = line[3] | ||
text_b = line[4] | ||
if (set_type == 'test'): | ||
label = self.get_labels()[0] | ||
else: | ||
label = line[0] | ||
label = self.get_labels()[0] if (set_type == 'test') else line[0] | ||
examples.append( | ||
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) | ||
if (set_type == 'test'): | ||
return examples, lines | ||
else: | ||
return examples | ||
return (examples, lines) if (set_type == 'test') else examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function CodesearchProcessor._create_examples
refactored with the following changes:
- Replace interpolated string formatting with f-string (
replace-interpolation-with-fstring
) - Replace if statement with if expression [×2] (
assign-if-exp
)
else: | ||
# Account for [CLS] and [SEP] with "- 2" | ||
if len(tokens_a) > max_seq_length - 2: | ||
tokens_a = tokens_a[:(max_seq_length - 2)] | ||
elif len(tokens_a) > max_seq_length - 2: | ||
tokens_a = tokens_a[:(max_seq_length - 2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_examples_to_features
refactored with the following changes:
- Merge else clause's nested if statement into elif (
merge-else-if-into-elif
) - Replace interpolated string formatting with f-string [×4] (
replace-interpolation-with-fstring
)
This removes the following comments ( why? ):
# Account for [CLS] and [SEP] with "- 2"
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the following improvement in Function Model.forward
:
pass | ||
pass | ||
#obtain dataflow | ||
if lang=="php": | ||
code="<?php"+code+"?>" | ||
code = f"<?php{code}?>" | ||
try: | ||
tree = parser[0].parse(bytes(code,'utf8')) | ||
root_node = tree.root_node | ||
tokens_index=tree_to_token_index(root_node) | ||
tree = parser[0].parse(bytes(code,'utf8')) | ||
root_node = tree.root_node | ||
tokens_index=tree_to_token_index(root_node) | ||
code=code.split('\n') | ||
code_tokens=[index_to_code_token(x,code) for x in tokens_index] | ||
index_to_code={} | ||
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)): | ||
index_to_code[index]=(idx,code) | ||
code_tokens=[index_to_code_token(x,code) for x in tokens_index] | ||
index_to_code = { | ||
index: (idx, code) | ||
for idx, (index, code) in enumerate(zip(tokens_index, code_tokens)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function extract_dataflow
refactored with the following changes:
- Use f-string instead of string concatenation [×2] (
use-fstring-for-concatenation
) - Convert for loop into dictionary comprehension (
dict-comprehension
) - Convert for loop into list comprehension (
list-comprehension
)
|
||
for url in [url1,url2]: | ||
if url not in cache: | ||
func=url_to_code[url] | ||
|
||
#extract data flow | ||
code_tokens,dfg=extract_dataflow(func,parser,'java') | ||
code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)] | ||
ori2cur_pos={} | ||
ori2cur_pos[-1]=(0,0) | ||
code_tokens = [ | ||
tokenizer.tokenize(f'@ {x}')[1:] | ||
if idx != 0 | ||
else tokenizer.tokenize(x) | ||
for idx, x in enumerate(code_tokens) | ||
] | ||
|
||
ori2cur_pos = {-1: (0, 0)} | ||
for i in range(len(code_tokens)): | ||
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) | ||
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) | ||
code_tokens=[y for x in code_tokens for y in x] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_examples_to_features
refactored with the following changes:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation
) - Merge dictionary assignment with declaration (
merge-dict-assign
) - Replace unused for index with underscore [×2] (
for-index-underscore
) - Convert for loop into dictionary comprehension (
dict-comprehension
)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function TextDataset.__init__
refactored with the following changes:
- Replace if statement with if expression (
assign-if-exp
) - Replace call to format with f-string [×10] (
use-fstring-for-formatting
)
node_index=sum([i>1 for i in self.examples[item].position_idx_1]) | ||
max_length=sum([i!=1 for i in self.examples[item].position_idx_1]) | ||
node_index = sum(i>1 for i in self.examples[item].position_idx_1) | ||
max_length = sum(i!=1 for i in self.examples[item].position_idx_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function TextDataset.__getitem__
refactored with the following changes:
- Replace unneeded comprehension with generator [×4] (
comprehension-to-generator
)
pass | ||
pass | ||
#obtain dataflow | ||
if lang=="php": | ||
code="<?php"+code+"?>" | ||
code = f"<?php{code}?>" | ||
try: | ||
tree = parser[0].parse(bytes(code,'utf8')) | ||
root_node = tree.root_node | ||
tokens_index=tree_to_token_index(root_node) | ||
tree = parser[0].parse(bytes(code,'utf8')) | ||
root_node = tree.root_node | ||
tokens_index=tree_to_token_index(root_node) | ||
code=code.split('\n') | ||
code_tokens=[index_to_code_token(x,code) for x in tokens_index] | ||
index_to_code={} | ||
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)): | ||
index_to_code[index]=(idx,code) | ||
code_tokens=[index_to_code_token(x,code) for x in tokens_index] | ||
index_to_code = { | ||
index: (idx, code) | ||
for idx, (index, code) in enumerate(zip(tokens_index, code_tokens)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function extract_dataflow
refactored with the following changes:
- Use f-string instead of string concatenation [×2] (
use-fstring-for-concatenation
) - Convert for loop into dictionary comprehension (
dict-comprehension
) - Convert for loop into list comprehension (
list-comprehension
)
code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)] | ||
ori2cur_pos={} | ||
ori2cur_pos[-1]=(0,0) | ||
code_tokens = [ | ||
tokenizer.tokenize(f'@ {x}')[1:] if idx != 0 else tokenizer.tokenize(x) | ||
for idx, x in enumerate(code_tokens) | ||
] | ||
|
||
ori2cur_pos = {-1: (0, 0)} | ||
for i in range(len(code_tokens)): | ||
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) | ||
code_tokens=[y for x in code_tokens for y in x] | ||
ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) | ||
code_tokens=[y for x in code_tokens for y in x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_examples_to_features
refactored with the following changes:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation
) - Merge dictionary assignment with declaration (
merge-dict-assign
) - Replace unused for index with underscore [×2] (
for-index-underscore
) - Convert for loop into dictionary comprehension (
dict-comprehension
)
cache_file=args.output_dir+'/'+prefix+'.pkl' | ||
cache_file = f'{args.output_dir}/{prefix}.pkl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function TextDataset.__init__
refactored with the following changes:
- Use f-string instead of string concatenation [×3] (
use-fstring-for-concatenation
) - Replace call to format with f-string [×6] (
use-fstring-for-formatting
)
node_index=sum([i>1 for i in self.examples[item].position_idx]) | ||
max_length=sum([i!=1 for i in self.examples[item].position_idx]) | ||
node_index = sum(i>1 for i in self.examples[item].position_idx) | ||
max_length = sum(i!=1 for i in self.examples[item].position_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function TextDataset.__getitem__
refactored with the following changes:
- Replace unneeded comprehension with generator [×2] (
comprehension-to-generator
)
|
||
#get optimizer and scheduler | ||
optimizer = AdamW(model.parameters(), lr=args.learning_rate, eps=1e-8) | ||
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,num_training_steps=len(train_dataloader)*args.num_train_epochs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function train
refactored with the following changes:
- Replace call to format with f-string [×3] (
use-fstring-for-formatting
) - Simplify unnecessary nesting, casting and constant values in f-strings (
simplify-fstring-formatting
) - Replace f-string with no interpolated values with string (
remove-redundant-fstring
)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function evaluate
refactored with the following changes:
- Inline variable that is immediately returned (
inline-immediately-returned-variable
) - Move assignment closer to its usage within a block [×2] (
move-assign-in-block
) - Convert for loop into list comprehension [×2] (
list-comprehension
)
|
||
parser.add_argument("--lang", default=None, type=str, | ||
help="language.") | ||
|
||
parser.add_argument("--model_name_or_path", default=None, type=str, | ||
help="The model checkpoint for weights initialization.") | ||
parser.add_argument("--config_name", default="", type=str, | ||
help="Optional pretrained config name or path if not the same as model_name_or_path") | ||
parser.add_argument("--tokenizer_name", default="", type=str, | ||
help="Optional pretrained tokenizer name or path if not the same as model_name_or_path") | ||
|
||
parser.add_argument("--nl_length", default=128, type=int, | ||
help="Optional NL input sequence length after tokenization.") | ||
help="Optional NL input sequence length after tokenization.") | ||
parser.add_argument("--code_length", default=256, type=int, | ||
help="Optional Code input sequence length after tokenization.") | ||
help="Optional Code input sequence length after tokenization.") | ||
parser.add_argument("--data_flow_length", default=64, type=int, | ||
help="Optional Data Flow input sequence length after tokenization.") | ||
|
||
parser.add_argument("--do_train", action='store_true', | ||
help="Whether to run training.") | ||
parser.add_argument("--do_eval", action='store_true', | ||
help="Whether to run eval on the dev set.") | ||
parser.add_argument("--do_test", action='store_true', | ||
help="Whether to run eval on the test set.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function main
refactored with the following changes:
- Simplify if expression by using or (
or-if-exp-identity
) - Move assignment closer to its usage within a block (
move-assign-in-block
) - Replace call to format with f-string [×2] (
use-fstring-for-formatting
) - Inline variable that is immediately returned (
inline-immediately-returned-variable
)
This removes the following comments ( why? ):
# Evaluation
do_first_statement=['for_in_clause'] | ||
def_statement=['default_parameter'] | ||
states=states.copy() | ||
states=states.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function DFG_python
refactored with the following changes:
- Move assignments closer to their usage (
move-assign
) - Hoist repeated code outside conditional statement (
hoist-statement-from-if
) - Simplify sequence length comparison [×4] (
simplify-len-comparison
) - Replace unused for index with underscore [×2] (
for-index-underscore
)
@@ -185,7 +184,6 @@ def DFG_java(root_node,index_to_code,states): | |||
for_statement=['for_statement'] | |||
enhanced_for_statement=['enhanced_for_statement'] | |||
while_statement=['while_statement'] | |||
do_first_statement=[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function DFG_java
refactored with the following changes:
- Move assignments closer to their usage (
move-assign
) - Hoist repeated code outside conditional statement (
hoist-statement-from-if
) - Replace unused for index with underscore [×2] (
for-index-underscore
)
# This series of conditionals removes docstrings: | ||
elif token_type == tokenize.STRING: | ||
if prev_toktype != tokenize.INDENT: | ||
# This is likely a docstring; double-check we're not inside an operator: | ||
if prev_toktype != tokenize.NEWLINE: | ||
if start_col > 0: | ||
out += token_string | ||
if ( | ||
prev_toktype not in [tokenize.INDENT, tokenize.NEWLINE] | ||
and start_col > 0 | ||
): | ||
out += token_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function remove_comments_and_docstrings
refactored with the following changes:
- Merge nested if conditions [×2] (
merge-nested-ifs
) - Replace if statement with if expression (
assign-if-exp
) - Replace multiple comparisons of same variable with
in
operator (merge-comparisons
)
This removes the following comments ( why? ):
# note: a space and not an empty string
# This is likely a docstring; double-check we're not inside an operator:
# This series of conditionals removes docstrings:
do_first_statement=['for_in_clause'] | ||
def_statement=['default_parameter'] | ||
states=states.copy() | ||
states=states.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function DFG_python
refactored with the following changes:
- Move assignments closer to their usage (
move-assign
) - Hoist repeated code outside conditional statement (
hoist-statement-from-if
) - Simplify sequence length comparison [×4] (
simplify-len-comparison
) - Replace unused for index with underscore [×2] (
for-index-underscore
)
@@ -185,7 +184,6 @@ def DFG_java(root_node,index_to_code,states): | |||
for_statement=['for_statement'] | |||
enhanced_for_statement=['enhanced_for_statement'] | |||
while_statement=['while_statement'] | |||
do_first_statement=[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function DFG_java
refactored with the following changes:
- Move assignments closer to their usage (
move-assign
) - Hoist repeated code outside conditional statement (
hoist-statement-from-if
) - Replace unused for index with underscore [×2] (
for-index-underscore
)
# This series of conditionals removes docstrings: | ||
elif token_type == tokenize.STRING: | ||
if prev_toktype != tokenize.INDENT: | ||
# This is likely a docstring; double-check we're not inside an operator: | ||
if prev_toktype != tokenize.NEWLINE: | ||
if start_col > 0: | ||
out += token_string | ||
if ( | ||
prev_toktype not in [tokenize.INDENT, tokenize.NEWLINE] | ||
and start_col > 0 | ||
): | ||
out += token_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function remove_comments_and_docstrings
refactored with the following changes:
- Merge nested if conditions [×2] (
merge-nested-ifs
) - Replace if statement with if expression (
assign-if-exp
) - Replace multiple comparisons of same variable with
in
operator (merge-comparisons
)
This removes the following comments ( why? ):
# note: a space and not an empty string
# This is likely a docstring; double-check we're not inside an operator:
# This series of conditionals removes docstrings:
else: | ||
code_tokens=[] | ||
for child in root_node.children: | ||
code_tokens+=tree_to_token_index(child) | ||
return code_tokens | ||
code_tokens=[] | ||
for child in root_node.children: | ||
code_tokens+=tree_to_token_index(child) | ||
return code_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function tree_to_token_index
refactored with the following changes:
- Remove unnecessary else after guard condition (
remove-unnecessary-else
)
for i in range(0, len(segment) - order + 1): | ||
for i in range(len(segment) - order + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _get_ngrams
refactored with the following changes:
- Replace range(0, x) with range(x) (
remove-zero-from-range
)
Thanks for starring sourcery-ai/sourcery ✨ 🌟 ✨
Here's your pull request refactoring your most popular Python repo.
If you want Sourcery to refactor all your Python repos and incoming pull requests install our bot.
Review changes via command line
To manually merge these changes, make sure you're on the
master
branch, then run: