AttributeError: 'FusedFilter' object has no attribute '_name' #495

xunmenglt · 2024-11-24T02:22:34Z

配置文件如下：

project_name: 'code'
dataset_path: ‘processed_starcode.jsonl' # path to your dataset directory or file
export_path: 'dataset.jsonl'

text_keys: 'text'

export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. Notice: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
np: 40 # number of subprocess to process your dataset
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., text, instruction, output, ...
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of text_keys when you set multiple keys.
suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: /opt/data/private/liuteng/dataset/dj_cache # cache dir for Hugging Face datasets. In default, it's the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning.
temp_dir: /opt/data/private/liuteng/dataset/dj_cache
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: true # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: zstd # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

save_stats_in_one_file: true # whether to store all stats result into one file

process:

clean_email_mapper:
clean_links_mapper:
fix_unicode_mapper:
punctuation_normalization_mapper:
whitespace_normalization_mapper:
clean_copyright_mapper:
alphanumeric_filter: # 18766
tokenization: false
min_ratio: 0.2 # < 3sigma (0.3791)
max_ratio: 0.9163 # 3sigma
alphanumeric_filter: # 146432
tokenization: true
min_ratio: 0.546 # 3sigma
max_ratio: 3.65 # 3sigma
average_line_length_filter: # for code
min_len: 10 # > 3sigma (0) -- 48790
max_len: 150 # < 3sigma (15603) -- 233275
character_repetition_filter:
max_ratio: 0.36 # 3sigma -- 346875
maximum_line_length_filter: # for code
max_len: 1000 # remove 256670 samples
text_length_filter:
max_len: 96714 # 3sigma -- 190006
words_num_filter:
min_num: 20 # remove 1504958 samples
max_num: 6640 # 3sigma -- remove 179847 samples
word_repetition_filter:
rep_len: 10
max_ratio: 0.357 # 3sigma -- 598462
document_simhash_deduplicator:
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4

报错

AttributeError: 'FusedFilter' object has no attribute '_name'

The text was updated successfully, but these errors were encountered:

HYLcool · 2024-11-25T02:34:23Z

Hi @xunmenglt , thanks for your report!

Sorry for this problem. We found this issue before and we already fix it in the PR #464 and merge it into the main branch now. Please pull the latest code in the main branch and try again.

HYLcool self-assigned this Nov 25, 2024

HYLcool added bug Something isn't working dj:op issues/PRs about some specific OPs labels Nov 25, 2024

github-project-automation bot added this to data-juicer Nov 25, 2024

github-project-automation bot moved this to Todo in data-juicer Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'FusedFilter' object has no attribute '_name' #495

AttributeError: 'FusedFilter' object has no attribute '_name' #495

xunmenglt commented Nov 24, 2024

HYLcool commented Nov 25, 2024

AttributeError: 'FusedFilter' object has no attribute '_name' #495

AttributeError: 'FusedFilter' object has no attribute '_name' #495

Comments

xunmenglt commented Nov 24, 2024

配置文件如下：

报错

HYLcool commented Nov 25, 2024