Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'FusedFilter' object has no attribute '_name' #495

Open
xunmenglt opened this issue Nov 24, 2024 · 1 comment
Open

AttributeError: 'FusedFilter' object has no attribute '_name' #495

xunmenglt opened this issue Nov 24, 2024 · 1 comment
Assignees
Labels
bug Something isn't working dj:op issues/PRs about some specific OPs

Comments

@xunmenglt
Copy link

配置文件如下:

project_name: 'code'
dataset_path: ‘processed_starcode.jsonl' # path to your dataset directory or file
export_path: 'dataset.jsonl'

text_keys: 'text'

export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. Notice: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
np: 40 # number of subprocess to process your dataset
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., text, instruction, output, ...
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of text_keys when you set multiple keys.
suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: /opt/data/private/liuteng/dataset/dj_cache # cache dir for Hugging Face datasets. In default, it's the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning.
temp_dir: /opt/data/private/liuteng/dataset/dj_cache
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: true # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: zstd # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

save_stats_in_one_file: true # whether to store all stats result into one file

process:

  • clean_email_mapper:

  • clean_links_mapper:

  • fix_unicode_mapper:

  • punctuation_normalization_mapper:

  • whitespace_normalization_mapper:

  • clean_copyright_mapper:

  • alphanumeric_filter: # 18766
    tokenization: false
    min_ratio: 0.2 # < 3sigma (0.3791)
    max_ratio: 0.9163 # 3sigma

  • alphanumeric_filter: # 146432
    tokenization: true
    min_ratio: 0.546 # 3sigma
    max_ratio: 3.65 # 3sigma

  • average_line_length_filter: # for code
    min_len: 10 # > 3sigma (0) -- 48790
    max_len: 150 # < 3sigma (15603) -- 233275

  • character_repetition_filter:
    max_ratio: 0.36 # 3sigma -- 346875

  • maximum_line_length_filter: # for code
    max_len: 1000 # remove 256670 samples

  • text_length_filter:
    max_len: 96714 # 3sigma -- 190006

  • words_num_filter:
    min_num: 20 # remove 1504958 samples
    max_num: 6640 # 3sigma -- remove 179847 samples

  • word_repetition_filter:
    rep_len: 10
    max_ratio: 0.357 # 3sigma -- 598462

  • document_simhash_deduplicator:
    tokenization: space
    window_size: 6
    lowercase: true
    ignore_pattern: '\p{P}'
    num_blocks: 6
    hamming_distance: 4

报错

AttributeError: 'FusedFilter' object has no attribute '_name'

cd73c8173ac1d1a6ebb7d4979f94fed

@HYLcool
Copy link
Collaborator

HYLcool commented Nov 25, 2024

Hi @xunmenglt , thanks for your report!

Sorry for this problem. We found this issue before and we already fix it in the PR #464 and merge it into the main branch now. Please pull the latest code in the main branch and try again.

@HYLcool HYLcool self-assigned this Nov 25, 2024
@HYLcool HYLcool added bug Something isn't working dj:op issues/PRs about some specific OPs labels Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dj:op issues/PRs about some specific OPs
Projects
None yet
Development

No branches or pull requests

2 participants