Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: KeyError: 'resource' #440

Open
3 tasks done
luckystar1992 opened this issue Sep 29, 2024 · 3 comments
Open
3 tasks done

[Bug]: KeyError: 'resource' #440

luckystar1992 opened this issue Sep 29, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@luckystar1992
Copy link

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

MacOS15.0

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

0.2.0

Python Version Python版本

3.10

Describe the bug 描述这个bug

最新的main分之的代码,运行算子的时候,出现以下错误信息:

Traceback (most recent call last):
  File "/Users/zyc/code/data-juicer/data_juicer/core/data.py", line 199, in process
    dataset, resource_util_per_op = Monitor.monitor_func(
                                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zyc/code/data-juicer/data_juicer/core/monitor.py", line 210, in monitor_func
    resource_util_dict['resource'] = mdict['resource']
                                     ~~~~~^^^^^^^^^^^^
  File "<string>", line 2, in __getitem__
  File "/Users/zyc/miniconda3/lib/python3.11/multiprocessing/managers.py", line 837, in _callmethod

    raise convert_to_error(kind, result)
KeyError: 'resource'

而在之前的代码中不会抱这个错误。试了几个算子都是这样。

To Reproduce 如何复现

python process_data.py --config ../configs/demo/process_demo.yaml

配置文件如下:

`# Process config example for dataset

global parameters

project_name: 'demo-process'
dataset_path: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl'
np: 1 # number of subprocess to process your dataset
text_keys: ["messages"]
export_path: '/Users/zyc/code/data-juicer/outputs/demo-process/demo-processed_chatml.jsonl'
use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it's the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning.
temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory.
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

for distributed processing

executor_type: default # type of executor, support "default" or "ray" for now.
ray_address: auto # the address of the Ray cluster.

only for data analysis

save_stats_in_one_file: false # whether to store all stats result into one file

process schedule: a list of several process operators with their arguments

process:

Mapper ops. Most of these ops need no arguments.

  • generate_instruction_mapper: # filter text with total token number out of specific range
    hf_model: '/Users/zyc/data/models/qwen/Qwen2-1___5B-Instruct' # model name on huggingface to generate instruction.
    seed_file: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' # Seed file as instruction samples to generate new instructions, chatml format.
    instruct_num: 3 # the number of generated samples.
    similarity_threshold: 0.7 # the similarity score threshold between the generated samples and the seed samples.Range from 0 to 1. Samples with similarity score less than this threshold will be kept.
    prompt_template: null # Prompt template for generate samples. Please make sure the template contains "{augmented_data}", which corresponds to the augmented samples.
    qa_pair_template: null # Prompt template for generate question and answer pair description. Please make sure the template contains two "{}" to format question and answer. Default: '【问题】\n{}\n【回答】\n{}\n'.
    example_template: null # Prompt template for generate examples. Please make sure the template contains "{qa_pairs}", which corresponds to the question and answer pair description generated by param qa_pair_template.
    qa_extraction_pattern: null # Regular expression pattern for parsing question and answer from model response.
    enable_vllm: false # Whether to use vllm for inference acceleration.
    tensor_parallel_size: null # It is only valid when enable_vllm is True. The number of GPUs to use for distributed execution with tensor parallelism.
    max_model_len: null # It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config.
    max_num_seqs: 256 # It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration.
    sampling_params: { "max_length": 1024 }
    `

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

Uploading 截屏2024-09-29 下午5.22.36.png…

Additional 额外信息

No response

@luckystar1992 luckystar1992 added the bug Something isn't working label Sep 29, 2024
@HYLcool
Copy link
Collaborator

HYLcool commented Oct 17, 2024

@luckystar1992 ,感谢你的使用与反馈!

我们这边未能复现你遇到的问题,请你拉取最新版本代码再进行尝试,如还是遇到类似问题,欢迎与我们继续讨论~

@SnoopyXI
Copy link

@luckystar1992 你好,想问一下你解决这个问题了嘛?我也遇到了同样的问题呢!

@HYLcool HYLcool mentioned this issue Dec 6, 2024
1 task
@HYLcool
Copy link
Collaborator

HYLcool commented Dec 6, 2024

嗨, @luckystar1992@SnoopyXI ,感谢你们使用Data-Juicer!

这个报错来自于Data-Juicer的资源检测器Monitor,但在我们的测试下还是没有发现这个问题。。。

不过,在最新的PR #483 里我们开放了Monitor的开关参数open_monitor,如果你们拉取了最新代码还是会遇到这个问题,可以在配置文件中设置open_monitor: false来关闭它,这不会影响正常数据处理流程

如还有其它问题,欢迎随时讨论~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants