Skip to content

georgeboc/DatasetsProfiler

Repository files navigation

Datasets Profiler

This project is used to profile a dataset. It has two use cases: the first one allows to get a profile of a dataset, and the second use case, prepares the dataset to serve as input for the Testbed project.

We have designed a set of integration tests to test against all modifications if there are functional changes in the project.

Ensure there are parameters and inputs before proceeding.

The parameters are files which contain a list of configurations. Each configurations defines whe behaviour of an application. When the Datasets Profiler will be executed, it will run through a list of configurations and will execute an Application for each configuration.
The parameters file looks as follows:

[
  {
    "use_case": "get_description",
    "input_path": "input/Ad_click_on_taobao/Ad_click_on_taobao.csv",
    "output_path": "output/datasets/Ad_click_on_taobao_sample",
    "parser": "ad_click_on_taobao_log_parser_strategy"
  },
  {
    "use_case": "get_described_dataset",
    "input_path": "input/Ad_click_on_taobao/Ad_click_on_taobao.csv",
    "output_directory": "output/described_datasets/Ad_click_on_taobao_10000",
    "parser": "ad_click_on_taobao_log_parser_strategy",
    "limit": 10000
  },
  {
    "use_case": "get_description",
    "input_path": "input/Android/Android.log",
    "output_path": "output/samples_1000/Android_sample_1000",
    "parser": "android_log_parser_strategy",
    "specific_formatters": [
      "no_year_datetime_specific_formatter",
      "string_specific_formatter",
      "string_specific_formatter",
      "string_specific_formatter",
      "string_specific_formatter",
      "string_specific_formatter"
    ],
    "limit": 1000
  }
]

The inputs are files whose path is defined in the parameters file with the key input_path and refer to the raw Datasets.

To run the Dataset Profiler with the integration tests, execute the following command from within the root of the project DatasetsProfiler/:

$ bash scripts/check_tests.sh

You also have the option to run the Datasets Profiler either locally or in a cluster, using Yarn.

  • If you want to run the Datasets Profiler locally, execute the following script:

    $ bash scripts/run-local.sh PATH_TO_PARAMETERS_FILE
    

    Where PATH_TO_PARAMETERS_FILE is the absolute path which points to the parameters file (in case of HDFS), or it can be the relative path (in case of local filesystem).

  • If you want to run the Datasets Profiler in a cluster, execute the following script:

    $ bash scripts/run-cluster.sh PATH_TO_PARAMETERS_FILE
    

    Where PATH_TO_PARAMETERS_FILE is the path which points to the parameters file.

An example of each execution type of the Datasets Profiler is shown below:

$ bash scripts/run-local.sh parameters/parameters_integration_test.json
$ bash scripts/run-cluster.sh /user/bochileanu/datasets_profiler_parameters/parameters_ad_click_on_taobao_log_parser_strategy.json

For more information, check the dissertation associated to this project.

About

Project used to profile a dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published