This documentation outlines the process of data extraction, cleaning, transformation, and analysis of a CSV file using a configuration file in JSON format.
Below is an example of a JSON configuration file structure:
clean_data.json
{
"data_file": "path/to/your/input.csv",
"columns": [
"Nest ID",
"Nestling ID",
"Nestling",
"Sex",
"Treatment",
],
"transformations": {
"Nest ID": "strip",
"Date": "to_datetime"
},
"fill_na": {
"numeric": 0,
"string": ""
},
"rename_columns": {
"Cluster6": "Babbles",
"Bout ID (sans subtype)": "Bout ID",
"Bout no.": "Bout Number",
"No. eggs hatched from nest": "Number eggs hatched from nest",
"No. birds fledged from nest": "Number birds fledged from nest"
},
"data_types": {
"Nest ID": "str",
"Nestling ID": "str",
}
}
- data_file: Path to the raw CSV file to be processed.
- columns: List of columns to be included in the dataset.
- transformations: Dictionary specifying any transformations to be applied to the data.
- fill_na: Dictionary for handling missing values in specific columns.
- rename_columns: Dictionary mapping original column names to new names.
- data_types: Dictionary specifying data types for columns, if applicable.
The following command-line arguments can be used to run the script:
Argument | Type | Required | Default | Description |
---|---|---|---|---|
-i, --input | String | Yes | N/A | Path to the input JSON file for data cleaning and transformation steps. |
-m, --minlength | Int | No | 2 | Minimum length for sequences used in pair analysis. |
-k, --kmeans | Int | No | 6 | Number of clusters for k-means clustering. |
-a, --analysis | String | No | N/A | Type of frequency analysis to perform (choices: singles, pairs, triples, quads, quints, all). |
-d, --dump | Flag | No | False | Flag to indicate if sequences should be dumped into a plot. |
-l, --loglevel | String | No | WARNING | Log level for script execution (choices: DEBUG, INFO, WARNING, ERROR, CRITICAL). |
-sc, --sequenceclass | String | No | N/A | Name of Columns in Data Frame to configure the data input for the Sequence Classification Model |
- Configuration Parsing:
- The script reads the JSON configuration file specified by the -i argument.
- Data Extraction:
- The CSV file specified in data_file is loaded.
- Data Cleaning and Transformation:
- The script applies column selection, renaming, transformations, and missing value handling as defined in the JSON configuration.
- Analysis:
- Depending on the --analysis argument, the script performs sequence analysis, including singles, pairs, triples, quads, or quints.
- Logging:
- Log messages are configured based on the --loglevel argument for better tracking and debugging.
- Dumping Sequences:
- If --dump is specified, the processed sequences are saved for further inspection.
- Sequence Classification:
- Depending on the --sequenceclass argument, the script performs setting up the df and csv for the Sequence Classification Model
Run the script using the following command:
python script_name.py -i config.json -m 3 -k 5 -a pairs -l INFO --dump --sc "Sex, Treatment"
- Reads configuration from config.json.
- Minimum sequence length is set to 3.
- Runs k-means clustering with 5 clusters.
- Performs pair analysis.
- Logs messages at INFO level.
- Dumps sequences into a plot.
- Makes corresponding df and cvs for Sex and Treatment
Logs are configured to display timestamps, log levels, and messages to help track the script's progress and troubleshoot any issues.
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
The script does not return a value but writes logs and, if specified, dumps sequence outputs for plotting.
Analysis of variance (ANOVA) is an extremely important method in exploratory and confirmatory data analysis. ANOVA test can be a valuable tool for exploratory data analysis (EDA) when you want to identify potential differences in the mean of a continuous variable across different categorical groups within your dataset, helping you explore potential relationships and patterns between variables.
-
Group comparisons: When exploring a dataset, ANOVA can quickly reveal whether there are statistically significant differences between various groups, allowing you to focus your analysis on those areas.
-
Identifying potential relationships: By comparing means across different groups, ANOVA can highlight potential relationships between categorical variables and continuous variables, prompting further investigation.
-
Data exploration: Even if you don't have a specific hypothesis in mind, running an ANOVA can help you discover interesting patterns in your data that might warrant further exploration.
-
Data visualization: The results of an ANOVA can be used to inform visualizations like box plots or bar charts, which can visually represent the differences between groups.
- Independent Variable: Categorical (e.g., sex, treatment groups).
- Dependent Variable: Continuous (e.g., vocalization frequency or amplitude).
-
One-Way ANOVA: Test differences among means for one categorical independent variable.
- Example: Do parrot vocalizations differ based on sex (male vs. female)?
-
Two-Way ANOVA: Explore interaction effects between two categorical independent variables.
- Example: Is there an interaction between sex and treatment on vocalization patterns?
-
Post-Hoc Tests: If the ANOVA reveals significant differences, use post-hoc tests (like Tukey's HSD) to determine which groups differ from each other.
-
Visualization: Complement ANOVA results with visualizations (e.g., boxplots, violin plots) to help interpret differences in group means.