Data Clean-up Tool

Setup

To set up and run the Data Clean-up Tool, follow these steps:

Clone this repository to your local machine:

git clone https://github.com/mustachemo/data-runners.git

Change to the project directory:

cd data-cleanup-tool

Create a conda environment from the provided environment.yml file:

conda env create -f environment.yml

Optionally, to update conda environment using existing file:

conda env update --file environment.yml --prune

Activate the newly created conda environment:

conda activate data_cleanup_env

Run the Application:
- Open a terminal and execute the following command:
```
python run.py
```
- OR
  - Open the command palette by pressing Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac).
  - Type "Run Task" and select "Tasks: Run Task" from the dropdown list.
  - Alternatively, you can use the keyboard shortcut Ctrl+Shift+B (Windows/Linux) or Cmd+Shift+B (Mac).

Features

Standard Features:
- Upload button to upload various forms of data
  - Style the button to make it look nice
- Display the data in a table
- All columns are displayed and are scrollable
- Up to 259 rows are displayed and are scrollable. Rest of data is cycled through pages system
- Export data into various forms
  - We are able to export the data, but not the modified data. Must fix that
  - Also need to make a button that exports the data into the format specified by the radio buttons
  - Style export button!
- Make the columns/first row/headers sticky. Meaning they stay when scrolling
- Add enforcement of types. Say a column shows only numbers, let's say money. Enforce that you can't edit a text into there, only numeric values are allowed (https://dash.plotly.com/datatable/typing)
- Add a formatting setting that formats columns to a specified prefereance. For example, cost column will show $ sign and number type enforcement along with commas when needed (https://dash.plotly.com/datatable/typing)
- Adding or removing columns and rows
- Update parse_content function to include 'xslx, xml, html" and "pdf" if we can (pdf is a bonus feature)
- Combine two or more data of the same format into one file
- [z] After "Enforcing" dtypes or formatting, those cells are then highlighted (https://dash.plotly.com/datatable/conditional-formatting). We could also use for other use cases when highlighting is required. We should have a legend that says what each higlight color means
  - dtype highlighting
  - Highlighting None, NaN, or Empty String Values
  - formatting highlighting
- [z] Make legend for filtering operations/syntax (https://dash.plotly.com/datatable/filtering)
- Testing (https://dash.plotly.com/testing)
Bonus Features:
- Make a tab option for graphs (https://dash.plotly.com/dash-core-components/tab)
- Highlight Changes: Display changed cells in a different color for easier tracking
- Add loading animation (https://dash.plotly.com/dash-core-components/loading)
- Make columns selection through a checkbox (https://dash.plotly.com/datatable/editable)
- Displaying Errors with dash.no_update (https://dash.plotly.com/advanced-callbacks)
- Taps for visuals and/or data analytics information (https://dash.plotly.com/dash-core-components/tabs) (https://dash.plotly.com/dash-core-components/graph)
- Could style the table to make it nicer (bonus feature)

Optimization

Get rid of df-store, no need to store in memory as we have the df stored as a variable in the instance of DataHandler
Use callback_context to combine multiple callbacks to one [Determining which Input Has Fired with dash.callback_context] (https://dash.plotly.com/advanced-callbacks), also look at Duplicate Callback Outputs (https://dash.plotly.com/duplicate-callback-outputs), also look at this (https://dash.plotly.com/determining-which-callback-input-changed)
Use Partial Property Update callback to highlight/unhighlight cells/rows/columns that match a specific pattern [Could be used for other cases] (https://dash.plotly.com/partial-properties) (Check clear section)
Make callbacks more readable with Flexible Callback Signatures (https://dash.plotly.com/flexible-callback-signatures)
Backend paging for loading a couple of rows per page (https://dash.plotly.com/datatable/callbacks)
Performance (https://dash.plotly.com/performance)
Remember user preferences? (https://dash.plotly.com/persistence)

Extras

Periodic Calling of callbacks/refreshing of page for real-time monitering (https://dash.plotly.com/dash-core-components/interval)
Add more highlighting for differnet cases (https://dash.plotly.com/datatable/conditional-formatting)
Format numbers [i.e. adding $ or % sign before or after numbers, commas between numbers, padding] (https://dash.plotly.com/datatable/data-formatting)
Check out dash AG Grid (https://dash.plotly.com/dash-ag-grid/getting-started)
Dev tools, Custom Timing Events
Loading for text/small compone ts (https://dash.plotly.com/loading-states)
Deployment (https://dash.plotly.com/deployment)
help (https://community.plotly.com/c/python/25?utm_medium=dash_docs&utm_content=sidebar)

Problem

The presence of large amounts of bad data which does not comply with the required format, currently not relevant and that has been entered into the warehouse management system (WMS) incorrectly and cannot be utilized for any purpose. This data always causes hinderance in many daily activities, become hurdles when the company transitions to a new WMS and most importantly occupies huge amounts of memory in the server systems. A tool which can help identify this bad data, modify it to required format and delete gaps, if necessary, can help resolve many of the forementioned issues.

Objectives

Objective is to design and function tool that can help the company to identify and delete, modify, fix this bad data, gaps in data, and eliminate a large amount as per user requirement. This will reduce manual work related to fixing this bad data.

Standard Features:
- Ability to read various formats of data (xml, csv, pdf etc.;) and display in rows and columns.
- Give the user the ability to define each row or column of data according to the user’s preference. And modify or display the data that is not according to the defined parameters. Preferably in GUI for a layman to use it.
- Combine different sets of data of same format into one set and customize as per user requirements.
- Ability to export into different formats as per user needs.
Bonus Features:
- Identify duplicate data in different formats, errors such as wrong address format, punctuation, spellings, and address styles. Filter the data and display the rows and columns with these discrepancies.
- Creating visuals from the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Clean-up Tool

Table of Contents

Setup

Features

Optimization

Extras

Problem

Objectives

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Clean-up Tool

Table of Contents

Setup

Features

Optimization

Extras

Problem

Objectives