- “Transforming and mapping data from one form into another with the intent of making it more appropriate for analytics” - Wikipedia
- “Process of cleaning and unifying messy and complex data sets for easy access and analysis.” - Altair
Goals of wrangling data
- derive value out of data through analysis
- make data-driven decisions
- optimize some business metrics; cost, expenditure etc.
- save the world
Time spent on preparing data
Data wrangling tasks
Discovering;
Understanding and exploring the dataset then planning on the actions to get it from a “raw” form to some final state.
- defining all entities and variables,
- visualizing the dataset,
- etc.
Structuring
Actions that change the form or schema of the data.
- reordering columns,
- filtering records,
- aggregating/pivoting records,
- etc
*Cleaning
*Actions that standardize or fix irregularities in a dataset.
Addressing:
- MISSING values,
- INVALID values,
- MISSPELLED values,
- different DATE FORMATS,
- Different UNITS of measurements.
- etc.
Enriching
Actions that introduce new variables/observations to the dataset.
- unions,
- joins,
- deriving new fields from old ones,
- etc.
Validating
Actions that ensure the transformed data is consistent with the agreed-upon definitions and assumptions.
- consistency in variable definition,
- outlier/anomaly detection,
- etc.
Publishing
Actions that integrate, expose or present the wrangled dataset to the final consumers of the data.
- download,
- DB integrations,
- exposing through APIs,
- presentations,
- etc.
Comments