Data Wrangling

Barkha Verma
4 min readNov 8, 2020
Photo by Elder Research

You may heard about Data Wrangling before . In this blog you will read about exactly what is Data Wrangling, Data Wrangling definition, Data Wrangling with pandas, and few example describing this methods.

Data Wrangling Definition

It is often the case with data science projects that you’ll have to deal with messy or incomplete data. Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. As data has become more diverse and unstructured, demanding increased time spent cleaning, and organizing data ahead of broader analysis. Data Wrangling involves processing the data in various format like-merging,grouping,concatenating etc.

As most statisticians, data scientists and data analyst will admit,most of the time spent implementing an analysis is devoted to cleaning or wrangling the data itself, rather than to coding or running a particular model that uses the data.

Data Wrangling as a process that includes six process :

  1. Discovering
  2. Structuring
  3. Cleaning
  4. Enriching
  5. Validating
  6. Publishing

I will explain each one.

1. Discovering

In this, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea about what the data is about.

2. Structuring

This data wrangling step means organizing the data, which is necessary because raw data comes in many different shapes and sizes. A single column may turn into several rows for easier analysis.One column may become two. Movement of data is made for easier computation and analysis.

3. Cleaning

All datasets are sure to have some outliers, which can skew the results of the analysis. These will have to be cleaned, for the best results. In this step, the data is cleaned thoroughly for high-quality analysis. Null values will have to be changed, and the formatting will be standardized in order to make the data of higher quality.

4. Enriching

Data enrichment is the process of combining first party data from internal sources with disparate data from other internal systems or third party data from external sources. Enriched data is a valuable asset for any organization because it becomes more useful and insightful.

5. Validating

Validation rules refer to some repetitive programming steps which are used to verify the consistency, quality and the security of the data you have. Validating is the activity that surfaces data quality and consistency issues, or verifies that they have been properly addressed by applied transformations.

6. Publishing

The prepared wrangled data is published so that it can be used further down the line that is its purpose after all. If needed, you will also have to document the steps which were taken or logic used to wrangle the said data.

Data Wrangling with Pandas

Pandas is one of the most popular Python library for data wrangling. In this example we’ll use Pandas to learn data wrangling techniques to deal with some of the most common data formats and their transformations. We’ll be playing with Pandas dataframes which are structured as tables where you can use Python code to easily manipulate the rows and columns.

Examples of Data Wrangling :

1 Merging Data

The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

2 Grouping Data

Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Pandas has in-built methods which can roll the data into various groups.

In the below example we group the data by year and then get the result for a specific year.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)

Its output is as follows −

    Points  Rank   Team     Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014

3 Concatenating Data

Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −

      Marks_scored   Name      subject_id
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

Conclusion:

Data wrangling is an important part of any data analysis. You’ll want to make sure your data is in tip-top shape and ready for convenient consumption before you apply any algorithms to it. Data preparation is a key part of a great data analysis. By dropping null values, filtering and selecting the right data, and working with timeseries, you can ensure that any machine learning or treatment you apply to your cleaned-up data is fully effective.

--

--

Barkha Verma

My self Barkha Verma form India, I parsuing my B-Tech in stream of computer science and engineering with specialization in Data science.