Data Wrangling Using Pandas

  • Read
  • Discuss

Data Wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw data into a format that is suitable for analysis. It is a crucial step in the data science process, as it allows you to uncover insights and patterns that may not be immediately visible in the raw data.

Dataset Used

We have picked a csv dataset of car price analysis of different models. The dataset is available on Kaggle and is free to use. The link to the dataset is provided as follows: https://www.kaggle.com/code/dronax/car-prices-dataset/data. We will explore this dataset and will try to perform wrangling on it. 

Steps Involved in Data Wrangling Using Pandas

There are several common tasks involved in data wrangling, including:

  1. Reading the dataset:

The first step of any data analysis process is to read the dataset. The dataset can be read either in CSV format, excel, json, or sql. We will prefer CSV as we got our data in CSV format. To read the data you have to do.

import pandas as pd

data= pd.read_csv('/content/drive/MyDrive/CarPrice_Assignment.csv')
data

The data contains 205 rows and 26 columns. We read the data from a CSV file.

  1. Handling missing data: 

This can involve removing or imputing missing values. We can find missing data and can try to fill it with some values. The values could be mean, either median, or can be mode. To check for missing values, you have to do this.

data.isna().sum()

car_ID               0

symboling            0

CarName              0

fueltype             0

aspiration           0

doornumber           0

carbody              0

drivewheel           0

enginelocation       0

wheelbase            0

carlength            0

carwidth             0

carheight            0

curbweight           0

enginetype           0

cylindernumber       0

enginesize           0

fuelsystem           0

boreratio            0

stroke               0

compressionratio     0

horsepower           0

peakrpm              0

citympg              0

highwaympg           0

price               12

dtype: int64

You can see price column contains 12 missing values.

data.loc[data["price"].isna()]

Here comes the rows that have null price values

One way to fill these missing values is to take mean and fill them. 

mean=data['price'].mean()
mean

13228.94387046632

The mean is calculated as follows:

data["price"] = data["price"].fillna(mean)

data.isna().sum()

car_ID              0

symboling           0

CarName             0

fueltype            0

aspiration          0

doornumber          0

carbody             0

drivewheel          0

enginelocation      0

wheelbase           0

carlength           0

carwidth            0

carheight           0

curbweight          0

enginetype          0

cylindernumber      0

enginesize          0

fuelsystem          0

boreratio           0

stroke              0

compressionratio    0

horsepower          0

peakrpm             0

citympg             0

highwaympg          0

price               0

dtype: int64

The null values of price are now filled with mean values of the column.

  1. Handling duplicate data: 

This can involve removing or merging duplicate rows. The best approach is to merge the duplicated rows and not to drop them. 

To check for duplicated entries, just do this:

data.duplicated().sum()

0

There are no duplicated rows in our dataset. In case we had some, we can drop them by writhing:

data = data.drop_duplicates()
  1. Handling outliers: 

This can involve removing or transforming extreme values. There might be a possibility that there are values in data that are either extremely small or extremely large. For example, we are analyzing a car dataset. So, if the price value is more than 100 or larger than 150000. The data would be considered an outlier. So, we have to remove it for better visualization.

To remove outlier you must do this:

data = data[(data["price"] >= 100) & (data["price"] <= 150000)]
data.shape

(205, 26)

We used the reverse method and kept the rows that are not outliers. It seemed like there were no outliers in the data as the shape of the dataset looks exactly the same.

  1. Formatting data: 

This can involve changing the data types of columns, or formatting text data. If we want to rename the column names to keep all names in uppercase and removing the spaces by “_”. We will do it by:

data.columns=[x.upper().replace(" ", "_") for x in data.columns]
data

You can see that column names have been renamed to upper case. For lowercase just replace .upper() with .lower().

To transform the content of a column in uppercase, or lower case. Just do this:

data["CARNAME"].str.upper()

0            ALFA-ROMERO GIULIA

1           ALFA-ROMERO STELVIO

2      ALFA-ROMERO QUADRIFOGLIO

3                   AUDI 100 LS

4                    AUDI 100LS

                 ...           

200             VOLVO 145E (SW)

201                 VOLVO 144EA

202                 VOLVO 244DL

203                   VOLVO 246

204                 VOLVO 264GL

Name: CARNAME, Length: 205, dtype: object

The content of “CARNAME” has been transformed into upper case.

Changing the datatype of a column is a fairly simple task. Suppose you need to change the datatype of the price column from float to string.

data["PRICE"] = data["PRICE"].astype(str)
data.dtypes

CAR_ID                int64

SYMBOLING             int64

CARNAME              object

FUELTYPE             object

ASPIRATION           object

DOORNUMBER           object

CARBODY              object

DRIVEWHEEL           object

ENGINELOCATION       object

WHEELBASE           float64

CARLENGTH           float64

CARWIDTH            float64

CARHEIGHT           float64

CURBWEIGHT            int64

ENGINETYPE           object

CYLINDERNUMBER       object

ENGINESIZE            int64

FUELSYSTEM           object

BORERATIO           float64

STROKE              float64

COMPRESSIONRATIO    float64

HORSEPOWER            int64

PEAKRPM               int64

CITYMPG               int64

HIGHWAYMPG            int64

PRICE                object

dtype: object

The datatype of price is converted to string. To change it back to float. Do this:

data["PRICE"] = data["PRICE"].astype(float)
data.dtypes

CAR_ID                int64

SYMBOLING             int64

CARNAME              object

FUELTYPE             object

ASPIRATION           object

DOORNUMBER           object

CARBODY              object

DRIVEWHEEL           object

ENGINELOCATION       object

WHEELBASE           float64

CARLENGTH           float64

CARWIDTH            float64

CARHEIGHT           float64

CURBWEIGHT            int64

ENGINETYPE           object

CYLINDERNUMBER       object

ENGINESIZE            int64

FUELSYSTEM           object

BORERATIO           float64

STROKE              float64

COMPRESSIONRATIO    float64

HORSEPOWER            int64

PEAKRPM               int64

CITYMPG               int64

HIGHWAYMPG            int64

PRICE               float64

dtype: object

The datatype of “PRICE” is successfully been changed to float 64 again.

Sometimes we have to remove rows that contain invalid conditions. We have a column name “DRIVEWHEEL”. We know that there are 3 types of vehicles i.e.: “Rear Wheel Drive”, “Front Wheel Drive”, and “4-by 4”. So we know the set of valid conditions for this scenario. 

v_conditions = ["rwd", "fwd", "4wd"]
data = data[data["DRIVEWHEEL"].isin(v_conditions)]
data.shape

(205, 26)

It shows that our dataset looks in the same shape. So, our data is all in a valid condition.

  1. Merging and joining data: 

This can involve combining data from multiple sources, such as different tables or files. In pandas, you can use the concat() function to merge multiple dataframes together. You can specify the axis (0 for rows, 1 for columns) and the dataframes to be concatenated. For example, to merge two data frames df1 and df2 on rows:

merge_df = pd.concat([df1, df2], axis=0)
print(df1.shape)
print(df2.shape)
print(merge_df.shape)

(103, 26)

(102, 26)

(205, 26)

All these datasets had a same number of columns. Ther pd.concat() function just merged the dataset based on rows.

  1. Save the cleaned dataset to a new CSV file:

To save the cleaned and wrangled data into a CSV file, we can use the following command:

data.to_csv("cleaned_data_cars.csv", index=False)
print("the data has been saved as CSV")

the data has been saved as CSV

Tools like pandas, and NumPy help data wrangling to be done efficiently.  It is important to note that data wrangling can be a time-consuming and iterative process, as you may need to go back and make changes to the data based on the results of your analysis.

Leave a Reply

Leave a Reply

Scroll to Top