Data Wrangling Using Pandas
- Read
- Discuss
Data Wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw data into a format that is suitable for analysis. It is a crucial step in the data science process, as it allows you to uncover insights and patterns that may not be immediately visible in the raw data.
Dataset Used
We have picked a csv dataset of car price analysis of different models. The dataset is available on Kaggle and is free to use. The link to the dataset is provided as follows: https://www.kaggle.com/code/dronax/car-prices-dataset/data. We will explore this dataset and will try to perform wrangling on it.
Steps Involved in Data Wrangling Using Pandas
There are several common tasks involved in data wrangling, including:
- Reading the dataset:
The first step of any data analysis process is to read the dataset. The dataset can be read either in CSV format, excel, json, or sql. We will prefer CSV as we got our data in CSV format. To read the data you have to do.
import pandas as pd
data= pd.read_csv('/content/drive/MyDrive/CarPrice_Assignment.csv')
data
The data contains 205 rows and 26 columns. We read the data from a CSV file.
- Handling missing data:
This can involve removing or imputing missing values. We can find missing data and can try to fill it with some values. The values could be mean, either median, or can be mode. To check for missing values, you have to do this.
data.isna().sum()
car_ID 0
symboling 0
CarName 0
fueltype 0
aspiration 0
doornumber 0
carbody 0
drivewheel 0
enginelocation 0
wheelbase 0
carlength 0
carwidth 0
carheight 0
curbweight 0
enginetype 0
cylindernumber 0
enginesize 0
fuelsystem 0
boreratio 0
stroke 0
compressionratio 0
horsepower 0
peakrpm 0
citympg 0
highwaympg 0
price 12
dtype: int64
You can see price column contains 12 missing values.
data.loc[data["price"].isna()]
Here comes the rows that have null price values
One way to fill these missing values is to take mean and fill them.
mean=data['price'].mean()
mean
13228.94387046632
The mean is calculated as follows:
data["price"] = data["price"].fillna(mean)
data.isna().sum()
car_ID 0
symboling 0
CarName 0
fueltype 0
aspiration 0
doornumber 0
carbody 0
drivewheel 0
enginelocation 0
wheelbase 0
carlength 0
carwidth 0
carheight 0
curbweight 0
enginetype 0
cylindernumber 0
enginesize 0
fuelsystem 0
boreratio 0
stroke 0
compressionratio 0
horsepower 0
peakrpm 0
citympg 0
highwaympg 0
price 0
dtype: int64
The null values of price are now filled with mean values of the column.
- Handling duplicate data:
This can involve removing or merging duplicate rows. The best approach is to merge the duplicated rows and not to drop them.
To check for duplicated entries, just do this:
data.duplicated().sum()
0
There are no duplicated rows in our dataset. In case we had some, we can drop them by writhing:
data = data.drop_duplicates()
- Handling outliers:
This can involve removing or transforming extreme values. There might be a possibility that there are values in data that are either extremely small or extremely large. For example, we are analyzing a car dataset. So, if the price value is more than 100 or larger than 150000. The data would be considered an outlier. So, we have to remove it for better visualization.
To remove outlier you must do this:
data = data[(data["price"] >= 100) & (data["price"] <= 150000)]
data.shape
(205, 26)
We used the reverse method and kept the rows that are not outliers. It seemed like there were no outliers in the data as the shape of the dataset looks exactly the same.
- Formatting data:
This can involve changing the data types of columns, or formatting text data. If we want to rename the column names to keep all names in uppercase and removing the spaces by “_”. We will do it by:
data.columns=[x.upper().replace(" ", "_") for x in data.columns]
data
You can see that column names have been renamed to upper case. For lowercase just replace .upper() with .lower().
To transform the content of a column in uppercase, or lower case. Just do this:
data["CARNAME"].str.upper()
0 ALFA-ROMERO GIULIA
1 ALFA-ROMERO STELVIO
2 ALFA-ROMERO QUADRIFOGLIO
3 AUDI 100 LS
4 AUDI 100LS
...
200 VOLVO 145E (SW)
201 VOLVO 144EA
202 VOLVO 244DL
203 VOLVO 246
204 VOLVO 264GL
Name: CARNAME, Length: 205, dtype: object
The content of “CARNAME” has been transformed into upper case.
Changing the datatype of a column is a fairly simple task. Suppose you need to change the datatype of the price column from float to string.
data["PRICE"] = data["PRICE"].astype(str)
data.dtypes
CAR_ID int64
SYMBOLING int64
CARNAME object
FUELTYPE object
ASPIRATION object
DOORNUMBER object
CARBODY object
DRIVEWHEEL object
ENGINELOCATION object
WHEELBASE float64
CARLENGTH float64
CARWIDTH float64
CARHEIGHT float64
CURBWEIGHT int64
ENGINETYPE object
CYLINDERNUMBER object
ENGINESIZE int64
FUELSYSTEM object
BORERATIO float64
STROKE float64
COMPRESSIONRATIO float64
HORSEPOWER int64
PEAKRPM int64
CITYMPG int64
HIGHWAYMPG int64
PRICE object
dtype: object
The datatype of price is converted to string. To change it back to float. Do this:
data["PRICE"] = data["PRICE"].astype(float)
data.dtypes
CAR_ID int64
SYMBOLING int64
CARNAME object
FUELTYPE object
ASPIRATION object
DOORNUMBER object
CARBODY object
DRIVEWHEEL object
ENGINELOCATION object
WHEELBASE float64
CARLENGTH float64
CARWIDTH float64
CARHEIGHT float64
CURBWEIGHT int64
ENGINETYPE object
CYLINDERNUMBER object
ENGINESIZE int64
FUELSYSTEM object
BORERATIO float64
STROKE float64
COMPRESSIONRATIO float64
HORSEPOWER int64
PEAKRPM int64
CITYMPG int64
HIGHWAYMPG int64
PRICE float64
dtype: object
The datatype of “PRICE” is successfully been changed to float 64 again.
Sometimes we have to remove rows that contain invalid conditions. We have a column name “DRIVEWHEEL”. We know that there are 3 types of vehicles i.e.: “Rear Wheel Drive”, “Front Wheel Drive”, and “4-by 4”. So we know the set of valid conditions for this scenario.
v_conditions = ["rwd", "fwd", "4wd"]
data = data[data["DRIVEWHEEL"].isin(v_conditions)]
data.shape
(205, 26)
It shows that our dataset looks in the same shape. So, our data is all in a valid condition.
- Merging and joining data:
This can involve combining data from multiple sources, such as different tables or files. In pandas, you can use the concat() function to merge multiple dataframes together. You can specify the axis (0 for rows, 1 for columns) and the dataframes to be concatenated. For example, to merge two data frames df1 and df2 on rows:
merge_df = pd.concat([df1, df2], axis=0)
print(df1.shape)
print(df2.shape)
print(merge_df.shape)
(103, 26)
(102, 26)
(205, 26)
All these datasets had a same number of columns. Ther pd.concat() function just merged the dataset based on rows.
- Save the cleaned dataset to a new CSV file:
To save the cleaned and wrangled data into a CSV file, we can use the following command:
data.to_csv("cleaned_data_cars.csv", index=False)
print("the data has been saved as CSV")
the data has been saved as CSV
Tools like pandas, and NumPy help data wrangling to be done efficiently. It is important to note that data wrangling can be a time-consuming and iterative process, as you may need to go back and make changes to the data based on the results of your analysis.
Leave a Reply
You must be logged in to post a comment.