Steps Involved in Data Analysis Using Pandas
- Read
- Discuss
Data analysis is not a single-step process. It actually requires step-by-step processes to ensure everything is heading in the right direction. There are multiple steps that are involved in data manipulation and analysis. Before we discuss those in detail, we will first discuss the dataset in detail.
Dataset Used in Project
Have you ever been on a roller coaster ride? I hope it fascinates all of us. Today we are going to explore an interesting dataset about roller coasters at several locations in the United States. The dataset has details of 1087 roller coasters with some attributes associated with them. A few of the interesting attributes are speed, manufacturing date, location, and length of the roller coaster track. The data is available online on Kaggle and everyone can access and work on it.
Today, we will work on this dataset, preprocess it, then will try to gain valuable insights from the data.
The steps which will be taken to ensure successful analysis of this dataset will be:
Downloading the dataset and Importing it into the System
This step has been called step 0 as it is standard for every data analysis operation. The data must be present in the system to make sure that we can work on it.
As mentioned earlier, the dataset is available on Kaggle. To download any dataset from Kaggle, you must go to the website and download it. You must have an account and sign in to download the dataset. Once you have downloaded the data, you can import it into Jupyter Notebook or Google Colab by simply dragging and dropping the method.
Another method to download datasets is to download directly into the system by using python’s opendatasets module. For that, first, you have to download and import opendatasets to use for downloading datasets. To download opendatasets, it’s simple. You can download it by running:
!pip install opendatasets
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (7.0.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.8/dist-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (4.0.0)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
It was already downloaded on my system, so it says that the requirements are already satisfied. But if it’s your first time downloading, it will take a while and pip will do the rest for you. Now to import opendatasets, run the following command:
import opendatasets as od
Now opendatasets is ready to download the dataset for you. You can access it by using the ‘od’ keyword. Now we will download the dataset directly from Kaggle into our Colab runtime environment. To do so, you have to run:
dataset_url = 'https://www.kaggle.com/datasets/robikscube/rollercoaster-database'
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: *****
Your Kaggle Key: ··········
Downloading rollercoaster-database.zip to ./rollercoaster-database
100%|██████████| 984k/984k [00:00<00:00, 90.8MB/s]
You have to provide your Kaggle username and key to download the dataset. The username and key could be found in the account section.
Now as the dataset is downloaded into the system, you have to import it using pandas, so that you might be able to use it. Importing datasets such as pandas data frame is quite easy. Just run the following commands:
import pandas as p
dataset = p.read_csv('/content/rollercoaster-database/coaster_db.csv')
dataset
index | coaster_name | Length | Speed | Location | Status | Opening date |
0 | Switchback Railway | 600 ft (180 m) | 6 mph (9.7 km/h) | Coney Island | Removed | June 16, 1884 |
1 | Flip Flap Railway | NaN | NaN | Sea Lion Park | Removed | 1895 |
2 | Switchback Railway (Euclid Beach Park) | NaN | NaN | Cleveland, Ohio, United States | Closed | NaN |
3 | Loop the Loop (Coney Island) | NaN | NaN | Other | Removed | 1901 |
4 | Loop the Loop (Young’s Pier) | NaN | NaN | Other | Removed | 1901 |
Now your dataset is uploaded and you are good to go.
Step 1: Data Cleaning
Data cleaning is an integral part of any data analysis assignment. You cannot perform accurate analysis unless your data is cleaned and in perfect shape. It is always advised to have some understanding of the data and perform cleaning before proceeding to manipulation and analysis.
Before cleaning data, you must be aware of the structure of the dataset. We must know how many rows and columns are there and what other characteristics we will see in a while.
These are the important steps you must take into account while trying to clean your dataset.
- Data Profiling with Pandas:
Being aware of the dataset is the basic step. Let’s take an example: If you would want to conquer a palace of your opponent’s empire, what would your first step be? Well, I believe you must do your homework and try to know your opponent more before trying to conquer their palace. First, try to know and understand how many soldiers they have, what equipment they use, and other military assets they own. If you are aware of all these things, and only then you can make a strategy to defeat them. Otherwise, you can’t. So, if we want to conquer our dataset, and try to take out interesting insights from it. We must first try these steps to get an understanding of our data.
- Knowing the Shape of the dataset:
knowing the dataset shape is important as it tells how many rows and columns are present. Consider rows as unique individual records, and columns as attributes related to that record.
To know what’s the shape of your dataset, you should write:
dataset.shape
(1087, 56)
The output comes in the form of (rows, columns). So, our dataset contains 1087 rows and 56 columns. It means that our dataset has 1087 individual records with 56 attributes associated with each of them. Look, at how much information our shape function provides us.
- Viewing Head and Tail of the Dataset:
Our dataset has 1087 rows. We cannot view all of them to understand. Rather we can have a look at its head and tail. The head method shows the first 5 rows if not mentioned. Likewise, the tail shows the last 5 elements if not mentioned.
dataset.head()
It throws the first 5 rows in the dataset. If you want a specific number of rows from the head, you can specify it as dataset.head(n).
Likewise, the tail method throws the last 5 elements.
dataset.tail()
Now, these two methods help us in understanding the data and will help us suggest what our data looks like and whether we have some missing values which we need to take care of. All that information is provided by head and tail methods.
- Knowing the Columns and their datatypes:
To start the successful analytical procedure, it is important to know which datatypes your columns contain. Are there any missing data types that are misleading? If so, then you have to drop them which we will see in a while. First, let’s have a look at what our columns look like. We know they are 56 of them.
dataset.columns
Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')
Here is the list of all the columns associated with dataset. Let’s see what are their datatypes.
dataset.dtypes
coaster_name object
Length object
Speed object
Location object
Status object
Opening date object
Type object
Manufacturer object
Height restriction object
Model object
Height object
Inversions float64
Lift/launch system object
Cost object
Trains object
Park section object
Duration object
Capacity object
G-force object
Designer object
Max vertical angle object
Drop object
Soft opening date object
Fast Lane available object
Replaced object
Track layout object
Fastrack available object
Soft opening date.1 object
Closing date object
Opened object
Replaced by object
Website object
Flash Pass Available object
Must transfer from wheelchair object
Theme object
Single rider line available object
Restraint Style object
Flash Pass available object
Acceleration object
Restraints object
Name object
year_introduced int64
latitude float64
longitude float64
Type_Main object
opening_date_clean object
speed1 object
speed2 object
speed1_value float64
speed1_unit object
speed_mph float64
height_value float64
height_unit object
height_ft float64
Inversions_clean int64
Gforce_clean float64
dtype: object
Object means string datatypes. We can see there are multiple datatypes. Some are objects, some are float. But some of the elements have the wrong datatype for analysis. Opening date must not be an object rather it must be DateTime. Height, speed, and length must be float. We will deal with them later on.
I believe by this point, we might be comfortable enough to start conquering our dataset.
- Data Cleaning:
Data cleaning is an enjoyable task. You perform several operations and extract data from a larger dataset that is useful for you. While performing data cleaning you must have these goals in mind:
- Your data must not contain irrelevant columns
- The columns must be renamed according to one standard. If one is in uppercase, then all should go with uppercase.
- Duplications must be dealt with. Duplicated data makes our analysis wrong.
- You must start preparing for the next step which is exploratory data analysis.
How to Clean your data using Pandas?
- Pick columns that you are interested in
Choose only those columns that you are interested in analyzing. The rest of them must be ignored.
To get started first let’s see what are columns are:
dataset.columns
Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')
You can see a list of columns here. Now if you want to drop a single column, you can do so by:
a=dataset.drop(['Speed'], axis=1)
a.columns
index(['coaster_name', 'Length', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')
axis=1 means to indicate you want to remove a column. Assigning to variable a means you are trying to make a new copy. You can drop columns one by one, or you can do this to keep the ones you are interested in, the rest of them will automatically drop.
df = dataset[['coaster_name',
'Location', 'Status','Manufacturer','year_introduced','latitude', 'longitude','Type_Main','opening_date_clean','speed_mph', 'height_ft','Inversions_clean', 'Gforce_clean']]
I am just interested in these 13 columns out of 56. Now our table looks like this:
It has 1087 rows and 13 columns. The first part of data cleaning completes here. We have picked up the columns which we were interested in.
- Check for Ambiguous column names and data types.
Ambiguity in column names can cause problems later on. So, let’s try to fix them now. We will again analyze what our column looks like.
df.columns
Index(['coaster_name', 'Location', 'Status','Manufacturer','year_introduced',
'latitude', 'longitude', 'Type_Main', 'opening_date_clean', 'speed_mph',
'height_ft', 'Inversions_clean', 'Gforce_clean'],
dtype='object')
You can see some of the column names start with lower case and some with upper case. Let’s make it uniform. I will prefer uppercase for all of them. Don’t keep too long names for columns. We can do it by calling rename function.
df = df.rename(columns={'year_introduced':'Year_Introduced', 'opening_date_clean':'Opening_Date',
'speed_mph':'Speed_mph',
'coaster_name':'Coaster_Name',
'height_ft':'Height_ft',
'Inversions_clean':'Inversions',
'Gforce_clean':'Gforce'})
df.columns
Index(['Coaster_Name', 'Location', 'Status', 'Manufacturer', 'Year_Introduced',
'latitude', 'longitude', 'Type_Main', 'Opening_Date', 'Speed_mph',
'Height_ft', 'Inversions', 'Gforce'],
dtype='object')
We provided a python dictionary with old names on the left side with new names on the right separated by a colon. Now names are making sense and are all in the same format.
Now let’s see what our datatypes look like.
df.dtypes
Coaster_Name object
Location object
Status object
Manufacturer object
Year_Introduced int64
Latitude float64
Longitude float64
Type_Main object
Opening_Date object
Speed_mph float64
Height_ft float64
Inversions int64
Gforce float64
It seems like all our datasets are good except opening date which is string type. We must convert it into DateTime so that it would be easier to create charts in the next step.
Opening date looks like
0 1884-06-16
1 1895-01-01
2 NaN
3 1901-01-01
4 1901-01-01
...
1082 NaN
1083 2022-01-01
1084 2016-06-16
1085 NaN
1086 2022-01-01
Name: Opening_Date, Length: 1087, dtype: object
To convert a column’s datatype, you need to follow this step:
df['Opening_Date']=p.to_datetime(df['Opening_Date'])
df['Opening_Date']
0 1884-06-16
1 1895-01-01
2 NaT
3 1901-01-01
4 1901-01-01
...
1082 NaT
1083 2022-01-01
1084 2016-06-16
1085 NaT
1086 2022-01-01
Name: Opening_Date, Length: 1087, dtype: datetime64[ns]
String column of opening date is being converted to DateTime. Now it will be easier to build charts in the next phases.
- Checking for Null Values:
You must check for null values in your dataset. Null values can be lethal and can lead to wrong analysis.
df.isna().head()
index | Coaster_Name | Location | Status | Manufacturer | Year_Introduced | latitude | longitude | Type_Main | Opening_Date | Speed_mph | Height_ft | Inversions | Gforce |
0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE |
1 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | FALSE |
2 | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE | TRUE |
3 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | TRUE |
4 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | TRUE |
True suggests that there are null values and the false suggests that there is a value in a particular block.
To check how many null values are present in a particular column. Just run this to confirm.
df.isna().sum()
Coaster_Name 0
Location 0
Status 213
Manufacturer 59
Year_Introduced 0
latitude 275
longitude 275
Type_Main 0
Opening_Date 250
Speed_mph 150
Height_ft 916
Inversions 0
Gforce 725
dtype: int64
It shows exactly how many null values are present in each column of the dataset. We should fill null values that are of number type. We should avoid string types for now. Let’s try to fill the Speed_mph column with some values. One method is to fill the column with mean values.
mean=df['Speed_mph'].mean()
mean
48.617289220917826
This gives us the mean value of the speed column. Now we have to fill the null values with this mean value.
df['Speed_mph'].fillna(mean,inplace=True)
df.isna().sum()
Coaster_Name 0
Location 0
Status 213
Manufacturer 59
Year_Introduced 0
latitude 275
longitude 275
Type_Main 0
Opening_Date 250
Speed_mph 0
Height_ft 916
Inversions 0
Gforce 725
dtype: int64
Now you can see that Speed_mph now has 0 null values which are being filled by mean values. It is important to write inplace=True so that the data frame gets updated.
You can fill null values in string datatypes as well by doing so:
df["Manufacturer"].fillna("No Manufacturer", inplace = True)
Coaster_Name 0
Location 0
Status 213
Manufacturer 0
Year_Introduced 0
latitude 275
longitude 275
Type_Main 0
Opening_Date 250
Speed_mph 0
Height_ft 916
Inversions 0
Gforce 725
dtype: int64
Now you can see the Null values in Manufacturer get filled by ‘No Manufacturer’ and now has 0 null values.
- Check for duplications in Dataset:
Duplicated data makes analysis faulty. So, we have to check whether there are duplicated values and we have to avoid them at any cost. Duplicated values are actually key values in a row that get repeated while collecting data. In our case, if our coaster name, latitude, and longitude occur more than once, it means that the data is duplicated.
We will now check for any duplications in our data.
df.duplicated().sum()
0
It shows that we do not have any completely duplicated rows. There might be partial duplications as well. For that, you should do this.
df.duplicated(subset=['Coaster_Name'])
0 False
1 False
2 False
3 False
4 False
...
1082 True
1083 False
1084 True
1085 False
1086 False
Length: 1087, dtype: bool
You can see True there. That suggests that we have duplicated names of the coaster. Let’s try to dig into more detail.
df.query('Coaster_Name == "Blue Streak (Conneaut Lake)"')
index | Coaster_Name | Location | Status | Manufacturer | Year_Introduced | latitude | longitude | Type_Main | Opening_Date | Speed_mph | Height_ft | Inversions | Gforce |
59 | Blue Streak (Conneaut Lake) | Conneaut Lake Park | Closed | No Manufacturer | 1937 | 41.6349 | -80.318 | Wood | 1938-05-23 0:00:00 | 50 | NaN | 0 | NaN |
61 | Blue Streak (Conneaut Lake) | Conneaut Lake Park | Closed | No Manufacturer | 1938 | 41.6349 | -80.318 | Wood | 1938-05-23 0:00:00 | 50 | NaN | 0 | NaN |
We searched a random name of the coaster and we find that we have two entries which are exactly same. Now we have to find the method to remove it.
df.loc[df.duplicated(subset=['Coaster_Name','Location','latitude', 'longitude', 'Manufacturer'])]
It shows that these 97 rows are duplicated. We have to eliminate them.
df = df.loc[~df.duplicated(subset=['Coaster_Name','Location','latitude', 'longitude', 'Manufacturer'])] \
.reset_index(drop=True)
df.shape
(990, 13)
In this code, we picked up the rows which are not duplicated in terms of the subset indicated, picked their location and we reset the index and indicate drop=True. We reset the index because if we don’t do the index will have missing values. The ‘~’ sign indicates not in python.
In this way, we can precisely handle duplications. That ends data cleaning.
Leave a Reply
You must be logged in to post a comment.