Steps Involved in Data Analysis Using Pandas

  • Read
  • Discuss

Data analysis is not a single-step process. It actually requires step-by-step processes to ensure everything is heading in the right direction. There are multiple steps that are involved in data manipulation and analysis. Before we discuss those in detail, we will first discuss the dataset in detail. 

Dataset Used in Project

Have you ever been on a roller coaster ride? I hope it fascinates all of us. Today we are going to explore an interesting dataset about roller coasters at several locations in the United States. The dataset has details of 1087 roller coasters with some attributes associated with them. A few of the interesting attributes are speed, manufacturing date, location, and length of the roller coaster track. The data is available online on Kaggle and everyone can access and work on it.

Today, we will work on this dataset, preprocess it, then will try to gain valuable insights from the data.

The steps which will be taken to ensure successful analysis of this dataset will be:

Downloading the dataset and Importing it into the System

This step has been called step 0 as it is standard for every data analysis operation. The data must be present in the system to make sure that we can work on it. 

As mentioned earlier, the dataset is available on Kaggle. To download any dataset from Kaggle, you must go to the website and download it. You must have an account and sign in to download the dataset. Once you have downloaded the data, you can import it into Jupyter Notebook or Google Colab by simply dragging and dropping the method.

Another method to download datasets is to download directly into the system by using python’s opendatasets module. For that, first, you have to download and import opendatasets to use for downloading datasets. To download opendatasets, it’s simple. You can download it by running:

!pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (7.0.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.8/dist-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (4.0.0)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22

It was already downloaded on my system, so it says that the requirements are already satisfied. But if it’s your first time downloading, it will take a while and pip will do the rest for you. Now to import opendatasets, run the following command:

import opendatasets as od

Now opendatasets is ready to download the dataset for you. You can access it by using the ‘od’ keyword. Now we will download the dataset directly from Kaggle into our Colab runtime environment. To do so, you have to run:

dataset_url = 'https://www.kaggle.com/datasets/robikscube/rollercoaster-database'
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: *****
Your Kaggle Key: ··········
Downloading rollercoaster-database.zip to ./rollercoaster-database
100%|██████████| 984k/984k [00:00<00:00, 90.8MB/s]

You have to provide your Kaggle username and key to download the dataset. The username and key could be found in the account section.

Now as the dataset is downloaded into the system, you have to import it using pandas, so that you might be able to use it. Importing datasets such as pandas data frame is quite easy. Just run the following commands: 

import pandas as p

dataset = p.read_csv('/content/rollercoaster-database/coaster_db.csv')
dataset
indexcoaster_nameLengthSpeedLocationStatusOpening date
0Switchback Railway600 ft (180 m)6 mph (9.7 km/h)Coney IslandRemovedJune 16, 1884
1Flip Flap RailwayNaNNaNSea Lion ParkRemoved1895
2Switchback Railway (Euclid Beach Park)NaNNaNCleveland, Ohio, United StatesClosedNaN
3Loop the Loop (Coney Island)NaNNaNOtherRemoved1901
4Loop the Loop (Young’s Pier)NaNNaNOtherRemoved1901

Now your dataset is uploaded and you are good to go.

Step 1: Data Cleaning

Data cleaning is an integral part of any data analysis assignment. You cannot perform accurate analysis unless your data is cleaned and in perfect shape. It is always advised to have some understanding of the data and perform cleaning before proceeding to manipulation and analysis. 

Before cleaning data, you must be aware of the structure of the dataset. We must know how many rows and columns are there and what other characteristics we will see in a while.

These are the important steps you must take into account while trying to clean your dataset.

Steps to perform data cleaning with Pandas
  1. Data Profiling with Pandas:

Being aware of the dataset is the basic step. Let’s take an example: If you would want to conquer a palace of your opponent’s empire, what would your first step be? Well, I believe you must do your homework and try to know your opponent more before trying to conquer their palace. First, try to know and understand how many soldiers they have, what equipment they use, and other military assets they own. If you are aware of all these things, and only then you can make a strategy to defeat them. Otherwise, you can’t. So, if we want to conquer our dataset, and try to take out interesting insights from it. We must first try these steps to get an understanding of our data.

  • Knowing the Shape of the dataset: 

knowing the dataset shape is important as it tells how many rows and columns are present. Consider rows as unique individual records, and columns as attributes related to that record. 

To know what’s the shape of your dataset, you should write:

dataset.shape

(1087, 56)

The output comes in the form of (rows, columns). So, our dataset contains 1087 rows and 56 columns. It means that our dataset has 1087 individual records with 56 attributes associated with each of them. Look, at how much information our shape function provides us. 

  • Viewing Head and Tail of the Dataset: 

Our dataset has 1087 rows. We cannot view all of them to understand. Rather we can have a look at its head and tail. The head method shows the first 5 rows if not mentioned. Likewise, the tail shows the last 5 elements if not mentioned.

dataset.head()

It throws the first 5 rows in the dataset. If you want a specific number of rows from the head, you can specify it as dataset.head(n).

Likewise, the tail method throws the last 5 elements. 

dataset.tail()

Now, these two methods help us in understanding the data and will help us suggest what our data looks like and whether we have some missing values which we need to take care of. All that information is provided by head and tail methods. 

  • Knowing the Columns and their datatypes: 

To start the successful analytical procedure, it is important to know which datatypes your columns contain. Are there any missing data types that are misleading? If so, then you have to drop them which we will see in a while. First, let’s have a look at what our columns look like. We know they are 56 of them.

dataset.columns

Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')

Here is the list of all the columns associated with dataset. Let’s see what are their datatypes.

dataset.dtypes
coaster_name                      object
Length                            object
Speed                             object
Location                          object
Status                            object
Opening date                      object
Type                              object
Manufacturer                      object
Height restriction                object
Model                             object
Height                            object
Inversions                       float64
Lift/launch system                object
Cost                              object
Trains                            object
Park section                      object
Duration                          object
Capacity                          object
G-force                           object
Designer                          object
Max vertical angle                object
Drop                              object
Soft opening date                 object
Fast Lane available               object
Replaced                          object
Track layout                      object
Fastrack available                object
Soft opening date.1               object
Closing date                      object
Opened                            object
Replaced by                       object
Website                           object
Flash Pass Available              object
Must transfer from wheelchair     object
Theme                             object
Single rider line available       object
Restraint Style                   object
Flash Pass available              object
Acceleration                      object
Restraints                        object
Name                              object
year_introduced                    int64
latitude                         float64
longitude                        float64
Type_Main                         object
opening_date_clean                object
speed1                            object
speed2                            object
speed1_value                     float64
speed1_unit                       object
speed_mph                        float64
height_value                     float64
height_unit                       object
height_ft                        float64
Inversions_clean                   int64
Gforce_clean                     float64
dtype: object

Object means string datatypes. We can see there are multiple datatypes. Some are objects, some are float. But some of the elements have the wrong datatype for analysis. Opening date must not be an object rather it must be DateTime. Height, speed, and length must be float. We will deal with them later on.

I believe by this point, we might be comfortable enough to start conquering our dataset. 

  1. Data Cleaning:

Data cleaning is an enjoyable task. You perform several operations and extract data from a larger dataset that is useful for you. While performing data cleaning you must have these goals in mind:

  • Your data must not contain irrelevant columns
  • The columns must be renamed according to one standard. If one is in uppercase, then all should go with uppercase. 
  • Duplications must be dealt with. Duplicated data makes our analysis wrong. 
  • You must start preparing for the next step which is exploratory data analysis.
How to Clean your data using Pandas?
  1. Pick columns that you are interested in

Choose only those columns that you are interested in analyzing. The rest of them must be ignored. 

To get started first let’s see what are columns are:

dataset.columns

Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')

You can see a list of columns here. Now if you want to drop a single column, you can do so by:

a=dataset.drop(['Speed'], axis=1)
a.columns
index(['coaster_name', 'Length', 'Location', 'Status', 'Opening date', 'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height','Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section','Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle','Drop', 'Soft opening date', 'Fast Lane available', 'Replaced','Track layout', 'Fastrack available', 'Soft opening date.1','Closing date', 'Opened', 'Replaced by', 'Website','Flash Pass Available', 'Must transfer from wheelchair', 'Theme','Single rider line available', 'Restraint Style','Flash Pass available', 'Acceleration', 'Restraints', 'Name','year_introduced', 'latitude', 'longitude', 'Type_Main','opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit','speed_mph', 'height_value', 'height_unit', 'height_ft','Inversions_clean', 'Gforce_clean'],dtype='object')

axis=1 means to indicate you want to remove a column. Assigning to variable a means you are trying to make a new copy. You can drop columns one by one, or you can do this to keep the ones you are interested in, the rest of them will automatically drop.

df = dataset[['coaster_name',
'Location', 'Status','Manufacturer','year_introduced','latitude', 'longitude','Type_Main','opening_date_clean','speed_mph', 'height_ft','Inversions_clean', 'Gforce_clean']]

I am just interested in these 13 columns out of 56. Now our table looks like this:

It has 1087 rows and 13 columns. The first part of data cleaning completes here. We have picked up the columns which we were interested in.

  1.  Check for Ambiguous column names and data types.

Ambiguity in column names can cause problems later on. So, let’s try to fix them now. We will again analyze what our column looks like.

df.columns

Index(['coaster_name', 'Location', 'Status','Manufacturer','year_introduced',
'latitude', 'longitude', 'Type_Main', 'opening_date_clean', 'speed_mph',
'height_ft', 'Inversions_clean', 'Gforce_clean'],
dtype='object')

You can see some of the column names start with lower case and some with upper case. Let’s make it uniform. I will prefer uppercase for all of them. Don’t keep too long names for columns. We can do it by calling rename function.

df = df.rename(columns={'year_introduced':'Year_Introduced', 'opening_date_clean':'Opening_Date',
'speed_mph':'Speed_mph',
'coaster_name':'Coaster_Name',
'height_ft':'Height_ft',
'Inversions_clean':'Inversions',
'Gforce_clean':'Gforce'})
df.columns

Index(['Coaster_Name', 'Location', 'Status', 'Manufacturer', 'Year_Introduced',
'latitude', 'longitude', 'Type_Main', 'Opening_Date', 'Speed_mph',
'Height_ft', 'Inversions', 'Gforce'],
dtype='object')

We provided a python dictionary with old names on the left side with new names on the right separated by a colon. Now names are making sense and are all in the same format.

Now let’s see what our datatypes look like.

df.dtypes

Coaster_Name        object
Location            object
Status              object
Manufacturer        object
Year_Introduced      int64
Latitude           float64
Longitude          float64
Type_Main           object
Opening_Date        object
Speed_mph          float64
Height_ft          float64
Inversions           int64
Gforce             float64

It seems like all our datasets are good except opening date which is string type. We must convert it into DateTime so that it would be easier to create charts in the next step. 

Opening date looks like

0       1884-06-16
1       1895-01-01
2              NaN
3       1901-01-01
4       1901-01-01
          ...   
1082           NaN
1083    2022-01-01
1084    2016-06-16
1085           NaN
1086    2022-01-01
Name: Opening_Date, Length: 1087, dtype: object

To convert a column’s datatype, you need to follow this step:

df['Opening_Date']=p.to_datetime(df['Opening_Date'])
df['Opening_Date']

0      1884-06-16
1      1895-01-01
2             NaT
3      1901-01-01
4      1901-01-01
          ...   
1082          NaT
1083   2022-01-01
1084   2016-06-16
1085          NaT
1086   2022-01-01
Name: Opening_Date, Length: 1087, dtype: datetime64[ns]

String column of opening date is being converted to DateTime. Now it will be easier to build charts in the next phases.

  1. Checking for Null Values:

You must check for null values in your dataset. Null values can be lethal and can lead to wrong analysis.

df.isna().head()
indexCoaster_NameLocationStatusManufacturerYear_IntroducedlatitudelongitudeType_MainOpening_DateSpeed_mphHeight_ftInversionsGforce
0FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSETRUEFALSEFALSE
1FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSETRUETRUEFALSEFALSE
2FALSEFALSEFALSETRUEFALSEFALSEFALSEFALSETRUETRUETRUEFALSETRUE
3FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSETRUETRUEFALSETRUE
4FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSETRUETRUEFALSETRUE

True suggests that there are null values and the false suggests that there is a value in a particular block.

To check how many null values are present in a particular column. Just run this to confirm. 

df.isna().sum()

Coaster_Name         0
Location             0
Status             213
Manufacturer        59
Year_Introduced      0
latitude           275
longitude          275
Type_Main            0
Opening_Date       250
Speed_mph          150
Height_ft          916
Inversions           0
Gforce             725
dtype: int64

It shows exactly how many null values are present in each column of the dataset. We should fill null values that are of number type. We should avoid string types for now. Let’s try to fill the Speed_mph column with some values. One method is to fill the column with mean values.

mean=df['Speed_mph'].mean()
mean  

48.617289220917826

This gives us the mean value of the speed column. Now we have to fill the null values with this mean value.

df['Speed_mph'].fillna(mean,inplace=True)
df.isna().sum()

Coaster_Name         0
Location             0
Status             213
Manufacturer        59
Year_Introduced      0
latitude           275
longitude          275
Type_Main            0
Opening_Date       250
Speed_mph            0
Height_ft          916
Inversions           0
Gforce             725
dtype: int64

Now you can see that Speed_mph now has 0 null values which are being filled by mean values. It is important to write inplace=True so that the data frame gets updated. 

You can fill null values in string datatypes as well by doing so:

df["Manufacturer"].fillna("No Manufacturer", inplace = True)

Coaster_Name         0
Location             0
Status             213
Manufacturer         0
Year_Introduced      0
latitude           275
longitude          275
Type_Main            0
Opening_Date       250
Speed_mph            0
Height_ft          916
Inversions           0
Gforce             725
dtype: int64

Now you can see the Null values in Manufacturer get filled by ‘No Manufacturer’ and now has 0 null values.

  1. Check for duplications in Dataset:

Duplicated data makes analysis faulty. So, we have to check whether there are duplicated values and we have to avoid them at any cost. Duplicated values are actually key values in a row that get repeated while collecting data. In our case, if our coaster name, latitude, and longitude occur more than once, it means that the data is duplicated.

We will now check for any duplications in our data.

df.duplicated().sum()

0

It shows that we do not have any completely duplicated rows. There might be partial duplications as well. For that, you should do this.

df.duplicated(subset=['Coaster_Name'])

0       False
1       False
2       False
3       False
4       False
        ... 
1082     True
1083    False
1084     True
1085    False
1086    False
Length: 1087, dtype: bool

You can see True there. That suggests that we have duplicated names of the coaster. Let’s try to dig into more detail.

df.query('Coaster_Name == "Blue Streak (Conneaut Lake)"')
indexCoaster_NameLocationStatusManufacturerYear_IntroducedlatitudelongitudeType_MainOpening_DateSpeed_mphHeight_ftInversionsGforce
59Blue Streak (Conneaut Lake)Conneaut Lake ParkClosedNo Manufacturer193741.6349-80.318Wood1938-05-23 0:00:0050NaN0NaN
61Blue Streak (Conneaut Lake)Conneaut Lake ParkClosedNo Manufacturer193841.6349-80.318Wood1938-05-23 0:00:0050NaN0NaN

We searched a random name of the coaster and we find that we have two entries which are exactly same. Now we have to find the method to remove it.

df.loc[df.duplicated(subset=['Coaster_Name','Location','latitude', 'longitude', 'Manufacturer'])]

It shows that these 97 rows are duplicated. We have to eliminate them.

df = df.loc[~df.duplicated(subset=['Coaster_Name','Location','latitude', 'longitude', 'Manufacturer'])] \
    .reset_index(drop=True)
df.shape

(990, 13)

In this code, we picked up the rows which are not duplicated in terms of the subset indicated, picked their location and we reset the index and indicate drop=True. We reset the index because if we don’t do the index will have missing values. The ‘~’ sign indicates not in python.

In this way, we can precisely handle duplications. That ends data cleaning. 

Leave a Reply

Leave a Reply

Scroll to Top