Cleaning the Data
- Read
- Discuss
In this article we will focus on cleaning the data which will include removing the null values from the dataset and identifying and removing the outliers(extreme values).
Firstly we will count the number of records in the table using the following query:
df=spark.sql("Select * from winedata")
df.count()
Output
You can see that there are a total 1600 records.
Remove Null values
To remove the null values we will use the drop() function of the dataframe object, which itself will find and remove all the null values.
%sql
select * from winedata where (quality is Null) ;
df=df.na.drop()
After executing the above query, again count the number of records
df.count()
Output
You will see that now the count is 1599 which means there was one row having null value and that row has been removed from the dataset.
Identifying and Removing Outliers
In order to identify the outliers, we will display all the records for quality and the count of the number of records for each quality and from that we will identify the outliers using visualization.
Use following code:
display(df.groupBy("quality").count().orderBy("quality"))
Output
Visualization, in the form of bar graph, will look as follows
From here you can clearly see that 100 value for quantity is the outlier
For removing the outliers, we will have to calculate the interquartile rank which actually is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data.
And for calculating interquartile rank we will first have to calculate the Approximate Quantile which is a method to read a stream of values in a continuous fashion and can, at any time, be queried about the approximate value of a specified quantile.
For calculating Approximate Quantile use following code:
aq=df.approxQuantile("quality",[0.25,0.75],0)
print(aq)
Output
Here the lower value is 5.0 and higher value is 6.0.
Next, we will use this approxQuantile to calculate the interquartile rank as shown in the following code snippet:
iqr=aq[1]-aq[0]
print(iqr)
Output
Here the interquartile rank is 1.0
Now we will have to find the highest and lowest value between which the data should be.
Lowest value will be calculated by subtracting the low value of approxQuantile by the interquartile rank*2.
Highest value will be calculated by adding the high value of approxQuantile with the interquartile rank*2.
lowval=aq[0]-(iqr*2)
highval=aq[1]+(iqr*2)
print(lowval)
print(highval)
Output
Here the range is between 3 and 8, so we will filter the dataframe to contain data that is only within this range. For that use the following code:
df=df.filter((df.quality>=lowval) & (df.quality<=highval))
Once again we will count the number of records in the dataframe
df.count()
Output
Now, you can see that our records are reduced from 1599 to 1598 which means that one row having extreme value for quantity has been removed.
In Order to re-use this, we can store it in a new table
df.write.saveAsTable("cleanedwinedata")
Leave a Reply
You must be logged in to post a comment.