Cleaning the Data

  • Read
  • Discuss

In this article we will focus on cleaning the data which will include removing the null values from the dataset and identifying and removing the outliers(extreme values).

Firstly we will count the number of records in the table using the following query:

df=spark.sql("Select * from winedata")
df.count()

Output

You can see that there are a total 1600 records.

Remove Null values

To remove the null values we will use the drop() function of the dataframe object, which itself will find and remove all the null values.

%sql
select * from winedata where (quality is Null) ;
df=df.na.drop()

After executing the above query, again count the number of records

df.count()

Output

You will see that now the count is 1599 which means there was one row having null value and that row has been removed from the dataset.

Identifying and Removing Outliers

In order to identify the outliers, we will display all the records for quality and the count of the number of records for each quality and from that we will identify the outliers using visualization.

Use following code:

display(df.groupBy("quality").count().orderBy("quality"))

Output

Visualization, in the form of bar graph, will look as follows 

From here you can clearly see that 100 value for quantity is the outlier

For removing the outliers, we will have to calculate the interquartile rank which actually is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data.

And for calculating interquartile rank we will first have to calculate the Approximate Quantile which is a method to read a stream of values in a continuous fashion and can, at any time, be queried about the approximate value of a specified quantile.

For calculating Approximate Quantile use following code:

aq=df.approxQuantile("quality",[0.25,0.75],0)
print(aq)

Output

Here the lower value is 5.0 and higher value is 6.0.

Next, we will use this approxQuantile to calculate the interquartile rank as shown in the following code snippet:

iqr=aq[1]-aq[0]
print(iqr)

Output

Here the interquartile rank is 1.0

Now we will have to find the highest and lowest value between which the data should be.

Lowest value will be calculated by subtracting the low value of approxQuantile by the interquartile rank*2.

Highest value will be calculated by adding the high value of approxQuantile with the interquartile rank*2.

lowval=aq[0]-(iqr*2)
highval=aq[1]+(iqr*2)
print(lowval)
print(highval)

Output

Here the range is between 3 and 8, so we will filter the dataframe to contain data that is only within this range. For that use the following code:

df=df.filter((df.quality>=lowval) & (df.quality<=highval))

Once again we will count the number of records in the dataframe

df.count()

Output

Now, you can see that our records are reduced from 1599 to 1598 which means that one row having extreme value for quantity has been removed.

In Order to re-use this, we can store it in a new table

df.write.saveAsTable("cleanedwinedata")

Leave a Reply

Leave a Reply

Scroll to Top