Investigate, Normalize and Visualize Data
- Read
- Discuss
Once the data is successfully loaded into your table, next step is to investigate, normalize and visualize the data.
Load Data into DataFrame
To do so, open your notebook, and inside command box write the following query to load the data from the table into the csv file:
df=spark.sql("SELECT * from PetDB.WineQuality")
Create Correlation Matrix
A correlation matrix is a table containing correlation coefficients between variables. Each cell in the table represents the correlation between two variables. The value lies between -1 and 1. A correlation matrix is used to summarize data, as a diagnostic for advanced analyses and as an input into a more advanced analysis. The two key components of the correlation are:
- Magnitude: larger the magnitude, stronger the correlation.
- Sign: if positive, there is a regular correlation. If negative, there is an inverse correlation.
To create a Correlation Matrix write the following query:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
vcol="test"
assembler=VectorAssembler(inputCols=df.columns,outputCol=vcol)
vdf=assembler.transform(df).select(vcol)
matrix=Correlation.corr(vdf,vcol)
outputdf=spark.createDataFrame(matrix.collect()[0][0].toArray().tolist(),df.columns)
display(outputdf)
Output:
In order to assemble data in features, write following query:
from pyspark.ml.feature import VectorAssembler
assembler=VectorAssembler(inputCols=df.columns[:-1],outputCol="features")
featuresdf=assembler.transform(df)
display(featuresdf)
Output:
Normalize Data
In order to normalize the data on the basis of features type following queries:
from pyspark.ml.feature import StandardScaler
featuredf=featuresdf.select("features","quality")
s=StandardScaler().setInputCol("features").setOutputCol("normalized")
display(s.fit(featuredf).transform(featuredf))
Store the normalized data into another dataframe object
sdf=s.fit(featuredf).transform(featuredf)
Moving Data into Table
Then move the normalized data into a table using the following query:
sdf.write.saveAsTable("NormalizedWineData")
Leave a Reply
You must be logged in to post a comment.