# Investigate, Normalize and Visualize Data

• Discuss

Once the data is successfully loaded into your table, next step is to investigate, normalize and visualize the data.

To do so, open your notebook, and inside command box write the following query to load the data from the table into the csv file:

``df=spark.sql("SELECT * from PetDB.WineQuality")``

## Create Correlation Matrix

A correlation matrix is a table containing correlation coefficients between variables. Each cell in the table represents the correlation between two variables. The value lies between -1 and 1. A correlation matrix is used to summarize data, as a diagnostic for advanced analyses and as an input into a more advanced analysis. The two key components of the correlation are:

• Magnitude: larger the magnitude, stronger the correlation.
• Sign: if positive, there is a regular correlation. If negative, there is an inverse correlation.

To create a Correlation Matrix write the following query:

``````from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

vcol="test"
assembler=VectorAssembler(inputCols=df.columns,outputCol=vcol)
vdf=assembler.transform(df).select(vcol)

matrix=Correlation.corr(vdf,vcol)
outputdf=spark.createDataFrame(matrix.collect()[0][0].toArray().tolist(),df.columns)
display(outputdf)``````

Output:

In order to assemble data in features, write following query:

``````from pyspark.ml.feature import VectorAssembler
assembler=VectorAssembler(inputCols=df.columns[:-1],outputCol="features")
featuresdf=assembler.transform(df)
display(featuresdf)``````

Output:

## Normalize Data

In order to normalize the data on the basis of features type following queries:

``````from pyspark.ml.feature import StandardScaler
featuredf=featuresdf.select("features","quality")
s=StandardScaler().setInputCol("features").setOutputCol("normalized")
display(s.fit(featuredf).transform(featuredf))``````

Store the normalized data into another dataframe object

``sdf=s.fit(featuredf).transform(featuredf)``

## Moving Data into Table

Then move the normalized data into a table using the following query:

``sdf.write.saveAsTable("NormalizedWineData")``

Scroll to Top