Hello ppl,
In a series of posts, let us see how to visualize and
preprocess the data given to us. There are many types of inputs that we can
get. Let us start with the basic kind of table input, which has a multiple
features or parameters representing a particular entity.
Data
The data we are going to use is breast cancer dataset
from kaggle. The requirement with this particular dataset is to classify the
tumor to be malignant or benign based on the given features. There are 30
features to describe the cancer data. Some examples are radius, texture, and
perimeter.
Raw Analysis
1) The first step is to view the data. I am using pandas library to view the data.
2) Get the datatypes of all the variables
3) Remove unnecessary columns (just use del data[‘particulat_feature_name’])
Understanding specific feature
1) The nunique() and unique() gives the count of unique elements and the values of unique elements. In the example below, there are 2 unique elements of value [ ‘M’ , ‘B’ ].
2) We can replace the particular element value to the desired ones in one or more features. Here we have replace the values of ‘M’ and ‘B’ with ‘1’ and ‘0’.
3) We can list all the features with similar data type. In the example below, we are listing only the float data type.
Visualizing the data
1) First, we need to split the features and the label.
2) Before doing any visualization we need to check for empty cells (or NaN values)
3) Check all the important
parameters of the features like mean, standard deviation, minimum and maximum
value using describe() option
4) The basic visualization on the data is histogram to see how data is spread. We can use seaborn library for all visualizations. Other than the examples provided here, there are many types of visualizations.For getting a histogram, we use,
seaborn.distplot(data)
The plot can be seen below,
5) Another important visualization is the boxplot. Why this is important is because, it gives the outliers in the given data. We get this with,
seaborn.boxplot(data)
The plot can be interpreted using the image below,
Median (Q2/50th Percentile): the middle value of the dataset.
First quartile (Q1/25th Percentile): the middle number between the smallest number (not
the “minimum”) and the median of the dataset.
Third quartile (Q3/75th Percentile): the middle value between the median and the highest
value (not the “maximum”) of the dataset.
Interquartile range (IQR): 25th to the 75th percentile
The data points
outside the minimum and maximum as specified in the Figure are outliers. We can
assess how good the data is distributed using this plot. The plot can be seen below,
No comments:
Post a Comment