Tuesday, August 25, 2020

Analysis of data - Part 1

 Hello ppl,

In a series of posts, let us see how to visualize and preprocess the data given to us. There are many types of inputs that we can get. Let us start with the basic kind of table input, which has a multiple features or parameters representing a particular entity.

Data

The data we are going to use is breast cancer dataset from kaggle. The requirement with this particular dataset is to classify the tumor to be malignant or benign based on the given features. There are 30 features to describe the cancer data. Some examples are radius, texture, and perimeter.

Raw Analysis

1) The first step is to view the data. I am using pandas library to view the data.

2) Get the datatypes of all the variables

 

3) Remove unnecessary columns (just use del data[‘particulat_feature_name’])

Understanding specific feature

1) The nunique() and unique()  gives the count of unique elements and the values of unique elements.  In the example below, there are 2 unique elements of value [ ‘M’ , ‘B’ ].

2) We can replace the particular element value to the desired ones in one or more features. Here we have replace the values of ‘M’ and ‘B’ with ‘1’ and ‘0’.

3) We can list all the features with similar data type. In the example below, we are listing only the float data type.

Visualizing the data

1) First, we need to split the features and the label.

2) Before doing any visualization we need to check for empty cells (or NaN values)

3) Check all the important parameters of the features like mean, standard deviation, minimum and maximum value using describe() option

4) The basic visualization on the data is histogram to see how data is spread. We can use seaborn library for all visualizations. Other than the examples provided here, there are many types of visualizations.For getting a histogram, we use,

seaborn.distplot(data)

The plot can be seen below,

5) Another important visualization is the boxplot. Why this is important is because, it gives the outliers in the given data. We get this with,

seaborn.boxplot(data)

The plot can be interpreted using the image below, 

Median (Q2/50th Percentile): the middle value of the dataset.

First quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

Third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

Interquartile range (IQR): 25th to the 75th percentile

The data points outside the minimum and maximum as specified in the Figure are outliers. We can assess how good the data is distributed using this plot. The plot can be seen below,

The complete code of this post can be seen in this link

Reference


No comments:

Post a Comment

Automatic segmentation based on thresholding

Hlo ppl,  Segmentation is one of the major preprocessing steps in analyzing the information present in the image. There are various methods ...