Data Science & Tech

Thursday, January 21, 2021

Automatic segmentation based on thresholding

Hlo ppl,

Segmentation is one of the major preprocessing steps in analyzing the information present in the image. There are various methods of segmenting an image. In this post, we are going to understand the automatic thresholding method of segmentation.

The following is the procedure to obtain the threshold,

1) Initial threshold value is taken as the mean of the given grayscale image (In case of color image convert it to gray)

2) With the obtained threshold, split the intensity values of the image into two groups.

3) Calculate the new threshold as an average of the mean of two groups of intensity values

4) Find the difference between the old and new threshold.

5) If the difference is smaller than a chosen value (epsilon) , we can stop the iteration. If not repeat from step 2.

The code snippet for implementing the thresholding process can be seen below:

An example for segmentation using automatic thresholding can be seen below:

Saturday, September 5, 2020

Understanding Short Term Fourier Transform (STFT)

Hello ppl,

In the previous post, we have seen about Fourier transform and its physical meaning. Now, what might be missing in the Fourier transform is that, if multiple frequency signals are present at different time window (Non – stationary signal may be the correct word for this), we don’t have a clear data from the frequency response. We don’t understand which part of the signal has the obtained particular frequency. What we do is a dot product of a signal with particular single frequency throughout. But the actual signal has two different frequencies at different time window.

Let us understand the problem with a simple example. We can take two signals for comparison. The first signal is a concatenation of sine waves of three different frequencies (5 Hz, 10 Hz and 15 Hz) .i.e. Different frequencies at different time windows. (Non- stationary signal). The second signal is a sum of sine waves with three different frequencies .i.e. same frequencies at all points of time (Stationary signal). All the implementations can be seen in the Github code page here

We are using the FFT implementation for taking the Fourier transform. When plotting we can usually remove the negative frequencies. Now, if we take the Fourier transform of both signals, we will have,

From this, it is clear that, the Fourier transform doesn’t give us any detail about the time of presence of a particular frequency.

How to tackle this issue?

We can fix a window and calculate FT in that particular window. This idea gives rise to Short Term Fourier Transform.

The mathematical representation of Short Term Fourier Transform can be given as follows:

Continuous – time STFT

Discrete – time STFT

Here,

x(n) – Signal

ω(n) – Sliding window

m – Denotes the location of the window

We can see that, in this form, only the window is added extra. Now, we need some more parameters other than the ones that are mentioned in the formula for practical implementation using scipy in python.

1)     Sampling frequency
2)     Length of the window
3)     Amount of overlap between the window (this determines the value of ‘m’)

Sampling frequency can be chosen as the number of samples per second. Mostly, there is no practical discrete signal. We discretize the analog signal to discrete signal. Now assume we take 256 samples per second, then this is our sampling frequency. 1/256 is our sampling time .i.e. The distance between two samples. (We choose our sampling frequency based on the maximum frequency present in the signal. By Nyquist criteria, we take more than twice the maximum frequency to be our sampling frequency. Here, in our example, we are dealing with frequencies of 5 Hz, 10 Hz and 15 Hz and we use sampling frequency as 256 which is good enough)

The next parameter is length of window, which we should specify in terms of number of samples. This usually is a trial and error term for an unknown signal. In our case, we know that our signal differs after every second (256 samples). So we fix that to be our length of window. The higher window length means high frequency resolution. This is because; we have large part of signal to assess the frequency present in it. (Say, the time period of a frequency (1/frequency) is 100 samples and our window length is just 25 samples, we may not find the presence of that particular frequency). Similarly, lesser the width of the window, higher is the time resolution.

The next parameter is amount of overlap. This is also a parameter to be chosen in a trial and error basis. Here I am choosing it to the maximum (255). This tells the step size of our window.

We are getting the scalogram (The plot of time, frequency and amplitude from STFT) as follows:

Here we can see clearly, even the time of the presence of a particular frequency.

Now, STFT helped in overcoming one problem but not completely.

Why??

Because, we need to make a trade-off between time and frequency resolution. Now to overcome this drawback we can use wavelet transform, which we can see in another post. The code for all the implementations in the post along with few more examples can be seen in the Github page

References

https://en.wikipedia.org/wiki/Short-time_Fourier_transform

http://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-transform-in-machine-learning/

https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.fft.html

Tuesday, August 25, 2020

Analysis of data - Part 1

Hello ppl,

In a series of posts, let us see how to visualize and preprocess the data given to us. There are many types of inputs that we can get. Let us start with the basic kind of table input, which has a multiple features or parameters representing a particular entity.

Data

The data we are going to use is breast cancer dataset from kaggle. The requirement with this particular dataset is to classify the tumor to be malignant or benign based on the given features. There are 30 features to describe the cancer data. Some examples are radius, texture, and perimeter.

Raw Analysis

1) The first step is to view the data. I am using pandas library to view the data.

2) Get the datatypes of all the variables

3) Remove unnecessary columns (just use del data[‘particulat_feature_name’])

Understanding specific feature

1) The nunique() and unique() gives the count of unique elements and the values of unique elements. In the example below, there are 2 unique elements of value [ ‘M’ , ‘B’ ].

2) We can replace the particular element value to the desired ones in one or more features. Here we have replace the values of ‘M’ and ‘B’ with ‘1’ and ‘0’.

3) We can list all the features with similar data type. In the example below, we are listing only the float data type.

Visualizing the data

1) First, we need to split the features and the label.

2) Before doing any visualization we need to check for empty cells (or NaN values)

3) Check all the important parameters of the features like mean, standard deviation, minimum and maximum value using describe() option

4) The basic visualization on the data is histogram to see how data is spread. We can use seaborn library for all visualizations. Other than the examples provided here, there are many types of visualizations.For getting a histogram, we use,

seaborn.distplot(data)

The plot can be seen below,

5) Another important visualization is the boxplot. Why this is important is because, it gives the outliers in the given data. We get this with,

seaborn.boxplot(data)

The plot can be interpreted using the image below,

Median (Q2/50th Percentile): the middle value of the dataset.

First quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

Third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

Interquartile range (IQR): 25th to the 75th percentile

The data points outside the minimum and maximum as specified in the Figure are outliers. We can assess how good the data is distributed using this plot. The plot can be seen below,

The complete code of this post can be seen in this link

Reference

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51