Hlo ppl,
Data Science & Tech
Thursday, January 21, 2021
Automatic segmentation based on thresholding
Saturday, September 5, 2020
Understanding Short Term Fourier Transform (STFT)
Hello ppl,
Let us understand the problem with a simple example. We can take two signals for comparison. The first signal is a concatenation of sine waves of three different frequencies (5 Hz, 10 Hz and 15 Hz) .i.e. Different frequencies at different time windows. (Non- stationary signal). The second signal is a sum of sine waves with three different frequencies .i.e. same frequencies at all points of time (Stationary signal). All the implementations can be seen in the Github code page here
We are using the FFT
implementation for taking the Fourier transform. When plotting we can usually
remove the negative frequencies. Now, if we take the Fourier transform of both
signals, we will have,
From this, it is clear
that, the Fourier transform doesn’t give us any detail about the time of presence
of a particular frequency.
How to tackle this
issue?
We can fix a window and
calculate FT in that particular window. This idea gives rise to Short Term Fourier
Transform.
The mathematical representation of Short Term Fourier
Transform can be given as follows:
Continuous – time STFT
Discrete – time STFT
Here,
x(n) – Signal
ω(n) – Sliding window
m – Denotes the location of the window
We can see that, in
this form, only the window is added extra. Now, we need some more parameters
other than the ones that are mentioned in the formula for practical
implementation using scipy in python.
1) Sampling frequency
2) Length of the window
3) Amount of overlap between the window (this determines the value of ‘m’)
Sampling
frequency can be chosen as the number of samples per second. Mostly, there is
no practical discrete signal. We discretize the analog signal to discrete
signal. Now assume we take 256 samples per second, then this is our sampling frequency.
1/256 is our sampling time .i.e. The distance between two samples. (We choose
our sampling frequency based on the maximum frequency present in the signal. By
Nyquist criteria, we take more than twice the maximum frequency to be our
sampling frequency. Here, in our example, we are dealing with frequencies of 5
Hz, 10 Hz and 15 Hz and we use sampling frequency as 256 which is good enough)
The
next parameter is length of window, which we should specify in terms of number
of samples. This usually is a trial and error term for an unknown signal. In
our case, we know that our signal differs after every second (256 samples). So
we fix that to be our length of window. The higher window length means high
frequency resolution. This is because; we have large part of signal to assess
the frequency present in it. (Say, the time period of a frequency (1/frequency)
is 100 samples and our window length is just 25 samples, we may not find the
presence of that particular frequency). Similarly, lesser the width of the
window, higher is the time resolution.
The
next parameter is amount of overlap. This is also a parameter to be chosen in a
trial and error basis. Here I am choosing it to the maximum (255). This tells
the step size of our window.
We
are getting the scalogram (The plot of time, frequency and amplitude from STFT)
as follows:
Here
we can see clearly, even the time of the presence of a particular frequency.
Now,
STFT helped in overcoming one problem but not completely.
Why??
Because,
we need to make a trade-off between time and frequency resolution. Now to overcome
this drawback we can use wavelet transform, which we can see in another post. The code for all the implementations in the post along with few more examples can be seen in the Github page
References
https://en.wikipedia.org/wiki/Short-time_Fourier_transform
http://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-transform-in-machine-learning/
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.fft.html
Tuesday, August 25, 2020
Analysis of data - Part 1
Hello ppl,
In a series of posts, let us see how to visualize and
preprocess the data given to us. There are many types of inputs that we can
get. Let us start with the basic kind of table input, which has a multiple
features or parameters representing a particular entity.
Data
The data we are going to use is breast cancer dataset
from kaggle. The requirement with this particular dataset is to classify the
tumor to be malignant or benign based on the given features. There are 30
features to describe the cancer data. Some examples are radius, texture, and
perimeter.
Raw Analysis
1) The first step is to view the data. I am using pandas library to view the data.
2) Get the datatypes of all the variables
3) Remove unnecessary columns (just use del data[‘particulat_feature_name’])
Understanding specific feature
1) The nunique() and unique() gives the count of unique elements and the values of unique elements. In the example below, there are 2 unique elements of value [ ‘M’ , ‘B’ ].
2) We can replace the particular element value to the desired ones in one or more features. Here we have replace the values of ‘M’ and ‘B’ with ‘1’ and ‘0’.
3) We can list all the features with similar data type. In the example below, we are listing only the float data type.
Visualizing the data
1) First, we need to split the features and the label.
2) Before doing any visualization we need to check for empty cells (or NaN values)
3) Check all the important
parameters of the features like mean, standard deviation, minimum and maximum
value using describe() option
4) The basic visualization on the data is histogram to see how data is spread. We can use seaborn library for all visualizations. Other than the examples provided here, there are many types of visualizations.For getting a histogram, we use,
seaborn.distplot(data)
The plot can be seen below,
5) Another important visualization is the boxplot. Why this is important is because, it gives the outliers in the given data. We get this with,
seaborn.boxplot(data)
The plot can be interpreted using the image below,
Median (Q2/50th Percentile): the middle value of the dataset.
First quartile (Q1/25th Percentile): the middle number between the smallest number (not
the “minimum”) and the median of the dataset.
Third quartile (Q3/75th Percentile): the middle value between the median and the highest
value (not the “maximum”) of the dataset.
Interquartile range (IQR): 25th to the 75th percentile
The data points
outside the minimum and maximum as specified in the Figure are outliers. We can
assess how good the data is distributed using this plot. The plot can be seen below,
Reference
Automatic segmentation based on thresholding
Hlo ppl, Segmentation is one of the major preprocessing steps in analyzing the information present in the image. There are various methods ...

-
Hello ppl, In this post, we are going to see the usage of decision tree algorithm which is a basic block of Random forest algorithm. Dat...
-
Hello ppl, The idea of wavelet transform has its base idea from Fourier transform. So, it is mandatory to understand the physical signif...