1D Example:

2D Example:

I am a geophysics researcher at the Institute de Physique du Globe de Paris. This blog is intended to provide space to share random things with the world. Due to my work these random things are mostly computer, especially Python (hence the title) related.

The following script shows how to extract ND-Bspline basis functions from scipy.

1D Example:

1D Example:

2D Example:

Real world data is often cluttered with outliers. Such contaminated data points can strongly distort data, and derived values such as the Fourier spectrum. Outliers can be handled by different algorithms. The python library Scikit-Learn, for example, contains several robust models that identify outliers and reduce their influence on linear data fits. Although linear models are often insufficient to capture real world data on their own, they can be combined with non-linear models, as shown in this example where the data is non-linearly transformed into polynomial space and then linearly and robustly fit.

The same idea can be applied in Fourier instead of Polynomial space: A data point with given value x (e.g. from a time series) is transformed into Fourier feature space by evaluating at point x all sine and cosine functions that constitute the Fourier basis. The result is stored in a large 'feature' vector. The value y at x is a linear combination of the evaluated sine and cosine functions. i.e. the dot product of a coefficient vector with the feature vector. The linear regression models can now find the coefficient vector that best predicts the data points y for given x.

Seeing the Fourier transform from this perspective has the advantage that a plethora of linear regression models can be used to fit the data and to find the coefficients of the Fourier Basis (the spectrum). The following image demonstrates how a simple sine wave with outliers can be accurately fit using the robust linear estimators that are implemented in scikit-learn. The short code can be found here.

Another useful application of custom non-linear features in Scikit-Learn are B-Splines. B-Splines are built from non-linear piecewise basis functions. To use them in Scikit-Learn, we need to build a Custom Feature Transformer class that transforms the single feature x to the feature vector of B-Spline basis functions evaluated at x, as in the case of the Fourier transform. This Feature Transformer can be pipelined with regression models to build the robust spline regression. The results of this are shown in the following image, and the code is located here.

In a similar manner, 2D Bsplines can be used to regress images with outliers. For the next example, we want to use a 2d spline basis to a fit a Gaussian function that is sampled at random intervals. 10% of the data points are set to a value of 10 as outliers. We then run the same estimators as above to find the spline coefficients that best fit the image. Again, the robust estimators manage to fit the original input function fairly well, whereas the Least-Squares fit is strongly perturbed by the outliers. The code for this example is available here.

The same idea can be applied in Fourier instead of Polynomial space: A data point with given value x (e.g. from a time series) is transformed into Fourier feature space by evaluating at point x all sine and cosine functions that constitute the Fourier basis. The result is stored in a large 'feature' vector. The value y at x is a linear combination of the evaluated sine and cosine functions. i.e. the dot product of a coefficient vector with the feature vector. The linear regression models can now find the coefficient vector that best predicts the data points y for given x.

Seeing the Fourier transform from this perspective has the advantage that a plethora of linear regression models can be used to fit the data and to find the coefficients of the Fourier Basis (the spectrum). The following image demonstrates how a simple sine wave with outliers can be accurately fit using the robust linear estimators that are implemented in scikit-learn. The short code can be found here.

Another useful application of custom non-linear features in Scikit-Learn are B-Splines. B-Splines are built from non-linear piecewise basis functions. To use them in Scikit-Learn, we need to build a Custom Feature Transformer class that transforms the single feature x to the feature vector of B-Spline basis functions evaluated at x, as in the case of the Fourier transform. This Feature Transformer can be pipelined with regression models to build the robust spline regression. The results of this are shown in the following image, and the code is located here.

In a similar manner, 2D Bsplines can be used to regress images with outliers. For the next example, we want to use a 2d spline basis to a fit a Gaussian function that is sampled at random intervals. 10% of the data points are set to a value of 10 as outliers. We then run the same estimators as above to find the spline coefficients that best fit the image. Again, the robust estimators manage to fit the original input function fairly well, whereas the Least-Squares fit is strongly perturbed by the outliers. The code for this example is available here.

Labels:
Fourier Transform,
Non-linear Features,
Outliers,
Python,
ScikitLearn

The interaction of ocean waves that run into each other is the major source of seismic noise in the frequency band between 0.1 and 0.3 Hz all over the Earth. This animation shows the interaction pressure fluctuations and their dominant frequency, displayed as color lightness and hue, respectively, from 2003 - 2015. The data shown in this video is based on the wave model WAVEWATCH III. Seismic recordings permit to monitor such microseisms directly.

For further information about secondary microseisms you can check out this talk.

For further information about secondary microseisms you can check out this talk.

Labels:
Basemap,
microseism,
ocean waves,
Python,
Seismology,
Visualization

CT Scan examinations often come with a data CD that contains the raw images of the scan. These can be easily visualized in 3D using freely available software such as Paraview. The steps to do this are as follows:

**Preparation:**

**Open the files in Paraview**

- Find the raw images. These can have an ending .dcm but sometimes the files have no ending at all. The folder structure that I received on a CD from my particular examination looked as follows:

IMAGES/

└── PAT00000

└── ST000000

├── SE000000

│ └── CT000000

├── SE000001

│ ├── CT000000

│ ├── CT000001

│ ├── CT000002

│ ├── CT000003

│ ├── CT000[...]

│ └── CT000213

└── SE000002

└── CT000000

The folder SE000001 contains the set of sections that you want to visualize. Alternatively you can explore the OsiriX database that contains publicly available datasets. - Copy all images to a new folder on your computer and change the file endings to .dmc
- dcm images can be raw or compressed (as is the case for the OsiriX database). If the files are compressed, you might get an error later in paraview. In this case, you can use the tool
*gdcmconv*to uncompress them in the terminal (bash) using the line "*for file in *.dcm; do gdcmconv -i $file --raw -o $file; done*". With

- open Paraview and click on file -> open. Browse to the folder that contains the set of sections. The files should show up as a collapsed list, e.g. as CT...dcm . Open them and select DICOM Files (directory) as reader that should be used. If you get an error (warnings are often OK), check that the files are really uncompressed, as explained above.
- If everything went well, you should see an object appearing in the paraview "pipeline browser" (usually on the left).
- Further down in the "Properties" window, click on "Apply" to display the data
- Once the data is loaded, you can select "Volume" rendering in the top toolbar instead of "Outline".
- In the Properties window click on "Edit" under the "Coloring" tab. You should see a new window on the right hand side that you can use to adapt the opacity and color of the volume.
- Click the colorbar to add new points. Lower them to make regions with certain data values more transparent. (see image below)

UPDATE: check out the github project with blender scripts and python module

Here is a comparison of these colormaps, showing a complex polynomial. In the case of poles and zeros, the black-to-white colormap is the best. For other applications, other colormaps will be better. The main point is that they are more uniform than HSV, which is shown in the last row. HSV enhances a few colors and therefore distorts the data:

A simple way to map data to these colors in python is shown in the following code snippet. data is a [2, nwidth, nheight] array with the data and cmap a [ndim1, ndim2, 3] array with the rgb values of the colormap. The resulting rgb [nwidth, nheight, 3] image can be displayed with plt.imshow(rgb).

A 2d colormap is designed to represent two dimensional data on a two dimensional domain. Common examples
are for example wind direction and magnitude, signal amplitude and frequency or a complex functions magnitude and argument. Typically, direction or frequency are hereby represented by the color hue and magnitude by the lightness or/and saturation of the colors. (check out for some really nice
explanations and examples)

A simple 2d colormap can easily be designed in the HSV (hue, saturation, value) color space. Unfortunately, HSV is not perceptually uniform. This means, equal distances in the HSV domain can be visually perceived differently. It therefore distorts the represented data.

The python module colorspacious (pip install colorspacious) provides access to a colorspace, in which similar distances map to similar perceived distances. I have written three short blender scripts (available on ) that allow you to visualize this colorspace in blender, and export color values on paths and surfaces to colormaps.

In contrast to a 1d colormap, that can be represented as a path in the 3d colorspace, a 2d colormap can
be represented as a surface. In particular a surface with a topology that can easily be mapped to a rectangle is useful, such that the two dimensions are clear. Blender offers so called NURBS surfaces,
that are defined from a rectangular grid of spline nodes that are perfect for this purpose.

The simplest and obvious choice of a 2d colormap is to use lightness as one axis and color hue as the other. The color saturation can be adapted for each lightness to span the maximum possible saturation range. However, this approach has several disadvantages. The uniform colorspace has 6 corners that allow to have the most saturated colors: blue, cyan, green, yellow, red, magenta.
This used in hexagon representations of the colorspace (). To make use of the extent of the colorspace we should therefore try to adapt the saturation to the possible values of different colors and to go into these corners. On the other hand, this shouldn't distort distances too much, because the perceived color differences will become less uniform. More problematic is, that each corner has a different lightness. Blue and red are darker colors than yellow and cyan for example.

The following two surfaces try to find a good balance between all of these constraints. The topology of the surface is represented as a wireframe mesh. Both surfaces are not fully uniform which is good because this allows us to better explore the color space.
This colormap goes from black to white and has a maximally saturated area in the middle (like HSL).

This colormap goes from black to white and has a maximally saturated area in the middle (like HSV):
Here is a comparison of these colormaps, showing a complex polynomial. In the case of poles and zeros, the black-to-white colormap is the best. For other applications, other colormaps will be better. The main point is that they are more uniform than HSV, which is shown in the last row. HSV enhances a few colors and therefore distorts the data:

A simple way to map data to these colors in python is shown in the following code snippet. data is a [2, nwidth, nheight] array with the data and cmap a [ndim1, ndim2, 3] array with the rgb values of the colormap. The resulting rgb [nwidth, nheight, 3] image can be displayed with plt.imshow(rgb).

```
def cmap2d_to_rgb(data, cmap):
from scipy.ndimage.interpolation import map_coordinates
data_dim, nrows, ncols = data.shape
idata = np.copy(data) # index coordinates of the data
idata[0] *= cmap.shape[0]
idata[1] *= cmap.shape[1]
idata = idata.reshape(data_dim, nrows, ncols)
r = map_coordinates(cmap[:, :, 0], idata, order=1, mode='nearest')
g = map_coordinates(cmap[:, :, 1], idata, order=1, mode='nearest')
b = map_coordinates(cmap[:, :, 2], idata, order=1, mode='nearest')
rgb = np.array([r, g, b])
rgb = rgb.reshape(3, nrows, ncols).transpose(1, 2, 0)
return rgb
```

This is a simple test that I have done that plots wavelet statistics for a stationary white noise random model realisation. In this case, the expected statistics of the model amplitudes is a Gaussian (normal) distribution. Due to the orthogonality of the daubechies wavelet basis (using PyWavelets), the coefficient power of one wavelet follows a Chi square distribution with 1 degree of freedom. This is simply the distribution of the square of normal distributed independent random variables. Finally, the distribution of the power of the average of all wavelets at one time follows a Chi square distribution with the number of wavelets as degrees of freedom (16 in this case). The plot and the attached script shows these prediction compared to a measured distribution of synthesized white noise.

Now imagine that the Gaussian distributed White noise is modulated (multiplied) in time by another (smooth) oscillatory function. This will not necessarily be visible in the distribution of the model itself, because the overall amplitudes might stay unchanged. Different statistics, however, do change: first, modulation in time will affect*all* frequencies in the same manner. This means that we introduce a correlation between frequencies that Gaussian white noise doesn't have. This correlation in turn means, that the averages of the 16 wavelets at each time are *not* chi square distributed anymore, which can be tested. This example is shown in the image below. Another possible statistic would be to check whether wavelet components that follow each other are correlated and not chi square distributed anymore. This could be another hint for a modulating background function.

Now imagine that the Gaussian distributed White noise is modulated (multiplied) in time by another (smooth) oscillatory function. This will not necessarily be visible in the distribution of the model itself, because the overall amplitudes might stay unchanged. Different statistics, however, do change: first, modulation in time will affect

```
#!/usr/bin/env python
"""
This script demonstrates wavelet statistics of a white noise stationary random model
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2, norm
import pywt
def main():
# prepare time grid:
#np.random.seed(0)
tmax = 1.
npts = 2**15
times = np.linspace(-tmax, tmax, npts)
dt = times[1] - times[0]
# generate real stationary time seris:
freqs = np.fft.rfftfreq(npts, d=dt)
nfreqs = len(freqs)
coeffs = np.random.normal(loc=0., scale=1., size=nfreqs) + \
1j * np.random.normal(loc=0., scale=1., size=nfreqs)
power = np.abs(coeffs)**2
coeffs /= np.sqrt(2. * np.sum(power))
coeffs *= npts
model = np.fft.irfft(coeffs)
coeffs = np.fft.fft(model)
# get discrete wavelet packet transform
wavelet = 'db4'
level = 4
order = "freq"
# Construct wavelet packet
wp = pywt.WaveletPacket(model, wavelet, 'symmetric', maxlevel=level)
nodes = wp.get_level(level, order=order)
labels = [n.path for n in nodes]
wl_coeffs = np.array([n.data for n in nodes], 'd')
wl_coeffs = np.abs(wl_coeffs)**2
nlevels, nnodes = wl_coeffs.shape
# statistics:
# 1a) model histogram
model_bins = np.linspace(-5., 5., 30)
model_hist, _ = np.histogram(model, bins=model_bins)
model_width = model_bins[1] - model_bins[0]
model_hist = model_hist.astype(np.float) / model_width / npts
# 1b) model distribution
values = np.linspace(-5., 5., 100)
sigma = np.sum(np.abs(coeffs)**2) / npts**2
model_pdf = norm.pdf(values, loc=0, scale=sigma)
# 2a) wavelet histogram
wl_bins = np.linspace(0., 3., 50)
wl_hist1, _ = np.histogram(wl_coeffs[0, :], bins=wl_bins)
wl_hist2, _ = np.histogram(np.mean(wl_coeffs, axis=0), bins=wl_bins)
wl_width = wl_bins[1] - wl_bins[0]
wl_hist1 = wl_hist1.astype(np.float) / nnodes / wl_width
wl_hist2 = wl_hist2.astype(np.float) / nnodes / wl_width
# 2b) wavelet distribution
chi2_pdf1 = chi2.pdf(wl_bins, 1)
chi2_pdf2 = chi2.pdf(wl_bins * nlevels, nlevels) * nlevels
# plotting
fig, axes = plt.subplots(2, 2, figsize=(14, 7))
axes[0, 0].plot(times, model, label='stationary random model')
axes[0, 0].set(xlabel='time', ylabel='amplitude',
title='stationary Gaussian model white noise realisation')
axes[1, 0].bar(model_bins[:-1], model_hist, alpha=0.5, width=model_width,
color='green')
axes[1, 0].plot(values, model_pdf, lw=2)
axes[1, 0].set(xlabel='model amplitude', ylabel='frequency',
title=r'amplitude distribution ($\mathcal{N}$)')
axes[0, 1].imshow(wl_coeffs, interpolation='nearest', cmap='viridis',
aspect="auto", origin="lower", extent=[-tmax, tmax, 0,
len(wl_coeffs)])
axes[0, 1].set(xlabel='time', ylabel='wavelet number',
title='discrete wavelet coefficient power')
axes[1, 1].bar(wl_bins[:-1], wl_hist1, width=wl_width, alpha=0.5,
color='blue')
axes[1, 1].bar(wl_bins[:-1], wl_hist2, width=wl_width, alpha=0.5,
color='green')
axes[1, 1].plot(wl_bins, chi2_pdf1, lw=2, c='red',
label=r'isolated wavelets ($\chi^2_{:d}$)'.format(1))
axes[1, 1].plot(wl_bins, chi2_pdf2, lw=2, c='magenta',
label=r'averaged wavelets ($\chi^2_{%d}$)'%nlevels)
axes[1, 1].set(xlabel='variance', ylabel='frequency',
title='wavelet variance distribution')
axes[1, 1].legend()
fig.tight_layout(pad=0.1)
plt.show()
if __name__ == "__main__":
main()
```

This is an illustration of different paths that seismic waves take when they propagate through the Earth.

This video shows the individual phases in an animation:

This video shows the individual phases in an animation:

Subscribe to:
Posts (Atom)