Commit d743a3c7 authored by Parish, Chad's avatar Parish, Chad
Browse files

Replace mean_vs_data_B.ipynb

parent 97c5d2b6
%% Cell type:markdown id: tags:
# Rule #3: Showing the data beats mean $\pm$ standard deviation
### Notebook 3b: More complicated examples
Including non-normal data
%% Cell type:markdown id: tags:
Imports:
%% Cell type:code id: tags:
``` python
import numpy as np
from numpy import matlib
import matplotlib.pyplot as plt
import h5py
np.random.seed(seed=1) # for repeatability
```
%% Cell type:markdown id: tags:
Make the plots more easily readable:
%% Cell type:code id: tags:
``` python
plt.style.use({'font.sans-serif': 'arial', 'font.size': 16, 'font.weight': 'bold', 'svg.fonttype': 'none'})
# 'svg.fonttype': 'none' forces Matplotlib to burn the text into the .SVG file, rather than create
# shapes that look like the text.
```
%% Cell type:markdown id: tags:
### Read in the data
The data has previously been explorted from the vendor-specific data formats into an HDF5 ".h5" file called `grain_file.h5`. HDF5 is the preferrable format for the transfer and archiving of scientific data, because it is readable on all operating systmes (Unix/Linux, Max, Windows) and has APIs for all common (and most uncommon) programming languages.<br>
<br>
The data is from the publication: Gerczak et al., Restructuring in high burnup UO$_2$ studied using modern electron microscopy, <b>Journal of Nuclear Materials</b>, 2018<br>
<br>
Here, we use the `h5py` library as a Python interface to HDF5. We read in h5 "datasets" into a numpy array `D` and then append this array to a list `grain_list`. We then read in the position, in units of normalized radius $r/r_0$, and append this to a list `pos_list`. We then calculate an equivilant grain area `A` from the grain diameter `D` (using $A = \frac{1}{4}\cdot \pi D^2 $) and store this numpy array into a list `area_list`.
%% Cell type:code id: tags:
``` python
grain_list = []
pos_list = []
area_list = []
with h5py.File('grain_file.h5', mode='r') as h5_handle:
for i in h5_handle: # iterate over items in file.
D = np.array(h5_handle[i]) # Load the diameters
pos = h5_handle[i].attrs['Position'] # .attrs reads attributes in the h5 format
grain_list.append(D)
pos_list.append(pos)
A = 0.25 * (np.pi * D * D)
area_list.append(A)
```
%% Cell type:markdown id: tags:
### Calculate the weighted average and standard deviation
See, for example, https://edaxblog.com/2014/06/23/time-for-a-change-new-perspectives-in-grain-size-analysis/ for a discussion of grain size "averages" in EBSD. We are using the definition: <br>
\begin{equation}
\bar{D}_W =\frac{\sum A_g \cdot p_g}{\sum A_g}
\end{equation} <br>
For the area-weighted mean grain diameter $\bar{D}_W$. This defines $A_g$ as the area A of grain g and $p_g$ as the parameter of interest (diameter) of grain g. We also use the simple aritmetic mean (average), $\bar{D}$, trivially calculated at `np.mean()` on any numpy array, along with standard deviation $\sigma$ by `np.std()`.<br>
<br>
The weighted standard deviation (From communication with Dr. S. Wright, EDAX):<br><br>
\begin{equation}
\sigma_{W} = \sqrt{
\frac{\sum{A_g}}{(\sum{A_g})^2 - \sum{(A_g)^2}}
\left[
\left(\sum \limits_{g=1}^{N} A_g p_g^2\right) -
\frac{1}{\Sigma A_g}
\left(\sum \limits_{g=1}^{N} A_g p_g\right)^2
\right]
}
\end{equation}<br>
<br>
We have defined:<br> `first_term` = $\frac{\sum{A_g}}{(\sum{A_g})^2 - \sum{(A_g)^2}}$<br>
<br>
`second_term` = $\sum \limits_{g=1}^{N} A_g p_g^2$<br>
<br>
`third_term_parens` = $\left(\sum \limits_{g=1}^{N} A_g p_g\right)^2$<br>
<br>
`third_term` = $\frac{1}{\Sigma A_g}
\left(\sum \limits_{g=1}^{N} A_g p_g\right)^2$<br>
<br>
The function `weighted_stats` returns the weighted mean $\bar{D}_W$ and the weighted standard deviation $\sigma_{D_W}$.
%% Cell type:code id: tags:
``` python
def weighted_stats(A, D): # A is area, D is diameter. Both are numpy arrays of the same shape
sum_weights = A.sum()
sum_product = (A * D).sum()
weighted_mean = sum_product / sum_weights
sigma_A_g = A.sum()
sigma_A_g_squared = np.power(A.sum(), 2)
sigma_A_square_g = (np.power(A, 2)).sum()
first_term = (sigma_A_g)/(sigma_A_g_squared - sigma_A_square_g)
second_term = ( (A * np.power(D,2)).sum() )
third_term_parens = np.power( (A*D).sum(),2)
third_term = third_term_parens / sigma_A_g
weighted_std = np.sqrt( first_term * (second_term - third_term) )
return weighted_mean, weighted_std
```
%% Cell type:markdown id: tags:
# Now to generate the figure.
There are a few sub-parts to this procedure. First, `jitter` and `alpha` are used due to the huge number of overlapping points at each $r/r_0$ value. We also create lists `mean_list`, `w_mean_list`, and others to hold the mean, weighted mean, and sorted versions, to plot the trend lines after plotting the raw scatter points. Two different datasets are at $r/r_0\approx$0.82, and are marked with an arrowed annotation for later analysis.
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots(figsize=(10,4))
jitter=0.002 # adjustable here
mean_list = [] # will hold the means in the order imported
std_list = []
w_mean_list = [] # will hold weighted means
w_std_list = [] # weighted standard deviations
for i, j in enumerate(grain_list): # iterate over the grain diameter data
X = j.shape[0]
pos = pos_list[i]
# np.random.uniform adds random jitter to the X-coordinate for each.
ax.plot(np.random.uniform(low=-jitter,
high=jitter,
size=(X,1)) + matlib.repmat(pos, X, 1),
j, 'k.', markersize=3, alpha=0.15)
mean_list.append(j.mean()) # put the mean into mean_list
std_list.append(j.std())
WM, WS = weighted_stats(area_list[i], j) # weighted mean
w_mean_list.append(WM)
w_std_list.append(WS)
PL = np.array(pos_list) # turn the Python list pos_list into a numpy array.
# Numpy arrays are easier to extract the sort order from.
WML = np.array(w_mean_list)[PL.argsort()] # sorted (W)eighted (M)ean (L)ist
ML = np.array(mean_list)[PL.argsort()] # sorted (M)ean (L)ist
STD = np.array(std_list)[PL.argsort()] # sorted standard deviation
WSL = np.array(w_std_list)[PL.argsort()]
PL = np.array(pos_list)[PL.argsort()] # sorted (P)osition (L)ist
ax.plot(PL, WML,'r*-', linewidth=.8, markersize=9, label='Area-weighted mean')
ax.fill_between(PL, WML-WSL, WML+WSL, alpha=0.1, color='r', label=r'$\pm$1$\sigma$, weighted')
ax.plot(PL, ML, 'c*-', linewidth=.8, markersize=9, label='Arithmetic mean')
ax.fill_between(PL, ML-STD, ML+STD, alpha=0.1, color='c', label=r'$\pm$1$\sigma$, arithmetic')
# Plot appearance
ax.invert_xaxis()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(top=16)
L = ax.legend(loc=1, fontsize=10)
L.set_frame_on(False)
ax.set_ylabel(r'Grain diameter, $\mu$m')
ax.set_xlabel(r'Radial position, $r/r_0$')
# label the points for later analysis
ax.annotate(s='See\nhistograms', xy=(0.82, 6), xytext=(0.8,7),
arrowprops={'arrowstyle': '-|>', 'color': 'black'}, fontsize=12 )
fig.tight_layout()
# dump to disk
fig.savefig('grain_sizes.png', dpi=300)
fig.savefig('grain_sizes.svg', dpi=300)
fig.savefig('grain_sizes.pdf', dpi=300)
```
%% Output
%% Cell type:markdown id: tags:
### Now let's talk about standard deviation
Grain sizes are one example of a very non-normal distribution, and we would not expect that standard deviation (for instance) would have much relevance for these distributions. There are two sets of grain diameters with the same (to two decimal places) $r/r_0$ value, $\approx$0.82. These are both plotted as histograms
%% Cell type:code id: tags:
``` python
L = [2, 15] # datasets with r/r_0 $\approx$ 0.82
# get the weighted means
D_W = []
D_W_STD = []
# I'm calling weighted_stats() twice per dataset. This is inefficienct but not worth
# optimizing for two only datasets.
D_W.append(weighted_stats(area_list[L[0]], grain_list[L[0]])[0])
D_W.append(weighted_stats(area_list[L[1]], grain_list[L[1]])[0])
D_W_STD.append(weighted_stats(area_list[L[0]], grain_list[L[0]])[1])
D_W_STD.append(weighted_stats(area_list[L[1]], grain_list[L[1]])[1])
#get the arithmetic means and stds
D_M = []
D_M.append(grain_list[L[0]].mean())
D_M.append(grain_list[L[1]].mean())
D_STD = []
D_STD.append(grain_list[L[0]].std())
D_STD.append(grain_list[L[1]].std())
#set some export parameters
color_list = ['xkcd:coral', 'xkcd:slate']
alpha_list=[0.4, 0.3]
#make the figure
fig, ax = plt.subplots(figsize=(10,4))
for i, j in enumerate(L):
a = plt.hist(grain_list[i], alpha=alpha_list[i], density=True, bins=100, color=color_list[i])
ax.set_yscale('log')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# create strings for annotations.
data0 = r'$\bar{D}$ =' + f'{D_M[0]:2.3}' + r'$\pm$' \
+ f'{D_STD[0]:2.2}' + r' $\mu$m' + '\n' + \
r'$\bar{D}_W$=' + f'{D_W[0]:2.3}' + r'$\pm$' + f'{D_W_STD[0]:2.3}'+ r' $\mu$m' + '\n' + \
f'N={len(grain_list[L[0]])}'
data1 = r'$\bar{D}$ =' + f'{D_M[1]:2.3}' + r'$\pm$' \
+ f'{D_STD[1]:2.2}' + r' $\mu$m' + '\n' + \
r'$\bar{D}_W$=' + f'{D_W[1]:2.3}' + r'$\pm$' + f'{D_W_STD[1]:2.3}'+ r' $\mu$m' + '\n' + \
f'N={len(grain_list[L[1]])}'
# add the comments to the plot
ax.text(7.5, 0.65, data0, fontsize=14, color=color_list[0])
ax.text(7.5, 0.05, data1, fontsize=14, color=color_list[1])
ax.set_xlabel('Grain diameter, $\mu$m')
ax.set_ylabel('Density')
ax.set_ylabel('Relative fraction')
LW=1.2 # line width
Y = ax.get_ylim()[1] # find top of the chart to plot properly
ax.set_ylim(top=12)
# plot the aritmetic means
ax.plot([D_M[0], D_M[0]],
[0, Y],
linewidth=LW, color=color_list[0])
ax.plot([D_M[1], D_M[1]],
[0, Y],
linewidth=LW, color=color_list[1])
ax.text(x=grain_list[L[1]].mean(), y=Y*1.07, s=r'$\bar{D}$')
# plot the weighted means
ax.plot([D_W[0], D_W[0]],
[0, Y],
linewidth=LW, color=color_list[0])
ax.plot([D_W[1], D_W[1]],
[0, Y],
linewidth=LW, color=color_list[1])
ax.text(x=D_W[0] + 0.2, y=Y*1.07, s=r'$\bar{D}_W$')
fig.suptitle(r'Grain diameters at $r/r_0\approx$0.82', fontsize=14)
fig.tight_layout()
fig.savefig('grain_histograms.png', dpi=300)
fig.savefig('grain_histograms.svg', dpi=300)
```
%% Output
%% Cell type:markdown id: tags:
An important note: Matplotlib's `pyplot.hist` command does not give a sum of the histogram to 1.0, but rather an integral of the histogram to 1.0 (when `density=True`). This is why the peak of the plot above is >1.0.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment