# 'svg.fonttype': 'none' forces Matplotlib to burn the text into the .SVG file, rather than create
# shapes that look like the text.
```
%% Cell type:markdown id: tags:
### Read in the data
The data has previously been explorted from the vendor-specific data formats into an HDF5 ".h5" file called `grain_file.h5`. HDF5 is the preferrable format for the transfer and archiving of scientific data, because it is readable on all operating systmes (Unix/Linux, Max, Windows) and has APIs for all common (and most uncommon) programming languages.<br>
<br>
The data is from the publication: Gerczak et al., Restructuring in high burnup UO$_2$ studied using modern electron microscopy, <b>Journal of Nuclear Materials</b>, 2018<br>
<br>
Here, we use the `h5py` library as a Python interface to HDF5. We read in h5 "datasets" into a numpy array `D` and then append this array to a list `grain_list`. We then read in the position, in units of normalized radius $r/r_0$, and append this to a list `pos_list`. We then calculate an equivilant grain area `A` from the grain diameter `D` (using $A = \frac{1}{4}\cdot \pi D^2 $) and store this numpy array into a list `area_list`.
pos=h5_handle[i].attrs['Position']# .attrs reads attributes in the h5 format
grain_list.append(D)
pos_list.append(pos)
A=0.25*(np.pi*D*D)
area_list.append(A)
```
%% Cell type:markdown id: tags:
### Calculate the weighted average and standard deviation
See, for example, https://edaxblog.com/2014/06/23/time-for-a-change-new-perspectives-in-grain-size-analysis/ for a discussion of grain size "averages" in EBSD. We are using the definition: <br>
\begin{equation}
\bar{D}_W =\frac{\sum A_g \cdot p_g}{\sum A_g}
\end{equation} <br>
For the area-weighted mean grain diameter $\bar{D}_W$. This defines $A_g$ as the area A of grain g and $p_g$ as the parameter of interest (diameter) of grain g. We also use the simple aritmetic mean (average), $\bar{D}$, trivially calculated at `np.mean()` on any numpy array, along with standard deviation $\sigma$ by `np.std()`.<br>
<br>
The weighted standard deviation (From communication with Dr. S. Wright, EDAX):<br><br>
\begin{equation}
\sigma_{W} = \sqrt{
\frac{\sum{A_g}}{(\sum{A_g})^2 - \sum{(A_g)^2}}
\left[
\left(\sum \limits_{g=1}^{N} A_g p_g^2\right) -
\frac{1}{\Sigma A_g}
\left(\sum \limits_{g=1}^{N} A_g p_g\right)^2
\right]
}
\end{equation}<br>
<br>
We have defined:<br>`first_term` = $\frac{\sum{A_g}}{(\sum{A_g})^2 - \sum{(A_g)^2}}$<br>
There are a few sub-parts to this procedure. First, `jitter` and `alpha` are used due to the huge number of overlapping points at each $r/r_0$ value. We also create lists `mean_list`, `w_mean_list`, and others to hold the mean, weighted mean, and sorted versions, to plot the trend lines after plotting the raw scatter points. Two different datasets are at $r/r_0\approx$0.82, and are marked with an arrowed annotation for later analysis.
%% Cell type:code id: tags:
``` python
fig,ax=plt.subplots(figsize=(10,4))
jitter=0.002# adjustable here
mean_list=[]# will hold the means in the order imported
std_list=[]
w_mean_list=[]# will hold weighted means
w_std_list=[]# weighted standard deviations
fori,jinenumerate(grain_list):# iterate over the grain diameter data
X=j.shape[0]
pos=pos_list[i]
# np.random.uniform adds random jitter to the X-coordinate for each.
ax.plot(np.random.uniform(low=-jitter,
high=jitter,
size=(X,1))+matlib.repmat(pos,X,1),
j,'k.',markersize=3,alpha=0.15)
mean_list.append(j.mean())# put the mean into mean_list
std_list.append(j.std())
WM,WS=weighted_stats(area_list[i],j)# weighted mean
w_mean_list.append(WM)
w_std_list.append(WS)
PL=np.array(pos_list)# turn the Python list pos_list into a numpy array.
# Numpy arrays are easier to extract the sort order from.
Grain sizes are one example of a very non-normal distribution, and we would not expect that standard deviation (for instance) would have much relevance for these distributions. There are two sets of grain diameters with the same (to two decimal places) $r/r_0$ value, $\approx$0.82. These are both plotted as histograms
%% Cell type:code id: tags:
``` python
L=[2,15]# datasets with r/r_0 $\approx$ 0.82
# get the weighted means
D_W=[]
D_W_STD=[]
# I'm calling weighted_stats() twice per dataset. This is inefficienct but not worth
fig.suptitle(r'Grain diameters at $r/r_0\approx$0.82',fontsize=14)
fig.tight_layout()
fig.savefig('grain_histograms.png',dpi=300)
fig.savefig('grain_histograms.svg',dpi=300)
```
%% Output
%% Cell type:markdown id: tags:
An important note: Matplotlib's `pyplot.hist` command does not give a sum of the histogram to 1.0, but rather an integral of the histogram to 1.0 (when `density=True`). This is why the peak of the plot above is >1.0.