Commit 46e84478 authored by Unknown's avatar Unknown
Browse files

Docstrings and example formatting fixes

parent d717d140
......@@ -25,8 +25,8 @@ Software Prerequisites:
* **cvxopt** for fully constrained least squares fitting
* install in a terminal via **`conda install -c https://conda.anaconda.org/omnia cvxopt`**
* **pycroscopy** : Though pycroscopy is mainly used here for plotting purposes only, it's true capabilities
are realized through the ability to seamlessly perform these analyses on any imaging dataset (regardless
of origin, size, complexity) and storing the results back into the same dataset among other things
are realized through the ability to seamlessly perform these analyses on any imaging dataset (regardless
of origin, size, complexity) and storing the results back into the same dataset among other things
"""
......@@ -69,13 +69,14 @@ import pycroscopy as px
#
# In this example, we will work on a **Band Excitation Piezoresponse Force Microscopy (BE-PFM)** imaging dataset
# acquired from advanced atomic force microscopes. In this dataset, a spectra was colllected for each position in a two
# dimensional grid of spatial locations. Thus, this is a three dimensional dataset that has been flattened to a two
# dimensional grid of spatial locations. Thus, this is a three dimensional dataset that has been flattened to a two
# dimensional matrix in accordance with the pycroscopy data format.
#
# Fortunately, all statistical analysis, machine learning, spectral unmixing algorithms, etc. only accept data that is
# formatted in the same manner of [position x spectra] in a two dimensional matrix.
#
# We will begin by downloading the BE-PFM dataset from Github
#
data_file_path = 'temp_um.h5'
# download the data file from Github:
......
......@@ -216,7 +216,7 @@ h5_path = tran.translate(h5_path, raw_data_2d, num_rows, num_cols,
# * Steps 1-3 would be performed anyway in order to begin data analysis
# * The actual pycroscopy translation step are reduced to just 3-4 lines in step 4.
# * While this approach is feasible and encouraged for simple and small data, it may be necessary to use lower level
# calls to write efficient translators
# calls to write efficient translators
#
# Verifying the newly written H5 file:
# ====================================
......
......@@ -57,17 +57,21 @@ Why bother with Microdata and ioHDF5?
=====================================
* These classes simplify the process of writing to H5 files considerably. The programmer only needs to construct
the tree structure with simple python objects such as dictionaries for parameters, numpy datasets for storing data, etc.
the tree structure with simple python objects such as dictionaries for parameters, numpy datasets for storing data, etc.
* It is easy to corrupt H5 files. ioHDF5 uses defensive programming strategies to solve these problems.
Translation can be challenging in many cases:
* It may not be possible to read the entire data from the raw data file to memory as we did in the tutorial on
Translation
Translation
* ioHDF5 allows the general tree structure and the attributes to be written before the data is populated.
* Sometimes, the raw data files do not come with sufficient parameters that describe the size and shape of the data.
This makes it challenging to prepare the H5 file.
This makes it challenging to prepare the H5 file.
* ioHDF5 allows dataets to be dataFile I/O is expensive and we don't want to read the same raw data files multiple
times
times
"""
......@@ -224,18 +228,22 @@ px.plot_utils.plot_cluster_results_together(np.reshape(labels, (num_rows, num_co
#
# Identifying the ancillary datasets:
# ===================================
#
# * `centroids`:
#
# * Spectroscopic Indices and Values: Since the `source` dataset and the `centroids` datasets both contain the
# same spectral information, the `centroids` dataset can simply reuse the ancillary spectroscopic datasets used by
# the `source` dataset.
# same spectral information, the `centroids` dataset can simply reuse the ancillary spectroscopic datasets used by
# the `source` dataset.
# * Position Indices and Values: The `centroids` dataset has `k` instances while the `source` dataset has `P`,
# so we need to create a new position indicies and a new position values dataset for `centroids`
# so we need to create a new position indicies and a new position values dataset for `centroids`
#
# * `labels`:
#
# * Spectroscopic Indices and Values: Unlike the `source` dataset that has spectra of length `S`, this dataset
# only has a single value (cluster index) at each location. Consequently, `labels` needs two new ancilary datasets
# only has a single value (cluster index) at each location. Consequently, `labels` needs two new ancilary datasets
# * Position Indices and Values: Since both `source` and `labels` have the same number of positions and the
# positions mean the same quantities for both datasets, we can simply reuse the ancillary dataset from `source`
# for `labels`
# positions mean the same quantities for both datasets, we can simply reuse the ancillary dataset from `source`
# for `labels`
#
###############################################################################
......@@ -244,10 +252,11 @@ px.plot_utils.plot_cluster_results_together(np.reshape(labels, (num_rows, num_co
#
# 1. Since `labels` is a main dataset, it needs to be two dimensional matrix of size `P x 1`
# 2. The `Spectroscopic` ancillary datasets for `labels` need to be of the form `dimension x points`. Since the
# spectroscopic axis of `labels` is only one deep, `labels` has only one spectroscopic dimension which itself has
# just one point. Thus the `Spectroscopic` matrix should be of size `1 x 1`
# spectroscopic axis of `labels` is only one deep, `labels` has only one spectroscopic dimension which itself has
# just one point. Thus the `Spectroscopic` matrix should be of size `1 x 1`
# 3. The `centroids` matrix is already of the form: `position x spectra`, so it does not need any reshaping
# 4. The `Position` ancillary datasets for `centroids` need to be of the form `points x dimensions` as well.
#
# In this case, `centroids` has `k` positions all in one dimension. Thus the matrix needs to be reshaped to `k x 1`
ds_labels_spec_inds, ds_labels_spec_vals = px.io.translators.utils.build_ind_val_dsets([1], labels=['Label'])
......@@ -382,11 +391,12 @@ px.hdf_utils.checkAndLinkAncillary(h5_centroids,
###############################################################################
# Why bother with all this?
# =========================
#
# * Though long, this simple file writing procedure needs to be written once for a given data analysis / processing tool
# * The general nature of this Clustering algorithm facilitates the application to several other datasets
# regardless of their origin
# regardless of their origin
# * Once the data is written in the pycroscopy format, it is possible to apply other data analytics operations
# to the datasets with a single line
# to the datasets with a single line
# * Generalized versions of visualization algorithms can be written to visualize clustering results quickly.
#
# Here is an example of very quick visualization with effectively just a single parameter - the group containing
......
......@@ -217,8 +217,11 @@ print('Positions:', pos_dim_sizes, '\nSpectroscopic:', spec_dim_sizes)
#
# Let's assume that we are interested in visualizing the spectrograms at the first field of the second cycle at
# position - row:3 and column 2. There are two ways of accessing the data:
#
# 1. The easier method - reshape the data to N dimensions and slice the dataset
#
# * This approach, while trivial, may not be suitable for large datasets which may or may not fit in memory
#
# 2. The harder method - find the spectroscopic and position indices of interest and slice the 2D dataset
#
# Approach 1 - N-dimensional form
......@@ -233,7 +236,7 @@ print(labels)
#########################################################################
# Now that we have the data in its original N dimensional form, we can easily slice the dataset:
spectrogram = ds_nd[2,3, :, 0, :, 1]
spectrogram = ds_nd[2, 3, :, 0, :, 1]
# Now the spectrogram is of order (frequency x DC_Offset).
spectrogram = spectrogram.T
# Now the spectrogram is of order (DC_Offset x frequency)
......
......@@ -69,6 +69,7 @@ sho32 = np.dtype({'names': field_names,
# 5. _write_results_chunk - writes the computed results back to the file
#
# Note that:
#
# * Only the code specific to this process needs to be implemented. However, the generic portions common to most
# Processes will be handled by the Process class.
# * The other functions such as the sho_function, sho_fast_guess function are all specific to this process. These have
......@@ -78,7 +79,7 @@ sho32 = np.dtype({'names': field_names,
# function. The additional code to turn this operation into a Pycroscopy Process is actually rather minimal. As
# described earlier, the goal of the Process class is to modularize and compartmentalize the main sections of the code
# in order to facilitate faster and more robust implementation of data processing algorithms.
#
class ShoGuess(px.Process):
......
......@@ -3,6 +3,7 @@
Created on Tue Jan 05 07:55:56 2016
@author: Suhas Somnath, Chris Smith
"""
from __future__ import division, print_function, absolute_import
import numpy as np
......@@ -19,6 +20,7 @@ from ..io.microdata import MicroDataGroup, MicroDataset
class Cluster(object):
"""
Pycroscopy wrapper around the sklearn.cluster classes.
"""
def __init__(self, h5_main, method_name, num_comps=None, *args, **kwargs):
......@@ -26,14 +28,15 @@ class Cluster(object):
Constructs the Cluster object
Parameters
------------
----------
h5_main : HDF5 dataset object
Main dataset with ancillary spectroscopic, position indices and values datasets
method_name : string / unicode
Name of the sklearn.cluster estimator
num_comps : (optional) unsigned int
Number of features / spectroscopic indices to be used to cluster the data. Default = all
*args and **kwargs : arguments to be passed to the estimator
args and kwargs : arguments to be passed to the estimator
"""
allowed_methods = ['AgglomerativeClustering', 'Birch', 'KMeans',
......
......@@ -34,7 +34,8 @@ class Decomposition(object):
Name of the sklearn.cluster estimator
n_components : (Optional) unsigned int
Number of components for decomposition
*args and **kwargs : arguments to be passed to the estimator
args and kwargs : arguments to be passed to the estimator
"""
if n_components is not None:
......@@ -103,6 +104,7 @@ class Decomposition(object):
Returns
------
None
"""
if data is None:
if self.method_name == 'NMF':
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment