Commit d70245b6 authored by syz's avatar syz
Browse files

updated from master

parents 99556b27 7ffa2f29
{
"metadata": {
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"version": "3.5.2",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"codemirror_mode": {
"name": "ipython",
"version": 3
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"cells": [
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"%matplotlib inline"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n====================================================================================================\nTutorial 2: Writing to pycroscopy H5 files\n====================================================================================================\n\n**Suhas Somnath**\n8/8/2017\n\nThis set of tutorials will serve as examples for developing end-to-end workflows for and using pycroscopy.\n\nWhile pycroscopy contains many popular data processing function, it may not have a function you need. Since pycroscopy\nis data-centric, it is preferable to write processing results back to the same file as well.\n\n**In this example, we will write the results of K-Means clustering (on a Scanning Tunnelling Spectroscopy (STS) dataset)\nback to the file.**\n\nK-Means clustering is a quick and simple method to determine the types of spectral responses present in the data and\ntheir spatial occurrence.\n\nIntroduction:\n=============\n\nData structuring and file format:\n=================================\n\n**Before proceeding with this example, we highly recommend you read about data formatting in pycroscopy as well as\nreading and writing to HDF5 files.**\n\nThis bookkeeping is necessary for helping the code to understand the dimensionality and structure of the data. While\nthese rules may seem tedious, there are several functions and a few classes that make these tasks much easier\n\nClasses for writing files\n=========================\n\nIn order to deal with the numerous challenges in writing data in a consistent manner, especially during translation,\nin the pycroscopy format, we developed two main classes: **MicroData** and **ioHDF5**.\n\nMicroData\n=========\n\nThe abstract class MicroData is extended by **MicroDataset** and **MicroDatagroup** which are skeletal counterparts\nfor the h5py.Dataset and h5py.Datagroup classes respectively. These classes allow programmers to quickly and simply\nset up the tree structure that needs to be written to H5 files without having to worry about the low-level HDF5\nconstructs or defensive programming strategies necessary for writing the H5 files. Besides facilitating the\nconstruction of a tree structure, each of the classes have a few features specific to pycroscopy to alleviate file\nwriting.\n\nioHDF5\n======\n\nWhile we use **h5py** to read from pycroscopy files, the ioHDF5 class is used to write data to H5 files. ioHDF5\ntranslates the tree structure described by a MicroDataGroup object and writes the contents to H5 files in a\nstandardized manner. As a wrapper around h5py, tt handles the low-level file I/O calls and includes defensive\nprogramming strategies to minimize issues with writing to H5 files.\n\nWhy bother with Microdata and ioHDF5?\n=====================================\n\n* These classes simplify the process of writing to H5 files considerably. The programmer only needs to construct\n the tree structure with simple python objects such as dictionaries for parameters, numpy datasets for storing data, etc.\n* It is easy to corrupt H5 files. ioHDF5 uses defensive programming strategies to solve these problems.\n\nTranslation can be challenging in many cases:\n\n* It may not be possible to read the entire data from the raw data file to memory as we did in the tutorial on\n Translation\n\n * ioHDF5 allows the general tree structure and the attributes to be written before the data is populated.\n\n* Sometimes, the raw data files do not come with sufficient parameters that describe the size and shape of the data.\n This makes it challenging to prepare the H5 file.\n\n * ioHDF5 allows dataets to be dataFile I/O is expensive and we don't want to read the same raw data files multiple\n times\n\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# Ensure python 3 compatibility:\nfrom __future__ import division, print_function, absolute_import, unicode_literals\n\n# The package for accessing files in directories, etc.:\nimport os\n\n# Warning package in case something goes wrong\nfrom warnings import warn\n\n# Package for downloading online files:\ntry:\n # This package is not part of anaconda and may need to be installed.\n import wget\nexcept ImportError:\n warn('wget not found. Will install with pip.')\n import pip\n pip.main(['install', 'wget'])\n import wget\n\n# The mathematical computation package:\nimport numpy as np\n\n# The package used for creating and manipulating HDF5 files:\nimport h5py\n\n# Packages for plotting:\nimport matplotlib.pyplot as plt\n\n# Package for performing k-Means clustering:\nfrom sklearn.cluster import KMeans\n\n# Finally import pycroscopy for certain scientific analysis:\ntry:\n import pycroscopy as px\nexcept ImportError:\n warn('pycroscopy not found. Will install with pip.')\n import pip\n pip.main(['install', 'pycroscopy'])\n import pycroscopy as px\nfrom pycroscopy.io.translators.omicron_asc import AscTranslator"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loading the dataset\n===================\n\nWe wil start by downloading the raw data file as generated by the microscope and then translate the file to a\npycroscopy H5 file.\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# download the raw data file from Github:\ndata_file_path = 'temp_2.asc'\nurl = 'https://raw.githubusercontent.com/pycroscopy/pycroscopy/master/data/STS.asc'\nif os.path.exists(data_file_path):\n os.remove(data_file_path)\n_ = wget.download(url, data_file_path, bar=None)\n\n# Translating from raw data to h5:\ntran = AscTranslator()\nh5_path = tran.translate(data_file_path)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading the H5 dataset\n======================\n\nThis data is a Scanning Tunnelling Spectroscopy (STS) dataset wherein current was measured as a function of voltage\non a two dimensional grid of points. Thus, the data has three dimensions (X, Y, Bias). Note, that in pycroscopy, all\nposition dimensions are collapsed to the first dimension and all spectroscopic (only bias in this case) dimensions\nare collapsed to the second axis of a two dimensional matrix. So, the data is represented as (position, bias)\ninstead.\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# opening the file:\nhdf = px.ioHDF5(h5_path)\nh5_file = hdf.file\n\n# Visualize the tree structure in the file\nprint('Tree structure within the file:')\npx.hdf_utils.print_tree(h5_file)\n\n# Extracting some parameters that will be necessary later on:\nh5_meas_grp = h5_file['Measurement_000']\nnum_cols = int(px.hdf_utils.get_attr(h5_meas_grp, 'x-pixels'))\nnum_rows = int(px.hdf_utils.get_attr(h5_meas_grp, 'y-pixels'))\n\n# There are multiple ways of accessing the Raw_Data dataset. Here's one approach:\nh5_main = h5_meas_grp['Channel_000/Raw_Data']\n\n# Prepare the label for plots:\ny_label = px.hdf_utils.get_attr(h5_main, 'quantity') + ' [' + px.hdf_utils.get_attr(h5_main, 'units') + ']'\n\n# Get the voltage vector that this data was acquired as a function of:\nh5_spec_vals = px.hdf_utils.getAuxData(h5_main, 'Spectroscopic_Values')[0]\nvolt_vec = np.squeeze(h5_spec_vals[()])\n\n# Get the descriptor for this\nx_label = px.hdf_utils.get_attr(h5_spec_vals, 'labels')[0] + ' [' + px.hdf_utils.get_attr(h5_spec_vals, 'units')[0] + ']'\n\n# Currently, the data is within the h5 dataset. We need to read this to memory:\ndata_mat = h5_main[()]\n\nprint('\\nData now loaded to memory and is of shape:', data_mat.shape)\nprint('Data has', num_rows, 'rows and', num_cols, 'columns each having a',\n data_mat.shape[1], 'long measurement of', y_label,'as a function of', x_label)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Performing k-Means Clustering:\n==============================\n\nNow that the data is loaded to memory, we can perform k-Means clustering on data_mat. As a reminder, K-Means\nclustering is a quick and simple method to determine the types of spectral responses present in the data and their\nspatial occurance.\n\nLet us assume that we have a `P x S` dataset with `P` positions each with spectra that are `S` long. When K-Means\nis asked to identify `k` clusters, it will produce two results:\n* cluster_centers: This contains the different kinds of spectral responses and is represented as a two dimensional\narray of the form [cluster number, representative spectra for this cluster]. Thus this dataset will have a shape\nof `k x S`\n* labels: This provides the information about which spatial pixel belongs to which group. It will be a\n1 dimensional array of size `P` wherein the value for each element in the array (cluster id for each pixel) will\nbe within `[0, k)`\n\n**Our goal is to write back these two datasets to the H5 file**\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"num_clusters = 9\n\n# Now, we can perform k-Means clustering:\nestimators = KMeans(num_clusters)\nresults = estimators.fit(data_mat)\n\nprint('K-Means Clustering performed on the dataset of shape', data_mat.shape,\n 'resulted in a cluster centers matrix of shape', results.cluster_centers_.shape,\n 'and a labels array of shape', results.labels_.shape)\n\n\"\"\"\nBy default, the clusters identified by K-Means are NOT arranged according to their relative \ndistances to each other. Visualizing and interpreting this data is challenging. We will sort the \nresults using a handy function already in pycroscopy:\n\"\"\"\nlabels, centroids = px.processing.cluster.reorder_clusters(results.labels_, results.cluster_centers_)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the results:\n======================\n\nWe will visualize both the raw results from k-Means as well as the distance-sorted results from pycroscopy.\nYou will notice that the sorted results are easier to understand and interpret. This is an example of the kind of\nadditional value that can be packed into pycroscopy wrappers on existing data analysis / processing functions.\n\nA second example of value addition - The pycroscopy wrapper for Clustering handles real, complex, and compound\nvalued datasets seamlessly in the background.\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"px.plot_utils.plot_cluster_results_together(np.reshape(results.labels_, (num_rows, num_cols)),\n results.cluster_centers_, spec_val=volt_vec, cmap=plt.cm.inferno,\n spec_label=x_label, resp_label=y_label);\n\npx.plot_utils.plot_cluster_results_together(np.reshape(labels, (num_rows, num_cols)),\n centroids, spec_val=volt_vec, cmap=plt.cm.inferno,\n spec_label=x_label, resp_label=y_label);"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preparing to write results\n==========================\n\nThe two datasets we need to write back to the H5 file are the `centroids` and `labels` matrices. Both the\n`centroids` and `labels` matrices satisfy the condition to be elevated to the status of **`main`** datasets.\nHowever, in order to be recognized as **`main`** datasets, they need the four ancillary datasets to go along with\nthem. Recall that the main datasets only need to store references to the ancillary datasets and that we do not\nneed to store copies of the same ancillary datasets if multiple main datasets use them.\n\nHere, we will refer to the dataset on which K-means was performed as the **`source`** dataset.\n\nIdentifying the ancillary datasets:\n===================================\n\n* `centroids`:\n\n * Spectroscopic Indices and Values: Since the `source` dataset and the `centroids` datasets both contain the\n same spectral information, the `centroids` dataset can simply reuse the ancillary spectroscopic datasets used by\n the `source` dataset.\n * Position Indices and Values: The `centroids` dataset has `k` instances while the `source` dataset has `P`,\n so we need to create a new position indicies and a new position values dataset for `centroids`\n\n* `labels`:\n\n * Spectroscopic Indices and Values: Unlike the `source` dataset that has spectra of length `S`, this dataset\n only has a single value (cluster index) at each location. Consequently, `labels` needs two new ancilary datasets\n * Position Indices and Values: Since both `source` and `labels` have the same number of positions and the\n positions mean the same quantities for both datasets, we can simply reuse the ancillary dataset from `source`\n for `labels`\n\n\n"
],
"cell_type": "markdown"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reshape the matricies to the correct dimensions\n===============================================\n\n1. Since `labels` is a main dataset, it needs to be two dimensional matrix of size `P x 1`\n2. The `Spectroscopic` ancillary datasets for `labels` need to be of the form `dimension x points`. Since the\n spectroscopic axis of `labels` is only one deep, `labels` has only one spectroscopic dimension which itself has\n just one point. Thus the `Spectroscopic` matrix should be of size `1 x 1`\n3. The `centroids` matrix is already of the form: `position x spectra`, so it does not need any reshaping\n4. The `Position` ancillary datasets for `centroids` need to be of the form `points x dimensions` as well.\n\nIn this case, `centroids` has `k` positions all in one dimension. Thus the matrix needs to be reshaped to `k x 1`\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"ds_labels_spec_inds, ds_labels_spec_vals = px.io.translators.utils.build_ind_val_dsets([1], labels=['Label'])\nds_cluster_inds, ds_cluster_vals = px.io.translators.utils.build_ind_val_dsets([centroids.shape[0]], is_spectral=False,\n labels=['Cluster'])\nlabels_mat = np.uint32(labels.reshape([-1, 1]))\n\n# Rename the datasets\nds_labels_spec_inds.name = 'Label_Spectroscopic_Indices'\nds_labels_spec_vals.name = 'Label_Spectroscopic_Values'\nds_cluster_inds.name = 'Cluster_Indices'\nds_cluster_vals.name = 'Cluster_Values'\n\nprint('Spectroscopic Dataset for Labels', ds_labels_spec_inds.shape)\nprint('Position Dataset for Centroids', ds_cluster_inds.shape)\nprint('Centroids',centroids.shape)\nprint('Labels', labels_mat.shape)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the Main MicroDataset objects\n====================================\nRemember that it is important to either inherit or add the `quantity` and `units` attributes to each **main** dataset\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# The two main datasets\nds_label_mat = px.MicroDataset('Labels', labels_mat, dtype=np.uint32)\n# Adding the mandatory attributes\nds_label_mat.attrs = {'quantity': 'Cluster ID', 'units': 'a. u.'}\n\nds_cluster_centroids = px.MicroDataset('Mean_Response', centroids, dtype=h5_main.dtype)\n# Inhereting / copying the mandatory attributes\npx.hdf_utils.copy_main_attributes(h5_main, ds_cluster_centroids)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the group that will contain these datasets\n=================================================\nWe will be appending data to the existing h5 file and since HDF5 uses a tree structure to store information, we\nwould need to specify where to add the sub-tree that we are building.\n\nRecall that the name of the DataGroup provides information of the operation that has been performed on the\n`source` dataset. Therefore, we need to be careful about naming the group.\n\nIt is also important to add relevant information about the operation. For example, the name of our operation\nis `Cluster` analogous to the `SkLearn` package organization. Thus, the name of the algorithm - `k-Means` needs\nto be written as an attribute of the group as well.\n\nOccasionaly, the same operation may be performed multiple times on the same dataset with different parameters.\nIn the case of K-means it may be the number of clusters. pycroscopy allows all these results to be stored instead\nof being overwritten by appending an index number to the end of the group name. Thus, one could have a tree\nthat contains the following groups:\n* Raw_Data-Cluster_000 <--- K-means with 9 clusters\n* Raw_Data-Cluster_001 <--- Agglomerative clustering\n* Raw_Data-Cluster_002 <--- K-means again with 4 clusters\n\nLeaving a '_' at the end of the group name will instruct ioHDF5 to look for the last instance of the same\noperation being performed on the same dataset. The index will then be updated accordingly\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"source_dset_name = h5_main.name.split('/')[-1]\noperation_name = 'Cluster'\n\nsubtree_root_path = h5_main.parent.name[1:]\n\ncluster_grp = px.MicroDataGroup(source_dset_name + '-' + operation_name + '_',\n subtree_root_path)\nprint('New group to be created with name:', cluster_grp.name)\nprint('This group (subtree) will be appended to the H5 file under the group:', subtree_root_path)\n\n# Making a tree structure by adding the MicroDataset objects as children of this group\ncluster_grp.addChildren([ds_label_mat, ds_cluster_centroids, ds_cluster_inds, ds_cluster_vals, ds_labels_spec_inds,\n ds_labels_spec_vals])\n\nprint('\\nWill write the following tree:')\ncluster_grp.showTree()\n\ncluster_grp.attrs['num_clusters'] = num_clusters\ncluster_grp.attrs['num_samples'] = h5_main.shape[0]\ncluster_grp.attrs['cluster_algorithm'] = 'KMeans'\n\n# Get the parameters of the KMeans object that was used and write them as attributes of the group\nfor parm in estimators.get_params().keys():\n cluster_grp.attrs[parm] = estimators.get_params()[parm]\n\nprint('\\nWriting the following attrbutes to the group:')\nfor at_name in cluster_grp.attrs:\n print(at_name, ':', cluster_grp.attrs[at_name])"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write to H5 and access the written objects\n==========================================\n\nOnce the tree is prepared (previous cell), ioHDF5 will handle all the file writing.\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"h5_clust_refs = hdf.writeData(cluster_grp, print_log=True)\n\nh5_labels = px.hdf_utils.getH5DsetRefs(['Labels'], h5_clust_refs)[0]\nh5_centroids = px.hdf_utils.getH5DsetRefs(['Mean_Response'], h5_clust_refs)[0]\nh5_clust_inds = px.hdf_utils.getH5DsetRefs(['Cluster_Indices'], h5_clust_refs)[0]\nh5_clust_vals = px.hdf_utils.getH5DsetRefs(['Cluster_Values'], h5_clust_refs)[0]\nh5_label_inds = px.hdf_utils.getH5DsetRefs(['Label_Spectroscopic_Indices'], h5_clust_refs)[0]\nh5_label_vals = px.hdf_utils.getH5DsetRefs(['Label_Spectroscopic_Values'], h5_clust_refs)[0]"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the H5 file contents now\n================================\nCompare this tree with the one printed earlier. The new group and datasets should be apparent\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"px.hdf_utils.print_tree(h5_file)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make `centroids` and `labels` -> `main` datasets\n================================================\nWe elevate the status of these datasets by linking them to the four ancillary datasets. This part is also made\nrather easy by a few pycroscopy functions.\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# we already got the reference to the spectroscopic values in the first few cells\nh5_spec_inds = px.hdf_utils.getAuxData(h5_main, 'Spectroscopic_Indices')[0]\n\npx.hdf_utils.checkAndLinkAncillary(h5_labels,\n ['Position_Indices', 'Position_Values'],\n h5_main=h5_main)\npx.hdf_utils.checkAndLinkAncillary(h5_labels,\n ['Spectroscopic_Indices', 'Spectroscopic_Values'],\n anc_refs=[h5_label_inds, h5_label_vals])\n\npx.hdf_utils.checkAndLinkAncillary(h5_centroids,\n ['Spectroscopic_Indices', 'Spectroscopic_Values'],\n anc_refs=[h5_spec_inds, h5_spec_vals])\n\npx.hdf_utils.checkAndLinkAncillary(h5_centroids,\n ['Position_Indices', 'Position_Values'],\n anc_refs=[h5_clust_inds, h5_clust_vals])"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why bother with all this?\n=========================\n\n* Though long, this simple file writing procedure needs to be written once for a given data analysis / processing tool\n* The general nature of this Clustering algorithm facilitates the application to several other datasets\n regardless of their origin\n* Once the data is written in the pycroscopy format, it is possible to apply other data analytics operations\n to the datasets with a single line\n* Generalized versions of visualization algorithms can be written to visualize clustering results quickly.\n\nHere is an example of very quick visualization with effectively just a single parameter - the group containing\nclustering results. The ancillary datasets linked to `labels` and `centroids` instructed the code about the\nspatial and spectroscopic dimensionality and enabled it to automatically render the plots below\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"px.plot_utils.plot_cluster_h5_group(h5_labels.parent, '');"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cleanup\n=======\nDeletes the temporary files created in the example\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"os.remove(data_file_path)\nhdf.close()\nos.remove(h5_path)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
}
],
"nbformat": 4,
"nbformat_minor": 0,
"nbformat": 4
"metadata": {
"language_info": {
"codemirror_mode": {
"version": 3,
"name": "ipython"
},
"nbconvert_exporter": "python",
"file_extension": ".py",
"version": "3.5.2",
"pygments_lexer": "ipython3",
"name": "python",
"mimetype": "text/x-python"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
}
}
\ No newline at end of file
......@@ -500,22 +500,22 @@ operation being performed on the same dataset. The index will then be updated ac
Measurement_000/Channel_000Raw_Data-Cluster_/Label_Spectroscopic_Values
Writing the following attrbutes to the group:
num_clusters : 9
cluster_algorithm : KMeans
timestamp : 2017_11_29-10_05_46
n_init : 10
n_jobs : 1
max_iter : 300
timestamp : 2017_11_16-14_42_00
precompute_distances : auto
algorithm : auto
random_state : None
num_clusters : 9
verbose : 0
algorithm : auto
machine_id : challtdow-ThinkPad-T530
n_init : 10
copy_x : True
n_clusters : 9
tol : 0.0001
num_samples : 10000
tol : 0.0001
init : k-means++
cluster_algorithm : KMeans
precompute_distances : auto
machine_id : challtdow-ThinkPad-T530
n_clusters : 9
max_iter : 300
copy_x : True
Write to H5 and access the written objects
......@@ -546,21 +546,21 @@ Once the tree is prepared (previous cell), ioHDF5 will handle all the file writi
Out::
Created group /Measurement_000/Channel_000/Raw_Data-Cluster_000
Writing attribute: timestamp with value: 2017_11_29-10_05_46
Writing attribute: n_init with value: 10
Writing attribute: n_jobs with value: 1
Writing attribute: algorithm with value: auto
Writing attribute: num_clusters with value: 9
Writing attribute: verbose with value: 0
Writing attribute: num_samples with value: 10000
Writing attribute: tol with value: 0.0001
Writing attribute: init with value: k-means++
Writing attribute: cluster_algorithm with value: KMeans
Writing attribute: n_jobs with value: 1
Writing attribute: max_iter with value: 300
Writing attribute: timestamp with value: 2017_11_16-14_42_00
Writing attribute: precompute_distances with value: auto
Writing attribute: verbose with value: 0
Writing attribute: algorithm with value: auto
Writing attribute: machine_id with value: challtdow-ThinkPad-T530
Writing attribute: n_init with value: 10
Writing attribute: copy_x with value: True
Writing attribute: n_clusters with value: 9
Writing attribute: tol with value: 0.0001
Writing attribute: num_samples with value: 10000
Writing attribute: init with value: k-means++
Writing attribute: max_iter with value: 300
Writing attribute: copy_x with value: True
Wrote attributes to group: Raw_Data-Cluster_000
Created Dataset /Measurement_000/Channel_000/Raw_Data-Cluster_000/Labels
......@@ -724,7 +724,7 @@ Deletes the temporary files created in the example
**Total running time of the script:** ( 2 minutes 28.277 seconds)
**Total running time of the script:** ( 1 minutes 18.893 seconds)
......
{
"metadata": {
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"version": "3.5.2",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"codemirror_mode": {
"name": "ipython",
"version": 3
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"cells": [
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"%matplotlib inline"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n========================================================================================================\nTutorial 3: Handling Multidimensional datasets\n========================================================================================================\n\n**Suhas Somnath**\n8/8/2017\n\nThis set of tutorials will serve as examples for developing end-to-end workflows for and using pycroscopy.\n\n**In this example, we will learn how to slice multidimensional datasets.**\n\nIntroduction\n============\n\nIn pycroscopy, all position dimensions of a dataset are collapsed into the first dimension and all other\n(spectroscopic) dimensions are collapsed to the second dimension to form a two dimensional matrix. The ancillary\nmatrices, namely the spectroscopic indices and values matrix as well as the position indices and values matrices\nwill be essential for reshaping the data back to its original N dimensional form and for slicing multidimensional\ndatasets\n\nWe highly recommend reading about the pycroscopy data format - available in the docs.\n\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# Ensure python 3 compatibility:\nfrom __future__ import division, print_function, absolute_import, unicode_literals\n\n# The package for accessing files in directories, etc.:\nimport os\n\n# Warning package in case something goes wrong\nfrom warnings import warn\n\n# Package for downloading online files:\ntry:\n # This package is not part of anaconda and may need to be installed.\n import wget\nexcept ImportError:\n warn('wget not found. Will install with pip.')\n import pip\n pip.main(['install', 'wget'])\n import wget\n\n# The mathematical computation package:\nimport numpy as np\n\n# The package used for creating and manipulating HDF5 files:\nimport h5py\n\n# Packages for plotting:\nimport matplotlib.pyplot as plt\n\n# basic interactive widgets:\nfrom ipywidgets import interact\n\n# Finally import pycroscopy for certain scientific analysis:\ntry:\n import pycroscopy as px\nexcept ImportError:\n warn('pycroscopy not found. Will install with pip.')\n import pip\n pip.main(['install', 'pycroscopy'])\n import pycroscopy as px"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the dataset\n================\n\nFor this example, we will be working with a Band Excitation Polarization Switching (BEPS)\ndataset acquired from advanced atomic force microscopes. In the much simpler Band Excitation (BE)\nimaging datasets, a single spectra is acquired at each location in a two dimensional grid of spatial locations.\nThus, BE imaging datasets have two position dimensions (X, Y) and one spectroscopic dimension (frequency - against\nwhich the spectra is recorded). The BEPS dataset used in this example has a spectra for each combination of\nthree other parameters (DC offset, Field, and Cycle). Thus, this dataset has three new spectral\ndimensions in addition to the spectra itself. Hence, this dataset becomes a 2+4 = 6 dimensional dataset\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# download the raw data file from Github:\nh5_path = 'temp_3.h5'\nurl = 'https://raw.githubusercontent.com/pycroscopy/pycroscopy/cades_dev/data/BEPS_small.h5'\nif os.path.exists(h5_path):\n os.remove(h5_path)\n_ = wget.download(url, h5_path, bar=None)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# Open the file in read-only mode\nh5_file = h5py.File(h5_path, mode='r')\n\nprint('Datasets and datagroups within the file:\\n------------------------------------')\npx.hdf_utils.print_tree(h5_file)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"h5_meas_grp = h5_file['Measurement_000']\nh5_main = h5_meas_grp['Channel_000/Raw_Data']\nprint('\\nThe main dataset:\\n------------------------------------')\nprint(h5_main)"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The main dataset clearly does not provide the multidimensional information about the data that will be necessary to\nslice the data. For that we need the ancillary datasets that support this main dataset\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false
},
"source": [
"# pycroscopy has a convenient function to access datasets linked to a given dataset:\nh5_spec_ind = px.hdf_utils.getAuxData(h5_main, 'Spectroscopic_Indices')[0]\nh5_spec_val = px.hdf_utils.getAuxData(h5_main, 'Spectroscopic_Values')[0]\nh5_pos_ind = px.hdf_utils.getAuxData(h5_main, 'Position_Indices')[0]\nh5_pos_val = px.hdf_utils.getAuxData(h5_main, 'Position_Values')[0]"
],
"outputs": [],
"execution_count": null,
"cell_type": "code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Understanding the ancillary datasets:\n=====================================\n\nThe position datasets are shaped as [spatial points, dimension] while the spectroscopic datasets are shaped as\n[dimension, spectral points]. Clearly the first axis of the position dataset and the second axis of the spectroscopic\ndatasets match the corresponding sizes of the main dataset.\n\nAgain, the sum of the position and spectroscopic dimensions results in the 6 dimensions originally described above.\n\nEssentially, there is a unique combination of position and spectroscopic parameters for each cell in the two\ndimensional main dataset. The interactive widgets below illustrate this point. The first slider represents the\nposition dimension while the second represents the spectroscopic dimension. Each position index can be decoded\nto a set of X and Y indices and values while each spectroscopic index can be decoded into a set of frequency,\ndc offset, field, and forc parameters\n\n"
],
"cell_type": "markdown"
]
},
{
"execution_count": null,
"cell_type": "code",
"outputs": [],
"metadata": {
"collapsed": false