Newer
Older
.. _DataFilesForTesting:
======================
Data Files for Testing
======================
.. contents::
:local:
Summary
#######
This page gives an overview of how data files are managed within Mantid.
Motivation
##########
Some unit tests use a small amount of data that is created by the test
harness and others load data from a file. Take the example of
``ApplyCalibrationTest``. In its first test, testSimple, it creates a
``WorkspaceCreationHelper::create2DWorkspaceWithFullInstrument()``. In
the second test, testComplex, it reads a file
**IDFs_for_UNIT_TESTING/MAPS_Definition_Reduced.xml**, which contains
the definition of a MAPS instrument with the number of detectors reduced
much to ensure it is read quickly but preserving the other properties of
this instrument. However, new tests should avoid even loading of this
nature unless there is a strong justification for doing so.
Main issues:
- need to store data, mainly for testing, alongside the code
- some data needs to be versioned
- merging system tests back with main code requires handling large data
files
- git is bad at handling binary files
Possible solutions:
- don't have any reference to data in git and force developers to
manage the data stored on a file server
- extensions to git, e.g.
`git-fat <https://github.com/jedbrown/git-fat>`__,
`git-annex <https://git-annex.branchable.com/>`__ to deal with large
files
- CMake's
`ExternalData <http://www.kitware.com/source/home/post/107>`__
We have chosen to use CMake as it is already in use as a build system
and it doesn't involve introducing extra work with git.
CMake's External Data
#####################
.. figure:: images/ExternalDataSchematic.png
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
:alt: Image originated at http://www.kitware.com/source/home/post/107
:align: center
Image originated at http://www.kitware.com/source/home/post/107
Terminology:
- content - the real data
- content link - text file containing a hash (MD5) of the real content.
The filename is the filename of the real data plus the ``.md5``
extension
- object - a file that stores the real data and whose name is the MD5
hash of the content
Overview:
- git does not store any content, it only stores content links
- content is stored on a remote server that can be accessed via a
``http`` link
- running cmake sets up build rules so that the content is downloaded
when dependent projects are built
Local Object Store
##################
CMake does not download content directly but stores the content in a
*Local Object Store*, whose location is defined by the
``ExternalData_OBJECT_STORES`` CMake variable. This allows it to share
content between build trees, especially useful for continuous
integration servers.
Binary Root
###########
The final step is to create the *real* filename and symbolic link (copy
on windows) it to the object in the local object store. The location of
the *real* filenames is controlled by the ``ExternalData_BINARY_ROOT``
CMake variable and defaults to ``build/ExternalData``.
Using Existing Data
###################
There are two places files may be found:
- `../mantid/Testing/Data/UnitTest <https://github.com/mantidproject/mantid/tree/master/Testing/Data/UnitTest>`__
- `../mantid/instrument/IDFs_for_UNIT_TESTING <https://github.com/mantidproject/mantid/tree/master/instrument/IDFs_for_UNIT_TESTING>`__
for test `IDF <IDF>`__ files
.. _DataFilesForTesting_AddingANewFile:
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Adding A New File(s)
####################
A helper git command is defined called ``add-test-data``. It would be
called like this:
.. code-block:: sh
git add-test-data Testing/Data/UnitTest/INST12345.nxs
This does the following:
- computes the MD5 hash of the data, e.g.
``d6948514d78db7fe251efb6cce4a9b83``
- stores the MD5 hash in a file called
``Testing/Data/UnitTest/INST12345.nxs.md5``
- renames the original data file to
``Testing/Data/UnitTest/d6948514d78db7fe251efb6cce4a9b83``
- runs ``git add Testing/Data/UnitTest/INST12345.nxs.md5``
- tells the user to upload the file(s),
``d6948514d78db7fe251efb6cce4a9b83``, to the remote store (URL:
http://198.74.56.37/ftp/external-data/upload)
- re-run cmake
Notes:
- You need to use a shell to add & modify data files under Windows in
this way. Not every shell works as described, though `Github for
Windows <https://windows.github.com/>`__ shell would allow you to do
everything described here step by step without deviations.
Unfortunately, MINGW32 shell you have to select to do that is not the
most convenient shell under Windows. In addition to that,
``add-test-data`` script is currently broken (at least was on
20/11/2015) . This is why I would suggest to use small python script,
provided below, which would calculate md5 hash, create the ``.md5``
file and rename your test or reference file according to the hash sum
calculated. You then have to manually put ``.md5`` file to requested
reference data location and add it to Git by usual means. The
hash-sum named file should be, as in the case of Unix, placed to the
`remote store <http://198.74.56.37/ftp/external-data/upload>`__
- Note, that ILL test data should be placed under ``ILL/${INSTRUMENT}``
subdirectories (e.g. ``ILL/IN16B``), and should not contain any
instrument prefix in the file name.
Developer Setup
###############
**You need cmake 2.8.11+**
To add the ``add-test-data`` command alias to git run
.. code-block:: sh
git config alias.add-test-data '!bash -c "tools/Development/git/git-add-test-data $*"'
in the git bash shell. The single quotes are important so that bash
doesn't expand the exclamation mark as a variable.
It is advised that CMake is told where to put the "real" data as the
default is ``$HOME/MantidExternalData`` on Linux/Mac or
``C:/MantidExternalData`` on Windows. Over time the store will grow so
it is recommended that it be placed on a disk with a large amount of
space. CMake uses the ``MANTID_DATA_STORE`` variable to define where the
data is stored.
Example cmake command:
Linux/Mac:
.. code-block:: sh
mkdir -p build
cmake -DMANTID_DATA_STORE=/home/mgigg/Data/LocalObjectStore ../Code/Mantid
Windows:
.. code-block:: sh
mkdir build
cmake -DMANTID_DATA_STORE=D:/Data/LocalObjectStore ../Code/Mantid
Setting With Dropbox:
This is for people in the ORNL dropbox share and has the effect of
reducing external network traffic. There is a
`gist <http://gist.github.com/peterfpeterson/638490530e37c3d8dba5>`__
for getting dropbox running on linux.
.. code-block:: sh
mkdir build
cmake -DMANTID_DATA_STORE=/home/mgigg/Dropbox\ \(ORNL\)/MantidExternalData ../Code/Mantid
If you don't want to define the MANTID_DATA_STORE everytime you run
cmake, you can link the default data store location to the Dropbox one.
.. code-block:: sh
ln -s ~/Dropbox\ \(ORNL\)/MantidExternalData ~
Proxy Settings
--------------
If you are sitting behind a proxy server then the shell or Visual studio
needs to know about the proxy server. You must set the ``http_proxy``
environment variable to ``http://HOSTNAME:PORT``.
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
On Windows you go to **Control Panel->System and
Security->System->Advanced System settings->Environment Variables** and
click **New...** to add a variable.
On Linux/Mac you will need to set the variable in the shell profile or
on Linux you can set it system wide in ``/etc/environment``.
Troubleshooting
---------------
If you find that your tests cannot find the data they require check the
following gotchas:
- Check that you have uploaded the original file renamed as a hash to
the Mantid file repository
- Check that you have re-run CMake in the build directory.
- Check that you have removed any user defined data search directories
in ~/.mantid
- Check that you have rebuilt the test executable you're trying to run
- Check that you have rebuilt the SystemTestData target
Python script to produce hash files
-----------------------------------
.. code:: python
#!/usr/bin/python
import hashlib
import os,sys
def md5sum(filename, blocksize=65536):
"""Calculate md5 hash sum of a file provided """
hash = hashlib.md5()
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
hash.update(block)
return hash.hexdigest()
def save_sum(filename,md_sum):
"""Save hash sum into file with appropriate filename"""
md_fname = filename+'.md5'
with open(md_fname) as f:
f.write(md_sum)
if __name__ == '__main__':
if len(sys.argv)<2 or not os.path.isfile(sys.argv[1]):
print "Usage: hash_file.py file_name"
exit(1)
filename = sys.argv[1]
path,fname = os.path.split(filename)
hash_sum = md5sum(filename)
print "MD SUM FOR FILE: {0} is {1}".format(fname,hash_sum)
# save hash sum in file with original file name and extension .md5
save_sum(os.path.join(path,fname),hash_sum)
# Rename hashed file into hash sum name.
hash_file = os.path.join(path,hash_sum)
if os.path.isfile(hash_file):
print "file: {0} already exist".format(hash_sum)
else:
os.rename(filename,hash_file)