Edit paper.md (93da4fe8) · Commits · Research Enablement / Xylem

paper.md

+47 −5

Original line number	Diff line number	Diff line
		@@ -11,7 +11,7 @@ authors:
		corresponding: true
		affiliation: 1
		- name: Cecilia N. Clark
		orcid: https://orcid.org/0000-0002-4826-2152
		orcid: 0000-0002-4826-2152
		affiliation: 1
		affiliations:
		- name: Oak Ridge National Laboratory, Oak Ridge, Tennessee, TN, United States of America
		@@ -22,10 +22,52 @@ bibliography: paper.bib
		---

		# Summary
		> 🚧 Under construction
		We present PIPE, a novel, portable workflow enabling fully automated and parallel processing of large-scale data. Originally
		designed to process satellite imagery for geospatial analytics, PIPE supports the repeatable execution of many processing tasks
		across the scientific domain. The PIPE configuration can codify almost any script project task: if a project can be built and
		run via a finite set of script commands without manual intervention, it can be configured as a standalone PIPE module.

		# Statement of need
		> 🚧 Under construction
		# Statement of Need
		Researchers who handle terabyte-scale datasets often rely on ad-hoc scripts bound to specific machines to process data,
		resulting in difficulties with high-performance processing, reproducibility, and peer review. Building a reliable and robust
		workflow is pertinent to the overall goals of processing voluminous and complex data. Developing a reliable pipeline provides
		data quality assurance by incorporating quality checks and validation stages to ensure outputs meet predefined accuracy and
		usability standards. Additionally, mechanisms to detect, report, and correct errors in the data or processing provides an
		additional layer of strength to the workflow. Overall, the community lacks a lightweight, stand-alone engine which brings
		high-level parallelism and strict reproducibility to any data domain without imposing new domain logic or infrastructure
		overhead.

		# Features and Implementation

		## Overview
		PIPE provides a portable, straightforward, trustworthy workflow considerate of performance concerns and limitations.
		Toward this goal, PIPE can be configured via flat files and executed in multiple cross-platform and cross-architecture
		contexts. In general terms, PIPE is a command line interface (CLI) application and a collection of standards and conventions.
		The CLI application was developed using the Rust programming language, providing unique benefits such as memory safety and
		“fearless” parallel programming which turns many concurrency errors into compile-time errors to be addressed before code
		execution.

		Simply put, researchers specify any number of processing modules in a plain-text, configuration file, which serves as the
		input to PIPE. Depending on the module specified, PIPE then pulls version-locked modules from a trusted repository, for
		reproducibility, or loads editable local copies, so researchers can tailor the processing logic without altering the
		underlying workflow. Finally, PIPE prepares and executes the specified modules in parallel, identifying any fatal errors
		or verification issues before execution.

		PIPE offers several advantages as a fully modular workflow through scalability, flexibility, maintainability, and
		reusability. Throughout the workflow development, we implemented methodologies and practices for sustainable scientific
		software research and development to be implemented at scale with emphasis on automation, portability, and correctness.
		Specifically, the PIPE workflow benefits the research community by:
		- Providing a modular, portable platform optimized for processing large-scale data
		- Detecting fatal issues before execution, virtually eliminating non-productive runtime
		- Constraining complexity to processing modules, which are resolved dynamically
		- Running as a single executable available on multiple operating systems and architectures

		# Acknowledgements
		This manuscript has been authored by UT-Battelle, LLC under Contract No.DE-AC05-00OR22725 with the U.S. Department of Energy.
		The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United
		States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form
		of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide
		public access to these results of federally sponsored research in accordance with the DOE Public Access Plan.

		# References
		> 🚧 Under construction
		No newline at end of file
		See [paper.bib](https://code.ornl.gov/GSHS/common/pipe/command/-/blob/paper/paper.bib?ref_type=heads)
		No newline at end of file