Commit 93da4fe8 authored by Clark, Niki's avatar Clark, Niki
Browse files

Edit paper.md

parent 528258a0
Loading
Loading
Loading
Loading
Loading
+47 −5
Original line number Diff line number Diff line
@@ -11,7 +11,7 @@ authors:
    corresponding: true
    affiliation: 1
  - name: Cecilia N. Clark
    orcid: https://orcid.org/0000-0002-4826-2152
    orcid: 0000-0002-4826-2152
    affiliation: 1
affiliations:
  - name: Oak Ridge National Laboratory, Oak Ridge, Tennessee, TN, United States of America
@@ -22,10 +22,52 @@ bibliography: paper.bib
---

# Summary
> 🚧 Under construction
We present PIPE, a novel, portable workflow enabling fully automated and parallel processing of large-scale data. Originally 
designed to process satellite imagery for geospatial analytics, PIPE supports the repeatable execution of many processing tasks 
across the scientific domain. The PIPE configuration can codify almost any script project task: if a project can be built and 
run via a finite set of script commands without manual intervention, it can be configured as a standalone PIPE module.

# Statement of need
> 🚧 Under construction
# Statement of Need
Researchers who handle terabyte-scale datasets often rely on ad-hoc scripts bound to specific machines to process data, 
resulting in difficulties with high-performance processing, reproducibility, and peer review. Building a reliable and robust 
workflow is pertinent to the overall goals of processing voluminous and complex data. Developing a reliable pipeline provides 
data quality assurance by incorporating quality checks and validation stages to ensure outputs meet predefined accuracy and 
usability standards. Additionally, mechanisms to detect, report, and correct errors in the data or processing provides an 
additional layer of strength to the workflow. Overall, the community lacks a lightweight, stand-alone engine which brings 
high-level parallelism and strict reproducibility to any data domain without imposing new domain logic or infrastructure 
overhead. 

# Features and Implementation

## Overview
PIPE provides a portable, straightforward, trustworthy workflow considerate of performance concerns and limitations. 
Toward this goal, PIPE can be configured via flat files and executed in multiple cross-platform and cross-architecture 
contexts. In general terms, PIPE is a command line interface (CLI) application and a collection of standards and conventions. 
The CLI application was developed using the Rust programming language, providing unique benefits such as memory safety and 
“fearless” parallel programming which turns many concurrency errors into compile-time errors to be addressed before code 
execution.  

Simply put, researchers specify any number of processing modules in a plain-text, configuration file, which serves as the 
input to PIPE. Depending on the module specified, PIPE then pulls version-locked modules from a trusted repository, for 
reproducibility, or loads editable local copies, so researchers can tailor the processing logic without altering the 
underlying workflow. Finally, PIPE prepares and executes the specified modules in parallel, identifying any fatal errors 
or verification issues before execution.  

PIPE offers several advantages as a fully modular workflow through scalability, flexibility, maintainability, and 
reusability. Throughout the workflow development, we implemented methodologies and practices for sustainable scientific 
software research and development to be implemented at scale with emphasis on automation, portability, and correctness. 
Specifically, the PIPE workflow benefits the research community by: 
- Providing a modular, portable platform optimized for processing large-scale data 
- Detecting fatal issues before execution, virtually eliminating non-productive runtime 
- Constraining complexity to processing modules, which are resolved dynamically 
- Running as a single executable available on multiple operating systems and architectures 

# Acknowledgements
This manuscript has been authored by UT-Battelle, LLC under Contract No.DE-AC05-00OR22725 with the U.S. Department of Energy. 
The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United 
States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form 
of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide 
public access to these results of federally sponsored research in accordance with the DOE Public Access Plan.

# References
> 🚧 Under construction
 No newline at end of file
See [paper.bib](https://code.ornl.gov/GSHS/common/pipe/command/-/blob/paper/paper.bib?ref_type=heads)
 No newline at end of file