Commit 1fd8ffec authored by Carson, Audrey's avatar Carson, Audrey
Browse files

fix(paper): Final copy edits and trimming

parent e707002f
Loading
Loading
Loading
Loading
+11 −11
Original line number Diff line number Diff line
@@ -23,7 +23,7 @@ affiliations:
  - name: Oak Ridge National Laboratory, Oak Ridge, Tennessee, TN, United States of America
    index: 1
    ror: 01qz5mb56
date: 29 July 2025
date: 22 August 2025
bibliography: paper.bib
---

@@ -31,31 +31,31 @@ bibliography: paper.bib
Accessible Content Optimization for Research Needs (ACORN) applies standardization, automation, linked data, and institutional knowledge to research activity data (RAD) to create actionable insights and ultimately enable new research. ACORN is a command line multitool that creates analysis-ready data from RAD. It can also run on remote continuous integration servers for shared RAD repositories. ACORN employs automated processes for informing and/or enforcing defined content schemas to create standardized, highly structured data. Because of its standardized data source, ACORN easily applies computer automation to generate communication assets such as PDFs, PPTs, and web pages. Built using memory-safe Rust, ACORN is portable and accessible for use on any Windows, Mac, or Linux machine. ACORN's standardized approach ingests and maintains data in a consistent format to enable immediate analysis and use, building progressively more powerful datasets. 

# Statement of need
Communicating research can be challenging — from the high-level overview of a research institution, down to singular projects within that institution. Science data systems created to help communicate research are often isolated and/or specialized to individual suborganizations, teams, or domains. Research communication is further complicated by a lack of consistency and documentation in research data and metadata, preventing external audiences, such as jobseekers, policymakers, funders, and the general public from finding the information they need, despite federal guidance for clear, consistent documentation.[@Lin,2020],[OSTP:2022] True innovation requires reusable systems that can standardize data across domain boundaries and serve as a nexus for scientists, developers, and communicators.[@Sochat:2018],[@Puebla:2024]
Communicating research can be challenging — from the high-level overview of a research institution, down to singular projects within that institution. Science data systems created to help communicate research are often isolated and/or specialized to individual suborganizations, teams, or domains. Research communication is further complicated by a lack of consistency and documentation in research data and metadata, preventing external audiences, such as jobseekers, policymakers, funders, and the general public, from finding the information they need, despite federal guidance for clear, consistent documentation.[@Lin,2020],[OSTP:2022] True innovation requires reusable systems that can standardize data across domain boundaries and serve as a nexus for scientists, developers, and communicators.[@Sochat:2018],[@Puebla:2024]

Traditional research practices are built on antiquated, habitual processes. These are particularly dangerous in an environment where publishing is critical to career survival.[@Grimes:2018] Researchers may be tempted to do the bare minimum, skip steps, and pursue sensational or novel paths in the name of journal acceptance and credibility.[@van_Dalen:2012] These practices have led to the twin reproducibility [@Baker:2016] and replicability [@Camerer:2018] crises.
Traditional research practices are built on antiquated, habitual processes. These are particularly troublesome in an environment where publishing is critical to career survival.[@Grimes:2018] Researchers may be tempted to do the bare minimum, skip steps, and pursue sensational or novel paths in the name of journal acceptance and credibility.[@van_Dalen:2012] These practices have led to the twin reproducibility [@Baker:2016] and replicability [@Camerer:2018] crises.

Conducting tustworthy research is hard work. Cataloging and analyzing collections of research prove another challenge, and institutions should hold research metadata to a high standard of replicability, accessibility, and ensure they are an integral part of data systems.[@Baca:2016],[OSTP:2022] Automation and data architecture, enabled through ACORN, can make it easier.
Conducting trustworthy research is hard work. Cataloging and analyzing collections of research prove another challenge. Institutions should hold research metadata to a high standard of replicability and accessibility and ensure they are an integral part of organizational data systems.[@Baca:2016],[OSTP:2022] Automation and data architecture, enabled through ACORN, can make this easier.

ACORN enables quick analysis of research project portfolios, allowing decision-makers to pick and pull solutions for execution, sponsor discussions, and mission applications. ACORN has three main outputs: analysis-ready data applicable to AI/ML research; target artifacts: from the ACORN-enabled content process that creates a single source of truth for research activity data from which users can generate communication artifacts; and understanding: maintaining data in the same format for programmatic analysis and enhanced understanding and better application of AI/ML practices. This collection of tools allows researchers to leverage the benefits of connected data and automate numerous tasks essential to science and communication.
ACORN enables quick analysis of research project portfolios, allowing decision-makers to pick and pull solutions for execution, sponsor discussions, and mission applications. This collection of tools allows researchers to leverage the benefits of connected data and automate numerous tasks essential to science and communication.

# Research Activity Data Workflows
At the intersection of research and communications, research activity data describes an identifiable package of work involving organized, systematic investigation. ACORN helps capture, standardize, and analyze research at the project level, one organization at a time. 
At the intersection of research and communication, research activity data describes an identifiable package of work involving organized, systematic investigation. ACORN helps capture, standardize, and analyze research at the project level. 

![ACORN works with unique data schemas and applies automation to analyze, format, export, and download research activity data. Its outputs include 1) analysis-ready data: highly standardized data from the automated check process immediately applicable to AI/ML research; 2) target artifacts: communication pieces such as PDFs and web pages; 3) programmatic data analysis made easier through standardized, structured data.](./figures/acorn_input_output.png)

![Within an organization, ACORN allows research activity data to be documented, analyzed, and publicized - even for projects without publications. By entering at the project level, ACORN provides unique visibility into a new set of variables that incrementally begin to document the broader scientific ecosystem through widespread adoption.](./figures/acorn_workflow.png)

# Future Work
ACORN's standardization allows great scalability and accessiblity possibilities. By publishing the RAD contents in a structured way, including description of mission, challenge, approach, and impact, along with metadata, ACORN can help build knowledge graphs of particular research institutions, domains, and even the larger scientific community. Incorporating linked data through machine-readable, standards-based [JSON-LD]( https://json-ld.org/) can further build out a RAD knowledgebase with domain-specific ontologies [@Ciccarese:2008],[@Tan:2025],[@Wu:2023] that enable complex queries and automated inference.[@Tan:2025] 
ACORN's standardization allows great scalability and accessiblity possibilities. By publishing the RAD contents in a structured way, including description of mission, challenge, approach, and impact, along with metadata, ACORN can help build knowledge graphs of research institutions, domains, and the larger scientific community. Incorporating linked data through machine-readable, standards-based [JSON-LD]( https://json-ld.org/) can further build out a RAD knowledge base with domain-specific ontologies [@Ciccarese:2008],[@Tan:2025],[@Wu:2023] that enable complex queries and automated inference.[@Tan:2025] 

Another piece of the research communications puzzle to address is representing often-verbalized and much-cited claims to bring provenance to RAD such as journal articles and fact sheets. This process can prove time-consuming and often results in redundant content, as multiple researchers try build novel claims on the same foundation of truth.
Another piece of the research communications puzzle to address is how to represent often-verbalized and much-cited claims to bring provenance to RAD such as journal articles and fact sheets. This process can prove time-consuming and redundant, as multiple researchers try to build novel claims on the same foundation of truth.

Nanopublications can serve as a standard framework to represent these claims, providing a concentrated source of truth for RAD. Nanopublications can bring finer granularity to RAD-adjacent information, including instrument, device, and physical technology offerings.[@Groth:2010] They've also been shown to be an indicator of contradiction between different claims. [@Asif:2021]
Nanopublications can serve as a standard framework to represent these claims, providing a concentrated source of truth for RAD. Nanopublications can bring finer granularity to RAD-adjacent information, including instrument, device, and physical technology offerings.[@Groth:2010] They've also been shown to be an indicator of contradiction between different claims, and with their use could allow ACORN to truly deliver truth and stability to projects and RAD. [@Asif:2021]

Bringing an interface to the knowledge graph, chatbots could be the initial implementation of ACORN AI functionality, created by embedding ACORN-formatted RAD in a vector database and querying an off-the-shelf large language model (LLM) on the database. The LLM would use the query to find the most similar or related context, using RAD to inform retrieval augmented generation (RAG) for extra prompt content and then feed through the LLM again to give the user an answer. Because RAD is already input in JSON format, it's clearly defined, machine friendly, and in a predictable, easy-to-validate format. This makes research data highly accessible for AI applications, which, in turn, makes it highly accessible for the layperson.[@Shao:2025]
Additionally, chatbots could bring an interface to the knowledge graph as the initial implementation of ACORN AI functionality, created by embedding ACORN-formatted RAD in a vector database and querying an off-the-shelf large language model (LLM) on the database. The LLM would use the query to find the most similar or related context, using RAD to inform retrieval augmented generation (RAG) for extra prompt content and then feed through the LLM again to give the user an answer. Because RAD is already input in JSON format, it's clearly defined, machine friendly, and in a predictable, easy-to-validate format. This makes research data highly accessible for AI applications, which, in turn, makes it highly accessible for the layperson.[@Shao:2025]

Coupled with RAG methodology, a chatbot could add necessary research data context for truly unique science communication capabilities.[@Patel:2024] Training could allow an existing model to make a base model with desired characteristics (e.g. follow AP style guide).[@Gao:2024] Through existing open-source LLMs, a trained RAD chatbot could provide information on the same project to both a fifth-grader writing a science report to a researcher at a partnering national laboratory.[@Hemmat:2024]
Coupled with RAG methodology, a chatbot could add necessary research data context for truly unique science communication capabilities.[@Patel:2024] Training could allow an existing model to make a base model with desired characteristics (e.g. follow AP style guide).[@Gao:2024] Through existing open-source LLMs, a trained RAD chatbot could provide context-appropriate information on the same project to both a fifth-grader writing a science report and a researcher at a partnering national laboratory.[@Hemmat:2024]

The chatbot would let the science speak for itself and could become part of an envisioned federated AI agent ecosystem. This could allow users to ask questions of the entire Department of Energy national laboratories, for example. Leveraging agent-to-agent communication [@Yang:2025], such as [MCP](https://modelcontextprotocol.io/docs/getting-started/intro) or [A2A](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/), could provide a synopsis answer based on answers from all national lab agents to tell a user about the DOE Office of Science capabilities to address a particular challenge.