Commit 8258d9d7 authored by Wohlgemuth, Jason's avatar Wohlgemuth, Jason
Browse files

feat: Replace clones with Arc in bucket code for memory and performance

parent 3ccc7979
Loading
Loading
Loading
Loading
Loading
+443 −0
Original line number Diff line number Diff line
@@ -265,3 +265,446 @@ pub fn run(path: PathBuf, offline: bool, verbose: Verbosity) {
    );
}
```

---

# Arc Optimization Opportunities

This document lists locations in the ACORN codebase where `clone()` calls could be replaced with `Arc` for better performance and memory management, organized by impact level.

> **Note:** High Impact #1 (Bucket/BucketOptions in `acorn-lib/src/io/config.rs`) has already been implemented.

---

## HIGH IMPACT

These are the most impactful changes: large data structures being cloned in parallel contexts, or structs cloned many times across the codebase.

---

### 2. ResearchActivity in parallel schema validation

**File:** `acorn-lib/src/analyzer/mod.rs`
**Lines:** 129, 265

**What is cloned:**
- Line 129: `data.clone()` -- `Cff` struct passed to `collect_validation_checks`
- Line 265: `data.clone()` -- `ResearchActivity` struct passed to `collect_validation_checks`

Both occur inside `.par_iter().flat_map()` closures. `ResearchActivity` is a very large struct containing nested `ResearchActivityMetadata`, `Sections`, `ContactPoint`, `AspectFramework`, and `Other` -- all with many `String`, `Vec<String>`, and nested struct fields.

**Why Arc is better:** These are parallel iterations over file paths. Each path is read, deserialized into a potentially large `ResearchActivity` or `Cff`, then immediately cloned for validation. Using `Arc<ResearchActivity>` would allow the validation checks to share ownership without deep-copying the entire struct. The `collect_validation_checks` function only needs read access.

**Estimated impact:** **HIGH** -- `ResearchActivity` is the central data model; cloning it in parallel processing is one of the most expensive operations in the codebase.

**Suggested approach:**
```rust
// Before
.par_iter().flat_map(|path| {
    let data = ResearchActivity::read(path)?;
    collect_validation_checks(data.clone())
})

// After
.par_iter().flat_map(|path| {
    let data = Arc::new(ResearchActivity::read(path)?);
    collect_validation_checks(Arc::clone(&data))
})
```

---

### 3. ResearchActivity triple-cloned in website checks

**File:** `acorn-lib/src/analyzer/mod.rs`
**Lines:** 284, 292, 300

**What is cloned:**
- Line 284: `data.clone().meta.doi`
- Line 292: `data.clone().meta.websites`
- Line 300: `data.clone().contact.url`

Three separate clones of the full `ResearchActivity` struct within the same async closure, each to access a single nested field.

**Why Arc is better:** Three clones of a large struct to access individual fields. With `Arc<ResearchActivity>`, you'd clone only the `Arc` pointer (16 bytes on 64-bit) instead of the entire struct. Or simply borrow the fields directly since they're only read.

**Estimated impact:** **HIGH** -- Three unnecessary full struct clones in a loop over paths.

**Suggested approach:**
```rust
// Before
async move {
    let doi = data.clone().meta.doi;
    let websites = data.clone().meta.websites;
    let url = data.clone().contact.url;
    // ...
}

// After (Option 1: Arc)
async move {
    let doi = data.meta.doi.clone();
    let websites = data.meta.websites.clone();
    let url = data.contact.url.clone();
    // ...
}

// After (Option 2: References if lifetimes allow)
// Borrow fields directly without cloning the parent struct
```

---

### 4. PowerPoint export redundant clones

**File:** `acorn-cli/src/commands/export/powerpoint.rs`
**Lines:** 104-179

**What is cloned:**
- Line 104: `options.clone().unwrap()` -- `CommandOptions` (contains 3x `Option<PathBuf>`, 3x `Option<String>`)
- Line 106: `paths.clone()` -- `Vec<PathBuf>` cloned just for `.len()`
- Line 108: `paths.clone()` -- same `Vec` cloned again for `.iter()`
- Line 110: `path.clone()` -- per-item `PathBuf` clone
- Line 112: `path.clone()` -- same path cloned again for context
- Line 125: `path.clone()` -- options field cloned
- Line 145: `reference_path.clone()`
- Line 155: `reference_extract_path.clone()` -- per slide in iteration
- Line 156: `paths.clone()` -- same `Vec` cloned a third time
- Line 170: `data.clone()` -- `ResearchActivity` cloned per item in parallel iteration
- Line 174: `paths[0].clone()` -- yet another clone
- Line 179: `reference_extract_path.clone()` -- another clone

**Why Arc is better:** `paths` is cloned 4 times in this function. `CommandOptions` is cloned for destructuring. `ResearchActivity` is cloned inside a `par_iter` block. Using `Arc<PathBuf>` for paths and `&ResearchActivity` (or `Arc<ResearchActivity>`) would eliminate most of these. The `reference_extract_path` is cloned once per slide -- an `Arc<PathBuf>` would reduce this to a pointer copy.

**Estimated impact:** **HIGH** -- Many redundant clones in the same function, plus `ResearchActivity` clone in parallel context.

**Suggested approach:**
```rust
// Wrap paths in Arc once
let paths = Arc::new(paths);

// Use Arc::clone instead of .clone() for Vec<PathBuf>
let paths_len = paths.len();  // No clone needed
for path in paths.iter() {    // No clone needed
    // ...
}

// For ResearchActivity in par_iter
let data = Arc::new(data);
paths.par_iter().for_each(|path| {
    let data = Arc::clone(&data);
    // ...
});
```

---

### 5. Vale struct in prose checking

**File:** `acorn-lib/src/analyzer/check.rs`
**Lines:** 629-633

**What is cloned:**
- Line 629: `vale.clone()` -- `Vale` struct cloned (contains `Option<SemanticVersion>`, `Option<PathBuf>`, `Option<ValeConfig>`)
- Line 632: `vale.clone()` -- cloned again inside the `.map()` closure for each path

**Why Arc is better:** `Vale` is cloned once, then cloned again per path in the async iterator. The cloned `Vale` is only used to call `.run()`, which takes `&self`. Wrapping in `Arc<Vale>` would reduce per-path cloning to a pointer copy.

**Estimated impact:** **HIGH** -- `Vale` cloning happens for every file being prose-checked, and prose checking is one of the more expensive operations.

**Suggested approach:**
```rust
// Before
let vale = Vale::new(...);
items.iter().map(|path| {
    let vale = vale.clone();
    async move { vale.run(path).await }
})

// After
let vale = Arc::new(Vale::new(...));
items.iter().map(|path| {
    let vale = Arc::clone(&vale);
    async move { vale.run(path).await }
})
```

---

## MEDIUM IMPACT

Structs that are cloned repeatedly but are smaller, or clones that occur less frequently.

---

### 6. CheckOptions cloned 3x in check command

**File:** `acorn-cli/src/commands/check/mod.rs`
**Lines:** 80, 113, 199

**What is cloned:**
- Line 80: `options.clone()` -- `CheckOptions` in `apply_early_exit_policy` (11 fields including `Vec<String>`)
- Line 113: `options.clone()` -- `CheckOptions` in `handle()`
- Line 199: `options.clone()` -- `CheckOptions` in `resolve_skipped_categories()`

**Why Arc is better:** `CheckOptions` is cloned 3 separate times in the check flow. While individual clones are cheap (primitives + one `Vec<String>`), the cumulative cost across many files adds up. More importantly, `CheckOptions` is passed by reference to many async operations that clone it internally. Sharing via `Arc<CheckOptions>` would allow all functions to share a single allocation.

**Estimated impact:** **MEDIUM-HIGH** -- Moderate struct size, but cloned frequently across the check command's hot path.

---

### 7. CommandOptions in export

**File:** `acorn-cli/src/commands/export/mod.rs`
**Lines:** 141, 165

**What is cloned:**
- Line 141: `CommandOptions::init().path(path.clone()).maybe_output(output.clone()).build()` -- creates new options but clones inputs
- Line 165: `options.clone()` inside async closure for parallel processing

**Why Arc is better:** The options struct is rebuilt and cloned multiple times within the export flow. Using `Arc` for the shared fields would reduce allocations, especially for the parallel processing closure at line 165.

**Estimated impact:** **MEDIUM** -- `CommandOptions` is moderate size (~5 Option fields), but cloned in a parallel context.

---

### 8. BucketOptions in download command

**File:** `acorn-cli/src/commands/download/mod.rs`
**Lines:** 51, 105

**What is cloned:**
- Line 51: `options.clone()` -- `BucketOptions`
- Line 105: Same pattern repeated for URL-based download

**Why Arc is better:** In the loop over buckets, `options.clone()` happens for each bucket. If a config file defines many buckets, this creates repeated allocations. `Arc<BucketOptions>` would share the options struct across all buckets.

**Estimated impact:** **MEDIUM** -- Depends on number of buckets; typically small but can grow.

---

### 9. GitLab Options cloned 4x

**File:** `acorn-cli/src/commands/gather/mod.rs`
**Lines:** 60, 64, 78, 80

**What is cloned:**
- Line 60: `options.clone()` -- `gitlab::Options` passed to `runners()`
- Line 64: `options.clone()` -- cloned again with modified identifier
- Line 78: `options.clone()` -- passed to `create_runner()`
- Line 80: `options.clone()` -- passed to `groups()`

**Why Arc is better:** `gitlab::Options` is cloned 4 times in quick succession. If this struct contains HTTP client state or authentication tokens, the clone cost could be non-trivial. Using `Arc` would share the options across API calls.

**Estimated impact:** **MEDIUM** -- Small number of clones, but potentially expensive struct.

---

### 10. Cff in website checks

**File:** `acorn-lib/src/analyzer/mod.rs`
**Lines:** 144, 174

**What is cloned:**
- Line 144: `path.clone()` -- in async closure for website checks
- Line 174: `value.clone()` -- `Identifier.value` String cloned

**Why Arc is better:** The `Cff` struct contains many `Vec` fields (`identifiers`, `references`) that are cloned when the struct is moved into the async closure. Using `Arc<Cff>` for the website check would avoid deep-cloning these collections.

**Estimated impact:** **MEDIUM** -- `Cff` is smaller than `ResearchActivity` but still contains nested collections.

---

### 11. Bucket in file_paths method

**File:** `acorn-lib/src/io/config.rs`
**Lines:** 397-401

**What is cloned:**
- Line 397: `self.clone()` -- full `Bucket` struct destructured
- Line 398: `name.clone()` -- `Option<String>` cloned from the destructured bucket
- Line 401: `code_repository.clone()` -- cloned from the same bucket

**Why Arc is better:** The entire `Bucket` is cloned just to access `name` and `code_repository`. Using `&self` would eliminate this clone entirely since both fields are only read.

**Estimated impact:** **MEDIUM** -- Single clone but of a struct containing `Repository` enum.

---

### 12. Vale cloned 7x across trait implementations

**File:** `acorn-lib/src/analyzer/mod.rs`
**Lines:** 413, 498, 517, 547, 567, 609, 686

**What is cloned:**
- Line 413: `self.clone().command()` -- `Vale` cloned to call `.command()`
- Line 498: `self.clone().download_checksums()` -- `Vale` cloned for async call
- Line 517: `self.clone().extract()` -- `Vale` cloned
- Line 547: `self.clone().command()` -- cloned again
- Line 567: `self.clone().command()` -- cloned again
- Line 609: `self.clone().command()` -- cloned again
- Line 686: `self.clone().command()` -- cloned again (in `with_system_command`)

**Why Arc is better:** `Vale` is cloned 7 times throughout its trait implementations, mostly just to call `.command()`. The `.command()` method takes `self` by value but only reads fields. Changing the signature to take `&self` would eliminate all these clones.

**Estimated impact:** **MEDIUM** -- Many clones but struct is small; the pattern is the issue (consuming `self` for a read-only operation).

**Suggested approach:**
```rust
// Before
fn command(self) -> Command {
    // ...
}

// After
fn command(&self) -> Command {
    // ...
}
```

---

## LOW IMPACT

Small structs, single clones, or clones of types that are already cheap (e.g., `PathBuf` is ~24 bytes, `String` in trivial contexts).

---

### 13. PathBuf cloned in format command (per file, parallel)

**File:** `acorn-cli/src/commands/format/mod.rs`
**Lines:** 37, 47, 57

**What is cloned:**
- Line 37: `path.clone()` -- `PathBuf` in `par_iter`
- Line 47: `path.clone()` -- same path for context
- Line 57: `path.clone()` -- for display string

**Why Arc is better:** `PathBuf` is small (~24 bytes with inline short string optimization), but cloned 3 times per file in a parallel context. Using a reference or `Arc<PathBuf>` would reduce allocations when processing thousands of files.

**Estimated impact:** **LOW** -- Small type, but frequent in parallel context.

---

### 14. PathBuf cloned in link command (per file, parallel)

**File:** `acorn-cli/src/commands/link/mod.rs`
**Lines:** 29, 39, 49, 55

**What is cloned:**
- Line 29: `path.clone()` -- in `par_iter`
- Line 39: `path.clone()` -- for context
- Line 49: `path.clone()` -- for display
- Line 55: `path.clone()` -- for filepath conversion

**Why Arc is better:** Same pattern as format command. 4 clones of `PathBuf` per file in parallel processing.

**Estimated impact:** **LOW** -- Small type but frequent.

---

### 15. Options cloned in resolve_paths

**File:** `acorn-cli/src/cli/mod.rs`
**Lines:** 96

**What is cloned:**
- Line 96: `ignore.clone()` and `filter.clone().map(regex_inverse)` -- both `Option<String>` cloned

**Why Arc is better:** Minor; these are `Option<String>` clones that could be avoided with references.

**Estimated impact:** **LOW** -- Small strings, single location.

---

### 16. Check struct cloned in render function

**File:** `acorn-cli/src/commands/check/mod.rs`
**Lines:** 171, 176

**What is cloned:**
- Line 171: `issue.clone()` -- `Check` struct cloned to access `.uri`
- Line 176: `issue.clone()` -- `Check` cloned to call `.with_index().report()`

**Why Arc is better:** `Check` struct contains several `Option` fields and an `ErrorKind` enum. Cloned twice per issue in the render function. Since the `Check` is only being read and then transformed, a reference-based approach or `&self` methods would avoid cloning.

**Estimated impact:** **LOW** -- Small struct, limited to render path.

---

### 17. Check struct cloned in filter_by_visibility

**File:** `acorn-cli/src/commands/check/mod.rs`
**Line:** 151

**What is cloned:**
- Line 151: `.cloned().collect()` -- every `Check` in the slice is cloned to create a new `Vec`

**Why Arc is better:** If `Check` were `Arc<Check>`, this would become a cheap pointer copy instead of a full struct clone for each element. Alternatively, using references and changing downstream code to accept `&[Check]` would eliminate the clone entirely.

**Estimated impact:** **LOW** -- One-time clone per check cycle, but of all results.

---

### 18. Database path cloned in main.rs

**File:** `acorn-cli/src/main.rs`
**Line:** 169

**What is cloned:**
- Line 169: `database_path.clone()` -- `Option<PathBuf>`

**Why Arc is better:** Single clone at startup, trivial impact.

**Estimated impact:** **LOW** -- One-time clone at startup.

---

## Summary Table

| Priority | File | Lines | Struct Being Cloned | Context | Impact |
|----------|------|-------|-------------------|---------|--------|
| ~~1~~ | ~~acorn-lib/src/io/config.rs~~ | ~~279-311, 344-383~~ | ~~BucketOptions, Bucket~~ | ~~Parallel file operations~~ | ~~HIGH (DONE)~~ |
| 2 | acorn-lib/src/analyzer/mod.rs | 129, 265 | ResearchActivity, Cff | Parallel schema checks | HIGH |
| 3 | acorn-lib/src/analyzer/mod.rs | 284, 292, 300 | ResearchActivity (3x) | Parallel website checks | HIGH |
| 4 | acorn-cli/src/commands/export/powerpoint.rs | 104-179 | CommandOptions, paths, ResearchActivity | Export pipeline | HIGH |
| 5 | acorn-lib/src/analyzer/check.rs | 629, 632 | Vale | Parallel prose checks | HIGH |
| 6 | acorn-cli/src/commands/check/mod.rs | 80, 113, 199 | CheckOptions (3x) | Check command hot path | MED-HIGH |
| 7 | acorn-cli/src/commands/export/mod.rs | 141, 165 | CommandOptions | Export parallel processing | MEDIUM |
| 8 | acorn-cli/src/commands/download/mod.rs | 51, 105 | BucketOptions | Bucket loop | MEDIUM |
| 9 | acorn-cli/src/commands/gather/mod.rs | 60, 64, 78, 80 | gitlab::Options (4x) | API call chain | MEDIUM |
| 10 | acorn-lib/src/analyzer/mod.rs | 144, 174 | Cff | Website checks | MEDIUM |
| 11 | acorn-lib/src/io/config.rs | 397-401 | Bucket | file_paths method | MEDIUM |
| 12 | acorn-lib/src/analyzer/mod.rs | 413, 498, 517, 547, 567, 609, 686 | Vale (7x) | Trait implementations | MEDIUM |
| 13-18 | Various | Various | PathBuf, Check, Option<String> | Per-file parallel ops | LOW |

---

## Implementation Notes

### When to use Arc vs &self

- **Use `Arc`** when:
  - Data needs to be shared across multiple async closures or threads
  - Data is cloned multiple times in a loop
  - Data is passed through multiple layers of functions

- **Use `&self` or references** when:
  - Data is only read and doesn't need to outlive the current scope
  - Clone is only needed due to method signature taking ownership (change signature instead)
  - Single clone that could be avoided with proper borrowing

### Performance Considerations

- `Arc::clone()` is very cheap (atomic increment, ~nanoseconds)
- Deep cloning large structs can be expensive (microseconds to milliseconds)
- The benefit of `Arc` scales with:
  - Size of the cloned struct
  - Frequency of cloning
  - Number of parallel operations

### Testing Arc Changes

After implementing Arc optimizations:
1. Run `cargo test` to ensure no regressions
2. Run `cargo clippy` to catch any new warnings
3. Consider benchmarking before/after if performance is critical
+54 −47
Original line number Diff line number Diff line
@@ -10,7 +10,7 @@ use crate::io::api::Endpoint;
use crate::io::http::get;
use crate::io::ApiResult;
use crate::io::{files_all, read_file, with_progress, write_file, write_file_bytes, FromPath, InputOutput, ProgressType};
use crate::prelude::{self, exit, PathBuf};
use crate::prelude::{self, exit, Arc, PathBuf};
use crate::util::{detect_json, suffix, Label, MimeType, StringConversion};
use crate::{Location, Repository, Scheme};
use bon::Builder;
@@ -276,40 +276,43 @@ impl Bucket {
            quiet,
            ignore,
            filter,
        } = options.clone();
        let output = output.unwrap_or_default();
        } = options;
        let output = Arc::new(output.clone().unwrap_or_default());
        let Bucket { name, code_repository, .. } = self.clone();
        if code_repository.is_local() {
            let uri = code_repository.clone().location().uri();
            let uri = code_repository.location().uri();
            let bucket_root = match uri {
                | Some(value) => PathBuf::from(value.path().to_string()).to_absolute_string(),
                | None => {
                    unimplemented!()
                }
            };
            let files = files_all(PathBuf::from(bucket_root.clone()), None)
            let files = files_all(PathBuf::from(&bucket_root), None)
                .into_iter()
                .map(|x| x.display().to_string())
                .collect::<Vec<String>>();
            let items = files
                .into_iter()
                .filter(|path| !is_ignored_path(path, &ignore) && PathBuf::from(path).is_file() && is_filtered_path(path, &filter))
                .filter(|path| !is_ignored_path(path, ignore) && PathBuf::from(path).is_file() && is_filtered_path(path, filter))
                .collect::<Vec<String>>();
            let message: fn(&String) -> String = |path| format!("Copying {path}");
            let bucket_root_clone = bucket_root.clone();
            let output_clone = output.clone();
            let operation = move |path: String| {
                let bucket_root = bucket_root_clone.clone();
                let output = output_clone.clone();
            let bucket_root = Arc::new(bucket_root);
            let operation = {
                let bucket_root = Arc::clone(&bucket_root);
                let output = Arc::clone(&output);
                move |path: String| {
                    let bucket_root = Arc::clone(&bucket_root);
                    let output = Arc::clone(&output);
                    async move {
                        let path_buf = PathBuf::from(&path);
                    let relative = match path_buf.strip_prefix(PathBuf::from(&bucket_root)) {
                        let relative = match path_buf.strip_prefix(PathBuf::from(bucket_root.as_str())) {
                            | Ok(rel) => rel.to_path_buf(),
                            | Err(_) => return Err(eyre!("Failed to strip prefix {bucket_root} from {path}")),
                        };
                        let output_filepath = output.join(relative);
                        write_file_bytes(output_filepath, || async { prelude::read(path.clone()) }).await
                    }
                }
            };
            let total_data: usize = count_json_files(&items);
            let total_images: usize = count_image_files(&items);
@@ -318,7 +321,7 @@ impl Bucket {
                | true => ProgressType::Silent,
                | false => ProgressType::Bar,
            };
            match with_progress(items, message, operation, finish_message, Some(threads), progress_type).await {
            match with_progress(items, message, operation, finish_message, Some(*threads), progress_type).await {
                | Ok(_) => Ok(total_data + total_images),
                | Err(why) => {
                    error!("=> {} Copy files with progress bar — {why}", Label::fail());
@@ -341,13 +344,13 @@ impl Bucket {
            quiet,
            filter,
            ignore,
        } = options.clone();
        let output = output.unwrap_or_default();
        } = options;
        let output = Arc::new(output.clone().unwrap_or_default());
        let Bucket { name, code_repository, .. } = self.clone();
        let items = match self.clone().file_paths("").await {
            | Ok(paths) => paths
                .into_iter()
                .filter(|path| !is_ignored_path(path, &ignore) && is_filtered_path(path, &filter))
                .filter(|path| !is_ignored_path(path, ignore) && is_filtered_path(path, filter))
                .collect::<Vec<String>>(),
            | Err(why) => {
                error!("=> {} Get file paths for download — {why}", Label::fail());
@@ -355,9 +358,12 @@ impl Bucket {
            }
        };
        let message: fn(&String) -> String = |path| format!("Downloading {path}");
        let operation = move |path: String| {
            let output = output.clone();
            let repository = code_repository.clone();
        let operation = {
            let output = Arc::clone(&output);
            let code_repository = Arc::new(code_repository);
            move |path: String| {
                let output = Arc::clone(&output);
                let repository = Arc::clone(&code_repository);
                async move {
                    let output_filepath = format!("{}/{}", output.display(), path);
                    let url = repository
@@ -380,6 +386,7 @@ impl Bucket {
                        | Err(why) => Err(why),
                    }
                }
            }
        };
        let total_data: usize = count_json_files(&items);
        let total_images: usize = count_image_files(&items);
@@ -388,7 +395,7 @@ impl Bucket {
            | true => ProgressType::Silent,
            | false => ProgressType::Bar,
        };
        match with_progress(items, message, operation, finish_message, Some(threads), progress_type).await {
        match with_progress(items, message, operation, finish_message, Some(*threads), progress_type).await {
            | Ok(_) => Ok(total_data + total_images),
            | Err(why) => Err(why),
        }
@@ -398,7 +405,7 @@ impl Bucket {
        let bucket_name = name.clone().unwrap_or_else(|| "Bucket".to_string()).to_uppercase();
        match &code_repository {
            | Repository::Git { .. } => {
                let path = match code_repository.clone().location().uri() {
                let path = match code_repository.location().uri() {
                    | Some(value) => PathBuf::from(value.path().to_string()),
                    | None => {
                        unimplemented!()
+2 −0
Original line number Diff line number Diff line
@@ -4,6 +4,8 @@
//! `acorn-lib` is a one-stop-shop for everything related to building and maintaining research activity data (RAD)-related technology, including the Accessible Content Optimization for Research Needs (ACORN) tool.
//! The modules, structs, enums and constants found here support the ACORN CLI, which checks, analyzes, and exports research activity data into useable formats.
//!
extern crate alloc;

use derive_more::Display;
use serde::{Deserialize, Serialize};
#[cfg(feature = "std")]
+2 −0
Original line number Diff line number Diff line
@@ -3,6 +3,7 @@
/// Module that provides `std` support
#[allow(unused_imports)]
mod std {
    pub use alloc::sync::Arc;
    pub use hashbrown::{DefaultHashBuilder, HashMap, HashSet};
    pub use std::env::{self, consts, var};
    pub use std::ffi::OsStr;
@@ -19,6 +20,7 @@ mod std {
/// Module that provides `no-std` support
#[allow(unused_imports)]
mod no_std {
    pub use alloc::sync::Arc;
    pub use hashbrown::{DefaultHashBuilder, HashMap, HashSet};
}