Unverified Commit 07312fbf authored by Gregory Cage's avatar Gregory Cage Committed by GitHub
Browse files

Merge pull request #46 from nova-model/update-user-docs

Update docs and usage examples.
parents 579d0c76 040c5850
Loading
Loading
Loading
Loading
Loading
+51 −5
Original line number Diff line number Diff line
@@ -3,7 +3,7 @@
Data Stores
-------------------------

A `Datastore` in nova-galaxy represents a Galaxy history. It serves as a container for organizing your data and tool outputs within Galaxy.
A `Datastore` or `Data Store` in nova-galaxy represents a Galaxy history. It serves as a container for organizing your data and tool outputs within Galaxy.

.. code-block:: python

@@ -11,13 +11,59 @@ A `Datastore` in nova-galaxy represents a Galaxy history. It serves as a contain

    galaxy_url = "your_galaxy_url"
    galaxy_key = "your_galaxy_api_key"
    nova = Nova(galaxy_url, galaxy_key)
    connection = Connection(galaxy_url, galaxy_key)

    with nova.connect() as conn:
    with connection.connect() as conn:
        data_store = conn.create_data_store("My Data Store")

You can also choose to persist a data store, preventing the tools in the data store from being automatically stopped when the nova-galaxy connection is closed:
By default data stores are persisted, meaning that their jobs and outputs will be available to retrieve even after the connection is closed.
Datastores (or data stores) also keep their namespace even after the application is exited. Meaning, if you name your data store "Data1", then
if you create a new data store in the future named "Data1" then Nova Galaxy will automatically connect the new instance to the old one, assuming
it has not been deleted.

In order to delete and cleanup your data stores (ie delete all outputs/resources associated with the data store), there are a few methods.

First you can mark a data store for cleanup automatically when you close your nova connection.

.. code-block:: python

    with connection.connect() as conn:
        data_store = conn.create_data_store("My Data Store")
        data_store.mark_for_cleanup()
        # when the 'with' block exits, the data store will be cleaned up.

This will also work when the connection class is used without the 'with' syntax.

.. code-block:: python

    active_connection = connection.connect()
    data_store = conn.create_data_store("My Data Store")
    data_store.mark_for_cleanup()
    active_connection.close()
    # when close() is called, the data store will be cleaned up.


You can also manually clean a data store by invoking the cleanup class method:

.. code-block:: python

    active_connection = connection.connect()
    data_store = active_connection.create_data_store("My Data Store")
    # Do work
    data_store.cleanup()
    data_store = active_connection.create_data_store("My Data Store")
    # In order to use this data store again, you will have to call create_data_store again. This will be an empty store since the previous was cleaned up.

If at any point, you want to persist a store that has been marked for cleanup, you can call the persist class method:

.. code-block:: python

    active_connection = connection.connect()
    data_store = active_connection.create_data_store("My Data Store")
    # Run your first tool
    data_store.cleanup()
    data_store = active_connection.create_data_store("My Data Store")
    # Run your second tool
    data_store.persist()
    active_connection.close()
    # All data in the store from the second tool will be persisted, whereas the first tool's outputs will be gone.
+56 −0
Original line number Diff line number Diff line
@@ -14,3 +14,59 @@ nova-galaxy provides abstractions for handling individual files (`Dataset`) and

   # Create a DatasetCollection (implementation for upload pending)
   my_collection = DatasetCollection("path/to/my/collection")


By default Datasets will take their name from the filepath given, but they can be given unique names by passing a string into the constructor.

.. code-block:: python

    my_dataset = Dataset(path="path/to/file.txt", name="cool_dataset_name")

Datasets can be marked as a remote file if you don't want to upload them from your local machine. Remote files are files that your upstream Galaxy instance will have access to.
For example, if your upstream Galaxy instance has access to a directory named `/SNS`, you can load a file from there as a dataset:

.. code-block:: python

    my_dataset = Dataset(path="/SNS/path/to/file.txt", remote_file=True)

Datasets can be uploaded to a store by calling the upload method.

.. code-block:: python

    connection = Connection("galaxy_url", "galaxy_key").connect()
    store = connection.create_data_store("store")
    my_dataset = Dataset("filepath/file.txt")
    my_dataset.upload(store, name="optional name")


Note, when the remote_files flag is set to true, the files are not actually "uploaded". Instead, they will be ingested into Galaxy as a link to the actual file, so file size should not slow down the system.

When running tools, any Dataset that is used as an input parameter will be automatically uploaded/ingested, unless that dataset has already been uploaded.
In order to force the dataset to be uploaded when a tool runs, even if it has been uploaded before, the dataset can be marked with force_upload:

.. code-block:: python

     my_dataset = Dataset(path="/SNS/path/to/file.txt", force_upload=True)

By default `force_upload` is actually True.

If instead of loading a file from disk or ingesting a file, you want to directly upload some text or some other serializable python value, you can set the dataset content directly:

.. code-block:: python

    my_dataset = Dataset()
    my_dataset.set_content("Some text that will be uploaded as a text file", file_type=".txt")

The `file_type` argument is optional and will default to a text file.

In order to fetch the content of a dataset you can either download the dataset to a path or fetch the content and store it directly in memory (be careful using this with large files.)

.. code-block:: python

    my_dataset.download("/path/to/local/location/where/you/want/to/download/this.txt")
    dataset_content = my_dataset.get_content() # will store content in memory


DatasetCollections currently have less functionality than individual Datasets, as most collections will come from tool outputs.
The `get_content()` method will return a list of info on each element in the collection rather than the content of each element.
The `download()` method will save the collection (with all content included) as a zip archive to the given path.
+7 −1
Original line number Diff line number Diff line
@@ -19,4 +19,10 @@ nova-galaxy allows running Galaxy tools in interactive mode, which is especially
    url = my_tool.run_interactive(data_store, params)
    print(f"Interactive tool URL: {url}")

By default, interactive tools are stopped automatically once the Nova connection is closed. To override this behavior, use the DataStore persist method. This will cause the tool to run into perpetuity and will need to be stopped manually using the Tool stop_all_tools_in_store method.
By default, interactive tools are not stopped automatically once the Nova connection is closed. To override this behavior, use the DataStore mark_for_cleanup method. This will cause the tool to stop automatically, once the connection is closed (or `with` block is exited). You can manually stop these tools by using the Tool stop_all_tools_in_store method.

If you want to get the url of an interactive tool at a later point, you can use the `get_url` method:

.. code-block:: python

     my_tool.get_url()
+78 −2
Original line number Diff line number Diff line
@@ -7,10 +7,86 @@ The `Tool` class represents a Galaxy tool. You can run tools, manage their input

.. code-block:: python

   from nova.galaxy import Tool, Parameters, Dataset
   from nova.galaxy import Connection, Tool, Parameters, Dataset

   # Get a tool instance
   my_tool = Tool("tool_id")

   connection = Connection("galaxy_url", "galaxy_key")
   active_connection = connection.connect()
   data_store = active_connection.create_data_store("cool store")
   inputs = Parameters()
   # Run the tool
   outputs = my_tool.run(data_store, params)
   outputs = my_tool.run(data_store, inputs)

By default tools will run synchronously. In order to run a tool in an "async" manner, set the wait argument to False.

.. code-block:: python

    outputs = my_tool.run(data_store=data_store, params=inputs, wait=False)
    # any code after will be executed immediately. Outputs will be None in this case.

You can get the status of the tool in the form of a WorkState (from nova-common library) enum value:

.. code-block:: python

    status = my_tool.get_status()
    print(status) # could print "running", "queued", "error", etc
    full status = my_tool.get_full_status()
    print(full_status) # Gives you details on error states, etc

If a tool has already been run, and you want to get the results/outputs again:

.. code-block:: python

    outputs = my_tool.get_results()

If you have run a tool asynchronously, and at a later point, you want to wait for the tool, you can use the `wait_for_results` method:

.. code-block:: python

    my_tool.run(data_store=data_store, params=inputs, wait=False)

    # do some stuff

    my_tool.wait_for_results()
    # Any code after will be executed after tool has finished running

If you want to stop a tool from running, but keep any existing outputs from the Tool, use the `stop` method.

.. code-block:: python

    my_tool.run(data_store=data_store, params=inputs, wait=False)
    my_tool.stop()
    outputs = my_tool.get_results()

If you want to cancel a tool from running and throw away any output from it, use the `cancel` method:

.. code-block:: python

    my_tool.run(data_store=data_store, params=inputs, wait=False)
    my_tool.cancel()

You can get any current stdout and stderr from a Tool:

.. code-block:: python

    stdout = my_tool.get_stdout() # Get current stdout
    stderr = my_tool.get_stderr(position=10, length = 300) # Gets 300 characters of stderr, starting from the tenth index.
    # Both stdout and stderr amount and starting position can be specified.

These methods work regardless of whether the job is running or has been completed.

Advanced users may find they need to access the underlying job id for a tool, which they can do so with `get_uid`

.. code-block:: python

    upstream_id = my_tool.get_uid() # Galaxy job ID

Tools can also be assigned to already running or completed jobs by using `assign_id`

.. code-block:: python

    second_tool = Tool("tool_id")
    second_tool.assign_id(upstream_id)
    # second_tool now can access status, outputs, stdout, stderr, etc from first tool
+51 −0
Original line number Diff line number Diff line
@@ -43,3 +43,54 @@ This example demonstrates how to upload a dataset to Galaxy and run a tool using
       # Get the content of the output dataset
       content = output_dataset.get_content()
       print(content)
       # Because data stores persist by default, this content will still be saved after the with block is exited.


Example 2: Manually managing a Connection
--------------------------------------------------
.. code-block:: python

    from nova.galaxy import Connection, Dataset, Tool, Parameters

        galaxy_url = "your_galaxy_url"
        galaxy_key = "your_galaxy_api_key"
        nova = Connection(galaxy_url, galaxy_key)

        # Open the connection
        conn = nova.connect()

        # Create a data store
        data_store = conn.create_data_store("Example Data Store")

        # Create a dataset from a local file
        my_dataset = Dataset("path/to/your/file.txt")

        # Define tool parameters
        params = Parameters()
        params.add_input("input", my_dataset)

        # Get the tool
        my_tool = Tool("some_tool_id") # Replace with the actual tool ID

        # Run the tool asynchronously
        my_tool.run(data_store, params, wait=False)

        # Get Tool Status
        print(my_tool.get_status())

        # Wait for tool to finish
        my_tool.wait_for_results()

        # Get the results from the tool
        results = my_tool.get_results()
        output_coll = results.get_collection("my_output_collection")

        # Download the output collection to a local path
        output_coll.download("/local/path/where/I/want/to/download/")

        # Clean data store (remove all files and outputs) after connection is closed
        data_store.mark_for_cleanup()

        # Manually close connection
        conn.close()
        # Results have been removed from upstream since data store was cleaned up.
Loading