Skip to main content

Understanding Caching in TangleML

TangleML's caching system is one of its most powerful features, designed to dramatically reduce compute time and accelerate your ML pipeline iterations. Unlike traditional pipeline systems, TangleML implements sophisticated caching strategies that can save hours or even days of computation time.

Why Caching Matters

Imagine you have a pipeline that runs for two days, involving data preprocessing, model training, and evaluation. Without caching:

  • Every pipeline run would start from scratch
  • Small changes would require complete re-execution
  • Parallel experiments would duplicate work unnecessarily

With TangleML's caching:

  • Previous computations are automatically reused
  • Only changed components are re-executed
  • Multiple pipeline variants share common computations
tip

A pipeline that took a full day to run initially might complete in minutes on subsequent runs if most components can use cached results!

How Caching Works

The Cache Key

TangleML generates a unique cache key for each task execution based on:

  1. Container specification - The exact container image, commands, and environment
  2. Input data hashes - Cryptographic hashes of all input artifacts

When a task is about to execute, TangleML checks if an identical task has been run before by comparing cache keys.

Content-Based vs. Lineage-Based Hashing

TangleML uses content-based hashing, which is superior to the lineage-based approach used by most other systems:

FeatureContent-Based (TangleML)Lineage-Based (Others)
Component upgradesDownstream uses cache if outputs are identicalMust re-execute all downstream
Non-deterministic componentsDownstream uses cache when output doesn't changeAlways re-executes downstream
Constant value replacementCan swap hardcoded values with components without breaking cacheCache breaks when lineage changes
tip

Content-based hashing means that if a component produces the same output (same hash), all downstream components can still use their cached results, even if the upstream component was re-executed!

Reusing Running Executions

One of TangleML's unique features is the ability to reuse still-running executions. This is crucial for parallel pipeline submissions:

Pipeline 1: [Data Processing] → [Training A] → [Evaluation]
Pipeline 2: [Data Processing] → [Training B] → [Evaluation]
Pipeline 3: [Data Processing] → [Training C] → [Evaluation]

When submitted simultaneously:

  • Other systems: Launch 3 separate data processing tasks
  • TangleML: All pipelines share the same data processing execution

Cache Configuration

Controlling Cache Behavior

You can control caching behavior at the component task level:

SettingDescriptionExample Use Case
DefaultCache enabled, no staleness limitMost components
Disabledmax_cache_staleness: "P0D"Components reading volatile external data
Time-limitedmax_cache_staleness: "P7D"Data that becomes stale after a week
warning

Components should be pure functions - given the same inputs, they should produce the same outputs. If your component reads from volatile sources, consider adding date parameters or disabling caching.

...
implementation:
graph:
tasks:
...
Fill all missing values using Pandas on CSV data:
executionOptions:
cachingStrategy:
maxCacheStaleness: P0D

Cache Staleness Format

Cache staleness uses RFC3339 duration format:

  • P30D - 30 days
  • P7D - 7 days
  • PT1H - 1 hour
  • P0D - Disable caching
info

Large artifacts are purged after 30 days due to data retention policies.

What Survives Purging

After purging, you can still see:

  • File sizes
  • Content hashes
  • Directory structure
  • Small values (metrics, parameters)
  • Execution status and timing

Cache Breaking Strategies

When you need fresh data despite caching:

1. Cache Breaker Input (Nonce)

The nonce is used to introduce non-determinicity to the component, ensuring that the component is re-executed even if the inputs haven't changed. Nonce can be a random string or the current time from the caller.

@component
def search_web(
query: str,
nonce=None, # type: str | None
) -> Results:
# Pass timestamp or random value to nonce parameter
...

In case of scheduled runs, pipelines may use a placeholder for incoming timestamp.

...
inputs:
- name: Pipeline Creation Time
type: String
default: '{{CreationTime}}'
...

This substituted input can be used to pass the cache breaker value to the component:

...
arguments:
nonce:
graphInput:
inputName: Pipeline Creation Time

2. Disable Caching via maxCacheStaleness

For components that should never cache:

...
implementation:
graph:
tasks:
...
Fill all missing values using Pandas on CSV data:
executionOptions:
cachingStrategy:
maxCacheStaleness: P0D
Disable Caching

Summary

TangleML's caching system provides:

  • Automatic optimization - No manual cache management needed
  • Intelligent reuse - Content-based hashing and running execution sharing
  • Flexible control - Configure staleness and breaking strategies
  • Performance gains - Dramatic reduction in execution time

By understanding and leveraging these caching capabilities, you can iterate faster, experiment more freely, and make the most of your computational resources.