Skip to main content

Tangle: Getting Started

Tangle is a service and web app that allows users to build and run machine learning pipelines using drag and drop without having to set up a development environment.

Jump to Tangle

The experimental new version of the Tangle app is now available.
No registration is required to experiment with building pipelines.

What does a pipeline system do?

Tangle is a pipeline system that:

  • Orchestrates distributed execution, scheduling, data passing, and caching
  • Containerizes tasks for isolation and reproducibility
  • Uses command-line interfaces as the true interface with user code
  • Runs programs, not just functions passing shared in-memory objects

Why use Tangle

image

Tangle offers unique advantages for both teams and individual ML engineers:

Visual pipeline editor

Advanced execution caching

  • Content-based caching saves significant time and compute costs
  • When you modify a pipeline, only changed tasks are re-executed
  • Both upstream and downstream cached results are reused when possible
  • Can even reuse running executions, not just completed ones

Tracking and reproducibility

  • All pipeline runs are automatically recorded with graphs, logs, and artifact metadata
  • Intermediate data is immutable and never overwritten
  • Clone any pipeline run to reproduce exact results
  • Strict component versioning ensures reproducibility

Component reusability

Open source and flexible

  • Run on any cloud provider or locally
  • Own your data and infrastructure
  • No vendor lock-in

How Tangle compares

Tangle vs. Kubeflow Pipelines

Kubeflow Pipelines pioneered the component-based approach to ML workflows that Tangle builds upon. In fact, Tangle uses the same ComponentSpec format introduced in KFP v1, meaning components can be reused between systems.

Here's how Kubeflow compares to Tangle:

  • Visual pipeline editor: While Kubeflow requires writing Python code to define pipelines, Tangle offers a visual drag-and-drop editor that makes pipeline creation accessible to non-engineers.
  • Execution caching: Our content-based caching is more advanced than Kubeflow's lineage-based approach. We cache globally across all pipelines and can even reuse running executions, whereas Kubeflow only caches successful executions within each pipeline. This means significantly better resource utilization and faster iteration cycles.
  • Doesn't require Kubernetes: Kubeflow Pipelines requires Kubernetes infrastructure, which can be complex to set up and maintain.

Tangle vs. Vertex AI Pipelines

Vertex AI Pipelines is Google Cloud's managed version of pipeline orchestration. It uses a similar component model to Tangle but is proprietary and only available on Google Cloud.

Here's how Vertex AI Pipelines compares to Tangle:

  • Visual pipeline editor: While Vertex AI requires writing code to define pipelines, Tangle offers a visual drag-and-drop editor that makes pipeline creation accessible to non-engineers.
  • Execution caching: Like Kubeflow, Vertex Pipelines uses lineage-based caching that's limited to successful executions within each pipeline. Tangle's content-based caching works globally across all pipelines and can reuse partial results, making experimentation more efficient.
  • Open source and cloud-agnostic: Because Tangle is open source, you can run it anywhere you want: On your laptop, on-premises, or on any cloud provider.

Tangle vs. Apache Airflow

Apache Airflow was built for data engineering workflows rather than ML pipelines. This shows in fundamental architectural differences that matter for ML use cases.

Here's how Airflow compares to Tangle:

  • Visual pipeline editor: In Airflow, everything is code—you write Python operators that execute locally or call remote services. Tangle provides a visual interface.
  • Data passing: Airflow relies on XComs which are limited to small JSON data and require manual data management for anything substantial. Tangle provides explicit input/output definitions and manages data storage and passing automatically.
  • Execution caching: Airflow doesn't provide execution caching. Every time you run a workflow in Airflow, all tasks execute from scratch. Tangle uses content-based caching. That means you only pay for computation that actually needs to be redone, which can translate to significant cost savings in ML experimentation where pipeline iterations are common.
  • Containerization: Airflow operators typically run in a shared Python environment, which can lead to the dependency management challenges familiar to any Python developer. Tangle runs each task in an isolated container with clearly defined interfaces, ensuring reproducibility and eliminating dependency conflicts.

Credits

The Tangle app is based on the Pipeline Editor app created by Alexey Volkov as part of the Cloud Pipelines project.

Please report any bugs you find using GitHub Issues.