Skip to main content

Getting Started: Your First Pipeline in TangleML

Build and run your first machine learning pipeline in under 10 minutes using the TangleML editor's drag-and-drop interface.

Overview

This guide walks you through creating a complete XGBoost machine learning pipeline using TangleML's standard library components. You'll learn how to:

  • Access and use standard library components
  • Connect components to create data flows
  • Run and monitor pipeline execution
  • Troubleshoot common issues
First pipeline preview

Note: This quickstart guide covers additional steps and tips - including how to troubleshoot common issues you might encounter as you build more advanced pipelines.

Accessing Standard Library Components

The standard library provides pre-built components for common ML tasks. To access it:

  1. Click on Standard Library Components in the left panel
  2. Navigate to the Quick Start folder for beginner-friendly components
  3. Browse available components organized by category
tip

Each component in the standard library is fully documented with implementation details you can inspect by clicking the component info dialog and checking the Implementation tab.

Building Your First Pipeline

Step 1: Add a Data Source

Start by adding the Chicago Taxi Trips Dataset component to your canvas:

  1. Find the component in the Quick Start folder
  2. Drag and drop it onto the canvas
  3. This component fetches data from Chicago's open API using a simple cURL command
tip

You can run the pipeline at this point and check the results in the Outputs or Artifacts tab.

Step 2: Add Training and Prediction Components

Next, add the ML components:

  1. Drag Train XGBoost Model on CSV onto the canvas
  2. Drag XGBoost Predict on CSV onto the canvas
  3. You'll now have three components ready to connect

Step 3: Connect Components

Create the data flow by connecting component outputs to inputs:

  1. Connect the dataset to training:

    • Drag from the table output of Chicago Taxi Trips Dataset
    • Connect to the training_data input of Train XGBoost Model
  2. Connect to prediction component:

    • Connect the same dataset table output to the prediction component's data input
    • Connect the training component's model output to the prediction component's model input
tip

Red dots indicate required inputs that need configuration. The pipeline cannot run until all validation errors are resolved.

Step 4: Configure Required Parameters

Some components need additional configuration:

  1. Click on the Train XGBoost Model component
  2. In the arguments editor, set the Label Column Name (e.g., "tips")
  3. Set the same label column name for the prediction component
  4. Red validation dots will disappear when properly configured
tip

TangleML automatically saves your canvas changes every time you make a change. You can verify auto-save status by looking for the "Last saved" timestamp in the Pipeline Actions section of the left panel.

Step 5: Running and Troubleshooting Your Pipeline

Now that all components are connected, let's run the pipeline:

  1. Click the Submit Run button on the left panel to submit your pipeline
  2. Components will transition through states:
    • QueuedPendingRunningSucceeded (or Failed)

You may see, that the component failed. That's OK! We'll troubleshoot it.

Submit Run button highlighted in the pipeline UI
note

Tasks that are dependent on failed tasks will be immediately marked as "Skipped".

Handling Pipeline Failures:

The most common way to troubleshoot a pipeline failure is to check the logs.

  1. Click on the failed component
  2. Select View Logs to see detailed error messages.
tip

You can use "Fullscreen" button to see the logs in full screen mode. Also you can use "Search" by pressing "CMD/CTRL + F" to search for specific text in the logs.

note

Different components produce different error messages. You need to know specifics of the component to troubleshoot the pipeline failure.

We can detect the error message produces by the component:

Check failed: valid: Label contains NaN, infinity or a value too large.

It means, that some of the rows in the dataset have empty values in the label column. That's ok, we can fix it. Standard library provides a component for that - Fill all missing values using Pandas on CSV data.

Get back to the Editor mode by clicking "Inspect" button in the right context panel.

Fix data quality issues:

  1. Search for and add the Fill All Missing Values on CSV component

  2. Drag it onto the canvas between your dataset and training components

  3. Redirect connections:

    • Connect dataset table output → Fill Missing Values data input
    • Connect Fill Missing Values table output → Training and Prediction data inputs
  4. Re-run the pipeline by clicking Submit Run button again

note

You may notice, that some tasks have been executed instantly. TangleML caches deterministic component results. When you re-run a pipeline, any tasks with identical component digest and inputs will reuse cached outputs, dramatically speeding up iteration time. You'll see cached components complete instantly with a "Succeeded" status.

Step 6: Using Input Nodes for Repeated Values

Avoid repetition and make your pipeline more maintainable by using Input nodes as variables:

  1. Add an Input node:

    • From the component panel, drag an Input node onto the canvas
    • Name it descriptively (e.g., "label_column_name")
    • Set its value to "tips" (or your label column name)
  2. Connect the Input node:

    • Connect the Input node's output to both:
      • Train XGBoost Model's label_column_name parameter
      • XGBoost Predict's label_column_name parameter
note

Using Input nodes as pipeline-wide constants or variables is a great trick and has some benefits:

  • Change the value once to update all connected components
  • Makes pipelines more readable and maintainable
  • Reduces configuration errors
Using Input Nodes for Repeated Values

Summary

You've successfully created and run your first machine learning pipeline in TangleML! The platform's visual interface, combined with automatic caching and comprehensive logging, makes it easy to iterate quickly on ML workflows. Remember that parallel execution and result caching will significantly speed up subsequent runs as you refine your pipeline.