Creating Components
Components are self-contained units of code that perform specific tasks in your ML workflow. This guide demonstrates how to create components using different programming languages.
Understanding Components
Learn more about the Components.
A component consists of two essential parts:
- Implementation: Your executable code and the container image it runs in
- Specification: A YAML file defining the component's interface (inputs/outputs) and metadata. Learn more about the ComponentSpec.
Components can be written in any language that runs in a container. The key is proper containerization and defining clear input/output interfaces.
Step 1. Designing a Component
When TangleML executes a component, it starts a container image in a Kubernetes Pod. Your component’s inputs are passed in as command-line arguments. You can pass small inputs — such as strings and numbers — by value. Larger inputs, like CSV, Parquet, etc., data, are passed as paths to files. Note, that all input files are mounted as read-only disks and available as local files; and all output files are mounted as writeable disks and available as local files. When your component finishes running, its outputs are returned as files.
When you design your component's code, consider the following:
- Which inputs can be passed to your TangleML component by value? Examples of inputs that you can pass by value include numbers, booleans, and short strings. Any value that you could reasonably pass as a command-line argument can be passed to your TangleML component by value. All other inputs are passed to your TangleML component as a reference to the input's path.
- To return an output from your TangleML component, the output's data must be stored as a file. When you define your component, you let TangleML know what outputs your component produces. When your pipeline runs, TangleML passes the paths that you use to store your component's outputs as inputs to your component.
- Outputs are usually written to a single file. In some cases, you may need to return a directory of files as an output. In this case, create a directory at the output path and write the output files to that location. In both cases, it may be necessary to create parent directories if they do not exist.
- Your TangleML component's goal may be to create a dataset in an external service, such as a BigQuery table. In this case, it may make sense for the component to output an identifier for the produced data, such as a table name, instead of the data itself. It is recommended that you use this pattern only when the data must be put into an external system instead of keeping it inside the TangleML system.
- Since your inputs and output paths are passed in as command-line arguments, your component's code must be able to read inputs from the command line. If your component is built with Python, libraries such as argparse and absl.flags make it easier to read your component's inputs.
- Your component's code can be implemented in any language, as long as it can run in a container image.
Component Design Principles
So as a summary, when designing components, remember these core principles:
-
Composability - Design your components with composability in mind:
- Think about upstream and downstream components
- What formats to consume as inputs from the upstream components
- What formats to use for output data so that downstream components can consume it
-
Pureness
- Components must be pure - they must not use any outside data except data that comes through inputs (unless impossible). Everything should either be inside the container or come from inputs.
- Network access is strongly discouraged unless that’s the explicit purpose of a component (for example, upload/download).
-
Input/Output Handling:
infoLearn more about the Inputs and Outputs.
- Small inputs (numbers, strings) can be passed by value.
- Large inputs (datasets, models) must be passed as file paths. Component code must use local files for input/output data.
- Outputs are always written to files.
- For Inputs and Outputs use only paths provided by TangleML thru InputPath and OutputPath placeholders.
-
CLI - each Container component is just a CLI program.
- Your code must accept arguments from the command line.
- Use standard argument parsing libraries for your language.
Example Component Design
TangleML supports any language that can run in a container. This example demonstrates how to create a component in multiple languages. This program reads from an input file and writes lines which pass a threshold to an output file. This means that this function accepts three command-line parameters:
- The path to the input file.
- The threshold for filtering.
- The path to the output file.
Example input file:
[
{ "name": "item1", "value": 5 },
{ "name": "item2", "value": 15 },
{ "name": "item3", "value": 8 },
{ "name": "item4", "value": 25 },
{ "name": "item5", "value": 2 }
]
- Python
- TypeScript
#!/usr/bin/env python3
import argparse
import json
from pathlib import Path
def process_data(input_path, output_path, threshold):
"""Process input data and filter by threshold."""
with open(input_path, "r") as f:
data = json.load(f)
filtered_data = [item for item in data if item["value"] > threshold]
# Ensure output directory exists
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
json.dump(filtered_data, f)
print(f"Processed {len(filtered_data)} items")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Data processor")
parser.add_argument(
"--input-path", type=str, required=True, help="Path to input data file"
)
parser.add_argument(
"--output-path", type=str, required=True, help="Path for output data file"
)
parser.add_argument(
"--threshold", type=float, default=0.5, help="Threshold for filtering"
)
args = parser.parse_args()
process_data(args.input_path, args.output_path, args.threshold)
If this program is saved as program.py, the command-line invocation of this
program is:
python3 process_data.py --input-path input.json --output-path output.json --threshold 10
#!/usr/bin/env -S npx ts-node
import { readFileSync, writeFileSync, mkdirSync } from 'fs';
import { dirname } from 'path';
import { parseArgs } from 'util';
// Define the shape of our data
interface DataItem {
name: string;
value: number;
}
function processData(inputPath: string, outputPath: string, threshold: number): void {
/**
* Process input data and filter by threshold.
*/
// Read and parse input data
const rawData = readFileSync(inputPath, 'utf-8');
const data: DataItem[] = JSON.parse(rawData);
// Filter data by threshold
const filteredData = data.filter(item => item.value > threshold);
// Ensure output directory exists
const outputDir = dirname(outputPath);
mkdirSync(outputDir, { recursive: true });
// Write filtered data to output file
writeFileSync(outputPath, JSON.stringify(filteredData, null, 2));
console.log(`Processed ${filteredData.length} items`);
}
function main(): void {
// Parse command-line arguments
const { values } = parseArgs({
options: {
'input-path': {
type: 'string',
short: 'i',
},
'output-path': {
type: 'string',
short: 'o',
},
threshold: {
type: 'string',
short: 't',
},
help: {
type: 'boolean',
short: 'h',
},
},
});
const inputPath = values['input-path'];
const outputPath = values['output-path'];
const threshold = values.threshold ? parseFloat(values.threshold) : 0.5;
// Process the data
try {
processData(inputPath, outputPath, threshold);
} catch (error) {
console.error('Error processing data:', error instanceof Error ? error.message : error);
process.exit(1);
}
}
// Run if this is the main module
if (require.main === module) {
main();
}
// Export for potential module usage
export { processData, DataItem };
If this program is saved as program.ts, the command-line invocation of this
program is:
npx tsx program.ts --input-path input.json --output-path output.json --threshold 10
Step 2. Containerize your component's code
To run your component in TangleML, you must package your component as a Docker container image. You should then publish this image to a container registry that your TangleML Kubernetes cluster can access. The steps required to create a container image are not specific to TangleML. To make things easier for you, this section provides some guidelines on standard container creation.
You can skip building the container image eveyr time you make a code change by using the Lightweight Python Components.
-
Create a Dockerfile for your container. A Dockerfile specifies:
- The base container image. For example, the operating system that your code runs on.
- Any dependencies that need to be installed for your code to run.
- Files to copy into the container, such as the runnable code for this component.
The following is an example Dockerfile.
- Python
- TypeScript
FROM python:3.14
RUN python3 -m pip install keras
COPY ./src /pipelines/component/srcIn this example:
- The base container image is
python:3.14. - The
kerasPython package is installed in the container image. - Files in your
./srcdirectory are copied into/pipelines/component/srcin the container image.
FROM node:22-slim
RUN npm install -g tsx
COPY ./src /pipelines/component/srcIn this example:
- The base container image is
node:22-slim. - The
tsxpackage is installed in the container image. - Files in your
./srcdirectory are copied into/pipelines/component/srcin the container image.
-
Create a script named
build_image.shthat uses Docker to build your container image and push your container image to a container registry. Your Kubernetes cluster must be able to access your container registry to run your component.The following example builds a container image, pushes it to a container registry, and outputs the strict image name. It is a best practice to use the strict image name in your component specification to ensure that you are using the expected version of a container image in each component execution.
#!/bin/bash -e
image_name=gcr.io/my-org/my-image
image_tag=latest
full_image_name=${image_name}:${image_tag}
cd "$(dirname "$0")"
docker build -t "${full_image_name}" .
docker push "$full_image_name"
# Output the strict image name, which contains the sha256 image digest
docker inspect --format="{{index .RepoDigests 0}}" "${full_image_name}"In the preceding example:
- The
image_namespecifies the full name of your container image in the container registry. - The
image_tagspecifies that this image should be tagged as latest.
Save this file and run the following to make this script executable.
chmod +x build_image.sh - The
-
Run your
build_image.shscript to build your container image and push it to a container registry. -
Use
docker runto test your container image locally. If necessary, revise your application and Dockerfile until your application works as expected in the container.
Local Testing
Test your component locally before deploying:
docker run -v $(pwd)/test-data:/data \
my-component:latest \
--input-path /data/input.json \
--output-path /data/output.json \
--threshold 0.5
Step 3: Define Component Specification
Define Component Metadata
Start your component specification by defining basic metadata. This helps users understand what your component does.
name: Data Processor
description: Processes and transforms input data according to specified parameters
metadata:
annotations:
author: Data Team
version: "1.0.0"
category: preprocessing
Use clear, descriptive names that indicate the component's purpose. Include version information to track component evolution.
Define Component Inputs
Inputs are the data and parameters your component needs to function. Each input must specify its name and can include type, description, and default values.
Basic Input Definition
inputs:
- name: input_data
description: Path to the input dataset
type: Data
Input with Default Value
inputs:
- name: threshold
description: Minimum threshold for filtering
type: Float
default: 0.5
Optional Input
inputs:
- name: config_file
description: Optional configuration file
type: String
optional: true
Large datasets should always be passed as file paths, not as values. Only pass small data like numbers or short strings by value.
Define Component Outputs
Outputs define what your component produces. Unlike inputs, outputs don't have default values, but they must specify where the component will write its results.
Basic Output Definition
outputs:
- name: processed_data
description: The processed dataset
type: Data
Multiple Outputs Example
outputs:
- name: trained_model
description: The trained machine learning model
type: Model
- name: metrics
description: Training and validation metrics
type: Metrics
- name: visualization
description: Performance visualization plots
type: HTML
- name: model_metadata
description: Model configuration and parameters
type: Data
Define Container Implementation
The implementation section specifies how to execute your component. It includes the container image and the command to run.
Basic Container Implementation
implementation:
container:
image: gcr.io/my-project/data-processor:v1.0.0
command: ["python", "/app/process.py"]
Container with Input/Output Placeholders
The command section uses placeholders that are replaced at runtime with actual values:
implementation:
container:
image: gcr.io/my-project/ml-trainer:v1.0.0
command:
[
"python",
"/app/train.py",
"--training-data",
{ inputPath: training_data },
"--validation-data",
{ inputPath: validation_data },
"--learning-rate",
{ inputValue: learning_rate },
"--batch-size",
{ inputValue: batch_size },
"--model-output",
{ outputPath: trained_model },
"--metrics-output",
{ outputPath: metrics },
]
Understanding Placeholders
There are three types of placeholders you can use in the command:
-
{inputValue: <input-name>}- Replaced with the actual value of the inputcommand: [
"python",
"script.py",
"--threshold",
{ inputValue: threshold }, # Becomes "--threshold", "0.5"
] -
{inputPath: <input-name>}- Replaced with the file path to the input datacommand: [
"python",
"script.py",
"--input-file",
{ inputPath: dataset }, # Becomes "--input-file", "/path/to/dataset"
] -
{outputPath: <output-name>}- Replaced with the path where output should be writtencommand: [
"python",
"script.py",
"--output-file",
{ outputPath: results }, # Becomes "--output-file", "/path/for/results"
]
The placeholder names must exactly match the input/output names defined in your component's interface.
Container with Arguments
Sometimes you need to separate the command from its arguments:
implementation:
container:
image: gcr.io/my-project/processor:v1.0.0
command: ["python", "/app/main.py"]
args:
[
"--mode",
{ inputValue: processing_mode },
"--input",
{ inputPath: input_data },
"--output",
{ outputPath: output_data },
]
Container with Environment Variables
You can also pass inputs through environment variables:
implementation:
container:
image: gcr.io/my-project/analyzer:v1.0.0
command: ["./analyze.sh"]
env:
- name: INPUT_PATH
value: { inputPath: input_data }
- name: OUTPUT_PATH
value: { outputPath: results }
- name: PROCESSING_THREADS
value: { inputValue: num_threads }
- name: DEBUG_MODE
value: "false"
Component Specification
Regardless of the implementation language, all components share the same YAML specification format.