Skip to content

Commit 23d50fe

Browse files
dandawgmprahl
andauthored
pipelines: updated KFP component concept page (#4062)
* updated KFP component concept page Signed-off-by: Daniel Dowler <[email protected]> * fix links Signed-off-by: Daniel Dowler <[email protected]> * added back next steps section Signed-off-by: Daniel Dowler <[email protected]> * minor wording changes Signed-off-by: Daniel Dowler <[email protected]> * added todo comment Signed-off-by: Daniel Dowler <[email protected]> * minor updates Signed-off-by: Daniel Dowler <[email protected]> * small formatting updates Signed-off-by: Daniel Dowler <[email protected]> * fixed link, small formatting Signed-off-by: Daniel Dowler <[email protected]> * Update content/en/docs/components/pipelines/concepts/component.md Co-authored-by: Matt Prahl <[email protected]> Signed-off-by: Daniel Dowler <[email protected]> * replaced hard-coded image with command for maintainability Signed-off-by: Daniel Dowler <[email protected]> --------- Signed-off-by: Daniel Dowler <[email protected]> Co-authored-by: Matt Prahl <[email protected]>
1 parent 239078a commit 23d50fe

File tree

1 file changed

+149
-37
lines changed

1 file changed

+149
-37
lines changed

content/en/docs/components/pipelines/concepts/component.md

Lines changed: 149 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,66 +2,178 @@
22
title = "Component"
33
description = "Conceptual overview of components in Kubeflow Pipelines"
44
weight = 20
5-
65
+++
76

8-
A *pipeline component* is self-contained set of code that performs one step in
9-
the ML workflow (pipeline), such as data preprocessing, data transformation,
10-
model training, and so on. A component is analogous to a function, in that it
11-
has a name, parameters, return values, and a body.
127

13-
## Component code
8+
A **pipeline component** is the fundamental building block for an ML engineer to construct a Kubeflow Pipelines [pipeline][pipeline]. The component structure serves the purpose of packaging a functional unit of code along with its dependencies, so that it can be run as part of a workflow in a Kubernetes environement. Components can be combined in a [pipeline][pipeline] that creates a repeatable workflow, with individual components coordinating on inputs and outputs like parameters and [artifacts][artifacts].
149

15-
The code for each component includes the following:
10+
A component is similar to a programming function. Indeed, it is most often implemented as a wrapper to a Python function using the [KFP Python SDK][KFP SDK]. However, a KFP component goes further than a simple function, with support for code dependencies, runtime environments, and distributed execution requirements.
1611

17-
* **Client code:** The code that talks to endpoints to submit jobs. For example,
18-
code to talk to the Google Dataproc API to submit a Spark job.
12+
KFP components are designed to simplify constructing and running ML workflows in a Kubernetes environment. Using KFP components, engineers can iterate faster, reduce maintenance overhead, and focus more attention on ML work.
1913

20-
* **Runtime code:** The code that does the actual job and usually runs in the
21-
cluster. For example, Spark code that transforms raw data into preprocessed
22-
data.
14+
## The Why Behind KFP Components
2315

24-
Note the naming convention for client code and runtime code&mdash;for a task
25-
named "mytask":
16+
Running ML code in a Kubernetes cluster presents many challenges. Some of the main challenges are:
17+
- Managing **code dependencies** (Python libraries and versions)
18+
- Handling **system dependencies** (OS-level packages, GPU drivers, runtime environments)
19+
- **Building and maintaining container images** and everything around this from container registry support to CVE (Common Vulnerabilities and Exposures) fixes
20+
- **Deploying supporting resources** like PersistentVolumeClaims and ConfigMaps
21+
- Handling **inputs and outputs** including metadata, parameters, logs, and artifacts
22+
- **Ensuring compatibility** across clusters, images, and dependencies
2623

27-
* The `mytask.py` program contains the client code.
28-
* The `mytask` directory contains all the runtime code.
24+
KFP components simplify these challenges by enabling ML engineers to:
25+
- **Stay at the Python level** - where most modern ML work occurs
26+
- **Iterate quickly** – modify code without creating or rebuilding containers at each step
27+
- **Focus on ML tasks** - rather than on platform and infrastructure concerns
28+
- **Work seamlessly with Python IDE tools** – enable debugging, syntax highlighting, type checking, and docstring usage
29+
- **Move between environments** – transition from local development to distributed execution with minimal changes
2930

30-
## Component definition
31+
## What Does a Component Consist Of?
3132

32-
A component specification in YAML format describes the component for the
33-
Kubeflow Pipelines system. A component definition has the following parts:
33+
A KFP component consist of the following key elements:
34+
35+
### 1. Code
36+
- Typically a Python function, but can be other code such as a Bash command.
37+
38+
### 2. Dependency Support
39+
- **Python libraries** - to be installed at runtime
40+
- **Environment variables** - to be available in the runtime environment
41+
- **Python package indices** - (for example, private PyPi servers) if needed to support installations
42+
- **Cluster resources** - to support use of ConfigMaps, Secrets, PersistentVolumeClaims, and more
43+
- **Runtime dependencies** - to support CPU, memory, and GPU requests and limits
44+
45+
### 3. Base Image
46+
- Defines the base container runtime environment (defaults to a generic Python base image)
47+
- May include system dependencies and pre-installed Python libraries
48+
49+
### 4. Input/Output (I/O) Specification
50+
- Individual components cannot share in-memory data with each other, so they use the following concepts to support exchanging information and publishing results:
51+
- **Parameters** – for small values
52+
- **[Artifacts][artifacts]** - for larger data like model files, processed datasets, and metadata
53+
54+
## Constructing a Component
55+
56+
### 1. Python-Based Components
57+
The recommended way to define a component is using the `@dsl.component` decorator from the [KFP Python SDK][KFP SDK]. Below are two basic component definition examples:
58+
```python
59+
from kfp.dsl import component, Output, Dataset
60+
61+
# hello world component
62+
@component()
63+
def hello_world(name: str = "World") -> str:
64+
print(f"Hello {name}!")
65+
return name
66+
67+
# process data component
68+
@component(
69+
base_image="python:3.12-slim-bookworm",
70+
packages_to_install=["pandas>=2.2.3"],
71+
)
72+
def process_data(output_data: Output[Dataset]):
73+
'''Get max from an array'''
74+
import pandas as pd
75+
# create dataset to write to output
76+
data = pd.DataFrame(data=[[1,2,3],[4,5,6]], columns=["a","b","c"])
77+
data.to_csv(output_data.path)
78+
```
79+
80+
Observe that these are wrapped Python functions. The `@component` wrapper helps the KFP Python SDK supply the needed context for running these functions in containers as part of a KFP [pipeline][pipeline].
81+
82+
The `hello_world` component just uses the default behavior, which is to run the Python function on the default base image (`kfp.dsl.component_factory._DEFAULT_BASE_IMAGE`).
83+
84+
The `process_data` component adds layers of customization, by supplying the name of a specific `base_image`, and `packages_to_install`. Note the inclusion of the `import pandas as pd` statement inside the function; since the function will run inside a container (and won't have the script context), all Python library dependencies need to be imported within the component function. This component also uses KFP's `Output[Dataset]` class, which takes care of creating a KFP [artifact][artifacts] type output.
85+
86+
Note that inputs and outputs are defined as Python function parameters. Also, dependencies can often be installed at runtime, avoiding the need for custom base containers. Python-based components give close access to the Python tools that ML experimenters rely on, like modules and imports, usage information, type hints, and debugging tools.
87+
88+
89+
#### Run a Component's Python Function
90+
Provided that dependencies are satisfied in your environment, it is also easy to run Python-based components as simple Python functions, which can be useful for local work. For example, to run `process_data` as a Python function try:
91+
``` python
92+
# Provide path as dataset type (as the function expects)
93+
dataset = Dataset(uri="data.csv")
94+
# execute the function
95+
# (writes data to data.csv locally)
96+
process_data.execute(output_data=dataset)
97+
# access the underlying function docstring
98+
print(process_data.python_func.__doc__)
99+
```
100+
101+
Component usage can get much more complex, as AI/ML use-cases often have demanding code and environment dependencies. For more on creating Python-based components, see the [component][python-sdk-component] SDK documentation.
102+
103+
104+
### 2. YAML-Based Components
105+
106+
The KFP backend uses YAML-based definitions to specify components. While the [KFP Python SDK][KFP SDK] can do this conversion automatically when a Python-based [pipeline][pipeline] is submitted, some use-cases can benefit from the direct YAML-based component approach.
107+
108+
A YAML-based component definition has the following parts:
34109

35110
* **Metadata:** name, description, etc.
36-
* **Interface:** input/output specifications (name, type, description, default
37-
value, etc).
38-
* **Implementation:** A specification of how to run the component given a
39-
set of argument values for the component's inputs. The implementation section
40-
also describes how to get the output values from the component once the
41-
component has finished running.
111+
* **Interface:** input/output specifications (name, type, description, default value, etc).
112+
* **Implementation:** A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running.
113+
114+
YAML-based components support system commands directly. In fact, any command (or binary) that exists on the base image can be run. Here is simple YAML-based component example:
115+
```yaml
116+
# my_component.yaml file
117+
name: my-component
118+
description: "Component that outputs \"<string prefix>...<num>\""
119+
120+
inputs:
121+
- {name: string prefix, type: String}
122+
- {name: num, type: Integer}
123+
124+
outputs: []
42125

43-
For the complete definition of a component, see the
44-
[component specification](/docs/components/pipelines/reference/component-spec/).
126+
implementation:
127+
container:
128+
image: python:3.12-slim-bookworm
129+
args:
130+
- echo
131+
- {inputValue: string prefix}
132+
- ...
133+
- {inputValue: num}
134+
```
45135
46-
## Containerizing components
136+
For the complete definition of a YAML-based component, see the [component specification][yaml-component].
47137
48-
You must package your component as a
49-
[Docker image](https://docs.docker.com/get-started/). Components represent a
50-
specific program or entry point inside a container.
138+
YAML-based components can be loaded for use in the Python SDK alongside Python-based components:
139+
```python
140+
from kfp.components import load_component_from_file
141+
142+
my_comp = load_component_from_file("my_component.yaml")
143+
```
144+
145+
Note that a component loaded from a YAML-based component will not have the same level of Python support that Python-based components do (like executing the function locally).
146+
147+
<!-- TODO: Briefly discuss graph components, container components, and importer components (see sdk dsl scripts) -->
148+
149+
## "Containerize" a Component
150+
151+
The KFP command-line tool contains a build command to help users "containerize" a component. This can be used to create the `Dockerfile`, `runtime-dependencies.txt`, and other supporting files, or even to build the custom image and push it to a registry. In order to use this utility, the `target_image` parameter must be set in the Python-based component definition, which itself is saved in a file.
152+
```bash
153+
# build Dockerfile and runtime-dependencies.txt
154+
kfp component build --component-filepattern the_component.py --no-build-image --platform linux/amd64 .
155+
```
156+
Note that creating and maintaining custom containers can carry a significant maintenance burden. In general, a 1-to-1 relationship between components and containers is not needed or recommended, as AI/ML work is often highly iterative. A best practice is to work with a small set of base images that can support many components. If you need more control over the container build than the `kfp` CLI provides, consider using a container CLI like [docker][docker-cli] or [podman][podman-cli].
51157

52-
Each component in a pipeline executes independently. The components do not run
53-
in the same process and cannot directly share in-memory data. You must serialize
54-
(to strings or files) all the data pieces that you pass between the components
55-
so that the data can travel over the distributed network. You must then
56-
deserialize the data for use in the downstream component.
57158

58159
## Next steps
59160

161+
* Read the user guides for [Creating Components][Creating Components]
60162
* Read an [overview of Kubeflow Pipelines](/docs/components/pipelines/overview/).
61163
* Follow the [pipelines quickstart guide](/docs/components/pipelines/getting-started/)
62164
to deploy Kubeflow and run a sample pipeline directly from the Kubeflow
63165
Pipelines UI.
64166
* Build your own
65167
[component and pipeline](/docs/components/pipelines/legacy-v1/sdk/component-development/).
66168
* Build a [reusable component](/docs/components/pipelines/legacy-v1/sdk/component-development/) for
67-
sharing in multiple pipelines.
169+
sharing in multiple pipelines.
170+
171+
172+
[pipeline]: /docs/components/pipelines/concepts/pipeline
173+
[KFP SDK]: https://kubeflow-pipelines.readthedocs.io
174+
[artifacts]: /docs/components/pipelines/concepts/output-artifact
175+
[python-sdk-component]: https://kubeflow-pipelines.readthedocs.io/en/stable/source/dsl.html#kfp.dsl.component
176+
[yaml-component]: /docs/components/pipelines/reference/component-spec
177+
[docker-cli]: https://github.com/docker/cli
178+
[podman-cli]: https://github.com/containers/podman
179+
[Creating Components]: /docs/components/pipelines/user-guides/components

0 commit comments

Comments
 (0)