|
2 | 2 | title = "Component"
|
3 | 3 | description = "Conceptual overview of components in Kubeflow Pipelines"
|
4 | 4 | weight = 20
|
5 |
| - |
6 | 5 | +++
|
7 | 6 |
|
8 |
| -A *pipeline component* is self-contained set of code that performs one step in |
9 |
| -the ML workflow (pipeline), such as data preprocessing, data transformation, |
10 |
| -model training, and so on. A component is analogous to a function, in that it |
11 |
| -has a name, parameters, return values, and a body. |
12 | 7 |
|
13 |
| -## Component code |
| 8 | +A **pipeline component** is the fundamental building block for an ML engineer to construct a Kubeflow Pipelines [pipeline][pipeline]. The component structure serves the purpose of packaging a functional unit of code along with its dependencies, so that it can be run as part of a workflow in a Kubernetes environement. Components can be combined in a [pipeline][pipeline] that creates a repeatable workflow, with individual components coordinating on inputs and outputs like parameters and [artifacts][artifacts]. |
14 | 9 |
|
15 |
| -The code for each component includes the following: |
| 10 | +A component is similar to a programming function. Indeed, it is most often implemented as a wrapper to a Python function using the [KFP Python SDK][KFP SDK]. However, a KFP component goes further than a simple function, with support for code dependencies, runtime environments, and distributed execution requirements. |
16 | 11 |
|
17 |
| -* **Client code:** The code that talks to endpoints to submit jobs. For example, |
18 |
| - code to talk to the Google Dataproc API to submit a Spark job. |
| 12 | +KFP components are designed to simplify constructing and running ML workflows in a Kubernetes environment. Using KFP components, engineers can iterate faster, reduce maintenance overhead, and focus more attention on ML work. |
19 | 13 |
|
20 |
| -* **Runtime code:** The code that does the actual job and usually runs in the |
21 |
| - cluster. For example, Spark code that transforms raw data into preprocessed |
22 |
| - data. |
| 14 | +## The Why Behind KFP Components |
23 | 15 |
|
24 |
| -Note the naming convention for client code and runtime code—for a task |
25 |
| -named "mytask": |
| 16 | +Running ML code in a Kubernetes cluster presents many challenges. Some of the main challenges are: |
| 17 | +- Managing **code dependencies** (Python libraries and versions) |
| 18 | +- Handling **system dependencies** (OS-level packages, GPU drivers, runtime environments) |
| 19 | +- **Building and maintaining container images** and everything around this from container registry support to CVE (Common Vulnerabilities and Exposures) fixes |
| 20 | +- **Deploying supporting resources** like PersistentVolumeClaims and ConfigMaps |
| 21 | +- Handling **inputs and outputs** including metadata, parameters, logs, and artifacts |
| 22 | +- **Ensuring compatibility** across clusters, images, and dependencies |
26 | 23 |
|
27 |
| -* The `mytask.py` program contains the client code. |
28 |
| -* The `mytask` directory contains all the runtime code. |
| 24 | +KFP components simplify these challenges by enabling ML engineers to: |
| 25 | +- **Stay at the Python level** - where most modern ML work occurs |
| 26 | +- **Iterate quickly** – modify code without creating or rebuilding containers at each step |
| 27 | +- **Focus on ML tasks** - rather than on platform and infrastructure concerns |
| 28 | +- **Work seamlessly with Python IDE tools** – enable debugging, syntax highlighting, type checking, and docstring usage |
| 29 | +- **Move between environments** – transition from local development to distributed execution with minimal changes |
29 | 30 |
|
30 |
| -## Component definition |
| 31 | +## What Does a Component Consist Of? |
31 | 32 |
|
32 |
| -A component specification in YAML format describes the component for the |
33 |
| -Kubeflow Pipelines system. A component definition has the following parts: |
| 33 | +A KFP component consist of the following key elements: |
| 34 | + |
| 35 | +### 1. Code |
| 36 | +- Typically a Python function, but can be other code such as a Bash command. |
| 37 | + |
| 38 | +### 2. Dependency Support |
| 39 | +- **Python libraries** - to be installed at runtime |
| 40 | +- **Environment variables** - to be available in the runtime environment |
| 41 | +- **Python package indices** - (for example, private PyPi servers) if needed to support installations |
| 42 | +- **Cluster resources** - to support use of ConfigMaps, Secrets, PersistentVolumeClaims, and more |
| 43 | +- **Runtime dependencies** - to support CPU, memory, and GPU requests and limits |
| 44 | + |
| 45 | +### 3. Base Image |
| 46 | +- Defines the base container runtime environment (defaults to a generic Python base image) |
| 47 | +- May include system dependencies and pre-installed Python libraries |
| 48 | + |
| 49 | +### 4. Input/Output (I/O) Specification |
| 50 | +- Individual components cannot share in-memory data with each other, so they use the following concepts to support exchanging information and publishing results: |
| 51 | + - **Parameters** – for small values |
| 52 | + - **[Artifacts][artifacts]** - for larger data like model files, processed datasets, and metadata |
| 53 | + |
| 54 | +## Constructing a Component |
| 55 | + |
| 56 | +### 1. Python-Based Components |
| 57 | +The recommended way to define a component is using the `@dsl.component` decorator from the [KFP Python SDK][KFP SDK]. Below are two basic component definition examples: |
| 58 | +```python |
| 59 | +from kfp.dsl import component, Output, Dataset |
| 60 | + |
| 61 | +# hello world component |
| 62 | +@component() |
| 63 | +def hello_world(name: str = "World") -> str: |
| 64 | + print(f"Hello {name}!") |
| 65 | + return name |
| 66 | + |
| 67 | +# process data component |
| 68 | +@component( |
| 69 | + base_image="python:3.12-slim-bookworm", |
| 70 | + packages_to_install=["pandas>=2.2.3"], |
| 71 | +) |
| 72 | +def process_data(output_data: Output[Dataset]): |
| 73 | + '''Get max from an array''' |
| 74 | + import pandas as pd |
| 75 | + # create dataset to write to output |
| 76 | + data = pd.DataFrame(data=[[1,2,3],[4,5,6]], columns=["a","b","c"]) |
| 77 | + data.to_csv(output_data.path) |
| 78 | +``` |
| 79 | + |
| 80 | +Observe that these are wrapped Python functions. The `@component` wrapper helps the KFP Python SDK supply the needed context for running these functions in containers as part of a KFP [pipeline][pipeline]. |
| 81 | + |
| 82 | +The `hello_world` component just uses the default behavior, which is to run the Python function on the default base image (`kfp.dsl.component_factory._DEFAULT_BASE_IMAGE`). |
| 83 | + |
| 84 | +The `process_data` component adds layers of customization, by supplying the name of a specific `base_image`, and `packages_to_install`. Note the inclusion of the `import pandas as pd` statement inside the function; since the function will run inside a container (and won't have the script context), all Python library dependencies need to be imported within the component function. This component also uses KFP's `Output[Dataset]` class, which takes care of creating a KFP [artifact][artifacts] type output. |
| 85 | + |
| 86 | +Note that inputs and outputs are defined as Python function parameters. Also, dependencies can often be installed at runtime, avoiding the need for custom base containers. Python-based components give close access to the Python tools that ML experimenters rely on, like modules and imports, usage information, type hints, and debugging tools. |
| 87 | + |
| 88 | + |
| 89 | +#### Run a Component's Python Function |
| 90 | +Provided that dependencies are satisfied in your environment, it is also easy to run Python-based components as simple Python functions, which can be useful for local work. For example, to run `process_data` as a Python function try: |
| 91 | +``` python |
| 92 | +# Provide path as dataset type (as the function expects) |
| 93 | +dataset = Dataset(uri="data.csv") |
| 94 | +# execute the function |
| 95 | +# (writes data to data.csv locally) |
| 96 | +process_data.execute(output_data=dataset) |
| 97 | +# access the underlying function docstring |
| 98 | +print(process_data.python_func.__doc__) |
| 99 | +``` |
| 100 | + |
| 101 | +Component usage can get much more complex, as AI/ML use-cases often have demanding code and environment dependencies. For more on creating Python-based components, see the [component][python-sdk-component] SDK documentation. |
| 102 | + |
| 103 | + |
| 104 | +### 2. YAML-Based Components |
| 105 | + |
| 106 | +The KFP backend uses YAML-based definitions to specify components. While the [KFP Python SDK][KFP SDK] can do this conversion automatically when a Python-based [pipeline][pipeline] is submitted, some use-cases can benefit from the direct YAML-based component approach. |
| 107 | + |
| 108 | +A YAML-based component definition has the following parts: |
34 | 109 |
|
35 | 110 | * **Metadata:** name, description, etc.
|
36 |
| -* **Interface:** input/output specifications (name, type, description, default |
37 |
| - value, etc). |
38 |
| -* **Implementation:** A specification of how to run the component given a |
39 |
| - set of argument values for the component's inputs. The implementation section |
40 |
| - also describes how to get the output values from the component once the |
41 |
| - component has finished running. |
| 111 | +* **Interface:** input/output specifications (name, type, description, default value, etc). |
| 112 | +* **Implementation:** A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running. |
| 113 | + |
| 114 | +YAML-based components support system commands directly. In fact, any command (or binary) that exists on the base image can be run. Here is simple YAML-based component example: |
| 115 | +```yaml |
| 116 | +# my_component.yaml file |
| 117 | +name: my-component |
| 118 | +description: "Component that outputs \"<string prefix>...<num>\"" |
| 119 | + |
| 120 | +inputs: |
| 121 | +- {name: string prefix, type: String} |
| 122 | +- {name: num, type: Integer} |
| 123 | + |
| 124 | +outputs: [] |
42 | 125 |
|
43 |
| -For the complete definition of a component, see the |
44 |
| -[component specification](/docs/components/pipelines/reference/component-spec/). |
| 126 | +implementation: |
| 127 | + container: |
| 128 | + image: python:3.12-slim-bookworm |
| 129 | + args: |
| 130 | + - echo |
| 131 | + - {inputValue: string prefix} |
| 132 | + - ... |
| 133 | + - {inputValue: num} |
| 134 | +``` |
45 | 135 |
|
46 |
| -## Containerizing components |
| 136 | +For the complete definition of a YAML-based component, see the [component specification][yaml-component]. |
47 | 137 |
|
48 |
| -You must package your component as a |
49 |
| -[Docker image](https://docs.docker.com/get-started/). Components represent a |
50 |
| -specific program or entry point inside a container. |
| 138 | +YAML-based components can be loaded for use in the Python SDK alongside Python-based components: |
| 139 | +```python |
| 140 | +from kfp.components import load_component_from_file |
| 141 | + |
| 142 | +my_comp = load_component_from_file("my_component.yaml") |
| 143 | +``` |
| 144 | + |
| 145 | +Note that a component loaded from a YAML-based component will not have the same level of Python support that Python-based components do (like executing the function locally). |
| 146 | + |
| 147 | +<!-- TODO: Briefly discuss graph components, container components, and importer components (see sdk dsl scripts) --> |
| 148 | + |
| 149 | +## "Containerize" a Component |
| 150 | + |
| 151 | +The KFP command-line tool contains a build command to help users "containerize" a component. This can be used to create the `Dockerfile`, `runtime-dependencies.txt`, and other supporting files, or even to build the custom image and push it to a registry. In order to use this utility, the `target_image` parameter must be set in the Python-based component definition, which itself is saved in a file. |
| 152 | +```bash |
| 153 | +# build Dockerfile and runtime-dependencies.txt |
| 154 | +kfp component build --component-filepattern the_component.py --no-build-image --platform linux/amd64 . |
| 155 | +``` |
| 156 | +Note that creating and maintaining custom containers can carry a significant maintenance burden. In general, a 1-to-1 relationship between components and containers is not needed or recommended, as AI/ML work is often highly iterative. A best practice is to work with a small set of base images that can support many components. If you need more control over the container build than the `kfp` CLI provides, consider using a container CLI like [docker][docker-cli] or [podman][podman-cli]. |
51 | 157 |
|
52 |
| -Each component in a pipeline executes independently. The components do not run |
53 |
| -in the same process and cannot directly share in-memory data. You must serialize |
54 |
| -(to strings or files) all the data pieces that you pass between the components |
55 |
| -so that the data can travel over the distributed network. You must then |
56 |
| -deserialize the data for use in the downstream component. |
57 | 158 |
|
58 | 159 | ## Next steps
|
59 | 160 |
|
| 161 | +* Read the user guides for [Creating Components][Creating Components] |
60 | 162 | * Read an [overview of Kubeflow Pipelines](/docs/components/pipelines/overview/).
|
61 | 163 | * Follow the [pipelines quickstart guide](/docs/components/pipelines/getting-started/)
|
62 | 164 | to deploy Kubeflow and run a sample pipeline directly from the Kubeflow
|
63 | 165 | Pipelines UI.
|
64 | 166 | * Build your own
|
65 | 167 | [component and pipeline](/docs/components/pipelines/legacy-v1/sdk/component-development/).
|
66 | 168 | * Build a [reusable component](/docs/components/pipelines/legacy-v1/sdk/component-development/) for
|
67 |
| - sharing in multiple pipelines. |
| 169 | + sharing in multiple pipelines. |
| 170 | + |
| 171 | + |
| 172 | +[pipeline]: /docs/components/pipelines/concepts/pipeline |
| 173 | +[KFP SDK]: https://kubeflow-pipelines.readthedocs.io |
| 174 | +[artifacts]: /docs/components/pipelines/concepts/output-artifact |
| 175 | +[python-sdk-component]: https://kubeflow-pipelines.readthedocs.io/en/stable/source/dsl.html#kfp.dsl.component |
| 176 | +[yaml-component]: /docs/components/pipelines/reference/component-spec |
| 177 | +[docker-cli]: https://github.com/docker/cli |
| 178 | +[podman-cli]: https://github.com/containers/podman |
| 179 | +[Creating Components]: /docs/components/pipelines/user-guides/components |
0 commit comments