-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
Description
Environment
- How did you deploy Kubeflow Pipelines (KFP)?
Standalone installation on AWS EKS cluster, using AWS Aurora MySQL database and AWS S3 for storage.
- KFP version:
2.5.0
- KFP SDK version:
2.13.0
Steps to reproduce
When trying to write classification metrics larger than ~4MB, the following error is raised:
main.go:51] failed to execute component: unable to produce MLMD artifact for output "classification_metrics": rpc error: code = ResourceExhausted desc = Received message larger than max (91000474 vs. 4194304)
To reproduce, you can create the following pipeline (V2):
import kfp
from kfp.dsl import ClassificationMetrics, Output, component
@component(base_image="python:3.9", packages_to_install=["numpy"])
def generate_large_classification_metrics(
classification_metrics: Output[ClassificationMetrics],
) -> None:
"""Generate a classification metrics artifact larger than 4MB."""
import json
import numpy as np
# Generate a large number of ROC curve points to create a large metrics file
# This will create approximately 10,000,000 data points
# Each point is (fpr, tpr, threshold) which will exceed 4MB when serialized
num_points = 1_000_000
# Create evenly spaced values between 0 and 1
fpr = np.linspace(0, 1, num_points).tolist()
tpr = np.linspace(0, 1, num_points).tolist()
thresholds = np.linspace(1, 0, num_points).tolist()
# Log the ROC curve data
classification_metrics.log_roc_curve(fpr, tpr, thresholds)
# Save the metrics to disk
with open(classification_metrics.path, "w") as file_handle:
json.dump(classification_metrics.metadata, file_handle)
# Print the size of the metrics file
import os
file_size_bytes = os.path.getsize(classification_metrics.path)
file_size_mb = file_size_bytes / (1024 * 1024)
print(f"Generated classification metrics file size: {file_size_mb:.2f} MB")
# Define the pipeline
@kfp.dsl.pipeline(
name="large-metrics-test-pipeline",
description="Pipeline to test large classification metrics artifacts",
)
def large_metrics_pipeline():
generate_large_classification_metrics()
if __name__ == "__main__":
kfp.compiler.Compiler().compile(large_metrics_pipeline, "large_metrics_pipeline.yaml")
Expected result
MLMD artifact is created successfully, no error is raised.
Materials and Reference
We've tried to:
- Increase
GroupConcatMaxLen
in theapiserver
config.json by mounting a custom config map toml-pipeline
container. The file was successfully replaced, deployment restarted, but it didn't help - Patching the cache server - setting
db_group_concat_max_len
argument => no change:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cache-server
spec:
template:
spec:
containers:
- name: server
args:
- "--tls_cert_filename=tls.crt"
- "--tls_key_filename=tls.key"
- "--db_host=$(DBCONFIG_HOST_NAME)"
- "--db_user=$(DBCONFIG_USER)"
- "--db_password=$(DBCONFIG_PASSWORD)"
- "--db_group_concat_max_len=18446744073709551615" # https://dev.mysql.com/doc/refman/8.4/en/server-system-variables.html#sysvar_group_concat_max_len
Impacted by this bug? Give it a 👍.
lifo9, EnyMan, krewi1, vojtech-filipec, HumairAK and 1 more
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Triaged