Skip to content

[backend] Received message larger than max when creating MLMD artifact #11949

@lifo9

Description

@lifo9

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?

Standalone installation on AWS EKS cluster, using AWS Aurora MySQL database and AWS S3 for storage.

  • KFP version:

2.5.0

  • KFP SDK version:

2.13.0

Steps to reproduce

When trying to write classification metrics larger than ~4MB, the following error is raised:

main.go:51] failed to execute component: unable to produce MLMD artifact for output "classification_metrics": rpc error: code = ResourceExhausted desc = Received message larger than max (91000474 vs. 4194304)

To reproduce, you can create the following pipeline (V2):

import kfp
from kfp.dsl import ClassificationMetrics, Output, component


@component(base_image="python:3.9", packages_to_install=["numpy"])
def generate_large_classification_metrics(
    classification_metrics: Output[ClassificationMetrics],
) -> None:
    """Generate a classification metrics artifact larger than 4MB."""
    import json
    import numpy as np

    # Generate a large number of ROC curve points to create a large metrics file
    # This will create approximately 10,000,000 data points
    # Each point is (fpr, tpr, threshold) which will exceed 4MB when serialized
    num_points = 1_000_000

    # Create evenly spaced values between 0 and 1
    fpr = np.linspace(0, 1, num_points).tolist()
    tpr = np.linspace(0, 1, num_points).tolist()
    thresholds = np.linspace(1, 0, num_points).tolist()

    # Log the ROC curve data
    classification_metrics.log_roc_curve(fpr, tpr, thresholds)

    # Save the metrics to disk
    with open(classification_metrics.path, "w") as file_handle:
        json.dump(classification_metrics.metadata, file_handle)

    # Print the size of the metrics file
    import os

    file_size_bytes = os.path.getsize(classification_metrics.path)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"Generated classification metrics file size: {file_size_mb:.2f} MB")


# Define the pipeline
@kfp.dsl.pipeline(
    name="large-metrics-test-pipeline",
    description="Pipeline to test large classification metrics artifacts",
)
def large_metrics_pipeline():
    generate_large_classification_metrics()


if __name__ == "__main__":
    kfp.compiler.Compiler().compile(large_metrics_pipeline, "large_metrics_pipeline.yaml")

Expected result

MLMD artifact is created successfully, no error is raised.

Materials and Reference

We've tried to:

  • Increase GroupConcatMaxLen in the apiserver config.json by mounting a custom config map to ml-pipeline container. The file was successfully replaced, deployment restarted, but it didn't help
  • Patching the cache server - setting db_group_concat_max_len argument => no change:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cache-server
spec:
  template:
    spec:
      containers:
        - name: server
          args:
            - "--tls_cert_filename=tls.crt"
            - "--tls_key_filename=tls.key"
            - "--db_host=$(DBCONFIG_HOST_NAME)"
            - "--db_user=$(DBCONFIG_USER)"
            - "--db_password=$(DBCONFIG_PASSWORD)"
            - "--db_group_concat_max_len=18446744073709551615" # https://dev.mysql.com/doc/refman/8.4/en/server-system-variables.html#sysvar_group_concat_max_len

Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions