Skip to content

Conversation

@njnu-seafish
Copy link
Contributor

Purpose of the pull request

close #17795

Brief change log

Add dispatch timeout checking logic to handle cases where the worker group does not exist or no workers are available

Verify this pull request

Successfully tested and verified in a real-world environment.

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

* Maximum time allowed for a task to be successfully dispatched.
* Default: 5 minutes.
*/
private Duration dispatchTimeout = Duration.ofMinutes(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it configurable by users in the configuration file and it can be turned off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it configurable by users in the configuration file and it can be turned off.

good idea!

@SbloodyS SbloodyS added the improvement make more easy to user or prompt friendly label Dec 16, 2025
@SbloodyS SbloodyS added this to the 3.4.0 milestone Dec 16, 2025
* Maximum time allowed for a task to be successfully dispatched.
* Default: 5 minutes.
*/
private Duration dispatchTimeout = Duration.ofMinutes(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private Duration dispatchTimeout = Duration.ofMinutes(5);
private Duration maxTaskDispatchDuration;

dispatchTimeout might confuse with single RPC timeout when dispatch task, and the default value should be null or a large duration, should compatibility with the previous behavior.

And you need to add this in document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxTaskDispatchDuration

ok

private void doDispatchTask(ITaskExecutionRunnable taskExecutionRunnable) {
final int taskId = taskExecutionRunnable.getId();
final TaskExecutionContext taskExecutionContext = taskExecutionRunnable.getTaskExecutionContext();
final long timeoutMs = this.dispatchTimeout.toMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to store the mills rather than store duration, then you can avoid execute dispatchTimeout.toMillis() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to store the mills rather than store duration, then you can avoid execute dispatchTimeout.toMillis() here.

Using a duration format like '5m' is more concise and readable. Most other configurations I’ve seen also avoid millisecond-level granularity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can translate the duration into milliseconds in the constructor.

}

private void doDispatchTask(ITaskExecutionRunnable taskExecutionRunnable) {
final int taskId = taskExecutionRunnable.getId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final int taskId = taskExecutionRunnable.getId();
final int taskExecutionRunnableId = taskExecutionRunnable.getId();

Avoid confuse with task definition id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taskExecutionRunnableId

final int taskInstanceId = taskExecutionRunnable.getId(); is it better?

/**
* Get the task instance id.
*

Need to know the id might change since the task instance might be regenerated.
*/
default int getId() {
return getTaskInstance().getId();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use taskExecutionRunnableId , since the outer layer is unaware that this is the taskInstanceId.

* @param elapsed the time (in milliseconds) already spent attempting to dispatch the task
* @param timeoutMs the configured dispatch timeout threshold (in milliseconds)
*/
private void handleDispatchFailure(ITaskExecutionRunnable taskExecutionRunnable, TaskDispatchException exception,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't print the taskId and workflowId here, all ids should already be added by MDC. We should only need to print the exception here, the exception already contains failure message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't print the taskId and workflowId here, all ids should already be added by MDC. We should only need to print the exception here, the exception already contains failure message.

ok


public WorkerGroupNotFoundException(String workerGroup) {
super("Cannot find worker group: " + workerGroup);
public WorkerGroupNotFoundException(String message) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this change?

restore, remove the task info


public class WorkerNotFoundException extends TaskDispatchException {

public WorkerNotFoundException(String message) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public WorkerNotFoundException(String message) {
public NoAvailableWorkerException(String workerGroup) {
super("Cannot find available worker under worker group: " + workerGroup);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 50 to 51
final MasterConfig masterConfig = new MasterConfig();
dispatcher = new WorkerGroupDispatcher("TestGroup", taskExecutorClient, masterConfig.getDispatchTimeout());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a validated test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a validated test case.

ok, add test for dispatch timeout checker

Comment on lines 129 to 132
/**
* Timestamp (ms) when the task was first enqueued for dispatch.
*/
private final long firstDispatchEnqueueTimeMs = System.currentTimeMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Timestamp (ms) when the task was first enqueued for dispatch.
*/
private final long firstDispatchEnqueueTimeMs = System.currentTimeMillis();
private final long firstDispatchTime = System.currentTimeMillis();

This is more like the creation time of the taskExecutionContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like the creation time of the taskExecutionContext.

ok

@ruanwenjun ruanwenjun added the feature new feature label Dec 16, 2025
Comment on lines 77 to 82
/**
* Configuration for the master's task dispatch timeout check mechanism.
* This controls whether the system enforces a time limit for dispatching tasks to workers,
* and if so, how long to wait before marking a task as failed due to dispatch timeout.
*/
private MasterDispatchTimeoutCheckerConfig dispatchTimeoutChecker = new MasterDispatchTimeoutCheckerConfig();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Configuration for the master's task dispatch timeout check mechanism.
* This controls whether the system enforces a time limit for dispatching tasks to workers,
* and if so, how long to wait before marking a task as failed due to dispatch timeout.
*/
private MasterDispatchTimeoutCheckerConfig dispatchTimeoutChecker = new MasterDispatchTimeoutCheckerConfig();
private TaskDispatchPolicy taskDispatchPolicy = new TaskDispatchPolicy();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* If enabled, tasks that remain in the dispatch queue longer than {@link #maxTaskDispatchDuration} will be marked as failed to prevent indefinite queuing.
*/
@Data
public class MasterDispatchTimeoutCheckerConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class MasterDispatchTimeoutCheckerConfig {
public class TaskDispatchPolicy {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

/**
* Whether to enable the dispatch timeout checking mechanism.
*/
private boolean enabled = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private boolean enabled = false;
private boolean dispatchTimeoutFailedEnabled = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* Tasks exceeding this duration in the dispatch queue will be failed.
* Examples: "2m", "5m", "30m". Defaults to 5 minutes.
*/
private Duration maxTaskDispatchDuration = Duration.ofMinutes(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private Duration maxTaskDispatchDuration = Duration.ofMinutes(5);
private Duration maxTaskDispatchDuration;

When we don't have a good suggested value, don't set a default value, as this can mislead many users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we don't have a good suggested value, don't set a default value, as this can mislead many users.

ok

}
taskExecutorClient.dispatch(taskExecutionRunnable);
} catch (Exception e) {
} catch (TaskDispatchException ex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} catch (TaskDispatchException ex) {
} catch (Exception ex) {

Currently, at least in this PR, you can't guarantee that other exceptions won't be thrown here, right?

Comment on lines 31 to 52
@Data
@Builder
@AllArgsConstructor
public class TaskFatalLifecycleEvent extends AbstractTaskLifecycleEvent {

private final ITaskExecutionRunnable taskExecutionRunnable;

private final Date endTime;

@Override
public ILifecycleEventType getEventType() {
return TaskLifecycleEventType.FATAL;
}

@Override
public String toString() {
return "TaskFatalLifecycleEvent{" +
"task=" + taskExecutionRunnable.getName() +
", endTime=" + endTime +
'}';
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this.

ok, will publish TaskFailedLifecycleEvent

.untilAsserted(() -> verify(taskExecutorClient, times(2)).dispatch(taskExecutionRunnable));
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add UT or IT test case to verify the time out check is enabled. Your core logic has not been tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add UT or IT test case to verify the time out check is enabled. Your core logic has not been tested.

OK. I need to modify the master's configuration. I'll submit the changes and test them — I can't run the integration tests locally.

@SbloodyS SbloodyS modified the milestones: 3.4.0, 3.4.1 Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend feature new feature improvement make more easy to user or prompt friendly test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement][Master] Add dispatch timeout checking logic to handle cases where the worker group does not exist or no workers are available.

3 participants