Lift session issues (#454)

Posted by @cwrx777:

HI,

rmf release: iron 20231229

I can’t provide any more detail at this moment, but we encountered a few issues in interactions with lift with multiple robots:

  1. When a robot was having a lift session and then the task was cancelled, and the lift session didn’t get released. After another task to go to a waypoint that is on the same leve was sent, the robot still received path request to enter lift waypoint. the path requests consisted of less than 4 waypoints, from current position to the lift waypoint even though there are multiple intermediate points, including named ones. I have configured the bulding yaml to use docking to enter and exit lift.

Path request:
image

  1. A robot (robot-1) currently holding a lift session and waiting at the lift entrance position was ‘forced’ to deconflict by another robot (robot-2) that wanted to go to that position and result in deadlock (ie. robot-1 hold the lift, but robot-2 was at lift entrance waiting for the lift to be released by robot-1).

  2. A robot entered lift when it’s not having a lift session.

  3. When there are two robots on different floors queueing for the same lift, to which robot the lift will be given? is it based on based on the request time? or based on the lift position? we encountered situations whereby a lift is bring down a robot (robot 1) to level 1, and robot 2 is alredy waiting for the lift at level 1 to go to level 5 and robot 3 is waiting for lift at level 5. after robot 1 exit the lift, the lift session is given to robot 3 (i.e. the lift went to level 5 empty when it could’ve brought robot 1 to level 5 also).

  4. is there a chance that robot need to go toother floor to perform deconflict? we encountered that a robot was sent to a floor that is not included in the list of waypoins in a task. in the log below, door_08_osu58_01 is on different floor from robot current floor.

[WardRobot_FlAdpt]: Planning for [WardRobot/WardRobot_101] to [CART_W56_01] from one of these locations:
[WardRobot_FlAdpt]: Ensuring door [door_08_58_01] is open for [WardRobot/WardRobot_101] after a replan
[WardRobot_FlAdpt]: Executing go_to_place [CART_W56_01] for robot [WardRobot/WardRobot_101]
[WardRobot_FlAdpt]: Opening door [door_08_58_01] for [WardRobot/WardRobot_101]
[WardRobot_FlAdpt]: Robot [WardRobot/WardRobot_101] will release door [door_08_58_01] after a replan because it is no longer needed.
  1. With reference to the above traffic lanes, when there is obstacle in front of the lift while robot is docking into the lift (with finish_waypoint = LIFT_ROBOT_POS, my implementation is that the docking will timeout and the robot will be sent to the LIFT_ROBOT_EXIT_POS and the fleet adapter will not call docking_finished_calback but call replan API so that replan gets triggered. The replan did get triggered, and as the lanes are made directional, the robot need to ‘rejoin’ the queue, and the queue waypoints are few lanes away from the lift calling position. however, the lift session was still held by the robot.

Edited by @cwrx777 at 2024-04-25T05:12:53Z

Chosen answer

Answer chosen by @cwrx777 at 2024-05-06T01:43:12Z.
Answered by @mxgrey:

If I dont call replan or docking_finish_callback after release_lift, the lift was released when code reach the condition you mentioned.

If you’ve manually released the lift then RMF won’t relock the lift until it begins a new attempt to have the robot enter or exit the lift. By not calling replan or docking_finished_callback you’re just leaving RMF in a hung state that it will never be able to make progress from.

My strongest suspicion right now is that the last position update that you gave before calling replan indicated that the robot is either inside the lift or somewhere on the lane that comes out of the lift, which prompts RMF to lock the lift for the robot at the very start of the new plan. I would recommend making sure you call update_current_waypoint with the graph index of the exit waypoint before you trigger the replan. Alternatively using the EasyFullControl API and calling its update function before replan would also work.

[fleet_adapter_mir-1] [INFO]: Planning for [WardRobot/WardRobot_082] to [TEST_L02_03] from one of these locations:

Was there no list of starting locations from this line? If there are no start locations for the planner then that would indicate more significant problems than the lift getting relocked.

Posted by @mxgrey:

After another task to go to a waypoint that is on the same leve was sent, the robot still received path request to enter lift waypoint.

Behaviors around lifts are extremely complex, so I can’t say with any certainty what caused this, but the overarching cause is almost certainly that RMF believed that the robot was still inside the lift. Here are some possible reasons for this:

  1. The wifi cut out while the robot was exiting the lift, so the last state update that RMF had was still inside the lift. It’s well-known that wifi connections cut out while the robot is inside the lift with the doors closed, but we’ve been finding that wifi is also generally unstable while the robot is moving out of a lift. As a result the robot might manage to reconnect to wifi, report its state to RMF, and receive a command from RMF to exit the lift while it sits inside the lift with the doors open, but then that connection can drop out while the robot is moving into the lobby. If this happens when a replan gets triggered for RMF (e.g. a new task is received or a conflict is detected) then RMF will believe the robot is inside the lift and will think it needs to command a path beginning from there.
  2. The RobotUpdateHandle is not being given very good nav graph element updates. If you’re using the old update_positions API there’s a high likelihood that you won’t be giving the fleet adapter the best position information for you robot. That API was very difficult to use correctly, and using it in slightly wrong ways can lead to undesirable behavior for the robot. I recommend transitioning to the EasyFullControl API if you haven’t already. If you’re using MiR then we have an EasyFullControl adapter in progress that you can fork. It’s derived directly from a private adapter that we’re using from a deployment, but it’s not finished yet because we’re redesigning it to separate out any sensitive operation-specific code.

A robot (robot-1) currently holding a lift session and waiting at the lift entrance position was ‘forced’ to deconflict by another robot (robot-2) that wanted to go to that position and result in deadlock

I recommend using mutex groups to create a queuing system for the lift as described here. You should try to have as many queuing points as the number of robots that you expect will be trying to get into the lift from one floor at any given moment.

A robot entered lift when it’s not having a lift session.

I haven’t seen this happen before so I don’t have an explanation to offer here. I would need you to elaborate on the exact circumstances around this.

When there are two robots on different floors queueing for the same lift, to which robot the lift will be given? is it based on based on the request time?

Yes, right now the lift utilization is not very intelligent so it’s first-come-first-serve based on lift request times. If this is a problem for your deployment, then you can write a custom lift supervisor node that assigns the lift based on the current floor of the lift and which floor the requests are coming from. We have plans to provide an intelligent reservation system for lift usage out of the box, but it’s low priority at the moment as we try to focus on developing more critical improvements.

is there a chance that robot need to go toother floor to perform deconflict? we encountered that a robot was sent to a floor that is not included in the list of waypoins in a task.

This is technically possible with the current negotiation system. Usually this will happen when the nav graph around a lift lobby gets too crowded with robots and the traffic flow is too loose. For sensitive areas like lift lobbies, I recommend making very conscious use of traffic flow controls like mutex groups and unidirectional lanes. If you can share a rough sketch of your traffic lane setup around the lift lobbies, I’m happy to give feedback on it.

the robot need to ‘rejoin’ the queue, and the queue waypoints are few lanes away from the lift calling position. however, the lift session was still held by the robot.

I understand what caused this, but it’s a very very tricky edge case to deal with. I intentionally programmed RMF to not release the lift following a replan if the new plan will still involve the robot taking the lift. The rationale is that in most situations where this happens, the robot may need to move to a certain waypoint to reposition itself to re-attempt entering the lift. We do not want the robot to release the lift while it repositions itself because that would allow another robot to snatch the lift away, creating more delays or possibly even new traffic conflicts. It’s very difficult for RMF to infer what exact situations it should hold versus release the lift when an entry attempt fails. I can make two recommendations for you to consider:

  1. When docking fails, send the robot back to the pre-entry point instead of the exit point and trigger the replan from there. If you use a mutex label on the pre-entry point as well as the lane from the pre-entry point into the lift then that will ensure that no other robot tries to move onto that point while the robot is docking.
  2. When docking fails and the robot has arrived at the lift exit point, call release_lift() before triggering the replan. If you’re using the Python API then I think we’ll need to add a binding for this.

Posted by @cwrx777:

Hi @mxgrey,

  1. When docking fails, send the robot back to the pre-entry point instead of the exit point and trigger the replan from there. If you use a mutex label on the pre-entry point as well as the lane from the pre-entry point into the lift then that will ensure that no other robot tries to move onto that point while the robot is docking.

For the first recommendation, how stable is the mutex feature in iron 20231229 release? I notice there are multiple PRs on mutex feature after that release.

  1. When docking fails and the robot has arrived at the lift exit point, call release_lift() before triggering the replan. If you’re using the Python API then I think we’ll need to add a binding for this.

I tried the second recommendation and I have added the python binding for release_lift() (based on iron 2023129 release) and called it from my fleet adapter when the robot has arrived at exit point but I didn’t see that the lift session was released based on published lift_state.
from the log, I did notice the following:

Releasing lift [lift_name] for [RobotName] because of user request
Releasing lift [lift_name] for [RobotName]

I believe those lines are coming from:
here and here

and after adding my own log, the execution went to:

  1. else part of this and
  2. here

Edited by @cwrx777 at 2024-05-02T02:59:01Z

Posted by @mxgrey:

how stable is the mutex feature in iron 20231229 release?

I generally recommend using main since it gets actively updated to improve behavior and reliability in deployments that we are working on. We don’t merge anything into main without testing it in a deployment first. I’m aware of two updates to mutex groups since its initial release:

I’d say these are both important updates that you’re going to want when using mutex groups. We don’t have any more updates planned for mutex groups in the current generation of Open-RMF.

I didn’t see that the lift session was released based on published lift_state.

I’ll need more information about the exact circumstances because the stack trace that you’re indicating doesn’t add up.. This line would have cleared out _lift_destination so this condition would have failed, but you said your logs made it inside of that block.

There must be something else causing the fleet adapter to re-lock the lift session after you’ve released it. The only way that would happen is if the fleet adapter believes the robot still has an immediate need for the lift. For example, if the robot’s position has not been updated recently or accurately enough so the fleet adapter still believes that the robot needs to finish traversing out through the lift exit lane. Maybe double-check how you’re updating the positions right before you trigger the replan, or switch to the Easy Full Control which should ensure that positions are being updated correctly.


Edited by @mxgrey at 2024-05-02T03:37:24Z

Posted by @cwrx777:

Hi @mxgrey,

I have added additional logs after fleet adapter called release_lift and immediately followed by replan.

[fleet_adapter_mir-1] [INFO]: Releasing lift [lift_AL04] for [WardRobot/WardRobot_082] because of a user request
[fleet_adapter_mir-1] [INFO]: Releasing lift [lift_AL04] for [WardRobot/WardRobot_082]
[fleet_adapter_mir-1] [INFO]: Replanning requested for [WardRobot/WardRobot_082]
[fleet_adapter_mir-1] [INFO]: Planning for [WardRobot/WardRobot_082] to [TEST_L02_03] from one of these locations:
[fleet_adapter_mir-1] [INFO]: Executing go_to_place [TEST_L02_03] for robot [WardRobot/WardRobot_082]
[fleet_adapter_mir-1] [INFO]: Robot [WardRobot/WardRobot_082] will summon lift [lift_AL04] to floor [L01] after a replan
[fleet_adapter_mir-1] [INFO]: Setting lift [lift_AL04] destination for [WardRobot/WardRobot_082] to level [L01].
[fleet_adapter_mir-1] [INFO]: BE STUBBORN - lift [lift_AL04] for robot [WardRobot/WardRobot_082]
[fleet_adapter_mir-1] [INFO]: Lift has arrived on floor [L01] and opened its doors for robot [WardRobot/WardRobot_082]
[fleet_adapter_mir-1] [INFO]: Setting lift [lift_AL04] destination for [WardRobot/WardRobot_082] to level [L01].
[fleet_adapter_mir-1] [INFO]: Requesting replan for [WardRobot/WardRobot_082] because it has finished waiting lift [lift_AL04] to arrive at [L01]

The code to publish lift request to release the lift is done when the lift state is handled in _check_lift_state. However, since the lift_destination was set again at set_lift_destination, the releasing of the lift was not executed.

If I dont call replan or docking_finish_callback after release_lift, the lift was released when code reach the condition you mentioned.


Edited by @cwrx777 at 2024-05-02T12:28:18Z

Posted by @mxgrey:

If I dont call replan or docking_finish_callback after release_lift, the lift was released when code reach the condition you mentioned.

If you’ve manually released the lift then RMF won’t relock the lift until it begins a new attempt to have the robot enter or exit the lift. By not calling replan or docking_finished_callback you’re just leaving RMF in a hung state that it will never be able to make progress from.

My strongest suspicion right now is that the last position update that you gave before calling replan indicated that the robot is either inside the lift or somewhere on the lane that comes out of the lift, which prompts RMF to lock the lift for the robot at the very start of the new plan. I would recommend making sure you call update_current_waypoint with the graph index of the exit waypoint before you trigger the replan. Alternatively using the EasyFullControl API and calling its update function before replan would also work.

[fleet_adapter_mir-1] [INFO]: Planning for [WardRobot/WardRobot_082] to [TEST_L02_03] from one of these locations:

Was there no list of starting locations from this line? If there are no start locations for the planner then that would indicate more significant problems than the lift getting relocked.


This is the chosen answer.

Posted by @cwrx777:

Hi @mxgrey,
explicitly calling update_current_waypoint right before replan works.

Posted by @cwrx777:

Hi @mxgrey ,

  1. is there a chance that robot need to go toother floor to perform deconflict? we encountered that a robot was sent to a floor that is not included in the list of waypoins in a task. in the log below, door_08_osu58_01 is on different floor from robot current floor.
[WardRobot_FlAdpt]: Planning for [WardRobot/WardRobot_101] to [CART_W56_01] from one of these locations:
[WardRobot_FlAdpt]: Ensuring door [door_08_58_01] is open for [WardRobot/WardRobot_101] after a replan
[WardRobot_FlAdpt]: Executing go_to_place [CART_W56_01] for robot [WardRobot/WardRobot_101]
[WardRobot_FlAdpt]: Opening door [door_08_58_01] for [WardRobot/WardRobot_101]
[WardRobot_FlAdpt]: Robot [WardRobot/WardRobot_101] will release door [door_08_58_01] after a replan because it is no longer neede

is my understading correct that the following code snippet did not check if the door is on the robot current map, and if the plan will mistakenly include door open of a door on another floor while the robot is travelling between two waypoints that happen to be in the same stack on the current floor.

for (const auto& door : graph.all_known_doors())
{

  if (door->intersects(p0, p1, envelope))
  {
	RCLCPP_INFO(
	  context->node()->get_logger(),
	  "Ensuring door [%s] is open for [%s] after a replan",
	  door->name().c_str(),
	  context->requester_id().c_str());

	legacy_phases.emplace_back(
	  std::make_shared<phases::DoorOpen::PendingPhase>(
		context, door->name(), context->requester_id(), t0),
	  t0, rmf_traffic::Dependencies(), std::nullopt);
  }
}

Posted by @mxgrey:

That’s a great catch. We should be filtering the doors based on the map name. I’ll write up a PR to fix this.

Posted by @mxgrey:

Fixed by Filter DoorOpen insertion by map name by mxgrey · Pull Request #353 · open-rmf/rmf_ros2 · GitHub

Posted by @cwrx777:

For issue that robot enter lift when the lift is not in AGV mode, should this code also check if the lift is already in AGV mode?

LegacyTask::StatusMsg RequestLift::ActivePhase::_get_status(
  const rmf_lift_msgs::msg::LiftState::SharedPtr& lift_state)
{
  using rmf_lift_msgs::msg::LiftState;
  using rmf_lift_msgs::msg::LiftRequest;
  LegacyTask::StatusMsg status{};
  status.state = LegacyTask::StatusMsg::STATE_ACTIVE;
  if (!_rewaiting &&
    lift_state->lift_name == _lift_name &&
    lift_state->current_floor == _destination &&
    lift_state->door_state == LiftState::DOOR_OPEN &&
    lift_state->current_mode == LiftState::MODE_AGV &&  //check if the lift is AGV mode
    lift_state->session_id == _context->requester_id())
  {
    RCLCPP_INFO(
      _context->node()->get_logger(),
      "Lift has arrived on floor [%s] and opened its doors for robot [%s]",
      lift_state->current_floor.c_str(),
      lift_state->session_id.c_str());
...


Edited by @cwrx777 at 2024-08-21T03:06:07Z

Posted by @cwrx777:

I have created a discussion for this. open-rmf#513