Horovod在0.20.0的Version以上支持了Elastic training的功能,以弹性的方式在训练过程中增加或减少分布式训练的协作进程数以达到训练任务的弹性扩缩容的目的,本文以Horovod Elastic下的Pytorch的ResNet-50分类Imagenet的Demo为例。
启动
Horovod Elastic的启动命令如下:
horovodrun --log-level DEBUG --verbose -np 8 --min-np 1 --max-np 128
--host-discovery-script /etc/edl/discover_hosts.sh python examples/elastic/pytorch/pytorch_imagenet_resnet50_elastic.py
--train-dir=/dataset/ILSVRC2012_img_train/ --val-dir=/dataset/ILSVRC2012_img_val/
该命令的入口为/usr/local/bin目录下的horovodrun的Python脚本,调用了horovod Python包下的horovod/runner/launch.py进行命令行参数的解析,包括log级别、Worker数量相关信息、分布式host列表、需要分布式执行的task命令等,解析完成后建立弹性训练的Elastic Setting。
settings = elastic_settings.ElasticSettings(discovery=discover_hosts,
min_np=args.min_np or args.np,
max_np=args.max_np,
elastic_timeout=args.elastic_timeout,
reset_limit=args.reset_limit,
num_proc=args.np,
verbose=2 if args.verbose else 0,
ssh_port=args.ssh_port,
ssh_identity_file=args.ssh_identity_file,
extra_mpi_args=args.mpi_args,
key=secret.make_secret_key(),
start_timeout=tmout,
output_filename=args.output_filename,
run_func_mode=args.run_func is not None,
nics=args.nics,
prefix_output_with_timestamp=args.prefix_output_with_timestamp)
示例命令从discover_hosts.sh这个脚本中获取分布式的节点host列表,然后初始化ElasticSetting。
这里需要解释一下Horovod设计的分布式训练的节点流程角色:Launcher和Worker,Launcher承担horovodrun命令的执行,并接受Worker节点上的Log信息进行打印。
Horovod的弹性训练底层的并行通信框架默认且指定采用Facebook开源的Gloo框架,所以Launcher在初始化好ElasticSetting之后需要建立Gloo的通信框架,首先通过gloo_built方法获取该Horovod环境是否由Gloo编译而来,然后传入ElasticSetting执行gloo_run_elastic方法。
def gloo_run_elastic(settings, env, command):
def get_common_interfaces(driver):
min_hosts = _get_min_start_hosts(settings)
current_hosts = driver.wait_for_available_slots(settings.num_proc, min_hosts=min_hosts)
return driver_service.get_common_interfaces(settings, current_hosts.host_assignment_order)
exec_command = _exec_command_fn(settings)
rendezvous = RendezvousServer(settings.verbose)
launch_gloo_elastic(command, exec_command, settings, env, get_common_interfaces, rendezvous)
这里需要再解释一下Horovod弹性训练场景下的两个角色:Worker和Driver,Worker是负责在该Worker(工作节点)分配的数据上执行训练任务,即真正执行分布式训练任务的进程,Driver在Launcher上运行,主要任务是维护通信,负责实时检测现有 Worker(工作节点)是否有变化,掉线情况、负责通过脚本来实时监控 host 是否有变化、负责分配任务到存活的worker(工作节点)、在有AllReduce 调用失败导致进程失败的情况下,组织存活的Worker构造一个新的环,如果有新 host 加入,则在新host之上生成新的 worker,新worker和旧worker一起构造成一个新的通信环。
Launcher建立了一个RendezvousServer,RendezvousServer是一个KVStore,采用HTTP的Server和Client类型进行进行交互,大致提供的方法如下。
def create_rendezvous_handler(driver):
class ElasticRendezvousHandler(RendezvousHandler):
def _get_value(self, scope, key):
if scope == GET_RANK_AND_SIZE:
host, local_rank = key.split(':')
return self._get_rank_and_size(host, int(local_rank))
return super(RendezvousHandler, self)._get_value(scope, key)
def _get_rank_and_size(self, host, local_rank):
logging.info('_get_rank_and_size: {} {}'.format(host, local_rank))
driver.record_ready(host, local_rank)
slot_info = driver.get_slot_info(host, local_rank)
logging.info('rank and size: {} {}'.format(slot_info.rank, slot_info.size))
return slot_info.to_response_string().encode('ascii')
def _put_value(self, scope, key, value):
if scope == PUT_WORKER_ADDRESSES:
host, local_rank = key.split(':')
addresses, secret_key = codec.loads_base64(value)
self._put_worker_addresses(host, int(local_rank), addresses, secret_key)
super(RendezvousHandler, self)._put_value(scope, key, value)
def _put_worker_addresses(self, host, local_rank, addresses, secret_key):
driver.register_worker_server(host, local_rank, addresses, secret_key)
return ElasticRendezvousHandler
然后建立ElasticDriver,将可用槽slot按照setting进行初始化,并启动RendezvousServer。
driver = ElasticDriver(rendezvous, settings.discovery,
settings.min_np, settings.max_np,
timeout=settings.elastic_timeout,
reset_limit=settings.reset_limit,
verbose=settings.verbose)
handler = create_rendezvous_handler(driver)
global_rendezv_port = rendezvous.start(handler)
driver.wait_for_available_slots(settings.num_proc)
nics = get_common_interfaces(driver)
server_ip = network.get_driver_ip(nics)
根据建立的driver server向可用的slot中的host通过ssh免密通讯发送检测task之间的通信任务的命令,这里又涉及到HorovodRunTaskService和HorovodRunTaskClient进行通信交互。
Attempted to launch horovod task servers.
Waiting for the hosts to acknowledge.
INFO:root:wait for available slots: 8
DEBUG:root:current available slots: 8
DEBUG:root:current available hosts: 8.
INFO:root:reset workers: 8
每个Worker节点执行horovod包中的task_fn方法,首先建立自己index任务的TaskService,然后建立DriverClient并注册task信息。接着建立一个指向index+1的TaskClient,并检查task间的通信情况,同时注册进Driver的RendezvousServer,在RendezvousServer中构成一个环状的信息记录。
def _task_fn(index, num_hosts, driver_addresses, settings):
task = task_service.HorovodRunTaskService(index, settings.key, settings.nics)
try:
driver = driver_service.HorovodRunDriverClient(
driver_addresses, settings.key, settings.verbose)
driver.register_task(index,
task.addresses(),
host_hash.host_hash())
task.wait_for_initial_registration(settings.start_timeout)
next_task_index = (index + 1) % num_hosts
next_task_addresses = driver.all_task_addresses(next_task_index)
next_task = task_service.HorovodRunTaskClient(
next_task_index,
next_task_addresses,
settings.key,
settings.verbose,
match_intf=True,
attempts=10)
driver.register_task_to_task_addresses(next_task_index,
next_task.addresses())
next_task.task_to_task_address_check_completed()
task.wait_for_task_to_task_address_check_finish_signal(settings.start_timeout)
finally:
task.shutdown()
正常检测结束后,Launcher收集检测信息,如果全部检测正常则启动worker process,依旧通过ssh发送待执行命令,即带有环境变量赋值的Shell中执行Python训练脚本的命令。
然后则进入了实际训练任务的启动环节,每个Worker开始执行镜像中的分布式训练Python脚本。开始执行horovod.init,加载Gloo通信库。根据全局rendezvous中相应的信息进行启动初始化各个Worker节点的Gloo通信环境,并创建自己的HorovodBasics用以保存该个Worker相关配置信息的记录,HorovodGlobalState保存有各个工作线程均可访问的全局变量(后台线程、缓冲、tensor队列等)。
下面是Worker根据rendezvous中信息设置环通信目标进程,用以初始化HorovodBasic信息的代码:
def _init_process_sets(process_set_list: List[ProcessSet]):
ids_seen_in_process_set_list = {0}
id_to_ranks_dict = _basics._get_process_set_ids_and_ranks()
ranks_to_id_dict = {tuple(ranks): process_set_id for process_set_id, ranks in id_to_ranks_dict.items()}
for ps in process_set_list:
if ps.ranks is not None:
ps.process_set_id = ranks_to_id_dict[tuple(ps.ranks)]
elif ps.mpi_comm is not None:
ps.process_set_id = _basics._comm_process_set_id(ps.mpi_comm)
ps.ranks = list(id_to_ranks_dict[ps.process_set_id])
if ps.process_set_id in ids_seen_in_process_set_list:
ps._invalidate()
else:
ids_seen_in_process_set_list.add(ps.process_set_id)
if global_process_set.ranks != id_to_ranks_dict[0]:
global_process_set.ranks = id_to_ranks_dict[0]
设置好了process相关的信息之后便可以进行后续的具体通信操作了。
启动Background thread,这里还需要解释一下,在每个分布式的Worker节点上面都有一个Horovod的副本,一个Horovod副本启动两个线程,其一加载C++的MPI_lib以启动MPI线程,其二是Background线程用以与协调器交互以及参与通信,协调器一般由rank=0的Worker承担。
Horovod初始化工作自此完成,接下来就是Worker执行具体的训练业务逻辑任务了。
训练
Horovod初始化以及启动完成之后,每个Worker便开始继续执行训练任务中内容。
因为Horovod的分布式是数据并行式的训练,所以在训练之前需要先进行数据集的分割,ImageNet数据集采用torchvision的方法进行load,Horovod为torch的Dataset提供了ElasticSampler的方法进行分割整体数据集。
train_dataset = \
datasets.ImageFolder(args.train_dir,
transform=transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]))
train_sampler = hvd.elastic.ElasticSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=allreduce_batch_size,
sampler=train_sampler,
**kwargs)
val_dataset = \
datasets.ImageFolder(args.val_dir,
transform=transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]))
val_sampler = hvd.elastic.ElasticSampler(val_dataset)
val_loader = torch.utils.data.DataLoader(
val_dataset,
batch_size=args.val_batch_size,
sampler=val_sampler,
**kwargs)
分布式数据并行的训练第一个重点是数据集的分割,接着就是反向传播参数的同步,Horovod提供了DistributedOptimizer对用户的Optimizer进行封装来实现参数的同步。
optimizer = hvd.DistributedOptimizer(
optimizer, named_parameters=model.named_parameters(),
compression=compression,
backward_passes_per_step=args.batches_per_allreduce,
op=hvd.Adasum if args.use_adasum else hvd.Average,
gradient_predivide_factor=args.gradient_predivide_factor)
在DistributedOptimizer内部封装的方法调用了allreduce与notify来进行实现。
接着需要进行状态的同步,Horovod引入了elastic.TorchState来进行全局状态的同步,此处将模型、优化器、数据集选择、epoch、batch作为参数进行同步。
state = hvd.elastic.TorchState(model=model,
optimizer=optimizer,
train_sampler=train_sampler,
val_sampler=val_sampler,
epoch=resume_from_epoch,
batch=0)
torchState的底层实现如下,可以通过以下方法进行状态的更新与同步。
class TorchState(ObjectState):
def __init__(self, model=None, optimizer=None, **kwargs):
kwargs.update(dict(model=model, optimizer=optimizer))
self._handlers, kwargs = _get_handlers(kwargs)
for name, handler in self._handlers.items():
setattr(self, name, handler.value)
super(TorchState, self).__init__(bcast_object=broadcast_object,
get_rank=rank,
**kwargs)
def save(self):
for handler in self._handlers.values():
handler.save()
super(TorchState, self).save()
def restore(self):
for handler in self._handlers.values():
handler.restore()
super(TorchState, self).restore()
def sync(self):
for handler in self._handlers.values():
handler.sync()
super(TorchState, self).sync()
def __setattr__(self, name, value):
if hasattr(self, name) and name in self._handlers:
self._handlers[name].set_value(value)
super().__setattr__(name, value)
对分布式任务来说,一般通过rank=0的worker进行log的打印以及进程通信的协调,然后根据节点选择训练的Device。剩余的工作与普通的训练任务差别不大,每个Worker专注于自己节点任务的执行,为了统一化指标的输出,所以通过调用allreduce的方法实现了Metric类来进行最终指标的计算。
# Horovod: average metrics from distributed training.
class Metric(object):
def __init__(self, name):
self.name = name
self.sum = torch.tensor(0.)
self.n = torch.tensor(0.)
def update(self, val):
self.sum += hvd.allreduce(val.detach().cpu(), name=self.name)
self.n += 1
@property
def avg(self):
return self.sum / self.n
@hvd.elastic.run
def full_train(state):
while state.epoch < args.epochs:
train(state)
validate(state.epoch)
save_checkpoint(state.epoch)
end_epoch(state)
进入allreduce的底层,可以看到其拥有这些操作:压缩、通信、解压。线程与线程之间的通信是两两异步进行的,当接受到通信指令后会根据通信可用的方式(NCCL、MPI)调用相应的底层库来完成通信。通信的目标进程在启动中进行过设置。
handle = getattr(mpi_lib, function)(tensor, output, divisor,
name.encode() if name is not None else _NULL, op,
prescale_factor, postscale_factor, process_set.process_set_id)
至此,训练的基本过程就大概介绍完毕。
弹性
Driver拥有discover_hosts方法用以监听host变化,并单独起一个线程thread运行该监听方法,如果有弹性的需求(用户主动弹性、Worker宕机的容灾场景),由HostManager进行检测,主要是对现有的slots对应的host进行检查,可用Host与Slots由HostDiscoveryScript进行发现。在EToperator中通过感知外界弹性进而使得HostDiscoveryScript感知弹性。
class HostDiscoveryScript(HostDiscovery):
def __init__(self, discovery_script, slots):
self._discovery_script = discovery_script
self._default_slots = slots
super(HostDiscoveryScript, self).__init__()
def find_available_hosts_and_slots(self):
stdout = io.StringIO()
exit_code = safe_shell_exec.execute(self._discovery_script, stdout=stdout)
if exit_code != 0:
raise RuntimeError('Failed to execute discovery script: {}. Exit code: {}'
.format(self._discovery_script, exit_code))
host_slots = {}
lines = set(stdout.getvalue().strip().split('\n'))
for line in lines:
host = line
if ':' in line:
host, slots = line.split(':')
host_slots[host] = int(slots)
else:
host_slots[host] = self._default_slots
return host_slots
外界的事件通过threading.Event()来进行获取,当有弹性的信号出现,Driver会进行notify_workers_host_changes操作,且通过coordinator即rank=0的Worker来执行广播。
def _notify_workers_host_changes(self, current_hosts, update_res):
next_host_assignments = {}
if current_hosts.count_available_slots() >= self._min_np:
# Assignments are required to be stable via contract
next_host_assignments, _ = self._get_host_assignments(current_hosts)
if next_host_assignments == self.host_assignments:
# Skip notifying workers when host changes would not result in changes of host assignments
logging.debug('no host assignment changes, skipping notifications')
return
coordinator_slot_info = self.get_coordinator_info()
if not coordinator_slot_info:
logging.debug('no coordinator info, skipping notifications')
return
coordinator_client = self.get_worker_client(coordinator_slot_info)
if not coordinator_client:
logging.debug('no coordinator client, skipping notifications')
return
timestamp = _epoch_time_s()
try:
coordinator_client.notify_hosts_updated(timestamp, update_res)
except:
if self._verbose >= 2:
logging.exception('failed to notify {}[{}] of host updates'
.format(coordinator_slot_info.hostname,
coordinator_slot_info.local_rank))
当Worker收到弹性的通知后,会等待当前的epoch结束后结束当前的任务进程,等待Driver建立新的rendezvousServer以及根据可用的host建立新的slots,通过与之前相同的ssh发送命令的方式发送可执行命令到host列表中的各个Worker。
def _start_worker_process(self, slot_info):
create_worker_fn = self._create_worker_fn
shutdown_event = self._shutdown
host_event = self._host_manager.get_host_event(slot_info.hostname)
def run_worker():
res = create_worker_fn(slot_info, [shutdown_event, host_event])
exit_code, timestamp = res
self._handle_worker_exit(slot_info, exit_code, timestamp)
thread = threading.Thread(target=run_worker)
thread.daemon = True
thread.start()
self._results.expect(thread)
在建立好新的rendezvousServer之后,Worker接受到新的运行指令后重新执行训练任务的Python脚本,重新建环。在建立好环之后,需要从之前保存的checkpoint中恢复状态信息,在pytorch_imagenet_resnet50_elastic这个Demo中,保存checkpoint的粒度为epoch,每次保存的checkpoint拥有model、optimizer以及保存该checkpoint的epoch。
# Restore from a previous checkpoint, if initial_epoch is specified.
# Horovod: restore on the first worker which will broadcast weights to other workers.
resume_from_epoch = 0
if hvd.rank() == 0:
for try_epoch in range(args.epochs, 0, -1):
if os.path.exists(args.checkpoint_format.format(epoch=try_epoch)):
resume_from_epoch = try_epoch
break
if resume_from_epoch > 0:
filepath = args.checkpoint_format.format(epoch=resume_from_epoch)
checkpoint = torch.load(filepath)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
state = hvd.elastic.TorchState(model=model,
optimizer=optimizer,
train_sampler=train_sampler,
val_sampler=val_sampler,
epoch=resume_from_epoch,
batch=0)
当恢复后便可以继续进行训练任务了。