机器学习分布式框架horovod安装 (Linux环境)
2021/12/19 7:20:30
本文主要是介绍机器学习分布式框架horovod安装 (Linux环境),对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
1、openmi 下载安装
下载连接:
https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
安装命令
1 2 3 4 5 | shell$ gunzip -c openmpi-4.0.1.tar.gz | tar xf - shell$ cd openmpi-4.0.1 shell$ ./configure --prefix=/usr/local <...lots of output...> shell$ make all install |
sudo ldconfig
2、horovod安装
官方文档: https://github.com/horovod/horovod#install
[sudo] pip3 install horovod
安装支持NCCL的版本的horovod
HOROVOD_GPU_ALLREDUCE=NCCL pip3 install --no-cache-dir horovod
3、horovod 使用
3.1 tensorFLow 修改
import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during # initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # Save checkpoints only on worker 0 to prevent other workers from corrupting them. checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir, config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
3.2 tensorflow 运行
mpi 指定mca通讯端口
mpirun --allow-run-as-root --oversubscribe \ -np 8-H ubuntu1:4,ubuntu2:4 \ -bind-to none -map-by slot \ -mca plm_rsh_args "-p 22" \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ python3 -u train.py
这篇关于机器学习分布式框架horovod安装 (Linux环境)的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-12-18git仓库有更新,jenkins 自动触发拉代码怎么配置的?-icode9专业技术文章分享
- 2024-12-18Jenkins webhook 方式怎么配置指定的分支?-icode9专业技术文章分享
- 2024-12-13Linux C++项目实战入门教程
- 2024-12-13Linux C++编程项目实战入门教程
- 2024-12-11Linux部署Scrapy教程:新手入门指南
- 2024-12-11怎么将在本地创建的 Maven 仓库迁移到 Linux 服务器上?-icode9专业技术文章分享
- 2024-12-10Linux常用命令
- 2024-12-06谁看谁服! Linux 创始人对于进程和线程的理解是…
- 2024-12-04操作系统教程:新手入门及初级技巧详解
- 2024-12-04操作系统入门:新手必学指南