Specify the required files for the related version of framework and dependent libraries by the submit parameters such as --file
,--cacheFile
or --cacheArchive
. Furthermore, setting the environment variables PYTHONPATH
as export PYTHONPATH=./:$PYTHONPATH
if necessary.
For example, if there is not the tensorflow module on the node of the cluster, user can set the module by the --cacheArchive
. More detail:
/usr/lib/python2.7/site-packages/tensorflow/
tar -zcvf tensorflow.tgz ./*
--cacheArchive /tmp/tensorflow.tgz#tensorflow
export PYTHONPATH=./:$PYTHONPATH
In order to view the progress of the execution both at the XLearning client and the application web interface, user need to print the progress to standard error as the format of "report:progress:<float type>"
in the execution program.
XLearning support the distributed deep learning framworks such as TensorFlow, MXNet, XGBoost, LightGBM.
--app-type
as TensorFlow
, and distinguish stand-alone and distributed mode by the number of ps applied.--app-type
as MXNet
, and distinguish stand-alone and distributed mode by the number of ps applied.--app-type
as distxgboost
.--app-type
as distlightgbm
.--app-type
as lightlda
, and distinguish stand-alone and distributed mode by the number of ps applied.--app-type
as XFlow
, and distinguish stand-alone and distributed mode by the number of ps applied.In the distributed mode of TensorFlow application, ClusterSpec is defined by setting the host and port of ps and worker preliminarily. XLearning implements the automatic construction of the ClusterSpec. User can get the information of ClusterSpec, job_name, task_index from the environment variables TF_CLUSTER_DEF, TF_ROLE, TF_INDEX, such as:
import os
import json
cluster_def = json.loads(os.environ["TF_CLUSTER_DEF"])
cluster = tf.train.ClusterSpec(cluster_def)
job_name = os.environ["TF_ROLE"]
task_index = int(os.environ["TF_INDEX"])
" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf"
after submit the application.Default that the yarn.application.classpath
setted in the yarn-site.xml
not contains the related lib package about mapreduce,try to add the related lib path, such:
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*</value>
</property>
User can get the information of the number of machines and local port from the environment variables and write into the configuration file on the distribute LightGBM application (More details in $XLEARNING_HOME/examples/distLightGBM ). Note that it is necessary to copy the configuration file at current directory to avoid the more containers modify the same file when they are at one machine, like :
cp train.conf train_real.conf
chmod 777 train_real.conf
echo "num_machines = $LIGHTGBM_NUM_MACHINE" >> train_real.conf
echo "local_listen_port = $LIGHTGBM_LOCAL_LISTEN_PORT" >> train_real.conf
./LightGBM/lightgbm config=train_real.conf
Also, user need to set the machine list file at the configuration which XLearning named as lightGBMlist.txt
generated at the executive directory of each worker, like :
machine_list_file = lightGBMlist.txt
Example of TF_CONFIG for chief training in the distributed mode of Tensorflow estimator application (More details in $XLEARNING_HOME/examples/tfEstimator):
import os
import json
cluster = json.loads(os.environ["TF_CLUSTER_DEF"])
task_index = int(os.environ["TF_INDEX"])
task_type = os.environ["TF_ROLE"]
# chief: worker 0 as chief, other worker index --
tf_config = dict()
worker_num = len(cluster["wroker"])
if task_type == "ps":
tf_config["task"] = {"index":task_index, "type":task_type}
elif task_type == "worker":
if taks_index == 0:
tf_config["task"] = {"index":0, "type":"chief"}
else:
tf_config["task"] = {"index":task_index-1, "type":task_type}
elif task_type == "evaluator":
tf_config["task"] = {"index":task_index, "type":task_type}
if worker_num == 1:
cluster["chief"] = cluster["worker"]
del cluster["worker"]
else:
cluster["chief"] = [cluster["worker"][0]]
del cluster["worker"][0]
tf_config["cluster"] = cluster
os.environ["TF_CONFIG"] = json.dumps(tf_config)
Because of loading the required js files for CPU Metrix functionality is based on the WebApp build method which is not achieved in the hadoop version lower than 2.6.4, there is the other method to display the cpu metrics if necessary.
hadoop-yarn-common-xxx.jar
on the cluster, more details:src\main\resources\xlWebApp
at the source code of XLearning to the directory webapps/static
which generated after unpackaging the hadoop-yarn-common-xxx.jar
--jars
to load the jar when submit the application.$XLEARNING_HOME/lib
which generated after unpackaging the XLearning dist. Follow the above method to replace the hadoop-yarn-common-xxx.jar, then restart service.XLearning1.1 support the application retry and memory auto scaled after failed by setting the configuration:
When the XLearning client submits a job, the --user-path "/root/anaconda2/lib/python2.7/site-packages/tensorboard" is added to specify the tensorboard path.
--conf xlearning.input.strategy
or --input-strategy
as PLACEHOLDER
?With the input strategy setted as the PLACEHOLDER
, worker containers get the assigned input file list to the program by the way of the environment INPUT_FILE_LIST
as json
format with the key
of the input local path and the value
of the list of the hdfs file name. However, there is the error when the length of the environment is too long to execute the user program. In this situation, the content of the environment INPUT_FILE_LIST
would be written to the local file inputFileList.txt
at the current path.
User can get the file list like this:
import os
import json
if os.environ.has_key('INPUT_FILE_LIST') :
inputfile = json.loads(os.environ["INPUT_FILE_LIST"])
data_file = inputfile["data"]
else :
with open("inputFileList.txt") as f:
fileStr = f.readline()
inputfile = json.loads(fileStr)
Upload the files using the --files
to each container. Load the related path to the system path, such as sys.path.append(os.getcwd())
before calling the module.
xlearning.am.nodeLabelException
、xlearning.worker.nodeLabelExpression
、xlearning.ps.nodeLabelExpression
to define the scheduled node.--conf xlearning.tf.distribution.strategy=true
, XLearning build the cluster information for application which use the advanced api for TensorFlow distribution strategy.Unzip the version 3.1.1 of openmpi package to /usr/local
as /usr/local/openmpinossh
which provided by XLearning under the examples/mpi/
;
2)Configure the install dir to the xlearning-site.xml:
1)Set the container type as docker: --conf xlearning.container.type=docker
;
2)Assign the docker image, such as --conf xlearning.docker.image=tensorflow/tensorflow:devel-gpu
;
3)Define the working dir --conf xlearning.docker.worker.dir=/work
;
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。