"Different node should owns different parts of all Train data. This simple script did not do this job, so you should prepare it at last. " I saw this in cluster training wiki. So, could paddle read data from hdfs and distribute data to each node automatically?
1条答案
按热度按时间eoxn13cs1#
Distribute data to cluster is not added in PaddlePaddle now. You can read data directly from a HDFS file path by PyDataProvider2.
PaddlePaddle not handle how to get data file remotely, just pass the file path into a Python function. It is user's job to OPEN the file (or SQL connection string, or HDFS path), and get each
sample one by one from it.
It is welcome to contribute a script to distribute data to cluster. Or we may add it soon if this feature is very necessary.