如果在数据集拆分和预处理后,有空的DataFrame分区(在使用Ray/Dask后端进行训练时),Ray会抛出以下错误。
E ray.exceptions.RayTaskError(AssertionError): ray::_get_read_tasks() (pid=10328, ip=127.0.0.1)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/read_api.py", line 1136, in _get_read_tasks
E reader = ds.create_reader(**kwargs)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 167, in create_reader
E return _ParquetDatasourceReader(**kwargs)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 230, in __init__
E self._encoding_ratio = self._estimate_files_encoding_ratio()
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 318, in _estimate_files_encoding_ratio
E sample_ratios = ray.get(futures)
E ray.exceptions.RayTaskError(AssertionError): ray::_sample_piece() (pid=10352, ip=127.0.0.1)
E File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 437, in _sample_piece
E assert num_rows > 0 and metadata.num_rows > 0, (
E AssertionError: Sampled number of rows: 0 and total number of rows: 0 should be positive
复现
请使用 num_examples=20
和 npartitions=10
运行以下单元测试。
pytest -xsrP tests/integration_tests/test_preprocessing.py::test_dask_known_divisions
- 操作系统:macOS
- 版本:12.3.1
- Python版本:3.9
- Ludwig版本:0.6.dev0
- Ray版本:夜间版(2022年7月28日)
1条答案
按热度按时间72qzrwbm1#
在这篇PR中添加一个更永久的解决方案:#2328