ludwig 比较分类器性能 - 预测结果不支持HDF5格式的地面真实值文件,

0h4hbjxa 于 5个月前发布在其他

关注(0)|答案(6)|浏览(60)

描述错误

可视化 compare_classifiers_performance_from_pred 无法正常工作，因为出现了以下错误：
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}
根据文档，参数 ground_truth 应该是在训练预处理过程中获得的 HDF5 文件的名称。
文档：https://ludwig.ai/latest/user_guide/visualizations/#compare_classifiers_performance_from_pred

重现问题

重现问题的步骤：

转到 Google Colab 并生成一些训练 + 预测数据。
生成一个可视化：

!ludwig visualize --visualization compare_classifiers_performance_from_pred \
  --predictions predictions_20230827_183245.csv \
  --ground_truth train.hdf5 \
  --ground_truth_metadata 1dbf206244e911ee93d40242ac1c000c.meta.json \
  --output_feature_name MyTarget

查看错误

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4172, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 469, in compare_classifiers_performance_from_pred_cli
    ground_truth = _extract_ground_truth_values(ground_truth, output_feature_name, ground_truth_split, split_file)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 264, in _extract_ground_truth_values
    ground_truth_df = _get_ground_truth_df(ground_truth) if isinstance(ground_truth, str) else ground_truth
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 233, in _get_ground_truth_df
    raise ValueError(
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}

预期行为

像文档中的一些图表一样。

环境

OS: Google Colab - Linux Ubuntu
Python 版本：3.10
Ludwig 版本：0.8.1.post1
附加上下文

我尝试在 ludwig/utils/data_utils.py 中查找错误，但它看起来很好。我还尝试直接从 Jupyter Notebook (compare_classifiers_performance_from_pred_cli) 调用，但仍然出现相同的错误。

ludwig

来源：https://github.com/ludwig-ai/ludwig/issues/3550

6条答案

按热度按时间

lawou6xi1#

嘿，@iflow,感谢你报告这个问题！看起来我们似乎在这次检查中将HDF5排除在了有效文件格式的列表之外。你能尝试使用#3557中的更改运行并告诉我是否解决了问题吗？

赞(0）回复(0）举报 5个月前

2eafrhcq2#

你好，iflow。请确认一下这个修复是否解决了问题，如果没问题的话，我们就可以将我们的修复合并进去了！

赞(0）回复(0）举报 5个月前

vs3odd8k3#

感谢快速的修复！不幸的是，我还没有尝试，因为我是在Google Colab上安装的库。所以我必须把我的本地机器设置好，这可能需要一些时间。

赞(0）回复(0）举报 5个月前

dy2hfwbg4#

嘿，@iflow ,在协作中，你可以像这样安装Ludwig来测试分支：

!pip install "git+https://github.com/ludwig-ai/ludwig.git@fix-gt-formats#egg=ludwig[llm]" --quiet

赞(0）回复(0）举报 5个月前

y53ybaqx5#

谢谢你 @tgaddair,我不知道这个很棒的命令 :)
使用固定版本后，错误 "hd5 不支持..." 不再出现👍
然而，出现了一个不同的错误：
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
我猜它与这个问题无关？
完整跟踪：

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4175, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 475, in compare_classifiers_performance_from_pred_cli
    predictions_per_model = _get_cols_from_predictions(predictions, [col], metadata)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 305, in _get_cols_from_predictions
    pred_df = pd.read_parquet(predictions_path)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
    dataset = _ParquetDatasetV2(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

赞(0）回复(0）举报 5个月前

oug3syen6#

嘿，@iflow,对于--predictions,你能尝试使用Ludwig生成的parquet文件而不是CSV吗？应该在同一个文件夹里有一个叫做类似predictions_20230827_183245.parquet的文件。

赞(0）回复(0）举报 5个月前