无法用scipy加载arff数据集,arff.loadarff

1wnzp6jl  于 2023-04-30  发布在  其他
关注(0)|答案(4)|浏览(169)

我试图从https://cometa.ujaen.es/下载arff数据集(例如https://cometa.ujaen.es/datasets/yahoo_arts),并使用scipy.arff.loadarff加载到python中。
然而,似乎scipy期望在头之后有一种csv文件,并且无法解析绝大多数数据集。
例如,重现问题:

from scipy.arff import loadarff
import urllib

urllib.request.urlretrieve('https://cometa.ujaen.es/public/full/yahoo_arts.arff', 'yahoo_arts.arff')
ds = loadarff('yahoo_arts.arff')

(在这种情况下,我得到ValueError: could not convert string to float: '{8 1')。
这是预期的吗?(也就是scipy实现不完全符合arff格式)你知道一些变通方法或一些手工的解析函数吗?
谢谢你对这个主题的任何帮助/建议。

whlutmcx

whlutmcx1#

这是预期的吗?(aka scipy实施不完全符合arff格式)
是的,很不幸。正如docstring for loadarff中提到的,“它无法读取具有稀疏数据的文件(文件中的{})。“文件yahoo_arts.arff在其@data部分使用稀疏格式。
您可以尝试searching PyPi for "arff"来寻找替代品。我没有使用过这些,所以我没有任何具体的建议。

c90pui9n

c90pui9n2#

您可以使用以下解决方法:

import numpy as np
import pandas as pd

with open('yahoo_arts.arff', 'r') as fp:
    file_content = fp.readlines()

def parse_row(line, len_row):
    line = line.replace('{', '').replace('}', '')

    row = np.zeros(len_row)
    for data in line.split(','):
        index, value = data.split()
        row[int(index)] = float(value)

    return row

columns = []
len_attr = len('@attribute')

# get the columns
for line in file_content:
    if line.startswith('@attribute '):
        col_name = line[len_attr:].split()[0]
        columns.append(col_name)

rows = []
len_row = len(columns)
# get the rows
for line in file_content:
    if line.startswith('{'):
        rows.append(parse_row(line, len_row))

df = pd.DataFrame(data=rows, columns=columns)

df.head()

输出:

unftdfkk

unftdfkk3#

正如Warren Weckesser的回答中所指出的,scipy无法读取稀疏的arff文件。我已经实现了一个快速的解决方案来解析稀疏的arff文件,如果它可以帮助其他人,我在下面分享它。如果我有时间做一个干净的版本,我会尝试为scipy版本做出贡献。
编辑:对不起rusu_ro1,我没有看到你的版本,但我想它的工作以及。

from scipy.sparse import coo_matrix
from functools import reduce
import pandas as pd

def loadarff(filename):

  features = list()
  data = list()
  row_idx = 0

  with open(filename, "rb") as f:
    for line in f:
      line = line.decode("utf8")
      if line.startswith("@data"):
        continue
      elif line.startswith("@relation"):
        continue
      elif line.startswith("@attribute"):
        try:
          features.append(line.split(" ")[1])
        except Exception as e:
          print(f"Cannot parse {line}")
          raise e
      elif line.startswith("{"):
        try:
          line = line.replace("{", "").replace("}", "")
          line = [[row_idx,]+[int(x) for x in v.split(" ")] for v in line.split(",")]
          data.append(line)
          row_idx += 1
        except Exception as e:
          print(f"Cannot parse {line}")
          raise e
      else:
        print(f"Cannot parse {line}")

  flatten = lambda l: [item for sublist in l for item in sublist]
  data = flatten(data)

  sparse_matrix = coo_matrix(([x[2] for x in data], ([x[0] for x in data], [x[1] for x in data])), shape=(row_idx, len(features)))

  df = pd.DataFrame(sparse_matrix.todense(), columns=features)

  return df
xe55xuns

xe55xuns4#

根据@Kederrac和@ThR37的精彩回复,我建议改进如下:

@staticmethod
    def _arff_to_csv(input_path: Union[str, Path]) -> pd.DataFrame:
        """
        Converts an ARFF file to a DataFrame.

        Args:
            input_path (Union[str, Path]): Path to the input ARFF file.

        Returns:
            pd.DataFrame: Converted DataFrame.
        """

        def parse_row(line: str, row_len: int) -> List[Any]:
            """
            Parses a row of data from an ARFF file.

            Args:
                line (str): A row from the ARFF file.
                row_len (int): Length of the row.

            Returns:
                List[Any]: Parsed row as a list of values.
            """
            line = line.strip()  # Strip the newline character
            if '{' in line and '}' in line:
                # Sparse data row
                line = line.replace('{', '').replace('}', '')
                row = np.zeros(row_len, dtype=object)
                for data in line.split(','):
                    index, value = data.split()
                    try:
                        row[int(index)] = float(value)
                    except ValueError:
                        row[int(index)] = np.nan if value == '?' else value.strip("'")
            else:
                # Dense data row
                row = [
                    float(value) if value.replace(".", "", 1).isdigit()
                    else (np.nan if value == '?' else value.strip("'"))
                    for value in line.split(',')
                ]

            return row

        def extract_columns_and_data_start_index(
                file_content: List[str]
        ) -> Tuple[List[str], int]:
            """
            Extracts column names and the index of the @data line from ARFF file content.

            Args:
                file_content (List[str]): List of lines from the ARFF file.

            Returns:
                Tuple[List[str], int]: List of column names and the index of the @data line.
            """
            columns = []
            len_attr = len('@attribute')

            for i, line in enumerate(file_content):
                if line.startswith('@attribute '):
                    col_name = line[len_attr:].split()[0]
                    columns.append(col_name)
                elif line.startswith('@data'):
                    return columns, i

            return columns, 0

        with open(input_path, 'r') as fp:
            file_content = fp.readlines()

        columns, data_index = extract_columns_and_data_start_index(file_content)
        len_row = len(columns)
        rows = [parse_row(line, len_row) for line in file_content[data_index + 1:]]
        return pd.DataFrame(data=rows, columns=columns)
    ```

相关问题