numpy 在Python中访问和比较大量数据的最快方法

e4yzc0pl 于 2023-01-20 发布在 Python

关注(0)|答案(1)|浏览(129)

我是Pandas的新手，对python也是新手
我正在查看股票数据，这些数据以CSV格式读入，通常大小为500，000行。
“

”
我需要检查数据本身-基本算法是一个循环，类似于

Row = 0

x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y

if (a):
     append ("A") at the end of row ROW  # in the dataframe
else
    print ("B") at the end of row ROW 

Row = Row +1

下一次迭代时，数据指针应重置为行1。然后每次都执行相同的过程，在行索引处向 Dataframe 添加注解
我看了看Pandas，认为尝试这种方法的方法是使用两个循环，并复制 Dataframe 以维护两个单独的示例
实际代码如下所示（简化）

df = pd.read_csv('data.csv')
calc1 = 1  # this part is confidential so set to something simple
calc2 = 2  # this part is confidential so set to something simple

def func3_df_index(df):
    dfouter = df.copy()

    for outerindex in dfouter.index:
        dfouter_openval = dfouter.at[outerindex,"Open"]

        for index in df.index:
            if (df.at[index,"Low"] <= (calc1)  and (index >= outerindex)) :
                dfouter.at[outerindex,'notes'] = "message 1"
                break
            elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
                dfouter.at[outerindex,'notes'] = "message2"
                break
            else:
                dfouter.at[outerindex,'notes'] =  "message3"

此方法每5 K需要很长时间（7分钟以上）-对于500，000行来说，这将是相当长的时间。可能有超过100万行的数据
我已经尝试使用两个循环的方法与以下变量：

using iloc - e.g df.iloc[index,2]
using at   - e,g df.at[index,"low"]
using numpy& at - eg  df.at[index,"low"] = np.where((df.at[index,"low"] < ..."

数据为浮点值和日期时间字符串。
使用numpy更好吗？也许是使用两个循环的替代方法？任何其他方法，如使用R，mongo，一些其他的数据库等-不同于python也会很有用-我只需要结果，不一定绑定到python。
任何帮助和构造都将非常有帮助
先谢了

numpy

来源：https://stackoverflow.com/questions/75182811/quickest-way-to-access-compare-huge-data-in-python

1条答案

按热度按时间

jm81lzqq1#

您必须将主数据集拆分为较小的数据集，例如50个子数据集，每个子数据集有10.000行，以提高速度。使用线程或并发在每个子数据集中执行函数，然后合并最终结果。

赞(0）回复(0）举报 2023-01-20

我来回答

numpy 在Python中访问和比较大量数据的最快方法

1条答案

相关问题

热门标签

最新问答