在大型csv文件中搜索

cigdeys3 于 2021-06-23 发布在 Mysql

关注(0)|答案(3)|浏览(499)

问题

我在一个文件夹中有数千个csv文件。每个文件有128000个条目，每行有四列。有时（一天两次）我需要比较所有csv文件的列表（10000个条目）。如果其中一个条目与其中一个csv文件的第三列或第四列相同，我需要将整个csv行写入一个额外的文件。

可能的解决方案

格雷普


# !/bin/bash

getArray() {
    array=()
    while IFS= read -r line
    do
        array+=("$line")
    done < "$1"
}

getArray "entries.log"
for e in "${array[@]}"
do
    echo "$e"
    /bin/grep $e ./csv/* >> found
done

这似乎管用，但它是永恒的。在将近48小时之后，剧本只检查了48个条目，大约10000个条目。

mysql数据库

下一次尝试是将所有csv文件导入mysql数据库。但我的表在50000000个条目时出现了问题。所以我写了一个脚本，在49000000个条目之后创建了一个新表，这样我就可以导入所有csv文件了。我试图在第二列上创建索引，但总是失败（超时）。在导入过程之前创建索引也是不可能的。它把进口速度从几小时降到了几天。select语句很糟糕，但它起了作用。比“grep”解决方案快得多，但仍然要慢。

我的问题

我还可以尝试在csv文件中搜索什么？为了加快速度，我将所有csv文件复制到ssd上。但我希望还有别的办法。

mysql shell csv search grep

来源：https://stackoverflow.com/questions/50590336/search-in-large-csv-files

3条答案

按热度按时间

w8rqjzmb1#

在awk中，假设所有csv文件都更改，否则最好跟踪已检查的文件。但首先是一些测试材料：

$ mkdir test        # the csvs go here
$ cat > test/file1  # has a match in 3rd
not not this not
$ cat > test/file2  # no match
not not not not
$ cat > test/file3  # has a match in 4th
not not not that
$ cat > list        # these we look for
this
that

然后脚本：

$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that

解释：

$ awk '                   # awk
NR==FNR {                 # process the list file
    a[$1]                 # hash list entries to a
    next                  # next list item
} 
($3 in a) || ($4 in a) {  # if 3rd or 4th field entry in hash
    print >> "out"        # append whole record to file "out"
}' list test/*            # first list then the rest of the files

脚本将所有列表项散列到 a 当有匹配项时，读取csv文件，在散列输出中查找第3和第4个字段条目。
如果你测试它，让我知道它跑了多久。

赞(0）回复(0）举报 2021-06-23

xt0899hw2#

这不太可能给您带来有意义的好处，但是对脚本进行了一些改进
使用内置 mapfile 要将文件拖入数组，请执行以下操作：

mapfile -t array < entries.log

将grep与模式文件和适当的标志一起使用。
我假设您希望将entries.log中的项匹配为固定字符串，而不是正则表达式模式。
我还假设你想匹配整个单词。

grep -Fwf entries.log ./csv/*

这意味着您不必将1000个csv文件grep 1000次（entries.log中的每个项目一次）。事实上，仅此一项就应该给您带来真正有意义的性能改进。
这也完全不需要将entries.log读入数组。

赞(0）回复(0）举报 2021-06-23

ua4mk5z43#

您可以构建一个模式文件，然后使用 xargs 以及 grep -Ef 要批量搜索csv文件中的所有模式，而不是像当前解决方案中那样一次搜索一个模式，请执行以下操作：


# prepare patterns file

while read -r line; do
  printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$"       # find value in third column
  printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$"       # find value in fourth column
done < entries.log > patterns.dat

find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
``` `find ...` -发出找到的所有csv文件的nul分隔列表 `xargs -0 ...` -将文件列表批量传递给grep

赞(0）回复(0）举报 2021-06-23

我来回答

在大型csv文件中搜索

问题

可能的解决方案

格雷普

mysql数据库

我的问题

3条答案

相关问题

热门标签

最新问答