MacOS终端:通过列1值拆分CSV文件

yr9zkbsy  于 2023-11-14  发布在  Mac
关注(0)|答案(7)|浏览(119)

我有一个两列的电子表格(以CSV格式保存),如下所示:

COLUMN 1,COLUMN 2
3-Entrepreneurship,"innovation, daily"
,countless
2-Police/Enforcement,"innocent, protect"
2-Bathroom:home room,toilet handle
3-Companies,née dresses
2-Sense of Smell,odorless
3-Entrepreneurship,old ideas
3-Entrepreneurship,¡new income streams!
3-Companies,Zoë’s food store
,many
2-Police/Enforcement,crime
2-Bathroom:home room,bath room
,ring
3-Companies,móvíl résumés
2-Sense of Smell,musty smell
3-Entrepreneurship,good publicity guru!
3-Companies,Señor

字符串
x1c 0d1x的数据
完整的电子表格有1000行(以CSV格式保存,逗号用于分隔两列)。它包含的类别多于此处列出的第1列。
如图所示,第2列的一些条目由两个或三个单词组成,中间用空格隔开。它们还使用逗号、撇号和重音字符(这些字符出现在多个类别中,而不仅仅是标题为3-Companies的类别)。
我想把CSV文件按照第1列中的值拆分成单独的TXT文件。单独的文件将不再是电子表格,而只是一个单词列表。
例如,拆分后

  • 在文件3-创业.txt*
3-Entrepreneurship
innovation, daily
old ideas
¡new income streams!
good publicity guru!

  • 在文件2-浴室:家庭房间.txt*
2-Bathroom:home room
toilet handle
bath room

  • 在文件2-警察/执法.txt*
2-Police/Enforcement
innocent, protect
crime

  • 在文件2中-Sense of sweet.txt *
2-Sense of Smell
odorless
musty smell

  • 在文件3-Companies.txt中 *
3-Companies
née dresses
Zoë’s food store
móvíl résumés
Señor


这只是一个示例。完整的文件有超过5个类别(在第2列中),因此将有超过5个拆分后的文件。

环境我在MacOS 12.6.9中使用终端。理想情况下,我希望复制并粘贴一行代码,并使其作用于终端活动目录中的CSV文件(因此我不必将文件名硬编码到代码中)。
初次尝试

我实际上问了这个问题的一个不同的变体here。在那个版本中,第2列(而不是第1列)被用来进行分割。那个版本也没有把类别作为每个分割的TXT文件的第一行。
我试着修改它,如下所示:

tail -n +2 *.csv | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\x0d$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'


但是,尽管它按列1拆分并将类别名称放在顶部,但它忽略了类别2的所有内容,而是将所有类别2值拆分为单独的文件。
请注意:我不受此代码的约束。任何适用于MacOS 12.6.9的解决方案都可以。

oewdyzsn

oewdyzsn1#

一个awk想法:

awk '
#NR==1 { next }                                                 # uncomment this line if the file *DOES* include the header record "COLUMN 1,COLUMN 2"
       { pos = index($0,",")                                    # find location of first comma

         cat = substr($0,1,pos-1)                               # extract the category
         gsub(/\//,":",cat)                                     # replace "/" with ":" (unix/linux filenames cannot include "/")
         if (! cat) cat = prevcat                               # if cat is empty/blank then use prevcat

         words = substr($0,pos+1)                               # extract the words
         gsub(/^"|"$/,"",words)                                 # strip any leading/trailing double quote

         cats[cat] = (cats[cat] ? cats[cat] : cat) ORS words    # update our categories array (cats[]) with our new data; if cats[cat] is empty (ie, this is a new array entry) then start by adding our cat to the list

         prevcat = cat                                          # update previous cat
       }

END    { for (cat in cats) {                                    # loop through list of categories (ie, the indexes of the cats[] array)
             outfile = cat ".txt"                               # create output file name
             print cats[cat] > outfile                          # print array entry to output file
             close (outfile)                                    # close the output file
         }
       }
' full.csv

字符串

**注意:**此解决方案将替换OP当前的所有tail | sort | awk代码

这将产生:

$ head -20 [0-9]*.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police:Enforcement.txt <==
2-Police:Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

nkhmeac6

nkhmeac62#

使用任何POSIX awk,不管会生成多少输出文件,并且不将所有输入存储在内存中,它都会正确处理输入中的转义双引号(例如,它会将a,"foo""bar",b转换为a,foo"bar,b而不是a,foo""bar,b):

$ cat tst.awk
NR > 1 {
    pos = index($0,",")
    tag = substr($0,1,pos-1)    # tag is everything before first ,
    val = substr($0,pos+1)      # val[ue] is everything after it

    gsub(/^"|"$/,"",val)        # strip field-enclosing quotes
    gsub(/""/,"\"",val)         # de-escape CSV quotes so "" -> "

    out = (tag == "" ? prevTag : tag ) ".txt"
    gsub("/","_",out)           # file names cannot contain / so map to _

    if ( !seen[out]++ ) {
        print tag > out
    }
    print val >> out
    close(out)                  # necessary to avoid "too many open files"

    prevTag = tag
}

字符串

$ awk -f tst.awk file.csv

$ head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police_Enforcement.txt <==
2-Police/Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!


或者,你可以使用Decorate-Sort-Undecorate idiom来提高效率,因为它不需要在每次写入时打开/关闭输出文件,每个标签只需要打开/关闭一次:

$ cat tst.sh
#!/usr/bin/env bash

awk -v OFS=',' '
    NR > 1 {
        pos = index($0,",")
        tag = substr($0,1,pos-1)    # tag is everything before first ,
        val = substr($0,pos+1)      # val[ue] is everything after it

        print NR, (tag == "" ? prevTag : tag), val
        prevTag = tag
    }
' "${@:--}" |
sort -t',' -k2,2 -k1,1n |
awk -F',' '
    $2 != prevTag {
        close(out)
        out = $2 ".txt"
        gsub("/","_",out)       # file names cannot contain / so map to _
        print $2 > out
        prevTag = $2
    }
    {
        sub(/[^,]+,[^,]+,/,"")  # remove the NR and tag
        gsub(/^"|"$/,"")        # strip field-enclosing quotes
        gsub(/""/,"\"")         # de-escape CSV quotes so "" -> "
        print > out
    }
'


你可以称之为./tst.sh file.csv
DSU脚本与将所有输入存储在awk中然后在END部分中处理它相比的一个好处是,在上面的情况下,只有sort需要一次操作整个输入,而不是awk,并且sort使用请求分页等来处理任意大的输入文件。

gj3fmq9x

gj3fmq9x3#

这里有一个Ruby来做这件事:

ruby -r csv -e 'BEGIN{h=Hash.new { |x, key| x[key] = [] }; last_col_1=""}
$<.each_line.with_index{|line,i|
    next if i==0
    lc=CSV.parse(line).flatten
    if lc[0].nil? then col_1=last_col_1 else col_1=lc[0] end
    last_col_1=col_1
    h[col_1] << lc[1]
}
h.each{|k,v| File.open("#{k.gsub(/\//,":")}.txt", "w"){|f| f.write "#{k}\n#{v.join("\n")}"} }
' your_file

字符串
(Note,在示例输入中将文件名中的字符/替换为:,以生成2-Police:Enforcement.txt,因为2-Police/Enforcement.txt是非法文件名。)
制作:

% head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring
==> 2-Police:Enforcement.txt <==
2-Police/Enforcement
innocent, protect
crime
==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell
==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor
==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

cbwuti44

cbwuti444#

你也可以用Perl来解决这个问题:

perl -F, -lane 'if ($. != 1) { $pos = index($_, ","); $cat = substr($_, 0, $pos); $cat =~ s/\//:/g; $cat = $prevcat if $cat eq ""; $words = substr($_, $pos+1); $words =~ s/^"|"$//g; $cats{$cat} = ($cats{$cat} ? $cats{$cat} : $cat) . "\n" . $words; $prevcat = $cat; } END { while (($cat, $words) = each %cats) { open(my $fh, ">", "$cat.txt") or die "Cannot open $cat.txt: $!"; print $fh $words; close($fh); } }' your_data.csv

字符串
生成的文件:

$ ls -1 *.txt
2-Bathroom:home room.txt
2-Police:Enforcement.txt
2-Sense of Smell.txt
3-Companies.txt
3-Entrepreneurship.txt


输出量:

$ head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police:Enforcement.txt <==
2-Police:Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

afdcj2ne

afdcj2ne5#

Python可以正确处理CSV数据。下面的程序使用字典(map)将每个值(col 2)存储在一个列表中,该列表与最后一个看到的类别(col 1)相关联。这种最后一次看到的方法允许缺少类别的值与最后一个类别(在它上面)相关联:

import csv

category_values: dict[str, list[str]] = {}

with open("input.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    category = ""
    for row in reader:
        _category, value = row[0], row[1]

        if _category != "" and _category != category:
            category = _category

        if category not in category_values:
            category_values[category] = []

        category_values[category].append(value)

字符串
我们可以用两组循环来检查字典:

for category, values in category_values.items():
    print(category)
    for value in values:
        print(f"  {value}")


我得到:

3-Entrepreneurship
  innovation, daily
  countless
  old ideas
  ¡new income streams!
  good publicity guru!
2-Police/Enforcement
  innocent, protect
  crime
2-Bathroom:home room
  toilet handle
  bath room
  ring
3-Companies
  née dresses
  Zoë’s food store
  many
  móvíl résumés
  Señor
2-Sense of Smell
  odorless
  musty smell


然后使用一组类似的循环将类别值写入它们自己的文件。我根据类别进行一些基本的文件名清理:

for category, values in category_values.items():
    fname = category.replace("/", "-").replace(":", "-").replace("\\", "-")
    with open(f"output-{fname}.txt", "w", newline="", encoding="utf-8") as f:
        f.write(category + "\n")
        for value in values:
            f.write(value + "\n")


然后我得到一个文件列表,比如:

output-2-Bathroom-home room.txt
output-2-Police-Enforcement.txt
output-2-Sense of Smell.txt
output-3-Companies.txt
output-3-Entrepreneurship.txt


output-2-Bathroom-home room.txt看起来像:

2-Bathroom:home room
toilet handle
bath room
ring

vybvopom

vybvopom6#

为了完整起见,您可以使用普通的bash来完成此操作(因为只有1000行,性能应该不是问题)。请注意,与其他基于awk的答案不同,这实际上创建了您想要的文件,即使是在一个(例如,2-Police/Enforcement.txt)。和其他答案一样,如果你有多个-行记录在输入CSV中。如果第二行的第一个字段为空,则将创建的文本文件为.txt。将以下内容放入文件中(例如,~/bin/csv2txt):

#!/usr/bin/env bash

tail -n+2 "$1" | while IFS=, read -r tmp b; do
  a="${tmp:-$a}"
  [[ "$a" == */* ]] && mkdir -p "${a%/*}"
  [[ -f "$a.txt" ]] || printf '%s\n' "$a" > "$a.txt"
  b="${b#\"}"
  printf '%s\n' "${b%\"}" >> "$a.txt"
done

字符串
使其可执行:

chmod +x ~/bin/csv2txt


然后:

~/bin/csv2txt file.csv


或者,如果您的PATH中已经有~/bin

csv2txt file.csv


免责声明:

  • tail -n+2 "$1" | while IFS=, read -r tmp b; do:我们使用tail删除CSV文件的第一行,并使用Input Field Separator将其他行传输到while循环(IFS)设置为逗号。对于每行,我们存储第一个字段(在第一个逗号之前)在tmp和该行的其余部分(在第一个逗号之后)。我们使用read-r选项来禁止反斜杠转义任何字符。
  • a="${tmp:-$a}":如果tmp不为空,我们将其分配给a,否则(例如在第3行,countless),我们让a未修改。
  • [[ "$a" == */* ]] && mkdir -p "${a%/*}":如果a包含一个斜杠(例如2-Police/Enforcement),我们将创建相应的目录。
  • [[ -f "$a.txt" ]] || printf '%s\n' "$a" > "$a.txt":如果目标文本文件("$a.txt")不存在,我们将在其中打印$a
  • b="${b#\"}":我们从b中删除任何前导"
  • printf '%s\n' "${b%\"}" >> "$a.txt":我们从b中删除任何尾随的",并将其值附加到目标文本文件中。
zy1mlcev

zy1mlcev7#

要正确处理可能包含转义引号等的引号值,您需要一种具有适当CSV解析器的语言。

#!/usr/bin/env python3

import csv

with open('input.csv') as csvin:
    for row in csv.reader(csvin):
        with open(row[0], 'a') as txtout:
            txtout.write(row[1] + '\n')

字符串
这里是一个稍微修饰的版本,涵盖了更多的角落情况。我假设第一列中的空单元格的例子应该被跳过,并且带有斜杠的标签应该会导致子目录。

import csv
from pathlib import Path

with open('input.csv') as csvin:
    # skip header
    reader = csv.reader(csvin)
    reader.__next__()
    for row in reader:
        if row[0]:
            if '/' in row[0]:
                Path(row[0]).parent.mkdir(parents=True, exist_ok=True)
            with open(row[0] + '.txt', 'a+') as txtout:
                txtout.write(row[1] + '\n')


演示:https://ideone.com/7akFvw
如果你想让第一行包含一个头,这会使代码复杂化(你需要检查文件是否已经存在),但我认为缺少头是一个特性,而不是一个bug。
如果你真的需要优化它的速度,你应该保持尽可能多的文件句柄打开,但如果你打开超过操作系统允许的数量(通常在20的顺序)后退。关闭并立即重新打开一个文件往往会慢得多。

相关问题