regex 从CSV文件的第一行开始删除尾随/前导空格/制表符的正则表达式是什么?

v64noz0r  于 2023-06-07  发布在  其他
关注(0)|答案(2)|浏览(123)

考虑CSV数据集如下:

"Column1 "," Column2"," Column3 ","Column4","Column5"
"Record11 "," Record12"," Record13 ","Record14","Record15"
"Record21 "," Record22"," Record23 ","Record24","Record25"
"Record31 "," Record32"," Record33 ","Record34","Record35"

在csv数据集上应用regex表达式后,预期输出:

"Column1","Column2","Column3","Column4","Column5" >trailing/leading space removed from row1
"Record11 "," Record12"," Record13 ","Record14","Record15" > other rows should remain intact
"Record21 "," Record22"," Record23 ","Record24","Record25" > other rows should remain intact
"Record31 "," Record32"," Record33 ","Record34","Record35" > other rows should remain intact

请注意:尾随/前导空格/制表符可以在任何行中,regex应该只处理用户给定的任何行/记录。
我尝试了pythonre.sub()函数,表达式为“\s([^"][^"\s])\s*”,并将其替换为"\1”**。但是使用这个regex replace表达式,它适用于csv文件的所有记录,但预期只适用于第一行。
请注意,我们必须只使用re.sub()python方法。

p4tfgftt

p4tfgftt1#

如果你必须对CSV文件的所有内容应用正则表达式,我真的不知道如何使匹配只在第一行起作用。
但是,如果您知道列数或通过计算第一行的列数来计算列数,则可以调用re.sub(),并将第四个参数(count)设置为列数**。这样,它将停止并仅应用第一行的更改。
对于正则表达式模式本身,我会这样做:

/
"              # opening quote.
\s*            # spaces to drop at the begin.
(?P<text>      # capturing group named "text"
  (?:          # non-capturing group, repeated 0 or n times, ungreedy.
    "{2}|[^\"] # either an escaped quote (2 quotes) or any non-quote char.
  )*?
)
\s*            # spaces to drop at the end.
"(?!")         # closing quote, not followed by a quote. This is to
               # make the ungreedy text capturing work correctly.
/gx

这将工作的情况下,你有一些引号内的字符串。在CSV中,引号不是用\"转义的,而是用双引号""转义。
你可以在这里测试正则表达式:https://regex101.com/r/t4UAyZ/4
你会注意到,你不能只计算逗号来计算列的数量,因为逗号也可以在字符串本身中。
Python代码:

import re

csv = """\"Column1 \",\" Column2\",\" Column3 \",\"Column4\",\"\"\"stupid\"\",\"\"col\"\" name \"
\"Record11 \",\" Record12\",\" Record13 \",\"Record14\",\"Record15\"
\"Record21 \",\" Record22\",\" Record23 \",\"Record24\",\"Record25\"
\"Record31 \",\" Record32\",\" Record33 \",\"Record34\",\"Record35\"
\"  It's \"\"allowed\"\" to have quotes in strings => double them\",\"val2\",\"val3\",\" \",\"\""""

print("CSV input:\n----------\n" + csv + "\n")

# Regex to match CSV string fields.
csvFieldRegex = re.compile(r"""
    \"              # opening quote.
    \s*             # spaces to drop at the begin.
    (?P<text>       # capturing group named "text"
      (?:           # non-capturing group, repeated 0 or n times, ungreedy.
        \"{2}|[^\"] # either an escaped quote (2 quotes) or any non-quote char.
      )*?
    )
    \s*             # spaces to drop at the end.
    \"(?!\")        # closing quote, not followed by a quote. This is to
                    # make the ungreedy text capturing work correctly.
    """, re.X)

# Substitution to have the trimmed text.
csvFieldSubst = '"\\g<text>"'

# Extract the first line to get the column names.
firstLineMatch = re.match(r"^(.+)\r?\n", csv)
if not firstLineMatch:
    raise Exception("Could not extract the first line of the CSV!")

print("\nFirst line:\n-----------\n" + firstLineMatch.group(1))

# Match all the string fields to count them.
startFieldMatches = csvFieldRegex.findall(firstLineMatch.group(1))
if startFieldMatches:
    nbrCols = len(startFieldMatches)
    print("Number of columns: ", nbrCols)
else:
    raise Exception("Could not extract the fields from the first line!")

# Trim the string fields, but only for the number found on the first line.
result = csvFieldRegex.sub(csvFieldSubst, csv, nbrCols)
if result:
    print("\nResulting CSV:\n--------------\n" + result)

输出:

CSV input:
----------
"Column1 "," Column2"," Column3 ","Column4","""stupid"",""col"" name "
"Record11 "," Record12"," Record13 ","Record14","Record15"
"Record21 "," Record22"," Record23 ","Record24","Record25"
"Record31 "," Record32"," Record33 ","Record34","Record35"
"  It's ""allowed"" to have quotes in strings => double them","val2","val3"," ",""

First line:
-----------
"Column1 "," Column2"," Column3 ","Column4","""stupid"",""col"" name "
Number of columns:  5

Resulting CSV:
--------------
"Column1","Column2","Column3","Column4","""stupid"",""col"" name"
"Record11 "," Record12"," Record13 ","Record14","Record15"
"Record21 "," Record22"," Record23 ","Record24","Record25"
"Record31 "," Record32"," Record33 ","Record34","Record35"
"  It's ""allowed"" to have quotes in strings => double them","val2","val3"," ",""

但就我个人而言,我会使用CSV解析器,而不是使用正则表达式。这不是很好的工具。

pqwbnv8z

pqwbnv8z2#

你应该看看str.strip,它可以满足你的需要--你可以在每个元素的基础上调用它。

相关问题