考虑CSV数据集如下:
"Column1 "," Column2"," Column3 ","Column4","Column5"
"Record11 "," Record12"," Record13 ","Record14","Record15"
"Record21 "," Record22"," Record23 ","Record24","Record25"
"Record31 "," Record32"," Record33 ","Record34","Record35"
在csv数据集上应用regex表达式后,预期输出:
"Column1","Column2","Column3","Column4","Column5" >trailing/leading space removed from row1
"Record11 "," Record12"," Record13 ","Record14","Record15" > other rows should remain intact
"Record21 "," Record22"," Record23 ","Record24","Record25" > other rows should remain intact
"Record31 "," Record32"," Record33 ","Record34","Record35" > other rows should remain intact
请注意:尾随/前导空格/制表符可以在任何行中,regex应该只处理用户给定的任何行/记录。
我尝试了pythonre.sub()函数,表达式为“\s([^"][^"\s])\s*”,并将其替换为"\1”**。但是使用这个regex replace表达式,它适用于csv文件的所有记录,但预期只适用于第一行。
请注意,我们必须只使用re.sub()python方法。
2条答案
按热度按时间p4tfgftt1#
如果你必须对CSV文件的所有内容应用正则表达式,我真的不知道如何使匹配只在第一行起作用。
但是,如果您知道列数或通过计算第一行的列数来计算列数,则可以调用
re.sub()
,并将第四个参数(count)设置为列数**。这样,它将停止并仅应用第一行的更改。对于正则表达式模式本身,我会这样做:
这将工作的情况下,你有一些引号内的字符串。在CSV中,引号不是用
\"
转义的,而是用双引号""
转义。你可以在这里测试正则表达式:https://regex101.com/r/t4UAyZ/4
你会注意到,你不能只计算逗号来计算列的数量,因为逗号也可以在字符串本身中。
Python代码:
输出:
但就我个人而言,我会使用CSV解析器,而不是使用正则表达式。这不是很好的工具。
pqwbnv8z2#
你应该看看str.strip,它可以满足你的需要--你可以在每个元素的基础上调用它。