在pyspark rdd中出现在引号之间时忽略逗号

bvuwiixz  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(446)

我已将csv文件加载到rdd中,如下所示:

['"3331/587","Sub,Metro","1235","1000"',
'"1234/232","City","8479","2000"',
'"5987/215","Sub,Metro","1111","Unknown"',
'"8794/215","Sub,Metro","1112","1000"',
'"1254/951","City","6598","XXXX"',
'"1584/951","City","1548","Unknown"',
'"1833/331","Sub,Metro","1009","2000"',
'"2213/987","City","1197", ']

我最终想要达到的是

[["3331/587","Sub,Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Sub,Metro","1111","Unknown"],
["8794/215","Sub,Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unknown"],
["1833/331","Sub,Metro","1009","2000"],
["2213/987","City","1197", ]]

如果我使用此代码:

sc.textFile(file).map(lambda l: l.replace(r'"', '').split(','))

它还用逗号分隔值(“sub,metro”)
按逗号拆分时,如何自动忽略“”之间的所有逗号?

laximzn5

laximzn51#

下面是我的正则表达式示例。

import re

rdd = sc.parallelize(['"3331/587","Sub,Metro","1235","1000"',
'"1234/232","City","8479","2000"',
'"5987/215","Sub,Metro","1111","Unknown"',
'"8794/215","Sub,Metro","1112","1000"',
'"1254/951","City","6598","XXXX"',
'"1584/951","City","1548","Unknown"',
'"1833/331","Sub,Metro","1009","2000"',
'"2213/987","City","1197", '])

rdd.map(lambda l: re.split(r'[^A-z],[^A-z]', re.sub(r'(^["])|(["]$)','', l))).collect()

[['3331/587', 'Sub,Metro', '1235', '1000'],
 ['1234/232', 'City', '8479', '2000'],
 ['5987/215', 'Sub,Metro', '1111', 'Unknown'],
 ['8794/215', 'Sub,Metro', '1112', '1000'],
 ['1254/951', 'City', '6598', 'XXXX'],
 ['1584/951', 'City', '1548', 'Unknown'],
 ['1833/331', 'Sub,Metro', '1009', '2000'],
 ['2213/987', 'City', '1197', '']]

相关问题