我尝试在R中解析一个csv文件,理想情况下使用read_csv()
。下面是我的数据示例,其中发生了错误:
"ScheduleDId","ReportId","CommitteeContactId","FirstName","MiddleName","LastOrCompanyName","Prefix","Suffix","AddressLine1","AddressLine2","City","StateCode","ZipCode","IsIndividual","TransactionDate","Amount","AuthorizingName","ItemOrService","ScheduleId","ReportUID"
"2543592","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/03/2019","1250.00","Raymond A. Bell, Jr.","Consulting Fees 10 hours @ $125 per hour","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543593","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/10/2019","750.00","Raymond A. Bell, Jr.","Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24" 1/0 coroplast yard signs/stakes/delivery - $405","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543594","168976","572005","","","Freelancer Limited - Freelancer.CO.UK","","","680 George Street","","Sidney, New South Wales - Australia","N/A","20000","False","06/07/2019","204.90","Raymond A. Bell, Jr.","Website design","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
我应该得到三行,但R返回两行。我认为其中一个字符串中表示英寸的双引号导致了这个问题,因为它从输出中消失了,但它抛出了结果。请参见下面的最后一列:
# A tibble: 2 × 7
IsIndividual TransactionDate Amount AuthorizingName ItemOrService ScheduleId ReportUID
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 False 06/03/2019 1250.00 Raymond A. Bell, Jr. "Consulting Fees 10 hours @ $125 per hour" NA {0A43968…
2 False 06/10/2019 750.00 Raymond A. Bell, Jr. "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24 1/0 coroplast yard signs… "New Sout… N/A
GoogleSheets在解析数据时没有问题,所以我想我应该能够在R中实现。我试了fread()
没有运气。我也试过read_delim(data, delim = ",", escape_double = FALSE)
这是通过文件发生的,所以我不能只是手动删除报价。有超过27 K行,我想导入多个文件。你可以在这里下载我引用的一个:https://apps.elections.virginia.gov/SBE_CSV/CF/2019_07/ScheduleD.csv
更新
下面罗伯特的回答让我更接近了--我能够毫无问题地读到更多的行,但仍然有一些问题行。下面是我使用的代码…
parse_csv <- function(x){
dat <- readLines(x)
dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
dat <- read_delim(I(dat), escape_backslash=T)
}
test <- parse_csv("test.csv")
problems(test)
还有那些问题
# A tibble: 2 × 5
row col expected actual
<int> <int> <chr> <chr>
1 27349 18 20 columns 18 columns
2 27350 5 20 columns 5 columns
我在文本编辑器中打开csv文件并转到问题行,我发现csv行的末尾在单独的一行上。
文件中有许多这样的行-一旦我清理了它们,自定义函数就完美地工作了。
但是我正在处理200多个文件,所以我不能手动检查每个文件以确保每行都在一行上。换行符似乎总是在ItemOrService
列中。有没有一种方法可以通过编程来清理这个问题?
这里有一个指向另一个csv的链接,行分两行。第一个断点位于第89行。
https://gist.github.com/jrcloutier/cc48a230d99a983f40c94f7800e69952
1条答案
按热度按时间fsi0uk1n1#
如果运气好的话,下面的方法会奏效。你可以使用
gsub
来转义所有可能不是外部引号的引号。我将这些定义为不在行的开始或结束处,并且前面没有",
或后面没有,"
的那些。我无法连接到virginia.gov,所以我只能在你的问题中显示的小样本上测试它,它在哪里工作。然而,当应用于完整的数据集时,可能会出现某些问题,或者27K行可能存在一些性能问题。
数据类型:
要合并列中划分的行,像这样更新
parse_csv()
函数应该可以:我没有测试这个,但它应该找到所有不以引号开始的行,将它们粘贴到前一行并删除它们。如果换行发生在字段之外的其他地方,则可能无法正常工作。