在R中使用内部引号解析csv

wrrgggsh  于 2023-09-28  发布在  其他
关注(0)|答案(1)|浏览(72)

我尝试在R中解析一个csv文件,理想情况下使用read_csv()。下面是我的数据示例,其中发生了错误:

"ScheduleDId","ReportId","CommitteeContactId","FirstName","MiddleName","LastOrCompanyName","Prefix","Suffix","AddressLine1","AddressLine2","City","StateCode","ZipCode","IsIndividual","TransactionDate","Amount","AuthorizingName","ItemOrService","ScheduleId","ReportUID"
"2543592","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/03/2019","1250.00","Raymond A. Bell, Jr.","Consulting Fees 10 hours @ $125 per hour","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543593","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/10/2019","750.00","Raymond A. Bell, Jr.","Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24" 1/0 coroplast yard signs/stakes/delivery - $405","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543594","168976","572005","","","Freelancer Limited - Freelancer.CO.UK","","","680 George Street","","Sidney, New South Wales - Australia","N/A","20000","False","06/07/2019","204.90","Raymond A. Bell, Jr.","Website design","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"

我应该得到三行,但R返回两行。我认为其中一个字符串中表示英寸的双引号导致了这个问题,因为它从输出中消失了,但它抛出了结果。请参见下面的最后一列:

# A tibble: 2 × 7
  IsIndividual TransactionDate Amount  AuthorizingName      ItemOrService                                                                                      ScheduleId ReportUID
  <chr>        <chr>           <chr>   <chr>                <chr>                                                                                              <chr>      <chr>    
1 False        06/03/2019      1250.00 Raymond A. Bell, Jr. "Consulting Fees 10 hours @ $125 per hour"                                                          NA        {0A43968…
2 False        06/10/2019      750.00  Raymond A. Bell, Jr. "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24 1/0 coroplast yard signs… "New Sout… N/A

GoogleSheets在解析数据时没有问题,所以我想我应该能够在R中实现。我试了fread()没有运气。我也试过read_delim(data, delim = ",", escape_double = FALSE)
这是通过文件发生的,所以我不能只是手动删除报价。有超过27 K行,我想导入多个文件。你可以在这里下载我引用的一个:https://apps.elections.virginia.gov/SBE_CSV/CF/2019_07/ScheduleD.csv

更新

下面罗伯特的回答让我更接近了--我能够毫无问题地读到更多的行,但仍然有一些问题行。下面是我使用的代码…

parse_csv <- function(x){
  dat <- readLines(x)
  dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
  dat <- read_delim(I(dat), escape_backslash=T)
}

test <- parse_csv("test.csv")

problems(test)

还有那些问题

# A tibble: 2 × 5
    row   col expected   actual                                                                                   
  <int> <int> <chr>      <chr>                                                                                    
1 27349    18 20 columns 18 columns 
2 27350     5 20 columns 5 columns

我在文本编辑器中打开csv文件并转到问题行,我发现csv行的末尾在单独的一行上。

文件中有许多这样的行-一旦我清理了它们,自定义函数就完美地工作了。
但是我正在处理200多个文件,所以我不能手动检查每个文件以确保每行都在一行上。换行符似乎总是在ItemOrService列中。有没有一种方法可以通过编程来清理这个问题?
这里有一个指向另一个csv的链接,行分两行。第一个断点位于第89行。
https://gist.github.com/jrcloutier/cc48a230d99a983f40c94f7800e69952

fsi0uk1n

fsi0uk1n1#

如果运气好的话,下面的方法会奏效。你可以使用gsub来转义所有可能不是外部引号的引号。我将这些定义为不在行的开始或结束处,并且前面没有",或后面没有,"的那些。
我无法连接到virginia.gov,所以我只能在你的问题中显示的小样本上测试它,它在哪里工作。然而,当应用于完整的数据集时,可能会出现某些问题,或者27K行可能存在一些性能问题。

# dat <- readLines('ScheduleD.csv')  # I used the dat object defined below
dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
dat <- read_delim(I(dat), escape_backslash=T)

dim(dat)
# [1]  3 20
dat$ItemOrService[2]
[1] "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24\" 1/0 coroplast yard signs/stakes/delivery - $405"

数据类型:

dat <- c("\"ScheduleDId\",\"ReportId\",\"CommitteeContactId\",\"FirstName\",\"MiddleName\",\"LastOrCompanyName\",\"Prefix\",\"Suffix\",\"AddressLine1\",\"AddressLine2\",\"City\",\"StateCode\",\"ZipCode\",\"IsIndividual\",\"TransactionDate\",\"Amount\",\"AuthorizingName\",\"ItemOrService\",\"ScheduleId\",\"ReportUID\"", 
"\"2543592\",\"168976\",\"568106\",\"\",\"\",\"MacGyver Group Solution Consultants\",\"\",\"\",\"6113 Sunlight Mountain Road\",\"\",\"Spotsylvania\",\"VA\",\"22553\",\"False\",\"06/03/2019\",\"1250.00\",\"Raymond A. Bell, Jr.\",\"Consulting Fees 10 hours @ $125 per hour\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"", 
"\"2543593\",\"168976\",\"568106\",\"\",\"\",\"MacGyver Group Solution Consultants\",\"\",\"\",\"6113 Sunlight Mountain Road\",\"\",\"Spotsylvania\",\"VA\",\"22553\",\"False\",\"06/10/2019\",\"750.00\",\"Raymond A. Bell, Jr.\",\"Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24\" 1/0 coroplast yard signs/stakes/delivery - $405\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"", 
"\"2543594\",\"168976\",\"572005\",\"\",\"\",\"Freelancer Limited - Freelancer.CO.UK\",\"\",\"\",\"680 George Street\",\"\",\"Sidney, New South Wales - Australia\",\"N/A\",\"20000\",\"False\",\"06/07/2019\",\"204.90\",\"Raymond A. Bell, Jr.\",\"Website design\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"")

要合并列中划分的行,像这样更新parse_csv()函数应该可以:

parse_csv <- function(x) {
  
  dat <- readLines(x)
  to.merge <- grep('^[^"]', dat)
  if (length(to.merge) > 0) {
    dat[to.merge - 1] <- paste0(dat[to.merge - 1], dat[to.merge])
    dat <- dat[-to.merge]
  }
  dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
  dat <- read_delim(I(dat), escape_backslash=T)
}

我没有测试这个,但它应该找到所有不以引号开始的行,将它们粘贴到前一行并删除它们。如果换行发生在字段之外的其他地方,则可能无法正常工作。

相关问题