为什么lumenworks在一个非常普通的CSV文件中返回一个字段的引号?

zynd9foi  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(110)

我以前用过几次Lumenworks CSV阅读器,从来没有见过这个问题。我有一个简单的CSV文件:

"1","001333","Test Company","","123 Test St","","Eland","NS","58601","USA","","","","","Company 1","","123 Destination St","","Schefield","ND","58601","USA","","","Standard No Options","Label 001","2","1","5","","","","","0","0","0","0","0","0","0","0","0","0","0","05/02/2023","001333","0"
"1","001333","Test Company","","123 Test St","","Eland","NS","58601","USA","","","","","Company 1","","123 Destination St","","Schefield","ND","58601","USA","","","Standard No Options","Label 001","2","2","125","","","","","0","0","0","0","0","0","0","0","0","0","0","05/02/2023","001333","0"

我通过创建一个新的流来阅读它,在顶部指定了头定义,然后复制文件流,这样我就读取了一个带头的CSV。
这是一个非常简单的代码摘录:

using StreamReader filestream = new StreamReader(csvfilepath);
using var finalcsvstream = await PrependHeaderToStream(filestream);
using var csv = new CsvReader(finalcsvstream, true);
while (csv.ReadNextRecord())
{
  var fileversionnumber = csv["FileVersionNumber"];
  var field2 = csv["field2"];
  // etc
}

FileVersionNumber是第一列,也是我唯一遇到问题的列。在文件中,它显然是数字1,但当我读取它时,我得到了一个带转义双引号的字符串:

\"1\"

经过一个多小时的工具和搜索谷歌,我已经尝试指定分隔符和引号,发挥各种修剪选项都无济于事。我查看了库的一个分支的源代码,看起来不应该发生这种情况。目前,我需要这个工作,并提出了一个具体的变通办法来修剪这个列。
你知道哪里出了问题吗?我应该做一个完整的工作玩具的例子,看看问题是否仍然存在?
我还应该提到,我尝试了带双引号和不带双引号的标题,看看它是否会做任何事情。
编辑:我的CSV文件的源代码是一个base64编码的字符串。人们会认为这将是安全的任何疯狂的字符恶作剧。当我在记事本中打开文件时,它看起来很好,但是如果我将base64字符串写入磁盘并使用以下内容:

base64 -d < /tmp/b64.txt | hexdump -c

我看到以下内容:

0000000 357 273 277   "   1   "   ,   "   0   0   1   3   3   3   "   ,
0000010   "   T   e   s   t       C   o   m   p   a   n   y   "   ,   "

有什么建议,如何修剪之前,在lumenworks打开它的来源?

von4xj4u

von4xj4u1#

十六进制转储显示您的数据包含一个UTF-8字节顺序标记。(hexdump以八进制显示组成字节)。
字节顺序标记与流的开始相关联,而不是与数据的第一行相关联。标题行不能在BOM表前面;如果您尝试,则特殊字符序列不再满足字节顺序标记的定义,而是出现在内容中,这就破坏了内容是否以引号开头的测试。
插入标题行有两个选项:

  • 把它放在BOM表和第一行数据之间(如果消费库甚至知道如何处理正确定位为BOM表的BOM表)
  • 彻底替换BOM(可能是最安全的)

在执行此操作时,您可以依赖这样一个事实,即只有少数几个可能的BOM表:UTF8、UTF 16-BE和UTF 16-LE。后两个都需要一个Encoding参数到StreamReader,所以对于所示的代码,您实际上只需要担心UTF8版本,您可以查找并修剪确切的三个字节集。

相关问题