csv 如何在TSV数据中保留未转义的双引号?

oaxa6hgo  于 2023-04-27  发布在  其他
关注(0)|答案(2)|浏览(128)

我在TSV(制表符分隔值)数据中有未转义的双引号,我想在使用CsvHelper阅读时保留它们。例如:

Column1    Column2    Column3
Value "1"  Value "2"  Value "3"

现在我的CsvConfiguration看起来像这样:

new CsvConfiguration(CultureInfo.InvariantCulture)
{
    HasHeaderRecord = true,
    Delimiter = "/t",
    NewLine = "/r/n",
    IgnoreBlankLines = true,
    MissingFieldFound = null,
    HeaderValidated = null,
    CacheFields = true,
    PrepareHeaderForMatch = args => args.Header.Trim(),
    TrimOptions = TrimOptions.Trim,
    LineBreakInQuotedFieldIsBadData = false,
            
};

我可以设置BadDataFound = null,但由于Value "2"不被认为是坏的,我不确定这是否是一个有效的选项。

i2loujxw

i2loujxw1#

TSV,Tab-Separated Values文本文件格式由美国国会图书馆as follows定义:

制表符分隔值(TSV)文件是一种文本格式,其主要功能是将数据存储在表结构中,其中表中的每条记录都记录为文本文件的一行。记录中的字段值由制表符分隔。标题行可以提供有关表列语义的信息。
...字段值不能包含制表符或新行字符,因此将纯文本转换为TSV需要以下转义符(括号中包含相应的ASCII代码):
\n用于换行符(ascii 0x 0a)
\t用于选项卡(ascii 0x 09)
\r用于回车(ASCII 0x 0 d)
\\表示反斜杠(ASCII 0x 5c)
通过设置CsvConfiguration.Delimiter = "\t",您可以轻松地使CsvHelper使用Tab字符作为分隔符,但是LoC指定的转义符与CsvHelper支持的任何escaping modes都不对应。
1.设置CsvConfiguration.Mode = CsvMode.NoEscape以禁用CsvHelper的转义。
1.为string编写自己的custom type converter,手动处理转义,然后全局注册。
首先定义以下ITypeConverter和扩展方法:

public class TSVStringConverter : CsvHelper.TypeConversion.StringConverter
{
    public static TSVStringConverter Instance { get; } = new TSVStringConverter();

    static ReadOnlyCollection<KeyValuePair<string, string>> EscapeMap = new List<KeyValuePair<string, string>>
        {
            //https://www.loc.gov/preservation/digital/formats/fdd/fdd000533.shtml
            //The order here is important, the \\ must come first.
            { new("\\", "\\\\") },
            { new("\n", "\\n") },
            { new("\t", "\\t") },
            { new("\r", "\\r") },
        }.AsReadOnly();

    public override string ConvertToString(object value, IWriterRow row, MemberMapData memberMapData)
    {
        if (value is string s)
            value = EscapeMap.Aggregate(new StringBuilder(s), (sb, p) => sb.Replace(p.Key, p.Value)).ToString();
        return base.ConvertToString(value, row, memberMapData);
    }

    public override object ConvertFromString(string text, IReaderRow row, MemberMapData memberMapData)
    {
        var obj = base.ConvertFromString(text, row, memberMapData);
        if (obj is string s)
            obj = EscapeMap.Reverse().Aggregate(new StringBuilder(s), (sb, p) => sb.Replace(p.Value, p.Key)).ToString();
        return obj;
    }
}   

public static class CsvHelperExtensions
{
    public static CsvConfiguration SetupTSV(this CsvConfiguration config)
    {
        config.Delimiter = "\t"; // FIXED
        config.NewLine = "\r\n"; // FIXED
        config.Mode = CsvMode.NoEscape; // ADDED
        config.LineBreakInQuotedFieldIsBadData = true; // Changed false => true as per LoC requirement
        return config;
    }
    
    public static CsvContext SetupTSV(this CsvContext context)
    {
        context.TypeConverterCache.AddConverter<string>(TSVStringConverter.Instance);
        return context;
    }
}

现在,如果你的唱片模型看起来像。

public record Model(string Column1, string Column2, string Column3);

您将能够反序列化TSV字符串,如下所示:

var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
    HasHeaderRecord = true,
    IgnoreBlankLines = true,
    MissingFieldFound = null,
    HeaderValidated = null,
    CacheFields = true,
    PrepareHeaderForMatch = args => args.Header.Trim(),
    TrimOptions = TrimOptions.Trim, 
    //LineBreakInQuotedFieldIsBadData = false, REMOVED
}
.SetupTSV(); // Add TSV specific options

using (var reader = new StringReader(tsvString)) // Or use a StreamReader when reading from a file
using (var csv = new CsvReader(reader, config))
{
    csv.Context.SetupTSV(); // Add TSV string converter for escaping and unescaping
    // Register your class map here if needed.
    
    var newRecords = csv.GetRecords<Model>().ToList();
}

备注:

  • 在你的问题中,你在字符串字面量中错误地转义了制表符、回车符和换行符。正如文档页面“引用的字符串字面量”中所解释的那样,你需要使用反斜杠而不是正斜杠。因此,你的代码应该如下所示
Delimiter = "\t", 
NewLine = "\r\n",

演示小提琴here

lo8azlld

lo8azlld2#

你可以使用CsvMode.NoEscape,这只是意味着你不能在任何字段中使用换行符或制表符。

void Main()
{
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        HasHeaderRecord = true,
        Delimiter = "\t",
        NewLine = "\r\n",
        IgnoreBlankLines = true,
        MissingFieldFound = null,
        HeaderValidated = null,
        CacheFields = true,
        PrepareHeaderForMatch = args => args.Header.Trim(),
        TrimOptions = TrimOptions.Trim,
        LineBreakInQuotedFieldIsBadData = false,
        Mode = CsvMode.NoEscape
    };

    using (var reader = new StringReader("Column1\tColumn2\tColumn3\r\nValue \"1\"\tValue \"2\"\tValue \"3\""))
    using (var csv = new CsvReader(reader, config))
    {
        var records = csv.GetRecords<Foo>().Dump();
    }
}

public class Foo
{
    public string Column1 { get; set; }
    public string Column2 { get; set; }
    public string Column3 { get; set; }
}

相关问题