使用C#读取过大的CSV

bn31dyow  于 12个月前  发布在  C#
关注(0)|答案(1)|浏览(231)

我必须用C#从一个太大的CSV文件中提取特定的信息,该文件大小为233 MB,它有1000008行和28列,并将信息插入数据表中。实际上我尝试过OleDB和Aspose.cells,但是两种方法都太慢了
ASPOSE代码:

public static DataTable CsvReader(string CsvFile)
{
    DataTable DtTemp = new DataTable();
    try
    {
        System.Windows.Forms.Application.DoEvents();
        Aspose.Cells.LoadOptions loadOptions = new LoadOptions(LoadFormat.CSV);
        Aspose.Cells.Workbook workbook = new Aspose.Cells.Workbook(CsvFile, loadOptions);
        string SheetName = workbook.Worksheets[0].Name;
        DtTemp = workbook.Worksheets[SheetName].Cells.ExportDataTable(0, 0, workbook.Worksheets[SheetName].Cells.MaxDataRow + 1, workbook.Worksheets[SheetName].Cells.MaxDataColumn + 1);
        //filter DtTemp datatable based on 4th column
    }
    catch (Exception) { DtTemp.Reset(); }
    return DtTemp;
}

OLEDB

public static DataTable CsvReader(string CsvFile)
{
    DataTable DtTemp = new DataTable();
    try
    {
        System.Windows.Forms.Application.DoEvents();
        OleDbConnection conn = null;
        string strConnString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + Path.GetDirectoryName(CsvFile) + "; Extended Properties='text; HDR=NO; IMEX=3; FMT=Delimited(,)';";
        string sql_select;
        conn = new OleDbConnection(strConnString.Trim());
        sql_select = "select * from [" + Path.GetFileName(CsvFile) + "] WHERE F4 Like '%27628%' OR F4 Like '%35627%'";
        conn.Open();
        OleDbCommand cmd = new OleDbCommand(sql_select, conn);
        OleDbDataAdapter obj_oledb_da = new OleDbDataAdapter(cmd);
        DataTable dtSchema = new DataTable();
        obj_oledb_da.FillSchema(dtSchema, SchemaType.Source);
        if (dtSchema != null)
        {
            writeSchema(dtSchema, CsvFile);
            obj_oledb_da.Fill(DtTemp);
        }
    }
    catch (Exception) { DtTemp.Reset(); }
    return DtTemp;
}
izj3ouym

izj3ouym1#

为了在事件驱动模式下有效地提取大的CSV数据(逐行),您可以尝试使用Aspose.Cells for .NET提供的LightCells API,这是一种轻量级模式,可以获得更好的性能。此外,您可以在lightcell API的实现中逐行将数据填充到datatable中。
请参阅以下示例代码段以获取完整参考:
例如

示例代码:

public void TestExportingDataTable()
{
            //here you may initialize the table according to the data
            //if possible, you can also connect to the database directly and send data to database directly in DataTableExporter
            DataTable dt = new DataTable();
            dt.Columns.Add("Column1", typeof(string));
            dt.Columns.Add("Column2", typeof(DateTime));
            dt.Columns.Add("Column3", typeof(double));
            
            TxtLoadOptions opts = new TxtLoadOptions();
            //if you can optimize the process of parsing data, disbaling the automatic conversion and use your own logic should be better for performance
            opts.ConvertDateTimeData = false;
            opts.ConvertNumericData = false;
            //using lightcells to improve performance for both time cost and memeory cost
            opts.LightCellsDataHandler = new DataTableExporter(dt);
            Workbook wb = new Workbook("largedata.csv", opts);
            //here you may check the generated data table by yourself.
}
private class DataTableExporter : LightCellsDataHandler
{
            private readonly DataTable mTable;
            private DataRow mDataRow;

            public DataTableExporter(DataTable dt)
            {
                mTable = dt;
            }
            public bool StartSheet(Worksheet sheet)
            {
                return true;
            }
            public bool StartRow(int row)
            {
                mDataRow = mTable.NewRow();
                mTable.Rows.Add(mDataRow);
                return true;
            }
            public bool ProcessRow(Row row)
            {
                return true;
            }
            public bool StartCell(int col)
            {
                return true;
            }
            public bool ProcessCell(Cell cell)
            {
                //in this sample, we just put values into the DataTable in memory
                //if you can save those data to database directly, we think it may improve performance significantly too.
                int col = cell.Column;
                if (col == 1)
                {
                    mDataRow[1] = DateTime.Parse(cell.StringValue);
                }
                else if (col == 2)
                {
                    mDataRow[2] = double.Parse(cell.StringValue) + col;
                }
                else
                {
                    mDataRow[col] = cell.StringValue;
                }
                return false;
            }
}

我们希望这对您有所帮助。
您也可以在专用的forum中发布您的疑问或进一步讨论。
PS.我在Aspose担任支持开发人员/布道者。

相关问题