97
Views
5
Comments
Solved
[CSVUtil] CSVUtil slowness for large file
csvutil
Service icon
Forge asset by Wei Zhu
Application Type
Service

Hi Team,

We have a requirement to import data from CSV file which contains 1 million+ records .

We are doing this operation through timer. We are doing data import in chunk of 10000.

  • We are using CSVUtil forge component to get records. Here MaxNum= 10000, StartIndex we are keeping in Entity.
  • We are using Bulk Insert forge component to insert chunk of 10000 in DB.
  • In below image we can see, for initial 50k records LoadCSV2RecordList action is fast
  • Toward the end of the file we can see extreme slowness for LoadCSV2RecordList action. In below Service Center log we can see to get CSV row 4 lakh to 4.1 lakh the action took more than 20 secs. We have to process 1 M records.  How to overcome the slowness of CSVUtil?

Best Regards,

Nitin

2021-06-21 12-37-59
Wei Zhu
Solution

@Nitin Chavan

StartIndex / MaxNum are used to load large CSV file that could not load all into memory.
For example, if you had a CSV that is 5GB,  you can load around 100MB to avoid Out Of Memory.
On the other hand, because we need parse CSV from first line and skip each time, so last 100MB will took more time than first 100MB.

For you case, you had 1 million+ records.  
Though I'm not sure your data size, But it seems MaxNum = 10000 is too small.
Why don't you try MaxNum = 200K or even load whole CSVs if you had enough memory.
This will decrease overhead and efficiency.

If you need BulkInsert with ChunkSize with 10000, 
Just simply loop on CSVList and fill data to temporary buffer, 
when buffer is full (list is reach 10000)  or last line of CSV, flush it (BulkInsert and ListClear).




Regards
Wei
2022-08-26 11-04-22
Nitin Chavan

Thank you @Wei Zhu !

Your solution worked for me :)

 I have loaded whole CSV. Below are the statistics-

  • CSV file size = 450 MB
  • Rows in CSV file = 1 Million

According to ServiceCenter log I can see -

CSVUtilExtension.LoadCSV2RecordList took 106308 ms.  

2024-12-17 14-32-59
Matthias Preuter
 
MVP

Maybe you could split the file first by using the FileSystem Low-level File actions


FS_Open

Opens a file for low-level reading.


FS_ReadBytes

Reads the specified number of bytes from a file opened with FS_Open.


FS_Seek

Moves the file read pointer to a specific position in the file opened with FS_Open.


FS_Close

Closes a file previously opened with FS_Open.


2022-08-26 11-04-22
Nitin Chavan

Thanks @Matthias Preuter for quick response!

How to calculate specific Position to read file?

Will it be same as StartIndex of LoadCSV2RecordList action?

2022-08-26 11-04-22
Nitin Chavan

Hi @Matthias Preuter 

Is there any example to read CSV file in chunk?

I am not getting value to be set in -

FS_ReadBytes - Attribute NumberOfBytes

FS_Seek - Attribute Position

2021-06-21 12-37-59
Wei Zhu
Solution

@Nitin Chavan

StartIndex / MaxNum are used to load large CSV file that could not load all into memory.
For example, if you had a CSV that is 5GB,  you can load around 100MB to avoid Out Of Memory.
On the other hand, because we need parse CSV from first line and skip each time, so last 100MB will took more time than first 100MB.

For you case, you had 1 million+ records.  
Though I'm not sure your data size, But it seems MaxNum = 10000 is too small.
Why don't you try MaxNum = 200K or even load whole CSVs if you had enough memory.
This will decrease overhead and efficiency.

If you need BulkInsert with ChunkSize with 10000, 
Just simply loop on CSVList and fill data to temporary buffer, 
when buffer is full (list is reach 10000)  or last line of CSV, flush it (BulkInsert and ListClear).




Regards
Wei
2022-08-26 11-04-22
Nitin Chavan

Thank you @Wei Zhu !

Your solution worked for me :)

 I have loaded whole CSV. Below are the statistics-

  • CSV file size = 450 MB
  • Rows in CSV file = 1 Million

According to ServiceCenter log I can see -

CSVUtilExtension.LoadCSV2RecordList took 106308 ms.  

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.