Is there a common best practice/pattern for bulk inserting large amount of data?

Is there a common best practice/pattern for bulk inserting large amount of data?

  
Hi,

What is the common best practice of inserting large amounts of records into tables with validations.
for example, our test(!)-set consists of 300000 records.

1. they need to be validated (values not empty, proper integers etc.)
2. Lookups to get the identifier
3. Insert the data

I remember some presentation which ofcourse I cannot find anymore.
I see something like BulkCopy?
Is a simple commit per 1000 records sufficient?



I also remember that presentation. (it has to be from NextStep 2012, but I couldn't find it).
Some of the things we do:

1. Do it in a timer!
2. Put the raw records into an Entity as a queue. Add a "status" attribute for "pending", "success", and "error".
3. Have the timer only pull raw records with a status of "pending".
4. Have the timer record the ID of the real value that the row got created into so you can audit and tie the new rows to their source.
5. Commit periodically, I like every 100 records, which seems to reduce locking issues and allows scheduled maintenance like backups to run more smoothly.
6. You can use the "LogRecord" action from the "AsynchronousLogging" extension if you wish to do the equivalent of a "bulk insert" and don't need to know the status... I do NOT recommend this since it makes your status attribute pretty useless and doesn't let you tie the created record to the raw record.

J.Ja
Greetings,

I've worked in several projects that required inserting / updating large amounts of records daily and used a similar process to what Justin James described above.

Since we had time constraints on those insert / update processes, we also used an extension to perform inserts into the queue entity with SqlBulkCopy. That really sped up inserts a lot (around 10 times faster than inserting one record at a time). We didn't create a generic extension for that; it had methods to insert into specific entities, which meant any changes to the entities would required changes to the extension.

I also used SQL Server's MERGE statement to do bulk "create or updates", but since it is not standard SQL and also not "maintenance friendly" I would advise against it, unless you have tight time constraints.


I assume that SqlBulkCopy is only for SQL Server. But, a timer is the way to go. Screen and business processes will time out at 2 and 5 minutes, respectively.
currently my status is,

- it's in a timer
- I need to import 18 csv's going from 100 record to 2million records in another csv
- I need to create FK's to each other.
- running out of time.
- it's a 2 stage for every csv basically.. bulk upload into a staging-table, then onwards to the real table with validation.
- crappiest part, only the initial load is so high, after that I will only get incrementals.

Hi,

In what step is it timming out?
The upload of the CSV's? Or the processing afterwords?

I've made a presentation in NextStep 2011 that might help regarding timers. You can find it here: http://www.slideshare.net/jfrodrigues/easy-and-efficient-processing-of-massive-data-sets
If you can share more in detail the operations you are doing, I can try to help you out with my experience. (feel free to message me).

Cheers