I need to copy data from TableA to TableB (>5 millions rows). The two are in the same database.
What is the best way of doing this?
I was thinking about using a simple INSERT INTO ... SELECT statement. Is there a faster way to do it with SSIS?
Thanks
If you are going to use SSIS for this, I think your best performance option will be to use an Execute SQL task and then perform the "INSERT INTO ... SELECT" statement.|||fleo wrote:
I need to copy data from TableA to TableB (>5 millions rows). The two are in the same database.
What is the best way of doing this?
I was thinking about using a simple INSERT INTO ... SELECT statement. Is there a faster way to do it with SSIS?
Thanks
If they are in the same database then I'd be shocked if anything outperforms INSERT INTO... SELECT...
Try that out and also try out the SSIS dataflow, see how much performance differs (if at all). Let us know as well, I for one would be very interested to know.
You may also want to test if it is any quicker to drop all indexes on the target table prior to loading and then recreating them afterwards.
-Jamie
|||I made a simple comparison between the two:
With 7 millions rows (approximations)
SQL: 13 min.
Data flow task: 26 min.
I was thinking maybe SSIS had a "Bulk Insert" mechanism to pump data from a table to another...
That brings me to the "Update/Insert" example (http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx)
How come the SSIS sometimes outperforms pure SQL? Which part is faster? If SSIS performs multiple steps at a time, why can't SQL Server do the same?
Thanks guys!
|||fleo wrote:
I was thinking maybe SSIS had a "Bulk Insert" mechanism to pump data from a table to another...
They do. It's the "Fast load" option.|||
fleo wrote:
I made a simple comparison between the two:
With 7 millions rows (approximations)
SQL: 13 min.
Data flow task: 26 min.
Damn. I'd hoped the difference would have been smaller than that
Which destination adapter did you use?
fleo wrote:
I was thinking maybe SSIS had a "Bulk Insert" mechanism to pump data from a table to another...
Well essentially it does. If you use OLE DB Destination with FastLoad then it uses SQL Server's Bulk Insert mechanism (all explained here: http://blogs.conchango.com/jamiethomson/archive/2006/08/14/SSIS_3A00_-Destination-Adapter-Comparison.aspx).
The SQL Server Destination should be even quicker though (although it only works if you're running the package on the same server as whre the destination table resides)
fleo wrote:
That brings me to the "Update/Insert" example (http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx)
How come the SSIS sometimes outperforms pure SQL? Which part is faster? If SSIS performs multiple steps at a time, why can't SQL Server do the same?
Thanks guys!
There are so many variables in this its impossible to give a definative answer. If its straight table-to-table on the same server though I would expect pure T-SQL to be quicker.
-Jamie
|||I used the OLE DB Destination with the Fast Load option as adapter. I also expected the difference would be smaller...
Thanks for link... I had already read it :) I often go to your blog for info. I ask a lot of questions because my focus is to determine the best implementation for data transfer/manipulation.
My rule of thumb so far was to use SSIS only to transfer data between heterogeneous sources (text file, other DBMS,...). Once the data is in a SQL Server database I always use T-SQL and expect it to perform better than any other method.
But it doesn't seem to always be the case... I'll try to figure that out later on.
Thanks all
|||The "INSERT INTO... SELECT..." inserts all the rows in one trasaction verses SSIS with batch size option inserts fewer rows in a transaction. The later puts less stress on the system especially if the recovery model is "Simple".
|||It's a very good point.bkallich wrote:
The "INSERT INTO... SELECT..." inserts all the rows in one trasaction verses SSIS with batch size option inserts fewer rows in a transaction. The later puts less stress on the system especially if the recovery model is "Simple".
Luckily we have a identity field we can loop through.
|||
bkallich wrote:
The "INSERT INTO... SELECT..." inserts all the rows in one trasaction verses SSIS with batch size option inserts fewer rows in a transaction. The later puts less stress on the system especially if the recovery model is "Simple".
If you set MaxInsertCommitSize to a value larger than the number of rows passing through the destination, then it will perform the inserts in one batch as well.
RowsPerBatch is a query optimizer hint. Nothing more, really.|||
Phil Brammer wrote:
bkallich wrote:
The "INSERT INTO... SELECT..." inserts all the rows in one trasaction verses SSIS with batch size option inserts fewer rows in a transaction. The later puts less stress on the system especially if the recovery model is "Simple".
If you set MaxInsertCommitSize to a value larger than the number of rows passing through the destination, then it will perform the inserts in one batch as well.
RowsPerBatch is a query optimizer hint. Nothing more, really.
He told me RowsPerBatch is the fetch size from the source and commit size relates to the destination. Is it correct?
|||From BOL:ROWS_PER_BATCH = rows_per_batch
Indicates the approximate number of rows of data in the data file.
By default, all the data in the data file is sent to the server as a single transaction, and the number of rows in the batch is unknown to the query optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses this value to optimize the bulk-import operation. The value specified for ROWS_PER_BATCH should approximately the same as the actual number of rows.
BATCHSIZE = batch_size = MaxInsertCommitSize in SSIS
Specifies the number of rows in a batch. Each batch is copied to the server as one transaction. If this fails, SQL Server commits or rolls back the transaction for every batch. By default, all data in the specified data file is one batch.|||
I tested the process with different values.
As you mentionned, the MaxCommitSize has the most impact on performance.
With RowsPerBatch = 100 000
MaxCommitSize = 1 -> around 13 000 records/min
MaxCommitSize = 7 000 000 -> 600 000 records/min
RowsPerBatch:
With MaxCommitSize = 100 000 I can't see much difference between RowsPerBatch = 1 and 7 000 000...
Phil, can these properties be set dynamically? They do not seem to be available though expressions nor package configuration. I wanted to create a basic benchmark and pass these values as parameters.
Thanks
|||I don't think you can set those values dynamically. Anyway, post the results of your tests as they may be very interesting. Few months ago I was playing with those values and I never could get more than 10k rows per batch; but I actually did not dig into it.
No comments:
Post a Comment