An APT Developer: August 2008

The other day, we were confronted with an interesting SQL problem. A user wanted to be able to paste in a set of record IDs to be saved and made available for use throughout the rest of the product. For large sets of data (in the millions of rows), it makes sense for users to upload these files to us via FTP or even through the web server, then have the files loaded into the database. In these cases, users most likely already have the data readily available in some file format, so it’s easy for them to just upload it to us. But in the small to medium size cases (anywhere from hundreds to tens of thousands of records), the user could be pulling the data from an Excel spreadsheet or similar document so that it’s easier for him or her to simply paste the data into a text field of a form. For simplicity’s sake, we can assume that the data consists of a set of ID numbers that are carriage return delimited.

So then, how do we persist these IDs to the database? Essentially we are given a ginormous string of delimited IDs that we want to dump into a table with a single column that is ID number, like so:


CREATE TABLE records
(
 recordID INT NOT NULL,
 CONSTRAINT PK_records PRIMARY KEY CLUSTERED (recordID)
)

In tackling this problem, we considered four different approaches. Our goal was to find the fastest approach (in terms of user wait time), since in many cases the user could be forced to wait several minutes for the upload to complete. As with many enterprise web applications, our DB resides on a different machine than our web server. So each approach’s performance is really driven by two factors:

Query Time – the total amount of time the DB spends processing the queries that insert IDs into the table we designate

NetworkTransfer Time – the total amount of time spent transferring requests from our web server to the DB

The first approach we tried, which we’ll refer to as the Naïve Insert Loop, was to loop through the delimited IDs, inserting a single row into the database for each ID. Each query to insert a row was fired off as a separate DB request:


INSERT INTO records (recordID)
SELECT 12

INSERT INTO records (recordID)
SELECT 15

INSERT INTO records (recordID)
SELECT 17
...

This approach is problematic for several reasons, all stemming from the fact that it creates an individual query per record ID and fires it off from the web app to the DB one at a time. Since each query and request has a certain overhead to it, this solution pays huge penalties for the large numbers of queries and requests used. We used this naïve approach as a baseline for which to improve upon.

Recognizing that we needed to reduce the number of queries and requests fired, we then considered Improved Insert Loop, which was very similar to the Insert Loop except that we combined the INSERTs together via UNION ALLs before firing them off to the DB:


INSERT INTO records (recordID)
SELECT 12
UNION ALL
SELECT 15
UNION ALL
SELECT 17
...

INSERT INTO records (recordID)
SELECT 27
UNION ALL
SELECT 28
UNION ALL
SELECT 54
...

We can combine these INSERTs together into batches of a thousand* SELECT statements UNION ALLed together, so we essentially have reduced the number of queries and network requests by a factor of thousand. The queries and requests are themselves approximately a thousand times larger than before. But we have improved net performance because by combining a thousand queries together, we don’t have to pay the overhead attached to all of the individual thousand queries and requests we would have run otherwise. For instance, by reducing the number of queries, we are reducing the number of DB transactions, and therefore we reduce the number of disk writes that happen on the DB since we are reducing the number of transaction log flushes.

In both approaches above, we are forced to wrap the IDs in queries that insert them into the table on the DB. If we could somehow transmit the raw IDs to the DB and have them parsed and inserted completely on the DB side, we could greatly reduce the size of the data sent over the network and thus greatly reduce the Network Transfer Time.

With that in mind, we came up with the Stored Procedure Loop approach. Essentially we would pass the entire string of delimited IDs as a TEXT field to a stored procedure, which would do the work of parsing the field and INSERTing the individual records into a target table. Below is the stored procedure definition. It starts by logging the entirety of the paste request into a dataStagingTable and parses the data logged in the table.


CREATE TABLE dataStagingTable (
 logID int not null identity(1,1),
 data Text,
 pastingTime datetime
)

exec processPastedData 'wilfred', '2
3
4
5', '
',','

select * from wilfred

CREATE PROCEDURE dbo.processPastedData
 @targetTable VARCHAR(32),
 @data TEXT,
 @rowDelimiter CHAR,
 @colDelimiter CHAR

AS
BEGIN
 DECLARE @logID INT
 DECLARE @dlen BIGINT
 DECLARE @offset INT
 DECLARE @linePtr INT
 DECLARE @buf varchar(4000)
 DECLARE @cols varchar(255)
 
 INSERT INTO dataStagingTable (data, pastingTime)
 SELECT @data, getDate()

 SELECT @logID = @@Identity
 SELECT @offset = 1

 SELECT @dlen = datalength(data)
 FROM dataStagingTable
 WHERE logID = @logID

 SELECT @cols = 'recordID'

 SET NOCOUNT ON

 WHILE (@offset > 0)
 BEGIN
 SELECT @linePtr = CHARINDEX(@rowDelimiter, SUBSTRING(data, @offset, 4000))
 FROM dataStagingTable
 WHERE logID = @logID

 if (@linePtr > 0)
  SELECT @buf = REPLACE (SUBSTRING(data, @offset , @linePtr-1), @colDelimiter, ''',''')
  FROM dataStagingTable
  WHERE logID = @logID
 else
  SELECT @buf = REPLACE (SUBSTRING(data, @offset , 8000), @colDelimiter, ''',''')
  FROM dataStagingTable
  WHERE logID = @logID

 SELECT @buf = REPLACE (@buf, char(13), '')

 EXEC ('INSERT INTO ' + @targetTable + ' (' + @cols + ') SELECT ' + @buf)

 SET @offset = @offset + @linePtr

 if (@linePtr = 0)
  BREAK
 END
END

This approach yielded a huge performance improvement, as we had essentially minimized the Network Transfer Time by minimizing the amount of data being transmitted. We could have further improved the Query Time by reducing the number of transactions via either explicit transaction blocks or by combining INSERTs into batches as we did with the Improved Loop Insert. But even by making these improvements, we still would have had approximately the same number of queries being run as we did for the Improved Loop Insert (although now they would be run within the stored procedure on the DB side).

In order to further improve the Query Time, we finally arrived at a fourth approach, the BCP Insert. BCP is a utility included with SQL Server that loads data from a file into a DB table. In this case, we dump the delimited IDs into a text file, then invoke BCP on the new text file. To run the BCP utility, we had to make sure there was a way for the web server to transmit these files to the DB machine. After that, we could run the BCP utility as a console command:


bcp ClientDB.dbo.records in dataDump.txt -S DBMachine -U userID -P password -f formatFile.fmt

------------

ClientDB.dbowner.records = [DATABASE].[SCHEMA].[TABLE TO UPLOAD DATA TO]
dataDump.txt = data file that contains pasted data
-S DBMachine = name of server to connect to
-U userID = tells bcp to use a specific userID to log into DB machine
-P password = tells bcp the password to use with userID to log into DB machine
-f formatFile.fmt = format file that tells bcp how to parse the data file and how to insert into records table

Creating the format file for configuring BCP to parse the data correctly was also straightforward:


8.0
1
1 SQLCHAR 0 50 "\r\n" 1 recordID SQL_Latin1_General_CP1_CI_AS

------------

8.0 = SQL Version
1 = number of columns

Third row going left to right:
1 = File field order
SQLCHAR = Host file data type
0 = Prefix length
50 = Host file data length
"\r\n" = line terminator
1 = Server column order
recordID = name of column we are uploading to
SQL_Latin1_General_CP1_CI_AS = column collation type

Like with the Stored Procedure Loop, the data sent is essentially just the raw data, but in this case, SQL loads all the data without having to run a whole slew of queries. Furthermore, SQL’s loading of this data is not logged, which results in much less overhead than the previous approaches. This final approach does require additional time to move data to and from the file system, but in sum it was still faster than the other approaches. It is important to note, however, that as the number of records being pasted decreases, the difference in performance between these approaches also decreases. In fact, if we are inserting fewer than a thousand records, it is actually faster to revert from BCP Insert to one of the more naïve solutions, since the overhead of the file dump begins to dominate the actual Query and Network Transfer Time.

In conclusion, the BCP Insert solution was the fastest of the four. This is not unexpected, since these are the types of operations that BCP was designed to perform quickly. Another interesting build on these solutions would have been to incorporate a means of compression of the raw data before transmitting from web server to DB server. That said, we found this to be an interesting chance to experiment with a few different creative solutions.

* The thousand here is selected for simplicity; we can continue to increase this number to maximize the benefit of these batch combinations. The cap on the batch size is dependent on the RDBMS you are using.

An APT Developer

Friday, August 29, 2008

SQL INSERTs

Thursday, August 14, 2008

Debugging at APT - Part 3: Fiddler

About Us

Labels

Blog Archive