Friday, August 29, 2008

SQL INSERTs

The other day, we were confronted with an interesting SQL problem. A user wanted to be able to paste in a set of record IDs to be saved and made available for use throughout the rest of the product. For large sets of data (in the millions of rows), it makes sense for users to upload these files to us via FTP or even through the web server, then have the files loaded into the database. In these cases, users most likely already have the data readily available in some file format, so it’s easy for them to just upload it to us. But in the small to medium size cases (anywhere from hundreds to tens of thousands of records), the user could be pulling the data from an Excel spreadsheet or similar document so that it’s easier for him or her to simply paste the data into a text field of a form. For simplicity’s sake, we can assume that the data consists of a set of ID numbers that are carriage return delimited.

So then, how do we persist these IDs to the database? Essentially we are given a ginormous string of delimited IDs that we want to dump into a table with a single column that is ID number, like so:


CREATE TABLE records
(
recordID INT NOT NULL,
CONSTRAINT PK_records PRIMARY KEY CLUSTERED (recordID)
)


In tackling this problem, we considered four different approaches. Our goal was to find the fastest approach (in terms of user wait time), since in many cases the user could be forced to wait several minutes for the upload to complete. As with many enterprise web applications, our DB resides on a different machine than our web server. So each approach’s performance is really driven by two factors:

  • Query Time – the total amount of time the DB spends processing the queries that insert IDs into the table we designate
  • NetworkTransfer Time – the total amount of time spent transferring requests from our web server to the DB

    The first approach we tried, which we’ll refer to as the Naïve Insert Loop, was to loop through the delimited IDs, inserting a single row into the database for each ID. Each query to insert a row was fired off as a separate DB request:


    INSERT INTO records (recordID)
    SELECT 12

    INSERT INTO records (recordID)
    SELECT 15

    INSERT INTO records (recordID)
    SELECT 17
    ...


    This approach is problematic for several reasons, all stemming from the fact that it creates an individual query per record ID and fires it off from the web app to the DB one at a time. Since each query and request has a certain overhead to it, this solution pays huge penalties for the large numbers of queries and requests used. We used this naïve approach as a baseline for which to improve upon.

    Recognizing that we needed to reduce the number of queries and requests fired, we then considered Improved Insert Loop, which was very similar to the Insert Loop except that we combined the INSERTs together via UNION ALLs before firing them off to the DB:


    INSERT INTO records (recordID)
    SELECT 12
    UNION ALL
    SELECT 15
    UNION ALL
    SELECT 17
    ...

    INSERT INTO records (recordID)
    SELECT 27
    UNION ALL
    SELECT 28
    UNION ALL
    SELECT 54
    ...


    We can combine these INSERTs together into batches of a thousand* SELECT statements UNION ALLed together, so we essentially have reduced the number of queries and network requests by a factor of thousand. The queries and requests are themselves approximately a thousand times larger than before. But we have improved net performance because by combining a thousand queries together, we don’t have to pay the overhead attached to all of the individual thousand queries and requests we would have run otherwise. For instance, by reducing the number of queries, we are reducing the number of DB transactions, and therefore we reduce the number of disk writes that happen on the DB since we are reducing the number of transaction log flushes.

    In both approaches above, we are forced to wrap the IDs in queries that insert them into the table on the DB. If we could somehow transmit the raw IDs to the DB and have them parsed and inserted completely on the DB side, we could greatly reduce the size of the data sent over the network and thus greatly reduce the Network Transfer Time.

    With that in mind, we came up with the Stored Procedure Loop approach. Essentially we would pass the entire string of delimited IDs as a TEXT field to a stored procedure, which would do the work of parsing the field and INSERTing the individual records into a target table. Below is the stored procedure definition. It starts by logging the entirety of the paste request into a dataStagingTable and parses the data logged in the table.


    CREATE TABLE dataStagingTable (
    logID int not null identity(1,1),
    data Text,
    pastingTime datetime
    )

    exec processPastedData 'wilfred', '2
    3
    4
    5', '
    ',','

    select * from wilfred

    CREATE PROCEDURE dbo.processPastedData
    @targetTable VARCHAR(32),
    @data TEXT,
    @rowDelimiter CHAR,
    @colDelimiter CHAR

    AS
    BEGIN
    DECLARE @logID INT
    DECLARE @dlen BIGINT
    DECLARE @offset INT
    DECLARE @linePtr INT
    DECLARE @buf varchar(4000)
    DECLARE @cols varchar(255)

    INSERT INTO dataStagingTable (data, pastingTime)
    SELECT @data, getDate()

    SELECT @logID = @@Identity
    SELECT @offset = 1

    SELECT @dlen = datalength(data)
    FROM dataStagingTable
    WHERE logID = @logID

    SELECT @cols = 'recordID'

    SET NOCOUNT ON

    WHILE (@offset > 0)
    BEGIN
    SELECT @linePtr = CHARINDEX(@rowDelimiter, SUBSTRING(data, @offset, 4000))
    FROM dataStagingTable
    WHERE logID = @logID

    if (@linePtr > 0)
    SELECT @buf = REPLACE (SUBSTRING(data, @offset , @linePtr-1), @colDelimiter, ''',''')
    FROM dataStagingTable
    WHERE logID = @logID
    else
    SELECT @buf = REPLACE (SUBSTRING(data, @offset , 8000), @colDelimiter, ''',''')
    FROM dataStagingTable
    WHERE logID = @logID

    SELECT @buf = REPLACE (@buf, char(13), '')

    EXEC ('INSERT INTO ' + @targetTable + ' (' + @cols + ') SELECT ' + @buf)

    SET @offset = @offset + @linePtr

    if (@linePtr = 0)
    BREAK
    END
    END


    This approach yielded a huge performance improvement, as we had essentially minimized the Network Transfer Time by minimizing the amount of data being transmitted. We could have further improved the Query Time by reducing the number of transactions via either explicit transaction blocks or by combining INSERTs into batches as we did with the Improved Loop Insert. But even by making these improvements, we still would have had approximately the same number of queries being run as we did for the Improved Loop Insert (although now they would be run within the stored procedure on the DB side).

    In order to further improve the Query Time, we finally arrived at a fourth approach, the BCP Insert. BCP is a utility included with SQL Server that loads data from a file into a DB table. In this case, we dump the delimited IDs into a text file, then invoke BCP on the new text file. To run the BCP utility, we had to make sure there was a way for the web server to transmit these files to the DB machine. After that, we could run the BCP utility as a console command:


    bcp ClientDB.dbo.records in dataDump.txt -S DBMachine -U userID -P password -f formatFile.fmt

    ------------

    ClientDB.dbowner.records = [DATABASE].[SCHEMA].[TABLE TO UPLOAD DATA TO]
    dataDump.txt = data file that contains pasted data
    -S DBMachine = name of server to connect to
    -U userID = tells bcp to use a specific userID to log into DB machine
    -P password = tells bcp the password to use with userID to log into DB machine
    -f formatFile.fmt = format file that tells bcp how to parse the data file and how to insert into records table


    Creating the format file for configuring BCP to parse the data correctly was also straightforward:


    8.0
    1
    1 SQLCHAR 0 50 "\r\n" 1 recordID SQL_Latin1_General_CP1_CI_AS

    ------------

    8.0 = SQL Version
    1 = number of columns

    Third row going left to right:
    1 = File field order
    SQLCHAR = Host file data type
    0 = Prefix length
    50 = Host file data length
    "\r\n" = line terminator
    1 = Server column order
    recordID = name of column we are uploading to
    SQL_Latin1_General_CP1_CI_AS = column collation type



    Like with the Stored Procedure Loop, the data sent is essentially just the raw data, but in this case, SQL loads all the data without having to run a whole slew of queries. Furthermore, SQL’s loading of this data is not logged, which results in much less overhead than the previous approaches. This final approach does require additional time to move data to and from the file system, but in sum it was still faster than the other approaches. It is important to note, however, that as the number of records being pasted decreases, the difference in performance between these approaches also decreases. In fact, if we are inserting fewer than a thousand records, it is actually faster to revert from BCP Insert to one of the more naïve solutions, since the overhead of the file dump begins to dominate the actual Query and Network Transfer Time.

    In conclusion, the BCP Insert solution was the fastest of the four. This is not unexpected, since these are the types of operations that BCP was designed to perform quickly. Another interesting build on these solutions would have been to incorporate a means of compression of the raw data before transmitting from web server to DB server. That said, we found this to be an interesting chance to experiment with a few different creative solutions.

    * The thousand here is selected for simplicity; we can continue to increase this number to maximize the benefit of these batch combinations. The cap on the batch size is dependent on the RDBMS you are using.
  • Thursday, August 14, 2008

    Debugging at APT - Part 3: Fiddler

    Fiddler has been around for a while, but I was only introduced to it a few months ago. Earlier this year, we were able to budget some time toward improving our software's performance, rather than continuing to add new features as quickly as we had been. Fiddler was especially useful for telling us just how much work our web servers (and our clients' browsers) were doing, and where some of our bottlenecks and improvement opportunities were.

    In my first post I mentioned that we have some pretty complex pages throughout our software. Before optimizing them, we were making hundreds of HTTP requests per page (including the now-infamous 85 iframes), and we were applying the same super-strict "no cache" rules to our static files that we applied to our dynamically-generated content. As a result, even our simplest pages required well over 100K of HTTP traffic, and some of the most complex pages were closer to 2MB. Fiddler's biggest strength for us was seeing how each of the various caching directives you can send in HTTP headers affects the browser and network traffic.

    Coincidentally, right after I learned about Fiddler, I was asked to help troubleshoot a performance concern raised by one of our newer clients. They were sporadically and unpredictably experiencing "This page cannot be displayed (cannot find server or DNS error)" messages. It was inconsistent but seemed to happen primarily during the client's core business hours, and none of our other clients were experiencing comparable symptoms, so we figured it could be a network problem on their end. A combination of Fiddler (on one client user's computer) and Wireshark (on our web server) showed us a pretty clear pattern of dropped and resubmitted requests at the HTTP level. At the same time, our own internal logging showed an extraordinarily high variance in round-trip performance.

    It turned out that the client's internal firewall was sporadically dropping some outgoing requests and incoming responses during times of peak traffic on their general-purpose network. The solution was to establish a VPN tunnel, using a separate infrastructure that they already had in place for their vendors and business partners. In hindsight, we might have reached the same conclusion even more quickly if we had run Fiddler on both sides and compared the traces.