This is a story from real life that explains one of the reasons why I love developers. At my current assignment as a SQL Server consultant I’m having the DW DBA role. Among the adventures the project deployed a report that a Cognos report developer had built
against a fact table with 3685952 rows and, from a list of many dimensions, a special one dimension table with 38572 rows that needed to be joined twice in the query. The report should show a result with an amount of rows that is slightly less than the fact table amount of rows, due to a condition on the dimension table. But something got terribly wrong somewhere in the cognos report design, and as it later shows, since it incidentally were INT on both columns in the bad join statement nothing were detected at point of development and the developers own testing could not have been thoroghly processed, to say the least, before the delivery of the report to the development environment.
The report is run (for many minutes/hours/etc and Cognos finally gives a diffuse error message to the test person that the report crasched. Simontaneously the maintenance DBA get a notice that ”someone or something” has filled the TEMPDB disk (the TEMPDB db has expanded to fill the disk i.e. tempdb is written full) and the server service has to be restarted to revert the excessive tempdb disk usage back to startup values. This happens a couple of times, but due to having experience working as a maint dba myself, and thanks to our projects participant sitting in the same room, I heard of it and knew about the new report and I got the feeling that this report most likely had something to do with the tempdb crash. I managed to get hold of the script by running a profiler trace at the time we retried the report ourselves. Tempdb blow again though. 🙂
Instead of having the fact table restricted by the dimension with a join, I found that all individual rows in the fact table were joined to almost all rows in the dimension. The amount of rows SQL Server juggled with was mindblowing. I couldn’t do a COUNT(*) on the join. The integer datatype wasn’t big enough. So I did a little count on it myself, slightly exagerated because it was not all of the rows that hit that join condition, but close to it. A calculator is not needed when you have TSQL… declare @bigint1 bigint , @bigint2 bigint set @bigint1 = 38572 set @bigint2 = 3685952 select @bigint1 * @bigint2 — 142 174 540 544 Wow! That is a lot of rows to run, and part of the query was a group by. That would heat up the tempdb disks a bit I guess. And that was just the first join to that dimension. Same mistake were done on the second one. select @bigint1 * @bigint1 * @bigint2 — 5 483 956 377 863 168
Wow! I’m glad though that tempdb crashed before we got to show that report to the test persons from the customer. Wouldn’t have had the users trust IT for a long time. Would like to have huge amount of disks and time to test if SQL Server would ever get an answer to that query, or not. The report design flaw will probably be easily corrected, but that’s the Cognos developers headache. I thank them though, for giving us DBA work.