Database design tricks using Mathematics

by Amol 10. March 2011 21:54

Increase performance by 30-40% and reduce the table size by 90% by considering mathematics while database design

Writer: Amol Rajamane

Applies to: SQL Server 2000 / 2005 / 2008 / 2011

Summary:

In many area Mathematics makes your life easier and faster. This paper contains the SQL Server database design to get benefits of Bitwise operation in terms of size and performance.

Introduction:

When database table grows in size then its bit difficult to manage insert update deletes and Index maintenance. It’s very difficult to modify the existing schema as well. For example if you want change one datatype  say, INT to BIGINT, the operation would take its own sweet time to complete which is directly proportional to the number of records in the table. 
More number of records resulting in slow query performance and large amount of space.  Which could affect the ETL process.
To overcome these two problems, size and performance, database design can  play a very important role. This paper tells you some tricky design to overcome these two problems.

Overview:  

You might be aware of the bitwise operation in SQL Server.  Let’s step back and see how it works. Lets consider the simple example of Student and their SkillSet. We need to populate SkillSet table as shown in below.

SkillSetId

Skill

1

DB2

2

FoxPro

4

Informix

8

Interbase

16

MS Access

32

MS SQL Server

64

MySQL

SkillSetId whould be the sequence of POWER of 2, like 1, 2, 4, 8 and so on …
If you see carefully, sum of bit values will never exceeds next bit value. For example
1 + 2 = 3
1 + 2 + 4 = 7 
1 + 2 + 4 + 8 = 15
. . .

Consider Student table as shown in below. (Students may be in millions) 

StudentId

Name

1

Student 1

2

Student 2

3

Student 3

4

Student 4

Now consider Student 1 has skills sets MS Access, MS SQL Server, MySQL. We need to store Student and their skill sets. Consider the table StudentSkillSet like below. This is the normal way of storing the relational data.

StudentId

SkillSetId

1

16

1

32

1

64

 In another approach, Instead of storing SkillSetIds of student, we will store the SUM of SkillSetIds.  Here 16 + 32 + 64 = 112

StudentId

SkillSetId

1

112

This approach reduces the number of records. As you see earlier you have 3 records for StudentId 1.
Now how to relate this 112 number with IDs associated with MSAccess(16), MS SQL Server(32) and MySQL(64).  Here we will use bitwise AND to get the SkillSetIds associated to number 112. Just run the following statement in query analyzer.

SELECT  112 & 1 AS 'DB2', 
        112 & 2 AS 'FoxPro', 
        112 & 4 AS 'Informix', 
        112 & 8 AS 'Interbase', 
        112 & 16 AS 'MS Access', 
        112 & 32 AS 'MS SQL Server', 
        112 & 64 AS 'MySQL'
GO

This will return the output like 

DB2

FoxPro

Informix

Interbase

MS Access

MS SQL Server

MySQL

0

0

0

0

16

32

64


Only skill sets which are non zero are associated to 112. Isn’t it simple?

How it reduces the number of rows and size of table. For that consider there are 1 million students and each student has 10 skillsets. Then there will be 10 millions of rows in StudentSkillSet table. If we store the SUM of skillSetIds there will be the same number of records as in Student table. That means 1 millions.  To get the space used I created two tables StudentSkillSet and StudentSkillSetTemp. StudentSkillSet table contains StudentId and sum of SkillSetIds. And StudentSkillSetTemp contains StudentId and SkillSetId. Here is size difference between both approaches.

Execute below query to get the space used by StudentSkillSet

EXEC sp_spaceused StudentSkillSet 
GO

Out is as below which indicate total space used = 35.976 MB  (Data + Index_size)

Name

rows

Reserved

data

index_size

unused

StudentSkillSet

1000000   

37216 KB

20816 KB

15160 KB

1240 KB

Same way get the space used by StudentSkillSetTemp

EXEC sp_spaceused StudentSkillSetTemp 
GO

Out is as below which indicate total space used = 357.544 MB (Data + Index_size)

name

rows

reserved

Data

index_size

unused

StudentSkillSetTemp

10000000  

358224 KB

207824 KB

149720 KB

680 KB

You can see the difference between these two tables in size. There is size saving almost 90% 

 

This is all about the size. Now step back and see if there is any performance impact of this technic.
Let’s create the following tables

1. SkillSet   –- To store all the skillsets
2. Student   –- To store all the students
3. StudentSkillSet  -- To store sum of skillsets associated to the student
4. StudentSkillSetTemp -- To store skillsets associated to the student

Run the below script in your test database.  Please be patient for 4-5 minutes to execute below script.

-- First drop the dependent objects  
IF OBJECT_ID('[dbo].[vwStudentSkillSet]') IS NOT NULL 
    DROP VIEW [dbo].[vwStudentSkillSet]  
GO 
IF OBJECT_ID('[dbo].[vwStudentSkillSetTemp]') IS NOT NULL 
    DROP VIEW [dbo].[vwStudentSkillSetTemp]  
GO 
  
-- Drop if exists  
IF OBJECT_ID('[dbo].[SkillSet]') IS NOT NULL 
    DROP TABLE [dbo].[SkillSet]  
GO 
  
-- Create SkillSet table 
CREATE TABLE [dbo].[SkillSet] (SkillSetID BIGINT NOT NULL, Skill VARCHAR(256) NOT NULL) 
  
GO 
  
-- Insert skillsets, SkillSetIDs must be in POWER of 2  
INSERT INTO [dbo].[SkillSet] (SkillSetID, Skill) 
SELECT 1 AS SkillSetID, 'DB2' AS Skill UNION ALL 
SELECT 2 AS SkillSetID, 'FoxPro' AS Skill UNION ALL 
SELECT 4 AS SkillSetID, 'Informix' AS Skill UNION ALL 
SELECT 8 AS SkillSetID, 'Interbase' AS Skill UNION ALL 
SELECT 16 AS SkillSetID, 'MS Access' AS Skill UNION ALL 
SELECT 32 AS SkillSetID, 'MS SQL Server' AS Skill UNION ALL 
SELECT 64 AS SkillSetID, 'MySQL' AS Skill UNION ALL 
SELECT 128 AS SkillSetID, 'Oracle' AS Skill UNION ALL 
SELECT 256 AS SkillSetID, 'Paradox' AS Skill UNION ALL 
SELECT 512 AS SkillSetID, 'PostgreSQL' AS Skill UNION ALL 
SELECT 1024 AS SkillSetID, 'Borland C++ Builder' AS Skill UNION ALL 
SELECT 2048 AS SkillSetID, 'Borland Delphi' AS Skill UNION ALL 
SELECT 4096 AS SkillSetID, 'Borland JBuilder' AS Skill UNION ALL 
SELECT 8192 AS SkillSetID, 'ErWIN' AS Skill UNION ALL 
SELECT 16384 AS SkillSetID, 'GNU C/C++' AS Skill UNION ALL 
SELECT 32768 AS SkillSetID, 'MS Visual C++' AS Skill UNION ALL 
SELECT 65536 AS SkillSetID, 'Sun ONE Studio' AS Skill UNION ALL 
SELECT 131072 AS SkillSetID, 'HP ANSI C++' AS Skill UNION ALL 
SELECT 262144 AS SkillSetID, 'JDBC' AS Skill UNION ALL 
SELECT 524288 AS SkillSetID, 'JDK' AS Skill UNION ALL 
SELECT 1048576 AS SkillSetID, 'Macromedia Flash' AS Skill UNION ALL 
SELECT 2097152 AS SkillSetID, 'Macromedia Flash Media Server' AS Skill UNION ALL 
SELECT 4194304 AS SkillSetID, 'MS Visual Basic' AS Skill UNION ALL 
SELECT 8388608 AS SkillSetID, 'MS Visual J++' AS Skill UNION ALL 
SELECT 16777216 AS SkillSetID, 'MS Visual Source Safe' AS Skill UNION ALL 
SELECT 33554432 AS SkillSetID, 'Powerbuilder Foundation Classes' AS Skill UNION ALL 
SELECT 67108864 AS SkillSetID, 'Rational Rose' AS Skill UNION ALL 
SELECT 134217728 AS SkillSetID, 'UML' AS Skill UNION ALL 
SELECT 268435456 AS SkillSetID, 'JavaScript' AS Skill UNION ALL 
SELECT 536870912 AS SkillSetID, 'Lua' AS Skill UNION ALL 
SELECT 1073741824 AS SkillSetID, 'Modula-2' AS Skill UNION ALL 
SELECT 2147483648 AS SkillSetID, 'Pascal' AS Skill UNION ALL 
SELECT 4294967296 AS SkillSetID, 'Perl' AS Skill UNION ALL 
SELECT 8589934592 AS SkillSetID, 'PL/I' AS Skill UNION ALL 
SELECT 17179869184 AS SkillSetID, 'PHP' AS Skill UNION ALL 
SELECT 34359738368 AS SkillSetID, 'Python' AS Skill UNION ALL 
SELECT 68719476736 AS SkillSetID, 'REALBasic' AS Skill UNION ALL 
SELECT 137438953472 AS SkillSetID, 'Visual Basic' AS Skill UNION ALL 
SELECT 274877906944 AS SkillSetID, 'Active X' AS Skill UNION ALL 
SELECT 549755813888 AS SkillSetID, 'ADO' AS Skill UNION ALL 
SELECT 1099511627776 AS SkillSetID, 'ASP' AS Skill UNION ALL 
SELECT 2199023255552 AS SkillSetID, 'ATL' AS Skill UNION ALL 
SELECT 4398046511104 AS SkillSetID, 'BDE' AS Skill UNION ALL 
SELECT 8796093022208 AS SkillSetID, 'CGI' AS Skill UNION ALL 
SELECT 17592186044416 AS SkillSetID, 'Client-Server Architecture' AS Skill UNION ALL 
SELECT 35184372088832 AS SkillSetID, 'COM/DCOM' AS Skill UNION ALL 
SELECT 70368744177664 AS SkillSetID, 'CORBA' AS Skill UNION ALL 
SELECT 140737488355328 AS SkillSetID, 'DAO/RDO' AS Skill UNION ALL 
SELECT 281474976710656 AS SkillSetID, 'DICOM' AS Skill UNION ALL 
SELECT 562949953421312 AS SkillSetID, 'HTML' AS Skill UNION ALL 
SELECT 1125899906842624 AS SkillSetID, 'HL7' AS Skill UNION ALL 
SELECT 2251799813685248 AS SkillSetID, 'IBM WebSphere' AS Skill UNION ALL 
SELECT 4503599627370496 AS SkillSetID, 'Image processing' AS Skill UNION ALL 
SELECT 9007199254740992 AS SkillSetID, 'JDBC' AS Skill UNION ALL 
SELECT 18014398509481984 AS SkillSetID, 'MFC' AS Skill UNION ALL 
SELECT 36028797018963968 AS SkillSetID, 'Microsoft .NET' AS Skill UNION ALL 
SELECT 72057594037927936 AS SkillSetID, 'MIDAS' AS Skill UNION ALL 
SELECT 144115188075855872 AS SkillSetID, 'ODBC' AS Skill UNION ALL 
SELECT 288230376151711744 AS SkillSetID, 'OLE' AS Skill UNION ALL 
SELECT 576460752303423488 AS SkillSetID, 'Oracle Developer/Programmer' AS Skill UNION ALL 
SELECT 1152921504606846976 AS SkillSetID, 'Sound processing' AS Skill UNION ALL 
SELECT 2305843009213693952 AS SkillSetID, 'VBA' AS Skill  
  
  
GO 
-- Create PK 
ALTER TABLE [dbo].[SkillSet] ADD CONSTRAINT pk_SkillSet PRIMARY KEY (SkillSetID)  
  
GO 
  
-- Drop if exists  
IF OBJECT_ID('[dbo].[Student]') IS NOT NULL 
    DROP TABLE [dbo].[Student]  
GO 
  
-- Create student table 
CREATE TABLE [dbo].[Student] (StudentID INT IDENTITY(1,1), StudentName VARCHAR(100) NOT NULL) 
GO 
  
-- Add 1 million records in Student table 
SET NOCOUNT ON
DECLARE @LoopCounter INT
SET @LoopCounter = 1 
  
BEGIN TRAN 
WHILE @LoopCounter <= 100000 
BEGIN
    INSERT [dbo].[Student] VALUES('Student ' + CAST(@LoopCounter AS VARCHAR)) 
    SET @LoopCounter = @LoopCounter + 1 
END
COMMIT 
GO 
  
-- Create PK 
ALTER TABLE [dbo].[Student] ADD CONSTRAINT pk_Student PRIMARY KEY (StudentID)  
  
GO 
  
-- Drop if exists  
IF OBJECT_ID('[dbo].[vwStudentSkillSet]') IS NOT NULL 
    DROP VIEW [dbo].[vwStudentSkillSet]  
  
GO 
IF OBJECT_ID('[dbo].[vwStudentSkillSetTemp]') IS NOT NULL 
    DROP VIEW [dbo].[vwStudentSkillSetTemp]  
  
GO 
  
-- Drop if exists  
IF OBJECT_ID('[dbo].[StudentSkillSet]') IS NOT NULL 
    DROP TABLE [dbo].[StudentSkillSet]  
GO 
  
-- Create StudentSkillSet 
CREATE TABLE [dbo].[StudentSkillSet] (StudentID INT NOT NULL, SkillSetID BIGINT NOT NULL) 
GO 
  
-- Drop if exists  
IF OBJECT_ID('[dbo].[StudentSkillSetTemp]') IS NOT NULL 
    DROP TABLE [dbo].[StudentSkillSetTemp]  
GO 
  
-- Create StudentSkillSetTemp 
CREATE TABLE [dbo].[StudentSkillSetTemp] (StudentID INT NOT NULL, SkillSetID BIGINT NOT NULL) 
GO 
  
  
-- Insert random 10 skills for each student into StudentSkillSetTemp 
SET NOCOUNT ON
DECLARE @LoopCounter INT, @StudentID INT, @strLoopCounter VARCHAR(100) 
SET @LoopCounter = 1 
  
BEGIN TRAN 
WHILE @LoopCounter <= 100000 
BEGIN
  
    SELECT @StudentID = StudentID 
    FROM Student  
    WHERE StudentID = @LoopCounter 
  
    INSERT [dbo].[StudentSkillSetTemp] (StudentID, SkillSetID) 
    SELECT TOP 10 @StudentID, SkillSetID 
    FROM SkillSet 
    ORDER BY NEWID() 
  
    SET @LoopCounter = @LoopCounter + 1 
      
    SET @strLoopCounter = CAST(@LoopCounter AS VARCHAR) 
    RAISERROR (@strLoopCounter, 0, 1) WITH NOWAIT 
  
END
COMMIT 
  
GO 
  
-- Create PK 
ALTER TABLE [dbo].[StudentSkillSetTemp] ADD CONSTRAINT pk_StudentSkillSetTemp PRIMARY KEY (SkillSetID, StudentID) 
GO 
  
-- Create NC on StudentID 
CREATE NONCLUSTERED INDEX [ix_StudentSkillSetTemp_StudentID] ON [dbo].[StudentSkillSetTemp] ([StudentID]) 
GO 
  
-- Insert SUM of Skillsets for students into StudentSkillSet 
INSERT [dbo].[StudentSkillSet] (StudentID, SkillSetID) 
SELECT StudentID, SUM(SkillSetID) AS SkillSetID 
FROM [dbo].[StudentSkillSetTemp] 
GROUP BY StudentID 
GO 
  
-- Create PK, Here don't need to create composit key as PK as we have 1-1 mapping 
ALTER TABLE [dbo].[StudentSkillSet] ADD CONSTRAINT pk_StudentSkillSet PRIMARY KEY (SkillSetID) 
GO 
  
-- Create NC on SkillSetID 
CREATE NONCLUSTERED INDEX [ix_StudentSkillSet_StudentID] ON [dbo].[StudentSkillSet] (StudentID)  
GO 
  
  
-- Create view to get the actual skillset ids associated to students 
CREATE VIEW [dbo].[vwStudentSkillSet] 
WITH SCHEMABINDING  
AS
    SELECT  
        sss.StudentID,  
        ss.SkillSetID, 
        ss.Skill   
    FROM [dbo].[StudentSkillSet] sss  
    CROSS JOIN [dbo].[SkillSet] ss 
    WHERE ss.SkillSetID > 0  
          AND ss.SkillSetID | sss.SkillSetID = sss.SkillSetID 
  
GO 
  
-- Here we will create composite PK 
CREATE UNIQUE CLUSTERED INDEX ucix_vwStudentSkillSet_StudentID 
ON [dbo].[vwStudentSkillSet] (SkillSetID, StudentID) 
GO 
  
-- Create the same view for our temp table as well  
CREATE VIEW [dbo].[vwStudentSkillSetTemp] 
WITH SCHEMABINDING  
AS
    SELECT  sss.StudentID, ss.SkillSetID, ss.Skill 
    FROM    [dbo].[StudentSkillSetTemp] sss 
    INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID) 
    WHERE   ss.SkillSetID > 0 
  
GO 
  
-- Create composite PK 
CREATE UNIQUE CLUSTERED INDEX ucix_vwStudentSkillSetTemp_StudentID 
ON [dbo].[vwStudentSkillSetTemp] (SkillSetID, StudentID) 
GO

Now Lets look at the performance different between these two approaches.
First we will run some queries against the tables directly instead of indexed view which we created above.
Query Number  1 :  -- Get all the Students having Skill Set  1, 2,  and 3

--With new design  
SELECT  sss.StudentID,  
        ss.SkillSetID,  
        ss.Skill   
FROM    [dbo].[StudentSkillSet] sss 
CROSS JOIN [dbo].[SkillSet] ss 
WHERE   ss.SkillSetID IN (1,2,4) AND 
        ss.SkillSetID | sss.SkillSetID = sss.SkillSetID  
GO 
--With normal design  
SELECT  sss.StudentID,  
        ss.SkillSetID,  
        ss.Skill   
FROM    [dbo].[StudentSkillSetTemp] sss 
INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID) 
WHERE   ss.SkillSetID IN (1,2,4)
GO

Run these queries and see the execution plan.
Figure 1 

See the execution plan in Figure 1, It’s 22% faster.

Query Number  2 :  -- Get all the Skill Sets of Student  1

-- With new design  
SELECT  sss.StudentID,  
        ss.SkillSetID,  
        ss.Skill   
FROM    [dbo].[StudentSkillSet] sss 
CROSS JOIN [dbo].[SkillSet] ss 
WHERE   ss.SkillSetID > 0 AND 
        ss.SkillSetID | sss.SkillSetID = sss.SkillSetID  
        AND sss.StudentID = 1 
GO 
  
-- With normal design  
SELECT  sss.StudentID,  
        ss.SkillSetID,  
        ss.Skill   
FROM    [dbo].[StudentSkillSetTemp] sss 
INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID) 
WHERE   ss.SkillSetID > 0 AND 
        sss.StudentID = 1
GO

Execute above queries and see the execution plan.

Figure 2

 

See the execution plan in Figure 2, It’s 8% faster.

Now lets run some queries against Indexed views, [dbo].[vwStudentSkillSet] and [dbo].[vwStudentSkillSetTemp]

Query Number  3 :  -- Get all the Students having Skill Sets  1, 2,  and 3 using Indexed views

--With new design  
SELECT  StudentID,  
        SkillSetID,  
        Skill  
FROM    [dbo].[vwStudentSkillSet] 
WHERE   SkillSetID IN (1,2,4) 
  
GO 
--With normal design  
SELECT  StudentID,  
        SkillSetID,  
        Skill  
FROM    [dbo].[vwStudentSkillSetTemp] 
WHERE   SkillSetID IN (1,2,4)
GO

Figure 3

 

See the execution plan in Figure 3, It’s 22% faster.

Query Number  4 :  -- Get all the Skill Sets of Student  1 using Indexed views

--With new design  
SELECT  StudentID,  
        SkillSetID,  
        Skill  
FROM    [dbo].[vwStudentSkillSet] 
WHERE   StudentID = 1 
  
GO 
--With new design  
SELECT  StudentID,  
        SkillSetID,  
        Skill  
FROM    [dbo].[vwStudentSkillSetTemp] 
WHERE   StudentID  = 1
GO

Figure 4

 

Query Number  5 :  -- Get the number of skillsets for Student  1 using Indexed views

SELECT  COUNT(*)  
FROM    [dbo].[vwStudentSkillSet] 
WHERE   SkillSetID = 1 
  
GO 
  
SELECT  COUNT(*)  
FROM    [dbo].[vwStudentSkillSetTemp] 
WHERE   SkillSetID  = 1
GO

Figure 5 

 

If you observe, all the queries are running at least 22% (this number may vary depending on hardware) faster than old design.

Limitations:

The biggest disadvantage of this design is, you cannot use it if you have more than 62 dimensions keys. In this case I used 62 Skill Sets. This is because the POWER(2,63) crosses the max limit of big integer supported by SQL Server. And binary operations cannot be performed against Float or Decimal values.

Inferences:

This approach is very useful if you have dimensions having keys less than 63 and you have huge facts which holds these dimension keys. 
Consider the case where you have 100 millions fact keys and each key hold minimum 10 dimension keys. So there will be 100 * 10 = 1000 Millions of rows. But if you use this design there will be 100 millions of rows. So at the max you could save the space 62 time, it’s huge one …

SQL Queries: 

SQL Queries.sql (10.31 kb)

Tags: , ,

Whitepaper

Category

Recent Posts

Tag cloud