SSMS Data Classification part 2

Oopps! Upgrade your browser pretty please.Oopps! Upgrade your browser pretty please.

Introduction

In my previous post I used SQL Profiler to investigate how the Data Classification feature works under the covers and we also found out a couple more shortcomings
(doesn’t parse XML data, tries to and fails adding extended properties to in-memory tables).

But we did establish that it is possible to add the relevant extended property tags directly from the backend and that the SSMS reports and features will pick them up and display them as usual.

In this post I will create a proof of concept for creating our own dictionary table and using these custom values (in Swedish naturally) to provide data classification capabilities for databases that have column names defined in Swedish (or whatever language you add to your dictionary table). I will also try and produce remedies to the limitations in the feature that I found last time.

OK – first of all we will need a couple of tables with column names in Swedish, here’s a very basic example (Person & Företag – which translates to Person & Company).


-- Add a couple of tables with Swedish table and column names
IF Object_id('dbo.Person') is not null
BEGIN
Drop Table dbo.Person
END

Create Table dbo.Person
(
Efternamn nvarchar(80) NOT NULL
,Förnamn nvarchar(80) NOT NULL
,Födelseort nvarchar(80) NOT NULL
,medborgarskap nvarchar(80) NOT NULL
,Födelseland nvarchar(80) NOT NULL
,Födelsedatum datetime NOT NULL
,Adress nvarchar(80) NOT NULL
,Postadress nvarchar(80) NOT NULL
,Postnummer varchar(10) NOT NULL
,Personnummer varchar(10) NOT NULL
,Kön varchar(8) NOT NULL
)

IF Object_id('dbo.Företag') is not null
BEGIN
Drop Table dbo.Företag
END

Create Table dbo.Företag
(
Namn nvarchar(100) NOT NULL
,Adress nvarchar(80) NOT NULL
,Postadress nvarchar(80) NOT NULL
,Postnummer varchar(10) NOT NULL
,OrgNummer varchar(10)
)

Dictionary

Then we will just steal borrow the dictionary table which our trace last time revealed, as an aside I believe that adding a language column and then filtering to english plus the language of the database collation would be a shortcut to much better performance (and fewer false positives).

Now we just need to populate the dictionary, I will take some of the english terms and translate them to swedish. I can’t resist leaving in the spelling mistake %mainden%name%
– I guess this rule won’t fire very often.
The following rows / rules should give us enough to test with a full GDPR scan would obviously require a fuller dictionary.


('%användarnamn%' ,'Credentials','Confidential' ,1),
('%lösenord%' ,'Credentials','Confidential' ,1),
('%password%' ,'Credentials','Confidential' ,1),
('%emejl%' ,'Contact Info','Confidential - GDPR' ,0),
('%epost%' ,'Contact Info','Confidential - GDPR' ,0),
('%e-mejl%' ,'Contact Info','Confidential - GDPR' ,0),
('%e-post%' ,'Contact Info','Confidential - GDPR' ,0),
('%efter%namn%' ,'Name','Confidential - GDPR' ,0),
('%för%namn%' ,'Name','Confidential - GDPR' ,0),
('%familjenamn%' ,'Name','Confidential - GDPR' ,0),
-- from original dictionary - to show spelling mistake - this isn't how you spell maiden !!!
('%mainden%name%' ,'Name','Confidential - GDPR' ,0),
('%flick%namn%' ,'Name','Confidential - GDPR' ,0),
('%adress%' ,'Contact Info','Confidential - GDPR' ,0),
('%telefon%' ,'Contact Info','Confidential - GDPR' ,1),
('%mobil%' ,'Contact Info','Confidential - GDPR' ,1),
('%postnummer%' ,'Contact Info','Confidential - GDPR' ,1),
('%person%nummer%' ,'National ID','Confidential - GDPR' ,1)

After this come the Information type and Sensitivity label tables (@InfoTypeRanking @SensitivityLabel), the entries are the same as before but I have added an extra column with the datatype uniqueidentifier to these tables as these are required by the data classification page (but not the report) we will add these as extended properties.


DECLARE @InfoTypeRanking TABLE
(
info_type NVARCHAR(128),
info_type_guid uniqueidentifier default newid(),
ranking INT
)
-- If we get multiple matches we will apply the lowest ranking first
-- So if a column matches on both Name and CreditCard it will be classified as Name
-- Your priorities may differ from Microsoft's
INSERT INTO @InfoTypeRanking (info_type, ranking)
VALUES
('Networking', 100),
('Contact Info', 200),
('Credentials', 300),
('Name', 400),
('National ID', 500),
('SSN', 600),
('Credit Card', 700),
('Banking', 800),
('Financial', 900),
('Health', 1000),
('Date Of Birth', 1100),
('Other', 1200)

DECLARE @SensitivityLabel TABLE
(
sensitivity_label NVARCHAR(128),
sensitivity_label_guid uniqueidentifier default newid()
)

Insert @SensitivityLabel (sensitivity_label) select distinct sensitivity_label from @Dictionary

If you want to know more about extended properties I refer you to the following article
https://www.red-gate.com/simple-talk/sql/database-delivery/scripting-description-database-tables-using-extended-properties/
Note that Phil is using JSON formatting for his extended properties, by doing this he can store multiple properties in a single extended property and this allows one to store a single extended property per column holding all of the values (but using this technique would create problems with compatibility with versions earlier than SQL Server 2016 and is the probably explanation why ).

Calculate Results

I populate the @ClassifcationResults table using the exact Microsoft code now all we need to do is use a common table expression to generate the calls to sp_addextendedproperty to tag the columns that have been identified. So simply copy the output calls and paste into a new SSMS window connected to the right database and execute.


DECLARE @ClassifcationResults TABLE
(
schema_name NVARCHAR(128),
table_name NVARCHAR(128),
column_name NVARCHAR(128),
info_type NVARCHAR(128),
sensitivity_label NVARCHAR(128),
ranking INT,
can_be_numeric BIT
)
INSERT INTO @ClassifcationResults
SELECT DISTINCT S.NAME AS schema_name,
T.NAME AS table_name,
C.NAME AS column_name,
D.info_type,
D.sensitivity_label,
R.ranking,
D.can_be_numeric
FROM sys.schemas S
INNER JOIN sys.tables T
ON S.schema_id = T.schema_id
INNER JOIN sys.columns C
ON T.object_id = C.object_id
INNER JOIN sys.types TP
ON C.system_type_id = TP.system_type_id
LEFT OUTER JOIN @Dictionary D
ON (D.pattern NOT LIKE '%[%]%' AND LOWER(C.name) = LOWER(D.pattern) COLLATE DATABASE_DEFAULT) OR
(D.pattern LIKE '%[%]%' AND LOWER(C.name) LIKE LOWER(D.pattern) COLLATE DATABASE_DEFAULT)
LEFT OUTER JOIN @infoTypeRanking R
ON (R.info_type = D.info_type)
WHERE (D.info_type IS NOT NULL ) AND
NOT (D.can_be_numeric = 0 AND TP.name IN ('bigint','bit','decimal','float','int','money','numeric','smallint','smallmoney','tinyint'))

; WITH Class AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY MR.schema_name, MR.table_name, MR.column_name) as rnum,
CR.schema_name AS schema_name,
CR.table_name AS table_name,
CR.column_name AS column_name,
OBJECTPROPERTY(OBJECT_ID(N''+CR.schema_name+'.'+CR.table_name+''),'TableIsMemoryOptimized') as IsInMemory,
CR.info_type AS information_type_name,
convert(varchar(50), ITR.info_type_guid) as info_type_guid,
CR.sensitivity_label AS sensitivity_label_name,
convert(varchar(50), SL.sensitivity_label_guid) as sensitivity_label_guid
FROM @ClassifcationResults CR
INNER JOIN
(
SELECT
schema_name,
table_name,
column_name,
MIN(ranking) AS min_ranking
FROM
@ClassifcationResults
GROUP BY
schema_name,
table_name,
column_name
) MR
ON CR.schema_name = MR.schema_name
AND CR.table_name = MR.table_name
AND CR.column_name = MR.column_name
AND CR.Ranking = MR.min_ranking
JOIN @InfoTypeRanking ITR ON
ITR.info_type = CR.info_type
JOIN @SensitivityLabel SL ON
SL.sensitivity_label = CR.sensitivity_label
)
-- uncomment and run the below if you want a report
-- rather than the calls to add the extended properties
--SELECT
--C.schema_name,
--C.table_name,
--C.column_name,
--OBJECTPROPERTY(OBJECT_ID(N''+C.schema_name+'.'+C.table_name+''),'TableIsMemoryOptimized') as IsInMemory,
--C.info_type AS information_type_name,
--C.sensitivity_label AS sensitivity_label_name,

SELECT rnum, 1 as subsort,
'EXEC sp_addextendedproperty @name = N''sys_information_type_name'', @value = '''+C.information_type_name+''', '+
'@level0type = N''Schema'', @level0name = '''+C.schema_name+''', '+
'@level1type = N''Table'', @level1name = '''+C.table_name+''','+
'@level2type = N''Column'', @level2name = '''+C.column_name+''';' as xp1
FROM Class C
UNION ALL
SELECT rnum, 2,
'EXEC sp_addextendedproperty @name = N''sys_information_type_id'', @value = '''+C.info_type_guid+''', '+
'@level0type = N''Schema'', @level0name = '''+C.schema_name+''', '+
'@level1type = N''Table'', @level1name = '''+C.table_name+''','+
'@level2type = N''Column'', @level2name = '''+C.column_name+''';'
FROM Class C
UNION ALL
SELECT rnum, 3,
'EXEC sp_addextendedproperty @name = N''sys_sensitivity_label_name'', @value = '''+C.sensitivity_label_name+''', '+
'@level0type = N''Schema'', @level0name = '''+C.schema_name+''', '+
'@level1type = N''Table'', @level1name = '''+C.table_name+''','+
'@level2type = N''Column'', @level2name = '''+C.column_name+''';'
FROM Class C
UNION ALL
SELECT rnum, 4,
'EXEC sp_addextendedproperty @name = N''sys_sensitivity_label_id'', @value = '''+C.sensitivity_label_guid+''', '+
'@level0type = N''Schema'', @level0name = '''+C.schema_name+''', '+
'@level1type = N''Table'', @level1name = '''+C.table_name+''','+
'@level2type = N''Column'', @level2name = '''+C.column_name+''';'
FROM Class C
-- We can't add extended properties to InMemory tables so lets filter those out
WHERE IsInMemory = 0
order by rnum asc, subsort

Now when I open the data classification window I see the matching columns correctly tagged and similarly within the report. QED

As a final flourish let us go back to the AdventureWorks2016 database and match the XML elements within the HumanResources.JobCandidate table within the Resume column – we can
then use a LIKE match to retrieve Elements against the dictionary table.


--Add some new rows to the dictionary table to use in the XML element demo
('%street%' ,'Contact Info' , 'Confidential - GDPR' ,0),
('%city%' ,'Contact Info' , 'Confidential - GDPR' ,0),
('%postal%' ,'Contact Info' , 'Confidential - GDPR' ,1),
('%zip%' ,'Contact Info' , 'Confidential - GDPR' ,1),
('%Addr%' ,'Contact Info' , 'Confidential - GDPR' ,1),
('%Telephone%' ,'Contact Info' , 'Confidential - GDPR' ,1)

-- Inspiration from this post - thanks Aaronaught
-- https://stackoverflow.com/questions/2266132/how-can-i-get-a-list-of-element-names-from-an-xml-value-in-sql-server

;WITH Xml_CTE AS
(
SELECT
CAST('/' + node.value('fn:local-name(.)',
'varchar(100)') AS varchar(100)) AS name,
node.query('*') AS children
FROM HumanResources.JobCandidate
CROSS APPLY Resume.nodes('/*') AS roots(node)
UNION ALL
SELECT
CAST(x.name + '/' +
node.value('fn:local-name(.)', 'varchar(100)') AS varchar(100)),
node.query('*') AS children
FROM Xml_CTE x
CROSS APPLY x.children.nodes('*') AS child(node)
)
SELECT DISTINCT name, replace(XC.name, '/',''),D.*, R.info_type
FROM Xml_CTE XC
INNER JOIN @Dictionary D ON
replace(XC.name, '/','') LIKE D.pattern COLLATE DATABASE_DEFAULT
LEFT OUTER JOIN @infoTypeRanking R
ON (LOWER(R.info_type) = LOWER(D.info_type))
WHERE (D.info_type IS NOT NULL )
OPTION (MAXRECURSION 1000)