Over the last several years not a month has gone by where I have not heard someone mention R – with regards to risk analysis or risk modeling – either in discussion or on a mailing list. If you do not know what R is, take a few minutes to read about it at the project’s main site. Simply put, R is a free software environment for statistical computing and graphics. Most of my quantitative modeling and analysis has been strictly Excel-based, which to date has been more then sufficient for my needs. However, Excel is not the ‘end-all-be-all’ tool. Excel does not contain every statistical distribution that risk practitioners may need to work with, there is no native Monte Carlo engine and it does have graphing limitations short of purchasing third party add-ons (advanced charts, granular configuration of graphs, etc…).
Thanks to some industry peer prodding (Jay Jacobs of Verizon’s Risk Intelligence team and Alex Hutton suggesting that ‘Numbers’ is a superior tool for visualizations). I finally bit the bullet, downloaded and then installed R. For those completely new to R you have to realize that R is a platform to build amazing things upon. It is very command-line like in nature. You type in instructions and it executes. I like this approach because you are forced to learn the R language and syntax. Thus, in the end you will probably understand your data and resulting analysis much better.
One of the first graphics I wanted to explore with R was heat maps. At first, as I was thinking a standard risk management heat map; a 5×5 matrix with issues plotted on the matrix relative to frequency and magnitude. However, when I started searching Google for ‘R heat map’, a similar yet different style of heat map – referred to as a cluster heat map – was first returned in the search results. A cluster heat map is useful for comparing data elements in a matrix against each other depending on how your data is laid out. It is very visual in nature and allows the reader to quickly zero in on data elements or visual information of importance. From an information risk management perspective, if we have quantitative risk information and some metadata, we can begin a discussion with management by leveraging a heat map visualization. If additional information is needed as to why there are dark areas, then we can have the discussion about the underlying quantitative data. Thus, I decided to build a cluster heat map in R.
I referenced three blogs to guide my efforts – they can be found here, here and here. What I am writing here is in no way a complete copy and paste of their content because I provide some additional details on some steps that generated errors for me that in some cases took hours to figure out. This is not unexpected given the difference in data sets.
Let’s do it.
1. Download and install R. After installation, start an R session. The version of R used for this post is 2.14.0. You can check your version by typing version at the command prompt and pressing ENTER.
2. You will need to download and install the ggplot2 package / library. Do this through the R gui by referencing an online CRAN repository (packages -> install packages …). This method seems to be cleaner then downloading a package to your hard disk and then telling R to install it. In addition, if you reference an online repository, it will also grab any dependent packages at the same time. You can learn more about ggplot2 here.
3. Once you have installed the ggplot2 package, we have to load it into our current R workspace.
4. Next, we are going to import data to work with in R. Download ‘risktical_csv1.csv’ to your hard disk and execute the following command. Change the file path to match the file path for where you saved the file to.
risk <- read.csv(“C:/temph/risktical_csv1.csv”, sep=”,”, check.names= FALSE)
a. We are telling R to import a Comma Separated Value file and assign it to a variable called ‘risk’.
b. Read.csv is the method or function type of import.
c. Notice that the slashes in the file name are opposite of what they normally would be when working with other common Windows-based applications.
d. ‘sep=”,”’ tells R what character is used to separate values within the data set.
e. ‘check.names=FALSE’ tells R not to check the column headers for correctness. R expects to see only letters, if it sees numbers, it will prepend an X to the column headers – we don’t want that based off the data set we are using.
f. Once you hit enter, you can type ‘risk’ and hit enter again. The data from the file will be displayed on the screen.
5. Now we need to ‘shape’ the data. The ggplot graphing function we want to use cannot consume the data as it currently is, so we are going to reformat the data first. The ‘melt’ function helps us accomplish this.
risk.m <- melt(risk)
a. We are telling R to use the melt function against the ‘risk’ variable. Then we are going to take the output from melt and create a new variable called risk.m.
b. Melt rearranges the data elements. Type ‘help(melt)’ for more information.
c. After you hit enter, you can type ‘risk.m’ and hit enter again. Notice the way the data is displayed compared to the data prior to ‘melting’ (variable ‘risk’).
6. Next, we have to rescale our numerical values so we can know how to shade any given section of our heat map. The higher the numerical value within a series of data, the darker the color or shade that tile of the heat map should be. The ‘ddply’ function helps us accomplish the rescaling; type ‘help(ddply)’ for more information.
risk.m <- ddply(risk.m, .(variable), transform, rescale = rescale(value), reorder=FALSE)
a. We are telling R to execute the ‘ddply’ function against the risk.m variable.
b. We are also passing some arguments to ‘ddply’ telling it to transform and reshape the numerical values. The result of this command produces a new column of values between 0 and 1.
c. Finally, we pass an argument to ‘ddply’ not to reorder any rows.
d. After you hit enter, you can type ‘risk.m’ and hit enter again and observe changes to the data elements; there should be two new columns of data.
7. We are now ready to plot our heat map.
(p <- ggplot(risk.m, aes(variable, BU.Name)) + geom_tile(aes(fill = rescale), colour = “grey20″) + scale_fill_gradient(low = “white”, high = “red”))
a. This command will produce a very crude looking heat map plot.
b. The plot itself is assigned to a variable called p
c. ‘scale_fill_gradient’ is the argument that associates color shading to the numerical values we rescaled in step 6. The higher the rescaling value – the darker the shading.
d. The ‘aes’ function of ggplot is related to aesthetics. You can type in ‘help(aes)’ to learn about the various ‘aes’ arguments.
8. Before we tidy up the plot, let’s set a variable that we will use in formatting axis values in step 9.
base_size <- 9
9. Now we are going to tidy up the plot. There is a lot going on here.
p + theme_grey(base_size = base_size) + labs(x = “”, y = “”) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0)) + opts(legend.position = “none”, axis.ticks = theme_blank(), axis.text.x = theme_text(size = base_size * 0.8, angle = -90, hjust = 0, colour = “black”), axis.text.y = theme_text(size = base_size * 0.8, angle = 0, hjust = 0, colour = “black”))
a. ‘labs(x = “”, y = “”)’ removes the axis labels.
b. ‘opts(legend.position = “none”’ gets rid of the scaling legend.
c. ‘axis.text.x = theme_text(size = base_size * 0.8, angle = -90’ sets the X axis text size as well as orientation.
d. The heat map should look like the image below.
A few final notes:
1. The color shading is performed within series of data, vertically. Thus, in the heat map we have generated, the color for any given tile is relative to the tile above and below it –IN THE SAME COLUMN – or in our case for a given ISO 2700X policy section.
2. If we transposed our original data set – risktical_cvs2 – and applied the same commands with the exception of replacing BU.Name with Policy in our initial ggplot command (step 7), you should get a heat map that looks like the one below.
3. In this heat map, we can quickly determine key areas of exposure for all 36 of our fictional business units relative to ISO 2700X. For example, most of BU3’s exposure is related to Compliance, followed by Organizational Security Policy and Access Control. If the executive in that business unit wanted more granular information in terms of dollar value exposure, we could share that information with them.
So there you have it! A quick R tutorial on developing a cluster heat map for information risk management purposes. I look forward to learning more about R and leveraging it to analyze and visualize data in unique and thought-provoking ways. As always, feel free to leave comments!