The importance of good data visualization is often neglected when thinking about experiments and timelines. This, however, is often a costly oversight as it is what tends to attract more attention when presenting our results. Unfortunately, good data visualization is time consuming. Unless there is a hired data analyst/graphics designer (which is often not the case), we'll have to deal with doing or data visualization ourselves. This might come off as a somewhat daunting endeavor given that the guidelines that make a "good" scientific graphic are fuzzy or close to nonexistent. Things like color palettes, font faces, distribution of elements, thicknesses, opacities, all affect the overall perception of our work. This is on top of selecting the right plot-type, and the correct framework to work in. As such, it is a good idea to review and practice some of the most current trends and technologies in the subject.
Data visualization is, as the name suggests, all about the data. As such, it is natural to think about it as a subset of a data-analysis workflow. At first glance it might seem that generating "nice plots" would only be important in final stages of analyses, but the clever use of visual cues can save a lot of time in performing experiments and debugging code.
- To explore/understand experiments
- To validate results
- To raise interest
- To communicate concepts
- To get funding
There are several visual cues that help viewers interpret plots. The most common ones are:
- Position: The coordinates of points are usually related to the independent variables shown in the plots axes.
- Length: One of the easiest cues to decode, length is usually interpreted as magnitude.
- Direction: When used in time series, it relates to increases/decreases of quantities.
- Angle: Generally mapped to rate of change in vector representations.
- Shape: Can be effectively used to differentiate between categorical datasets, or to emphasize data points.
- Area: Used to encode magnitde, this variable can be an effective, albeit tricky, method to show an extra dimension of our data in a two-dimensional setting.
- Volume: Although it is hard to use effectively, it can be used in analogous ways to area.
- Color:
- Hue: When used correctly in a color palette, it can be a very effective way to transmit either magnitude, or difference between datasets.
- Saturation/Transparency: The density of the color can be mapped to the magnitude of the dependent variable.
There's always tradeoffs we have to take into account when generating graphics for scientific/engineering contexts, and depending on our application we should be able to weight which elements are more important for our specific purposes.
Obvious to anyone who has ever worked with a deadline. The more time we spend working on a graphic, the better it will look. However, not always do we have the luxury to spend a week making one plot as aesthetically pleasing as possible. This is why practicing data visualization is generally a good idea. Through failure and experimenting, we iteratively find better ways to be time-efficient.
Increasing the level of abstraction usually tends to improve the aesthetics of visualization. This can be achieved by removing labels, and elements that might interfere with the graphics, but it comes at the cost of information.
Whilst it is often a good practice to make our visualizations as clean, and "publication-ready" as possible, preparing a good graphic is time consuming. If we are interested in exploration, we might not need to devote so much time into labeling axes correctly, or using the appropriate font-size unless it is making the exploration easier.
This constraint is a bit less obvious than the exploration versus communication one, but it's still important to make the distinction between plots made for the purpose of detecting large trends in a dataset (in an inital state of experimentation), and plots that are specifically designed for analysis of a clear phenomenon.
Dynamic contents have the advantage of being manipulable and interactive, which often results in better exploratory understanding of a concept. This, however, comes at the cost of media compatibility, and sometimes in a reduction of graphics quality.
Although it is less often though about, there are still some constraints in the print media that are worth considering. There is still a sector of the population that prefers to read papers in a printed format, so the font-sizes, colors, and shapes should ideally be readable even in lower resolutions.
This will become more apparent when we look into image formats, but the amount of information we put into our graphics is related to their filesize (this is particularly important in raster-based filetypes). As such, we have to take into account publishers and computational limitations for the rendering of our graphics.
To go through an example of the usual steps taken to generate an effective visualization, we will follow an example inspired by the this post, in which the NFL players' height and weight are plotted to make a comparison of their sizes across positions.
Note: The code to do these plots is available in this exercise.
The NFL Players Dataset can be downloaded in CSV format. After a quick inspection we can see that it does contain the information we need to create the visualization.
Age,Birth Place,Birthday,College,Current Status,Current Team,Experience,Height (inches),High School,High School Location,Name,Number,Player Id,Position,Weight (lbs),Years Played
,"Grand Rapids , MI",5/23/1921,Notre Dame,Retired,,3 Seasons,71,,,"Evans, Fred",,fredevans/2513736,,185,1946 - 1948
,"Dayton , OH",12/21/1930,Dayton,Retired,,1 Season,70,,,"Raiff, Jim",,jimraiff/2523700,,235,1954 - 1954
56,"Temple , TX",9/11/1960,Louisiana Tech,Retired,,1 Season,74,,,"Fowler, Bobby",,bobbyfowler/2514295,,230,1985 - 1985
30,"New Orleans , LA",9/30/1986,LSU,Retired,,5 Seasons,73,,,"Johnson, Quinn",,quinnjohnson/79593,,255,2009 - 2013
We must take into account, though, that some of the player's positions are missing (denoted by an empty space in their position field).
SetDirectory[NotebookDirectory[]];
rawData = Import["NFL.csv"];
{header, data} = {rawData[[1]], rawData[[2 ;; All]]};
positionsID = ((data[[All, position]] // DeleteDuplicates) // Sort)
out={"","C","CB","DB","DE","DL","DT","FB","FS","G","ILB","K","LB","LS","MLB","NT","OG","OL","OLB","OT","P","QB","RB",
"SAF","SS","T","TE","WR"}
After doing so, we can generate an initial draft of our plot.
Even though it's not the prettiest representation of our data, this scatter plot might be a useful way to present the results if we work on it.
In this case, highlighting the position of some players might make it easier to read due to the high level of overlapping between clusters.
We can improve the aesthetics a bit by changing the color palette, and increasing the dot size.
After doing so, we add the color swatch, and iterate through previous steps if we need to.
Finally, we can compile our plots and put them together in a way that favors readability of the data.
Data visualization is also a tool to aid our narratives. Wether it is a scientific paper, a lecture, a talk, or an internal report; there is always an audience we want to convince, and a story we want to tell.