-
Notifications
You must be signed in to change notification settings - Fork 0
/
7-Theme1.tex
executable file
·126 lines (65 loc) · 10.4 KB
/
7-Theme1.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
\Chapter{GENERAL RESEARCH WORKFLOW}\label{sec:Theme1}
As stated in \autoref{sec:Introduction}, the goal of our research project is to create an expertise model capable of detecting maintainers in a given subsystem. Although our model could be applied to other projects, we focused our evaluation on the Linux Kernel.
Our expertise model takes into account many different activities not only present in the linux contribution process, but also in other large-scale software engineering projects. These activities translate into metrics as we attempt to quantify them. To give back to the Linux community, we made those metrics readily available through two open source tools.
In this chapter, we describe the general workflow of the research. \autoref{sec:expertise_model} introduces the expertise model we created during our research project, whose approach and evaluation were submitted as a paper to the 15th International Conference on Mining Software Repositories\footnote{\url{https://2018.msrconf.org/}}. In \autoref{sec:srcmap} and \autoref{sec:email2git}, we introduce the tools we created, providing an explanation of how the metrics helped with the creation of our expertise model.
\section{Srcmap}
\label{sec:srcmap}
In the interest of increasing the visibility of the authors of the Linux Kernel, we built a data visualization tool capable of displaying a wide array of information about directories or files found in the linux git repository. This information includes code age and individual developer footprint. We wanted to display the following data points about each file and directory of the source code:
\begin{itemize}
\item \ac{LOC}
\item Median age of the \ac{LOC} within a file/directory
\item Number of lines of code modified since 2016
\item A list of the 20 developers with the most lines of code
\item A bar plot displaying the distribution of line of code age
\end{itemize}
We needed an interface that would allow the user to navigate the different files and directories of Linux while displaying our list of data points, which is why we chose to base the tool on a treemap. Treemaps, which were introduced by~\citep{Bederson-2002} as a solution to display large hierarchical dataset on a 2 dimensional plane, were a great fit to display the structure of the Linux directory tree.
This tool served two purposes. First, perceived as a contribution to the Linux community, we acquired a lot of contacts and gained goodwill in return. Second, we gained a greater understanding of the contribution process and we discovered the concept of decreasing \ac{LOC} footprint as explained in \autoref{sec:lessons_srcmap}.
\subsection{Srcmap 1.0}
In the first version of Srcmap\footnote{\url{http://mcis.polymtl.ca/~courouble/linux.html}}, we used the Google Chart treemap implementation\footnote{\url{https://developers.google.com/chart/interactive/docs/gallery/treemap}}. This easy to use library allowed us to create a quick proof of concept.
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{srcmap1}
\caption{First version of srcmap}
\label{fig:srcmap1}
\end{figure}
\autoref{fig:srcmap1} shows the first version of Srcmap. The different boxes represent subdirectories of the Linux Kernel. The different colors within each box give a preview of the content of the box, or the files and subdirectories contained in that directory. In this version of the tool, the color represents the developer having contributed the most lines of code in the contained files. Furthermore, the size of the boxes is proportional to the number of lines of code existing within the file or directory represented by the box. The panel on the right of the screen and the tooltip contain most of the data: exact number of lines of code, age of the first and last lines of code added, and the top 20 authors and their percentage of lines of code contributed.
\subsection{Srcmap 2.0}
We encountered scalability issues with the first implementation after adding data to the treemap.
After some research, we discovered a new treemap implementation\footnote{\url{https://carrotsearch.com/foamtree/}} capable of handling large datasets and deeply nested structures, which we used for the second version of the tool\footnote{\url{http://mcis.polymtl.ca/~courouble/dev/}}. This new version, shown in \autoref{fig:srcmap2}, introduced three important features:
\begin{itemize}
\item Coloring the files and directories according to one out of three metrics:
\begin{itemize}
\item \ac{LOC}
\item Median age
\item Number of commits since 2016, or 'hot files'
\end{itemize}
\item File search
\item A plot displaying the \textit{age} distribution of \ac{LOC} present in the file/directory.
\end{itemize}
\begin{figure}[htb]
\centering
\includegraphics[width=5in]{srcmap2}
\caption{Second version of srcmap}
\label{fig:srcmap2}
\end{figure}
Although the foamtree library had a steeper learning curve than the Google Charts Treemap library, the versatility provided by the foamtree library allow us to provide a much more pleasant user experience and easier access to the data.
The very first Srcmap prototype consisted of a treemap displaying only the \texttt{net} subdirectory, which was small enough to provide a smooth user experience. However, Scalability issues started to arise when we attempted displaying the entire Linux Kernel source code. The data would take up to 30 seconds to load, and navigating between each node became very slow.
Furthermore, we discovered the limitations of Google Charts when we tried adding new features to the tool.
This is why we decided to create a new version of the tool with a different treemap library. The new library, Foamtree, provided a richer API and allowed for smooth browsing through deeply nested tree.
\subsection{Community Engagement}
After the creation of the first implementation of Srcmap, we attended LinuxCon 2016 to meet with a series of Linux developers and maintainers. The goal of these informal interviews was to receive early feedback on the tool. In addition to that, We traveled to Santa Fe, New Mexico to present Srcmap at the Linux Plumbers Conference. After discussing srcmap and our research to a series of Linux developers, we deducted the following.
Firstly, experienced linux developers and maintainers are accustomed to their own development workflow. According to our interviews, experienced developers have acquired \textit{muscle memory} from developing in the same editor and terminal over the years. Moreover, experienced maintainers were not interested in a web-based visualization tool, especially since the metrics displayed in Srcmap were accessible from the Linux git repository.
However, the interviewed developers showed interest in combining srcmap with work that was previously done by \cite{jiang14}: linking Linux git commits to email patches and code reviews. Since our wished to provide a tool that would increase developers' productivity, this became our next goal. Access to the original patches and code reviews would save a lot of time to developers trying to further understand an unknown subsystem.
\subsection{Lessons learned}
\label{sec:lessons_srcmap}
We learned many important lesson during the creation of Srcmap. From a community perspective, we now understand the importance of understanding the needs and the habits of the targeted community.
From a research point of view, the creation of srcmap allowed us to understand an important concept in the creation of our expertise model. The data displayed in the tool represents the footprint of the developer in the given file or directory. As we updated srcmap to a newer release of the Linux Kernel, we noticed that the larger footprints were decreasing, which lead us to discover the concept of decreasing footprint, an important aspect of our expertise model.
In conclusion, although Srcmap was not as successful as we originally hoped, we learned many important concepts that helped us in the rest of our research.
\section{Expertise Based Maintainer Recommendation Model}
\label{sec:expertise_model}
To address the claim made in our hypothesis, we create an expertise model based on multiple different activities and with a historical dimension. As stated in chapter 1, we discovered that maintainers' contribution frequencies decreased over time. This results in their \textit{LOC footprint} to slowly erode because of the contributions coming from other developers. Wondering whether this decrease in footprint was the result of maintainers reconsidering their involvement with the Linux community, we analyzed other metrics. Starting with the data collected for the creation of Email2git, we take a look at the other aspects of large scale software development. We discovered that as the Linux community becomes larger, maintainers are spending and increasing amount of time review code changes submitted by other developers. With this data available, we were able to improve state of the art expertise model. Chapter 5 presents the paper in which we introduce and evaluate our expertise model.
\section{Email2git}
\label{sec:email2git}
The linux contribution process has been a reliable way to pipe code contributions (patches) from developers around the world, to the main Linux repository. With a working copy of the Linux Kernel on their computers, developer can modify the source code and, if desired, submit their changes for review, in hope to integrate the main tree. If accepted, the maintainer will \textit{commit} the changes to his local git repository, and submit the changes \textit{upstream} to another maintainer.
Although this system has been very reliable, it has one major drawback: once committed, it is complicated to easily find the email conversation that eventually led to the creation of the patch. We addressed this drawback by implementing an algorithm capable of backtracking the origin of commits in the Linux Git repository introduced by~\citep{jiang14}. The algorithm and the issues related to scalability are described in chapter 4.
The data generated by the algorithm consists of a list of commit to patch matches. The matches are accessible online through two interfaces: as a commit ID search through the Email2git interface\footnote{\url{http://mcis.polymtl.ca/~courouble/email2git/}}, or though the Cregit interface\footnote{\url{https://cregit.linuxsources.org/}}. Chapter 5 provides an in-depth description of our implementation of Email2git.