RStudio-GitHub-Analysis

Contributors: Juno Chen, Ian Flores, Rayce Rossum, Richie Zitomer

Project Mentor: Dr. Tiffany Timbers

Project Partner: Dr. Greg Wilson

Overview

This project aims to understand how people are currently using GitHub, with the eventual goal of building an easy-to-use alternative to Git.

This project includes the ability to cluster similar GitHub projects and pick out their most commonly-occuring subgraphs.

Motivation behind this project: http://third-bit.com/2017/09/30/git-graphs-and-engineering.html

Useful documents

Installation instructions

First, to get credentials file neccessary for pulling the GitHub Torrent from Google Cloud (necessary for re-generating images for our analysis):

Follow the instructions under 'Set up a service account' to create and download a credentials file: https://cloud.google.com/video-intelligence/docs/common/auth
Change the name of the file to credentials_file.json and put it in the root directory of the project (a sample file with the name credentials_file_EXAMPLE.json is included as a reference).

Usage

Run the following commands to reproduce this analysis:

snakemake get_ght_data # Downloads GH Torrent data from figshare. Be aware that the file is quite large, and downloading can take 1-2 hours.

snakemake run_analysis # Run our pipeline; generate embeddings, clusters, tsne graph, motif report, etc.

snakemake generate_images # Generate images of our most important findings.

To change parameters from the command line, simply put --config param=value after your snakemake call. For a full list of configurable parameters, see the config.json file in the root directory of this project. For example, if you wanted to run the analysis with 5 workers instead of the default, run:

snakemake run_analysis --config n_workers=5

Config Parameters

Short Name	Long Name	Description	Default	Type
-rp	--results_path	The folder to output results of the analysis. e.g. embeddings and plots	./results/	String
-nw	--n_workers	The number of workers to use when running the analysis.	1	int
-dp	--data_path	The path to the commits.feather file. e.g. /home/user/RStudio-Data-Repository/clean_data/commits_by_org.feather	./data/commits_by_org.feather	String
-np	--n_projects	The number of projects to sample from the dataset.	1000	int
-mc	--min_commits	The minimum number of commits for a project to be included in the sample.	None	none_or_int
-mcount	--min_count	The min_count parameter for the graph2vec model.	5	int
-nps	--n_personas	The number of personas to extract from each cluster.	5	int
-nn	--n_neurons	The number of neurons to use for Graph2Vec (project level)	128	int
-ni	--n_iter	The number of iteration to use to run the WeisfeilerLehmanMachine	10	int
-rs	--random_state	The random state to initalize all random states.	1	int

Data Repositories

RStudio-Data-Repository

Figshare Upload

Docker

To run Docker you have to run:

docker build --tag rstudio:1.0.0 .
docker run -it -v $(pwd):/rstudio_analysis rstudio:1.0.0 /bin/bash

Once inside the container you run:

cd rstudio_analysis
snakemake get_ght_data
snakemake run_analysis
snakemake generate_images

Software and Dependencies

MulticoreTSNE==0.1
pandas-gbq==0.10.0
panel==0.6.0
networkx==2.3
joblib==0.12.3
gensim==3.7.1
tqdm==4.26.0
pyviz-comms==0.7.2
snakemake=5.5.2

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
data		data
docs		docs
github_analysis.egg-info		github_analysis.egg-info
results		results
src/github_analysis		src/github_analysis
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.json		config.json
credentials_file_EXAMPLE.json		credentials_file_EXAMPLE.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RStudio-GitHub-Analysis

Overview

Useful documents

Installation instructions

Usage

Config Parameters

Data Repositories

Docker

Software and Dependencies

About

Releases 4

Packages

Contributors 6

Languages

License

UBC-MDS/RStudio-GitHub-Analysis

Folders and files

Latest commit

History

Repository files navigation

RStudio-GitHub-Analysis

Overview

Useful documents

Installation instructions

Usage

Config Parameters

Data Repositories

Docker

Software and Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 6

Languages

Packages