-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH action to generate report #199
Draft
asmacdo
wants to merge
96
commits into
main
Choose a base branch
from
cron-data-usage-report
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
96 commits
Select commit
Hold shift + click to select a range
30e9cbf
Inital commit to add GH action to generate report
asmacdo 3bcba91
Assume Jupyterhub Provisioning Role
asmacdo b5cdcf3
Fixup: indent
asmacdo 6e118cc
Rename job
asmacdo 5062f08
Add assumed role to update-kubeconfig
asmacdo d21a3a9
No need to add ProvisioningRole to masters
asmacdo 403028f
Deploy a pod to the cluster, and schedule with Karpenter
asmacdo 92b9925
Fixup: correct path to pod manifest
asmacdo 478a31f
Fixup again ugh, rename file
asmacdo 9db914e
Delete Pod even if previous step times out
asmacdo 8458d01
Hack out initial du
asmacdo 7999455
tmp comment out job deployment, test dockerhub build
asmacdo d2e65de
Fixup hyphens for image name
asmacdo 5e9e7df
Write file to output location
asmacdo d33973c
use kubectl cp to retrieve report
asmacdo 98fecbc
Combine run blocks to use vars
asmacdo 40ae0e8
Mount efs and pass arg to du script
asmacdo 4c978f7
Comment out repo pushing, lets see if the report runs
asmacdo 6bd7b82
Restrict job to asmacdo for testing
asmacdo 73c3e80
Sanity check. Just list the directories
asmacdo 685dfb1
Job was deployed, but never assigned to node, back to sanity check
asmacdo f6afefc
change from job to pod
asmacdo 6dad759
deploy pod to same namespace as pvc
asmacdo 3a33937
Use ns in action
asmacdo 1ffb1c9
increase timeout to 60s
asmacdo 58e0753
fixup: image name in manifest
asmacdo 6767755
increase timeout to 150
asmacdo cbf951e
override entrypoint so i can debug with exec
asmacdo 59eb045
bound /home actually meant path was /home/home/asmacdo
asmacdo db140d5
Create output dir prior to writing report
asmacdo f90176a
pod back to job
asmacdo c31ccdd
Fixup use the correct job api
asmacdo 3ee9d9f
Add namespace to pod retrieval
asmacdo d7f81ba
write directly to pv to test job
asmacdo 0856baa
fixup script fstring
asmacdo 5301b1b
no retry on failure, we were spinning up 5 pods, lets just fail 1 time
asmacdo 7384274
Fixup backup limit job not template
asmacdo 8e81e38
Initial report
asmacdo cb5db49
disable report
asmacdo 5d188a7
deploy ec2 instance directly
asmacdo 2f39e9c
Update AMI image
asmacdo 3a21106
update sg and subnet
asmacdo 6a54da0
terminate even if job fails
asmacdo 87075fb
debug: print public ip
asmacdo 48c7f35
explicitly allocate public ip for ec2 instance
asmacdo 743359e
Add WIP scripts
asmacdo 0ba12f2
rm old unused
asmacdo 2893ab2
initial commit of scripts
asmacdo 5ef8f80
clean up launch script
asmacdo b02720e
make scripe executable
asmacdo ae98909
fixup cleanup script
asmacdo 7e80e4a
add a name to elastic ip (for easier manual cleanup)
asmacdo f2a4116
Exit on fail
asmacdo 6ffef17
Add permission for aws ec2 wait instance-status-ok
asmacdo 20cc085
Upload scripts to instance
asmacdo 76477df
explicitly return
asmacdo b38ded1
output session variables to file
asmacdo f795570
modify cleanup script to retrieve instance from temporary file
asmacdo f8a92b2
All ec2 persmissions granted
asmacdo e9726df
Add EFS mount (hardcoded)
asmacdo c6e92f9
No pager for termination
asmacdo 17d77cd
force pseudo-terminal, otherwise hangs after yum install
asmacdo 2246af5
Add doublequotes to variable usage for proper expansion
asmacdo b49b7b5
Fixup -t goes on ssh, not scp
asmacdo 584ac4d
Mount as a single command, since we dont have access to pty
asmacdo 4a700e5
add todos for manual steps
asmacdo 6339924
Disable job for now
asmacdo 17130ef
Update AMI to ubuntu
asmacdo cc29df5
Roll back to AL 2023
asmacdo 295361c
drop gzip, just write json
asmacdo a667c04
include target dir in relative paths
asmacdo a91beb0
Second script will not produce user report, but directory stats json
asmacdo 9371982
inital algorithm hackout
asmacdo 8cead5a
Clean up and refactor for simplicity
asmacdo 86a7c72
Add basic tests
asmacdo fc1cab1
test multiple directories in root
asmacdo 2308aed
comment about [:-1]
asmacdo 84754fe
support abspaths
asmacdo a1427ac
[DATALAD RUNCMD] blacken
asmacdo 16e4890
test propagation with files in all dirs
asmacdo 528833d
Write files to disk as they are inspected
asmacdo 3c0e7f7
Comment out column headers in output
asmacdo 260c69d
Write all fields for every file
asmacdo 87dd8ca
Convert to reading tsv
asmacdo e0e0a32
Fixup: update test to match tsv-read data
asmacdo 41aaa2a
update for renamed script
asmacdo 25e27eb
install pip
asmacdo 204b70e
install parallel
asmacdo 64d653e
install dependencies in launch script
asmacdo 6475f11
Output to tmp, accept only 1 arg, target dir
asmacdo b67c063
add up sizes
asmacdo 3241473
print useful info as index is created
asmacdo f4eb101
dont fail if output dir exists
asmacdo 13e0e75
Create a report dict with only relevant stats
asmacdo a7e6991
output data reports
asmacdo 845df00
Remove unused
asmacdo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import os | ||
import csv | ||
import json | ||
import sys | ||
import unittest | ||
from collections import defaultdict | ||
from pathlib import Path | ||
from pprint import pprint | ||
from typing import Iterable | ||
|
||
|
||
def propagate_dir(stats, current_parent, previous_parent): | ||
assert os.path.isabs(current_parent) == os.path.isabs( | ||
previous_parent | ||
), "current_parent and previous_parent must both be abspath or both be relpath" | ||
highest_common = os.path.commonpath([current_parent, previous_parent]) | ||
assert highest_common, "highest_common must either be a target directory or /" | ||
|
||
path_to_propagate = os.path.relpath(previous_parent, highest_common) | ||
# leaves off last to avoid propagating to the path we are propagating from | ||
nested_dir_list = path_to_propagate.split(os.sep)[:-1] | ||
# Add each dir count to all ancestors up to highest common dir | ||
while nested_dir_list: | ||
working_dir = os.path.join(highest_common, *nested_dir_list) | ||
stats[working_dir]["file_count"] += stats[previous_parent]["file_count"] | ||
stats[working_dir]["total_size"] += stats[previous_parent]["total_size"] | ||
nested_dir_list.pop() | ||
previous_parent = working_dir | ||
stats[highest_common]["file_count"] += stats[previous_parent]["file_count"] | ||
stats[highest_common]["total_size"] += stats[previous_parent]["total_size"] | ||
|
||
|
||
def generate_directory_statistics(data: Iterable[str]): | ||
# Assumes dirs are listed depth first (files are listed prior to directories) | ||
|
||
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0}) | ||
previous_parent = "" | ||
for filepath, size, modified, created, error in data: | ||
# TODO if error is not None: | ||
this_parent = os.path.dirname(filepath) | ||
stats[this_parent]["file_count"] += 1 | ||
stats[this_parent]["total_size"] += int(size) | ||
|
||
if previous_parent == this_parent: | ||
continue | ||
# going deeper | ||
elif not previous_parent or previous_parent == os.path.dirname(this_parent): | ||
previous_parent = this_parent | ||
continue | ||
else: # previous dir done | ||
propagate_dir(stats, this_parent, previous_parent) | ||
previous_parent = this_parent | ||
|
||
# Run a final time with the root directory as this parent | ||
# During final run, leading dir cannot be empty string, propagate_dir requires | ||
# both to be abspath or both to be relpath | ||
leading_dir = previous_parent.split(os.sep)[0] or "/" | ||
propagate_dir(stats, leading_dir, previous_parent) | ||
return stats | ||
|
||
|
||
def iter_file_metadata(file_path): | ||
""" | ||
Reads a tsv and returns an iterable that yields one row of file metadata at | ||
a time, excluding comments. | ||
""" | ||
file_path = Path(file_path) | ||
with file_path.open(mode="r", newline="", encoding="utf-8") as file: | ||
reader = csv.reader(file, delimiter="\t") | ||
for row in reader: | ||
# Skip empty lines or lines starting with '#' | ||
if not row or row[0].startswith("#"): | ||
continue | ||
yield row | ||
|
||
def update_stats(stats, directory, stat): | ||
stats["total_size"] += stat["total_size"] | ||
stats["file_count"] += stat["file_count"] | ||
|
||
# Caches track directories, but not report as a whole | ||
if stats.get("directories") is not None: | ||
stats["directories"].append(directory) | ||
|
||
def main(): | ||
if len(sys.argv) != 2: | ||
print("Usage: python script.py <input_json_file>") | ||
sys.exit(1) | ||
|
||
input_tsv_file = sys.argv[1] | ||
username = input_tsv_file.split("-index.tsv")[0] | ||
|
||
data = iter_file_metadata(input_tsv_file) | ||
stats = generate_directory_statistics(data) | ||
cache_types = ["pycache", "user_cache", "yarn_cache", "pip_cache", "nwb_cache"] | ||
report_stats = { | ||
"total_size": 0, | ||
"file_count": 0, | ||
"caches": { | ||
cache_type: {"total_size": 0, "file_count": 0, "directories": []} | ||
for cache_type in cache_types | ||
} | ||
} | ||
|
||
# print(f"{directory}: File count: {stat['file_count']}, Total Size: {stat['total_size']}") | ||
for directory, stat in stats.items(): | ||
if directory.endswith("__pycache__"): | ||
update_stats(report_stats["caches"]["pycache"], directory, stat) | ||
elif directory.endswith(f"{username}/.cache"): | ||
update_stats(report_stats["caches"]["user_cache"], directory, stat) | ||
elif directory.endswith(".cache/yarn"): | ||
update_stats(report_stats["caches"]["yarn_cache"], directory, stat) | ||
elif directory.endswith(".cache/pip"): | ||
update_stats(report_stats["caches"]["pip_cache"], directory, stat) | ||
elif directory == username: | ||
update_stats(report_stats, username, stat) | ||
|
||
OUTPUT_DIR = "/home/austin/hub-user-reports/" | ||
os.makedirs(OUTPUT_DIR, exist_ok=True) | ||
with open(f"{OUTPUT_DIR}{username}-report.json", "w") as out: | ||
json.dump(report_stats, out) | ||
|
||
|
||
sorted_dirs = sorted(stats.items(), key=lambda x: x[1]['total_size'], reverse=True) | ||
print(f"Finished {username} with Total {report_stats["total_size"]}") | ||
|
||
|
||
class TestDirectoryStatistics(unittest.TestCase): | ||
def test_propagate_dir(self): | ||
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0}) | ||
stats["a/b/c"] = {"total_size": 100, "file_count": 3} | ||
stats["a/b"] = {"total_size": 10, "file_count": 0} | ||
stats["a"] = {"total_size": 1, "file_count": 0} | ||
|
||
propagate_dir(stats, "a", "a/b/c") | ||
self.assertEqual(stats["a"]["file_count"], 3) | ||
self.assertEqual(stats["a/b"]["file_count"], 3) | ||
self.assertEqual(stats["a"]["total_size"], 111) | ||
|
||
def test_propagate_dir_abs_path(self): | ||
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0}) | ||
stats["/a/b/c"] = {"total_size": 0, "file_count": 3} | ||
stats["/a/b"] = {"total_size": 0, "file_count": 0} | ||
stats["/a"] = {"total_size": 0, "file_count": 0} | ||
|
||
propagate_dir(stats, "/a", "/a/b/c") | ||
self.assertEqual(stats["/a"]["file_count"], 3) | ||
self.assertEqual(stats["/a/b"]["file_count"], 3) | ||
|
||
def test_propagate_dir_files_in_all(self): | ||
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0}) | ||
stats["a/b/c"] = {"total_size": 0, "file_count": 3} | ||
stats["a/b"] = {"total_size": 0, "file_count": 2} | ||
stats["a"] = {"total_size": 0, "file_count": 1} | ||
|
||
propagate_dir(stats, "a", "a/b/c") | ||
self.assertEqual(stats["a"]["file_count"], 6) | ||
self.assertEqual(stats["a/b"]["file_count"], 5) | ||
|
||
def test_generate_directory_statistics(self): | ||
sample_data = [ | ||
("a/b/file3.txt", 3456, "2024-12-01", "2024-12-02", "OK"), | ||
("a/b/c/file1.txt", 1234, "2024-12-01", "2024-12-02", "OK"), | ||
("a/b/c/file2.txt", 2345, "2024-12-01", "2024-12-02", "OK"), | ||
("a/b/c/d/file4.txt", 4567, "2024-12-01", "2024-12-02", "OK"), | ||
("a/e/file3.txt", 5678, "2024-12-01", "2024-12-02", "OK"), | ||
("a/e/f/file1.txt", 6789, "2024-12-01", "2024-12-02", "OK"), | ||
("a/e/f/file2.txt", 7890, "2024-12-01", "2024-12-02", "OK"), | ||
("a/e/f/g/file4.txt", 8901, "2024-12-01", "2024-12-02", "OK"), | ||
] | ||
stats = generate_directory_statistics(sample_data) | ||
self.assertEqual(stats["a/b/c/d"]["file_count"], 1) | ||
self.assertEqual(stats["a/b/c"]["file_count"], 3) | ||
self.assertEqual(stats["a/b"]["file_count"], 4) | ||
self.assertEqual(stats["a/e/f/g"]["file_count"], 1) | ||
self.assertEqual(stats["a/e/f"]["file_count"], 3) | ||
self.assertEqual(stats["a/e"]["file_count"], 4) | ||
self.assertEqual(stats["a"]["file_count"], 8) | ||
|
||
|
||
if __name__ == "__main__": | ||
if len(sys.argv) > 1 and sys.argv[1] == "test": | ||
unittest.main( | ||
argv=sys.argv[:1] | ||
) # Run tests if "test" is provided as an argument | ||
else: | ||
try: | ||
main() | ||
except Exception as e: | ||
# print(f"FAILED ------------------------------ {sys.argv[1]}") | ||
# raise(e) | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,63 @@ | ||||||
#!/usr/bin/env bash | ||||||
|
||||||
set -e | ||||||
|
||||||
# Load environment variables from the file if they are not already set | ||||||
ENV_FILE=".ec2-session.env" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
if [ -f "$ENV_FILE" ]; then | ||||||
echo "Loading environment variables from $ENV_FILE..." | ||||||
source "$ENV_FILE" | ||||||
else | ||||||
echo "Warning: Environment file $ENV_FILE not found." | ||||||
fi | ||||||
|
||||||
# Ensure required environment variables are set | ||||||
if [ -z "$INSTANCE_ID" ]; then | ||||||
echo "Error: INSTANCE_ID is not set. Cannot proceed with cleanup." | ||||||
exit 1 | ||||||
fi | ||||||
|
||||||
if [ -z "$ALLOC_ID" ]; then | ||||||
echo "Error: ALLOC_ID is not set. Cannot proceed with cleanup." | ||||||
exit 1 | ||||||
fi | ||||||
|
||||||
# Check for AWS CLI and credentials | ||||||
if ! command -v aws &>/dev/null; then | ||||||
echo "Error: AWS CLI is not installed. Please install it and configure your credentials." | ||||||
exit 1 | ||||||
fi | ||||||
|
||||||
if ! aws sts get-caller-identity &>/dev/null; then | ||||||
echo "Error: Unable to access AWS. Ensure your credentials are configured correctly." | ||||||
exit 1 | ||||||
fi | ||||||
|
||||||
# Terminate EC2 instance | ||||||
echo "Terminating EC2 instance with ID: $INSTANCE_ID..." | ||||||
if aws ec2 terminate-instances --instance-ids "$INSTANCE_ID" --no-cli-pager; then | ||||||
echo "Instance termination initiated. Waiting for the instance to terminate..." | ||||||
if aws ec2 wait instance-terminated --instance-ids "$INSTANCE_ID"; then | ||||||
echo "Instance $INSTANCE_ID has been successfully terminated." | ||||||
else | ||||||
echo "Warning: Instance $INSTANCE_ID may not have terminated correctly." | ||||||
fi | ||||||
else | ||||||
echo "Warning: Failed to terminate instance $INSTANCE_ID. It may already be terminated." | ||||||
fi | ||||||
|
||||||
# Release Elastic IP | ||||||
echo "Releasing Elastic IP with Allocation ID: $ALLOC_ID..." | ||||||
if aws ec2 release-address --allocation-id "$ALLOC_ID"; then | ||||||
echo "Elastic IP with Allocation ID $ALLOC_ID has been successfully released." | ||||||
else | ||||||
echo "Warning: Failed to release Elastic IP with Allocation ID $ALLOC_ID. It may already be released." | ||||||
fi | ||||||
|
||||||
# Cleanup environment file | ||||||
if [ -f "$ENV_FILE" ]; then | ||||||
echo "Removing environment file $ENV_FILE..." | ||||||
rm -f "$ENV_FILE" | ||||||
fi | ||||||
|
||||||
echo "Cleanup complete." |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.