Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Progress from Cloud Journey Template and Displaying it on our app. #12

Open
omenking opened this issue Jul 6, 2020 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@omenking
Copy link
Contributor

omenking commented Jul 6, 2020

This is our Cloud Journey Template.
It needs to be forked by each participant.
https://github.com/100DaysOfCloud/100DaysOfCloud

We need a way to programmatically track people's progress when they update their fork with progress.

  • Can we use Github Actions? I have seen Github Labs have a Github Bot, investigate and get some ideas
    https://lab.github.com/

  • Would the repo trigger an endpoint with a payload? Webhook in our system.

  • Would we have to check periodically eg. CloudWatch Event > Lambda

  • Can we store it in DynamoDB, what should the DynamoDB database structure look like?

We don't have to worry about displaying it in our app just storing in a DB that will be accessed by AWS Amplify

@what-name
Copy link
Collaborator

Fork repo or use template - I'm 100% on the fork side.

  • Fork
    • pros: straighforward to keep track of who forked the repo, enables updates from upstream - would teach participants about how forking on github works because they work with it in real life.
    • cons: fork includes all commit history, can be solved with creating a new repo but it's honestly not a problem IMO.
  • Template
    • pros: does not include commit history, meh.
    • cons: very difficult to keep track of who is using it, no updates from upstream

Keep track of activity

  • Options:
    • GitHub Action: security might be an issue but if somebody start abusing it we will notice it and it's all public. If it happens (and that's a big if), it can be handled manually at the time. A GitHub Actions workflow is very easy to implement to ping our backend when on push to the repo (re-run job abuse can be easily mitigated with a simple cooldown period on our backend using timestamps in the database entries). GitHub Action workflow is easy to implement. Con is that participants can technically modify it themselves but YAML can have comments, a big "dont touch this" works marvels. Action could simply send the username of our backend and that's it because we know it only gets triggered when there is a push to the repo. Cooldown period can be set to ~3hours (in secret).
    • Manually scraping activity: very difficult to make work reliably, lot of coding and even more scraping and parsing. Lot of work compared to the other option.

Backend database

  • SQL or NoSQL? There is no strict requirement to use DynamoDB. We could very well use Aurora as it's supported with Amplify as well. Is NoSQL actually better for this?
  • Access pattern:
    • GitHub Action pings our backend when there is a commit. The only crucial data that's transmitted here is the GitHub username of the participant - a single string. An entry into the database should have a timestamp added to that and that's it. If we need total amount of pushes (unlikely), it can just do a simple sum on all entries.
    • Twitter scrape: How to scrape the data from Twitter? Important to store every entry only once. This can be done with a scheduled CloudWatch event and a well-written Lambda function that scrapes the Twitter API and stores new entries in the database. Entry format: Twitter username and tweet timestamp, not much is needed besides these two IMO. Username as the "secondary key" and the timestamp to implement the cooldown.
    • Update frontend leaderboard can happen either
      • scheduled CloudWatch event that updates the static website in the github repo. Lambda clones the repo, updates the static content of the leaderboard and new yourneyers and pushes back to the repo. A GitHub action (or webhook) in turn deploys the new version via Amplify. Very dirty solution but super cost effective and rather easy to implement. It also only needs to query the whole database (or past X days) and create one source truth every once in a while (every X hours maybe). The leaderboard is not time sensitive and doesn't need to reflect changes immediately.
      • Make the leaderboard be up-to-date as possible with dynamically pulling it on every page load. This is much harder to implement on the backend as it requires an API that mostly calculates that ground truth on every request, or stores the data ready to go - API cache can be implemented of course in this case. SELECT * FROM tweets FILTER last week & sum based on username & order by sum of tweets - > Weight this out with the GitHub commits and come to a final leaderboard order. This is a very expensive and computationally heavy query and doesn't necessarly offer any benefit over updating it every X hours.
      • Make a seperate place to store a bit older source truth for the leaderboard and query that on page load. This is a middle ground between the two options from above. A scheduled CW event would query the databases and create a source truth every X hours then store it in some format (maybe a json file in an S3 bucket🙃 honestly not a bad idea lol). On every page load this file would be served back dynamically. This would simplify both the page request load and eradicate the need to couple the "query databases -> display fresh content on website" workflow (which is not great to begin with).

Need to commit all this for now. Will add further info in a bit.

@what-name
Copy link
Collaborator

@what-name
Copy link
Collaborator

what-name commented Jul 22, 2020

Okay, I've been fighting so much with understanding Amplify and I propose we go Serverless instead. To give a little context, I've put a lot of work into understanding Amplify. I can totally see it being a great tool if you know the small (and very unintuitive) quirks of it, but since nobody on our team is at that place (I honestly don't even want to be), it would add a lot of time and unnecessary headache IMO. Most of us are pretty familiar with serverless however, and since this is a Gatsby project, it works perfectly as well.

Architecture idea:

  • Frontend: Gatsby static site hosted in S3+CloudFront.
  • Leaderboard updates are easy to do since Gatsby uses seperate "containers" for components of the website. The leaderboard component can be updated with a very simple Lamba function that pulls the data from the database and updates that file statically, or it can simply pull the finished numbers from dynamodb as well through Lambda and an API Gateway endpoint. It would be REST, yes, but it works and it's fast to build [for now]. Another Lambda can look through the database periodically to create the up-to-date and ready-to-go leaderboard stats.
  • We have easy and full control over the DynamoDB database. (I honestly have zero idea how to create a DynamoDB database with a proper setup and "connectors" in Amplify, after a week of messing around). For data transform/update inside the database we can use scheduled Lambda functions as mentioned above. - Among other things.
  • We don't need Cognito. All user data can be stored perfectly in DynamoDB and authentication will happen through GitHub OAUTH anyways.
  • Input form for manual entries is a very simple API Gateway/Lambda integration. We can go with this setup in the beginning and modify it later to GraphQL if deemed necessary (as I have no knowledge on how GraphQL works either yet, but REST yes).
  • The OAUTH login workflow with Serverless has been done quite a lot. There's this post for example (bit off) which was the very first search result - there's info on it, unlike Amplify which we're still waiting on that PR for.
  • Teamwork and dev-staging environments are easy to do with SAM [and CloudFormation]. I could not find a single, even remotely viable solution to how to do this with Amplify. CI/CD can be done via GitHub actions or CodePipeline, including unit tests.
  • We have so much more wiggle room on scraping or automatically keeping track of participants' activity with Lambdas. I know this could have been simply added next to the Amplify deployment but keeping everything together in a single stack just feels better and easier. Also this would consolidate all our deployment and testing workflows.

That's a bit of braindump for now, I'm going to start working on the Serverless app. In the meantime, I'd suggest checking out the following video: OAUTH and OIDC in plain english. - This will make you go from "what is happening" to OAUTH pro in an hour.

I didn't mention any of the frontend requirements to work with this solution but I could look into that as well. Technically it's only a couple; the input form (for now), the leaderboard window and the OAUTH authentication (which is cleared up by that video just above).

@antoniolofiego
Copy link
Collaborator

antoniolofiego commented Jul 22, 2020

Let me go through all of these in order.

  • Frontend: Gatsby static site hosted in S3+CloudFront.
    Pretty straightforward

  • Leaderboard updates are easy to do since Gatsby uses seperate "containers" for components of the website. The leaderboard component can be updated with a very simple Lamba function that pulls the data from the database and updates that file statically, or it can simply pull the finished numbers from dynamodb as well through Lambda and an API Gateway endpoint. It would be REST, yes, but it works and it's fast to build [for now]. Another Lambda can look through the database periodically to create the up-to-date and ready-to-go leaderboard stats.
    Leaderboards are a good fit for services like Redis, but a table that is updated on a schedule would work as well. For the frontend, we can just call an API that returns the most up to date leaderboard.

  • We have easy and full control over the DynamoDB database. (I honestly have zero idea how to create a DynamoDB database with a proper setup and "connectors" in Amplify, after a week of messing around). For data transform/update inside the database we can use scheduled Lambda functions as mentioned above.
    See point above.

  • We don't need Cognito. All user data can be stored perfectly in DynamoDB and authentication will happen through GitHub OAUTH anyways.
    In terms of fast turnaround, Cognito is still a feasible option for the MVP. We can figure out OAuth at a later stage, but to get the product out of the door we can go with email/password combination

  • Input form for manual entries is a very simple API Gateway/Lambda integration. We can go with this setup in the beginning and modify it later to GraphQL if deemed necessary (as I have no knowledge on how GraphQL works either yet, but REST yes).
    I will likely use React-Forms to handle the frontend, and we can just have a POST endpoint that triggers a Lambda to put the correct details where they need to be.

  • The OAUTH login workflow with Serverless has been done quite a lot. There's this post for example (bit off) which was the very first search result - there's info on it, unlike Amplify which we're still waiting on that PR for.
    See points above.

  • Teamwork and dev-staging environments are easy to do with SAM [and CloudFormation]. I could not find a single, even remotely viable solution to how to do this with Amplify. CI/CD can be done via GitHub actions or CodePipeline, including unit tests.
    It is definitely much easier to do this via CloudFormation/SAM, indeed.

  • We have so much more wiggle room on scraping or automatically keeping track of participants' activity with Lambdas. I know this could have been simply added next to the Amplify deployment but keeping everything together in a single stack just feels better and easier. Also this would consolidate all our deployment and testing workflows.
    Nothing to add here.

I would definitely need help with the frontend coding eventually just to speed up everything. I am starting to work on it right now again after a few days of MIA but more hands and brains can be helpful. Given that we are going fully serverless, can we have a chat on how to structure our endpoints and similar things? Also, were are we going to store things like avatars and articles?

@what-name
Copy link
Collaborator

Also, were are we going to store things like avatars and articles?
As far as I know, articles on Gatsby should be easy to do from our end. When it comes to articles written by others, we could store it's title and necessary info in DynamoDB and store their thumbnails in S3. If we want to make them appear on our site insted of simply being redirected to their original location (which is not a good idea from an SEO standpoint), can store that in DynamoDB as well actually or store a markdown version of that in S3 and just pull it from there (weird I know lol) - but this is all for the future, just wanted to mention that it's feasable.

In terms of fast turnaround, Cognito is still a feasible option for the MVP. We can figure out OAuth at a later stage, but to get the product out of the door we can go with email/password combination
That depends what we want to do first. If just a simple input form like @omenking suggested, no need for either at first. After seeing how that works and will be displayed, we can go for more authentication - in my opinion. I think a simple input form would be a great starting point. I can't really see the authentication part from a user perspective without knowing how the data generated by it will be used, but that's just me.

@antoniolofiego
Copy link
Collaborator

antoniolofiego commented Jul 22, 2020

import datetime

class Score:
  def __init__(self):
    self.twitter = 0    # Current score accrued through Twitter
    self.github = 0    # Current score accrued through GitHub

    self.old_twitter = datetime.datetime(2020, 1, 1)    # Second to last Twitter activity
    self.new_twitter = datetime.datetime(2020, 1, 1)    # Last Twitter activity

    self.old_github = datetime.datetime(2020, 1, 1)     # Second to last GitHub activity
    self.new_github = datetime.datetime(2020, 1, 1)    # Last GitHub activity

    self.twitter_streak = 0
    self.github_streak = 0
    
  def github_activity(self):
    # Rotate activity dates
    self.old_github = self.new_github
    self.new_github = datetime.datetime.now()
   
   # If less than 3 days passed from the previous activity, add 1 to the streak, else reset the streak to 1
    if (self.new_github - self.old_github).days <= 3:
      self.github_streak += 1
    else:
      self.github_streak = 1
    
    # If this is the first activity of the day, add 2 points multiplied by the streak length  
    if (self.new_github - self.old_github).days != 0:
      self.github += 2 * self.github_streak
    
    print("GitHub points for today:", 2 * self.github_streak)
    print("Total GitHub points:", self.github)
    print("GitHub streak:", self.github_streak)

  def twitter_activity(self):
    # Rotate activity dates
    self.old_twitter = self.new_twitter
    self.new_twitter = datetime.datetime.now()

   # If less than 3 days passed from the previous activity, add 1 to the streak, else reset the streak to 1
    if (self.new_twitter - self.old_twitter).days <= 3:
      self.twitter_streak += 1
    else:
      self.twitter_streak = 1

    # If this is the first activity of the day, add 2 points multiplied by the streak length  
    if (self.new_twitter - self.old_twitter).days != 0:
      self.twitter += 1 * self.twitter_streak
    
    print("Twitter points for today:", 1 * self.twitter_streak)
    print("Total Twitter points:", self.twitter)
    print("Twitter streak:", self.twitter_streak)

  def get_score(self):
    # Get full score
    return self.twitter + self.github

Draft for the leaderboard logic.

@what-name
Copy link
Collaborator

what-name commented Jul 23, 2020

Notes of the Zoom meeting on 23.07.2020

Current objectives mostly in order

  1. Recreate the current site with links, own articles.
  2. Log in with Cognito username/pass and get all info from participant (github, twitter, full name, linkedin, [email with consent box?]) Save this info in Cognito and in it's [own] DynamoDB database (for now seperated DB tables). This will allow us to have a list of participants and their social from the beginning.
  3. Input form for today's tweet/commit entry. This is given that we don't get to automatic collection first.
  4. Show Twitter feed for #100DaysOfCloud - should be easy enough as Twitter provides drop-ins AFAIK
  5. Leaderboard - total days OR calculated based on commits and tweets and streak

LEADERBOARD

TL;DR we need more diversity in knowledge, experience and ideas on what the leaderboard should be based on. @omenking @cmgorton @rishabkumar7 @madebygps

This feature is a bit down the road, but it needs to be somewhat planned for in advance. - For knowing the flow of the system and what to work on.

1:

  • We can't form a clear idea of what the leaderboard should look like and display. I think it needs to be carefully decided upon because a bad leaderboard strategy can very well drive people away from using the website completely. What I mean is that a simple score "X" for "top performers" can be confusing if it's a complicated algorithm.
  • I think a simple and easy to measure value would be enough, like total days of challenge completion. This would show participants where they are and where everybody else is. If there'd be a carefully calculated number based on tweets and github commits, that could make the whole scoring system difficult to understand and therefore drive away "engagement" with that component.
  • "I have a score of 745, now what does that mean tho and how does that person have 34280?!" for example
  • I also would like to put less emphasis on perfect streaks as a lot of people can't commit to the challenge every day. Therefore people with less everyday tasks would get a headstart by default, because they have more time and less likely to miss a day. Posting updates is also tricky, as forgeting to update your GitHub repo every day is very real and sometimes people backfill 2-3 days (like I do).
  • A simple scoring system simplifies the backend a lot as well.
  • (3-10 days) "The key to this challenge is about commitment. Doing even a tiny bit every single day, that is the most powerful way to learn and practice. We understand that life happens, therefore we have a four day grace period where you can be inactive and nothing will happen. If you are inactive for more than four days however, your total days commited might get deleted." - something like this as a heads up and a way to focus on consistency without putting emphasis on streaks.

2:

  • Another option is from @antoniolofiego where the participants' activity on Twitter and GitHub would be kept track of. This would make up a system where a GitHub commit (to the journey template is worth 2 times as much as a Tweet), and streaks would be rewarded a lot. For the logic please see the above.
  • In my opinion, encouraging public exposure is great in terms of the Tweeting but - from what I've seen - people who'll tweet about it already do, without our gentle push. This brings up the GitHub Journey template, which should definitely be encouraged to be filled out and kept up to date for a source truth [note]. Maybe we can solve displaying or using it on our end in some other way [with the goal of more recognition and encouragement for participants]. I'm not fully understanding this system but maybe that's just me, @antoniolofiego please elaborate more on it so we have it

I'm personally recommending the first one, where only the total number of days are kept track of. My opinion is based on a case study from Jocko Willink's book Extreme Ownership, where a company was underperforming because their bonus system was too confusing. Employees didn't understand it, got totally random bonuses with each paycheck and therefore didn't perform well. After the company revised their bonus system to be dead-simple, they all took off because they understood the system behind it and they could work with that well.

  • Edit: Both solutions have to take into account time zone differences!

Misc

  • We have limited personnel who can work on the frontend code. A "call-to-action" was shared in the Discourd group and on Twitter if somebody wants to work with us on this.
  • Idea: use DynamoDB Streams to record new activity for commits and tweets. This could also make a "dynamic" table with all latest 100DaysOfCloud activity and be displayed on the website under a panel "What's happening now". It could also simplify the fan out of the activity data (ie. total number of days)

@what-name
Copy link
Collaborator

Possible problems for the future with the leaderboard:

  • People backfilling 2 days at once
  • People who can't commit every single day - due to random life events or children - not lose their streaks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants