Skip to content
This repository has been archived by the owner on Mar 3, 2020. It is now read-only.

Performance Issue(s) #456

Open
punkdis opened this issue Feb 16, 2017 · 38 comments
Open

Performance Issue(s) #456

punkdis opened this issue Feb 16, 2017 · 38 comments

Comments

@punkdis
Copy link

punkdis commented Feb 16, 2017

6d04149

AWS EC2, Ubuntu 14.04.5 LTS on a m4.xlarge

When we have over 150 participants over 20 mins the CPU slowly goes to 80%. Once at 80% performance goes to the crapper and the UI is slow and participants are seeing the impact in game play.

image

@juliannagler
Copy link
Contributor

Thanks for filling the issue @rfrillman - we'll investigate asap.

@justinwray
Copy link
Contributor

Do you happen to know if you were running Memcached? Was the database server on the same system as the webserver (a single system)?

@punkdis
Copy link
Author

punkdis commented Feb 17, 2017

Yes Memcached was running.

memcache 1359 0.0 0.2 327440 2580 ? Sl 04:11 0:00 /usr/bin/memcached -m 64 -p 11211 -u memcache -l 127.0.0.1
ubuntu 1886 0.0 0.0 11744 932 pts/0 S+ 04:19 0:00 grep --color=auto memcached

Yes the database server was on the same system as the webserver.

@justinwray
Copy link
Contributor

Interesting, nearly all of the data is cached now (even sessions). So you shouldn't see a high load in relation to the database. In the past we've seen this issue when Memcached wasn't running/working.

Did you have the binlog enabled by chance? Or the general query log? (Note: You probably shouldn't for performance reasons - but it would be useful for some investigative information in this case).

@punkdis
Copy link
Author

punkdis commented Feb 17, 2017

binlog is/was disabled. I can see if there are any query logs

@punkdis
Copy link
Author

punkdis commented Feb 18, 2017

After reviewing the logs there is nothing that stands out. Maybe the resources for m4.xlarge is not enough.

image

I have another event in a month or so and I am going to move to the C4.4xlarge

image

@shargon
Copy link
Contributor

shargon commented Feb 22, 2017

We would like to expose our experience using the Facebook CTF platform.
First of all, the platform was deployed on a Virtual Private Server (OVH SSD3 VPS Server), whose specifications are:

  • 8 GB RAM.
  • 2v cores.
  • 2.4 GHz.
  • SSD 40 GB.
  • KVM OpenStack.
  • Local RAID 10.

After deploying the platform for our CTF, was tested with 15 players and we did not experience any problem of flooding or lagging in the platform.

Once the competition was opened, 340 participants were registered and the platform remained open from 02/16/2017 to 02/19/2017.
At this point the platform began to flood, being almost impossible to interact with the interface, even when there were only 30 players ingame.

The memory system (RAM) was not a problem, however, the CPU load increased considerably.
To correct these problems, we made the following modifications:

  • Modifying Mysql stack threads from 10 to 25.
  • Restart HHVM service every 5 minutes.
  • Deletion of active sessions of the BBDD every hour, to remove the inactive users (cleaning memcache included).
  • Refresh callbacks changed from 5000 to 30000 and 10000 to 120000

The main problem is given by HHVM, and although the modifications managed to fix the lagging problems at times, we did not manage to correct the problems completely and all users started complaining about the flooding of the system until we delete the user sessions and restart HHVM service again.

@punkdis
Copy link
Author

punkdis commented Feb 24, 2017

Any other thoughts?

@wisco24
Copy link
Contributor

wisco24 commented Feb 25, 2017

I had this same issue awhile back, and still do at times for large events. HHVM is tying up the CPU waiting for the database queries. I just ran an event with 30 teams last week 50-60 sessions, and used a c4.4xl (16 cores) but moved the DB to AWS Aurora db.r3.2xl. (working on the code so that I can do a pull request for it) Late in in the afternoon when the game log grew due to all the guesses/answers, the CPU was sitting around 60%. The biggest help from my testing was having the DB separate. so the DB and HHVM were not both beating up on the same CPU.

@wisco24
Copy link
Contributor

wisco24 commented Feb 25, 2017

From the queries I looked at, the issue is with how the leader board scores are done. They are recalculated every time a team scores by querying/adding the full game logs table. A much more efficient way would be to store the score in a separate DB table so it doesn't have to go through the all the guesses each time.

@wisco24
Copy link
Contributor

wisco24 commented Feb 25, 2017

Just looked at pull request #459 and that might be a mitigation method as well as it will create separate DB's, etc. and then possibly put a Load Balancer in front of it with sticky sessions so a team stays stuck to one system.

@punkdis
Copy link
Author

punkdis commented Apr 4, 2017

Has anyone used the Live Sync API?

justinwray added a commit that referenced this issue Aug 4, 2017
…port (#535)

* Separate docker containers per service

* Provision Streamlined, Quick Setup Added, and Multiple Containers Support

* The project now includes a number of "Quick Setup" options to ease the installation or startup process of the platform.  The following Quick Setup modes are available:

  *  Direct Installation - Used when directly installing to the system you are on; this is useful when installing on bare metal, an existing VM, or a cloud-based host.

      * `source ./extra/lib.sh`
      * `quick_setup install <dev/prod>`

  * Multi-Server Direct Installation - Used when directly installing the platform with each service on a separate system; this is useful when installing on bare metal systems, existing VMs, or cloud-based hosts.

    * Database Server (MySQL)
      * `source ./extra/lib.sh`
      * `quick_setup install_multi_mysql <dev/prod>`

    * Cache Server (Memcached)
      * `source ./extra/lib.sh`
      * `quick_setup install_multi_nginx <dev/prod>`

    * HHVM Server (HHVM)
      * `source ./extra/lib.sh`
      * `quick_setup install_multi_hhvm <dev/prod> <IP of MySQL Server> <IP of Memcached Server>`

    * Web Server (Nginx)
      * `source ./extra/lib.sh`
      * `quick_setup install_multi_nginx <dev/prod> <IP of HHVM Server>`

  * Standard Docker Startup - Used when running FBCTF as a single docker container.
      * `source ./extra/lib.sh`
      * `quick_setup start_docker <dev/prod>`

  * Multi-Container Docker Startup - Used when running FBCTF on docker with each service hosted in a separate docker container.
      * `source ./extra/lib.sh`
      * `quick_setup start_docker_multi <dev/prod>`

  * Standard Vagrant Startup - Used when running FBCTF as a single vagrant container.
      * `source ./extra/lib.sh`
      * `quick_setup start_docker <dev/prod>`

  * Multi-Container Vagrant Startup - Used when running FBCTF on vagrant with each service hosted in a separate vagrant container.
      * `source ./extra/lib.sh`
      * `quick_setup start_docker_multi <dev/prod>`

* Each installation platform now supports both Production Mode (prod) and Development Mode (dev).

* The `provision.sh` script has been streamlined and organized based on the services being installed.  The installation process now also includes more logging and error handling.  Common and core functionally has been migrated to `lib.sh` where appropriate.  Color coding has been added to the various output to make quick visual monitoring of the process easier.

* Package installation, specifically the check for existing packages has been updated to fix an issue where packages would sometimes not be installed if a similarly named package was already present on the system.

* The `provision.sh` script now supports separate installations for each service using the `--multiple-servers` and `--server-type` options.

* HHVM configuration has been updated to run HHVM as a network-service.

* Nginx configuration is now included in the platform code base and utilized.

* Docker service startup scripts are included for each of the services:
  * `./extra/mysql/mysql_startup.sh`
  * `./extra/hhvm/hhvm_startup.sh`
  * `./extra/nginx/nginx_startup.sh`

* This PR fixes the docker installation dependencies issue #534.

* This PR includes docker-compose configurations for multi-docker containers, fixing issue #440.

* Services on Docker (both single container and multi-container) are now monitored to ensure they do not fail.

* This PR updates HHVM to the latest stable version for Ubuntu 14.04, HHVM Version 3.18.1, fixing issue #496.

* Attachment/Upload permissions have been corrected across the installation environments.  This fixes issues with improper permissions on Docker and Vagrant while still enforcing secure file permissions.  This should resolve issues like #280 going forward.

* Implemented more strict permissions on he CTF PATH (755 verses 777).

* Fixed long-standing, upstream induced, HHVM socket permission issues (like #229), mostly experienced in Docker or after a restart (resulting in a _502 Bad Gateway_):  facebook/hhvm#6336.  Note that this fix is a temporary workaround until the upstream issue is resolved.

* With the introduction of the latest available version of HHVM and the inclusion of multiple-server support, performance increases should be noticeable.  This should help alleviate issues like #456.

* This PR was derived, in part, from PR #530.

* Added Memcached Service Restart to container service script

* Added logging of PHP/HHVM version to provision script.

* Added logging of PHP Alternatives to provision script.

* Composer is now installed with the HHVM binary instead of PHP.

* Composer Install is run with the HHVM binary instead of PHP.

* The Travis trusty Ubuntu image has been downgraded from `sugilite` to `connie`.

* Updated run_tests.sh to have write permissions to settings.ini

* Set run_tests.sh to use localhost for DB and MC.

* HHVM 3.18+ enforces \HH\FormatString - Invariant calls now are of \HH\FormatString type - All `invariant()` calls that are passing in a variable argument have been updated to use literal strings for the format string.  Invariant passes the second (and subsequent) arguments to `sprintf()`.  The second parameter of `invariant()` must be a literal string, containing placeholders when needed.  More information can be found here:  hhvm/user-documentation#448.  This change ensures the code is strict compliant in HHVM versions 3.18
@punkdis
Copy link
Author

punkdis commented Aug 4, 2017

image

I ran another event and the mysqld was off the chart and caused the game to become unresponsive Any suggestions? Or any immediate commands I could issue to free up the CPU?

git rev-parse HEAD
51e06a7

@wisco24
Copy link
Contributor

wisco24 commented Aug 4, 2017

Exactly what I was seeing.... Best way I found was the offload the DB to a different server. In my case, I move the DB to AWS Aurora RDS, and then have the rest on a rather large EC2 instance....

I tried to restart mysqld and hhvm when I had this issue before, but it cause the scoring system to get messed up and award double points for the same question.

@punkdis
Copy link
Author

punkdis commented Aug 4, 2017

@wisco24 do you have any documentation on how you setup the DB on a different server? I have another event scheduled for September

@stevcoll
Copy link

stevcoll commented Aug 4, 2017

@punkdis The multi server changes were just merged in the dev branch. Depending on how many servers you want to separate, here are the general instructions for production. Note that the additional options you see with IP addresses are instructing as to the location of hhvm, mysql, and cache servers. So generally you would provision in this order:

Update: See Justin's comment below for correct commands

@punkdis
Copy link
Author

punkdis commented Aug 4, 2017

@stevcoll Do you know when the changes will be merged into production?

@stevcoll
Copy link

stevcoll commented Aug 4, 2017

The last I heard master and dev were in conflict so it could be months, but I'll let @justinwray answer that one.

@justinwray
Copy link
Contributor

@punkdis / @stevcoll dev and master are currently out of sync.

However, given the number of improvements, including performance updates, in dev it might be a good time to merge that work into master.

Merging dev into master would also bring the new server/container separation to master as well.

In the meantime, you can certainly pull down the dev branch, to do so you need to do the following:

  • git clone https://github.com/facebook/fbctf
  • cd fbctf
  • git checkout dev

From that point forward all other documentation applies and you will now be using the dev branch.

Additionally, documentation for the new installation options can be found in the commit log here: b487fc1

You can use the individual provision.sh executions as @stevcoll provided, or the quick_install option for multiple direct installations:

  • Database Server (MySQL)

    • source ./extra/lib.sh
    • quick_setup install_multi_mysql <dev/prod>
  • Cache Server (Memcached)

    • source ./extra/lib.sh
    • quick_setup install_multi_nginx <dev/prod>
  • HHVM Server (HHVM)

    • source ./extra/lib.sh
    • quick_setup install_multi_hhvm <dev/prod> <IP of MySQL Server> <IP of Memcached Server>
  • Web Server (Nginx)

    • source ./extra/lib.sh
    • quick_setup install_multi_nginx <dev/prod> <IP of HHVM Server>

@fredemmott
Copy link
Contributor

fredemmott commented Aug 11, 2017

Restart HHVM service every 5 minutes.

Test this very carefully - as HHVM is a JIT with a high warmup cost, this will usually make things much worse.

In general, when looking at HHVM's CPU usage, the best tool to use is 'perf' from the linux perf tools; as HHVM generates a perf PID map, this will show both the native C++ and JITed PHP functions in the same stack traces. If you get lots of memory addresses with out names in the stack traces, the PID map option is probably off or there's a permission issue reading it; if you get no PHP functions or raw memory addresses, it's probably not warmed up yet and there's no JITed code.

You can also use perf to profile the entire system, not just HHVM - so if, for example, there's a problem with MySQL, this will show you the PHP call to MySQL, the C++ HHVM builtin talking to MySQL, the C functions in MySQL, and ultimately the kernel calls that MySQL is spending its' time in.

@5hubh4m
Copy link

5hubh4m commented Sep 25, 2017

HEAD: 1f236bb on an EC2 m4.x16large with a separate AWS Aurora SQL Database.

The platform (coupled with our naiveté) caused us a lot of misery, and possibly the reputation of our yearly CTF event CodeFest '17 CTF

utilisation

As soon as the contest began, the website slowed to a crawl. Error 504s everywhere. I shifted the MySQL database to AWS Aurora and upgraded m4.xlarge to m4.x16large. Still virtually no difference.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ wc -l error.log
6853801 error.log

Mid-way into the contest, the game board stopped opening at all as GET-ing /data/teams.php would always fail. It's still not working, even after stopping the game and dropping all entries from sessions.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ cat error.log | grep -v mysql | grep -v translation | tail
[Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac5bff700:735:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac13ff700:1561:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:736:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1562:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:737:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1563:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1564:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1565:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/command-line.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:738:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/configuration.php(7): SessionUtils::enforceLogin()\n#1 {main}
[Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1566:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main}

There were around ~500 team registrations and around 276 teams managed to make a submission. There were as many as 160 sessions open simultaneously.

@justinwray
Copy link
Contributor

@5hubh4m:

Thank you for submitting your report (and using the existing issue thread)! Could you please provide the quick_setup or provision details/options you used?

It sounds like your game has already concluded, but if you do still have the system running, and are experiencing performance issues, please take a look at https://github.com/facebook/hhvm/wiki/Profiling#linux-perf-tools. The reports from pref, sorted by both Self and Children during performance degradation will be beneficial for further troubleshooting.

Even if you are unable to obtain the pref reports, did you happen to capture the output from top or ps during the performance degradation? The output from such tools may help track down which process was using an excessive amount of CPU.

The output from your error_log are entries showing requests that require a valid session but were sent unauthenticated. In other words, these "errors" are simply requests where someone was not logged in. Given that sessions do timeout, this frequently happens when people leave their browser windows open on the scoreboard, but otherwise inactive.

@5hubh4m / All:

We have been dedicating much focus on the overall performance of the platform. One area of significant improvement has been in the caching of database requests. Nearly all requests are now cached, and the results are stored for subsequent usage unless the cached data is invalidated on specific actions. As a result, the total number of database queries is very low.

Another major area of performance improvement has been the ability, through the quick_setup or provision process, to separate the various components of the platform onto different servers/instances/containers. In total, you can install the platform components across four servers; one for Nginx, one for HHVM, one for MySQL, and one for Memcached. Furthermore, on platforms like Amazon Web Services (or equivalent cloud solutions) you can use services such as RDS for MySQL and ElastiCache for Memcached, to provide more purpose-tailored performance.

While we continue to do load and stress testing of our own, we, unfortunately, have been unable to reproduce the issue(s) detailed in this thread. Our testing, with hundreds of users, on an even smaller resource footprint (2 GHz of CPU and 2 GBs of RAM) shows negligible latency for the end-user. One thing that has become apparent through our testing is that MySQL, given the caching, is not experiencing a significant amount of load. Instead, it becomes more and more likely that HHVM is the cause of excessive CPU utilization, system load, and subsequent latency.

We are finalizing testing that will migrate the platform from Ubuntu 14.04 and HHVM 3.18 to Ubuntu 16.04 and HHVM 3.21. While this will not ensure a performance boost, it does include performance improvements to HHVM itself.

One commonality we see in the performance reports is the usage of Amazon Web Services. While the platform is certainly capable of running on AWS (and some of our testing takes place on AWS as well), there is one important aspect of AWS T2 instances to consider: CPU Credits. When using a t2 instance on AWS, you are given a set number of CPU credits and a baseline CPU level. For example, on a t2.medium you are given 24 CPU credits with a CPU baseline of 40%. A single CPU credit is worth 1 minute of 100% CPU usage. As credits are earned and stored (for 24 hours), you could burn through your entire balance of credits sequentially. For a t2.medium you can store a total of 576 credits (earned over 24 hours). In other words, with a t2.medium you are given 40% of the CPU resources across the board, with the ability to boost (on average) to 100% for 24 minutes per hour. Learn More about CPU Credits here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html#t2-instances-cpu-credits

We will continue to dedicate effort to improving the performance of the platform and continue to test and troubleshoot in attempt to recreate the aforementioned issue(s).

When experiencing a performance issue, please include the following in your reports:

  • Platform Version (HEAD commit used)
  • Installation Mode (Production or Development)
  • Installation Method (Quick Setup or Provision with Options)
  • Linux Distribution
  • Linux Kernel
  • Nginx Version
  • HHVM Version
  • MySQL Version
  • Memcached Version
  • Top Output (during performance degradation - On each server when using multiple servers)
  • PS Output (during performance degradation - ps aux - On each server when using multiple servers)
  • Relevant HHVM Errors (/var/log/hhvm/error.log)
  • Profiling Reports (during performance degradation - https://github.com/facebook/hhvm/wiki/Profiling#linux-perf-tools - Sorted by Self and Children)
  • Number of Teams
  • Average Number of Sessions (during performance degradation)

@justinwray justinwray changed the title Performance issue for over 150 sessions Performance Issue(s) Sep 25, 2017
@5hubh4m
Copy link

5hubh4m commented Sep 26, 2017

@justinwray

That is easy. hhvm was constantly on top with 5000-6000 % CPU consumption. That snippet of error.log I posted didn't contain the Resource Not Available errors because I wasn't able to locate them in the 1G text file. The ones I did mention are CERTAINLY not due to unauthenticated access, as those were done after the game was over and the only session active was mine. The perf command when used during the game to see syscall errors was riddled with EAGAINs.

I installed the game using provision.sh using the letsencrypt options and single server setup. The Db was then dumped using mysqldump, moved to Aurora and I edited settings.ini to reflect the changes made.

Interestingly, both disk and RDS were not the bottleneck. The RDS witnessed 400 connections at the peak but the CPU usage remained 20-25%. The disk usage (which was a provisioned IO disk on AWS) also didn't was out with it's provisioned IOPS values.

ubuntu@ip-172-31-20-69:~$ uname -a
Linux ip-172-31-20-69 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ip-172-31-20-69:~$ memcached -help
memcached 1.4.14

ubuntu@ip-172-31-20-69:~$ hhvm --version
HipHop VM 3.18.5 (rel)
Compiler: tags/HHVM-3.18.5-0-g61f6a1f9a199c929980408aff866f36a7b4a1515
Repo schema: 514949365dd9d370d84ea5a6db4a3dd3b619e484

ubuntu@ip-172-31-20-69:~$ nginx -v
nginx version: nginx/1.4.6 (Ubuntu)

I don't have the top or ps outputs unfortunately.

Also, a comment was made on CTFTime about fbctf's performance.

illustris – Sept. 22, 2017, 5:45 p.m.
You should have tested the platform well before using it. We considered using it for BITSCTF last year, but a simple load test with 50 people made it crawl on a fairly powerful VPS

@stevcoll
Copy link

@5hubh4m Regarding those errors, they are generated when somebody does not have a session, but for instance leaves the scoreboard open. Until a hard refresh is performed, they are not redirected to the login page, and the requests are coming in without a session. These look to be the typical requests performed during an AJAX refresh, which happens every ~5 seconds on each platform window.

For instance, even if you removed all users, or disabled login, or invalidated all sessions in the platform. Any user that was previously authenticated, but still resides on the gameboard page, will generate these errors. They also come up upon a hard refresh that does redirect to the login page (before the redirect invalid requests are coming in).

As @justinwray pointed out, we are heavily focused on reproducing these issues and finding a resolution. Unfortunately the biggest problem has been duplicating these live environments with hundreds or thousands of users, and ensuring our test provision and configuration matches exactly what the game admins are utilizing.

This morning I setup a stress test where I provisioned on a VM with only 2GB of memory and 2 cores. After setting up various flags and quizzes, with different configurations, I then hammered the platform with constant logins and subsequent endpoint requests of all types for over 6 hours, with hundreds of sessions, and an estimated typical user load of a few hundred users.

I saw the memory for HHVM go up about 8% during the event, and HHVM CPU was very consistent at 100% (ranging between 50% and 150% or about 1.5 cores worth of utilization). NGINX and Memcached showed CPU at around 10-30% on spikes, and MySQL easily handled the requests without even coming up on top. The bottleneck seemed to be HHVM, and yet even after 6 hours I was able to capture levels, bring up countries, and perform typical platform actions in multiple browsers without any noticeable latency.

The perf command showed that requests to Memcached were using up the most CPU in the HHVM process, which probably makes sense considering that a majority of MySQL queries were cached, and very few queries are actually sent to MySQL. Once a user captures a level or other actions take place, Memcached is invalidated, MySQL gets the queries once, then they are again cached into Memcached.

The next step on our end is now to try to reproduce the problem on AWS itself on instance types such as the ones you used, and see if that environment is involved in the issue.

@fredemmott
Copy link
Contributor

@mpomarole : do you remember why this switched from the mcrouter extension to memcached in d5846c5 ?

@fredemmott
Copy link
Contributor

fredemmott commented Sep 26, 2017

not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems, whereas the memcached extension basically exists for PHP compatibility.

@wisco24
Copy link
Contributor

wisco24 commented Sep 26, 2017 via email

@wisco24
Copy link
Contributor

wisco24 commented Sep 26, 2017 via email

@stevcoll
Copy link

@5hubh4m @fredemmott @justinwray @wisco24

Last night I attempted to duplicate the @5hubh4m scenario with some load testing scripts. The general setup was starting with an AWS EC2 m4.large (2 Cores, 8GB memory). Standard production provision with all services running on the one server. Added 10 quizzes and 10 flags, description text heavy, some with attachments, hints, different point values, etc. Turned on registration and started the event.

Each Perl script "client" was run from one of two separate m4.large instances and these scripts registered on the platform, logged in, then went through an endless cycle of loading the gameboard page then performing AJAX requests every 5 seconds (3 times). The total cycle was 20 seconds. After each cycle, the script had a 2% randomized chance of submitting a flag. When that happened, the script then had a 20% chance of getting the flag answer correct.

Started off with 50 script sessions running. This brought HHVM to about 125% utilization out of a total 200%. The platform was usable, but I did notice some throttling of the HHVM access log every few seconds, where it would "jam up" before continuing. Essentially the CPU on HHVM ranged from 50% to 150% every few seconds. Memory on HHVM and the entire platform was a non-issue.

I increased to 150 script sessions, and that brought the CPU average up to around 170%. Platform still usable, but now seeing cases where Modals were taking 10 seconds to pop up.

At 300 script sessions, the CPUs were maxed out on HHVM at 200%. The platform at this point became completely unusable as the HHVM access log crawled and froze up constantly. The script sessions started dying from timeouts. I ran perf for the first time with these results:

perf01

perf02

I then stopped the instance and upgraded to a m4.16xlarge (64 Cores, 256GB memory). Resumed the event starting with 350 sessions. HHVM immediately ate up a significant amount of CPU on all 64 cores. The platform was usable but again Modals were taking seconds to popup.

I started seeing this error all over the place (which I had not seen even once on the m4.large:

Error executing AsyncMysql operation: Failed (mysql error: 1040: Too many connections)

This caused many of the scripts to timeout and/or die. Increased the number of connections to a million in the MySQL configuration and restarted that. This seemed to solve the issue for a few minutes, before I instead got this error:

Error executing AsyncMysql operation: Failed (mysql error: 2004: Can't create TCP/IP socket

Finally at around 400 users I started getting this error message on the scripts which was very strange:

Error GETing /data/teams.php: hphp_invoke
Error GETing /data/country-data.php: hphp_invoke

The admin sessions page would now not even load, with the following error message:

[Tue Sep 26 07:36:35 2017] [hphp] [1817:7f8f25bff700:1194:000001] [] \nFatal error: Uncaught exception 'HH\\InvariantException' with message 'all_teams should of type Team and not null' in /var/www/fbctf/src/models/MultiTeam.php:72\nStack trace:\n#0 /var/www/fbctf/src/models/MultiTeam.php(72): HH\\invariant_violation()\n#1 /var/www/fbctf/src/controllers/AdminController.php(3749): MultiTeam::genTeam()\n#2 /var/www/fbctf/src/controllers/AdminController.php(4091): AdminController->genRenderSessionsContent()\n#3 /var/www/fbctf/src/controllers/AdminController.php(4104): AdminController->genRenderPage()\n#4 /var/www/fbctf/src/controllers/Controller.php(37): AdminController->genRenderBody()\n#5 /var/www/fbctf/src/Router.php(73): Controller->genRender()\n#6 /var/www/fbctf/src/Router.php(20): Router::genRouteNormal()\n#7 /var/www/fbctf/src/index.php(7): Router::genRoute()\n#8 (): genInit()\n#9

Somewhere between 400 and 500 users the platform was again unusable. The HHVM access log again crawling and unable to even remotely keep up with requests. Could not load the platform page. I performed another perf and got the resource graphs from AWS:

image

image

image

The timeline on the graph is that the event started at 6:30GMT, and the instance was upgraded to the m4.16xlarge at 7:15GMT. The reason CPU isn't pegged some of the time is really my inability to keep the session scripts running while MySQL had too many connections or dying from timeouts.

Overall Memcached does seem to be a primary bottleneck inside of HHVM as far as I can read perf. I don't understand how HHVM could utilize so many cores on the m4.16xlarge. Neither MySQL or NGINX had any issues whatsoever in terms of resource usage. Memcached ended up utilizing 1.5 cores towards the end of my test with ~450 sessions.

@fredemmott
Copy link
Contributor

HHVM defaults to 2x number of cores, and you probably actually want higher: each concurrent request needs a thread.

A few things:

  • please use the perf map - this might require running perf report as root. This will replace the addresses with Hack function names
  • given how unserialize is showing up, we should look at exactly what’s been stored in memcached and if we can store something cheaper to serialize/unserialize instead

@stevcoll
Copy link

@fredemmott Yeah I did run it as root (that was required for it to even function). I ran the exactly command here:

sudo ./perf record -g --call-graph dwarf -p 5076

The report generated with this:

sudo ./perf report -g

@stevcoll
Copy link

@fredemmott Also I installed perf from apt-get then used the binary from here as the documentation specified. Perhaps I need the entire install from a newer source?

http://dl.hhvm.com/resources/perf.gz

@fredemmott
Copy link
Contributor

Should be able to get it from your distribution package manager, sounds like the instructions our outdated.

We should also consider optimizing for the common single-web server case and use APC instead by default - if value forms such as dict are used, no serialization or unserialization. Will need the caching to change to be pluggable, but that would be good anyway.

@justinwray
Copy link
Contributor

@wisco24 That is an excellent example to illustrate what our usage of Memcached is and is not doing.

If you have 30 teams and 30 flags, and every team captures all of the flags, some of the Memcached data (related to scoring) would be invalidated a total of 900 times. However, only a subset of the data in the cache would be invalidated through these actions. Furthermore, while this will cause some queries to execute 900 times, the previous implementation, before the data was cached, would have had these requests being sent on every scoreboard request, or thousands upon thousands of times. Not to mention the other data that isn't invalidated from the scoring action.

To be clear, previously every user/session would be generating a query for the values within the database. They would be querying for teams, levels, scores, hints, etc. The queries for such data would be happening on every scoreboard request (including the AJAX refreshes) and on any new page or modal load. With the new implementation, the queries happen once the cached data is invalidated or missing and only once, with all users sharing the results. So instead of all 30 teams making tens of thousands of requests throughout the event, you are down to a few thousand across all of the teams.

Memcached is certainly not intended to replace the database, or entirely shield the database from the workload. Specifically, Memcached is implemented to store and reuse database results for as long as they are valid.

However, you are right; the more "busy" a game is, the more often queries will take place.

@stevcoll

Great job with the testing and reproducing the issue!

You cut off part of this error line:

Error executing AsyncMysql operation: Failed (mysql error: 2004: Can't create TCP/IP socket

Directly after the word socket should be a number; this number would indicate why the sockets could not be created. Most likely you ran out of file handles or TCP/IP sockets at the kernel level.

Take a look a ulimit on that instance to see some of the userspace limitations and if you are hitting a kernel limit take a look at net.core.somaxconn or the backlog values (net.core.netdev_max_backlog, ipv4.tcp_max_syn_backlog).

As we can see from your error:

all_teams should of type Team and not null

We are not getting valid results from MySQL, which is to be expected if the server is unable to fulfill the request.

We may want to look into further MySQL, Memcached, and system-level tuning for the production installation mode. While there is still optimization to be achieved within the platform code, the default configurations for these services are unlikely to be solid enough for larger events.

@fredemmott

given how unserialize is showing up, we should look at exactly what’s been stored in memcached and if we can store something cheaper to serialize/unserialize instead

Currently, the result Map are being stored in Memcached.

@fredemmott
Copy link
Contributor

Maps should be relatively efficient; I'm wondering if:

  • we're storing a larger map than we should
  • we're simply fetching stuff from the cache more often than we should - e.g. will we do the same memcached request twice in the same http request? This is especially likely given whatever PHP code is at 0x50050fe is responsible for most of this (though that could be part of systemlib)

@stevcoll : can you make sure hhvm.perf_pid_map is on in your hhvm settings?

@justinwray
Copy link
Contributor

justinwray commented Sep 26, 2017

@fredemmott

Maps should be relatively efficient

I would think so as well.

we're storing a larger map than we should

Less likely, as it's usually just a simple set of keys/values from the database row (so they are fairly small, in some cases just a single int).

we're simply fetching stuff from the cache more often than we should

This is more likely an area for improvement. The MC calls are taking place in the various class methods, and while there is not a major amount of overlap, there is certainly some.

Which leads to the option of potentially building a Cache class that handles the fetching of MC results and stores them in an object - so if a subsequent cache request is made, the data is retrieved from HHVM's memory instead of reaching out to MC.

@fredemmott
Copy link
Contributor

Which leads to the option of potentially building a Cache class that handles the fetching of MC results and stores them in an object - so if a subsequent cache request is made, the data is retrieved from HHVM's memory instead of reaching out to MC.

I'll be putting some more thoughts on this over in #572

@justinwray
Copy link
Contributor

Performance Testing and Improvements Update:

We continued to perform various load, stress, and scalability testing and wanted to provide an update to everyone here.

Improvements

First, let's cover the improvements. We have submitted three (3) Pull Requests that have provide performance improvements:

The first PR, Facebook and Google Login Integration & LiveSync Update, is primarily a feature improvement, providing integration with Facebook and Google (for registration and login); however, this PR also includes a few minor performance enhancements, primarily around the use of caching. The performance difference between this PR alone should be relatively minimal, but it is one of the changes involved in the code we are testing.

The second PR, Local Caching of Database and Cache Results, is the primary performance update of the set. This new local caching is a major improvement on performance and scalability. From the PR message:

  • Results from the database (MySQL) are stored and queried from the Cache service (Memcached). Results from Memcached are now stored locally within HHVM memory as well.

  • Through the new Cache class a global cache object is created and used throughout a single HHVM execution thread (user request).

  • This change results in a 99.9995% reduction in Memcached requests (at scale), providing a major improvement in scalability and performance.

As you can see this code change is dramatically lowering the number of queries to Memcached, which based on our recent testing has shown to be a major area of performance degradation. While this is by no means a resolution to issue #572, it is a stop-gap until further rewrites can of the models can be complete. Regardless, the performance and scalability implications of this new local cache are massively beneficial.

The last PR, Blocking AJAX Requests, is a further improvement on PR #565. From the PR:

  • AJAX requests for the gameboard are now individually blocking. A new request will not be dispatched until the previous request has completed.

  • AJAX requests will individually stop subsequent requests on a hard-error.

While this PR will not prevent performance issues, it does keep them from being exasperated by non-stop user requests. The particular implementation ensures that a user does not request the same (AJAX) data more than once at a time - that is, it will wait until the previous request completes before requesting the refresh again. Again, this will not solve an existing performance degradation situation, but it will prevent the problem from getting worse.

Testing

Next, let's cover the new testing results, based on the new code mentioned above. We utilized the existing dev branch code, with the new PRs, referred to above, merged. We also included a modified version of PR #564 (based on the discussion in the PR thread - we took just the schema indexing improvements). You can grab a copy of the fully merged code here: https://github.com/WraySec/fbctf/tree/merge. If you wish to test this code yourself, you can do the following:

  • git clone https://github.com/WraySec/fbctf.git
  • git checkout merge
  • Complete the provisioning process

For our testing, we utilized the recommended production installation: Production Mode with Multi Server Setup.

As we have covered in recent testing and posts to this thread, HHVM has been the major bottleneck, especially given the previous reliance and absurdly high number of hits to Memcached from HHVM. As such our testing has been focused on the HHVM server (as we are using separate servers for each service, HHVM is on its own server). In fact, during our testing, we never hit the limitations of a properly configured and tuned Nginx, Memcached, or MySQL server. The limitations always came from the HHVM server exclusively.

Results

Now, onto the results. With the new code improvements, we were able to support 75-150 simultaneous sessions, per CPU core. That gives us 75 sessions, optimally, per core with the platform still being usable with up to 150 sessions. Past 150 sessions, while the platform would work, the latency was increasingly unbearable. The HHVM system would finally complete fail at approximately 1000 user sessions per core. Furthermore, the server regardless of CPU cores or general resources has a maximum upper limit of 2000 sessions. That is, no matter the hardware level, you will not successfully exceed 2000 sessions - this limitation was hit due to constraints of the networking stack (sockets, client ports, backlogs, network speed/capacity, etc.).

One of the issues we were previously seeing, before the new improvements, was a lack of scalability with additional resources. Increasing the amount of processing power (or servers) provided little to no improvement in performance. With these new changes that is not the case, instead, the performance of the platform is linearly scalable with the system resources. So providing additional processing power will increase the performance or supportable load of the platform. This change is a significant improvement and should allow people to dynamically scale the platform based on their level or users, sessions, and load.

To further test the scalability we implemented load-balancing of the HHVM server instances. This load-balancing allowed us to deploy multiple HHVM servers to distribute the load across and increase the number of users supported. The numbers are scalable from the original numbers provided: 75-150 sessions per core, with a slight increase when utilizing a cluster of HHVM server, due to the networking relief each individual server experiences.

In fact, we deployed two HHVM instances with eight cores each and hit an upper bound of approximately 3500 users.

Resources

To summarize the resource requirements of the system, based on these new improvements, we have the following:

Minimum

1 CPU Core per 150 Users

Recommended

1 CPU Core per 75 Users

Over 2000 Users

HHVM, regardless of CPU cores, must be separated across multiple servers
Each HHVM server can only handle a maximum of 2000 users

Note: Memory usage was never a concern, as we never hit any upper limits of memory utilization. For transparency, we had approximately 2GB of memory per core in our testing. However, again, we stayed well in the single digits on memory utilization throughout our testing.

Conclusion

The new improvements have provided a massive improvement to performance and scalability of the project. We will continue to perform load testing and make improvements where possible. Our direct next steps will be a review of each AJAX end-point utilized on the game board for possible performance increasing improvements.

justinwray added a commit that referenced this issue Nov 17, 2017
* Major Performance Enhancements and Bug Fixes

* **NOTICE**:  _This PR is extremely large.  Due to the interdependencies, these code changes are being included as a single PR._

* **Local Caching of Database and Memcached Results**

  * Replaced PR #574

  * Results from the database (MySQL) are stored and queried from the Cache service (Memcached). Results from Memcached are now stored locally within HHVM memory as well.

  * New class `Cache` has been added, with four methods:

    * `setCache()`
    * `getCache()`
    * `deleteCache()`
    * `flushCache()`

  * The new `Cache` class object is included as a static property of the `Model` class and all children.

  * Through the new `Cache` object, all `Model` sub-classes will automatically utilize the temporary HHVM-memory cache for results that have already been retrieved.

  * The `Cache` object is created and used throughout a single HHVM execution thread (user request).

  * This change results in a massive reduction in Memcached requests (over 99% - at scale), providing a significant improvement in scalability and performance.

  * The local cache can be deleted or flushed from the `Model` class or any child class with the usage of `deleteLocalCache()`.  The `deleteLocalCache()` method works like `invalidateMCRecords()`.  When called from a child class, a `$key` can be passed in, matching the MC key identifier used elsewhere, or if no `$key` is specified all of the local caches for that class will be deleted.  If the calling class is `Model` then the entire local cache is flushed.

  * Some non-HTTP code has local cache values explicitly deleted, or the local cache completely flushed, as the execution thread is continuous:

    * Prevent `autorun.php` from storing timestamps in the local cache, forever (the script runs continuously).

    * Flush the local cache before the next cycle of `bases.php` to ensure the game is still running and the configuration of the bases has not changed (the script runs continuously).

    * Flush the local cache before the next import cycle of `liveimport.php` to ensure we get the up-to-date team and level data (the script runs continuously).

  * The `Cache` class is specifically separate from `Model` (as an independent class) so that other code may instantiate and utilize a temporary (request-exclusive) local-memory-based caching solution, with a common interface.  The usage provides local caching without storing the data in MySQL, Memcached, or exposing it to other areas of the application. (For example, this is being utilized in some `Integration` code already.)

  * Implemented CR from PR #574.

  * Relevant:  Issue #456 and Comment #456 (comment)

* **Blocking AJAX Requests**

  * Replaced PR #575

  * Expansion and Bug Fixes of PR #565

  * AJAX requests for the gameboard are now individually blocking.  A new request will not be dispatched until the previous request has completed.

  * AJAX requests will individually stop subsequent requests on a hard-error.

  * The blocking of continuous AJAX requests, when the previous has not yet returned, or on a hard error, provides a modest performance benefit by not compounding the issue with more unfulfillable requests.

  * Forceful refreshes are still dispatched every 60 seconds, regardless of the blocking state on those requests.

  * Relevant:  Issue #456 and Comment #456 (comment)

* **AJAX Endpoint Optimization**

  * Removed nested loops within multiple AJAX endpoints:

    * `map-data.php`

    * `country-data.php`

    * ` leaderboard.php`

  * All Attachments, including Link and Filename, are now cached and obtained through:  `Attachment::genAllAttachmentsFileNamesLinks()`.

  * All Team names of those who have completed a level, are now cached and obtained through `MultiTeam::genCompletedLevelTeamNames()`.

  * All Levels and Country for map displays are cached and obtained through `Level::genAllLevelsCountryMap()` and `Country::genAllEnabledCountriesForMap()`.

  * Relevant:  Issue #456

* **Memcached Cluster Support**

  * The platform now supports a cluster of Memcached nodes.

  * Configuration for the `MC_HOST` within the `settings.ini` file is now an array, instead of a single value:

    * `MC_HOST[] = 127.0.0.1`

  * Multiple Memcached servers can be configured by providing additional `MC_HOST` lines:

    * `MC_HOST[] = 1.2.3.4`

    * `MC_HOST[] = 5.6.7.8`

  * The platform uses a Write-Many Read-Once approach to the Memcached Cluster.  Specifically, data is written to all of the configured Memcached nodes and then read from a single node at random.  This approach ensures that all of the nodes stay in sync and up-to-date while providing a vital performance benefit to the more expensive and frequent operation of reading.

  * The existing `Model` methods (`setMCRecords()` and `invalidateMCRecords()`) all call and utilize the new cluster methods:

    * `writeMCCluster()`

    * `invalidateMCCluster()`

  * The flushing of Memcached has also been updated to support the multi-cluster approach:  `flushMCCluster()`.

  * Note that the usage of a Memcached Cluster is seamless for administrators and users, and works in conjunction with the Local Cache.  Also note, the platform works identically, for administrators and users, for both single-node and multi-node Memcached configurations.

  * The default configuration remains a single-node configuration.  The utilization of a Memcached Cluster requires the following:

    * The configuration and deployment of multiple Memcached nodes (the `quick_setup install_multi_cache` or Memcached specific provision, will work).

    * The modification of `settings.ini` to include all of the desired Memcached hosts.

    * All Memcached hosts must be identically configured.

  * Usage of a Memcached Cluster is only recommended in the Multi-Server deployment modes.

  * Relevant:  Issue #456

* **Load Balancing of Application Servers (HHVM)**

  * The platform now supports the ability to load balance multiple HHVM servers.

  * To facilitate the load balancing of the HHVM servers, the following changes were made:

    * Scripts (`autorun`, `progressive`, etc.) are now tracked on a per-server level, preventing multiple copies of the scripts from being executed on the HHVM servers.

    * Additional database verification on scoring events to prevent multiple captures.

  * Load Balancing of HHVM is only recommended in the Multi-Server deployment modes.

  * Relevant:  Issue #456

* **Leaderboard Limits**

  * `MultiTeam::genLeaderboard()` now limits the total number of teams returned based on a configurable setting.

  * A new argument has been added to `MultiTeam::genLeaderboard()`: `limit`.  This value, either `true` or `false`, indicates where the limit should be enforced, and defaults to `true`.

  * When the data is not cached, `MultiTeam::genLeaderboard()` will only build, cache, and return the number of teams needed to meet the limit.

  * When the data is already cached, `MultiTeam::genLeaderboard()` will ensure the limit value has not changed and returned the cached results.  If the configured limit value has been changed, `MultiTeam::genLeaderboard()` will build, cache, and return the number based on the new limit.

  * The "Scoreboard" modal (found from the main gameboard) is a special case where all teams should be displayed.  As such, the Scoreboard modal sets the `limit` value to `false` retuning all teams.  This full leaderboard will be cached, but all other display limits are still enforced based on the configured limit.  Once invalidated, the cached data will return to the limited subset.

  *  Because a full leaderboard is not always cached, this does result in the first hit to the Scoreboard modal requiring a database hit.

  * A user, whose rank is above the limit, will have their rank shown to them as `$limit+`.  For example, if the limit is set to `50` and the user's rank is above `50`, they would see:  `51+` as their rank.

  * Overall, the caching of the Leaderboard, one of the more resource-intensive and frequent queries, resulted in significant performance gains.

  * The Leaderboard limit is configurable by administrators within the administrative interface.  The default value is `50`.

  * Relevant:  Issue #456

* **Activity Log Limits**

  * The Activity Log is now limited to the most recent `100` log entries.  The value is not configurable.

  * The activity log is continually queried and contains a large amount of data, as such, it is a very resource-intensive request.

  *  The limit on the results built, cached, and returned for the activity log provides a notable improvement in performance.

  * Relevant:  Issue #456

* **Database Optimization**

  * Expansion of PR #564

  * Added additional indexing of the database tables in the schema.

  * The additional indexing provides further performance improvements to the platform queries, especially those found in `MultiTeam` and those queries continually utilized as a result of the AJAX calls.

  * Relevant:  Issue #456 and Comment #456 (comment)

* **Team and MultiTeam Performance Improvements**

  * Updated numerous `Team::genTeam()` calls to used the cached version: `MultiTeam::genTeam()`.

  * Optimized the database query within `MultiTeam::genFirstCapture()` to return the `team_id` and build the `Team` from the cache.

  * Optimized the database query within `MultiTeam::genCompletedLevel()` to return the `team_id` and build the `Team` from the cache.

  * Optimized the database query within `MultiTeam::genAllCompletedLevels()` to return the `team_id` and build the `Team` from the cache.

  * A full invalidation of the `MultiTeam` cache is no longer executed when a new team is created.  Newly created teams will not have any valid scoring activity.  Delaying the rebuild of the scoring related cache provides a modest performance improvement.  The new team will not show up in certain areas (namely the full scoreboard) until they or someone else perform a scoring action.  To ensure the team is properly functioning, following cache is specifically invalided on a new team creation:

    * `ALL_TEAMS`
    * `ALL_ACTIVE_TEAMS`
    * `ALL_VISIBLE_TEAMS`
    * `TEAMS_BY_LOGO`

  * Fixed an extremely rare race condition within `MultiTeam::genFirstCapture()`.

  * Relevant:  Issue #456

* **Combined Awaitables**

  * Combined Awaitables which were not in nested loops.

  * Combined Awaitables found in some nested loops, where existing code provided a streamlined approach.

  * Given the lack of support for concurrent queries to a single database connection, some queries were combined via `multiQuery()` (in the case where the queries were modifying data within the database).  TODO:  Build and utilize additional `AsyncMysqlConnection` within the pool for suitable concurrent queries.

  * Annotated Awaitables within a nested loop for future optimization.

  * Relevant:  Issue #577

* **Facebook and Google Login Integration**

  * Replaced PR #573

  * The platform now supports Login via OAuth2 for Facebook and Google. When configured and enabled, users will have the option to link and login to their existing account with a Facebook or Google account.

  * Automated registration through Facebook or Google OAuth2 is now supported. When configured and enabled, users will have the option to register an account by using and linking an existing account with Facebook or Google.

  * New `configuration` options added to the database schema:

    * Added `facebook_login`. This configuration option is a toggleable setting to enable or disable login via Facebook.

    * Added `google_login`. This configuration option is a toggleable setting to enable or disable login via Google.

    * Added `facebook_registration`. This configuration option is a toggleable setting to enable or disable registration via Facebook.

    * Added `google_registration`. This configuration option is a toggleable setting to enable or disable registration via Google.

    * Added `registration_prefix`. This configuration option is a string that sets the prefix for the randomly generated username/team name for teams registered via (Facebook or Google) OAuth.

  * New Integration section within the Administrative interface allows for control over the Facebook and Google Login, Registration, and the automatic team name prefix option.

  * Overhauled the Login page to support the new Login buttons.  Login page now displays appropriate messages based on the configuration of login.

  * Login form is dynamically generated, based on the configuration options and settings.

  * Overhauled the Registration page to support the new Registration buttons.  The registration page now displays appropriate messages based on the configuration of registration.

  * The registration form is dynamically generated, based on the configuration options and settings.

  * Account Linking for Facebook sets both the Login OAuth values and the LiveSync values (single step for both).

  * Account Linking for Google sets both the Login OAuth values and the LiveSync values (single step for both).

  * Facebook Account linkage option has been added to the Account modal.

  * The Account modal now shows which accounts are already linked.

  * The Account modal will color-code the buttons on an error (red) and success (green).

  * New table "teams_oauth" has been added to handle the OAuth data for Facebook and Google account linkage.

  * New class `Integration` handles the linkage of Facebook or Google accounts with an FBCTF account (both Login OAuth values and the LiveSync values). The Integration class also includes the underlying methods for authentication in both the linkage and login routines and the OAuth registration process.

  * New URL endpoints have been created and simplified for the `Integration` actions:

    * New data endpoint `data/integration_login.php`. This endpoint accepts a type argument, currently supporting types of `facebook` and `google`. Through this endpoint, the login process is handled in conjunction with the Integration class.

    * The new callback URL for Facebook Login: `/data/integration_login.php?type=facebook`

    * The new callback URL for Google Login: `/data/integration_login.php?type=google`

    * New data endpoint data/integration_oauth.php. This endpoint accepts a type argument, currently supporting types of `facebook'`and `google`. Through this endpoint, the OAuth account linkage is handled in conjunction with the Integration class.

    * The new callback URL for Facebook linkage: /data/integration_login.php?type=facebook

    * The new callback URL for Google linkage: /data/integration_login.php?type=google

    * Old Google-specific endpoint (data/google_oauth.php) has been removed.

  * New Team class methods: `genAuthTokenExists()`, `genTeamFromOAuthToken()`, `genSetOAuthToken()`.

    * `Team::genAuthTokenExists()` allows an OAuth token to be verified.

    * `Team::genTeamFromOAuthToken()` returns a Team object based on the OAuth token supplied.

    * `Team::genSetOAuthToken()` sets the OAuth token for a team.

  * The `settings.ini` (including the packaged example file) and `Configuration` have methods to verify and return Facebook and Google API settings.

    * `Configuration::getFacebookOAuthSettingsExists()` verifies the Facebook API _App ID_ and _APP Secret_ are set in the `settings.ini` file.

    * `Configuration::getFacebookOAuthSettingsAppId()` returns the Facebook API _App ID_.

    * `Configuration::getFacebookOAuthSettingsAppSecret()` returns the Facebook API _App Secret_.

    * `Configuration::getGoogleOAuthFileExists()` verifies the Google API JSON file is set and exists in the `settings.ini` file.

    * `Configuration::getGoogleOAuthFile()` returns the filename for the Google API JSON file.

    * All of Facebook and Google API configuration values are cached (in Memcached) to prevent the repeated loading, reading, and parsing of the `settings.ini` file.

  * To use the new Facebook or Google integration the following must be completed:

    * A Facebook and/or Google Application must be created, and OAuth2 API keys must be obtained.

    * The API keys must be provided in the Settings.ini file.

    * Desired settings must be configured from within the administrative interface (Configuration) - the default has all integration turned off.

  * The Facebook OAuth code provides CSRF protection through the Graph SDK.

  * The Google OAuth code provides CSRF protection through the usage of the `integration_csrf_token` cookie and API state value.

  * Note: Facebook Login/Integration will not work in development mode - this is due to a pending issue in the Facebook Graph SDK (facebookarchive/php-graph-sdk#853) utilization of the pending PR (facebookarchive/php-graph-sdk#854) resolves this issue. Alternatively, the Integration with Facebook will work in production mode, the recommended mode for a live game.

  * Implemented CR from PR #573.

  * Relevant:  PR #591 and PR #459.

* **LiveSync API and LiveImport Script Update**

  * LiveSync has been updated to support and supply Facebook and Google OAuth output. All of the users LiveSync integrations (FBCTF, Facebook, and Google) are now provided through the API. As a result, so long as one of the three LiveSync methods are configured by the user (which happens automatically when linking an account to Facebook or Google) the data will become available through the LiveSync API.

  * LiveSync now includes a "general" type. The new `general` type output includes the scoring information using the local team name on the FBCTF instance. This new type is not for importation on another FBCTF instance but does provide the opportunity for third-parties to use the data for score tracking, metric collections, and displays. As such, this new LiveSync data allows the scoring data for a single FBCTF instance to be tracked.

  * The `liveimport.sh` script, used to import LiveSync API data, will ignore the new `general` LiveSync type.

  * Updated `Team::genLiveSyncKeyExists()` and `Team::genTeamFromLiveSyncKey()` to use the new Integration class methods:  `Integration::genFacebookThirdPartyExists)` and `Integration::genFacebookThirdPartyEmail()`.

  * Within the `liveimport.sh` script: when the type is `facebook_oauth`, `Team::genLiveSyncKeyExists()` and `Team::genTeamFromLiveSyncKey()` properly use the Facebook `third_party_id`.

  * `Integration::genFacebookThirdPartyExists()` and `Integration::genFacebookThirdPartyEmail()` query the Facebook API for the coorosponding user, storing the results in a temporary HHVM-memory cache, via  the `Cache` class.

  * Given that `liveimport.sh` now needs to query the Facebook API for any `facebook_oauth` typed items, the script will utilize the HHVM-memory cache of `Integration` to limit the number of hits to the Facebook API.

  * The `liveimport.sh` script now includes the `Cache` class and the Facebook Graph SDK.

  * Relevant:  PR #459.

* **Error and Exception Handling**

  * All Exceptions, including Redirect Exceptions, are now caught.

  * The NGINX configuration has been updated to catch errors from HHVM (FastCGI) and return `error.php`.

  * The `error.php` page has been updated with a themed error page.

  * The `error.php` page will redirect to `index.php?page=error` so long as `index.php?page=error` is not generating any HTTP errors.  If an error is detected on `index.php?page=error` then no redirect will occur.  The verification of the HTTP status ensures no redirect loops occur.

  * The `DataController` class now includes a `sendData()` method to catch errors and exceptions.  `DataController` children classes now utilize `sendData()` instead of outputing their results directly.

  * On Exception within an AJAX request, an empty JSON array is returned.  This empty array prevents client-side errors.

  * The `ModuleController` class now includes a `sendRender()` method to catch errors and exceptions.  `ModuleController` children classes now utilize `sendRender()` instead of outputing their results directly.

  * On Exception within a Module request, an empty string is returned.  This empty string prevents client-side and front-end errors.

  * A new AJAX endpoint has been added:  `/data/session.php`.  The response of the endpoint is used to determine if the user's session is still active.  If a user's session is no longer active, they will be redirected from the gameboard to the login page.  This redirection ensures that they do not continually perform AJAX requests.

  * Custom HTTP headers are used to monitor AJAX responses:

    * The Login page now includes a custom HTTP header: `Login-Page`.

    * The Error page now includes a custom HTTP header:  `Error-Page`.

  * The custom HTTP headers are used client-side (JS) to determine if a request or page rendered an error or requires authentication.

  * Exception log outputs now include additional information on which Exception was thrown.

  * Users should no longer directly receive an HTTP 500.

  * These Exception changes prevent the error logs from being filled with unauthenticated requests.  The changes also provide a user-friendly experience when things malfunction or a user needs to reauthenticate.

  *   Relevant:  #563

* **Team Account Modal Update**

  * Users can now change their team name from within the Account modal.

  * The account Modal now contains the following options:

    * Team Name

    * Facebook Account Linkage

    * Google Account Linkage

    * FBCTF LiveSync Authentication

  * Relevant:  PR #459.

* **Non-Visible/Inactive Team Update**

  * Ensure that non-visible or inactive team do not show up for any other users.

  * Non-Visible/Inactive teams are not awarded as the "first capture."

  * Non-Visible/Inactive teams do not show in the "captured by" list.

  * Countries will not show as captured (for other teams) if only captured by a Non-Visible/Inactive team.

  * Activity Log entries for a Non-Visible or Inactive team are not included in the activity log for other users.

  * Updated `ScoreLog::genAllPreviousScore()` and `ScoreLog::genPreviousScore()` to only include Visible and Active teams, or the user's own team.

  * Teams who are Non-Visible or Inactive will have a rank of "N/A."

  * Relevant:  PR #513

* **Mobile Page Update**

  * The mobile page is shown when a user's window has a resolution under `960px`.  While this is geared towards mobile users, it can happen when the window size on a non-mobile device is too small.

  * The mobile page now includes a "Refresh" button, which will reload the page.

  * The mobile page will refresh, attempting to re-render correctly, after `2` seconds.

  * If a user resizes their window to a larger size, they should reload into a properly displayed screen, and not the mobile warning.

* **Login and Registration JS Fixes**

  * Consistently corrected the usage of `teamname` and `team_name` across PHP and JS code.

  * Ensured that all JavaScript is using `team_name`.

  * Ensured that all PHP is using `team_name` when interacting with JS.

  * Updated the input filters within PHP when retrieving input for the team name, using `team_name`.

  * Updated Login errors to highlight the username and password fields.

  * Relevant:  Issue #571, Issue #558, Issue #521, PR #592, and PR #523

* **System Statistics JSON Endpoint**

   * A new administrative-only JSON endpoint has been added that provides statistical data about the platform and game.

  * The endpoint is found at `/data/stata.php`.  Access to the endpoint requires an authenticated administrative session.

  * The endpoint provides the following information:

    * Number of Teams (`teams`)

    * Number of Sessions (`sessions`)

    * Total Number of Levels (`levels`)

    * Number of Active Levels (`active_levels`)

    * Number of Hints (`hints`)

    * Number of Captures (`captures`)

    * `AsyncMysqlConnectionPool` Statistics (`database`)

      * Created Connections (`created_pool_connections`)

      * Destroyed Connections (`destroyed_pool_connections`)

      * Connection Requests (`connections_requested`)

      * Pool Hits (`pool_hits`)

      * Pool Misses (`pool_misses`)

    * `Memcached` Statistics (`memcached`)

      * Node Address

        * Node Address:Port

          * Process ID (`pid`)

          * Uptime (`uptime`)

          * Threads (`threads`)

          * Timestamp (`time`)

          * Size of Pointer (`pointer_size`)

          * Total User Time for Memcached Process (`rusage_user_seconds`)

          * Total User Time for Memcached Process (`rusage_user_microseconds`)

          * Total System Time for Memcached Process (`rusage_system_seconds`)

          * Total System Time for Memcached Process (`rusage_system_microseconds`)

          * Current Items in Cache (`curr_items`)

          * Total Items in Cache (`total_items`)

          * Max Bytes Limit (`limit_maxbytes`)

          * Number of Current Connections (`curr_connections`)

          * Number of Total Connections (`total_connections`)

          * Number of Current Connection Structures Allocated (`connection_structures`)

          * Number of Bytes Used (`bytes`)

          * Total Number of Cache Get Requests (`cmd_get`)

          * Total Number of Cache Set Requests (`cmd_set`)

          * Total Number Successful of Cache Retrievals (`get_hits`)

          * Total Number Unsuccessful of Cache Retrievals (`get_ misses`)

          * Total Number of Cache Evictions (`evictions`)

          * Total Number of Bytes Read (`bytes_read`)

          * Total Number of Bytes Written (`bytes_writter`)

          * Memcached Version (`version`)

    * System Load Statistics (`load`)

      * One Minute Average (`0`)

      * Five Minute Average (`1`)

      * Fifteen Minute Average (`2`)

    * System CPU Utilization (`load`)

      * Userspace Utilization Percentage (`user`)

      * Nice Utilization Percentage (`nice`)

      * System Utilization Percentage (`sys`)

      * Idle Percentage (`idle`)

  * The endpoint provides current data and can be pooled/ingested for historical data reporting.

  * For more information on the `AsyncMysqlConnectionPool` statistics, please see:  https://docs.hhvm.com/hack/reference/class/AsyncMysqlConnection/ and https://docs.hhvm.com/hack/reference/class/AsyncMysqlConnectionPool/getPoolStats/

  * For more information on the `Memcached` statistics, please see:  https://github.com/memcached/memcached/blob/master/doc/protocol.txt and https://secure.php.net/manual/en/memcached.getstats.php

* **Miscellaneous Changes**

  * Added `Announcement` and `ActivityLog` to `autorun.php`.

  * Added `Announcement` and `ActivityLog` to `bases.php`.

  * Added/Updated UTF-8 encoding on various user-controlled values, such as team name.

  * Changed the "Sign Up" link to a button on the login page.

  * Allow any Logo to be re-used once all logos are in use.

  * Invalidate Scores and Hint cache when a Team is deleted.

  * Reverify the game status (running or stopped) before the next cycle of base scoring in `bases.php`.

  * Allow the game to stop even if some scripts (`autorun`, `bases`, `progressive`, etc.) are not running.

  * Fixed a bug where teams with a custom logo could not be edited by an administrator.

  * Added "Reset Schedule" button to administrative interface to completely remove a previously set game schedule.  The game schedule can only be reset if the game is not running.  Otherwise, the existing schedule must be modified.

  * Moved "Begin Game," "Pause Game," and "End Game" outside of the scrollable admin list into a new fixed pane below the navigation list.

  * Formatted all code files as part of this PR.

  * Updated ActivityLog to delete entries when deleting a team.

  * Updated PHPUnit tests based on the new changes.
iliushin-a added a commit to iliushin-a/fbctf that referenced this issue May 16, 2018
* Major Performance Enhancements and Bug Fixes

* **NOTICE**:  _This PR is extremely large.  Due to the interdependencies, these code changes are being included as a single PR._

* **Local Caching of Database and Memcached Results**

  * Replaced PR facebookarchive#574

  * Results from the database (MySQL) are stored and queried from the Cache service (Memcached). Results from Memcached are now stored locally within HHVM memory as well.

  * New class `Cache` has been added, with four methods:

    * `setCache()`
    * `getCache()`
    * `deleteCache()`
    * `flushCache()`

  * The new `Cache` class object is included as a static property of the `Model` class and all children.

  * Through the new `Cache` object, all `Model` sub-classes will automatically utilize the temporary HHVM-memory cache for results that have already been retrieved.

  * The `Cache` object is created and used throughout a single HHVM execution thread (user request).

  * This change results in a massive reduction in Memcached requests (over 99% - at scale), providing a significant improvement in scalability and performance.

  * The local cache can be deleted or flushed from the `Model` class or any child class with the usage of `deleteLocalCache()`.  The `deleteLocalCache()` method works like `invalidateMCRecords()`.  When called from a child class, a `$key` can be passed in, matching the MC key identifier used elsewhere, or if no `$key` is specified all of the local caches for that class will be deleted.  If the calling class is `Model` then the entire local cache is flushed.

  * Some non-HTTP code has local cache values explicitly deleted, or the local cache completely flushed, as the execution thread is continuous:

    * Prevent `autorun.php` from storing timestamps in the local cache, forever (the script runs continuously).

    * Flush the local cache before the next cycle of `bases.php` to ensure the game is still running and the configuration of the bases has not changed (the script runs continuously).

    * Flush the local cache before the next import cycle of `liveimport.php` to ensure we get the up-to-date team and level data (the script runs continuously).

  * The `Cache` class is specifically separate from `Model` (as an independent class) so that other code may instantiate and utilize a temporary (request-exclusive) local-memory-based caching solution, with a common interface.  The usage provides local caching without storing the data in MySQL, Memcached, or exposing it to other areas of the application. (For example, this is being utilized in some `Integration` code already.)

  * Implemented CR from PR facebookarchive#574.

  * Relevant:  Issue facebookarchive#456 and Comment facebookarchive#456 (comment)

* **Blocking AJAX Requests**

  * Replaced PR facebookarchive#575

  * Expansion and Bug Fixes of PR facebookarchive#565

  * AJAX requests for the gameboard are now individually blocking.  A new request will not be dispatched until the previous request has completed.

  * AJAX requests will individually stop subsequent requests on a hard-error.

  * The blocking of continuous AJAX requests, when the previous has not yet returned, or on a hard error, provides a modest performance benefit by not compounding the issue with more unfulfillable requests.

  * Forceful refreshes are still dispatched every 60 seconds, regardless of the blocking state on those requests.

  * Relevant:  Issue facebookarchive#456 and Comment facebookarchive#456 (comment)

* **AJAX Endpoint Optimization**

  * Removed nested loops within multiple AJAX endpoints:

    * `map-data.php`

    * `country-data.php`

    * ` leaderboard.php`

  * All Attachments, including Link and Filename, are now cached and obtained through:  `Attachment::genAllAttachmentsFileNamesLinks()`.

  * All Team names of those who have completed a level, are now cached and obtained through `MultiTeam::genCompletedLevelTeamNames()`.

  * All Levels and Country for map displays are cached and obtained through `Level::genAllLevelsCountryMap()` and `Country::genAllEnabledCountriesForMap()`.

  * Relevant:  Issue facebookarchive#456

* **Memcached Cluster Support**

  * The platform now supports a cluster of Memcached nodes.

  * Configuration for the `MC_HOST` within the `settings.ini` file is now an array, instead of a single value:

    * `MC_HOST[] = 127.0.0.1`

  * Multiple Memcached servers can be configured by providing additional `MC_HOST` lines:

    * `MC_HOST[] = 1.2.3.4`

    * `MC_HOST[] = 5.6.7.8`

  * The platform uses a Write-Many Read-Once approach to the Memcached Cluster.  Specifically, data is written to all of the configured Memcached nodes and then read from a single node at random.  This approach ensures that all of the nodes stay in sync and up-to-date while providing a vital performance benefit to the more expensive and frequent operation of reading.

  * The existing `Model` methods (`setMCRecords()` and `invalidateMCRecords()`) all call and utilize the new cluster methods:

    * `writeMCCluster()`

    * `invalidateMCCluster()`

  * The flushing of Memcached has also been updated to support the multi-cluster approach:  `flushMCCluster()`.

  * Note that the usage of a Memcached Cluster is seamless for administrators and users, and works in conjunction with the Local Cache.  Also note, the platform works identically, for administrators and users, for both single-node and multi-node Memcached configurations.

  * The default configuration remains a single-node configuration.  The utilization of a Memcached Cluster requires the following:

    * The configuration and deployment of multiple Memcached nodes (the `quick_setup install_multi_cache` or Memcached specific provision, will work).

    * The modification of `settings.ini` to include all of the desired Memcached hosts.

    * All Memcached hosts must be identically configured.

  * Usage of a Memcached Cluster is only recommended in the Multi-Server deployment modes.

  * Relevant:  Issue facebookarchive#456

* **Load Balancing of Application Servers (HHVM)**

  * The platform now supports the ability to load balance multiple HHVM servers.

  * To facilitate the load balancing of the HHVM servers, the following changes were made:

    * Scripts (`autorun`, `progressive`, etc.) are now tracked on a per-server level, preventing multiple copies of the scripts from being executed on the HHVM servers.

    * Additional database verification on scoring events to prevent multiple captures.

  * Load Balancing of HHVM is only recommended in the Multi-Server deployment modes.

  * Relevant:  Issue facebookarchive#456

* **Leaderboard Limits**

  * `MultiTeam::genLeaderboard()` now limits the total number of teams returned based on a configurable setting.

  * A new argument has been added to `MultiTeam::genLeaderboard()`: `limit`.  This value, either `true` or `false`, indicates where the limit should be enforced, and defaults to `true`.

  * When the data is not cached, `MultiTeam::genLeaderboard()` will only build, cache, and return the number of teams needed to meet the limit.

  * When the data is already cached, `MultiTeam::genLeaderboard()` will ensure the limit value has not changed and returned the cached results.  If the configured limit value has been changed, `MultiTeam::genLeaderboard()` will build, cache, and return the number based on the new limit.

  * The "Scoreboard" modal (found from the main gameboard) is a special case where all teams should be displayed.  As such, the Scoreboard modal sets the `limit` value to `false` retuning all teams.  This full leaderboard will be cached, but all other display limits are still enforced based on the configured limit.  Once invalidated, the cached data will return to the limited subset.

  *  Because a full leaderboard is not always cached, this does result in the first hit to the Scoreboard modal requiring a database hit.

  * A user, whose rank is above the limit, will have their rank shown to them as `$limit+`.  For example, if the limit is set to `50` and the user's rank is above `50`, they would see:  `51+` as their rank.

  * Overall, the caching of the Leaderboard, one of the more resource-intensive and frequent queries, resulted in significant performance gains.

  * The Leaderboard limit is configurable by administrators within the administrative interface.  The default value is `50`.

  * Relevant:  Issue facebookarchive#456

* **Activity Log Limits**

  * The Activity Log is now limited to the most recent `100` log entries.  The value is not configurable.

  * The activity log is continually queried and contains a large amount of data, as such, it is a very resource-intensive request.

  *  The limit on the results built, cached, and returned for the activity log provides a notable improvement in performance.

  * Relevant:  Issue facebookarchive#456

* **Database Optimization**

  * Expansion of PR facebookarchive#564

  * Added additional indexing of the database tables in the schema.

  * The additional indexing provides further performance improvements to the platform queries, especially those found in `MultiTeam` and those queries continually utilized as a result of the AJAX calls.

  * Relevant:  Issue facebookarchive#456 and Comment facebookarchive#456 (comment)

* **Team and MultiTeam Performance Improvements**

  * Updated numerous `Team::genTeam()` calls to used the cached version: `MultiTeam::genTeam()`.

  * Optimized the database query within `MultiTeam::genFirstCapture()` to return the `team_id` and build the `Team` from the cache.

  * Optimized the database query within `MultiTeam::genCompletedLevel()` to return the `team_id` and build the `Team` from the cache.

  * Optimized the database query within `MultiTeam::genAllCompletedLevels()` to return the `team_id` and build the `Team` from the cache.

  * A full invalidation of the `MultiTeam` cache is no longer executed when a new team is created.  Newly created teams will not have any valid scoring activity.  Delaying the rebuild of the scoring related cache provides a modest performance improvement.  The new team will not show up in certain areas (namely the full scoreboard) until they or someone else perform a scoring action.  To ensure the team is properly functioning, following cache is specifically invalided on a new team creation:

    * `ALL_TEAMS`
    * `ALL_ACTIVE_TEAMS`
    * `ALL_VISIBLE_TEAMS`
    * `TEAMS_BY_LOGO`

  * Fixed an extremely rare race condition within `MultiTeam::genFirstCapture()`.

  * Relevant:  Issue facebookarchive#456

* **Combined Awaitables**

  * Combined Awaitables which were not in nested loops.

  * Combined Awaitables found in some nested loops, where existing code provided a streamlined approach.

  * Given the lack of support for concurrent queries to a single database connection, some queries were combined via `multiQuery()` (in the case where the queries were modifying data within the database).  TODO:  Build and utilize additional `AsyncMysqlConnection` within the pool for suitable concurrent queries.

  * Annotated Awaitables within a nested loop for future optimization.

  * Relevant:  Issue facebookarchive#577

* **Facebook and Google Login Integration**

  * Replaced PR facebookarchive#573

  * The platform now supports Login via OAuth2 for Facebook and Google. When configured and enabled, users will have the option to link and login to their existing account with a Facebook or Google account.

  * Automated registration through Facebook or Google OAuth2 is now supported. When configured and enabled, users will have the option to register an account by using and linking an existing account with Facebook or Google.

  * New `configuration` options added to the database schema:

    * Added `facebook_login`. This configuration option is a toggleable setting to enable or disable login via Facebook.

    * Added `google_login`. This configuration option is a toggleable setting to enable or disable login via Google.

    * Added `facebook_registration`. This configuration option is a toggleable setting to enable or disable registration via Facebook.

    * Added `google_registration`. This configuration option is a toggleable setting to enable or disable registration via Google.

    * Added `registration_prefix`. This configuration option is a string that sets the prefix for the randomly generated username/team name for teams registered via (Facebook or Google) OAuth.

  * New Integration section within the Administrative interface allows for control over the Facebook and Google Login, Registration, and the automatic team name prefix option.

  * Overhauled the Login page to support the new Login buttons.  Login page now displays appropriate messages based on the configuration of login.

  * Login form is dynamically generated, based on the configuration options and settings.

  * Overhauled the Registration page to support the new Registration buttons.  The registration page now displays appropriate messages based on the configuration of registration.

  * The registration form is dynamically generated, based on the configuration options and settings.

  * Account Linking for Facebook sets both the Login OAuth values and the LiveSync values (single step for both).

  * Account Linking for Google sets both the Login OAuth values and the LiveSync values (single step for both).

  * Facebook Account linkage option has been added to the Account modal.

  * The Account modal now shows which accounts are already linked.

  * The Account modal will color-code the buttons on an error (red) and success (green).

  * New table "teams_oauth" has been added to handle the OAuth data for Facebook and Google account linkage.

  * New class `Integration` handles the linkage of Facebook or Google accounts with an FBCTF account (both Login OAuth values and the LiveSync values). The Integration class also includes the underlying methods for authentication in both the linkage and login routines and the OAuth registration process.

  * New URL endpoints have been created and simplified for the `Integration` actions:

    * New data endpoint `data/integration_login.php`. This endpoint accepts a type argument, currently supporting types of `facebook` and `google`. Through this endpoint, the login process is handled in conjunction with the Integration class.

    * The new callback URL for Facebook Login: `/data/integration_login.php?type=facebook`

    * The new callback URL for Google Login: `/data/integration_login.php?type=google`

    * New data endpoint data/integration_oauth.php. This endpoint accepts a type argument, currently supporting types of `facebook'`and `google`. Through this endpoint, the OAuth account linkage is handled in conjunction with the Integration class.

    * The new callback URL for Facebook linkage: /data/integration_login.php?type=facebook

    * The new callback URL for Google linkage: /data/integration_login.php?type=google

    * Old Google-specific endpoint (data/google_oauth.php) has been removed.

  * New Team class methods: `genAuthTokenExists()`, `genTeamFromOAuthToken()`, `genSetOAuthToken()`.

    * `Team::genAuthTokenExists()` allows an OAuth token to be verified.

    * `Team::genTeamFromOAuthToken()` returns a Team object based on the OAuth token supplied.

    * `Team::genSetOAuthToken()` sets the OAuth token for a team.

  * The `settings.ini` (including the packaged example file) and `Configuration` have methods to verify and return Facebook and Google API settings.

    * `Configuration::getFacebookOAuthSettingsExists()` verifies the Facebook API _App ID_ and _APP Secret_ are set in the `settings.ini` file.

    * `Configuration::getFacebookOAuthSettingsAppId()` returns the Facebook API _App ID_.

    * `Configuration::getFacebookOAuthSettingsAppSecret()` returns the Facebook API _App Secret_.

    * `Configuration::getGoogleOAuthFileExists()` verifies the Google API JSON file is set and exists in the `settings.ini` file.

    * `Configuration::getGoogleOAuthFile()` returns the filename for the Google API JSON file.

    * All of Facebook and Google API configuration values are cached (in Memcached) to prevent the repeated loading, reading, and parsing of the `settings.ini` file.

  * To use the new Facebook or Google integration the following must be completed:

    * A Facebook and/or Google Application must be created, and OAuth2 API keys must be obtained.

    * The API keys must be provided in the Settings.ini file.

    * Desired settings must be configured from within the administrative interface (Configuration) - the default has all integration turned off.

  * The Facebook OAuth code provides CSRF protection through the Graph SDK.

  * The Google OAuth code provides CSRF protection through the usage of the `integration_csrf_token` cookie and API state value.

  * Note: Facebook Login/Integration will not work in development mode - this is due to a pending issue in the Facebook Graph SDK (facebookarchive/php-graph-sdk#853) utilization of the pending PR (facebookarchive/php-graph-sdk#854) resolves this issue. Alternatively, the Integration with Facebook will work in production mode, the recommended mode for a live game.

  * Implemented CR from PR facebookarchive#573.

  * Relevant:  PR facebookarchive#591 and PR facebookarchive#459.

* **LiveSync API and LiveImport Script Update**

  * LiveSync has been updated to support and supply Facebook and Google OAuth output. All of the users LiveSync integrations (FBCTF, Facebook, and Google) are now provided through the API. As a result, so long as one of the three LiveSync methods are configured by the user (which happens automatically when linking an account to Facebook or Google) the data will become available through the LiveSync API.

  * LiveSync now includes a "general" type. The new `general` type output includes the scoring information using the local team name on the FBCTF instance. This new type is not for importation on another FBCTF instance but does provide the opportunity for third-parties to use the data for score tracking, metric collections, and displays. As such, this new LiveSync data allows the scoring data for a single FBCTF instance to be tracked.

  * The `liveimport.sh` script, used to import LiveSync API data, will ignore the new `general` LiveSync type.

  * Updated `Team::genLiveSyncKeyExists()` and `Team::genTeamFromLiveSyncKey()` to use the new Integration class methods:  `Integration::genFacebookThirdPartyExists)` and `Integration::genFacebookThirdPartyEmail()`.

  * Within the `liveimport.sh` script: when the type is `facebook_oauth`, `Team::genLiveSyncKeyExists()` and `Team::genTeamFromLiveSyncKey()` properly use the Facebook `third_party_id`.

  * `Integration::genFacebookThirdPartyExists()` and `Integration::genFacebookThirdPartyEmail()` query the Facebook API for the coorosponding user, storing the results in a temporary HHVM-memory cache, via  the `Cache` class.

  * Given that `liveimport.sh` now needs to query the Facebook API for any `facebook_oauth` typed items, the script will utilize the HHVM-memory cache of `Integration` to limit the number of hits to the Facebook API.

  * The `liveimport.sh` script now includes the `Cache` class and the Facebook Graph SDK.

  * Relevant:  PR facebookarchive#459.

* **Error and Exception Handling**

  * All Exceptions, including Redirect Exceptions, are now caught.

  * The NGINX configuration has been updated to catch errors from HHVM (FastCGI) and return `error.php`.

  * The `error.php` page has been updated with a themed error page.

  * The `error.php` page will redirect to `index.php?page=error` so long as `index.php?page=error` is not generating any HTTP errors.  If an error is detected on `index.php?page=error` then no redirect will occur.  The verification of the HTTP status ensures no redirect loops occur.

  * The `DataController` class now includes a `sendData()` method to catch errors and exceptions.  `DataController` children classes now utilize `sendData()` instead of outputing their results directly.

  * On Exception within an AJAX request, an empty JSON array is returned.  This empty array prevents client-side errors.

  * The `ModuleController` class now includes a `sendRender()` method to catch errors and exceptions.  `ModuleController` children classes now utilize `sendRender()` instead of outputing their results directly.

  * On Exception within a Module request, an empty string is returned.  This empty string prevents client-side and front-end errors.

  * A new AJAX endpoint has been added:  `/data/session.php`.  The response of the endpoint is used to determine if the user's session is still active.  If a user's session is no longer active, they will be redirected from the gameboard to the login page.  This redirection ensures that they do not continually perform AJAX requests.

  * Custom HTTP headers are used to monitor AJAX responses:

    * The Login page now includes a custom HTTP header: `Login-Page`.

    * The Error page now includes a custom HTTP header:  `Error-Page`.

  * The custom HTTP headers are used client-side (JS) to determine if a request or page rendered an error or requires authentication.

  * Exception log outputs now include additional information on which Exception was thrown.

  * Users should no longer directly receive an HTTP 500.

  * These Exception changes prevent the error logs from being filled with unauthenticated requests.  The changes also provide a user-friendly experience when things malfunction or a user needs to reauthenticate.

  *   Relevant:  facebookarchive#563

* **Team Account Modal Update**

  * Users can now change their team name from within the Account modal.

  * The account Modal now contains the following options:

    * Team Name

    * Facebook Account Linkage

    * Google Account Linkage

    * FBCTF LiveSync Authentication

  * Relevant:  PR facebookarchive#459.

* **Non-Visible/Inactive Team Update**

  * Ensure that non-visible or inactive team do not show up for any other users.

  * Non-Visible/Inactive teams are not awarded as the "first capture."

  * Non-Visible/Inactive teams do not show in the "captured by" list.

  * Countries will not show as captured (for other teams) if only captured by a Non-Visible/Inactive team.

  * Activity Log entries for a Non-Visible or Inactive team are not included in the activity log for other users.

  * Updated `ScoreLog::genAllPreviousScore()` and `ScoreLog::genPreviousScore()` to only include Visible and Active teams, or the user's own team.

  * Teams who are Non-Visible or Inactive will have a rank of "N/A."

  * Relevant:  PR facebookarchive#513

* **Mobile Page Update**

  * The mobile page is shown when a user's window has a resolution under `960px`.  While this is geared towards mobile users, it can happen when the window size on a non-mobile device is too small.

  * The mobile page now includes a "Refresh" button, which will reload the page.

  * The mobile page will refresh, attempting to re-render correctly, after `2` seconds.

  * If a user resizes their window to a larger size, they should reload into a properly displayed screen, and not the mobile warning.

* **Login and Registration JS Fixes**

  * Consistently corrected the usage of `teamname` and `team_name` across PHP and JS code.

  * Ensured that all JavaScript is using `team_name`.

  * Ensured that all PHP is using `team_name` when interacting with JS.

  * Updated the input filters within PHP when retrieving input for the team name, using `team_name`.

  * Updated Login errors to highlight the username and password fields.

  * Relevant:  Issue facebookarchive#571, Issue facebookarchive#558, Issue facebookarchive#521, PR facebookarchive#592, and PR facebookarchive#523

* **System Statistics JSON Endpoint**

   * A new administrative-only JSON endpoint has been added that provides statistical data about the platform and game.

  * The endpoint is found at `/data/stata.php`.  Access to the endpoint requires an authenticated administrative session.

  * The endpoint provides the following information:

    * Number of Teams (`teams`)

    * Number of Sessions (`sessions`)

    * Total Number of Levels (`levels`)

    * Number of Active Levels (`active_levels`)

    * Number of Hints (`hints`)

    * Number of Captures (`captures`)

    * `AsyncMysqlConnectionPool` Statistics (`database`)

      * Created Connections (`created_pool_connections`)

      * Destroyed Connections (`destroyed_pool_connections`)

      * Connection Requests (`connections_requested`)

      * Pool Hits (`pool_hits`)

      * Pool Misses (`pool_misses`)

    * `Memcached` Statistics (`memcached`)

      * Node Address

        * Node Address:Port

          * Process ID (`pid`)

          * Uptime (`uptime`)

          * Threads (`threads`)

          * Timestamp (`time`)

          * Size of Pointer (`pointer_size`)

          * Total User Time for Memcached Process (`rusage_user_seconds`)

          * Total User Time for Memcached Process (`rusage_user_microseconds`)

          * Total System Time for Memcached Process (`rusage_system_seconds`)

          * Total System Time for Memcached Process (`rusage_system_microseconds`)

          * Current Items in Cache (`curr_items`)

          * Total Items in Cache (`total_items`)

          * Max Bytes Limit (`limit_maxbytes`)

          * Number of Current Connections (`curr_connections`)

          * Number of Total Connections (`total_connections`)

          * Number of Current Connection Structures Allocated (`connection_structures`)

          * Number of Bytes Used (`bytes`)

          * Total Number of Cache Get Requests (`cmd_get`)

          * Total Number of Cache Set Requests (`cmd_set`)

          * Total Number Successful of Cache Retrievals (`get_hits`)

          * Total Number Unsuccessful of Cache Retrievals (`get_ misses`)

          * Total Number of Cache Evictions (`evictions`)

          * Total Number of Bytes Read (`bytes_read`)

          * Total Number of Bytes Written (`bytes_writter`)

          * Memcached Version (`version`)

    * System Load Statistics (`load`)

      * One Minute Average (`0`)

      * Five Minute Average (`1`)

      * Fifteen Minute Average (`2`)

    * System CPU Utilization (`load`)

      * Userspace Utilization Percentage (`user`)

      * Nice Utilization Percentage (`nice`)

      * System Utilization Percentage (`sys`)

      * Idle Percentage (`idle`)

  * The endpoint provides current data and can be pooled/ingested for historical data reporting.

  * For more information on the `AsyncMysqlConnectionPool` statistics, please see:  https://docs.hhvm.com/hack/reference/class/AsyncMysqlConnection/ and https://docs.hhvm.com/hack/reference/class/AsyncMysqlConnectionPool/getPoolStats/

  * For more information on the `Memcached` statistics, please see:  https://github.com/memcached/memcached/blob/master/doc/protocol.txt and https://secure.php.net/manual/en/memcached.getstats.php

* **Miscellaneous Changes**

  * Added `Announcement` and `ActivityLog` to `autorun.php`.

  * Added `Announcement` and `ActivityLog` to `bases.php`.

  * Added/Updated UTF-8 encoding on various user-controlled values, such as team name.

  * Changed the "Sign Up" link to a button on the login page.

  * Allow any Logo to be re-used once all logos are in use.

  * Invalidate Scores and Hint cache when a Team is deleted.

  * Reverify the game status (running or stopped) before the next cycle of base scoring in `bases.php`.

  * Allow the game to stop even if some scripts (`autorun`, `bases`, `progressive`, etc.) are not running.

  * Fixed a bug where teams with a custom logo could not be edited by an administrator.

  * Added "Reset Schedule" button to administrative interface to completely remove a previously set game schedule.  The game schedule can only be reset if the game is not running.  Otherwise, the existing schedule must be modified.

  * Moved "Begin Game," "Pause Game," and "End Game" outside of the scrollable admin list into a new fixed pane below the navigation list.

  * Formatted all code files as part of this PR.

  * Updated ActivityLog to delete entries when deleting a team.

  * Updated PHPUnit tests based on the new changes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants