From bc68802d8a93f9cbe3520b73b621e5dd53bac849 Mon Sep 17 00:00:00 2001 From: Syphax bouazzouni Date: Tue, 16 Jan 2024 19:08:38 +0100 Subject: [PATCH] Sync: bring OntoPortal up-to-date with BioPortal releases 5.26.0 and onward (#2) * add a script to eradicate (delete data+ files) submissions of an ontology * Auto stash before merge of "development" and "master" * omit logs link file * update the eradicator to support the eradication of not archived submissions if wanted * fix the delete submission files to not let behind empty directories * not remove the submission directory beaucse it's already done by the submission.delete * Update Gemfile.lock * Reset branch specifier to develop * extract do_ontology_pull function * some simple code refactor in the ontology_pull * simple code refactor of test_ontology_pull * add a script to do a ontology pull on an ontology on demand * set the name of the new script in $0 * extract new_file_exists? method from do_ontology_pull * save the submission in the RemoteFileException * some automatic code refactor/lint * use the new do_ontology_pull in the old do_remote_ontology_pull * fixed an API call mentioned by @syphax-bouazzouni in ncbo/bioportal-project#254 * fixed an API call mentioned by @syphax-bouazzouni in ncbo/bioportal-project#254 * Gemfile.lock update * bump up version of actions/checkout from v2->v3 * Gemfile.lock update * Merge branch 'develop' * remove forgot variables * GH Actions unit test workflow refactor - add ruby versioning via docker-compose.yml file - bump up ruby v2.6 -> v2.7 - add AllegroGraph backend - add code coverage * Remove extra space * fix for #61 - create contact instance if it doesn't exist - changed --from-api to --from-apikey - minor linting * Restore branch specifier to develop * Optimization - remove repeated query * Gemfile.lock update * Gemfile.lock update * Gemfile.lock update * Gemfile had references to develop branch * implemented #64 - ability to generate labels independently of RDF processing (and vise versa) * Gemfile.lock update * fixed a bug in #64 * Relocate docker-compose file and update default configs * Add GH workflow for publishing docker images * use ruby native method for listing files instead of a git function Resolves warning messages when we exclude .git directory from docker image * remove comment * capitalize argument in order to be consistent with other scripts * add arm/64 platform * additional error handling for SPAM deletion script, #60 * additional error handling for SPAM deletion script, #60 * implemented #67 - improved corrupt data and error handling * Gemfile.lock update * exclude test/data/dictionary.txt from git commits * update version of solr-ut * Gemfile.lock update * Restore branch specifier to master * fixed configuration for the analytics module * Gemfile.lock update * implemented #69 - scheduled annotator dictionary file generation should be a configurable option instead of the default * Gemfile.lock update * gem update * create new rake taks for updating purls for all ontologies moved from ontologies_api/fix_purls.rb * initial implementation of #70 - Google Analytics v4 Update Compatibility Issue * added the /data folder to ignore * update gems * Gemfile.lock update * Gemfile.lock update * Gemfile.lock update * use patched version of agraph v7.3.1 * unpin faraday gem * A chnage to reference Analytics Redis from LinkedData block * Gemfile.lock update * Gemfile.lock update * Gemfile.lock update * Gemfile.lock update * use assert_operator instead of assert minitest style guide adherence. encountered an intermittent unit test failure so assert_operator will provide better failure feedback than assert * fixed ncbo_ontology_archive_old_submissions error output * Gemfile.lock update * Gemfile.lock update * Gemfile update * Gemfile update * fixes to the analytics script and a new script to generate UA analytics for documentation * Gemfile.lock update * Gemfile.lock update * implemented the first pass at bmir-radx/radx-project#37 * implemented the first pass at bmir-radx/radx-project#37 * set bundler version to be comptatible with ruby 2.7 + AG v8 * Gemfile.lock update * Gemfile.lock update --------- Co-authored-by: Jennifer Vendetti Co-authored-by: mdorf Co-authored-by: Alex Skrenchuk --- .dockerignore | 9 +- .github/workflows/docker-image.yml | 42 +++ .github/workflows/ruby-unit-tests.yml | 18 +- .gitignore | 5 + Dockerfile | 28 +- Gemfile | 14 +- Gemfile.lock | 177 +++++++----- bin/generate_ua_analytics_file.rb | 126 ++++++++ bin/ncbo_cron | 71 +---- bin/ncbo_ontology_annotate_generate_cache | 2 +- bin/ncbo_ontology_archive_old_submissions | 112 +++++++- bin/ncbo_ontology_import | 54 ++-- bin/ncbo_ontology_process | 11 +- bin/ncbo_ontology_pull | 42 +++ bin/ncbo_ontology_submissions_eradicate | 107 +++++++ config/config.rb.sample | 91 ++++-- config/config.test.rb | 76 +++-- dip.yml | 54 ++++ docker-compose.yml | 139 +++++++++ lib/ncbo_cron.rb | 1 + lib/ncbo_cron/config.rb | 20 +- lib/ncbo_cron/ontologies_report.rb | 2 +- lib/ncbo_cron/ontology_analytics.rb | 269 ++++++++++++------ lib/ncbo_cron/ontology_helper.rb | 185 ++++++++++++ lib/ncbo_cron/ontology_pull.rb | 139 +-------- lib/ncbo_cron/ontology_rank.rb | 7 +- .../ontology_submission_eradicator.rb | 39 +++ lib/ncbo_cron/ontology_submission_parser.rb | 61 ++-- lib/ncbo_cron/spam_deletion.rb | 12 +- ncbo_cron.gemspec | 4 +- rakelib/purl_management.rake | 28 ++ test/docker-compose.yml | 38 --- test/run-unit-tests.sh | 10 +- test/test_case.rb | 26 +- test/test_ontology_pull.rb | 39 ++- test/test_scheduler.rb | 2 +- 36 files changed, 1500 insertions(+), 560 deletions(-) create mode 100644 .github/workflows/docker-image.yml create mode 100755 bin/generate_ua_analytics_file.rb create mode 100755 bin/ncbo_ontology_pull create mode 100755 bin/ncbo_ontology_submissions_eradicate create mode 100644 dip.yml create mode 100644 docker-compose.yml create mode 100644 lib/ncbo_cron/ontology_helper.rb create mode 100644 lib/ncbo_cron/ontology_submission_eradicator.rb create mode 100644 rakelib/purl_management.rake delete mode 100644 test/docker-compose.yml diff --git a/.dockerignore b/.dockerignore index c712142f..96c8053c 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,5 +1,6 @@ # Git -#.git +.git +.github .gitignore # Logs log/* @@ -8,3 +9,9 @@ tmp/* # Editor temp files *.swp *.swo +coverage +create_permissions.log +# Ignore generated test data +test/data/dictionary.txt +test/data/ontology_files/repo/**/* +test/data/tmp/* diff --git a/.github/workflows/docker-image.yml b/.github/workflows/docker-image.yml new file mode 100644 index 00000000..6105c1d8 --- /dev/null +++ b/.github/workflows/docker-image.yml @@ -0,0 +1,42 @@ +name: Docker Image CI + +on: + release: + types: [published] + +jobs: + push_to_registry: + name: Push Docker image to Docker Hub + runs-on: ubuntu-latest + steps: + - name: Check out the repo + uses: actions/checkout@v3 + + - name: Set up QEMU + uses: docker/setup-qemu-action@v2 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + + - name: Log in to Docker Hub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Extract metadata (tags, labels) for Docker + id: meta + uses: docker/metadata-action@v4 + with: + images: bioportal/ncbo_cron + + - name: Build and push Docker image + uses: docker/build-push-action@v4 + with: + context: . + platforms: linux/amd64,linux/arm64 + build-args: | + RUBY_VERSION=2.7 + push: true + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} diff --git a/.github/workflows/ruby-unit-tests.yml b/.github/workflows/ruby-unit-tests.yml index 192774d1..b61ce745 100644 --- a/.github/workflows/ruby-unit-tests.yml +++ b/.github/workflows/ruby-unit-tests.yml @@ -6,15 +6,25 @@ on: jobs: test: + strategy: + fail-fast: false + matrix: + backend: ['ncbo_cron', 'ncbo_cron-agraph'] # ruby runs tests with 4store backend and ruby-agraph runs with AllegroGraph backend runs-on: ubuntu-latest steps: - - uses: actions/checkout@v2 + - uses: actions/checkout@v3 - name: copy config.rb file from template run: cp config/config.test.rb config/config.rb - name: Build docker-compose - working-directory: ./test run: docker-compose build - name: Run unit tests - working-directory: ./test - run: docker-compose run unit-test wait-for-it solr-ut:8983 -- rake test TESTOPTS='-v' + run: | + ci_env=`bash <(curl -s https://codecov.io/env)` + docker-compose run $ci_env -e CI --rm ${{ matrix.backend }} bundle exec rake test TESTOPTS='-v' + - name: Upload coverage reports to Codecov + uses: codecov/codecov-action@v3 + with: + flags: unittests + verbose: true + fail_ci_if_error: false # optional (default = false) diff --git a/.gitignore b/.gitignore index a7b2058f..ccf97ea0 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,8 @@ config/config.rb config/config_*.rb config/*.p12 +config/*.json +data/ projectFilesBackup/ .ruby-version repo* @@ -11,6 +13,9 @@ repo* .DS_Store tmp +# Code coverage reports +coverage* + # Ignore eclipse .project .project .pmd diff --git a/Dockerfile b/Dockerfile index 1c463704..73e1379c 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,13 +1,29 @@ -FROM ruby:2.6 +ARG RUBY_VERSION +ARG DISTRO_NAME=bullseye -RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends openjdk-11-jre-headless raptor2-utils wait-for-it +FROM ruby:$RUBY_VERSION-$DISTRO_NAME + +RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \ + openjdk-11-jre-headless \ + raptor2-utils \ + && rm -rf /var/lib/apt/lists/* -# The Gemfile Caching Trick -# we install gems before copying the code in its own layer so that gems would not have to get -# installed every single time code is updated RUN mkdir -p /srv/ontoportal/ncbo_cron +RUN mkdir -p /srv/ontoportal/bundle COPY Gemfile* *.gemspec /srv/ontoportal/ncbo_cron/ + WORKDIR /srv/ontoportal/ncbo_cron -RUN gem install bundler -v "$(grep -A 1 "BUNDLED WITH" Gemfile.lock | tail -n 1)" + +# set rubygem and bundler to the last version supported by ruby 2.7 +# remove version after ruby v3 upgrade +RUN gem update --system '3.4.22' +RUN gem install bundler -v '2.4.22' +RUN gem update --system +RUN gem install bundler +ENV BUNDLE_PATH=/srv/ontoportal/bundle RUN bundle install + COPY . /srv/ontoportal/ncbo_cron +RUN cp /srv/ontoportal/ncbo_cron/config/config.rb.sample /srv/ontoportal/ncbo_cron/config/config.rb + +CMD ["/bin/bash"] diff --git a/Gemfile b/Gemfile index 8d9bd46c..bcf5f137 100644 --- a/Gemfile +++ b/Gemfile @@ -2,13 +2,17 @@ source 'https://rubygems.org' gemspec -gem 'faraday', '~> 1.9' gem 'ffi' + +# This is needed temporarily to pull the Google Universal Analytics (UA) +# data and store it in a file. See (bin/generate_ua_analytics_file.rb) +# The ability to pull this data from Google will cease on July 1, 2024 gem "google-apis-analytics_v3" + +gem 'google-analytics-data' gem 'mail', '2.6.6' -gem 'minitest', '< 5.0' gem 'multi_json' -gem 'oj', '~> 2.0' +gem 'oj', '~> 3.0' gem 'parseconfig' gem 'pony' gem 'pry' @@ -28,6 +32,8 @@ gem 'sparql-client', github: 'ncbo/sparql-client', branch: 'master' group :test do gem 'email_spec' + gem 'minitest', '< 5.0' + gem 'simplecov' + gem 'simplecov-cobertura' # for codecov.io gem 'test-unit-minitest' end - diff --git a/Gemfile.lock b/Gemfile.lock index eb1dffec..99d242db 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -1,6 +1,6 @@ GIT remote: https://github.com/ncbo/goo.git - revision: fd7d45cb862c5c2c1833b64a5c8c14154384edc2 + revision: 75436fe8e387febc53e34ee31ff0e6dd837a9d3f branch: master specs: goo (0.0.2) @@ -15,7 +15,7 @@ GIT GIT remote: https://github.com/ncbo/ncbo_annotator.git - revision: ed325ae9f79e3b0a0061b1af0b02f624de1d0eef + revision: 1170a94d266d3e469bfb034a3aa3c4852bd0de82 branch: master specs: ncbo_annotator (0.0.1) @@ -26,7 +26,7 @@ GIT GIT remote: https://github.com/ncbo/ontologies_linked_data.git - revision: 8196bf34b45c75f8104bb76dfcba1db0f2c048e4 + revision: ee0013f0ee23876076bff9d9258b46371ec3b248 branch: master specs: ontologies_linked_data (0.0.1) @@ -46,7 +46,7 @@ GIT GIT remote: https://github.com/ncbo/sparql-client.git - revision: fb4a89b420f8eb6dda5190a126b6c62e32c4c0c9 + revision: d418d56a6c9ff5692f925b45739a2a1c66bca851 branch: master specs: sparql-client (1.0.1) @@ -60,7 +60,7 @@ PATH ncbo_cron (0.0.1) dante goo - google-apis-analytics_v3 + google-analytics-data mlanett-redis-lock multi_json ncbo_annotator @@ -74,48 +74,49 @@ GEM activesupport (3.2.22.5) i18n (~> 0.6, >= 0.6.4) multi_json (~> 1.0) - addressable (2.8.0) - public_suffix (>= 2.0.2, < 5.0) - bcrypt (3.1.18) + addressable (2.8.6) + public_suffix (>= 2.0.2, < 6.0) + base64 (0.2.0) + bcrypt (3.1.20) + bigdecimal (3.1.5) builder (3.2.4) coderay (1.1.3) - concurrent-ruby (1.1.10) + concurrent-ruby (1.2.2) + connection_pool (2.4.1) cube-ruby (0.0.3) dante (0.2.0) declarative (0.0.20) - domain_name (0.5.20190701) - unf (>= 0.0.5, < 1.0.0) + docile (1.4.0) + domain_name (0.6.20240107) email_spec (2.1.1) htmlentities (~> 4.3.3) launchy (~> 2.1) mail (~> 2.6) - faraday (1.10.0) - faraday-em_http (~> 1.0) - faraday-em_synchrony (~> 1.0) - faraday-excon (~> 1.1) - faraday-httpclient (~> 1.0) - faraday-multipart (~> 1.0) - faraday-net_http (~> 1.0) - faraday-net_http_persistent (~> 1.0) - faraday-patron (~> 1.0) - faraday-rack (~> 1.0) - faraday-retry (~> 1.0) + faraday (2.8.1) + base64 + faraday-net_http (>= 2.0, < 3.1) ruby2_keywords (>= 0.0.4) - faraday-em_http (1.0.0) - faraday-em_synchrony (1.0.0) - faraday-excon (1.1.0) - faraday-httpclient (1.0.1) - faraday-multipart (1.0.4) - multipart-post (~> 2) - faraday-net_http (1.0.1) - faraday-net_http_persistent (1.2.0) - faraday-patron (1.0.0) - faraday-rack (1.0.0) - faraday-retry (1.0.3) - ffi (1.15.5) - google-apis-analytics_v3 (0.10.0) - google-apis-core (>= 0.7, < 2.a) - google-apis-core (0.7.0) + faraday-net_http (3.0.2) + faraday-retry (2.2.0) + faraday (~> 2.0) + ffi (1.16.3) + gapic-common (0.21.1) + faraday (>= 1.9, < 3.a) + faraday-retry (>= 1.0, < 3.a) + google-protobuf (~> 3.18) + googleapis-common-protos (>= 1.4.0, < 2.a) + googleapis-common-protos-types (>= 1.11.0, < 2.a) + googleauth (~> 1.9) + grpc (~> 1.59) + google-analytics-data (0.4.0) + google-analytics-data-v1beta (>= 0.7, < 2.a) + google-cloud-core (~> 1.6) + google-analytics-data-v1beta (0.11.1) + gapic-common (>= 0.21.1, < 2.a) + google-cloud-errors (~> 1.0) + google-apis-analytics_v3 (0.13.0) + google-apis-core (>= 0.11.0, < 2.a) + google-apis-core (0.11.2) addressable (~> 2.5, >= 2.5.1) googleauth (>= 0.16.2, < 2.a) httpclient (>= 2.8.1, < 3.a) @@ -124,13 +125,37 @@ GEM retriable (>= 2.0, < 4.a) rexml webrick - googleauth (1.2.0) - faraday (>= 0.17.3, < 3.a) + google-cloud-core (1.6.1) + google-cloud-env (>= 1.0, < 3.a) + google-cloud-errors (~> 1.0) + google-cloud-env (2.1.0) + faraday (>= 1.0, < 3.a) + google-cloud-errors (1.3.1) + google-protobuf (3.25.2) + google-protobuf (3.25.2-x86_64-darwin) + google-protobuf (3.25.2-x86_64-linux) + googleapis-common-protos (1.4.0) + google-protobuf (~> 3.14) + googleapis-common-protos-types (~> 1.2) + grpc (~> 1.27) + googleapis-common-protos-types (1.11.0) + google-protobuf (~> 3.18) + googleauth (1.9.1) + faraday (>= 1.0, < 3.a) + google-cloud-env (~> 2.1) jwt (>= 1.4, < 3.0) - memoist (~> 0.16) multi_json (~> 1.11) os (>= 0.9, < 2.0) signet (>= 0.16, < 2.a) + grpc (1.60.0) + google-protobuf (~> 3.25) + googleapis-common-protos-types (~> 1.0) + grpc (1.60.0-x86_64-darwin) + google-protobuf (~> 3.25) + googleapis-common-protos-types (~> 1.0) + grpc (1.60.0-x86_64-linux) + google-protobuf (~> 3.25) + googleapis-common-protos-types (~> 1.0) htmlentities (4.3.4) http-accept (1.7.0) http-cookie (1.0.5) @@ -138,48 +163,50 @@ GEM httpclient (2.8.3) i18n (0.9.5) concurrent-ruby (~> 1.0) - json (2.6.2) - json_pure (2.6.2) - jwt (2.4.1) - launchy (2.5.0) - addressable (~> 2.7) - libxml-ruby (3.2.3) - logger (1.5.1) + json (2.7.1) + json_pure (2.7.1) + jwt (2.7.1) + launchy (2.5.2) + addressable (~> 2.8) + libxml-ruby (5.0.2) + logger (1.6.0) macaddr (1.7.2) systemu (~> 2.6.5) mail (2.6.6) mime-types (>= 1.16, < 4) - memoist (0.16.2) method_source (1.0.0) - mime-types (3.4.1) + mime-types (3.5.2) mime-types-data (~> 3.2015) - mime-types-data (3.2022.0105) - mini_mime (1.1.2) + mime-types-data (3.2023.1205) + mini_mime (1.1.5) minitest (4.7.5) mlanett-redis-lock (0.2.7) redis multi_json (1.15.0) - multipart-post (2.2.3) net-http-persistent (2.9.4) netrc (0.11.0) - oj (2.18.5) + oj (3.16.3) + bigdecimal (>= 3.0) omni_logger (0.1.4) logger os (1.1.4) parseconfig (1.1.2) pony (1.13.1) mail (>= 2.0) - pry (0.14.1) + pry (0.14.2) coderay (~> 1.1) method_source (~> 1.0) - public_suffix (4.0.7) - rack (2.2.4) - rack-test (2.0.2) + public_suffix (5.0.4) + rack (3.0.8) + rack-test (2.1.0) rack (>= 1.3) - rake (13.0.6) + rake (13.1.0) rdf (1.0.8) addressable (>= 2.2) - redis (4.7.1) + redis (5.0.8) + redis-client (>= 0.17.0) + redis-client (0.19.1) + connection_pool representable (3.2.0) declarative (< 0.1.0) trailblazer-option (>= 0.1.1, < 0.2.0) @@ -190,7 +217,7 @@ GEM mime-types (>= 1.16, < 4.0) netrc (~> 0.8) retriable (3.1.2) - rexml (3.2.5) + rexml (3.2.6) rsolr (2.5.0) builder (>= 2.1.2) faraday (>= 0.9, < 3, != 2.0.0) @@ -199,26 +226,32 @@ GEM rubyzip (2.3.2) rufus-scheduler (2.0.24) tzinfo (>= 0.3.22) - signet (0.17.0) + signet (0.18.0) addressable (~> 2.8) faraday (>= 0.17.5, < 3.a) jwt (>= 1.5, < 3.0) multi_json (~> 1.10) - sys-proctable (1.2.6) - ffi + simplecov (0.22.0) + docile (~> 1.1) + simplecov-html (~> 0.11) + simplecov_json_formatter (~> 0.1) + simplecov-cobertura (2.1.0) + rexml + simplecov (~> 0.19) + simplecov-html (0.12.3) + simplecov_json_formatter (0.1.4) + sys-proctable (1.3.0) + ffi (~> 1.1) systemu (2.6.5) test-unit-minitest (0.9.1) minitest (~> 4.7) trailblazer-option (0.1.2) - tzinfo (2.0.4) + tzinfo (2.0.6) concurrent-ruby (~> 1.0) uber (0.1.0) - unf (0.1.4) - unf_ext - unf_ext (0.0.8.2) uuid (2.3.9) macaddr (~> 1.0) - webrick (1.7.0) + webrick (1.8.1) PLATFORMS ruby @@ -228,16 +261,16 @@ PLATFORMS DEPENDENCIES cube-ruby email_spec - faraday (~> 1.9) ffi goo! + google-analytics-data google-apis-analytics_v3 mail (= 2.6.6) minitest (< 5.0) multi_json ncbo_annotator! ncbo_cron! - oj (~> 2.0) + oj (~> 3.0) ontologies_linked_data! parseconfig pony @@ -245,9 +278,11 @@ DEPENDENCIES rake redis rest-client + simplecov + simplecov-cobertura sparql-client! sys-proctable test-unit-minitest BUNDLED WITH - 2.3.14 + 2.4.22 diff --git a/bin/generate_ua_analytics_file.rb b/bin/generate_ua_analytics_file.rb new file mode 100755 index 00000000..0a432a92 --- /dev/null +++ b/bin/generate_ua_analytics_file.rb @@ -0,0 +1,126 @@ +require 'logger' +require 'google/apis/analytics_v3' +require 'google/api_client/auth/key_utils' + +module NcboCron + module Models + + class OntologyAnalyticsUA + + def initialize(logger) + @logger = logger + end + + def run + redis = Redis.new(:host => NcboCron.settings.redis_host, :port => NcboCron.settings.redis_port) + ontology_analytics = fetch_ontology_analytics + File.open(NcboCron.settings.analytics_path_to_ua_data_file, 'w') do |f| + f.write(ontology_analytics.to_json) + end + end + + def fetch_ontology_analytics + google_client = authenticate_google + aggregated_results = Hash.new + start_year = Date.parse(NcboCron.settings.analytics_start_date).year || 2013 + ont_acronyms = LinkedData::Models::Ontology.where.include(:acronym).all.map {|o| o.acronym} + # ont_acronyms = ["NCIT", "ONTOMA", "CMPO", "AEO", "SNOMEDCT"] + filter_str = (NcboCron.settings.analytics_filter_str.nil? || NcboCron.settings.analytics_filter_str.empty?) ? "" : ";#{NcboCron.settings.analytics_filter_str}" + + ont_acronyms.each do |acronym| + max_results = 10000 + num_results = 10000 + start_index = 1 + results = nil + + loop do + results = google_client.get_ga_data( + ids = NcboCron.settings.analytics_profile_id, + start_date = NcboCron.settings.analytics_start_date, + end_date = Date.today.to_s, + metrics = 'ga:pageviews', + { + dimensions: 'ga:pagePath,ga:year,ga:month', + filters: "ga:pagePath=~^(\\/ontologies\\/#{acronym})(\\/?\\?{0}|\\/?\\?{1}.*)$#{filter_str}", + start_index: start_index, + max_results: max_results + } + ) + results.rows ||= [] + start_index += max_results + num_results = results.rows.length + @logger.info "Acronym: #{acronym}, Results: #{num_results}, Start Index: #{start_index}" + @logger.flush + + results.rows.each do |row| + if aggregated_results.has_key?(acronym) + # year + if aggregated_results[acronym].has_key?(row[1].to_i) + # month + if aggregated_results[acronym][row[1].to_i].has_key?(row[2].to_i) + aggregated_results[acronym][row[1].to_i][row[2].to_i] += row[3].to_i + else + aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i + end + else + aggregated_results[acronym][row[1].to_i] = Hash.new + aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i + end + else + aggregated_results[acronym] = Hash.new + aggregated_results[acronym][row[1].to_i] = Hash.new + aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i + end + end + + if num_results < max_results + # fill up non existent years + (start_year..Date.today.year).each do |y| + aggregated_results[acronym] = Hash.new if aggregated_results[acronym].nil? + aggregated_results[acronym][y] = Hash.new unless aggregated_results[acronym].has_key?(y) + end + # fill up non existent months with zeros + (1..12).each { |n| aggregated_results[acronym].values.each { |v| v[n] = 0 unless v.has_key?(n) } } + break + end + end + end + + @logger.info "Completed Universal Analytics pull..." + @logger.flush + + aggregated_results + end + + def authenticate_google + Google::Apis::ClientOptions.default.application_name = NcboCron.settings.analytics_app_name + Google::Apis::ClientOptions.default.application_version = NcboCron.settings.analytics_app_version + # enable google api call retries in order to + # minigate analytics processing failure due to occasional google api timeouts and other outages + Google::Apis::RequestOptions.default.retries = 5 + # uncoment to enable logging for debugging purposes + # Google::Apis.logger.level = Logger::DEBUG + # Google::Apis.logger = @logger + client = Google::Apis::AnalyticsV3::AnalyticsService.new + key = Google::APIClient::KeyUtils::load_from_pkcs12(NcboCron.settings.analytics_path_to_ua_key_file, 'notasecret') + client.authorization = Signet::OAuth2::Client.new( + :token_credential_uri => 'https://accounts.google.com/o/oauth2/token', + :audience => 'https://accounts.google.com/o/oauth2/token', + :scope => 'https://www.googleapis.com/auth/analytics.readonly', + :issuer => NcboCron.settings.analytics_service_account_email_address, + :signing_key => key + ).tap { |auth| auth.fetch_access_token! } + client + end + end + end +end + +require 'ontologies_linked_data' +require 'goo' +require 'ncbo_annotator' +require 'ncbo_cron/config' +require_relative '../config/config' +ontology_analytics_log_path = File.join("logs", "ontology-analytics-ua.log") +ontology_analytics_logger = Logger.new(ontology_analytics_log_path) +NcboCron::Models::OntologyAnalyticsUA.new(ontology_analytics_logger).run diff --git a/bin/ncbo_cron b/bin/ncbo_cron index 8d212382..3b7aa063 100755 --- a/bin/ncbo_cron +++ b/bin/ncbo_cron @@ -111,19 +111,9 @@ opt_parser = OptionParser.new do |opts| opts.on("--disable-update-check", "disable check for updated version of Ontoportal (for VMs)", "(default: #{options[:enable_update_check]})") do |v| options[:enable_update_check] = false end - - - - - opts.on("--disable-dictionary-generation", "disable mgrep dictionary generation job", "(default: #{options[:enable_dictionary_generation]})") do |v| - options[:enable_dictionary_generation] = false + opts.on("--enable-dictionary-generation-cron-job", "ENABLE mgrep dictionary generation JOB and DISABLE dictionary generation during ontology processing. If this is not passed in, dictionary is generated every time an ontology is processed.", "(default: Dictionary is generated on every ontology processing, CRON job is DISABLED)") do |v| + options[:enable_dictionary_generation_cron_job] = true end - - - - - - opts.on("--disable-obofoundry_sync", "disable OBO Foundry synchronization report", "(default: #{options[:enable_obofoundry_sync]})") do |v| options[:enable_obofoundry_sync] = false end @@ -160,18 +150,10 @@ opt_parser = OptionParser.new do |opts| opts.on("--obofoundry_sync SCHED", String, "cron schedule to run OBO Foundry synchronization report", "(default: #{options[:cron_obofoundry_sync]})") do |c| options[:cron_obofoundry_sync] = c end - - - - - opts.on("--dictionary-generation SCHED", String, "cron schedule to run mgrep dictionary generation job", "(default: #{options[:cron_dictionary_generation]})") do |c| - options[:cron_dictionary_generation] = c + opts.on("--dictionary-generation-cron-job SCHED", String, "cron schedule to run mgrep dictionary generation job (if enabled)", "(default: #{options[:cron_dictionary_generation_cron_job]})") do |c| + options[:cron_dictionary_generation_cron_job] = c end - - - - # Display the help screen, all programs are assumed to have this option. opts.on_tail('--help', 'Display this screen') do puts opts @@ -484,49 +466,27 @@ runner.execute do |opts| end end - - - - - - - - # temporary job to generate mgrep dictionary file + # optional job to generate mgrep dictionary file # separate from ontology processing due to # https://github.com/ncbo/ncbo_cron/issues/45 - - if options[:enable_dictionary_generation] + if options[:enable_dictionary_generation_cron_job] dictionary_generation_thread = Thread.new do dictionary_generation_options = options.dup - dictionary_generation_options[:job_name] = "ncbo_cron_dictionary_generation" + dictionary_generation_options[:job_name] = "ncbo_cron_dictionary_generation_cron_job" dictionary_generation_options[:scheduler_type] = :cron - dictionary_generation_options[:cron_schedule] = dictionary_generation_options[:cron_dictionary_generation] - logger.info "Setting up mgrep dictionary generation job with #{dictionary_generation_options[:cron_dictionary_generation]}"; logger.flush + dictionary_generation_options[:cron_schedule] = dictionary_generation_options[:cron_dictionary_generation_cron_job] + logger.info "Setting up mgrep dictionary generation job with #{dictionary_generation_options[:cron_dictionary_generation_cron_job]}"; logger.flush NcboCron::Scheduler.scheduled_locking_job(dictionary_generation_options) do - logger.info "Starting mgrep dictionary generation..."; logger.flush + logger.info "Starting mgrep dictionary generation CRON job..."; logger.flush t0 = Time.now annotator = Annotator::Models::NcboAnnotator.new annotator.generate_dictionary_file() - logger.info "mgrep dictionary generation job completed in #{Time.now - t0} sec."; logger.flush - logger.info "Finished mgrep dictionary generation"; logger.flush + logger.info "mgrep dictionary generation CRON job completed in #{Time.now - t0} sec."; logger.flush + logger.info "Finished mgrep dictionary generation CRON job"; logger.flush end end end - - - - - - - - - - - - - - # Print running child processes require 'sys/proctable' at_exit do @@ -549,12 +509,5 @@ runner.execute do |opts| mapping_counts_thread.join if mapping_counts_thread update_check_thread.join if update_check_thread obofoundry_sync_thread.join if obofoundry_sync_thread - - - - dictionary_generation_thread.join if dictionary_generation_thread - - - end diff --git a/bin/ncbo_ontology_annotate_generate_cache b/bin/ncbo_ontology_annotate_generate_cache index 07286e7c..18399bea 100755 --- a/bin/ncbo_ontology_annotate_generate_cache +++ b/bin/ncbo_ontology_annotate_generate_cache @@ -49,7 +49,7 @@ opt_parser = OptionParser.new do |opts| options[:generate_dictionary] = true end - options[:logfile] = "logs/annotator_cache.log" + options[:logfile] = STDOUT opts.on('-l', '--logfile FILE', "Write log to FILE (default is 'logs/annotator_cache.log').") do |filename| options[:logfile] = filename end diff --git a/bin/ncbo_ontology_archive_old_submissions b/bin/ncbo_ontology_archive_old_submissions index 3dc5c87c..1b2268a5 100755 --- a/bin/ncbo_ontology_archive_old_submissions +++ b/bin/ncbo_ontology_archive_old_submissions @@ -11,31 +11,125 @@ require_relative '../lib/ncbo_cron' config_exists = File.exist?(File.expand_path('../../config/config.rb', __FILE__)) abort("Please create a config/config.rb file using the config/config.rb.sample as a template") unless config_exists require_relative '../config/config' +require 'optparse' -logfile = 'archive_old_submissions.log' +options = { delete: false } +opt_parser = OptionParser.new do |opts| + # Set a banner, displayed at the top of the help screen. + opts.banner = "Usage: #{File.basename(__FILE__)} [options]" + + options[:logfile] = STDOUT + opts.on( '-l', '--logfile FILE', "Write log to FILE (default is STDOUT)" ) do |filename| + options[:logfile] = filename + end + + # Delete submission if it contains bad data + opts.on( '-d', '--delete', "Delete submissions that contain bad data" ) do + options[:delete] = true + end + + # Display the help screen, all programs are assumed to have this option. + opts.on( '-h', '--help', 'Display this screen' ) do + puts opts + exit + end +end + +opt_parser.parse! +logfile = options[:logfile] if File.file?(logfile); File.delete(logfile); end logger = Logger.new(logfile) -options = { process_rdf: false, index_search: false, index_commit: false, - run_metrics: false, reasoning: false, archive: true } +process_actions = { process_rdf: false, generate_labels: false, index_search: false, index_commit: false, + process_annotator: false, diff: false, run_metrics: false, archive: true } onts = LinkedData::Models::Ontology.all onts.each { |ont| ont.bring(:acronym, :submissions) } -onts.sort! { |a,b| a.acronym <=> b.acronym } +onts.sort! { |a, b| a.acronym <=> b.acronym } +bad_submissions = {} onts.each do |ont| latest_sub = ont.latest_submission - if not latest_sub.nil? + + unless latest_sub.nil? id = latest_sub.submissionId subs = ont.submissions - old_subs = subs.reject { |sub| sub.submissionId >= id } - old_subs.sort! { |a,b| a.submissionId <=> b.submissionId } + + old_subs = subs.reject { |sub| + begin + sub.submissionId >= id + rescue => e + msg = "Invalid submission ID detected (String instead of Integer): #{ont.acronym}/#{sub.submissionId} - #{e.class}:\n#{e.backtrace.join("\n")}" + puts msg + logger.error(msg) + + if options[:delete] + sub.delete if options[:delete] + msg = "Deleted submission #{ont.acronym}/#{sub.submissionId} due to invalid Submission ID" + puts msg + logger.error(msg) + end + bad_submissions["#{ont.acronym}/#{sub.submissionId}"] = "Invalid Submission ID" + true + end + } + old_subs.sort! { |a, b| a.submissionId <=> b.submissionId } old_subs.each do |sub| - if not sub.archived? + unless sub.archived? msg = "#{ont.acronym}: found un-archived old submission with ID #{sub.submissionId}." puts msg logger.info msg - NcboCron::Models::OntologySubmissionParser.new.process_submission(logger, sub.id.to_s, options) + + begin + NcboCron::Models::OntologySubmissionParser.new.process_submission(logger, sub.id.to_s, process_actions) + rescue => e + if e.class == Goo::Base::NotValidException + if sub.valid? + msg = "Error archiving submission #{ont.acronym}/#{sub.submissionId} - #{e.class}:\n#{e.backtrace.join("\n")}" + puts msg + logger.error(msg) + bad_submissions["#{ont.acronym}/#{sub.submissionId}"] = "Submission passes valid check but cannot be saved" + else + msg = "Error archiving submission #{ont.acronym}/#{sub.submissionId}:\n#{JSON.pretty_generate(sub.errors)}" + puts msg + logger.error(msg) + + if options[:delete] + sub.delete if options[:delete] + msg = "Deleted submission #{ont.acronym}/#{sub.submissionId} due to invalid data" + puts msg + logger.error(msg) + end + bad_submissions["#{ont.acronym}/#{sub.submissionId}"] = "Submission is not valid to be saved" + end + else + msg = "Error archiving submission #{ont.acronym}/#{sub.submissionId} - #{e.class}:\n#{e.backtrace.join("\n")}" + puts msg + logger.error(msg) + + if options[:delete] && (e.class == Net::HTTPBadResponse || e.class == Errno::ECONNREFUSED) + sub.delete + msg = "Deleted submission #{ont.acronym}/#{sub.submissionId} due to a non-working pull URL" + puts msg + logger.error(msg) + end + bad_submissions["#{ont.acronym}/#{sub.submissionId}"] = "#{e.class} - Runtime error" + end + end end end end end +puts + +if bad_submissions.empty? + msg = "No errored submissions found" + puts msg + logger.info(msg) +else + msg = JSON.pretty_generate(bad_submissions) + puts msg + logger.error(msg) + msg = "Number of errored submissions: #{bad_submissions.length}" + puts msg + logger.error(msg) +end \ No newline at end of file diff --git a/bin/ncbo_ontology_import b/bin/ncbo_ontology_import index db2e90c5..57d63aa1 100755 --- a/bin/ncbo_ontology_import +++ b/bin/ncbo_ontology_import @@ -20,27 +20,27 @@ require 'net/http' require 'optparse' ontologies_acronyms = '' ontology_source = '' -source_api = '' +source_apikey = '' username = '' opt_parser = OptionParser.new do |opts| opts.banner = 'Usage: ncbo_ontology_import [options]' - opts.on('-o', '--ontology ACRONYM', 'Ontologies acronyms which we want to import (separated by comma)') do |acronym| + opts.on('-o', '--ontologies ACRONYM1,ACRONYM2', 'Comma-separated list of ontologies to import') do |acronym| ontologies_acronyms = acronym end - opts.on('--from url', 'The ontoportal api url source of the ontology') do |url| + opts.on('--from URL', 'The ontoportal api url source of the ontology') do |url| ontology_source = url.to_s end - opts.on('--from-api api', 'An apikey to acces the ontoportal api') do |api| - source_api = api.to_s + opts.on('--from-apikey APIKEY', 'An apikey to acces the ontoportal api') do |apikey| + source_apikey = apikey.to_s end - opts.on('--admin-user username', 'The target admin user that will submit the ontology') do |user| + opts.on('--admin-user USERNAME', 'The target admin user that will submit the ontology') do |user| username = user.to_s end # Display the help screen, all programs are assumed to have this option. - opts.on( '-h', '--help', 'Display this screen') do + opts.on('-h', '--help', 'Display this screen') do puts opts exit end @@ -48,9 +48,8 @@ end opt_parser.parse! # URL of the API and APIKEY of the Ontoportal we want to import data FROM -SOURCE_API = ontology_source -SOURCE_APIKEY = source_api - +SOURCE_API = ontology_source +SOURCE_APIKEY = source_apikey # The username of the user that will have the administration rights on the ontology on the target portal TARGETED_PORTAL_USER = username @@ -58,17 +57,15 @@ TARGETED_PORTAL_USER = username # The list of acronyms of ontologies to import ONTOLOGIES_TO_IMPORT = ontologies_acronyms.split(',') || [] - def get_user(username) user = LinkedData::Models::User.find(username).first raise "The user #{username} does not exist" if user.nil? + user.bring_remaining end - # A function to create a new ontology (if already Acronym already existing on the portal it will return HTTPConflict) def create_ontology(ont_info) - new_ontology = LinkedData::Models::Ontology.new new_ontology.acronym = ont_info['acronym'] @@ -97,23 +94,30 @@ def upload_submission(sub_info, ontology) # Build the json body # hasOntologyLanguage options: OWL, UMLS, SKOS, OBO # status: alpha, beta, production, retired - attr_to_reject = %w[id submissionStatus hasOntologyLanguage metrics ontology @id @type contact] - to_copy = sub_info.select do |k,v| + attr_to_reject = %w[id submissionStatus hasOntologyLanguage metrics ontology @id @type contact uploadFilePath diffFilePath] + to_copy = sub_info.select do |k, v| !v.nil? && !v.is_a?(Hash) && !v.to_s.empty? && !attr_to_reject.include?(k) end to_copy["ontology"] = ontology - to_copy["contact"] = [LinkedData::Models::Contact.where(email: USER.email).first] - to_copy["hasOntologyLanguage"] = LinkedData::Models::OntologyFormat.where(acronym: sub_info["hasOntologyLanguage"]).first + + contact = LinkedData::Models::Contact.where(email: USER.email).first + unless contact + contact = LinkedData::Models::Contact.new(name: USER.username, email: USER.email).save + puts "created a new contact; name: #{USER.username}, email: #{USER.email}" + end + + to_copy["contact"] = [contact] + to_copy["hasOntologyLanguage"] = LinkedData::Models::OntologyFormat.where(acronym: sub_info["hasOntologyLanguage"]).first to_copy.each do |key, value| attribute_settings = new_submission.class.attribute_settings(key.to_sym) if attribute_settings - if attribute_settings[:enforce]&.include?(:date_time) + if attribute_settings[:enforce]&.include?(:date_time) value = DateTime.parse(value) elsif attribute_settings[:enforce]&.include?(:uri) && attribute_settings[:enforce]&.include?(:list) value = value.map { |v| RDF::IRI.new(v) } - elsif attribute_settings[:enforce]&.include?(:uri) + elsif attribute_settings[:enforce]&.include?(:uri) value = RDF::IRI.new(value) end end @@ -124,12 +128,11 @@ def upload_submission(sub_info, ontology) new_submission end - USER = get_user username -#get apikey for admin user +# get apikey for admin user TARGET_APIKEY = USER.apikey -SOURCE_APIKEY == '' && abort('--from-api has to be set') +SOURCE_APIKEY == '' && abort('--from-apikey has to be set') SOURCE_API == '' && abort('--from has to be set') def result_log(ressource, errors) @@ -143,10 +146,11 @@ end # Go through all ontologies acronym and get their latest_submission informations ONTOLOGIES_TO_IMPORT.each do |ont| sub_info = JSON.parse(Net::HTTP.get(URI.parse("#{SOURCE_API}/ontologies/#{ont}/latest_submission?apikey=#{SOURCE_APIKEY}&display=all"))) - puts "Import #{ont} " , + puts "Import #{ont} ", "From #{SOURCE_API}" # if the ontology is already created then it will return HTTPConflict, no consequences raise "The ontology #{ont} does not exist" if sub_info['ontology'].nil? + new_ontology = create_ontology(sub_info['ontology']) errors = nil if new_ontology.valid? @@ -159,6 +163,7 @@ ONTOLOGIES_TO_IMPORT.each do |ont| new_ontology ||= LinkedData::Models::Ontology.where(acronym: ont).first new_submission = upload_submission(sub_info, new_ontology) + if new_submission.valid? new_submission.save errors = nil @@ -167,6 +172,3 @@ ONTOLOGIES_TO_IMPORT.each do |ont| end result_log(sub_info["id"], errors) end - - - diff --git a/bin/ncbo_ontology_process b/bin/ncbo_ontology_process index d96f0d87..879e749d 100755 --- a/bin/ncbo_ontology_process +++ b/bin/ncbo_ontology_process @@ -31,9 +31,14 @@ opt_parser = OptionParser.new do |opts| end options[:tasks] = NcboCron::Models::OntologySubmissionParser::ACTIONS - opts.on('-t', '--tasks process_rdf,index_search,run_metrics', "Optional comma-separated list of processing tasks to perform. Default: #{NcboCron::Models::OntologySubmissionParser::ACTIONS.keys.join(',')}") do |tasks| - t = tasks.split(",").map {|t| t.strip.sub(/^:/, '').to_sym} - options[:tasks].each {|k, _| options[:tasks][k] = false unless t.include?(k)} + opts.on('-t', '--tasks process_rdf,generate_labels=false,index_search,run_metrics', "Optional comma-separated list of processing tasks to perform (or exclude). Default: #{NcboCron::Models::OntologySubmissionParser::ACTIONS.keys.join(',')}") do |tasks| + tasks_obj = {} + tasks.split(',').each { |t| + t_arr = t.gsub(/\s+/, '').gsub(/^:/, '').split('=') + tasks_obj[t_arr[0].to_sym] = (t_arr.length <= 1 || t_arr[1].downcase === 'true') + } + tasks_obj[:generate_labels] = true if tasks_obj[:process_rdf] && !tasks_obj.has_key?(:generate_labels) + options[:tasks].each {|k, _| options[:tasks][k] = false unless tasks_obj[k]} end options[:logfile] = STDOUT diff --git a/bin/ncbo_ontology_pull b/bin/ncbo_ontology_pull new file mode 100755 index 00000000..be3e08de --- /dev/null +++ b/bin/ncbo_ontology_pull @@ -0,0 +1,42 @@ +#!/usr/bin/env ruby + +$0 = "ncbo_ontology_pull" + +# Exit cleanly from an early interrupt +Signal.trap("INT") { exit 1 } + +# Setup the bundled gems in our environment +require 'bundler/setup' +# redis store for looking up queued jobs +require 'redis' + +require_relative '../lib/ncbo_cron' +require_relative '../config/config' +require 'optparse' + +ontology_acronym = '' +opt_parser = OptionParser.new do |opts| + opts.on('-o', '--ontology ACRONYM', 'Ontology acronym to pull if new version exist') do |acronym| + ontology_acronym = acronym + end + + # Display the help screen, all programs are assumed to have this option. + opts.on( '-h', '--help', 'Display this screen') do + puts opts + exit + end +end +opt_parser.parse! + +logger = Logger.new($stdout) +logger.info "Starting ncbo pull"; logger.flush +puller = NcboCron::Models::OntologyPull.new +begin + puller.do_ontology_pull(ontology_acronym, logger: logger, enable_pull_umls: true) +rescue StandardError => e + logger.error e.message + logger.flush +end +logger.info "Finished ncbo pull"; logger.flush + + diff --git a/bin/ncbo_ontology_submissions_eradicate b/bin/ncbo_ontology_submissions_eradicate new file mode 100755 index 00000000..ef2c7a19 --- /dev/null +++ b/bin/ncbo_ontology_submissions_eradicate @@ -0,0 +1,107 @@ +#!/usr/bin/env ruby + +$0 = 'ncbo_cron' + +# Exit cleanly from an early interrupt +Signal.trap('INT') { exit 1 } + +# Setup the bundled gems in our environment +require 'bundler/setup' +# redis store for looking up queued jobs +require 'redis' + +require_relative '../lib/ncbo_cron' +require_relative '../config/config' +require 'optparse' +ontology_acronym = '' +submission_id_from = 0 +submission_id_to = 0 + +opt_parser = OptionParser.new do |opts| + opts.banner = 'Usage: ncbo_ontology_sumissions_eradicate [options]' + opts.on('-o', '--ontology ACRONYM', 'Ontology acronym which we want to eradicate (remove triples+files) specific submissions') do |acronym| + ontology_acronym = acronym + end + + opts.on('--from id', 'Submission id to start from deleting (included)') do |id| + submission_id_from = id.to_i + end + + opts.on('--to id', 'Submission id to end deleting (included)') do |id| + submission_id_to = id.to_i + end + # Display the help screen, all programs are assumed to have this option. + opts.on( '-h', '--help', 'Display this screen') do + puts opts + exit + end +end +opt_parser.parse! + + + + + +def ontology_exists?(ontology_acronym) + ont = LinkedData::Models::Ontology.find(ontology_acronym) + .include(submissions: [:submissionId]) + .first + if ont.nil? + logger.error "ontology not found: #{options[:ontology]}" + exit(1) + end + ont.bring(:submissions) if ont.bring?(:submissions) + ont +end + + +def get_submission_to_delete(submissions, from, to) + min, max = [from, to].minmax + submissions.select { |s| s.submissionId.between?(min, max) }.sort { |s1, s2| s1.submissionId <=> s2.submissionId} +end + +def eradicate(ontology_acronym, submissions , logger) + logger ||= Logger.new($stderr) + submissions.each do |submission| + begin + logger.info "Start removing submission #{submission.submissionId.to_s}" + NcboCron::Models::OntologySubmissionEradicator.new.eradicate submission + logger.info"Submission #{submission.submissionId.to_s} deleted successfully" + rescue NcboCron::Models::OntologySubmissionEradicator::RemoveNotArchivedSubmissionException + logger.info "Submission #{submission.submissionId.to_s} is not archived" + ask? logger, 'Do you want to force remove ? (Y/n)' + NcboCron::Models::OntologySubmissionEradicator.new.eradicate submission, true + logger.info"Submission #{submission.submissionId.to_s} deleted successfully" + rescue NcboCron::Models::OntologySubmissionEradicator::RemoveSubmissionFileException => e + logger.error "RemoveSubmissionFileException in submission #{submission.submissionId.to_s} : #{e.message}" + rescue NcboCron::Models::OntologySubmissionEradicator::RemoveSubmissionDataException => e + logger.error "RemoveSubmissionDataException in submission #{submission.submissionId.to_s} : #{e.message}" + rescue Exception => e + logger.error "Error in submission #{submission.submissionId.to_s} remove: #{e.message}" + end + end +end + +def ask?(logger, prompt) + logger.info prompt + choice = gets.chomp.downcase + exit(1) if choice.eql? 'n' +end + +begin + logger = Logger.new($stderr) + + logger.info 'Start of NCBO ontology submissions eradicate' + + ont = ontology_exists? ontology_acronym + + submissions = ont.submissions + submissions_to_delete = get_submission_to_delete submissions, submission_id_from, submission_id_to + + logger.info "You are attempting to remove the following submissions of #{ontology_acronym} : #{submissions_to_delete.map{ |s| s.submissionId }.join(', ')}" + logger.info 'They will be deleted from the triple store and local files' + ask? logger, 'Do you confirm ? (Y/n)' + + eradicate ontology_acronym , submissions_to_delete, logger + exit(0) +end \ No newline at end of file diff --git a/config/config.rb.sample b/config/config.rb.sample index 15125224..668c7a0c 100644 --- a/config/config.rb.sample +++ b/config/config.rb.sample @@ -1,16 +1,42 @@ -LinkedData.config do |config| - config.enable_monitoring = false - config.cube_host = "localhost" - config.goo_host = "localhost" - config.goo_port = 8080 - config.search_server_url = "http://localhost:8983/solr/term_search_core1" - config.property_search_server_url = "http://localhost:8983/solr/prop_search_core1" - config.repository_folder = "./test/data/ontology_files/repo" - config.http_redis_host = "localhost" - config.http_redis_port = 6379 - config.goo_redis_host = "localhost" - config.goo_redis_port = 6379 +# This file is designed to be used for unit testing with docker-compose + +GOO_BACKEND_NAME = ENV.include?("GOO_BACKEND_NAME") ? ENV["GOO_BACKEND_NAME"] : "4store" +GOO_HOST = ENV.include?("GOO_HOST") ? ENV["GOO_HOST"] : "localhost" +GOO_PATH_DATA = ENV.include?("GOO_PATH_DATA") ? ENV["GOO_PATH_DATA"] : "/data/" +GOO_PATH_QUERY = ENV.include?("GOO_PATH_QUERY") ? ENV["GOO_PATH_QUERY"] : "/sparql/" +GOO_PATH_UPDATE = ENV.include?("GOO_PATH_UPDATE") ? ENV["GOO_PATH_UPDATE"] : "/update/" +GOO_PORT = ENV.include?("GOO_PORT") ? ENV["GOO_PORT"] : 9000 +MGREP_HOST = ENV.include?("MGREP_HOST") ? ENV["MGREP_HOST"] : "localhost" +MGREP_PORT = ENV.include?("MGREP_PORT") ? ENV["MGREP_PORT"] : 55555 +MGREP_DICT_PATH = ENV.include?("MGREP_DICT_PATH") ? ENV["MGREP_DICT_PATH"] : "./test/data/dictionary.txt" +REDIS_GOO_CACHE_HOST = ENV.include?("REDIS_GOO_CACHE_HOST") ? ENV["REDIS_GOO_CACHE_HOST"] : "localhost" +REDIS_HTTP_CACHE_HOST = ENV.include?("REDIS_HTTP_CACHE_HOST") ? ENV["REDIS_HTTP_CACHE_HOST"] : "localhost" +REDIS_PERSISTENT_HOST = ENV.include?("REDIS_PERSISTENT_HOST") ? ENV["REDIS_PERSISTENT_HOST"] : "localhost" +REDIS_PORT = ENV.include?("REDIS_PORT") ? ENV["REDIS_PORT"] : 6379 +REPORT_PATH = ENV.include?("REPORT_PATH") ? ENV["REPORT_PATH"] : "./test/tmp/ontologies_report.json" +REPOSITORY_FOLDER = ENV.include?("REPOSITORY_FOLDER") ? ENV["REPOSITORY_FOLDER"] : "./test/data/ontology_files/repo" +REST_URL_PREFIX = ENV.include?("REST_URL_PREFIX") ? ENV["REST_URL_PREFIX"] : "http://localhost:9393" +SOLR_PROP_SEARCH_URL = ENV.include?("SOLR_PROP_SEARCH_URL") ? ENV["SOLR_PROP_SEARCH_URL"] : "http://localhost:8983/solr/prop_search_core1" +SOLR_TERM_SEARCH_URL = ENV.include?("SOLR_TERM_SEARCH_URL") ? ENV["SOLR_TERM_SEARCH_URL"] : "http://localhost:8983/solr/term_search_core1" +LinkedData.config do |config| + config.goo_backend_name = GOO_BACKEND_NAME.to_s + config.goo_host = GOO_HOST.to_s + config.goo_port = GOO_PORT.to_i + config.goo_path_query = GOO_PATH_QUERY.to_s + config.goo_path_data = GOO_PATH_DATA.to_s + config.goo_path_update = GOO_PATH_UPDATE.to_s + config.goo_redis_host = REDIS_GOO_CACHE_HOST.to_s + config.goo_redis_port = REDIS_PORT.to_i + config.http_redis_host = REDIS_HTTP_CACHE_HOST.to_s + config.http_redis_port = REDIS_PORT.to_i + config.ontology_analytics_redis_host = REDIS_PERSISTENT_HOST.to_s + config.ontology_analytics_redis_port = REDIS_PORT.to_i + config.repository_folder = REPOSITORY_FOLDER.to_s + config.search_server_url = SOLR_TERM_SEARCH_URL.to_s + config.property_search_server_url = SOLR_PROP_SEARCH_URL.to_s +# config.replace_url_prefix = false +# config.rest_url_prefix = REST_URL_PREFIX.to_s # Email notifications. config.enable_notifications = true config.email_sender = "sender@domain.com" # Default sender for emails @@ -19,35 +45,38 @@ LinkedData.config do |config| config.smtp_user = nil config.smtp_password = nil config.smtp_auth_type = :none - config.smtp_domain = "localhost.localhost" + config.smtp_domain = "localhost.localhost" end Annotator.config do |config| - config.mgrep_dictionary_file ||= "./test/tmp/dict" - config.stop_words_default_file ||= "./config/default_stop_words.txt" config.mgrep_host ||= "localhost" - config.mgrep_port ||= 55555 - config.annotator_redis_host ||= "localhost" - config.annotator_redis_port ||= 6379 + config.annotator_redis_host = REDIS_PERSISTENT_HOST.to_s + config.annotator_redis_port = REDIS_PORT.to_i + config.mgrep_host = MGREP_HOST.to_s + config.mgrep_port = MGREP_PORT.to_i + config.mgrep_dictionary_file = MGREP_DICT_PATH.to_s end NcboCron.config do |config| - config.redis_host ||= "localhost" - config.redis_port ||= 6379 + config.redis_host = REDIS_PERSISTENT_HOST.to_s + config.redis_port = REDIS_PORT.to_i + # Ontologies Report config + config.ontology_report_path = REPORT_PATH + + # do not deaemonize in docker + config.daemonize = false + config.search_index_all_url = "http://localhost:8983/solr/term_search_core2" config.property_search_index_all_url = "http://localhost:8983/solr/prop_search_core2" - # Ontologies Report config - config.ontology_report_path = "./test/reports/ontologies_report.json" - - # Google Analytics config - config.analytics_service_account_email_address = "123456789999-sikipho0wk8q0atflrmw62dj4kpwoj3c@developer.gserviceaccount.com" - config.analytics_path_to_key_file = "config/bioportal-analytics.p12" - config.analytics_profile_id = "ga:1234567" - config.analytics_app_name = "BioPortal" - config.analytics_app_version = "1.0.0" - config.analytics_start_date = "2013-10-01" - config.analytics_filter_str = "ga:networkLocation!@stanford;ga:networkLocation!@amazon" + # Google Analytics GA4 config + config.analytics_path_to_key_file = "config/your_analytics_key.json" + config.analytics_property_id = "123456789" + # path to the Universal Analytics data, which stopped collecting on June 1st, 2023 + config.analytics_path_to_ua_data_file = "data/your_ua_data.json" + # path to the file that will hold your Google Analytics data + # this is in addition to storing it in Redis + config.analytics_path_to_ga_data_file = "data/your_ga_data.json" # this is a Base64.encode64 encoded personal access token # you need to run Base64.decode64 on it before using it in your code diff --git a/config/config.test.rb b/config/config.test.rb index 97eaf1f7..84a621ac 100644 --- a/config/config.test.rb +++ b/config/config.test.rb @@ -1,49 +1,69 @@ # This file is designed to be used for unit testing with docker-compose -# -GOO_PATH_QUERY = ENV.include?('GOO_PATH_QUERY') ? ENV['GOO_PATH_QUERY'] : '/sparql/' -GOO_PATH_DATA = ENV.include?('GOO_PATH_DATA') ? ENV['GOO_PATH_DATA'] : '/data/' -GOO_PATH_UPDATE = ENV.include?('GOO_PATH_UPDATE') ? ENV['GOO_PATH_UPDATE'] : '/update/' -GOO_BACKEND_NAME = ENV.include?('GOO_BACKEND_NAME') ? ENV['GOO_BACKEND_NAME'] : 'localhost' -GOO_PORT = ENV.include?('GOO_PORT') ? ENV['GOO_PORT'] : 9000 -GOO_HOST = ENV.include?('GOO_HOST') ? ENV['GOO_HOST'] : 'localhost' -SOLR_HOST = ENV.include?('SOLR_HOST') ? ENV['SOLR_HOST'] : 'localhost' -REDIS_HOST = ENV.include?('REDIS_HOST') ? ENV['REDIS_HOST'] : 'localhost' -REDIS_PORT = ENV.include?('REDIS_PORT') ? ENV['REDIS_PORT'] : 6379 -MGREP_HOST = ENV.include?('MGREP_HOST') ? ENV['MGREP_HOST'] : 'localhost' -MGREP_PORT = ENV.include?('MGREP_PORT') ? ENV['MGREP_PORT'] : 55555 + +GOO_BACKEND_NAME = ENV.include?("GOO_BACKEND_NAME") ? ENV["GOO_BACKEND_NAME"] : "4store" +GOO_HOST = ENV.include?("GOO_HOST") ? ENV["GOO_HOST"] : "localhost" +GOO_PATH_DATA = ENV.include?("GOO_PATH_DATA") ? ENV["GOO_PATH_DATA"] : "/data/" +GOO_PATH_QUERY = ENV.include?("GOO_PATH_QUERY") ? ENV["GOO_PATH_QUERY"] : "/sparql/" +GOO_PATH_UPDATE = ENV.include?("GOO_PATH_UPDATE") ? ENV["GOO_PATH_UPDATE"] : "/update/" +GOO_PORT = ENV.include?("GOO_PORT") ? ENV["GOO_PORT"] : 9000 +MGREP_HOST = ENV.include?("MGREP_HOST") ? ENV["MGREP_HOST"] : "localhost" +MGREP_PORT = ENV.include?("MGREP_PORT") ? ENV["MGREP_PORT"] : 55555 +MGREP_DICT_PATH = ENV.include?("MGREP_DICT_PATH") ? ENV["MGREP_DICT_PATH"] : "./test/data/dictionary.txt" +REDIS_GOO_CACHE_HOST = ENV.include?("REDIS_GOO_CACHE_HOST") ? ENV["REDIS_GOO_CACHE_HOST"] : "localhost" +REDIS_HTTP_CACHE_HOST = ENV.include?("REDIS_HTTP_CACHE_HOST") ? ENV["REDIS_HTTP_CACHE_HOST"] : "localhost" +REDIS_PERSISTENT_HOST = ENV.include?("REDIS_PERSISTENT_HOST") ? ENV["REDIS_PERSISTENT_HOST"] : "localhost" +REDIS_PORT = ENV.include?("REDIS_PORT") ? ENV["REDIS_PORT"] : 6379 +REPORT_PATH = ENV.include?("REPORT_PATH") ? ENV["REPORT_PATH"] : "./test/tmp/ontologies_report.json" +REPOSITORY_FOLDER = ENV.include?("REPOSITORY_FOLDER") ? ENV["REPOSITORY_FOLDER"] : "./test/data/ontology_files/repo" +REST_URL_PREFIX = ENV.include?("REST_URL_PREFIX") ? ENV["REST_URL_PREFIX"] : "http://localhost:9393" +SOLR_PROP_SEARCH_URL = ENV.include?("SOLR_PROP_SEARCH_URL") ? ENV["SOLR_PROP_SEARCH_URL"] : "http://localhost:8983/solr/prop_search_core1" +SOLR_TERM_SEARCH_URL = ENV.include?("SOLR_TERM_SEARCH_URL") ? ENV["SOLR_TERM_SEARCH_URL"] : "http://localhost:8983/solr/term_search_core1" LinkedData.config do |config| + config.goo_backend_name = GOO_BACKEND_NAME.to_s config.goo_host = GOO_HOST.to_s config.goo_port = GOO_PORT.to_i - config.goo_redis_host = REDIS_HOST.to_s + config.goo_path_query = GOO_PATH_QUERY.to_s + config.goo_path_data = GOO_PATH_DATA.to_s + config.goo_path_update = GOO_PATH_UPDATE.to_s + config.goo_redis_host = REDIS_GOO_CACHE_HOST.to_s config.goo_redis_port = REDIS_PORT.to_i - config.http_redis_host = REDIS_HOST.to_s + config.http_redis_host = REDIS_HTTP_CACHE_HOST.to_s config.http_redis_port = REDIS_PORT.to_i - config.ontology_analytics_redis_host = REDIS_HOST.to_s + config.ontology_analytics_redis_host = REDIS_PERSISTENT_HOST.to_s config.ontology_analytics_redis_port = REDIS_PORT.to_i - config.search_server_url = "http://#{SOLR_HOST}:8983/solr/term_search_core1".to_s - config.property_search_server_url = "http://#{SOLR_HOST}:8983/solr/prop_search_core1".to_s + config.repository_folder = REPOSITORY_FOLDER.to_s + config.search_server_url = SOLR_TERM_SEARCH_URL.to_s + config.property_search_server_url = SOLR_PROP_SEARCH_URL.to_s +# config.replace_url_prefix = false +# config.rest_url_prefix = REST_URL_PREFIX.to_s # Email notifications. config.enable_notifications = true - config.email_sender = 'sender@domain.com' # Default sender for emails - config.email_override = 'test@domain.com' # By default, all email gets sent here. Disable with email_override_disable. - config.smtp_host = 'smtp-unencrypted.stanford.edu' + config.email_sender = "sender@domain.com" # Default sender for emails + config.email_override = "test@domain.com" # By default, all email gets sent here. Disable with email_override_disable. + config.smtp_host = "smtp-unencrypted.stanford.edu" config.smtp_user = nil config.smtp_password = nil config.smtp_auth_type = :none - config.smtp_domain = 'localhost.localhost' + config.smtp_domain = "localhost.localhost" end Annotator.config do |config| - config.annotator_redis_host = REDIS_HOST.to_s - config.annotator_redis_port = REDIS_PORT.to_i - config.mgrep_host = MGREP_HOST.to_s - config.mgrep_port = MGREP_PORT.to_i - config.mgrep_dictionary_file = './test/data/dictionary.txt' + config.annotator_redis_host = REDIS_PERSISTENT_HOST.to_s + config.annotator_redis_port = REDIS_PORT.to_i + config.mgrep_host = MGREP_HOST.to_s + config.mgrep_port = MGREP_PORT.to_i + config.mgrep_dictionary_file = MGREP_DICT_PATH.to_s end +# LinkedData::OntologiesAPI.config do |config| +# config.http_redis_host = REDIS_HTTP_CACHE_HOST.to_s +# config.http_redis_port = REDIS_PORT.to_i +# end +# NcboCron.config do |config| - config.redis_host = REDIS_HOST.to_s + config.daemonize = false + config.redis_host = REDIS_PERSISTENT_HOST.to_s config.redis_port = REDIS_PORT.to_i - config.ontology_report_path = './test/ontologies_report.json' + config.ontology_report_path = REPORT_PATH end diff --git a/dip.yml b/dip.yml new file mode 100644 index 00000000..3bbe4444 --- /dev/null +++ b/dip.yml @@ -0,0 +1,54 @@ +version: '7.1' + +# Define default environment variables to pass +# to Docker Compose +#environment: +# RAILS_ENV: development + +compose: + files: + - docker-compose.yml + # project_name: ncbo_cron + +interaction: + # This command spins up a ncbo_cron container with the required dependencies (solr, 4store, etc), + # and opens a terminal within it. + runner: + description: Open a Bash shell within a ncbo_cron container (with dependencies up) + service: ncbo_cron + command: /bin/bash + + # Run a container without any dependent services + bash: + description: Run an arbitrary script within a container (or open a shell without deps) + service: ncbo_cron + command: /bin/bash + compose_run_options: [ no-deps ] + + # A shortcut to run Bundler commands + bundle: + description: Run Bundler commands within ncbo_cron container (with depencendies up) + service: ncbo_cron + command: bundle + + # A shortcut to run unit tests + test: + description: Run unit tests with 4store triplestore + service: ncbo_cron + command: bundle exec rake test TESTOPTS='-v' + + test-ag: + description: Run unit tests with AllegroGraph triplestore + service: ncbo_cron-agraph + command: bundle exec rake test TESTOPTS='-v' + + 'redis-cli': + description: Run Redis console + service: redis-ut + command: redis-cli -h redis-ut + +#provision: + #- dip compose down --volumes + #- dip compose up -d solr 4store + #- dip bundle install + #- dip bash -c bin/setup diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 00000000..5f4e9307 --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,139 @@ +x-app: &app + build: + context: . + args: + RUBY_VERSION: '2.7' + # Increase the version number in the image tag every time Dockerfile or its arguments is changed + image: ncbo_cron:0.0.2 + environment: &env + BUNDLE_PATH: /srv/ontoportal/bundle + # default bundle config resolves to /usr/local/bundle/config inside of the container + # we are setting it to local app directory if we need to use 'bundle config local' + BUNDLE_APP_CONFIG: /srv/ontoportal/ncbo_cron/.bundle + COVERAGE: 'true' + GOO_REDIS_HOST: redis-ut + REDIS_GOO_CACHE_HOST: redis-ut + REDIS_HTTP_CACHE_HOST: redis-ut + REDIS_PERSISTENT_HOST: redis-ut + REDIS_PORT: 6379 + SOLR_TERM_SEARCH_URL: http://solr-ut:8983/solr/term_search_core1 + SOLR_PROP_SEARCH_URL: http://solr-ut:8983/solr/prop_search_core1 + MGREP_HOST: mgrep-ut + MGREP_PORT: 55556 + stdin_open: true + tty: true + command: "bundle exec rackup -o 0.0.0.0 --port 9393" + volumes: + # bundle volume for hosting gems installed by bundle; it helps in local development with gem udpates + - bundle:/srv/ontoportal/bundle + # ncbo_cron code + - .:/srv/ontoportal/ncbo_cron + # mount directory containing development version of the gems if you need to use 'bundle config local' + #- /Users/alexskr/ontoportal:/Users/alexskr/ontoportal + depends_on: &depends_on + solr-ut: + condition: service_healthy + redis-ut: + condition: service_healthy + mgrep-ut: + condition: service_healthy + +services: + ncbo_cron: + <<: *app + environment: + <<: *env + GOO_BACKEND_NAME: 4store + GOO_PORT: 9000 + GOO_HOST: 4store-ut + GOO_PATH_QUERY: /sparql/ + GOO_PATH_DATA: /data/ + GOO_PATH_UPDATE: /update/ + profiles: + - 4store + depends_on: + <<: *depends_on + 4store-ut: + condition: service_started + + ncbo_cron-agraph: + <<: *app + environment: + <<: *env + GOO_BACKEND_NAME: ag + GOO_PORT: 10035 + GOO_HOST: agraph-ut + GOO_PATH_QUERY: /repositories/bioportal_test + GOO_PATH_DATA: /repositories/bioportal_test/statements + GOO_PATH_UPDATE: /repositories/bioportal_test/statements + profiles: + - agraph + depends_on: + <<: *depends_on + agraph-ut: + condition: service_healthy + + redis-ut: + image: redis + healthcheck: + test: redis-cli ping + interval: 10s + timeout: 3s + retries: 10 + + 4store-ut: + image: bde2020/4store + platform: linux/amd64 + #volume: fourstore:/var/lib/4store + command: > + bash -c "4s-backend-setup --segments 4 ontoportal_kb + && 4s-backend ontoportal_kb + && 4s-httpd -D -s-1 -p 9000 ontoportal_kb" + profiles: + - 4store + + solr-ut: + image: ontoportal/solr-ut:0.0.2 + healthcheck: + test: ["CMD-SHELL", "curl -sf http://localhost:8983/solr/term_search_core1/admin/ping?wt=json | grep -iq '\"status\":\"OK\"}' || exit 1"] + start_period: 3s + interval: 10s + timeout: 5s + retries: 5 + + mgrep-ut: + image: ontoportal/mgrep:0.0.2 + platform: linux/amd64 + healthcheck: + test: ["CMD", "nc", "-z", "-v", "localhost", "55556"] + start_period: 3s + interval: 10s + timeout: 5s + retries: 5 + + agraph-ut: + image: franzinc/agraph:v8.0.0 + platform: linux/amd64 + environment: + - AGRAPH_SUPER_USER=test + - AGRAPH_SUPER_PASSWORD=xyzzy + shm_size: 1g + # ports: + # - 10035:10035 + command: > + bash -c "/agraph/bin/agraph-control --config /agraph/etc/agraph.cfg start + ; agtool repos create bioportal_test + ; agtool users add anonymous + ; agtool users grant anonymous root:bioportal_test:rw + ; tail -f /agraph/data/agraph.log" + healthcheck: + test: ["CMD-SHELL", "agtool storage-report bioportal_test || exit 1"] + start_period: 20s + interval: 60s + timeout: 5s + retries: 3 + profiles: + - agraph + +volumes: + bundle: diff --git a/lib/ncbo_cron.rb b/lib/ncbo_cron.rb index 309b15db..884e6b33 100644 --- a/lib/ncbo_cron.rb +++ b/lib/ncbo_cron.rb @@ -6,6 +6,7 @@ require 'ncbo_annotator' require_relative 'ncbo_cron/config' require_relative 'ncbo_cron/ontology_submission_parser' +require_relative 'ncbo_cron/ontology_submission_eradicator' require_relative 'ncbo_cron/ontology_pull' require_relative 'ncbo_cron/scheduler' require_relative 'ncbo_cron/query_caching' diff --git a/lib/ncbo_cron/config.rb b/lib/ncbo_cron/config.rb index 49db0fb4..6d3db51e 100644 --- a/lib/ncbo_cron/config.rb +++ b/lib/ncbo_cron/config.rb @@ -40,16 +40,8 @@ def config(&block) @settings.enable_spam_deletion ||= true # enable update check (vor VMs) @settings.enable_update_check ||= true - - - - # enable mgrep dictionary generation job - @settings.enable_dictionary_generation ||= true - - - - + @settings.enable_dictionary_generation_cron_job ||= false # UMLS auto-pull @settings.pull_umls_url ||= "" @@ -85,17 +77,9 @@ def config(&block) @settings.cron_obofoundry_sync ||= "0 8 * * 1,2,3,4,5" # 00 3 * * * - run daily at 3:00AM @settings.cron_update_check ||= "00 3 * * *" - - - - # mgrep dictionary generation schedule # 30 3 * * * - run daily at 3:30AM - @settings.cron_dictionary_generation ||= "30 3 * * *" - - - - + @settings.cron_dictionary_generation_cron_job ||= "30 3 * * *" @settings.log_level ||= :info unless (@settings.log_path && File.exists?(@settings.log_path)) diff --git a/lib/ncbo_cron/ontologies_report.rb b/lib/ncbo_cron/ontologies_report.rb index 43f0505f..99463a0a 100644 --- a/lib/ncbo_cron/ontologies_report.rb +++ b/lib/ncbo_cron/ontologies_report.rb @@ -345,7 +345,7 @@ def good_classes(submission, report) page_size = 1000 classes_size = 10 good_classes = Array.new - paging = LinkedData::Models::Class.in(submission).include(:prefLabel, :synonym, metrics: :classes).page(page_num, page_size) + paging = LinkedData::Models::Class.in(submission).include(:prefLabel, :synonym, submission: [metrics: :classes]).page(page_num, page_size) cls_count = submission.class_count(@logger).to_i # prevent a COUNT SPARQL query if possible paging.page_count_set(cls_count) if cls_count > -1 diff --git a/lib/ncbo_cron/ontology_analytics.rb b/lib/ncbo_cron/ontology_analytics.rb index e06fcd77..c5a4de00 100644 --- a/lib/ncbo_cron/ontology_analytics.rb +++ b/lib/ncbo_cron/ontology_analytics.rb @@ -1,117 +1,223 @@ require 'logger' -require 'google/apis/analytics_v3' -require 'google/api_client/auth/key_utils' +require 'json' +require 'benchmark' +require 'google/analytics/data' + module NcboCron module Models class OntologyAnalytics - ONTOLOGY_ANALYTICS_REDIS_FIELD = "ontology_analytics" + ONTOLOGY_ANALYTICS_REDIS_FIELD = 'ontology_analytics' + UA_START_DATE = '2013-10-01' + GA4_START_DATE = '2023-06-01' def initialize(logger) @logger = logger end def run - redis = Redis.new(:host => NcboCron.settings.redis_host, :port => NcboCron.settings.redis_port) + redis = Redis.new(:host => LinkedData.settings.ontology_analytics_redis_host, :port => LinkedData.settings.ontology_analytics_redis_port) ontology_analytics = fetch_ontology_analytics + File.open(NcboCron.settings.analytics_path_to_ga_data_file, 'w') do |f| + f.write(ontology_analytics.to_json) + end redis.set(ONTOLOGY_ANALYTICS_REDIS_FIELD, Marshal.dump(ontology_analytics)) end def fetch_ontology_analytics - google_client = authenticate_google - aggregated_results = Hash.new - start_year = Date.parse(NcboCron.settings.analytics_start_date).year || 2013 - ont_acronyms = LinkedData::Models::Ontology.where.include(:acronym).all.map {|o| o.acronym} - # ont_acronyms = ["NCIT", "ONTOMA", "CMPO", "AEO", "SNOMEDCT"] - filter_str = (NcboCron.settings.analytics_filter_str.nil? || NcboCron.settings.analytics_filter_str.empty?) ? "" : ";#{NcboCron.settings.analytics_filter_str}" - - ont_acronyms.each do |acronym| + @logger.info "Starting Google Analytics refresh..." + @logger.flush + full_data = nil + + time = Benchmark.realtime do max_results = 10000 - num_results = 10000 - start_index = 1 - results = nil - - loop do - results = google_client.get_ga_data( - ids = NcboCron.settings.analytics_profile_id, - start_date = NcboCron.settings.analytics_start_date, - end_date = Date.today.to_s, - metrics = 'ga:pageviews', - { - dimensions: 'ga:pagePath,ga:year,ga:month', - filters: "ga:pagePath=~^(\\/ontologies\\/#{acronym})(\\/?\\?{0}|\\/?\\?{1}.*)$#{filter_str}", - start_index: start_index, - max_results: max_results - } - ) - results.rows ||= [] - start_index += max_results - num_results = results.rows.length - @logger.info "Acronym: #{acronym}, Results: #{num_results}, Start Index: #{start_index}" - @logger.flush - - results.rows.each do |row| - if aggregated_results.has_key?(acronym) - # year - if aggregated_results[acronym].has_key?(row[1].to_i) - # month - if aggregated_results[acronym][row[1].to_i].has_key?(row[2].to_i) - aggregated_results[acronym][row[1].to_i][row[2].to_i] += row[3].to_i + aggregated_results = Hash.new + + @logger.info "Fetching all ontology acronyms from backend..." + @logger.flush + ont_acronyms = LinkedData::Models::Ontology.where.include(:acronym).all.map {|o| o.acronym} + # ont_acronyms = ["NCIT", "SNOMEDCT", "MEDDRA"] + @logger.info "Authenticating with the Google Analytics Endpoint..." + @logger.flush + google_client = authenticate_google + + date_range = Google::Analytics::Data::V1beta::DateRange.new( + start_date: GA4_START_DATE, + end_date: Date.today.to_s + ) + metrics_page_views = Google::Analytics::Data::V1beta::Metric.new( + name: "screenPageViews" + ) + dimension_path = Google::Analytics::Data::V1beta::Dimension.new( + name: "pagePath" + ) + dimension_year = Google::Analytics::Data::V1beta::Dimension.new( + name: "year" + ) + dimension_month = Google::Analytics::Data::V1beta::Dimension.new( + name: "month" + ) + string_filter = Google::Analytics::Data::V1beta::Filter::StringFilter.new( + match_type: Google::Analytics::Data::V1beta::Filter::StringFilter::MatchType::FULL_REGEXP + ) + filter = Google::Analytics::Data::V1beta::Filter.new( + field_name: "pagePath", + string_filter: string_filter + ) + filter_expression = Google::Analytics::Data::V1beta::FilterExpression.new( + filter: filter + ) + order_year = Google::Analytics::Data::V1beta::OrderBy::DimensionOrderBy.new( + dimension_name: "year" + ) + orderby_year = Google::Analytics::Data::V1beta::OrderBy.new( + desc: false, + dimension: order_year + ) + order_month = Google::Analytics::Data::V1beta::OrderBy::DimensionOrderBy.new( + dimension_name: "month" + ) + orderby_month = Google::Analytics::Data::V1beta::OrderBy.new( + desc: false, + dimension: order_month + ) + @logger.info "Fetching GA4 analytics for all ontologies..." + @logger.flush + + ont_acronyms.each do |acronym| + start_index = 0 + string_filter.value = "^(\\/ontologies\\/#{acronym})(\\/?\\?{0}|\\/?\\?{1}.*)$" + + loop do + request = Google::Analytics::Data::V1beta::RunReportRequest.new( + property: "properties/#{NcboCron.settings.analytics_property_id}", + metrics: [metrics_page_views], + dimension_filter: filter_expression, + dimensions: [dimension_path, dimension_year, dimension_month], + date_ranges: [date_range], + order_bys: [orderby_year, orderby_month], + offset: start_index, + limit: max_results + ) + response = google_client.run_report request + + response.rows ||= [] + start_index += max_results + num_results = response.rows.length + @logger.info "Acronym: #{acronym}, Results: #{num_results}, Start Index: #{start_index}" + @logger.flush + + response.rows.each do |row| + row_h = row.to_h + year_month_hits = row_h[:dimension_values].map.with_index { + |v, i| i > 0 ? v[:value].to_i.to_s : row_h[:metric_values][0][:value].to_i + }.rotate(1) + + if aggregated_results.has_key?(acronym) + # year + if aggregated_results[acronym].has_key?(year_month_hits[0]) + # month + if aggregated_results[acronym][year_month_hits[0]].has_key?(year_month_hits[1]) + aggregated_results[acronym][year_month_hits[0]][year_month_hits[1]] += year_month_hits[2] + else + aggregated_results[acronym][year_month_hits[0]][year_month_hits[1]] = year_month_hits[2] + end else - aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i + aggregated_results[acronym][year_month_hits[0]] = Hash.new + aggregated_results[acronym][year_month_hits[0]][year_month_hits[1]] = year_month_hits[2] end else - aggregated_results[acronym][row[1].to_i] = Hash.new - aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i + aggregated_results[acronym] = Hash.new + aggregated_results[acronym][year_month_hits[0]] = Hash.new + aggregated_results[acronym][year_month_hits[0]][year_month_hits[1]] = year_month_hits[2] end - else - aggregated_results[acronym] = Hash.new - aggregated_results[acronym][row[1].to_i] = Hash.new - aggregated_results[acronym][row[1].to_i][row[2].to_i] = row[3].to_i end - end + break if num_results < max_results + end # loop + end # ont_acronyms + @logger.info "Refresh complete" + @logger.flush + full_data = merge_and_fill_missing_data(aggregated_results) + end # Benchmark.realtime + @logger.info "Completed Google Analytics refresh in #{(time/60).round(1)} minutes." + @logger.flush + full_data + end - if num_results < max_results - # fill up non existent years - (start_year..Date.today.year).each do |y| - aggregated_results[acronym] = Hash.new if aggregated_results[acronym].nil? - aggregated_results[acronym][y] = Hash.new unless aggregated_results[acronym].has_key?(y) + def merge_and_fill_missing_data(ga4_data) + ua_data = {} + + if File.exists?(NcboCron.settings.analytics_path_to_ua_data_file) && + !File.zero?(NcboCron.settings.analytics_path_to_ua_data_file) + @logger.info "Merging GA4 and UA data..." + @logger.flush + ua_data_file = File.read(NcboCron.settings.analytics_path_to_ua_data_file) + ua_data = JSON.parse(ua_data_file) + ua_ga4_intersecting_year = Date.parse(GA4_START_DATE).year.to_s + ua_ga4_intersecting_month = Date.parse(GA4_START_DATE).month.to_s + + # add up hits for June of 2023 (the only intersecting month between UA and GA4) + ua_data.each do |acronym, _| + if ga4_data.has_key?(acronym) + if ga4_data[acronym][ua_ga4_intersecting_year].has_key?(ua_ga4_intersecting_month) + ua_data[acronym][ua_ga4_intersecting_year][ua_ga4_intersecting_month] += + ga4_data[acronym][ua_ga4_intersecting_year][ua_ga4_intersecting_month] + # delete data for June of 2023 from ga4_data to avoid overwriting when merging + ga4_data[acronym][ua_ga4_intersecting_year].delete(ua_ga4_intersecting_month) end - # fill up non existent months with zeros - (1..12).each { |n| aggregated_results[acronym].values.each { |v| v[n] = 0 unless v.has_key?(n) } } - break end end end - @logger.info "Completed ontology analytics refresh..." + # merge ua and ga4 data + merged_data = ua_data.deep_merge(ga4_data) + # fill missing years and months + @logger.info "Filling in missing years data..." @logger.flush + fill_missing_data(merged_data) + # sort acronyms, years and months + @logger.info "Sorting final data..." + @logger.flush + sort_ga_data(merged_data) + end + + def fill_missing_data(ga_data) + # fill up non existent years + start_year = Date.parse(UA_START_DATE).year + + ga_data.each do |acronym, _| + (start_year..Date.today.year).each do |y| + ga_data[acronym] = Hash.new if ga_data[acronym].nil? + ga_data[acronym][y.to_s] = Hash.new unless ga_data[acronym].has_key?(y.to_s) + end + # fill up non existent months with zeros + (1..12).each { |n| ga_data[acronym].values.each { |v| v[n.to_s] = 0 unless v.has_key?(n.to_s) } } + end + end - aggregated_results + def sort_ga_data(ga_data) + ga_data.transform_values { |value| + value.transform_values { |val| + val.sort_by { |key, _| key.to_i }.to_h + }.sort_by { |k, _| k.to_i }.to_h + }.sort.to_h end def authenticate_google - Google::Apis::ClientOptions.default.application_name = NcboCron.settings.analytics_app_name - Google::Apis::ClientOptions.default.application_version = NcboCron.settings.analytics_app_version - # enable google api call retries in order to - # minigate analytics processing failure due to ocasional google api timeouts and other outages - Google::Apis::RequestOptions.default.retries = 5 - # uncoment to enable logging for debugging purposes - # Google::Apis.logger.level = Logger::DEBUG - # Google::Apis.logger = @logger - client = Google::Apis::AnalyticsV3::AnalyticsService.new - key = Google::APIClient::KeyUtils::load_from_pkcs12(NcboCron.settings.analytics_path_to_key_file, 'notasecret') - client.authorization = Signet::OAuth2::Client.new( - :token_credential_uri => 'https://accounts.google.com/o/oauth2/token', - :audience => 'https://accounts.google.com/o/oauth2/token', - :scope => 'https://www.googleapis.com/auth/analytics.readonly', - :issuer => NcboCron.settings.analytics_service_account_email_address, - :signing_key => key - ).tap { |auth| auth.fetch_access_token! } - client + Google::Analytics::Data.analytics_data do |config| + config.credentials = NcboCron.settings.analytics_path_to_key_file + end end - end + end # class + + end +end + +class ::Hash + def deep_merge(second) + merger = proc { |key, v1, v2| Hash === v1 && Hash === v2 ? v1.merge(v2, &merger) : v2 } + self.merge(second, &merger) end end @@ -120,7 +226,8 @@ def authenticate_google # require 'ncbo_annotator' # require 'ncbo_cron/config' # require_relative '../../config/config' -# ontology_analytics_log_path = File.join("logs", "ontology-analytics.log") -# ontology_analytics_logger = Logger.new(ontology_analytics_log_path) +# # ontology_analytics_log_path = File.join("logs", "ontology-analytics.log") +# # ontology_analytics_logger = Logger.new(ontology_analytics_log_path) +# ontology_analytics_logger = Logger.new(STDOUT) # NcboCron::Models::OntologyAnalytics.new(ontology_analytics_logger).run # ./bin/ncbo_cron --disable-processing true --disable-pull true --disable-flush true --disable-warmq true --disable-ontologies-report true --disable-mapping-counts true --disable-spam-deletion true --ontology-analytics '14 * * * *' diff --git a/lib/ncbo_cron/ontology_helper.rb b/lib/ncbo_cron/ontology_helper.rb new file mode 100644 index 00000000..42534768 --- /dev/null +++ b/lib/ncbo_cron/ontology_helper.rb @@ -0,0 +1,185 @@ +require 'logger' + +module NcboCron + module Helpers + module OntologyHelper + + REDIS_SUBMISSION_ID_PREFIX = "sub:" + PROCESS_QUEUE_HOLDER = "parseQueue" + PROCESS_ACTIONS = { + :process_rdf => true, + :generate_labels => true, + :index_search => true, + :index_properties => true, + :run_metrics => true, + :process_annotator => true, + :diff => true, + :remote_pull => false + } + + class RemoteFileException < StandardError + attr_reader :submission + + def initialize(submission) + super + @submission = submission + end + end + + def self.do_ontology_pull(ontology_acronym, enable_pull_umls = false, umls_download_url = '', logger = nil, + add_to_queue = true) + logger ||= Logger.new($stdout) + ont = LinkedData::Models::Ontology.find(ontology_acronym).include(:acronym).first + new_submission = nil + raise StandardError, "Ontology #{ontology_acronym} not found" if ont.nil? + + last = ont.latest_submission(status: :any) + raise StandardError, "No submission found for #{ontology_acronym}" if last.nil? + + last.bring(:hasOntologyLanguage) if last.bring?(:hasOntologyLanguage) + if !enable_pull_umls && last.hasOntologyLanguage.umls? + raise StandardError, "Pull umls not enabled" + end + + last.bring(:pullLocation) if last.bring?(:pullLocation) + raise StandardError, "#{ontology_acronym} has no pullLocation" if last.pullLocation.nil? + + last.bring(:uploadFilePath) if last.bring?(:uploadFilePath) + + if last.hasOntologyLanguage.umls? && umls_download_url && !umls_download_url.empty? + last.pullLocation = RDF::URI.new(umls_download_url + last.pullLocation.split("/")[-1]) + logger.info("Using alternative download for umls #{last.pullLocation.to_s}") + logger.flush + end + + if last.remote_file_exists?(last.pullLocation.to_s) + logger.info "Checking download for #{ont.acronym}" + logger.info "Location: #{last.pullLocation.to_s}"; logger.flush + file, filename = last.download_ontology_file + file, md5local, md5remote, new_file_exists = self.new_file_exists?(file, last) + + if new_file_exists + logger.info "New file found for #{ont.acronym}\nold: #{md5local}\nnew: #{md5remote}" + logger.flush() + new_submission = self.create_submission(ont, last, file, filename, logger, add_to_queue) + else + logger.info "There is no new file found for #{ont.acronym}" + logger.flush() + end + + file.close + new_submission + else + raise self::RemoteFileException.new(last) + end + end + + def self.create_submission(ont, sub, file, filename, logger = nil, add_to_queue = true, new_version = nil, + new_released = nil) + logger ||= Kernel.const_defined?("LOGGER") ? Kernel.const_get("LOGGER") : Logger.new(STDOUT) + new_sub = LinkedData::Models::OntologySubmission.new + + sub.bring_remaining + sub.loaded_attributes.each do |attr| + new_sub.send("#{attr}=", sub.send(attr)) + end + + submission_id = ont.next_submission_id() + new_sub.submissionId = submission_id + file_location = LinkedData::Models::OntologySubmission.copy_file_repository(ont.acronym, submission_id, file, filename) + new_sub.uploadFilePath = file_location + + unless new_version.nil? + new_sub.version = new_version + end + + if new_released.nil? + new_sub.released = DateTime.now + else + new_sub.released = DateTime.parse(new_released) + end + new_sub.submissionStatus = nil + new_sub.creationDate = nil + new_sub.missingImports = nil + new_sub.metrics = nil + full_file_path = File.expand_path(file_location) + + # check if OWLAPI is able to parse the file before creating a new submission + owlapi = LinkedData::Parser::OWLAPICommand.new( + full_file_path, + File.expand_path(new_sub.data_folder.to_s), + logger: logger) + owlapi.disable_reasoner + parsable = true + + begin + owlapi.parse + rescue Exception => e + logger.error("The new file for ontology #{ont.acronym}, submission id: #{submission_id} did not clear OWLAPI: #{e.class}: #{e.message}\n#{e.backtrace.join("\n\t")}") + logger.error("A new submission has NOT been created.") + logger.flush + parsable = false + end + + if parsable + if new_sub.valid? + new_sub.save() + + if add_to_queue + self.queue_submission(new_sub, { all: true }) + logger.info("OntologyPull created a new submission (#{submission_id}) for ontology #{ont.acronym}") + end + else + logger.error("Unable to create a new submission for ontology #{ont.acronym} with id #{submission_id}: #{new_sub.errors}") + logger.flush + end + else + # delete the bad file + File.delete full_file_path if File.exist? full_file_path + end + new_sub + end + + def self.queue_submission(submission, actions={:all => true}) + redis = Redis.new(:host => NcboCron.settings.redis_host, :port => NcboCron.settings.redis_port) + + if actions[:all] + actions = PROCESS_ACTIONS.dup + else + actions.delete_if {|k, v| !PROCESS_ACTIONS.has_key?(k)} + end + actionStr = MultiJson.dump(actions) + redis.hset(PROCESS_QUEUE_HOLDER, get_prefixed_id(submission.id), actionStr) unless actions.empty? + end + + def self.get_prefixed_id(id) + "#{REDIS_SUBMISSION_ID_PREFIX}#{id}" + end + + def self.last_fragment_of_uri(uri) + uri.to_s.split("/")[-1] + end + + def self.acronym_from_submission_id(submissionID) + submissionID.to_s.split("/")[-3] + end + + def self.new_file_exists?(file, last) + file = File.open(file.path, "rb") + remote_contents = file.read + md5remote = Digest::MD5.hexdigest(remote_contents) + + if last.uploadFilePath && File.exist?(last.uploadFilePath) + file_contents = open(last.uploadFilePath) { |f| f.read } + md5local = Digest::MD5.hexdigest(file_contents) + new_file_exists = (not md5remote.eql?(md5local)) + else + # There is no existing file, so let's create a submission with the downloaded one + new_file_exists = true + end + return file, md5local, md5remote, new_file_exists + end + + end + end +end \ No newline at end of file diff --git a/lib/ncbo_cron/ontology_pull.rb b/lib/ncbo_cron/ontology_pull.rb index ac6da70e..c554c95e 100644 --- a/lib/ncbo_cron/ontology_pull.rb +++ b/lib/ncbo_cron/ontology_pull.rb @@ -1,18 +1,11 @@ -require 'open-uri' require 'logger' -require_relative 'ontology_submission_parser' +require_relative 'ontology_helper' module NcboCron module Models class OntologyPull - class RemoteFileException < StandardError - end - - def initialize() - end - def do_remote_ontology_pull(options = {}) logger = options[:logger] || Logger.new($stdout) logger.info "UMLS auto-pull #{options[:enable_pull_umls] == true}" @@ -23,65 +16,26 @@ def do_remote_ontology_pull(options = {}) ontologies.select! { |ont| ont_to_include.include?(ont.acronym) } unless ont_to_include.empty? enable_pull_umls = options[:enable_pull_umls] umls_download_url = options[:pull_umls_url] - ontologies.sort! {|a, b| a.acronym.downcase <=> b.acronym.downcase} + ontologies.sort! { |a, b| a.acronym.downcase <=> b.acronym.downcase } new_submissions = [] ontologies.each do |ont| begin - last = ont.latest_submission(status: :any) - next if last.nil? - last.bring(:hasOntologyLanguage) if last.bring?(:hasOntologyLanguage) - if !enable_pull_umls && last.hasOntologyLanguage.umls? - next - end - last.bring(:pullLocation) if last.bring?(:pullLocation) - next if last.pullLocation.nil? - last.bring(:uploadFilePath) if last.bring?(:uploadFilePath) - - if last.hasOntologyLanguage.umls? && umls_download_url - last.pullLocation= RDF::URI.new(umls_download_url + last.pullLocation.split("/")[-1]) - logger.info("Using alternative download for umls #{last.pullLocation.to_s}") + begin + new_submissions << NcboCron::Helpers::OntologyHelper.do_ontology_pull(ont.acronym, + enable_pull_umls: enable_pull_umls, + umls_download_url: umls_download_url, + logger: logger, add_to_queue: true) + rescue NcboCron::Helpers::OntologyHelper::RemoteFileException => error + logger.info "RemoteFileException: No submission file at pull location #{error.submission.pullLocation.to_s} for ontology #{ont.acronym}." logger.flush + LinkedData::Utils::Notifications.remote_ontology_pull(error.submission) end - - if last.remote_file_exists?(last.pullLocation.to_s) - logger.info "Checking download for #{ont.acronym}" - logger.info "Location: #{last.pullLocation.to_s}"; logger.flush - file, filename = last.download_ontology_file() - file = File.open(file.path, "rb") - remote_contents = file.read - md5remote = Digest::MD5.hexdigest(remote_contents) - - if last.uploadFilePath && File.exist?(last.uploadFilePath) - file_contents = open(last.uploadFilePath) { |f| f.read } - md5local = Digest::MD5.hexdigest(file_contents) - new_file_exists = (not md5remote.eql?(md5local)) - else - # There is no existing file, so let's create a submission with the downloaded one - new_file_exists = true - end - - if new_file_exists - logger.info "New file found for #{ont.acronym}\nold: #{md5local}\nnew: #{md5remote}" - logger.flush() - new_submissions << create_submission(ont, last, file, filename, logger) - end - - file.close - else - begin - raise RemoteFileException - rescue RemoteFileException - logger.info "RemoteFileException: No submission file at pull location #{last.pullLocation.to_s} for ontology #{ont.acronym}." - logger.flush - LinkedData::Utils::Notifications.remote_ontology_pull(last) - end - end - rescue Exception => e - logger.error "Problem retrieving #{ont.acronym} in OntologyPull:\n" + e.message + "\n" + e.backtrace.join("\n\t") - logger.flush() - next end + rescue Exception => e + logger.error "Problem retrieving #{ont.acronym} in OntologyPull:\n" + e.message + "\n" + e.backtrace.join("\n\t") + logger.flush() + next end if options[:cache_clear] == true @@ -93,70 +47,7 @@ def do_remote_ontology_pull(options = {}) new_submissions end - def create_submission(ont, sub, file, filename, logger=nil, - add_to_pull=true,new_version=nil,new_released=nil) - logger ||= Kernel.const_defined?("LOGGER") ? Kernel.const_get("LOGGER") : Logger.new(STDOUT) - new_sub = LinkedData::Models::OntologySubmission.new - - sub.bring_remaining - sub.loaded_attributes.each do |attr| - new_sub.send("#{attr}=", sub.send(attr)) - end - - submission_id = ont.next_submission_id() - new_sub.submissionId = submission_id - file_location = LinkedData::Models::OntologySubmission.copy_file_repository(ont.acronym, submission_id, file, filename) - new_sub.uploadFilePath = file_location - unless new_version.nil? - new_sub.version = new_version - end - if new_released.nil? - new_sub.released = DateTime.now - else - new_sub.released = DateTime.parse(new_released) - end - new_sub.submissionStatus = nil - new_sub.creationDate = nil - new_sub.missingImports = nil - new_sub.metrics = nil - full_file_path = File.expand_path(file_location) - - # check if OWLAPI is able to parse the file before creating a new submission - owlapi = LinkedData::Parser::OWLAPICommand.new( - full_file_path, - File.expand_path(new_sub.data_folder.to_s), - logger: logger) - owlapi.disable_reasoner - parsable = true - - begin - owlapi.parse - rescue Exception => e - logger.error("The new file for ontology #{ont.acronym}, submission id: #{submission_id} did not clear OWLAPI: #{e.class}: #{e.message}\n#{e.backtrace.join("\n\t")}") - logger.error("A new submission has NOT been created.") - logger.flush - parsable = false - end - - if parsable - if new_sub.valid? - new_sub.save() - - if add_to_pull - submission_queue = NcboCron::Models::OntologySubmissionParser.new - submission_queue.queue_submission(new_sub, {all: true}) - logger.info("OntologyPull created a new submission (#{submission_id}) for ontology #{ont.acronym}") - end - else - logger.error("Unable to create a new submission in OntologyPull: #{new_sub.errors}") - logger.flush - end - else - # delete the bad file - File.delete full_file_path if File.exist? full_file_path - end - new_sub - end + private def redis_goo Redis.new(host: LinkedData.settings.goo_redis_host, port: LinkedData.settings.goo_redis_port, timeout: 30) diff --git a/lib/ncbo_cron/ontology_rank.rb b/lib/ncbo_cron/ontology_rank.rb index b60c2740..64de8844 100644 --- a/lib/ncbo_cron/ontology_rank.rb +++ b/lib/ncbo_cron/ontology_rank.rb @@ -1,5 +1,6 @@ require 'logger' require 'benchmark' +require_relative 'ontology_helper' module NcboCron module Models @@ -66,7 +67,7 @@ def umls_scores(ontologies) ontologies.each do |ont| if ont.group && !ont.group.empty? - umls_gr = ont.group.select {|gr| acronym_from_id(gr.id.to_s).include?('UMLS')} + umls_gr = ont.group.select {|gr| NcboCron::Helpers::OntologyHelper.last_fragment_of_uri(gr.id.to_s).include?('UMLS')} scores[ont.acronym] = umls_gr.empty? ? 0 : 1 else scores[ont.acronym] = 0 @@ -75,10 +76,6 @@ def umls_scores(ontologies) scores end - def acronym_from_id(id) - id.to_s.split("/")[-1] - end - def normalize(x, xmin, xmax, ymin, ymax) xrange = xmax - xmin yrange = ymax - ymin diff --git a/lib/ncbo_cron/ontology_submission_eradicator.rb b/lib/ncbo_cron/ontology_submission_eradicator.rb new file mode 100644 index 00000000..40f8ef4d --- /dev/null +++ b/lib/ncbo_cron/ontology_submission_eradicator.rb @@ -0,0 +1,39 @@ +module NcboCron + module Models + + class OntologySubmissionEradicator + class RemoveSubmissionFileException < StandardError + end + + class RemoveSubmissionDataException < StandardError + end + + class RemoveNotArchivedSubmissionException < StandardError + end + + def initialize() + end + + def eradicate(submission , force=false) + submission.bring(:submissionStatus) if submission.bring(:submissionStatus) + if submission.archived? || force + delete_submission_data submission + else submission.ready? + raise RemoveNotArchivedSubmissionException, "Submission #{submission.submissionId} is not an archived submission" + end + + end + + private + def delete_submission_data(submission) + begin + submission.delete + rescue Exception => e + raise RemoveSubmissionDataException, e.message + end + end + + + end + end +end diff --git a/lib/ncbo_cron/ontology_submission_parser.rb b/lib/ncbo_cron/ontology_submission_parser.rb index fe7a3e06..8d33f89d 100644 --- a/lib/ncbo_cron/ontology_submission_parser.rb +++ b/lib/ncbo_cron/ontology_submission_parser.rb @@ -1,38 +1,22 @@ require 'multi_json' +require_relative 'ontology_helper' module NcboCron module Models class OntologySubmissionParser - QUEUE_HOLDER = "parseQueue" - IDPREFIX = "sub:" - - ACTIONS = { - :process_rdf => true, - :index_search => true, - :index_properties => true, - :run_metrics => true, - :process_annotator => true, - :diff => true - } + QUEUE_HOLDER = NcboCron::Helpers::OntologyHelper::PROCESS_QUEUE_HOLDER + ACTIONS = NcboCron::Helpers::OntologyHelper::PROCESS_ACTIONS def initialize() end - def queue_submission(submission, actions={:all => true}) - redis = Redis.new(:host => NcboCron.settings.redis_host, :port => NcboCron.settings.redis_port) - - if actions[:all] - actions = ACTIONS.dup - else - actions.delete_if {|k, v| !ACTIONS.has_key?(k)} - end - actionStr = MultiJson.dump(actions) - redis.hset(QUEUE_HOLDER, get_prefixed_id(submission.id), actionStr) unless actions.empty? + def queue_submission(submission, actions={ :all => true }) + NcboCron::Helpers::OntologyHelper.queue_submission(submission, actions) end - def process_queue_submissions(options = {}) + def process_queue_submissions(options={}) logger = options[:logger] logger ||= Kernel.const_defined?("LOGGER") ? Kernel.const_get("LOGGER") : Logger.new(STDOUT) redis = Redis.new(:host => NcboCron.settings.redis_host, :port => NcboCron.settings.redis_port) @@ -43,6 +27,20 @@ def process_queue_submissions(options = {}) realKey = process_data[:key] key = process_data[:redis_key] redis.hdel(QUEUE_HOLDER, key) + + # if :remote_pull is one of the actions, pull the ontology and halt if no new submission is found + # if a new submission is found, replace the submission ID with the new one and proceed with + # processing the remaining actions on the new submission + if actions.key?(:remote_pull) && actions[:remote_pull] + acronym = NcboCron::Helpers::OntologyHelper.acronym_from_submission_id(realKey) + new_submission = NcboCron::Helpers::OntologyHelper.do_ontology_pull(acronym, enable_pull_umls: false, + umls_download_url: '', logger: logger, + add_to_queue: false) + return unless new_submission + realKey = new_submission.id.to_s + actions.delete(:remote_pull) + end + begin process_submission(logger, realKey, actions) rescue Exception => e @@ -55,7 +53,7 @@ def process_queue_submissions(options = {}) def queued_items(redis, logger=nil) logger ||= Kernel.const_defined?("LOGGER") ? Kernel.const_get("LOGGER") : Logger.new(STDOUT) all = redis.hgetall(QUEUE_HOLDER) - prefix_remove = Regexp.new(/^#{IDPREFIX}/) + prefix_remove = Regexp.new(/^#{NcboCron::Helpers::OntologyHelper::REDIS_SUBMISSION_ID_PREFIX}/) items = [] all.each do |key, val| begin @@ -75,10 +73,6 @@ def queued_items(redis, logger=nil) items end - def get_prefixed_id(id) - "#{IDPREFIX}#{id}" - end - def zombie_classes_graphs query = "SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o }}" class_graphs = [] @@ -165,7 +159,7 @@ def process_submission(logger, submission_id, actions=ACTIONS) # Check to make sure the file has been downloaded if sub.pullLocation && (!sub.uploadFilePath || !File.exist?(sub.uploadFilePath)) - multi_logger.debug "Pull location found, but no file in the upload file path. Retrying download." + multi_logger.debug "Pull location found (#{sub.pullLocation}, but no file in the upload file path (#{sub.uploadFilePath}. Retrying download." file, filename = sub.download_ontology_file file_location = sub.class.copy_file_repository(sub.ontology.acronym, sub.submissionId, file, filename) file_location = "../" + file_location if file_location.start_with?(".") # relative path fix @@ -190,6 +184,10 @@ def process_submission(logger, submission_id, actions=ACTIONS) end end + def get_prefixed_id(id) + NcboCron::Helpers::OntologyHelper.get_prefixed_id(id) + end + private def archive_old_submissions(logger, sub) @@ -219,10 +217,11 @@ def process_annotator(logger, sub) begin annotator = Annotator::Models::NcboAnnotator.new annotator.create_term_cache_for_submission(logger, sub) - # commenting this action out for now due to a problem with hgetall in redis + # this action only occurs if the CRON dictionary generation job is disabled + # if the CRON dictionary generation job is running, + # the dictionary will NOT be generated on each ontology parsing # see https://github.com/ncbo/ncbo_cron/issues/45 for details - # mgrep dictionary generation will occur as a separate CRON task - # annotator.generate_dictionary_file() + annotator.generate_dictionary_file() unless NcboCron.settings.enable_dictionary_generation_cron_job rescue Exception => e logger.error(e.message + "\n" + e.backtrace.join("\n\t")) logger.flush() diff --git a/lib/ncbo_cron/spam_deletion.rb b/lib/ncbo_cron/spam_deletion.rb index 8db5568b..e2ec64f8 100644 --- a/lib/ncbo_cron/spam_deletion.rb +++ b/lib/ncbo_cron/spam_deletion.rb @@ -25,8 +25,18 @@ def initialize(logger=nil) end def run - auth_token = Base64.decode64(NcboCron.settings.git_repo_access_token) + auth_token = NcboCron.settings.git_repo_access_token res = `curl --header 'Authorization: token #{auth_token}' --header 'Accept: application/vnd.github.v3.raw' --location #{FULL_FILE_PATH}` + + begin + error_json = JSON.parse(res) + msg = "\nError while fetching the SPAM user list from #{FULL_FILE_PATH}: #{error_json}" + @logger.error(msg) + puts msg + exit + rescue JSON::ParserError + @logger.info("Successfully downloaded the SPAM user list from #{FULL_FILE_PATH}") + end usernames = res.split(",").map(&:strip) delete_spam(usernames) end diff --git a/ncbo_cron.gemspec b/ncbo_cron.gemspec index 821881d1..c8faa03d 100644 --- a/ncbo_cron.gemspec +++ b/ncbo_cron.gemspec @@ -8,7 +8,7 @@ Gem::Specification.new do |gem| gem.summary = %q{} gem.homepage = "https://github.com/ncbo/ncbo_cron" - gem.files = `git ls-files`.split($\) + gem.files = Dir['**/*'] gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) } gem.test_files = gem.files.grep(%r{^(test|spec|features)/}) gem.name = "ncbo_cron" @@ -16,7 +16,7 @@ Gem::Specification.new do |gem| gem.add_dependency("dante") gem.add_dependency("goo") - gem.add_dependency("google-apis-analytics_v3") + gem.add_dependency("google-analytics-data") gem.add_dependency("mlanett-redis-lock") gem.add_dependency("multi_json") gem.add_dependency("ncbo_annotator") diff --git a/rakelib/purl_management.rake b/rakelib/purl_management.rake new file mode 100644 index 00000000..58cfadd7 --- /dev/null +++ b/rakelib/purl_management.rake @@ -0,0 +1,28 @@ +# Task for updating and adding missing purl for all ontologies +# +desc 'Purl Utilities' +namespace :purl do + require 'bundler/setup' + # Configure the process for the current cron configuration. + require_relative '../lib/ncbo_cron' + config_exists = File.exist?(File.expand_path('../../config/config.rb', __FILE__)) + abort('Please create a config/config.rb file using the config/config.rb.sample as a template') unless config_exists + require_relative '../config/config' + + desc 'update purl for all ontologies' + task :update_all do + purl_client = LinkedData::Purl::Client.new + LinkedData::Models::Ontology.all.each do |ont| + ont.bring(:acronym) + acronym = ont.acronym + + if purl_client.purl_exists(acronym) + puts "#{acronym} exists" + purl_client.fix_purl(acronym) + else + puts "#{acronym} DOES NOT exist" + purl_client.create_purl(acronym) + end + end + end +end diff --git a/test/docker-compose.yml b/test/docker-compose.yml deleted file mode 100644 index 5bdb51f5..00000000 --- a/test/docker-compose.yml +++ /dev/null @@ -1,38 +0,0 @@ -version: '3.8' - -services: - unit-test: - build: ../. - environment: - - GOO_BACKEND_NAME=4store - - GOO_PORT=9000 - - GOO_HOST=4store-ut - - REDIS_HOST=redis-ut - - REDIS_PORT=6379 - - SOLR_HOST=solr-ut - - MGREP_HOST=mgrep-ut - - MGREP_PORT=55555 - depends_on: - - solr-ut - - redis-ut - - 4store-ut - - mgrep-ut - #command: "bundle exec rake test TESTOPTS='-v' TEST='./test/parser/test_owl_api_command.rb'" - command: "wait-for-it solr-ut:8983 -- bundle exec rake test TESTOPTS='-v'" - - solr-ut: - image: ontoportal/solr-ut:0.1 - - redis-ut: - image: redis - - mgrep-ut: - image: ontoportal/mgrep-ncbo:0.1 - - 4store-ut: - image: bde2020/4store - command: > - bash -c "4s-backend-setup --segments 4 ontoportal_kb - && 4s-backend ontoportal_kb - && 4s-httpd -D -s-1 -p 9000 ontoportal_kb" - diff --git a/test/run-unit-tests.sh b/test/run-unit-tests.sh index 385898e6..b2c119da 100755 --- a/test/run-unit-tests.sh +++ b/test/run-unit-tests.sh @@ -3,10 +3,10 @@ # # add config for unit testing [ -f ../config/config.rb ] || cp ../config/config.test.rb ../config/config.rb -docker-compose build +docker compose build # wait-for-it is useful since solr container might not get ready quick enough for the unit tests -docker-compose run --rm unit-test wait-for-it solr-ut:8983 -- rake test TESTOPTS='-v' -#docker-compose run --rm unit-test wait-for-it solr-ut:8983 -- bundle exec rake test TESTOPTS='-v' TEST='./test/controllers/test_annotator_controller.rb' -#docker-compose up --exit-code-from unit-test -docker-compose kill +docker compose run --rm ruby bundle exec rake test TESTOPTS='-v' +#docker compose run --rm ruby-agraph bundle exec rake test TESTOPTS='-v' +#docker-compose run --rm ruby bundle exec rake test TESTOPTS='-v' TEST='./test/controllers/test_annotator_controller.rb' +docker compose kill diff --git a/test/test_case.rb b/test/test_case.rb index 81a10aa6..75bb0454 100644 --- a/test/test_case.rb +++ b/test/test_case.rb @@ -1,3 +1,21 @@ +# Start simplecov if this is a coverage task or if it is run in the CI pipeline +if ENV['COVERAGE'] == 'true' || ENV['CI'] == 'true' + require 'simplecov' + require 'simplecov-cobertura' + # https://github.com/codecov/ruby-standard-2 + # Generate HTML and Cobertura reports which can be consumed by codecov uploader + SimpleCov.formatters = SimpleCov::Formatter::MultiFormatter.new([ + SimpleCov::Formatter::HTMLFormatter, + SimpleCov::Formatter::CoberturaFormatter + ]) + SimpleCov.start do + add_filter '/test/' + add_filter 'app.rb' + add_filter 'init.rb' + add_filter '/config/' + end +end + require 'ontologies_linked_data' require_relative '../lib/ncbo_cron' require_relative '../config/config' @@ -7,7 +25,7 @@ require 'test/unit' # Check to make sure you want to run if not pointed at localhost -safe_host = Regexp.new(/localhost|-ut|ncbo-dev*|ncbo-unittest*/) +safe_host = Regexp.new(/localhost|-ut/) unless LinkedData.settings.goo_host.match(safe_host) && LinkedData.settings.search_server_url.match(safe_host) && NcboCron.settings.redis_host.match(safe_host) @@ -38,7 +56,7 @@ def count_pattern(pattern) return 0 end - def backend_4s_delete + def backend_triplestore_delete raise StandardError, 'Too many triples in KB, does not seem right to run tests' unless count_pattern('?s ?p ?o') < 400000 @@ -71,7 +89,7 @@ def _run_suites(suites, type) end def _run_suite(suite, type) - backend_4s_delete + backend_triplestore_delete suite.before_suite if suite.respond_to?(:before_suite) super(suite, type) rescue Exception => e @@ -80,7 +98,7 @@ def _run_suite(suite, type) puts 'Traced from:' raise e ensure - backend_4s_delete + backend_triplestore_delete suite.after_suite if suite.respond_to?(:after_suite) end end diff --git a/test/test_ontology_pull.rb b/test/test_ontology_pull.rb index 57fa9f47..ca3c6130 100644 --- a/test/test_ontology_pull.rb +++ b/test/test_ontology_pull.rb @@ -41,14 +41,14 @@ def self.after_suite @@redis.del NcboCron::Models::OntologySubmissionParser::QUEUE_HOLDER end - def test_remote_ontology_pull() + def test_remote_ontology_pull ontologies = init_ontologies(1) ont = LinkedData::Models::Ontology.find(ontologies[0].id).first ont.bring(:submissions) if ont.bring?(:submissions) assert_equal 1, ont.submissions.length pull = NcboCron::Models::OntologyPull.new - pull.do_remote_ontology_pull() + pull.do_remote_ontology_pull # check that the pull creates a new submission when the file has changed ont = LinkedData::Models::Ontology.find(ontologies[0].id).first @@ -72,7 +72,33 @@ def test_remote_ontology_pull() ont = LinkedData::Models::Ontology.find(ontologies[0].id).first ont.bring(:submissions) if ont.bring?(:submissions) assert_equal 2, ont.submissions.length - pull.do_remote_ontology_pull() + pull.do_remote_ontology_pull + assert_equal 2, ont.submissions.length + end + + def test_remote_pull_parsing_action + ontologies = init_ontologies(1, process_submissions: true) + ont = LinkedData::Models::Ontology.find(ontologies[0].id).first + ont.bring(:submissions) if ont.bring?(:submissions) + assert_equal 1, ont.submissions.length + + # add this ontology to submission queue with :remote_pull action enabled + parser = NcboCron::Models::OntologySubmissionParser.new + actions = NcboCron::Models::OntologySubmissionParser::ACTIONS.dup + actions[:remote_pull] = true + parser.queue_submission(ont.submissions[0], actions) + parser.process_queue_submissions + + # make sure there are now 2 submissions present + ont = LinkedData::Models::Ontology.find(ontologies[0].id).first + ont.bring(:submissions) if ont.bring?(:submissions) + assert_equal 2, ont.submissions.length + + # verify that no new submission is created when the file has not changed + parser.queue_submission(ont.submissions[0], actions) + parser.process_queue_submissions + ont = LinkedData::Models::Ontology.find(ontologies[0].id).first + ont.bring(:submissions) if ont.bring?(:submissions) assert_equal 2, ont.submissions.length end @@ -164,15 +190,16 @@ def test_no_pull_location private - def init_ontologies(submission_count) - ont_count, acronyms, ontologies = LinkedData::SampleData::Ontology.create_ontologies_and_submissions(ont_count: 1, submission_count: submission_count, process_submission: false) + def init_ontologies(submission_count, process_submissions = false) + ont_count, acronyms, ontologies = LinkedData::SampleData::Ontology.create_ontologies_and_submissions( + ont_count: 1, submission_count: submission_count, process_submission: process_submissions) ontologies[0].bring(:submissions) if ontologies[0].bring?(:submissions) ontologies[0].submissions.each do |sub| sub.bring_remaining() sub.pullLocation = RDF::IRI.new(@@url) sub.save() rescue binding.pry end - return ontologies + ontologies end end diff --git a/test/test_scheduler.rb b/test/test_scheduler.rb index bac2f842..58808ea5 100644 --- a/test/test_scheduler.rb +++ b/test/test_scheduler.rb @@ -39,7 +39,7 @@ def test_scheduler sleep(5) finished_array = listen_string.split("\n") - assert finished_array.length >= 4 + assert_operator 4, :<=, finished_array.length assert job1_thread.alive? job1_thread.kill