-
Notifications
You must be signed in to change notification settings - Fork 4
Helper Scripts
There are no doubt multiple ways of doing this, but the most straightforward I have found is to use the underlying RSolr connection to construct the query.
As an example, the following code will query for all documents that have any value in the license_tesim
field and paginate through the results, 100 at a time, returning just the document id and the value in the license_tesim
field:
# Get first page of results
docs = ActiveFedora::SolrService.instance.conn.paginate(1, 100, 'select', :params => {:q => "license_tesim:['' TO *]", fl:'id,license_tesim'}).dig('response', 'docs')
rows = 2
# Starting with the second page, paginate until we get no more
while (next_page = ActiveFedora::SolrService.instance.conn.paginate(rows, 10, 'select', :params => {:q => "license_tesim:['' TO *]", fl:'id,license_tesim'}).dig('response', 'docs')) != []
docs += next_page
rows += 1
end
Keeping this a script to be run in the console in case we pursue GA integration later. If not, this could become a rake task.
Before running, move the CSV exports from Google Analytics/GA 4 to /opt/scholarspace/scholarspace-tmp
inside the app-server
container.
# Google Analytics 4 data -- excluding metadata the precedes CSV header
ga4 = CSV.read("/opt/scholarspace/scholarspace-tmp/data-export.csv")[9..]
# Google Analytics legacy data
ga = CSV.read("/opt/scholarspace/scholarspace-tmp/views.csv")[6..]
def extract_ids(ga_stats)
# Given a 2-D array, expects the first column of each row to contain a GWSS page relative URL, and the second, the number of page views
# extract work ID
concern_pattern = /.+\/(gw_works|gw_etds)\/([a-zA-Z0-9]{9})\/?.*/
return ga_stats.each_with_object(Hash.new { |h,k| h[k] = [] }) do |line, h|
match = concern_pattern.match(line[0])
if match
h[match[2]] << line[1]
end
h
end
end
def sum_views(ga_hash)
# Given a hash mapping work ID's to arrays of page views (as strings), returns a numeric sum of views for each ID
return ga_hash.transform_values do |v|
v.map { |n| n.sub(",", "").to_i }.sum
end
end
SOLR_KEYS = ["id", "title_tesim", "gw_affiliation_tesim", "resource_type_tesim"]
def get_solr_docs(ga_hash)
# Given a hash of work IDs and page views, retrieve Solr metadata for each id
# Returns an array of hashes, one per work, with select metadata fields + page views
solr_docs = ga_hash.keys.map do |key|
Hyrax::SolrQueryService.new.with_ids(ids: [key]).solr_documents
end.map do |doc|
# Extract values from desired Solr fields
SOLR_KEYS.map { |k| [k, doc[0][k]] }.to_h
end
return solr_docs.map do |doc|
id = doc["id"]
views = ga_hash[id]
# Extract strings from Solr array fields; expects each array to have only one value
new_doc = doc.transform_values { |v| if v.kind_of?(Array) then v[0] else v end }
new_doc["views"] = views
new_doc
end
end
ga = sum_views(extract_ids(ga))
ga4 = sum_views(extract_ids(ga4))
# Merge the two hashes, summing values for duplicate keys
ga_merged = ga.merge(ga4) { |key, ga_val, ga4_val| ga4_val + ga_val }
# Get additional metadata from Solr
ga_merged_csv = get_solr_docs(ga_merged)
headers = ga_merged_csv.first.keys
CSV.open("/opt/scholarspace/scholarspace-tmp/ga-stats-merged.csv", "w") do |csv|
csv << headers
ga_merged_csv.each do |row|
csv << row
end
end
# Need to do this because the related_url_tesim is indexed as a TextField and as such does not support string wildcard queries
data = ActiveFedora::SolrService.query("related_url_tesim:['' TO *]", rows: 10000)
# Filter all records with non-null values in related_url_tesim to find those with at least one DOI
with_dois = data.select do |record|
record["related_url_tesim"].any? { |url| url.starts_with? 'https://doi' }
end
# Get work ID'sRAIL
ids = with_dois.map {|d| d.id}
ids.each do |id|
# Get ActiveRecord object for that work
work = GwWork.find(id)
# Exchange URL and DOI fields
doi = work.related_url
work.doi = doi
work.related_url = []
work.save
end
# In this example, we add a variant of a keyword to all works having a given keyword
data = ActiveFedora::SolrService.query("keyword_tesim:Eckles", rows: 10000)
old_keyword = "The Eckles Prize for Freshman Research Excellence"
new_keyword = "The Eckles Prize for First Year Research Excellence"
with_old_keyword = data.select do |d|
d["keyword_tesim"].any? { |k| k == old_keyword }
end
ids = with_old_keyword.map { |r| r.id }
ids.each do |id|
work = GwWork.find(id)
# Note that we cannot add a value directly to the ActiveFedora field; it will not persist
# Instead, we have to replace the entire field with a new array
keywords = work.keyword.map { |keyword| keyword }
keywords << new_keyword
work.keyword = keywords
work.save
end
In the scholarspace-hyrax-app-server
container, open the Rails console (using RAILS_ENV=production rails c
) and run the following code:
CSV.open("./all_users_#{Time.now.to_s.gsub(' ', '_')}.csv", "wb") do |csv|
csv << User.attribute_names
User.find_each do |user|
csv << user.attributes.values
end
end
This will place a file called all_users_{current timestamp}.csv
in the /opt/scholarspace/scholarspace-hyrax
directory of the container. Exit the container and use this command to copy it out of the container volume:
docker cp scholarspace-hyrax-app-server-1:/opt/scholarspace/scholarspace-hyrax/all_users_{current timestamp}.csv ./
.