Skip to content

Helper Scripts

Dolsy Smith edited this page Dec 2, 2024 · 9 revisions

Running Solr queries in the console (with pagination)

There are no doubt multiple ways of doing this, but the most straightforward I have found is to use the underlying RSolr connection to construct the query.

As an example, the following code will query for all documents that have any value in the license_tesim field and paginate through the results, 100 at a time, returning just the document id and the value in the license_tesim field:

# Get first page of results
docs = ActiveFedora::SolrService.instance.conn.paginate(1, 100, 'select', :params => {:q => "license_tesim:['' TO *]", fl:'id,license_tesim'}).dig('response', 'docs')
rows = 2
# Starting with the second page, paginate until we get no more
while (next_page =  ActiveFedora::SolrService.instance.conn.paginate(rows, 10, 'select', :params => {:q => "license_tesim:['' TO *]", fl:'id,license_tesim'}).dig('response', 'docs')) != []
     docs += next_page
     rows += 1
end

Merging GA views with work metadata

Keeping this a script to be run in the console in case we pursue GA integration later. If not, this could become a rake task.

Before running, move the CSV exports from Google Analytics/GA 4 to /opt/scholarspace/scholarspace-tmp inside the app-server container.

# Google Analytics 4 data -- excluding metadata the precedes CSV header
ga4 = CSV.read("/opt/scholarspace/scholarspace-tmp/data-export.csv")[9..]
# Google Analytics legacy data
ga = CSV.read("/opt/scholarspace/scholarspace-tmp/views.csv")[6..] 

def extract_ids(ga_stats)
    # Given a 2-D array, expects the first column of each row to contain a GWSS page relative URL, and the second, the number of page views
    # extract work ID
    concern_pattern = /.+\/(gw_works|gw_etds)\/([a-zA-Z0-9]{9})\/?.*/
    return ga_stats.each_with_object(Hash.new { |h,k| h[k] = [] }) do |line, h|
        match = concern_pattern.match(line[0])
        if match
            h[match[2]] << line[1]
        end
        h
    end
end

def sum_views(ga_hash) 
    # Given a hash mapping work ID's to arrays of page views (as strings), returns a numeric sum of views for each ID
    return ga_hash.transform_values do |v|
        v.map { |n| n.sub(",", "").to_i }.sum
    end
end

SOLR_KEYS = ["id", "title_tesim", "gw_affiliation_tesim", "resource_type_tesim"]

def get_solr_docs(ga_hash)
    # Given a hash of work IDs and page views, retrieve Solr metadata for each id
    # Returns an array of hashes, one per work, with select metadata fields + page views
    solr_docs = ga_hash.keys.map do |key|
        Hyrax::SolrQueryService.new.with_ids(ids: [key]).solr_documents
    end.map do |doc|
        # Extract values from desired Solr fields
        SOLR_KEYS.map { |k| [k, doc[0][k]] }.to_h
    end
    return solr_docs.map do |doc|
        id = doc["id"]
        views = ga_hash[id]
        # Extract strings from Solr array fields; expects each array to have only one value
        new_doc = doc.transform_values { |v| if v.kind_of?(Array) then v[0] else v end }
        new_doc["views"] = views
        new_doc
     end
end

ga = sum_views(extract_ids(ga))
ga4 = sum_views(extract_ids(ga4))

# Merge the two hashes, summing values for duplicate keys
ga_merged = ga.merge(ga4) { |key, ga_val, ga4_val| ga4_val + ga_val }

# Get additional metadata from Solr
ga_merged_csv = get_solr_docs(ga_merged)

headers = ga_merged_csv.first.keys

CSV.open("/opt/scholarspace/scholarspace-tmp/ga-stats-merged.csv", "w") do |csv|
    csv << headers
    ga_merged_csv.each do |row|
        csv << row
    end
end

Migrating values from one metadata field to another

Migrating DOI's from the related_url to doi field

# Need to do this because the related_url_tesim is indexed as a TextField and as such does not support string wildcard queries
data = ActiveFedora::SolrService.query("related_url_tesim:['' TO *]", rows: 10000)
# Filter all records with non-null values in related_url_tesim to find those with at least one DOI
with_dois = data.select do |record|
        record["related_url_tesim"].any? { |url| url.starts_with? 'https://doi' }
end
# Get work ID'sRAIL
ids = with_dois.map {|d| d.id}
ids.each do |id|
        # Get ActiveRecord object for that work
	work = GwWork.find(id)
        # Exchange URL and DOI fields
	doi = work.related_url
	work.doi = doi
	work.related_url = []
	work.save
end

Adding a keyword (or adding to another array field)

# In this example, we add a variant of a keyword to all works having a given keyword
data = ActiveFedora::SolrService.query("keyword_tesim:Eckles", rows: 10000)
old_keyword = "The Eckles Prize for Freshman Research Excellence"
new_keyword = "The Eckles Prize for First Year Research Excellence"
with_old_keyword = data.select do |d| 
  d["keyword_tesim"].any? { |k| k == old_keyword }
end
ids = with_old_keyword.map { |r| r.id }
ids.each do |id|
  work = GwWork.find(id)
  # Note that we cannot add a value directly to the ActiveFedora field; it will not persist
  # Instead, we have to replace the entire field with a new array
  keywords = work.keyword.map { |keyword| keyword }
  keywords << new_keyword
  work.keyword = keywords
  work.save
end

Getting a database extract of users

In the scholarspace-hyrax-app-server container, open the Rails console (using RAILS_ENV=production rails c) and run the following code:

CSV.open("./all_users_#{Time.now.to_s.gsub(' ', '_')}.csv", "wb") do |csv|
  csv << User.attribute_names
  User.find_each do |user|
    csv << user.attributes.values
  end
end

This will place a file called all_users_{current timestamp}.csv in the /opt/scholarspace/scholarspace-hyrax directory of the container. Exit the container and use this command to copy it out of the container volume: docker cp scholarspace-hyrax-app-server-1:/opt/scholarspace/scholarspace-hyrax/all_users_{current timestamp}.csv ./.