Skip to content

Transform Shopify stores into a local high-performance JSON database

License

Notifications You must be signed in to change notification settings

tehaksbrid/shop-databaser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shop Databaser (SD)

Your store, your data; whenever you need it. A data science tool for Shopify stores.

This desktop application efficiently copies the entirety of any number of connected Shopify stores. The results are stored in a custom database in a highly compressed format. Data access is provided via queries, JSON exports, or direct terminal access.

Cold queries can return 1 million orders per minute.

tehaksbrid/shop-databaser status

Goals

  1. Remake Shopify admin search with orders of magnitude more power, accuracy, and speed
  2. Decouple the analysis of Shopify data sets from the production of Shopify data sets

I have spent the last few years working with a number of very large Shopify stores. I would periodically get asked questions along the lines of:

  • What fulfillments contained item X today?
  • How has AOV changed over the history of the store?
  • What is our overall refund rate?
  • What are the actual delivery times for Fedex/2day?

The analysis involved in these queries is very simple, but just collecting the data accounted for 90% of the time cost. I would have to write some code to read the right data from Shopify, and usually end up throwing everything away when I was done.

I can't imagine I'm the only person frustrated by this, so I have written SD in an attempt to solve this problem as generally as possible.

Table of Contents

  1. Quick start guide
  2. Example queries
  3. Query syntax and features
  4. Example analysis
  5. Data structure
  6. FAQs
  7. Future work
  8. Disclaimer

Quick start guide

  1. Download and install the latest release of SD
  2. Create a private app token (your-store.myshopify.com/admin/apps/private) with read access on the following permissions:
    • Customers, discounts, inventory, orders, price rules, and products
  3. From the Stores tab in SD, click Add a store. Enter a nickname + the myshopify URL/key/password.
  4. Click "Connect". SD will validate the connection & permissions, then save the info. It will immediately begin copying all store data.

Once running on your machine, press tab or CTRL+3 to bring up queries. Queries work by taking a list of your Shopify data types (orders, fulfillments, etc) and filtering them based on input you provide. For example, inputting orders : shipping_address [ country_code = US ] will return every order in the database where order.shipping_address.country_code is equal to "US".

Example queries

Customers that had an order delivered on or after January 1st 2020
customers : orders : fulfillments : events [ status = delivered ] [ happenedAt > 1-1-20 ]
Free orders
orders [ subtotal_price = 0 ]
Unpaid orders that have fulfillments
orders [ financial_status = pending ]
orders : fulfillments
Orders where a specific item was refunded
orders :  refunds : refund_line_items : line_item [ sku = ABC ]
Orders with refunds that were created on or after January 1st 2020
orders : refunds [ created_at > 1-1-20 ]
Products with an image exceeding 1MP
products : images [ height > 1000 ] [ width > 1000 ]
Products with no HTS code
products : variants : inventory [ harmonized_system_code = null ]
Customers who have used a specific discount code
customers : orders : discount_codes [ code = ABC ]

Query syntax and features

This query language is inspired by xpath and has the following general structure
root_type [ field = value ] [ ... ] : subtype1 [ field = value ] [ ... ] : subtype2 ...
  • root_type must always be one of orders, fulfillments, customers, products, discounts, inventory

Fields and values

Example. orders [ name = #1001 ]

Returns orders whose name field is equal to the value "#1001".

  • Fields have a number of comparison functions
    • = Downcased string/number equality
      customers [ first_name = zach ] → Customers whose first name equals "zach" (or "Zach")
    • ~ Downcased string includes
      customers [ first_name ~ zach ] → Customers whose first name includes "zach" (also matches "Zachary")
    • ! Downcased string not equals / integer not equals
    • >, <, >=, <= Number/date inequalities
      customers [ orders_count > 1 ] → Customers with more than 1 order
      orders [ created_at > 1-1-20 ] → Orders placed after January 1st 2020, based on your system time zone
  • Field types and field values are automatically coerced to strings/integers/dates if and only if:
    • The field and the input both follow the same transformation (String → date, String → Integer)
    • The operator makes sense for the resulting type (eg ~ cannot act on integers or dates)
  • You can execute serial field comparisons
    • fulfillments : events [ status = delivered ] [ happenedAt > 1-1-20 ] → Fulfillments that were marked as delivered after January 1st 2020

Subtypes

Example. orders : shipping_address [ country_code = US ]

Returns orders where the following conditions are met:

  • The order has a field called "shipping_address"
  • "shipping_address" has a field called "country_code", whose value is "US"

Subtypes can be objects or arrays.

  • Subtypes have a number of access quantifiers
    • : Existential some
      orders : line_items [ sku = ABC ] → Orders where some line item has SKU ABC
    • & Existential every
      orders & line_items [ sku = ABC ] → Orders where every line item has SKU ABC
    • * None
      orders * line_items [ sku = ABC ] → Orders where no line item has SKU ABC

Misc

  • New lines are fed the results of the previous line
    • orders : line_items : properties [ name = engraving ]
      orders [ created_at > 1-1-20 ]
      

      This finds orders where some line items have a property with name=engraving, then filters that result for order creation after January 1st 2020

  • SD will join different data sets if it can determine a relationship between them. The joined result is always an array. Those relationships are:
    • orders: fulfillments, orders: customers
    • fulfillments: orders
    • line_items: products
    • customers : orders
    • variants: inventory

Example analysis

What fulfillments contained item X today?

fulfillments [ created_at > 1-1-21 ]
fulfillments : line_items [ sku = X ]

How has AOV changed over the history of the store?
orders

(Send results to console)

let aovSet = results.map(o => { // Reducing each order to its relevant data: revenue & date
  return {
    revenue: +o.subtotal_price,
    date: new Date(o.created_at).toLocaleDateString('en-US')
  }
});

let aovPerDay = aovSet.reduce((entries, o) => { // Total revenue on each date
  entries[o.date] = entries[o.date] ? {...o, count: entries[o.date].count + 1, revenue: entries[o.date].revenue + o.revenue} : {...o, count: 1, revenue: o.revenue};
  return entries;
}, {});

// Calculating average revenue on each date
Object.keys(aovPerDay).forEach(date => aovPerDay[date].aov = aovPerDay[date].revenue / aovPerDay[date].count);

// For each date, calculate average revenue for all days prior
Object.keys(aovPerDay).forEach(date => {
    let previousDays = Object.values(aovPerDay).filter(e => new Date(e.date) < new Date(date));
    let previousDaysRevenue = previousDays.reduce((sum, e) => sum += e.aov, 0);
    aovPerDay[date].rolling_average = previousDaysRevenue / previousDays.length || 0;
});

// Output as CSV
Object.keys(aovPerDay).reduce((csv, date) => {
    csv += `${date},${aovPerDay[date].aov},${aovPerDay[date].rolling_average}\n`;
    return csv;
}, 'Date,AOV,AOV_AC\n');

What is our overall refund rate?
orders

(Send results to console)

// (# orders with > 0 refunds) / (# orders)
let frequency = results.filter(o => o.refunds.length > 0).length / results.length;
`1 in ${Math.round(1/frequency)} orders have had a refund of any kind`;
What are the actual delivery times for Fedex/2day?
fulfillments [ tracking_company = fedex ]
fulfillments : events [ status = delivered ]
fulfillments : orders : shipping_lines [ code ~ 2day ]
// The shipping line code is store-dependent!

(Send results to console)

let total = results.reduce((total_delivery_time, f) => {
  return total_delivery_time + (new Date(f.events.find(e => e.status === "DELIVERED").happenedAt).getTime() - new Date(f.created_at)).getTime();
}, 0);

// Average days per shipment from fulfillment creation to "delivered" event
total / 8.64e7 / results.length;

Plottable delivery times of all fulfillments
fulfillments : events [ status = delivered ]

(Send results to console)

results.map(f => (new Date(f.events.find(e => e.status === "DELIVERED").happenedAt) - new Date(f.created_at)) / 8.64e7);
Units sold per day of a specific SKU
orders : line_items [ sku = ABC ]

(Send results to console)

results.map(o => {
  return {
    quantity: o.line_items.find(l => l.sku === "ABC").quantity
    date: new Date(o.created_at).toLocaleDateString('en-US')
  }
}).reduce((entries, o) => {
  entries[o.date] = entries[o.date] ? entries[o.date] + o.quantity : o.quantity;
  return entries;
}, {});

Data structure

The data generally follows Shopify's definitions for each root type (orders, fulfillments, customers, products, discounts, inventory), with a few notable exceptions:
  • order.fulfillments is replaced with an array of fulfillment IDs, and those fulfillments are stored separately. This allows querying fulfillments without referencing the parent orders.
  • order.customer is replaced with an array whose sole member is the customer ID. This is done to avoid storing redundant or conflicting copies of the data.
  • fulfillment.events is added to eligible (fulfillment.shipment_status non-null) fulfillments. This is gathered from Shopify's GraphQL API and is an array of {status, happenedAt}.

The fastest way to explore each data structure is to simply query the type and send the results to a javascript console.

FAQs

Q. Is MacOS supported
No. During development, I did not have Apple hardware to test and build on. I do have a new Mac on the way, so I'll be doing a dmg release in mid-January.
Q. Is this an "open source" project?
No. This project is public-source freeware. The source code is public and the releases are code-signed so that users can ease potential security concerns. I, personally, would not enter powerful Shopify tokens into an application whose source I could not inspect. While I hope this project can help you enhance the profits of your store, I retain the rights to sell or monetize the project/code itself.
Q. Why does this use private tokens?
Private tokens give their bearer access to the entire history of your store. App tokens can only access the past 60 days worth of data. I chose private tokens to make SD more powerful.
Q. Does SD collect usage data?
At this time, no. I have interest in sampling specific datasets in the future to support some AI projects. If this happens, it will be an explicit opt-in. The only network requests the current version makes are to your store, to gather data on your behalf.
Q. Can you add...
I will absolutely consider it and discuss the request. Please submit a feature request via the Issues tab.
Q. Is SD self-updating?
Beta version 1 does not self-update.
Q. This query language is dumb
Yup sure is, that's why I added the "Send to console" / "Send to file" buttons.
Q. How much storage space will this consume?
My test stores (around 250k orders per store) were each around 200 MB on disk. Your stats may vary. Under the hood, we collect orders/products/whatever into chunks and gzip the binary objects, then store that. This method gets us >90% compression.
Q. How long does it take to download everything?
Stores with 250k orders currently finish in about 4 hours. Your stats may vary. After this, SD will periodically check in with the store to get recent changes.
Q. My order/product/whatever is missing!
I expect SD to be slightly out of date (~10 minutes) at any time. Some types of events/changes are difficult to track in real-time. Fulfillments, for example, can be created on very old orders and Shopify sometimes will not emit events for this. If your data has not appeared after waiting an hour or so, I recommend forcing a re-sync.

Shopify is not able to reliably gather or parse tracking events from all carriers. Shippo USPS is notable here for never having events data.

Q. What kind of system resource usage should I expect?
When all historical data has finished downloading, <1% CPU and <200MB of RAM.
When downloading historical data, my Ryzen 3700x would occasionally spike up to ~10% due to large data compression tasks.
During very complex queries (nested queries with millions of date comparisons), I typically see 30% CPU.
Q. Can I export my data?
Yes. Query the type your want to export (eg "orders"), then click the "Send results to file" button. Just be aware that the JSON string will be >10X larger than the compressed data. Orders will export to ~1GB / 100k orders.
Q. Can I query inventory levels with this?
Yes, to the extent Shopify makes inventory information available. If your goal is to inspect or analyze inventory history, you would need to reconstruct it from fulfillment events or interface with a WMS that tracks history on your behalf.

Future work

Demand pending, I am considering the following feature additions:
  • Finer store permissions. For example, excluding the customers permission would cause SD to not attempt downloading that data.
  • A server-friendly version
  • Saving and replaying queries
  • More robust parent/child relationships. Would be nice to be able to link children to grandparents (eg line_item → variant)

Disclaimer

This is application is a novelty not intended to support production software workloads and is not guaranteed to be reliable, accurate, or safe. See licensing.

About

Transform Shopify stores into a local high-performance JSON database

Resources

License

Stars

Watchers

Forks

Packages

No packages published