Your store, your data; whenever you need it. A data science tool for Shopify stores.
This desktop application efficiently copies the entirety of any number of connected Shopify stores. The results are stored in a custom database in a highly compressed format. Data access is provided via queries, JSON exports, or direct terminal access.
Cold queries can return 1 million orders per minute.
- Remake Shopify admin search with orders of magnitude more power, accuracy, and speed
- Decouple the analysis of Shopify data sets from the production of Shopify data sets
I have spent the last few years working with a number of very large Shopify stores. I would periodically get asked questions along the lines of:
- What fulfillments contained item X today?
- How has AOV changed over the history of the store?
- What is our overall refund rate?
- What are the actual delivery times for Fedex/2day?
The analysis involved in these queries is very simple, but just collecting the data accounted for 90% of the time cost. I would have to write some code to read the right data from Shopify, and usually end up throwing everything away when I was done.
I can't imagine I'm the only person frustrated by this, so I have written SD in an attempt to solve this problem as generally as possible.
- Quick start guide
- Example queries
- Query syntax and features
- Example analysis
- Data structure
- FAQs
- Future work
- Disclaimer
- Download and install the latest release of SD
- Create a private app token (your-store.myshopify.com/admin/apps/private) with read access on the following permissions:
- Customers, discounts, inventory, orders, price rules, and products
- From the Stores tab in SD, click Add a store. Enter a nickname + the myshopify URL/key/password.
- Click "Connect". SD will validate the connection & permissions, then save the info. It will immediately begin copying all store data.
Once running on your machine, press tab or CTRL+3 to bring up queries. Queries work by taking a list of your Shopify data types (orders, fulfillments, etc) and filtering them based on input you provide. For example, inputting orders : shipping_address [ country_code = US ]
will return every order in the database where order.shipping_address.country_code
is equal to "US".
customers : orders : fulfillments : events [ status = delivered ] [ happenedAt > 1-1-20 ]
orders [ subtotal_price = 0 ]
orders [ financial_status = pending ]
orders : fulfillments
orders : refunds : refund_line_items : line_item [ sku = ABC ]
orders : refunds [ created_at > 1-1-20 ]
products : images [ height > 1000 ] [ width > 1000 ]
products : variants : inventory [ harmonized_system_code = null ]
customers : orders : discount_codes [ code = ABC ]
root_type [ field = value ] [ ... ] : subtype1 [ field = value ] [ ... ] : subtype2 ...
root_type
must always be one oforders, fulfillments, customers, products, discounts, inventory
Example. orders [ name = #1001 ]
Returns orders whose name field is equal to the value "#1001".
- Fields have a number of comparison functions
=
Downcased string/number equality
customers [ first_name = zach ]
→ Customers whose first name equals "zach" (or "Zach")~
Downcased string includes
customers [ first_name ~ zach ]
→ Customers whose first name includes "zach" (also matches "Zachary")!
Downcased string not equals / integer not equals>
,<
,>=
,<=
Number/date inequalities
customers [ orders_count > 1 ]
→ Customers with more than 1 orderorders [ created_at > 1-1-20 ]
→ Orders placed after January 1st 2020, based on your system time zone
- Field types and field values are automatically coerced to strings/integers/dates if and only if:
- The field and the input both follow the same transformation (String → date, String → Integer)
- The operator makes sense for the resulting type (eg
~
cannot act on integers or dates)
- You can execute serial field comparisons
fulfillments : events [ status = delivered ] [ happenedAt > 1-1-20 ]
→ Fulfillments that were marked as delivered after January 1st 2020
Example. orders : shipping_address [ country_code = US ]
Returns orders where the following conditions are met:
- The order has a field called "shipping_address"
- "shipping_address" has a field called "country_code", whose value is "US"
Subtypes can be objects or arrays.
- Subtypes have a number of access quantifiers
:
Existential someorders : line_items [ sku = ABC ]
→ Orders where some line item has SKU ABC&
Existential everyorders & line_items [ sku = ABC ]
→ Orders where every line item has SKU ABC*
Noneorders * line_items [ sku = ABC ]
→ Orders where no line item has SKU ABC
- New lines are fed the results of the previous line
-
orders : line_items : properties [ name = engraving ] orders [ created_at > 1-1-20 ]
This finds orders where some line items have a property with name=engraving, then filters that result for order creation after January 1st 2020
-
- SD will join different data sets if it can determine a relationship between them. The joined result is always an array. Those relationships are:
orders: fulfillments
,orders: customers
fulfillments: orders
line_items: products
customers : orders
variants: inventory
fulfillments [ created_at > 1-1-21 ]
fulfillments : line_items [ sku = X ]
orders
(Send results to console)
let aovSet = results.map(o => { // Reducing each order to its relevant data: revenue & date
return {
revenue: +o.subtotal_price,
date: new Date(o.created_at).toLocaleDateString('en-US')
}
});
let aovPerDay = aovSet.reduce((entries, o) => { // Total revenue on each date
entries[o.date] = entries[o.date] ? {...o, count: entries[o.date].count + 1, revenue: entries[o.date].revenue + o.revenue} : {...o, count: 1, revenue: o.revenue};
return entries;
}, {});
// Calculating average revenue on each date
Object.keys(aovPerDay).forEach(date => aovPerDay[date].aov = aovPerDay[date].revenue / aovPerDay[date].count);
// For each date, calculate average revenue for all days prior
Object.keys(aovPerDay).forEach(date => {
let previousDays = Object.values(aovPerDay).filter(e => new Date(e.date) < new Date(date));
let previousDaysRevenue = previousDays.reduce((sum, e) => sum += e.aov, 0);
aovPerDay[date].rolling_average = previousDaysRevenue / previousDays.length || 0;
});
// Output as CSV
Object.keys(aovPerDay).reduce((csv, date) => {
csv += `${date},${aovPerDay[date].aov},${aovPerDay[date].rolling_average}\n`;
return csv;
}, 'Date,AOV,AOV_AC\n');
orders
(Send results to console)
// (# orders with > 0 refunds) / (# orders)
let frequency = results.filter(o => o.refunds.length > 0).length / results.length;
`1 in ${Math.round(1/frequency)} orders have had a refund of any kind`;
fulfillments [ tracking_company = fedex ]
fulfillments : events [ status = delivered ]
fulfillments : orders : shipping_lines [ code ~ 2day ]
// The shipping line code is store-dependent!
(Send results to console)
let total = results.reduce((total_delivery_time, f) => {
return total_delivery_time + (new Date(f.events.find(e => e.status === "DELIVERED").happenedAt).getTime() - new Date(f.created_at)).getTime();
}, 0);
// Average days per shipment from fulfillment creation to "delivered" event
total / 8.64e7 / results.length;
fulfillments : events [ status = delivered ]
(Send results to console)
results.map(f => (new Date(f.events.find(e => e.status === "DELIVERED").happenedAt) - new Date(f.created_at)) / 8.64e7);
orders : line_items [ sku = ABC ]
(Send results to console)
results.map(o => {
return {
quantity: o.line_items.find(l => l.sku === "ABC").quantity
date: new Date(o.created_at).toLocaleDateString('en-US')
}
}).reduce((entries, o) => {
entries[o.date] = entries[o.date] ? entries[o.date] + o.quantity : o.quantity;
return entries;
}, {});
orders, fulfillments, customers, products, discounts, inventory
), with a few notable exceptions:
order.fulfillments
is replaced with an array of fulfillment IDs, and those fulfillments are stored separately. This allows querying fulfillments without referencing the parent orders.order.customer
is replaced with an array whose sole member is the customer ID. This is done to avoid storing redundant or conflicting copies of the data.fulfillment.events
is added to eligible (fulfillment.shipment_status
non-null) fulfillments. This is gathered from Shopify's GraphQL API and is an array of{status, happenedAt}
.
The fastest way to explore each data structure is to simply query the type and send the results to a javascript console.
No. During development, I did not have Apple hardware to test and build on. I do have a new Mac on the way, so I'll be doing a dmg release in mid-January. No. This project is public-source freeware. The source code is public and the releases are code-signed so that users can ease potential security concerns. I, personally, would not enter powerful Shopify tokens into an application whose source I could not inspect. While I hope this project can help you enhance the profits of your store, I retain the rights to sell or monetize the project/code itself. Private tokens give their bearer access to the entire history of your store. App tokens can only access the past 60 days worth of data. I chose private tokens to make SD more powerful. At this time, no. I have interest in sampling specific datasets in the future to support some AI projects. If this happens, it will be an explicit opt-in. The only network requests the current version makes are to your store, to gather data on your behalf. I will absolutely consider it and discuss the request. Please submit a feature request via the Issues tab. Beta version 1 does not self-update. Yup sure is, that's why I added the "Send to console" / "Send to file" buttons. My test stores (around 250k orders per store) were each around 200 MB on disk. Your stats may vary. Under the hood, we collect orders/products/whatever into chunks and gzip the binary objects, then store that. This method gets us >90% compression. Stores with 250k orders currently finish in about 4 hours. Your stats may vary. After this, SD will periodically check in with the store to get recent changes. I expect SD to be slightly out of date (~10 minutes) at any time. Some types of events/changes are difficult to track in real-time. Fulfillments, for example, can be created on very old orders and Shopify sometimes will not emit events for this. If your data has not appeared after waiting an hour or so, I recommend forcing a re-sync.Shopify is not able to reliably gather or parse tracking events from all carriers. Shippo USPS is notable here for never having events
data.
When downloading historical data, my Ryzen 3700x would occasionally spike up to ~10% due to large data compression tasks.
During very complex queries (nested queries with millions of date comparisons), I typically see 30% CPU. Yes. Query the type your want to export (eg "orders"), then click the "Send results to file" button. Just be aware that the JSON string will be >10X larger than the compressed data. Orders will export to ~1GB / 100k orders. Yes, to the extent Shopify makes inventory information available. If your goal is to inspect or analyze inventory history, you would need to reconstruct it from fulfillment events or interface with a WMS that tracks history on your behalf. Demand pending, I am considering the following feature additions:
- Finer store permissions. For example, excluding the
customers
permission would cause SD to not attempt downloading that data. - A server-friendly version
- Saving and replaying queries
- More robust parent/child relationships. Would be nice to be able to link children to grandparents (eg line_item → variant)