Skip to content
Richard Lin edited this page Jan 28, 2021 · 37 revisions

DataSpread logo

DataSpread is a spreadsheet-database hybrid system with a spreadsheet front end and a database backend. Thus, DataSpread inherits the flexibility and ease-of-use of spreadsheets, as well as the scalability and power of databases. DataSpread is a multi-year project supported by the National Science Foundation via award number 1633755.

You can directly use DataSpread via our cloud-hosted site (Temporarily offline).

Motivation

Spreadsheet software is the tool of choice for ad-hoc tabular data management with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. DataSpread is a system that holistically unifies databases and spreadsheets with the goal of interactively operating on large datasets.

DataSpread retains all of the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the expressiveness, scalability and collaboration abilities of traditional relational databases. DataSpread has a spreadsheet front-end and a regular relational database back-end.

Features

DataSpread is built using PostgreSQL and ZKSpreadsheet, an open-source web-based spreadsheet tool.

DataSpread's version 0.1 enables users to scale to billions of cells and return results for common spreadsheet operations within seconds. It does so via on-demand loading of spreadsheet data. DataSpread's version 0.3 introduces the ability to declare and operate on tables, along with the ability to employ relational operators and SQL, all within the spreadsheet interface, enabling users to perform interactive tabular data management.

Like traditional spreadsheet software, DataSpread supports standard spreadsheet book and sheet operations like Load, Rename, Delete, and Import (via XLS and XLSX, and CSV). Any updates to the spreadsheets are automatically reflected in the underlying database.

Also like traditional spreadsheet software, DataSpread supports the use of 225+ spreadsheet functions, along with formatting and styling operations. It supports row and column operations like insert, delete, cut, copy, and paste; during insertion and deletion, formulae are updated, as is the case in traditional spreadsheet software.

It supports all of these operations while scaling to arbitrarily large spreadsheets.

Starting from version 0.3, DataSpread supports tables management and SQL and relational algebra, all within the spreadsheet interface. That is, DataSpread allows users to declare and use tables from the spreadsheet front-end or the relational database back-end, and reuse the same tables multiple times across the same or different spreadsheets. Moreover, DataSpread enables users to express SQL queries or relational algebra within the spreadsheet interface, allowing users to compose complex relational expressions beyond what spreadsheets can currently offer. For example, expressing joins is now a simple matter of a few clicks on the frontend.

Also starting from version 0.3, DataSpread no longer requires users to wait for complex or large formulae to be evaluated, by displaying placeholder values, and filling these values asynchronously via our asynchronous formula evaluation engine.

Starting from version 0.5, users can navigate tabular spreadsheet data using a navigation panel on the left. The navigation panel enables the users explore large datasets without having to tediously scroll the data. Users can issue traditional spreadsheet formulae to view aggregate statistics on the data. Moreover, users can customize the navigation panel to accommodate their purpose of navigation. Furthermore, starting from version 0.5, users can view the progress of the asynchronous computation via progressive horizontal bars.

Key Design Innovations

  • DataSpread employs a flexible hybrid data model to represent spreadsheet data within a database.
  • DataSpread uses positional indexing techniques to both locate data by position and keeps it up-to-date as the data is updated.
  • DataSpread employs an LRU caching mechanism to retrieve and keep in memory data from the database on demand.
  • DataSpread employs speculative fetching to fetch additional data beyond the user's current spreadsheet window.
  • DataSpread enables users to associate spreadsheet regions with relational tables and allows simple table manipulation operations.
  • DataSpread evaluates formulae asynchronously thereby not requiring the users to wait for long running operations to complete.
  • DataSpread introduces front-end SQL and relational operations to enable users to issue SQL queries to the underlying database.
  • DataSpread introduces a navigation panel which enables the users to explore tabular spreadsheet data and obtain additional details on demand via aggregation operations.

Future Plan

  • Optimize the computation of spreadsheet formulae by using joint formula evaluation and optimization.
  • Support create multiple views for displayed tables using different orderings.
  • Allow for multi-user sharing/collaboration of spreadsheets.

Learning More

Use the user guide on the side to learn more about setup instructions, architecture, and other information.

Clone this wiki locally