Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion does not support wasm32-unknown-unknown target #177

Closed
alamb opened this issue Apr 26, 2021 · 15 comments · Fixed by #7633
Closed

DataFusion does not support wasm32-unknown-unknown target #177

alamb opened this issue Apr 26, 2021 · 15 comments · Fixed by #7633
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11615

The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does not support thewasm32-unknown-unknown target.

Try out the repository at https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a.

{code}
error[E0433]: failed to resolve: could not find unix in os
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
|
41 | use std::os::unix::ffi::OsStringExt;
| ^^^^ could not find unix in os

error[E0432]: unresolved import unix
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
|
6 | use unix;
| ^^^^ no unix in the root

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9
|
98 | sys::duplicate(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9
|
101 | sys::allocated_size(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9
|
104 | sys::allocate(self, len)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9
|
107 | sys::lock_shared(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9
|
110 | sys::lock_exclusive(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9
|
113 | sys::try_lock_shared(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9
|
116 | sys::try_lock_exclusive(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9
|
119 | sys::unlock(self)
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5
|
126 | sys::lock_error()
| ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys
--> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5
|
169 | sys::statvfs(path.as_ref())
| ^^^ use of undeclared crate or module sys

Compiling num-rational v0.3.2
error: aborting due to 10 previous errors
{code}

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alippai
Copy link
Contributor

alippai commented Apr 26, 2021

Polars proof of concept (shows that arrow-rs and datafusion like API can work): https://github.com/ritchie46/polars/blob/master/js-polars/app.js

@alippai
Copy link
Contributor

alippai commented Apr 29, 2021

#218 is a great step helping this

@seddonm1
Copy link
Contributor

@alamb @Dandandan @jorgecarleitao
I have done some digging into this. There are two blockers:

  1. dirs-rs
    problem dirs-rs is pulled in due to prettyprint-r pulled in due to pretty printing dataframes https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/execution/dataframe_impl.rs#L34. dirs does not have a wasm32-unknown implementation and the owner seems reluctant to add it: wasm-pack support missing.  dirs-dev/dirs-rs#17.
    solution The arrow crate has a feature flag which enables pretty printing. If we agree we can add this to the datafusion (default enabled) to prevent this first problem.

  2. lz4
    problem the parquet crate depends on lz4 which is generated by lz4-sys (not native rust) and does not compile to wasm32-unknown-unknown. I have done a test of ripping parquet out of datafusion and everything compiles.
    solution we could either put anything parquet behind another feature flag or swap out the lz4 for a pure rust implementation like the redox: https://lib.rs/crates/lz4-compress.

It would be relatively easy for me to do the two feature flags unless someone has an objection?

@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2021

dirs-rs

Note that apache/arrow-rs#656 from @PsiACE has removed the pretty-table dependency in arrow-rs upstream. This will be included in the 6.0 arrow release (in 2ish months); I am not sure if/how this affects your decision

lz4:

I think lz4 is an optional dependency of parquet: https://github.com/apache/arrow-rs/blob/master/parquet/Cargo.toml#L40 thus perhaps we could just have a lz4 feature flag for datafusion?

@seddonm1
Copy link
Contributor

Thanks @alamb .
I will do some experiments but seems like a good solution.

@ivanceras
Copy link

I think the biggest hurdles is using tokio specific types in the underlying implementation of datafusion, as I've found in this code
I understand this is for performance reason, but also there is no easy way to abstract this instead of using executor specific types. In wasm, we can use wasm-bindgen-futures as a way to run async functions.

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2021

@ivanceras I am not sure if we are using tokio specific futures stuff due to performance or just convenience

It might be worth trying to replace tokio specific structs with things from the futures crate (if that works on wasm32) and see what it looks like

@seddonm1
Copy link
Contributor

seddonm1 commented Nov 8, 2021

@ivanceras what are you experiencing? I have managed to compile to WASM with very slight code modifications.

@roee88
Copy link

roee88 commented Nov 9, 2021

@seddonm1 compile and run?

I have experimented with that yesterday. I tried wasm32-wasi first and a simple sample works in single threaded mode after disabling some parquet features. See this gist for the example: https://gist.github.com/roee88/91f2b67c3e180fa0dfb688ba8d923dae

For wasm32-unknown-unknown adding getrandom with js as a dependency of the sample makes it compile IIRC, but actually running it is a different story. I tried to get a sample working with wasm-pack and it stops execution on the datafusion context creation, I suspect that it uses some sync primitives that are unsupported in wasm32-unknown-unknown but I didn't investigate further.

I didn't try wasm32-unknown-emscripten yet since my local rust version is incompatible with my installed emcc version (both latest at the time of this writing).

Edit: re tokio, the sample above worked on wasm32-wasi with other executors in single threaded mode including futures 0.3, https://github.com/richardanaya/executor, and async-global-executor.
As long as you don't hit code paths that use things like tokio::spawn (used in hash aggregate) it might be fine to use another executor. I'm not sure what's the best approach for library code that needs to spawn tasks. I have seen opinions for 1) a library should never spawn, 2) futures should be universally supported, 3) a library should accept an executor trait (as implemented in https://github.com/najamelan/async_executors). I didn't check the state of futures and WebAssembly recently.
I didn't try wasmbindgen-futures because it's officially no longer compatible with wasi and emscripten and as I said I couldn't get anything running with wasm32-unknown-unknown.

@roee88
Copy link

roee88 commented Nov 9, 2021

I got the basic sample from the gist in the previous message working with wasm-pack (wasm32-unknown-unknown) on single threaded tokio after:

  1. Removing the "lz4" and "zstd" features from parquet dependency. This is the same change I had to do for running in wasm32-wasi.
  2. Disabling the default time feature from tokio-stream by changing the dependency to: tokio-stream = { version = "0.1", default-features = false }
  3. Removing the use of Instant::now() which is unsupported in wasm32-unknown-unknown and panics. For the basic sample that I'm running I just removed some lines from ScopedTimerGuard, but it's used in other places too if you run more complex stuff. Maybe https://github.com/sebcrozet/instant can be used instead.
  4. Enabling js feature of getrandom: getrandom = { version = "0.2", features = ["js"] }. I did this in the sample application itself and not in datafusion (as it breaks other targets like wasm32-wasi).
  5. Enabling wasmbind feature of chrono: chrono = {version = "0.4", features = ["wasmbind"]}. I did this in the sample application itself and not in datafusion (as it breaks other targets like wasm32-wasi).

Tested in chrome and seems to work. Again, there are definitely some code paths that lead to panic as not all of std is supported in wasm32 targets and I only tested something basic. Also, for multi-threading to work in the browser some parts of tokio can't be used directly from the datafusion codebase (this is a more complex topic).

@milesrichardson
Copy link

Good news, fellow WebAssembly enthusiasts! It looks like the stars are finally aligning, and with relatively minimal patching, I successfully compiled the code from the gist (create, insert and query a MemTable) to wasm32-wasi and wasm32-unknown-unknown, and ran it in wasmedge and the browser (via wasmpack):

❯ docker run --rm -it -v $(pwd)/target/wasm32-wasi/debug:/app wasmedge/slim:0.11.2-rc.1 wasmedge --reactor dfwasm.wasm _start
+---+----+
| a | b  |
+---+----+
| b | 10 |
| c | 10 |
+---+----+
0

image

I pushed the proof-of-concept to a public repository at splitgraph/experimental-datafusion-webassembly. There are two branches:

  • wasm32-wasi
    • This is the target I got working first. The readme on this branch contains all the details and you should be able to reproduce it yourself.
  • wasm32-unknown-unknown
    • This is branched from wasm32-wasi and the diff of wasm32-wasi..wasm32-unknown-unknown shows the changes
    • The top of the readme includes instructions for running this in the browser, but the patch is still very messy and might not be easily reproducible. Make sure you check Cargo.toml for any patched crates that you need to have checked out at a local path.

In the near future, I intend to cleanup these changes and submit a PR to DataFusion feature-flagging WebAssembly support.

In general, the summary of requirements for wasm-wasi:

for wasm32-unknown-unknown, in addition to all those requirements, it was also necessary to:

  • Replace usage of std::time with Instant, in both datafusion and arrow
  • Make sure every library that calls getrandom is also passing it the js feature flag, which I did by just patching getrandom and making that the default

To get it to run (without a runtime error related to std::time being unreachable), a few more changes were made:

  • Don't run the demo code in a Tokio main runtime, even with flavor = current-thread. Instead, use wasm-bindgen-futures to await a future that performs the asynchronous task that calls datafusion

This is all very messy. I will clean it up and submit a PR to DataFusion once I have a better sense of the most minimal changes required and the proper way to feature flag them. Also, general disclaimer that I'm new to Rust and YMMV, especially on the wasm-unknown-unknown patch - after all, I barely got it to run. But it does compile and create and query a small in-memory table, which is pretty good!

@alamb
Copy link
Contributor Author

alamb commented Nov 1, 2022

This sounds very cool @milesrichardson - DataFusion should be upgraded to arrow 26.0.0 shortly: #4039. I think @jimexist is in the process of making bzip support optional #3993

In terms of being messy / submitting a PR -- if it is possible I suggest trying to do it incrementally -- like for example we can probably sort out the calls to spawn_blocking in a separate PR

But all in all this is pretty exciting

@REASY
Copy link

REASY commented Mar 15, 2023

Hello, folks.

I'm trying to add WASM support to DataFusion's dependencies. Started with bzip2-rs trifectatechfoundation/bzip2-rs#93

@REASY
Copy link

REASY commented Apr 19, 2023

Posting an update on trifectatechfoundation/bzip2-rs#93, had a discussion with @alexcrichton

Compiling C for the web works with Emscripten and can work with WASI since there's a libc, but in general it doesn't work with wasm32-unknown-unknown because there's no libc. I would not recommend this as a viable approach of
porting a project to wasm.

Not sure how to go from here...

@milesrichardson
Copy link

@REASY In my experiment (the one linked above), I put bzip behind a configuration flag and disabled it for the wasm targets. Datafusion still compiled. I don't know enough about DF to say how important bzip is, or which parts of DF would be broken without it, however. It seemed limited in scope, since it should only affect files that are encoded with bzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants