Skip to content

Latest commit

 

History

History
408 lines (313 loc) · 38.7 KB

rust.md

File metadata and controls

408 lines (313 loc) · 38.7 KB

Table of Contents generated with DocToc

Getting Started with Rust

Why the developers who use Rust love it so much - from StackOverflow survey, really good quotes

Some links on Rust

cheats.rs - Awesome quick ref.

Speed without wizardry - how using Rust is safer and better than using hacks in Javascript

Dealing with strings are confusing in rust, because there are two types: a heap- allocated String and a pointer to a slice of String bytes: &str. Knowing what to use, and defining structures on them, immediately exposes the steep learning curve of ownership.

See the Guide to Strings for some help.

Online resources and help:

Specific topics:

Cool Rust Projects

CLI tools:

  • XSV - a fast CSV parsing and analysis tool
  • Ripgrep - insanely fast grep utility, great for code searches. Shows off power of Rust regex library
  • Bat - A super cat with syntax highlighting, git integration, other features
  • Bottom - Cross-platform fancy top in Rust - process/sys mon with graphs, very useful!
  • ht - HTTPie clone / much better curl alternative
  • Dust - Rust graphical-text faster and friendlier version of du
  • fd - Rust CLI, friendlier and faster replacement for find
  • Nushell - Rust shell that turns all output into tabular data. Pretty cool!
  • imagecli - CLI for image batch processing
  • Hyperfine - Rust performnace benchmarking CLI
  • Alacritty - GPU accelerated terminal emulator
  • jql - Rust version of popular jq JSON CLI processor, though not as powerful
  • Starship - "The minimal, blazing-fast, and infinitely customizable prompt for any shell!"

Wasm:

  • Wasmer - general purpose WASM runtime
  • Krustlet - WebAssembly (instead of containers) runtime on Kubernetes!! Use Rust + wasm + WASI for a truly portable k8s-based deploy!

Data/Others:

  • Sled - an embedded database engine using latch-free Bw-tree on latch-free page cache techniques for speed
  • IOx - New in-memory columnar InfluxDB engine using Arrow, DataFusion, rust! Persists using parquet. Super awesome stuff.
  • IndraDB - Graph database/library written in Rust! and inspired by Facebook's TAO.
  • TabNine - an ML-based autocompleter, written in Rust
  • async-std - the standard library with async APIs
  • MinSQL - interesting POC on lightweight SQL based log search, w automatic field parsing etc.
  • Timely Dataflow - distributed data-parallel compute engine in Rust!!!
  • Toshi - ElasticSearch written in Rust using Tantivy as the engine
  • Convey - Layer 4 load balancer

Rust Error Handling

Error handling survey - really good summary of the Rust error library landscape as of late 2019.

  • Anyhow - streamlined error handling with context....
  • Snafu - adding context to errors

Rust Concurrency

Shared Data Across Multiple Threads

Sometimes one needs to share a large data structure across threads and several of them must access it.

The most general way to share a data structure is to use Arc<RwLock<...>> or Arc<Mutex<...>>. The Arc keeps track of lifetimes and lets different threads exist for different lengths of time, and is inexpensive since it is usually only accessed once at thread spawn. The Mutex or RwLock lets different threads mutate it safely, assuming the data structure is not thread-safe.

A thread-safe data structure could be used in place of the RwLock or Mutex.

Scoped threads could be used if only one owner will mutate the data structure, and one wants to share immutable refs with other threads for reading. However, the special threads in Crossbeam crate are still needed as Rustc by itself has no way of proving the lifetime of a thread or when it will be joined, thus any immutable refs created from the owner thread still cannot compile or be shared due to rustc lifetime checks. Scoped threads are a way around that as it gives rustc a guarantee that the threads will be joined before the owner goes away.

Arc-swap could potentially help too.

Also see beef - a leaner version of Cow.

Data Processing and Data Structures

  • Are we learning yet? - list of ML Rust crates

  • Timely Dataflow - distributed data-parallel compute engine in Rust!!!

  • DataFusion - a Rust query engine which is part of Apache Arrow!

  • Weld - Stanford's high-performance runtime for data analytics

  • Toshi - ElasticSearch written in Rust using Tantivy as the engine

  • MeiliDB - fast full-text search engine

  • Vector - unified client side collection agent for logs, metrics, events

  • Tremor - a simple event processing / log and metric processing and forwarding system, with scripting and streaming query support. Much more capable than Telegraf.

  • Clepsydra - Graydon Hoare working on distributed database protocol - in Rust!

JSON Processing

For JSON DOM (IR) processing, using the mimalloc allocator provided me a 2x speedup with serde-json. Then, switching to json-rust provided another 1.8x speedup. The speedup is completely unreal, much faster than JVM. The main reason I guess is that json-rust has a Short DOM class for short strings, which requires no heap allocation.

  • simdjson-rs - SIMD-enabled JSON parser. NOTE: no writing of JSON.

Cool Data Structures

  • dashmap - "Blazing fast concurrent HashMap for Rust"

  • radix-trie

  • Patricia Tree - Radix-tree based map for more compact storage

  • Using Finite State Automata and Rust to quickly index and find data amongst HUGE amount of strings

  • ahash - this seems to be the fastest hash algo for hash keys

  • Metrohash - a really fast hash algorithm

  • IndexMap - O(1) obtain by index, iteration by index order

  • FM-Index, a neat structure that allows for fast exact string indexing and counting while compressing original string data at the same time. There is a Rust crate

  • Rstar - n-dimensional R*-Tree for geospatial indexing and nearest-neighbor

  • Heapless - static data structures with fixed size; Vec, heap, map, set, queues

  • Petgraph - Graph data structure for Rust, considered perhaps most mature right now

  • Easy Persistent Data Structures in Rust - replacing Box with Rc

  • VecMap - map for small integer keys, may use less space

String Performance

Rust has native UTF8 string processing, which is AWESOME for performance. However, there are two concerns usually:

  1. Small string memory efficiency. The native String type uses at least two words just for pointer and length/cap, which might be longer than the string itself;
  2. Minimizing number of heap allocations

Here are some solutions:

  • String - string type with configurable byte storage, including stack byte arrays!
  • Inlinable String - stores strings up to 30 chars inline, automatic promotion to heap string if needed.
  • Also see smallstr
  • kstring - intended for map keys: immutable, inlined for small keys, and have Ref/Cow types to allow efficient sharing. :)
  • nested - reduce Vec type structures to just two allocations, probably more memory efficient too.
  • tinyset - space efficient sets and maps, can be combined with nested perhaps
  • bumpalo can do really cheap group allocations in a Bump and has custom String and Vec versions. At least lowers allocation overhead.

Rust and Scala/Java

  • Rust for Java Developers

  • 5 Rust Reflections from Java

  • The presence of true unsigned types is really nice for low-level work. I hit a bug in Scala where I used >> instead of >>>. In Rust you declare a type as unsigned and don't have to worry about this.

  • Immutable byte slices and reference types again are awesome for low-level work.

  • Trait monomorphisation is awesome for ensuring trait methods can be inlined. JVM cannot do this when there is more than one implementation of a trait.

  • Being able to examine assembly directly from compiler output is super nice for low level perf work (compared to examining bytecode and not knowing the final output until runtime)

  • OTOH, rustc is definitely much much stricter (IMO) compared to scalac. Much of this is for good reason though, for example lack of integer/primitive coercion, ownership, etc. gives safety guarantees.

Rust-OtherLanguage Integration / Rust FFI

CLI and Misc

  • Structopt - define CLI options using a struct!

IDE/Editor/Tooling

  • EVCXR - a Rust REPL!!! With deps, and tab-completion for methods!!

  • comby-rust - rewrite Rust code using comby

  • no-panics-whatsoever - crate to detect and ensure at compile time there aren't panics in your code

  • RustAnalyzer - LSP-based plugin/server for IDE functionality in Sublime/VSCode/EMacs/etc

  • Cargo-play - run Rust scripts without needing to set up a project

    • Also see cargo-eval and runner for diff ways of easily running scripts without projects

Testing and CI/CD

The two standard property testing crates are Quickcheck and proptest. Personally I prefer proptest due to much better control over input generation (without having to define your own type class).

Cross-compilation

A common concern - how do I build different versions of my Rust lib/app for say OSX and also Linux?

  • Easiest way now seems to be to use cross - I tried it and literally as easy as cargo install cross and cross build --target ... as long as you have Docker.
    • NOTE: crates with non-Rust code (eg jemalloc, mimalloc) often have trouble
  • Also see rust-musl-builder, another Docker-based solution
  • musl is the best target for Linux as it removes need for G/LIBC dependencies and versioning. Musl creates a single static binary for super easy deploys.
  • For automation, maybe better to create a single Docker image which combines crossbuild (which has a recipe for OSXCross + other targets) with a rustup container like abronan/rust-circleci which allows building both nightly and stable. Use Docker multi-stage builds to make combining multiple images easier

Finally, the Taking Rust everywhere with Rustup blog has good guide on how to use rustup to install cross toolchains, but the above steps to install OS specific linkers are still important.

Performance and Low-Level Stuff

A big part of the appeal of Rust for me is super fast, SAFE, built in UTF8 string processing, access to detailed memory layout, things like SIMD. Basically, to be able to idiomatically, safely, and beautifully (functionally?) do super fast and efficient data processing.

Rust nightly now has a super slick asm! inline assembly feature. The way that it integrates Rust variables/expressions with auto register assignment is super awesome.

NOTE: simplest way to increase perf may be to enable certain CPU instructions: set -x RUSTFLAGS "-C target-feature=+sse3,+sse4.2,+lzcnt,+avx,+avx2"

NOTE2: lazy_static accesses are not cheap. Don't use it in hot code paths.

Perf profiling:

NEW: I've created a Docker image for Linux perf profiling, super easy to use. The best combo is cargo flamegraph followed by perf and asm analysis.

  • cargo-flamegraph -- this is now the easiest way to get a FlameGraph on OSX and profile your Rust binaries. To make it work with bench and Criterion:

    • First run cargo bench to build your bench executable
    • If you haven't already, cargo install flamegraph (recommend at least v0.1.13)
    • sudo flamegraph target/release/bench-aba573ea464f3f67 --profile-time 180 <filter> --bench (replace bench-aba* with the name of your bench executable)
      • The --profile-time is needed for flamegraph to collect enough stats
    • open -a Safari flamegraph.svg
    • NOTE: you need to turn on debug = true in release profile for symbols
    • This method works better for apps than small benchmarks btw, as inlined methods won't show up in the graph.
  • Rust Performance: Perf and Flamegraph - including finding hot assembly instructions

  • Top-down Microarchitecture Analysis Method - TMAM is a formal microprocessor perf analysis method from Intel, works with perf to find out what CPU-level bottlenecks are (mem IO? branch predictions? etc.)

  • Rust Profiling with DTrace and FlameGraphs on OSX - probably the best bet (besides Instruments), can handle any native executable too

    • From @blaagh: though the predicate should be "/pid == $target/" rather than using execname.
    • DTrace Guide is probably pretty useful here
  • Hyperfine - Rust performnace benchmarking CLI

  • Tools for Profiling Rust - cpuprofiler might possibly work on OSX. It does compile. The cpuprofiler crate requires surrounding blocks of your code though.

  • Rust Performance Profiling on Travis CI

  • Rust Profiling talk - discusses both OSX and Linux, as well as Instruments and Intel VTune

  • 2017 RustConf - Improving Rust Performance through Profiling

  • Flamer - an alternative to generating FlameGraphs if one is willing to instrument code. Warning: might require nightly Rust features.

  • Rust Profiling with Instruments on OSX - but apparently cannot export CSV to FlameGraph :(

  • cargo-profiler - only works in Linux :(

  • coz and its Cargo plugin, coz-rs -- "a new kind of profiler that unlocks optimization opportunities missed by traditional profilers. Coz employs a novel technique we call causal profiling that measures optimization potential"

For heap profiling try memory-profiler - written in Rust by the Nokia team!

  • stats_alloc can dump out incremental stats about allocation. Or just use jemalloc-ctl.
  • deepsize - macro to recursively find size of an object
  • Measuring Memory Usage in Rust - thoughts on working around the fact we don't have a GC to track deep memory usage

cargo-asm can dump out assembly or LLVM/IR output from a particular method. I have found this useful for really low level perf analysis. NOTE: if the method is generic, you need to give a "monomorphised" or filled out method. Also, methods declared inline won't show up.

  • What I like to do with asm output: check if rustc has inlined certain methods. Also you can clearly see where dynamic dispatch happens and how complicated generated code seems. More complicated code usually == slower.
  • llvm-mca - really detailed static analysis and runtime prediction at the machine instruction level

What I've found that works (but see cargo flamegraph above for easier way):

sudo dtrace -c './target/release/bench-2022f41cf9c87baf --profile-time 120' -o out.stacks -n 'profile-997 /pid == $target/ { @[ustack(100)] = count(); }'
~/src/github/FlameGraph/stackcollapse.pl out.stacks | ~/src/github/FlameGraph/flamegraph.pl >rust-bench.svg
open -a Safari rust-bench.svg

where -c bench.... is the executable output of cargo bench.

I was hoping cargo-with would allow us to run above dtrace command with the name of the bench output, but alas it doesn't seem to work with bench. (NOTE: they are working on a PR to fix this! :)

NOTE: The built in cargo bench requires nightly Rust, it doesn't work on stable! I highly recommend for benchmarking to use criterion, which works on stable and has extra features such as gnuplot, parameterized benchmarking and run-to-run comparisons, as well as being able to run for longer time to work with profiling such as dtrace.

Fast String Parsing

  • nom - a direct parser using macros, commonly accepted as fastest generic parser
  • pest is a PEG parser using an external, easy to understand syntax file. Not quite as fast but might be easier to understand and debug. There is also a book.
  • combine is a parser combinator library, supposedly just as fast as nom, syntax seems slightly easier

Bitpacking, Binary Structures, Serialization

  • bitpacking - insanely fast integer bitpacking library
  • packed_struct - bitfield packing/unpacking; can also pack arrays of bitfields; mixed endianness, etc.

The ideal performance-wise is to not need serialization at all; ie be able to read directly from portions of a binary byte slice. There are some libraries for doing this, such as flatbuffers, or flatdata for which there is a Rust crate; or Cap'n Proto. However, there may be times when you want more control or things like Cap'n Proto are not good enough.

How do we perform low-level byte/bit twiddling and precise memory access? Unfortunately, all structs in Rust basically need to have known sizes. There's something called dynamically sized types basically like slices where you can have the last element of a struct be an array of unknown size; however, they are virtually impossible to create and work with, and this only covers some cases anyhow. So we will unfortunately need a combination of techniques. In order of preference:

  • Overall scroll is the best general-purpose struct serialization crate; it helps with reading integers and other fields too, and takes care of endianness. It generates pretty efficient code. It is a bit of a pain working with numeric enums however.
    • num_enum - a way to derive TryFrom for numeric enums helps a little bit.
  • I have found plain works really well. Mark your structs with #[repr(C)]. It only helps with size and alignment, not endianness - so maybe more for in-memory structures or when you are sure you don't need code to work across endianness platforms. If your structures are not aligned then use #[repr(C, packed)] or #[align(1)].
  • Use a crate such as bytes or scroll to help extract and write structs and primitives to/from buffers. Might need extra copying though. Also see iobuf
  • rel-ptr - small library for relative pointers/offsets, should be super useful for custom file formats and binary/persistent data structures
  • arrayref might help extract fixed size arrays from longer ones.
  • bytemuck for casts
  • bitmatch could be great for bitfield parsing
  • Or use the pod crate to help with some of the above conversions. However pod seems to no longer be maintained. nue and its macros can also help with struct alignment.
  • Allocate a Vec::<u8> and transmute specific portions to/from structs of known size, or convert pointers within regions back to references:
    let foobar: *mut Foobar = mybytes[..].as_ptr() as *mut Foobar;
    let &mut Foobar = (unsafe { foobar.as_ref() }).expect("Cannot convert foobar to ref");
  • Or structview which offers types for unaligned integers etc.
  • There are some DST crates worth checking out: slice-dst, thin-dst
  • As a last resort, work with raw pointer math using the add/sub/offset methods, but this is REALLY UNSAFE.
    let foobar: *mut Foobar = mybytes[..].as_ptr() as *mut Foobar;
    unsafe {
      (*foobar).foo = 17;
      (*foobar).bar = -1;
    }

Want to zero memory quickly? Use slice_fill for memset optimization, since there is no memory filling for slices in Rust yet.

Also check out the crazy number of crates available under compression - including various interesting radix and trie data structures, and more compression algorithms that one has never heard of.

SIMD

There is this great article on Towards fearless SIMD, about why SIMD is hard, and how to make it easier. Along with pointers to many interesting crates doing SIMD. (There is a built in crate, std::simd but it is really lacking) (However, packed_simd will soon be merged into it)

Another great article: learning simd with rust by finding planets is great too. simd is really about parallelism. it is better to do multiple operations in a parallel (vertical) fashion, vector on vector, than to do horizontal operations where the different components of a wide register depend on one another.

  • ssimd - an effort to bring std::simd/packed_simd to Rust stable, with auto vectorization (meaning auto detect and implement code paths and fallbacks for when SIMD not available!)

  • faster - "SIMD for Humans" -- probably my favorite one, very high level translation of numeric map loops into SIMD

  • fearless_simd, the blog post author's crate. Runtime CPU detection and use of the most optimal code, no need for unsafe, but only focused on f32.

  • SIMDeez - abstracts intrinsic SIMD instructions over different instruction sets & vector widths, runtime detection

  • simd_aligned and simd_aligned_rust - work with SIMD and packed_simd using vectors which have guaranteed alignment

  • aligned - newtype with byte alignment, for stack or heap!

  • https://www.rustsim.org/blog/2020/03/23/simd-aosoa-in-nalgebra/

NOTE: shuffle in packed_simd is not very fast. Replace with native instructions if possible.