Stars
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
DeepEP: an efficient expert-parallel communication library
thunlp / Seq1F1B
Forked from NVIDIA/Megatron-LMSequence-level 1F1B schedule for LLMs.
Zotero completion source for nvim-cmp using zotcite as backend.
Efficient Training (including pre-training and fine-tuning) for Big Models
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
Zero Bubble Pipeline Parallelism
MayDomine / Seq1F1B
Forked from NVIDIA/Megatron-LMSequence-level 1F1B schedule for LLMs.
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
北邮北京邮电大学校园网网关自动化认证脚本。支持有线网和无线网。支持带参数 Portal 认证、AC 跳转、掉线重连。跨平台。 BUPT Network Login.
Development repository for the Triton language and compiler
Tool Learning for Big Models, Open-Source Solutions of ChatGPT-Plugins
MayDomine / flash-attention
Forked from Dao-AILab/flash-attentionadd attention mask for flash attention
MayDomine / BMTrain
Forked from OpenBMB/BMTrainEfficient Training (including pre-training and fine-tuning) for Big Models
Elegant and Powerfull. Powered by OpenAI and Vercel.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
real Transformer TeraFLOPS on various GPUs