RSTnet: Real-time Speech-Text Foundation Model Toolkit (wip)

Building a real-time speech-text foundation model capable of understanding and generating speech has garnered significant attention. Notable examples of such models include ChatGPT-4o and Moshi. However, challenges in training these models continue to persist in the research community. We introduce RSTnet, a new open-source platform designed for developing real-time speech-text foundation models. RSTnet offers a comprehensive framework for data processing, pre-training, and fine-tuning, aimed at helping researchers build their real-time speech-text models efficiently. It builds upon previous works, such as the real-time spoken dialogue model (Moshi) and the universal audio generation model (UniAudio). RSTnet consists of following key components: (1) Data preparation; (2) streaming audio codec models; (3) speech-text foundation models; (4) Benchmark and Evaluation.

News

2024.10.7. We release the first version of RSTnet.

Make Contributions

The project is still ongoing. If you have interest about RSTnet, welcome to make contribution. You can consider:

[1] Propose issue or PR to solve the bugs
[2] Propose more idea about Data collection, streaming codec, and speech-text foundation model
[3] Join us as an author of this project. (Contact me by [email protected])

Install

conda create -n RSTnet python=3.12
conda activate RSTnet
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tqdm
pip install librosa==0.9.1
pip install matplotlib
pip install omegaconf 
pip install einops
pip install vector_quantize_pytorch
pip install tensorboard
pip install deepspeed
pip install peft

Technical report

You can find our technical report from https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf

DataPipeline

More details will be updated soon. You can refer to DataPipeline part.

AudioCodec

We plan to support more SOTA streaming audio codec. Now, we have reproduced the MimiCodec.

Multi-modal LLM

We release the fine-tuning code for moshi now. In the next step, we will release the full recipes about pre-training, post-training. Furthermore, we also consider to add other SOTA speech-text foundation paradigms.

Reference

The implements of streaming audio codec and speech-text language models are based on previous codebase: https://github.com/kyutai-labs/moshi https://github.com/yangdongchao/UniAudio

Citations

@techreport{RSTnet,
  title={RSTnet: Real-time Speech-Text Foundation Model Toolkit},
  author={RSTnet team},
  journal={Technical report},
  year={2024}
}

@techreport{kyutai2024moshi,
    author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
			  Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
    title = {Moshi: a speech-text foundation model for real-time dialogue},
    institution = {Kyutai},
    year={2024},
    month={September},
    url={http://kyutai.org/Moshi.pdf},
}

```bibtex
@article{yang2023uniaudio,
  title={UniAudio: An Audio Foundation Model Toward Universal Audio Generation},
  author={Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Helen Meng},
  journal={arXiv preprint arXiv:2310.00704},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
AudioCodec		AudioCodec
DataPipeline		DataPipeline
Evaluation		Evaluation
MLLM		MLLM
RSTnet.pdf		RSTnet.pdf
RSTnet.png		RSTnet.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSTnet: Real-time Speech-Text Foundation Model Toolkit (wip)

News

Make Contributions

Install

Technical report

DataPipeline

AudioCodec

Multi-modal LLM

Reference

Citations

About

Releases

Packages

Contributors 3

Languages

yangdongchao/RSTnet

Folders and files

Latest commit

History

Repository files navigation

RSTnet: Real-time Speech-Text Foundation Model Toolkit (wip)

News

Make Contributions

Install

Technical report

DataPipeline

AudioCodec

Multi-modal LLM

Reference

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages