Skip to content

Commit

Permalink
update Languagecodec_v2
Browse files Browse the repository at this point in the history
  • Loading branch information
novateurjsp committed Apr 25, 2024
1 parent 670e370 commit 68e855a
Show file tree
Hide file tree
Showing 80 changed files with 443 additions and 66 deletions.
50 changes: 28 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
# Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
# Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language M odels

[Audio samples](https://languagecodec.github.io) |
Paper [[abs]](https://arxiv.org/abs/2402.12208) [[pdf]](https://arxiv.org/pdf/2402.12208.pdf)

[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/pdf/2402.12208.pdf)
[![demo](https://img.shields.io/badge/Languagecodec-Demo-red)](https://languagecodec.github.io)
[![model](https://img.shields.io/badge/%F0%9F%A4%97%20Languagecodec-Models-blue)](https://huggingface.co/amphion/naturalspeech3_facodec)


# 🔥 News
- *2024.04*: We update Languagecodec and release a more powerful checkpoint.
- *2022.02*: We release Languagecodec on arxiv.

![result](result.png)


## Installation

Expand All @@ -20,49 +31,49 @@ pip install -r requirements.txt

```python

from encodec.utils import convert_audio
from languagecodec_encoder.utils import convert_audio
import torchaudio
import torch
from vocos.pretrained import Vocos
from languagecodec_decoder.pretrained import Vocos

device=torch.device('cpu')

config_path = "xxx/languagecodec/configs/languagecodec.yaml"
model_path = "xxx/xxx.ckpt"
audio_outpath = "xxx"
vocos = Vocos.from_pretrained0802(config_path, model_path)
vocos = vocos.to(device)
languagecodec = Vocos.from_pretrained0802(config_path, model_path)
languagecodec = languagecodec.to(device)

wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, 24000, 1)
bandwidth_id = torch.tensor([0])
wav=wav.to(device)
features,discrete_code= vocos.encode(wav, bandwidth_id=bandwidth_id)
audio_out = vocos.decode(features, bandwidth_id=bandwidth_id)
features,discrete_code= languagecodec.encode_infer(wav, bandwidth_id=bandwidth_id)
audio_out = languagecodec.decode(features, bandwidth_id=bandwidth_id)
torchaudio.save(audio_outpath, audio_out, sample_rate=24000, encoding='PCM_S', bits_per_sample=16)
```


### Part2: Generating discrete codecs
```python

from encodec.utils import convert_audio
from languagecodec_encoder.utils import convert_audio
import torchaudio
import torch
from vocos.pretrained import Vocos
from languagecodec_decoder.pretrained import Vocos

device=torch.device('cpu')

config_path = "xxx/languagecodec/configs/languagecodec.yaml"
model_path = "xxx/xxx.ckpt"
vocos = Vocos.from_pretrained0802(config_path, model_path)
vocos = vocos.to(device)
languagecodec = Vocos.from_pretrained0802(config_path, model_path)
languagecodec = languagecodec.to(device)

wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, 24000, 1)
bandwidth_id = torch.tensor([0])
wav=wav.to(device)
_,discrete_code= vocos.encode(wav, bandwidth_id=bandwidth_id)
_,discrete_code= languagecodec.encode_infer(wav, bandwidth_id=bandwidth_id)
print(discrete_code)
```

Expand All @@ -71,24 +82,19 @@ print(discrete_code)
### Part3: Audio reconstruction through codecs
```python
# audio_tokens [n_q,1,t]/[n_q,t]
features = vocos.codes_to_features(audio_tokens)
features = languagecodec.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([0])
audio_out = vocos.decode(features, bandwidth_id=bandwidth_id)
audio_out = languagecodec.decode(features, bandwidth_id=bandwidth_id)
```




## Pre-trained models

Currently, we have only released the results from our paper, and we plan to release additional checkpoints trained on a larger training dataset within the next two months.

Notice: We will release a better language-codec checkpoint before 5.15, and further revise the paper.

| Model Name | Dataset | Training Iterations
-------------------------------------------------------------------------------------|---------------|---------------------
| [languagecodec_paper_8nq](https://drive.google.com/file/d/109ectu4NJWFCpmrqc31wdXvkTI6U2nMA/view?usp=drive_link) | 3W Hours | 2.0 M
| [languagecodec_chinese_8nq](https://drive.google.com/file/d/18JpINstfF2YrbFg6nqs3BVn0oxdLsuUm/view?usp=drive_link) | 2W Chinese Hours | 2.0 M
| [languagecodec_paper_8nq](https://drive.google.com/file/d/109ectu4NJWFCpmrqc31wdXvkTI6U2nMA/view?usp=drive_link) | 5W Hours | 2.0 M

## Training

Expand All @@ -99,7 +105,7 @@ Notice: We will release a better language-codec checkpoint before 5.15, and furt

### Step2: Modifying configuration files
```python
# xxx/languagecodec/configs/languagecodec.yaml
# xxx/languagecodec/configs/languagecodec_mm.yaml
# Modify the values of parameters such as batch_size, filelist_path, save_dir, device
```

Expand All @@ -109,7 +115,7 @@ training pipeline.

```bash
cd xxx/languagecodec
python train.py fit --config xxx/languagecodec/configs/languagecodec.yaml
python train.py fit --config xxx/languagecodec/configs/languagecodec_mm.yaml
```


Expand Down
24 changes: 12 additions & 12 deletions configs/languagecodec.yaml → configs/languagecodec_mm.yaml
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
seed_everything: 4444

data:
class_path: vocos.dataset.VocosDataModule
class_path: languagecodec_decoder.dataset.VocosDataModule
init_args:
train_params:
filelist_path: xxx/xxx
filelist_path: /home/jovyan/honor/big-disk/speech/code/languagecodec/data/train/languagecodec_ch_en
sampling_rate: 24000
num_samples: 24000
batch_size: 100
num_workers: 8

val_params:
filelist_path: xxx/xxx
filelist_path: /home/jovyan/honor/big-disk/speech/code/languagecodec/data/train/languagecodec_large_val
sampling_rate: 24000
num_samples: 24000
batch_size: 10
num_workers: 8

model:
class_path: vocos.experiment.VocosEncodecExp
class_path: languagecodec_decoder.experiment.VocosEncodecExp
init_args:
sample_rate: 24000
initial_learning_rate: 2e-4
Expand All @@ -33,27 +33,27 @@ model:
evaluate_periodicty: true

resume: false
resume_config: xxx/config.yaml
resume_model: xxx/xxxx.ckpt
resume_config: /home/jovyan/honor/big-disk/speech/code/languagecodec/result/train/languagecodec_mm/lightning_logs/version_1/config.yaml
resume_model: /home/jovyan/honor/big-disk/speech/code/languagecodec/result/train/languagecodec_mm/lightning_logs/version_1/checkpoints/vocos_checkpoint_epoch=7_step=1268768_val_loss=2.9373.ckpt

feature_extractor:
class_path: vocos.feature_extractors.EncodecFeatures
class_path: languagecodec_decoder.feature_extractors.EncodecFeatures
init_args:
encodec_model: encodec_24khz
bandwidths: [6.6, 6.6, 6.6, 6.6]
train_codebooks: true

backbone:
class_path: vocos.models.VocosBackbone
class_path: languagecodec_decoder.models.VocosBackbone
init_args:
input_channels: 128
dim: 384
intermediate_dim: 1152
num_layers: 8
num_layers: 12
adanorm_num_embeddings: 4 # len(bandwidths)

head:
class_path: vocos.heads.ISTFTHead
class_path: languagecodec_decoder.heads.ISTFTHead
init_args:
dim: 384
n_fft: 1280
Expand All @@ -76,7 +76,7 @@ trainer:
filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f}
save_top_k: 50
save_last: true
- class_path: vocos.helpers.GradNormCallback
- class_path: languagecodec_decoder.helpers.GradNormCallback

# Lightning calculates max_steps across all optimizer steps (rather than number of batches)
# This equals to 1M steps per generator and 1M per discriminator
Expand All @@ -85,5 +85,5 @@ trainer:
limit_val_batches: 100
accelerator: gpu
strategy: ddp
devices: [4,5,6,7]
devices: [0,1,2,3,4,5,6,7]
log_every_n_steps: 1000
Binary file not shown.
4 changes: 4 additions & 0 deletions languagecodec_decoder/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from languagecodec_decoder.pretrained import Vocos


__version__ = "0.0.3"
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
5 changes: 4 additions & 1 deletion vocos/dataset.py → languagecodec_decoder/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from torch.utils.data import Dataset, DataLoader

import soundfile
# import librosa

torch.set_num_threads(1)

Expand Down Expand Up @@ -54,7 +55,9 @@ def __len__(self) -> int:
def __getitem__(self, index: int) -> torch.Tensor:
audio_path = self.filelist[index]
# y, sr = torchaudio.load(audio_path)
# print(audio_path,"111")
y1, sr = soundfile.read(audio_path)
# y1, sr = librosa.load(audio_path,sr=None)
y = torch.tensor(y1).float().unsqueeze(0)
# if y.size(0) > 1:
# # mix to mono
Expand All @@ -78,4 +81,4 @@ def __getitem__(self, index: int) -> torch.Tensor:
# During validation, take always the first segment for determinism
y = y[:, : self.num_samples]

return y[0]
return y[0]
Loading

0 comments on commit 68e855a

Please sign in to comment.