notes

letzter gedanke: discriminator trainieren ohne g zu updaten sollte eigentlich funktionieren (tut es das?)


# Next steps:
- TIMBRE INTERPOLATION
  Target: specify a model and two monophonic wav files
  Preprocess samples, collect dataset statistics
  copy stuff from timbre transfer notebook
  
- move timbre interpolation notebook to colab
- add sript to download s3


- test gan:
  - train discriminator only to see if it works
  - test that decoder changes when trained as gan


- implement gated residual block
- 04_hierarchical_timbre_painting
- 05_multi_room_reverb
- implement reverb module to take latent params


# Later:
- correct learning rate, optimizers for discriminator
- timbre painting training stages will not be handled from the trainer loop. instead, trainer will be called for each stage, and we use some generic way to create a model from pretrained modules
- add hyperparams for d_optimizer to trainer
- add hyperparams for d_optimizer to timbrepainting.gin

# Training jobs:

With harmonic audio (urmp-mono) only:
Target: find reverb module that can learn to handle arbitrary room acoustics, find best configuration for standard DDSP Autoencoder
  - Standard DDSP Autoencoder with trainable reverb for single room
  - Standard DDSP Autoencoder with custom reverb
Target: evaluate extensions for timbre painting (multiple instruments, reverb)
  - Timbre Painting on single instrument (original)
  - Timbre painting with multiple instruments and custom reverb
  - Evaluate Timbre painting with harmonic sine waves and random noise as input channels
Target: DDSP GAN vs extended Timbre Painting Gan
  - Train DDSP GAN
Target: Find effect of adversarial loss
  - Best model with adversarial loss only

With percussion:
  - Timbre Painting + z vector + reverb vector
  - DDSP (GAN/AE) + z vector + reverb vector

With singing:
  - Timbre Painting + z vector + reverb vector
  - DDSP (GAN/AE) + z vector + reverb vector


# Questions
- why is scale_fn included in e.g. Additive synth? why wouldn't I just scale it in the network that produces the input
- where is this scale_fn actually used (by which applications)? Loudness is scaled by default, but isn't it computed by some fixed algorithm?
- difference between loudness and amplitude?

# Random ideas
- learn encoding of voice vs lyrics:
  voice component is constant for a training sample over time
  lyrics is not averaged over time (i.e. like it is now)
  there is some cost to using the lyrics encoder (L2 loss + noise)


# New Classes:

synths.py:
    BasicUpsampler
        - Like additive synths without harmonics

decoders.py:
    TimbrePaintingDecoder(Decoder) <- BasicUpsampler, ParallelWaveGANUpsampler
        - Combines a BasicUpsampler and a stack of ParallelWaveGANUpsamplers
    Upsampler
        - Upsample all conditioning features including input audio to target sample rate, then use dilated conv stacks on them

discriminator.py
    Discriminator
        - discriminator takes a dict with controls and the target audio
        - discriminator can decide which of these info to use by itself
        - the evaluated audio is always called "discriminate_audio"
        - with this setup, we can model any kind of conditioning information
    ParallelWaveGANDiscriminator(Discriminator)


gan.py
    Implements the GAN train step, is otherwise like an autoencoder

losses.py:
    AdversarialMSELoss(Loss) <- Discriminator


trainer.py:
    __init__:
        - create discriminator optimizer
        - define a learning schedule for the different steps and number of generators to use in each step 

    scheduler:
        - create a tf.function with:
            - control flow to execute g or d step
            - control flow to set number of upsamplers
            - control flow only depends on tf.Tensor objects

    step functions of different models:
        Autoencoder:
            - classical train step
        ParallelWaveGANUpsampler:
            - can take another upsampler to copy initial weights
            - downsamples target to output sample rate
        Discriminator:
            - classical train_step
        TimbrePainting:
            - sub models: ParallelWaveGANUpsampler, Discriminator
            - upsampler.train_step(batch)
            - split batch, reconstructions into audio_real, audio_gen
            - discriminator.train_step(audio_real, audio_gen)
            - instead of creating a new discriminator and copying the weights, we just continue to use the discriminator we already have


model.py:
    the losses dict has get an additional entry: 'discriminator_loss'


# Components:

MonoEncoder: Audio -> (f0, loudness_db, z_timbre, z_reverb)
  - Pretrained CREPE + MFCC-RNN 
  - Pretrained CREPE + Dilated Gated Conv Architecture

MonoDecoder: (f0, loudness_db, z_timbre, z_reverb) -> Audio
  - NN and Harmonic+noise
  - TP

MonoDiscriminator: (f0, loudness_db, z_timbre, z_reverb, Audio) -> [0, 1]

PolyEncoder: Audio -> (f0, loudness_db, z_timbre, z_reverb)^num_tracks

PolyDecoder: (f0, loudness_db, z_timbre, z_reverb)^num_tracks -> Audio


# Random old notes
MonoAutoencoder
    MonoEncoder:
        (MFCC) => (z, loudness)

        f0 Encoder:
            (MFCC) => (f0)
            CREPE

    
    FeatureDecoder:
        (f0, z) => ('amps', 'harmonic_distribution', 'noise_magnitudes')

    Synthesizer:Processor
        ('f0_hz', 'amps', 'harmonic_distribution', 'noise_magnitudes') => (audio)


MonoTimbreUpsamblingDecoder(GanDecoder): (f0, loudness, z) => (audio)

    InitialSampler:Processor
        (f0, loudness) => (audio)

    ParallelWaveGANUpsampler:
        (f0, loudness, z, audio) => (audio)


PolyAutoEncoder
    PolyEncoder:
        (MFCC) => (z: (N_synth, z_dims, time), f0: (N_synth, time), loudness: (N_synth, time))

    PolyDecoder: (f0, loudness, z) => audio
        stacked applications of MonoDecoder, with shared weights