Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating good audio samples #47

Open
ibab opened this issue Sep 19, 2016 · 86 comments
Open

Generating good audio samples #47

ibab opened this issue Sep 19, 2016 · 86 comments
Labels

Comments

@ibab
Copy link
Owner

ibab commented Sep 19, 2016

Let's discuss strategies for producing audio samples.
When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.

Some ideas I've had to improve on this:

  • We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
  • Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with librosa.
@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 19, 2016

I've got a laundry list of things I'd like to try and plan to explore the space of possibilities. Here are a few that come to mind.

  1. Paper mentions ~300mSec receptive field for speech generation at one point. Given our current params, we got closer to 63 mSec if my arithmetic is correct.
  2. Maybe we were a bit too draconian in cutting back dilation channels and residual channels. Bump those up?
  3. Does anyone else feel weird about performing a convolution on a time series where each element of the series is a one-hot vector, like we do at the input? I had thought their quantization into one-hot softmax output was only for the output, as the rationale involved avoiding having the learnt distribution putting probability mass outside the range of possible values. Encoding to one-hot on the input has to at the very least add quantization noise. I'd rather feed the input signal as a single floating point channel. Then the filters would be more like the digital filters we've always dealt with since the days of yore.
  4. That last 1x1 conv before the output has #channels = average of the dilation channels and the quantization levels. I'd rather make that a configurable number of channels, and bump it up to > quantization levels (e.g. 1024). We're trying to go from a small dense representation to choice from amongst 256 quantization levels, so it's almost like a classification problem where we need to create a complicated decision surface and thus maybe need more decision boundaries from more units, and maybe a bit deeper too.
  5. Issue 48.
  6. PR 39.
  7. Something else I can't remember at the moment.

@ibab
Copy link
Owner Author

ibab commented Sep 19, 2016

  1. The ~300ms sounds like 4 or 5 stacks of dilation layers ranging from 1-512.
  2. I've also been thinking that the number of channels can now vary throughout the convolutional stack, as we're not tied to the number of input channels that we have when combining the outputs.
  3. That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right? But I can see how one might build something useful with this idea. Basically, you would extract time-dependent activations at different time scales from the layers, and then feed them through several 1x1 layers to make sense of what you've seen at the different time scales. But you probably wouldn't want to add up the outputs of each layer, as in this architecture.
  4. Yeah, that makes perfect sense. The "postprocessing" layers probably aren't doing a lot at the moment.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 19, 2016

2 I've also been thinking that the number of channels can now vary throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?

3 That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.

@ibab
Copy link
Owner Author

ibab commented Sep 19, 2016

2: Yeah, I was thinking about changing the number that's referred to as dilation_channels in the config file on a per-block basis, but got confused. This would require #48.

4: Ah, so basically we would allow the network to learn its own encoding of the floating point samples. Wouldn't that make the quantization procedure unnecessary?

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 19, 2016

In your one-hot proposal, we would be assuming that the network tries to
learn how to perform a quantization. In the current implementation, the
network learns how to encode a random variable with a multinomial
distribution that has temporal dependencies across trials.

The obvious model that we would all like is using the float value with
"classical" filters, but then we need to choose a loss function. The
authors said that most loss functions on floats assume a particular
distribution of the possible values (I think the squared loss corresponds
to a normal distribution), while the multinomial from the one-hot encoding
makes no assumptions at the expense of having a finite set of possible
values. Apparently, lifting this constraint gives better results despite
the quantization noise.

The downside is that the SNR will always be kinda high because of the 8-bit
resolution, so at some point we should be able to find a better model -- or
scale it up to a one-hot encoding with 60k+ categories.

El dl., 19 set. 2016 a les 17:14, jyegerlehner (notifications@github.com)
va escriure:

  1. I've also been thinking that the number of channels can now vary
    throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels
can vary. The res block's summation forces the input shape to equal the
output shape, so num channels can't change. Oh, or perhaps you are saying
within a single block, the channels can vary, so long as we end up with the
same shape at input and output of any single block?

  1. That would conflict with the presence of the gated activation unit
    and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't
see any conflict at all. Everything would be exactly as it is now. Except
one thing: the input to the net would be n samples x 1 scalar amplitude,
instead of n samples x 256 one-hot encoding. The initial causal conv filter
will still produce the same number of channels it does now, so nothing
downstream would see any impact.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5myUpJB0EKsftfed4CTAB2S3AZpzks5qrvtIgaJpZM4KAi-o
.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 19, 2016

@ibab

Wouldn't that make the quantization procedure unnecessary?

No, we'd still be using 1) softmax at the output to provide a discrete probability distribution over the 256 quantized values, 2) the quantization procedure for producing target output during training, and 3) and sampling from the discrete prob distribution of the softmax in order to produce the output (followed by the inverse of the companding quantization to get it back to a scalar amplitude).

@lemonzi

In your one-hot proposal...

I sincerely think what I'm proposing is being misunderstood. It's probably my fault :). I understand and like the rationale given for the softmax output in section 2.2 of the paper; I wasn't proposing getting rid of it. I don't have a one-hot proposal; the current code uses a one-hot encoding as an input.

But no matter; at some point perhaps I'll get around to trying what I propose, and bring it back here if it works any better. It does seem more likely, based on my reading of the paper, that it's what they are doing than what the current code does. I don't have high confidence though, and of course could easily be wrong.

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 19, 2016

This is what I understood:

- floating-point input with raw audio (no mu-law)
- convolution with 1 input channel and N output channels
- N-to-M channel layers
- layer that aggregates the skip connections with 256 outputs
- softmax
- cross-entropy against a mu-law + quantization encoding of the input

It's worth a shot, but the network is no longer an auto-encoder.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them, which is closer to feeding it a given raw audio seed.

@ibab, how about a tag for these "strategical issues" and one issue per idea?

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 19, 2016

@jyegerlehner Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

@jyegerlehner
Copy link
Contributor

This is what I understood

OK you understood me well then. Perhaps I was misunderstanding you.

It's worth a shot, but the network is no longer an auto-encoder.

I'm trying to see in what sense it was ever an auto-encoder. I don't think it is/was.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them

Not sure I follow your alternative to the softmax. I was mostly trying to stick to figuring out what the authors had most likely done in their implementation. I bet we all have ideas about different approaches we'd like to try out.

Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

No I never thought of that.

how about a tag for these "strategical issues" and one issue per idea?

Right, I feel bad about hijacking ibab's thread. I like the strategic issues tag idea. I prefer not to clutter this thread any more with this one topic.

@ibab
Copy link
Owner Author

ibab commented Sep 20, 2016

As the topic of this issue is just a general "What else should we try?" I think the discussion is perfectly fine 👍
But feel free to open new issues to discuss strategies. I can tag them with a "strategy" label.

@ibab ibab added the strategy label Sep 20, 2016
@jyegerlehner
Copy link
Contributor

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset.

Right. I also think when we train on multiple speakers we need to shuffle the clips. I fear we may be experiencing catastrophic forgetting. That little sine-wave unit test I wrote shows how quickly it can learn a few frequencies, which makes me think once it starts getting sentences from a single speaker, it forgets about the pitch and other characteristics of the previous speaker(s).

But single-speaker training is less ambitious an easier first step.

@woodshop
Copy link
Contributor

My two cents:

Scalar input: WaveNet treats audio generation as an autoregressive classification task. The model requires the last step's output to be provided at the input. I don't think there's much to be gained by providing scalar floating point values at the input. They would still need to be reduced to 8-bit resolution (or as @lemonzi mentions you'd be asking the model to learn quantization). You might save some computational cycles at the first layer. However I think then the scale of the input would need to be considered more closely.

Input shuffling: this would probably be very useful.

Silence trimming: Shouldn't the model be allowed to see strings of silent samples? Otherwise it will learn to generate waveforms having more discontinuities. I suggest that the degree of trimming is decided as a function of the size of the receptive field. E.g. truncate silences to no less than 75% of the receptive field.

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 21, 2016

Oh, that makes sense. It's classifying the next sample, not encoding the sequence as a whole.

The trimming is currently applied to the beginning and end of the samples, not to the gaps in between speech. If there are long silence chunks in the samples, what could make sense is to split them in two rather than stripping out the silence.

@ibab
Copy link
Owner Author

ibab commented Sep 21, 2016

I've just managed to generate a sample that I think sounds pretty decent:
https://soundcloud.com/user-952268654/wavenet-28k-steps-of-100k-samples

This is using Tanh instead of ReLU to avoid the issue that the ReLU activations eventually cut off the network.
I stopped it at one point to reduce the learning rate from 0.02 to 0.01 but it doesn't look like it had a large impact.
I started generating when the curve was at about 28k steps.

screen shot 2016-09-21 at 21 08 46

I used only two stacks of 9 dilation layers each:

{
    "filter_width": 2,
    "quantization_steps": 256,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256],
    "residual_channels": 32,
    "dilation_channels":16,
    "use_biases": false
}

@woodshop
Copy link
Contributor

Nice!

@ibab
Copy link
Owner Author

ibab commented Sep 21, 2016

I've noticed that generating from the same model doesn't always produce interesting output.
But if I start off with an existing recording, it seems to work very reliably:
https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 21, 2016

Can you test with argmax instead of random . choice?

On Wed, Sep 21, 2016, 17:46 Igor Babuschkin notifications@github.com
wrote:

I've noticed that generating from the same model doesn't always produce
interesting output.
But if I start off with an existing recording, it seems to work very
reliably:

https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000
samples, I think the results sound quite promising.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5jeApDDLwiBeK-zYyZDdEmKvPTs0ks5qsaWkgaJpZM4KAi-o
.

@ibab
Copy link
Owner Author

ibab commented Sep 21, 2016

@lemonzi: After swapping out random.choice with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 21, 2016

Interesting...

El dc., 21 set. 2016 a les 17:58, Igor Babuschkin (notifications@github.com)
va escriure:

@lemonzi https://github.com/lemonzi: After swapping out random.choice
with argmax, it always returns the same value. I think that makes sense, as
staying at the same amplitude is the most likely thing to happen at the
resolution we work with.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5lUKLgy9yHbvna-8BQFLuht0GWvyks5qsah1gaJpZM4KAi-o
.

@ibab
Copy link
Owner Author

ibab commented Sep 21, 2016

When calculating the mean amplitude with

sample = np.int32(np.sum(np.arange(quantization_steps) * prediction))

it just produces noise for me.

@ghenter
Copy link

ghenter commented Sep 23, 2016

Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.

As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner's list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.

I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:

  1. This turns the problem from a regression task to a classification task. For some reason, DNNs have seen greater success in classification than in regression. (This has motivated the research into generative adversarial networks, which is another hot topic at the moment.) Up until now, most DNN-based waveform/audio generation approaches were formulated as regression problems.
  2. A softmax output layer allows a flexible representation of the distribution of possible output values, from which the next value is generated by sampling. Empirically, this worked better than parametrising the output distribution using GMMs (i.e., a mixture density network).

Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:

  • Scalar inputs have lower dimensionality, requiring fewer parameters in the network. (They are compact and dense instead of a factor 256 larger and sparse.)
  • Using floats does not introduce (additional) quantisation noise.
  • Applying convolutions to floating point values is interpretable as a filter, as @jyegerlehner said. The effect of applying convolutions to one-hot vectors, in contrast, is opaque.
  • Finally, and most importantly, the actual waveform sample values are numerical, so they have both a magnitude and an internal ordering. These properties matter hugely. Feeding in a categorical representation (one-hot vectors) would essentially force the network to learn the relative values and ordering associated with each input node, in order to make sense of the input. Since there are something like 256 values x 300 ms x 16 kHz = 1.2 million one-hot input nodes, this is a formidable learning task that is entirely avoided by using a floating point representation.

Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven't looked sufficiently deeply into it to tell how they encode their input.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 25, 2016

Has everyone been reproducing ibab's results? I got a result similar to his, but I think it sounds a bit smoother; I'm guessing because the receptive field is a little bigger than his.

2 seconds:
https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1

10 seconds:
https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2],
    "residual_channels": 32,
    "dilation_channels":32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

[Edit] After mortont comment below: I used learning_rate=0.001.

I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab's example. Loss was a bit lower than his, around 2.0-2.1.

I think to get pauses between words and such we need to a wider receptive field. That's my next step.

By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?

[Edit] Here's one from a model with that has about 250 mSec receptive field, trained for about 16 hours:
https://soundcloud.com/user-731806733/generated-larger-1

@ibab
Copy link
Owner Author

ibab commented Sep 25, 2016

Those results sound great!
We should consider linking to them from the README.md to demonstrate what the network can do.
It seems likely that we will be able to reproduce the quality of the DeepMind samples with a higher receptive field.

On soundcloud, you can set an audio clip to repeat in the bar at the bottom, but I don't think this will affect other listeners. Not sure why my clip was on repeat by default for you.

@mortont
Copy link
Contributor

mortont commented Sep 25, 2016

This is definitely the best result yet! What commit did you use to achieve this @jyegerlehner? I tried reproducing it using the same hyperparameters and only speaker p280 from the corpus, but my model hasn't gone under a loss of 5 after 26k steps.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 25, 2016

@mortont
I'm not sure exactly which commit to this branch it was exactly:
https://github.com/jyegerlehner/tensorflow-wavenet/tree/single-speaker
But most are trivial and frankly I don't think it matters. I haven't observed it breaking at any point.

I've started training a newer model with latest updates from master and it is working fine. I don't have any "special sauce" or changes to the code relative to master that I can think of. The only reason for a separate branch for it is to allow me to change the .json file and add shell scripts, and be able to switch back to master without losing files.

I'm trying to imagine why you would have loss stuck at 5 and... can't think of a good reason. Perhaps compare the train.sh and resume.sh in my branch to the command-line arguments you are supplying and see if there's an important difference? Learning rate perhaps?

[Edit]: I observe the loss to start dropping right away, within the first 50 steps. Loss drops to < 3 rapidly well before 1K steps. So if you don't see that, I think something is wrong.

@mortont
Copy link
Contributor

mortont commented Sep 25, 2016

Looks like it was learning rate, I changed it from 0.02 to 0.001 and it's now steadily dropping, thanks!

@robinsloan
Copy link
Contributor

robinsloan commented Oct 25, 2016

@neale, this is cool! It would be interesting, with this output as a baseline, to now take a bundle of pieces from the same composer -- doesn't have to be a lot -- and train the network on those alone, with the same settings, same procedure, etc. I'd be very curious to hear that output alongside what you've got.

My sense is that training this WaveNet implementation on a large, diverse corpus is going to be tricky until we have a method for "conditioning" and telling the network that A is supposed to sound like B and C, but not as much like Y and Z, etc. Otherwise it just tries to generalize across the entire breadth of what it's hearing, and that's a lot to ask.

Question for everyone: what's the math to compute the length of the receptive field for a given set of params/dilations? I know I should know this but… I do not 😬

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Oct 25, 2016

@Nyrt those sound pretty good. What was your loss approximately when you generated those?

And regarding the optimizer: are you using SGD/momentum optimizer for any particular reason? I haven't seen it learning as fast as adam or rmsprop, but your results are hard to argue with.

@neale
Copy link

neale commented Oct 25, 2016

@robinsloan I've reread the paper and realized that classical music is absolutely beyond the ability of the model. With a 300ms receptive field, the multiple instruments probably sound just like what I posted.
A more homogeneous dataset is needed, I'm getting together as much solo piano as I can find.

Also the receptive field size is just the size of the convolution window. So in a regular CNN people usually use 3x3 or 5x5, two dimensional receptive fields. We use 1D conv layers, and with the dilations the receptive fields get sparse by multiplying the length by [1, 2, 4, 8, ..] and zeroing portions of the filter.
There's no math for that :) , math would be calculating parameters introduced from each convolution

@neale
Copy link

neale commented Oct 25, 2016

Also can someone edify me as to why anything but rmsprop is being used. I thought it would perform the best here.

@robinsloan
Copy link
Contributor

robinsloan commented Oct 25, 2016

@neale Er, I guess I mean, how do you know your receptive field is 300ms long? I get that it depends on sample rate and dilations, but I don't understand the arithmetic.

@mortont
Copy link
Contributor

mortont commented Oct 25, 2016

@robinsloan the receptive field length (in seconds) is just the sum of your dilation layers (receptive field in unit-less numbers) divided by your sample rate (1/seconds), so in the case of a wavenet_params.json that looks like this:

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

your receptive field would be the sum of the dilations list (1564) divided by the sampling frequency (16000) giving you ~98ms of a receptive field.

Somewhat related, @neale it may be worth trying with a longer receptive field than 100ms since the best speech samples have been in the 300ms+ range. The paper mentions fields >1 second for the music generation, but I think we could get results better than static if we upped the dilations to 6 blocks of 512 or so.

@neale
Copy link

neale commented Oct 25, 2016

@mortont I would love to but I can't even have a full [1..512] stack because I only have a gtx 970 :(

@Nyrt
Copy link

Nyrt commented Oct 26, 2016

@neale I believe it was speaker 266-- I can confirm this once I get back to my big machine.

@jyegerlehner The loss was still oscillating a bit, but the minimum it hit was something around 1.5-1.6. Most of the time it was closer to 1.7-1.8, peaking at 2.

The reason I was using the SGD/momentum optimizer was that it avoids an issue I was having early on where the loss would suddenly explode and start generating white noise. The slower training wasn't really an issue because it was running overnight anyway. I haven't tried RMSprop yet, though.

@neale
Copy link

neale commented Nov 1, 2016

In the quest for ever better audio I made some samples that sound a lot better than anything I could get before.

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1014, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64],

    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

unmentioned: rmsprop and initial LR of 0.001

Soundcloud Link

I used a receptive field of 2s, on some 20 hours of solo piano that I scraped off youtube. I trained this out to 100k steps, and those are still generating. Unfortunately I had to decrease the sample size to 32000 for lack of available memory.

I just grabbed a 8GB 1070, so in a few days I'll triple my model and try it out.

@nakosung
Copy link
Contributor

nakosung commented Nov 11, 2016

I just grabbed some P100's and have trained large wavenet. :) (Trained with current default wavenet_params.json)

https://soundcloud.com/nako-sung/test

@Cortexelus
Copy link

Nako, What was dataset size (#files, length of files)? For how long did you
train?

On Friday, November 11, 2016, Nako Sung notifications@github.com wrote:

I just grabbed some P100's and have trained large wavenet. :) Trained with
current default wavenet_params.json.

https://soundcloud.com/nako-sung/test


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACVxZXm4OsBJGIrOTrTF8h4_POd4-BXGks5q9ERvgaJpZM4KAi-o
.

@nakosung
Copy link
Contributor

@Cortexelus 158 files, 200K bytes each. ~50k iteration with learning rate annealing. :)

@Cortexelus
Copy link

@nakosung at what sample rate?

@fehiepsi
Copy link
Contributor

In the paper, the receptive field is about 300ms and the sample rate is 16000. So the current default setting is good in my opinion; except something like sample size, number of iterations, learning rate, optimizer, which we can tweak through experiments to avoid overfitting and get better convergence. To generate a good sound, we should consider the local conditioning on text e.g. (as mentioned in Section 3.1 of the paper).

@nakosung
Copy link
Contributor

@Cortexelus default setting(16K)

New sample with slience. https://soundcloud.com/nako-sung/test-3-wav

@nakosung
Copy link
Contributor

I've just added dropout and concat-elu (#184). I hope dropout would help generator's quality. :)

@willjhenry
Copy link

Hi all, I am using the default settings and on an ec2 p2.xlarge and am getting 2.7 sec/step. I am pretty certain I saw a post on one of the issues describing 0.5 sec steps. Just wondering if anyone had some tips

@nakosung
Copy link
Contributor

@willjhenry In my case I got 3.5 sec/step using KAIST-Korean corpus. (IBM PowerPC, P100)

@Whytehorse
Copy link

Is the goal here to produce random speech? If so, it sounds like nakosung nailed it. What about text-to-speech? I'd love to start using this in OpenAssistant for our tts. Please help us? Festival is so poor... https://github.com/vavrek/openassistant

@jyegerlehner
Copy link
Contributor

@Whytehorse

Generating random speech is a step on the way to generating speech conditioned on what you want to it to say (i.e. generating non-random speech). Please see the WaveNet paper.

I don't think this project, at least in its current form, is a candidate for your tts solution. 1) there's no local conditioning on the desired speech implemented in the repo, at least yet. 2) Even if it were, it doesn't generate audio in real-time. Takes M seconds to generate N seconds of audio, where M >> N.

@jyegerlehner
Copy link
Contributor

@willjhenry @nakosung On the time-per-step: the code in the master of this repo produces a training batch by pasting together subsequent audio clips from the corpus until the length is at least sample_size=100000 samples long. I think this is wrong because it is training the net to predict a discontinuity where it transitions from the first to second clip. This is fixed in both the global condition PR and koz4k's PR. Since most of the VCTK corpus clips are much shorter than 100000 samples, these branches produce faster step times as the number of samples in the train batch tends to be smaller (than what you get with the ibab master).

Having said all that, I was getting about 0.5-1.0 second steps on my branch, on Titan XP.

@belevtsoff
Copy link
Contributor

belevtsoff commented Dec 23, 2016

@nakosung That's a really cool sample you've got there: https://soundcloud.com/nako-sung/test-3-wav. I noticed that your model produces very smooth-sounding vowels. I trained my model with 50k+ steps (current default parameters) and still have some considerable tremor in the vowels: https://soundcloud.com/belevtsoff/wavenet_54k_audiobook. My training corpus is an audiobook with 1+ hours of clean speech. Also, I use RandomShuffleQueue for feeding input data. Can you think about a possible reason for this poor quality?

@Whytehorse
Copy link

Can't you just use google or siri to produce the corpus? Maybe through an API you could send a word as text to them and get back a sound. There's your training data. Eventually you could get longer and longer sentences until it can handle anything.

@AlvinChen13
Copy link

@nakosung Have you trained on Mutiple GPU with Mutiple nodes? The default training just run one GPU. Do you mind share how to train it on multiple nodes?

@nakosung
Copy link
Contributor

@AlvinChen13 Multi node training didn't scale well. I think it has bottleneck in network bandwidth. (Although I didn't test ASGD) I switched to TitanXP which had larger memory so I stopped multi node/gpu experiment.

@AlvinChen13
Copy link

@nakosung Would you mind share your distributed code? Our lab have 8 nodes with 2 M40 for each, and 4 nodes with 2 K40 for each, and all are connected to 40G switch.

I beleive multi-node training is neccessary if we train on huge dataset with bigger receptive fields. It is worth to cost some efforts to investigate it. Google claimed that tf 1.0 can achieve 58x performance improving with 64 GPU for inception v3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests