Generating good audio samples #47

ibab · 2016-09-19T14:23:12Z

Let's discuss strategies for producing audio samples.
When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.

Some ideas I've had to improve on this:

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with librosa.

The text was updated successfully, but these errors were encountered:

jyegerlehner · 2016-09-19T18:44:57Z

I've got a laundry list of things I'd like to try and plan to explore the space of possibilities. Here are a few that come to mind.

Paper mentions ~300mSec receptive field for speech generation at one point. Given our current params, we got closer to 63 mSec if my arithmetic is correct.
Maybe we were a bit too draconian in cutting back dilation channels and residual channels. Bump those up?
Does anyone else feel weird about performing a convolution on a time series where each element of the series is a one-hot vector, like we do at the input? I had thought their quantization into one-hot softmax output was only for the output, as the rationale involved avoiding having the learnt distribution putting probability mass outside the range of possible values. Encoding to one-hot on the input has to at the very least add quantization noise. I'd rather feed the input signal as a single floating point channel. Then the filters would be more like the digital filters we've always dealt with since the days of yore.
That last 1x1 conv before the output has #channels = average of the dilation channels and the quantization levels. I'd rather make that a configurable number of channels, and bump it up to > quantization levels (e.g. 1024). We're trying to go from a small dense representation to choice from amongst 256 quantization levels, so it's almost like a classification problem where we need to create a complicated decision surface and thus maybe need more decision boundaries from more units, and maybe a bit deeper too.
Issue 48.
PR 39.
Something else I can't remember at the moment.

ibab · 2016-09-19T20:06:26Z

The ~300ms sounds like 4 or 5 stacks of dilation layers ranging from 1-512.
I've also been thinking that the number of channels can now vary throughout the convolutional stack, as we're not tied to the number of input channels that we have when combining the outputs.
That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right? But I can see how one might build something useful with this idea. Basically, you would extract time-dependent activations at different time scales from the layers, and then feed them through several 1x1 layers to make sense of what you've seen at the different time scales. But you probably wouldn't want to add up the outputs of each layer, as in this architecture.
Yeah, that makes perfect sense. The "postprocessing" layers probably aren't doing a lot at the moment.

jyegerlehner · 2016-09-19T21:14:46Z

2 I've also been thinking that the number of channels can now vary throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?

3 That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.

ibab · 2016-09-19T21:33:41Z

2: Yeah, I was thinking about changing the number that's referred to as dilation_channels in the config file on a per-block basis, but got confused. This would require #48.

4: Ah, so basically we would allow the network to learn its own encoding of the floating point samples. Wouldn't that make the quantization procedure unnecessary?

lemonzi · 2016-09-19T21:43:16Z

In your one-hot proposal, we would be assuming that the network tries to
learn how to perform a quantization. In the current implementation, the
network learns how to encode a random variable with a multinomial
distribution that has temporal dependencies across trials.

The obvious model that we would all like is using the float value with
"classical" filters, but then we need to choose a loss function. The
authors said that most loss functions on floats assume a particular
distribution of the possible values (I think the squared loss corresponds
to a normal distribution), while the multinomial from the one-hot encoding
makes no assumptions at the expense of having a finite set of possible
values. Apparently, lifting this constraint gives better results despite
the quantization noise.

The downside is that the SNR will always be kinda high because of the 8-bit
resolution, so at some point we should be able to find a better model -- or
scale it up to a one-hot encoding with 60k+ categories.

El dl., 19 set. 2016 a les 17:14, jyegerlehner (notifications@github.com)
va escriure:

I've also been thinking that the number of channels can now vary
throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels
can vary. The res block's summation forces the input shape to equal the
output shape, so num channels can't change. Oh, or perhaps you are saying
within a single block, the channels can vary, so long as we end up with the
same shape at input and output of any single block?

That would conflict with the presence of the gated activation unit
and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't
see any conflict at all. Everything would be exactly as it is now. Except
one thing: the input to the net would be n samples x 1 scalar amplitude,
instead of n samples x 256 one-hot encoding. The initial causal conv filter
will still produce the same number of channels it does now, so nothing
downstream would see any impact.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5myUpJB0EKsftfed4CTAB2S3AZpzks5qrvtIgaJpZM4KAi-o
.

jyegerlehner · 2016-09-19T22:50:31Z

@ibab

Wouldn't that make the quantization procedure unnecessary?

No, we'd still be using 1) softmax at the output to provide a discrete probability distribution over the 256 quantized values, 2) the quantization procedure for producing target output during training, and 3) and sampling from the discrete prob distribution of the softmax in order to produce the output (followed by the inverse of the companding quantization to get it back to a scalar amplitude).

@lemonzi

In your one-hot proposal...

I sincerely think what I'm proposing is being misunderstood. It's probably my fault :). I understand and like the rationale given for the softmax output in section 2.2 of the paper; I wasn't proposing getting rid of it. I don't have a one-hot proposal; the current code uses a one-hot encoding as an input.

But no matter; at some point perhaps I'll get around to trying what I propose, and bring it back here if it works any better. It does seem more likely, based on my reading of the paper, that it's what they are doing than what the current code does. I don't have high confidence though, and of course could easily be wrong.

lemonzi · 2016-09-19T23:01:31Z

This is what I understood:

- floating-point input with raw audio (no mu-law)
- convolution with 1 input channel and N output channels
- N-to-M channel layers
- layer that aggregates the skip connections with 256 outputs
- softmax
- cross-entropy against a mu-law + quantization encoding of the input

It's worth a shot, but the network is no longer an auto-encoder.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them, which is closer to feeding it a given raw audio seed.

@ibab, how about a tag for these "strategical issues" and one issue per idea?

lemonzi · 2016-09-19T23:02:58Z

@jyegerlehner Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

jyegerlehner · 2016-09-20T15:33:32Z

This is what I understood

OK you understood me well then. Perhaps I was misunderstanding you.

It's worth a shot, but the network is no longer an auto-encoder.

I'm trying to see in what sense it was ever an auto-encoder. I don't think it is/was.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them

Not sure I follow your alternative to the softmax. I was mostly trying to stick to figuring out what the authors had most likely done in their implementation. I bet we all have ideas about different approaches we'd like to try out.

Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

No I never thought of that.

how about a tag for these "strategical issues" and one issue per idea?

Right, I feel bad about hijacking ibab's thread. I like the strategic issues tag idea. I prefer not to clutter this thread any more with this one topic.

ibab · 2016-09-20T15:44:59Z

As the topic of this issue is just a general "What else should we try?" I think the discussion is perfectly fine 👍
But feel free to open new issues to discuss strategies. I can tag them with a "strategy" label.

jyegerlehner · 2016-09-21T00:13:28Z

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset.

Right. I also think when we train on multiple speakers we need to shuffle the clips. I fear we may be experiencing catastrophic forgetting. That little sine-wave unit test I wrote shows how quickly it can learn a few frequencies, which makes me think once it starts getting sentences from a single speaker, it forgets about the pitch and other characteristics of the previous speaker(s).

But single-speaker training is less ambitious an easier first step.

woodshop · 2016-09-21T03:09:34Z

My two cents:

Scalar input: WaveNet treats audio generation as an autoregressive classification task. The model requires the last step's output to be provided at the input. I don't think there's much to be gained by providing scalar floating point values at the input. They would still need to be reduced to 8-bit resolution (or as @lemonzi mentions you'd be asking the model to learn quantization). You might save some computational cycles at the first layer. However I think then the scale of the input would need to be considered more closely.

Input shuffling: this would probably be very useful.

Silence trimming: Shouldn't the model be allowed to see strings of silent samples? Otherwise it will learn to generate waveforms having more discontinuities. I suggest that the degree of trimming is decided as a function of the size of the receptive field. E.g. truncate silences to no less than 75% of the receptive field.

lemonzi · 2016-09-21T14:56:25Z

Oh, that makes sense. It's classifying the next sample, not encoding the sequence as a whole.

The trimming is currently applied to the beginning and end of the samples, not to the gaps in between speech. If there are long silence chunks in the samples, what could make sense is to split them in two rather than stripping out the silence.

ibab · 2016-09-21T19:30:48Z

I've just managed to generate a sample that I think sounds pretty decent:
https://soundcloud.com/user-952268654/wavenet-28k-steps-of-100k-samples

This is using Tanh instead of ReLU to avoid the issue that the ReLU activations eventually cut off the network.
I stopped it at one point to reduce the learning rate from 0.02 to 0.01 but it doesn't look like it had a large impact.
I started generating when the curve was at about 28k steps.

I used only two stacks of 9 dilation layers each:

{
    "filter_width": 2,
    "quantization_steps": 256,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256],
    "residual_channels": 32,
    "dilation_channels":16,
    "use_biases": false
}

woodshop · 2016-09-21T20:09:53Z

Nice!

ibab · 2016-09-21T21:46:09Z

I've noticed that generating from the same model doesn't always produce interesting output.
But if I start off with an existing recording, it seems to work very reliably:
https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.

lemonzi · 2016-09-21T21:50:10Z

Can you test with argmax instead of random . choice?

On Wed, Sep 21, 2016, 17:46 Igor Babuschkin notifications@github.com
wrote:

I've noticed that generating from the same model doesn't always produce
interesting output.
But if I start off with an existing recording, it seems to work very
reliably:

https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000
samples, I think the results sound quite promising.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5jeApDDLwiBeK-zYyZDdEmKvPTs0ks5qsaWkgaJpZM4KAi-o
.

ibab · 2016-09-21T21:58:11Z

@lemonzi: After swapping out random.choice with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.

lemonzi · 2016-09-21T22:00:22Z

Interesting...

El dc., 21 set. 2016 a les 17:58, Igor Babuschkin (notifications@github.com)
va escriure:

@lemonzi https://github.com/lemonzi: After swapping out random.choice
with argmax, it always returns the same value. I think that makes sense, as
staying at the same amplitude is the most likely thing to happen at the
resolution we work with.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5lUKLgy9yHbvna-8BQFLuht0GWvyks5qsah1gaJpZM4KAi-o
.

ibab · 2016-09-21T22:03:25Z

When calculating the mean amplitude with

sample = np.int32(np.sum(np.arange(quantization_steps) * prediction))

it just produces noise for me.

ghenter · 2016-09-23T01:50:51Z

Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.

As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner's list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.

I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:

This turns the problem from a regression task to a classification task. For some reason, DNNs have seen greater success in classification than in regression. (This has motivated the research into generative adversarial networks, which is another hot topic at the moment.) Up until now, most DNN-based waveform/audio generation approaches were formulated as regression problems.
A softmax output layer allows a flexible representation of the distribution of possible output values, from which the next value is generated by sampling. Empirically, this worked better than parametrising the output distribution using GMMs (i.e., a mixture density network).

Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:

Scalar inputs have lower dimensionality, requiring fewer parameters in the network. (They are compact and dense instead of a factor 256 larger and sparse.)
Using floats does not introduce (additional) quantisation noise.
Applying convolutions to floating point values is interpretable as a filter, as @jyegerlehner said. The effect of applying convolutions to one-hot vectors, in contrast, is opaque.
Finally, and most importantly, the actual waveform sample values are numerical, so they have both a magnitude and an internal ordering. These properties matter hugely. Feeding in a categorical representation (one-hot vectors) would essentially force the network to learn the relative values and ordering associated with each input node, in order to make sense of the input. Since there are something like 256 values x 300 ms x 16 kHz = 1.2 million one-hot input nodes, this is a formidable learning task that is entirely avoided by using a floating point representation.

Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven't looked sufficiently deeply into it to tell how they encode their input.

jyegerlehner · 2016-09-25T01:09:46Z

Has everyone been reproducing ibab's results? I got a result similar to his, but I think it sounds a bit smoother; I'm guessing because the receptive field is a little bigger than his.

2 seconds:
https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1

10 seconds:
https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2],
    "residual_channels": 32,
    "dilation_channels":32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

[Edit] After mortont comment below: I used learning_rate=0.001.

I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab's example. Loss was a bit lower than his, around 2.0-2.1.

I think to get pauses between words and such we need to a wider receptive field. That's my next step.

By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?

[Edit] Here's one from a model with that has about 250 mSec receptive field, trained for about 16 hours:
https://soundcloud.com/user-731806733/generated-larger-1

ibab · 2016-09-25T09:11:02Z

Those results sound great!
We should consider linking to them from the README.md to demonstrate what the network can do.
It seems likely that we will be able to reproduce the quality of the DeepMind samples with a higher receptive field.

On soundcloud, you can set an audio clip to repeat in the bar at the bottom, but I don't think this will affect other listeners. Not sure why my clip was on repeat by default for you.

mortont · 2016-09-25T14:15:12Z

This is definitely the best result yet! What commit did you use to achieve this @jyegerlehner? I tried reproducing it using the same hyperparameters and only speaker p280 from the corpus, but my model hasn't gone under a loss of 5 after 26k steps.

jyegerlehner · 2016-09-25T16:27:26Z

@mortont
I'm not sure exactly which commit to this branch it was exactly:
https://github.com/jyegerlehner/tensorflow-wavenet/tree/single-speaker
But most are trivial and frankly I don't think it matters. I haven't observed it breaking at any point.

I've started training a newer model with latest updates from master and it is working fine. I don't have any "special sauce" or changes to the code relative to master that I can think of. The only reason for a separate branch for it is to allow me to change the .json file and add shell scripts, and be able to switch back to master without losing files.

I'm trying to imagine why you would have loss stuck at 5 and... can't think of a good reason. Perhaps compare the train.sh and resume.sh in my branch to the command-line arguments you are supplying and see if there's an important difference? Learning rate perhaps?

[Edit]: I observe the loss to start dropping right away, within the first 50 steps. Loss drops to < 3 rapidly well before 1K steps. So if you don't see that, I think something is wrong.

mortont · 2016-09-25T19:02:28Z

Looks like it was learning rate, I changed it from 0.02 to 0.001 and it's now steadily dropping, thanks!

robinsloan · 2016-10-25T15:19:26Z

@neale, this is cool! It would be interesting, with this output as a baseline, to now take a bundle of pieces from the same composer -- doesn't have to be a lot -- and train the network on those alone, with the same settings, same procedure, etc. I'd be very curious to hear that output alongside what you've got.

My sense is that training this WaveNet implementation on a large, diverse corpus is going to be tricky until we have a method for "conditioning" and telling the network that A is supposed to sound like B and C, but not as much like Y and Z, etc. Otherwise it just tries to generalize across the entire breadth of what it's hearing, and that's a lot to ask.

Question for everyone: what's the math to compute the length of the receptive field for a given set of params/dilations? I know I should know this but… I do not 😬

jyegerlehner · 2016-10-25T15:44:30Z

@Nyrt those sound pretty good. What was your loss approximately when you generated those?

And regarding the optimizer: are you using SGD/momentum optimizer for any particular reason? I haven't seen it learning as fast as adam or rmsprop, but your results are hard to argue with.

neale · 2016-10-25T17:46:12Z

@robinsloan I've reread the paper and realized that classical music is absolutely beyond the ability of the model. With a 300ms receptive field, the multiple instruments probably sound just like what I posted.
A more homogeneous dataset is needed, I'm getting together as much solo piano as I can find.

Also the receptive field size is just the size of the convolution window. So in a regular CNN people usually use 3x3 or 5x5, two dimensional receptive fields. We use 1D conv layers, and with the dilations the receptive fields get sparse by multiplying the length by [1, 2, 4, 8, ..] and zeroing portions of the filter.
There's no math for that :) , math would be calculating parameters introduced from each convolution

neale · 2016-10-25T17:47:21Z

Also can someone edify me as to why anything but rmsprop is being used. I thought it would perform the best here.

robinsloan · 2016-10-25T17:59:24Z

@neale Er, I guess I mean, how do you know your receptive field is 300ms long? I get that it depends on sample rate and dilations, but I don't understand the arithmetic.

mortont · 2016-10-25T18:13:22Z

@robinsloan the receptive field length (in seconds) is just the sum of your dilation layers (receptive field in unit-less numbers) divided by your sample rate (1/seconds), so in the case of a wavenet_params.json that looks like this:

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

your receptive field would be the sum of the dilations list (1564) divided by the sampling frequency (16000) giving you ~98ms of a receptive field.

Somewhat related, @neale it may be worth trying with a longer receptive field than 100ms since the best speech samples have been in the 300ms+ range. The paper mentions fields >1 second for the music generation, but I think we could get results better than static if we upped the dilations to 6 blocks of 512 or so.

neale · 2016-10-25T19:21:49Z

@mortont I would love to but I can't even have a full [1..512] stack because I only have a gtx 970 :(

Nyrt · 2016-10-26T22:11:40Z

@neale I believe it was speaker 266-- I can confirm this once I get back to my big machine.

@jyegerlehner The loss was still oscillating a bit, but the minimum it hit was something around 1.5-1.6. Most of the time it was closer to 1.7-1.8, peaking at 2.

The reason I was using the SGD/momentum optimizer was that it avoids an issue I was having early on where the loss would suddenly explode and start generating white noise. The slower training wasn't really an issue because it was running overnight anyway. I haven't tried RMSprop yet, though.

neale · 2016-11-01T16:02:23Z

In the quest for ever better audio I made some samples that sound a lot better than anything I could get before.

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1014, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64],

    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

unmentioned: rmsprop and initial LR of 0.001

Soundcloud Link

I used a receptive field of 2s, on some 20 hours of solo piano that I scraped off youtube. I trained this out to 100k steps, and those are still generating. Unfortunately I had to decrease the sample size to 32000 for lack of available memory.

I just grabbed a 8GB 1070, so in a few days I'll triple my model and try it out.

nakosung · 2016-11-11T10:33:18Z

I just grabbed some P100's and have trained large wavenet. :) (Trained with current default wavenet_params.json)

https://soundcloud.com/nako-sung/test

Cortexelus · 2016-11-12T07:50:34Z

Nako, What was dataset size (#files, length of files)? For how long did you
train?

On Friday, November 11, 2016, Nako Sung notifications@github.com wrote:

I just grabbed some P100's and have trained large wavenet. :) Trained with
current default wavenet_params.json.

https://soundcloud.com/nako-sung/test

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACVxZXm4OsBJGIrOTrTF8h4_POd4-BXGks5q9ERvgaJpZM4KAi-o
.

nakosung · 2016-11-14T06:21:37Z

@Cortexelus 158 files, 200K bytes each. ~50k iteration with learning rate annealing. :)

Cortexelus · 2016-11-14T15:47:29Z

@nakosung at what sample rate?

fehiepsi · 2016-11-15T11:01:20Z

In the paper, the receptive field is about 300ms and the sample rate is 16000. So the current default setting is good in my opinion; except something like sample size, number of iterations, learning rate, optimizer, which we can tweak through experiments to avoid overfitting and get better convergence. To generate a good sound, we should consider the local conditioning on text e.g. (as mentioned in Section 3.1 of the paper).

nakosung · 2016-11-15T12:57:54Z

@Cortexelus default setting(16K)

New sample with slience. https://soundcloud.com/nako-sung/test-3-wav

nakosung · 2016-11-21T09:35:15Z

I've just added dropout and concat-elu (#184). I hope dropout would help generator's quality. :)

willjhenry · 2016-11-23T19:15:48Z

Hi all, I am using the default settings and on an ec2 p2.xlarge and am getting 2.7 sec/step. I am pretty certain I saw a post on one of the issues describing 0.5 sec steps. Just wondering if anyone had some tips

nakosung · 2016-11-23T22:59:18Z

@willjhenry In my case I got 3.5 sec/step using KAIST-Korean corpus. (IBM PowerPC, P100)

Whytehorse · 2016-12-11T16:11:45Z

Is the goal here to produce random speech? If so, it sounds like nakosung nailed it. What about text-to-speech? I'd love to start using this in OpenAssistant for our tts. Please help us? Festival is so poor... https://github.com/vavrek/openassistant

jyegerlehner · 2016-12-13T17:39:56Z

@Whytehorse

Generating random speech is a step on the way to generating speech conditioned on what you want to it to say (i.e. generating non-random speech). Please see the WaveNet paper.

I don't think this project, at least in its current form, is a candidate for your tts solution. 1) there's no local conditioning on the desired speech implemented in the repo, at least yet. 2) Even if it were, it doesn't generate audio in real-time. Takes M seconds to generate N seconds of audio, where M >> N.

jyegerlehner · 2016-12-13T18:09:05Z

@willjhenry @nakosung On the time-per-step: the code in the master of this repo produces a training batch by pasting together subsequent audio clips from the corpus until the length is at least sample_size=100000 samples long. I think this is wrong because it is training the net to predict a discontinuity where it transitions from the first to second clip. This is fixed in both the global condition PR and koz4k's PR. Since most of the VCTK corpus clips are much shorter than 100000 samples, these branches produce faster step times as the number of samples in the train batch tends to be smaller (than what you get with the ibab master).

Having said all that, I was getting about 0.5-1.0 second steps on my branch, on Titan XP.

belevtsoff · 2016-12-23T19:32:51Z

@nakosung That's a really cool sample you've got there: https://soundcloud.com/nako-sung/test-3-wav. I noticed that your model produces very smooth-sounding vowels. I trained my model with 50k+ steps (current default parameters) and still have some considerable tremor in the vowels: https://soundcloud.com/belevtsoff/wavenet_54k_audiobook. My training corpus is an audiobook with 1+ hours of clean speech. Also, I use RandomShuffleQueue for feeding input data. Can you think about a possible reason for this poor quality?

Whytehorse · 2016-12-23T21:26:25Z

Can't you just use google or siri to produce the corpus? Maybe through an API you could send a word as text to them and get back a sound. There's your training data. Eventually you could get longer and longer sentences until it can handle anything.

AlvinChen13 · 2017-02-21T02:52:17Z

@nakosung Have you trained on Mutiple GPU with Mutiple nodes? The default training just run one GPU. Do you mind share how to train it on multiple nodes?

nakosung · 2017-02-21T13:35:16Z

@AlvinChen13 Multi node training didn't scale well. I think it has bottleneck in network bandwidth. (Although I didn't test ASGD) I switched to TitanXP which had larger memory so I stopped multi node/gpu experiment.

AlvinChen13 · 2017-02-22T01:02:24Z

@nakosung Would you mind share your distributed code? Our lab have 8 nodes with 2 M40 for each, and 4 nodes with 2 K40 for each, and all are connected to 40G switch.

I beleive multi-node training is neccessary if we train on huge dataset with bigger receptive fields. It is worth to cost some efforts to investigate it. Google claimed that tf 1.0 can achieve 58x performance improving with 64 GPU for inception v3.

ibab/tensorflow-wavenet#47 (comment)

ibab added the strategy label Sep 20, 2016

mortont mentioned this issue Sep 21, 2016

Error training multiple epochs #66

Closed

ibab mentioned this issue Sep 22, 2016

add regularization, dropout and batch norm? #65

Closed

tomlepaine mentioned this issue Sep 23, 2016

Can we have some comparative voice samples ? tomlepaine/fast-wavenet#4

Closed

Cortexelus mentioned this issue Nov 14, 2016

librosa MemoryError #171

Open

akademi4eg mentioned this issue Feb 20, 2017

The generated wav is bad, and the loss value stick at around 2.0 #224

Open

darth-c0d3r mentioned this issue Oct 19, 2019

Is the input a vector of floats or a vector of one-hot vectors? darth-c0d3r/conditioned-text-to-speech#2

Closed

cosmicBboy added a commit to cosmicBboy/movenet that referenced this issue Aug 6, 2022

replace relu with tanh in architecture

7c796fd

ibab/tensorflow-wavenet#47 (comment)

Generating good audio samples #47

Generating good audio samples #47

Comments

ibab commented Sep 19, 2016

jyegerlehner commented Sep 19, 2016 • edited

ibab commented Sep 19, 2016

jyegerlehner commented Sep 19, 2016 • edited

ibab commented Sep 19, 2016

lemonzi commented Sep 19, 2016

jyegerlehner commented Sep 19, 2016 • edited

lemonzi commented Sep 19, 2016

lemonzi commented Sep 19, 2016

jyegerlehner commented Sep 20, 2016

ibab commented Sep 20, 2016

jyegerlehner commented Sep 21, 2016

woodshop commented Sep 21, 2016

lemonzi commented Sep 21, 2016

ibab commented Sep 21, 2016

woodshop commented Sep 21, 2016

ibab commented Sep 21, 2016

lemonzi commented Sep 21, 2016

ibab commented Sep 21, 2016

lemonzi commented Sep 21, 2016

ibab commented Sep 21, 2016

ghenter commented Sep 23, 2016 • edited

jyegerlehner commented Sep 25, 2016 • edited

ibab commented Sep 25, 2016 • edited

mortont commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016 • edited

mortont commented Sep 25, 2016

robinsloan commented Oct 25, 2016 • edited

jyegerlehner commented Oct 25, 2016 • edited

neale commented Oct 25, 2016

neale commented Oct 25, 2016

robinsloan commented Oct 25, 2016 • edited

mortont commented Oct 25, 2016

neale commented Oct 25, 2016

Nyrt commented Oct 26, 2016

neale commented Nov 1, 2016 • edited

nakosung commented Nov 11, 2016 • edited

Cortexelus commented Nov 12, 2016

nakosung commented Nov 14, 2016

Cortexelus commented Nov 14, 2016

fehiepsi commented Nov 15, 2016

nakosung commented Nov 15, 2016

nakosung commented Nov 21, 2016

willjhenry commented Nov 23, 2016

nakosung commented Nov 23, 2016

Whytehorse commented Dec 11, 2016

jyegerlehner commented Dec 13, 2016

jyegerlehner commented Dec 13, 2016

belevtsoff commented Dec 23, 2016 • edited

Whytehorse commented Dec 23, 2016

AlvinChen13 commented Feb 21, 2017

nakosung commented Feb 21, 2017

AlvinChen13 commented Feb 22, 2017

jyegerlehner commented Sep 19, 2016 •

edited

jyegerlehner commented Sep 19, 2016 •

edited

jyegerlehner commented Sep 19, 2016 •

edited

ghenter commented Sep 23, 2016 •

edited

jyegerlehner commented Sep 25, 2016 •

edited

ibab commented Sep 25, 2016 •

edited

jyegerlehner commented Sep 25, 2016 •

edited

robinsloan commented Oct 25, 2016 •

edited

jyegerlehner commented Oct 25, 2016 •

edited

robinsloan commented Oct 25, 2016 •

edited

neale commented Nov 1, 2016 •

edited

nakosung commented Nov 11, 2016 •

edited

belevtsoff commented Dec 23, 2016 •

edited