New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating good audio samples #47
Comments
I've got a laundry list of things I'd like to try and plan to explore the space of possibilities. Here are a few that come to mind.
|
|
Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?
Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact. |
2: Yeah, I was thinking about changing the number that's referred to as 4: Ah, so basically we would allow the network to learn its own encoding of the floating point samples. Wouldn't that make the quantization procedure unnecessary? |
In your one-hot proposal, we would be assuming that the network tries to The obvious model that we would all like is using the float value with The downside is that the SNR will always be kinda high because of the 8-bit El dl., 19 set. 2016 a les 17:14, jyegerlehner (notifications@github.com)
|
No, we'd still be using 1) softmax at the output to provide a discrete probability distribution over the 256 quantized values, 2) the quantization procedure for producing target output during training, and 3) and sampling from the discrete prob distribution of the softmax in order to produce the output (followed by the inverse of the companding quantization to get it back to a scalar amplitude).
I sincerely think what I'm proposing is being misunderstood. It's probably my fault :). I understand and like the rationale given for the softmax output in section 2.2 of the paper; I wasn't proposing getting rid of it. I don't have a one-hot proposal; the current code uses a one-hot encoding as an input. But no matter; at some point perhaps I'll get around to trying what I propose, and bring it back here if it works any better. It does seem more likely, based on my reading of the paper, that it's what they are doing than what the current code does. I don't have high confidence though, and of course could easily be wrong. |
This is what I understood:
It's worth a shot, but the network is no longer an auto-encoder. BTW, I don't like sampling from the multinomial in @ibab, how about a tag for these "strategical issues" and one issue per idea? |
@jyegerlehner Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR? |
OK you understood me well then. Perhaps I was misunderstanding you.
I'm trying to see in what sense it was ever an auto-encoder. I don't think it is/was.
Not sure I follow your alternative to the softmax. I was mostly trying to stick to figuring out what the authors had most likely done in their implementation. I bet we all have ideas about different approaches we'd like to try out.
No I never thought of that.
Right, I feel bad about hijacking ibab's thread. I like the strategic issues tag idea. I prefer not to clutter this thread any more with this one topic. |
As the topic of this issue is just a general "What else should we try?" I think the discussion is perfectly fine 👍 |
Right. I also think when we train on multiple speakers we need to shuffle the clips. I fear we may be experiencing catastrophic forgetting. That little sine-wave unit test I wrote shows how quickly it can learn a few frequencies, which makes me think once it starts getting sentences from a single speaker, it forgets about the pitch and other characteristics of the previous speaker(s). But single-speaker training is less ambitious an easier first step. |
My two cents: Scalar input: WaveNet treats audio generation as an autoregressive classification task. The model requires the last step's output to be provided at the input. I don't think there's much to be gained by providing scalar floating point values at the input. They would still need to be reduced to 8-bit resolution (or as @lemonzi mentions you'd be asking the model to learn quantization). You might save some computational cycles at the first layer. However I think then the scale of the input would need to be considered more closely. Input shuffling: this would probably be very useful. Silence trimming: Shouldn't the model be allowed to see strings of silent samples? Otherwise it will learn to generate waveforms having more discontinuities. I suggest that the degree of trimming is decided as a function of the size of the receptive field. E.g. truncate silences to no less than 75% of the receptive field. |
Oh, that makes sense. It's classifying the next sample, not encoding the sequence as a whole. The trimming is currently applied to the beginning and end of the samples, not to the gaps in between speech. If there are long silence chunks in the samples, what could make sense is to split them in two rather than stripping out the silence. |
I've just managed to generate a sample that I think sounds pretty decent: This is using Tanh instead of ReLU to avoid the issue that the ReLU activations eventually cut off the network. I used only two stacks of 9 dilation layers each: {
"filter_width": 2,
"quantization_steps": 256,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
1, 2, 4, 8, 16, 32, 64, 128, 256],
"residual_channels": 32,
"dilation_channels":16,
"use_biases": false
} |
Nice! |
I've noticed that generating from the same model doesn't always produce interesting output. Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising. |
Can you test with argmax instead of random . choice? On Wed, Sep 21, 2016, 17:46 Igor Babuschkin notifications@github.com
|
@lemonzi: After swapping out |
Interesting... El dc., 21 set. 2016 a les 17:58, Igor Babuschkin (notifications@github.com)
|
When calculating the mean amplitude with
it just produces noise for me. |
Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead. As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner's list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them. I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:
Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:
Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven't looked sufficiently deeply into it to tell how they encode their input. |
Has everyone been reproducing ibab's results? I got a result similar to his, but I think it sounds a bit smoother; I'm guessing because the receptive field is a little bigger than his. 2 seconds: 10 seconds:
[Edit] After mortont comment below: I used learning_rate=0.001. I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab's example. Loss was a bit lower than his, around 2.0-2.1. I think to get pauses between words and such we need to a wider receptive field. That's my next step. By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that? [Edit] Here's one from a model with that has about 250 mSec receptive field, trained for about 16 hours: |
Those results sound great! On soundcloud, you can set an audio clip to repeat in the bar at the bottom, but I don't think this will affect other listeners. Not sure why my clip was on repeat by default for you. |
This is definitely the best result yet! What commit did you use to achieve this @jyegerlehner? I tried reproducing it using the same hyperparameters and only speaker p280 from the corpus, but my model hasn't gone under a loss of 5 after 26k steps. |
@mortont I've started training a newer model with latest updates from master and it is working fine. I don't have any "special sauce" or changes to the code relative to master that I can think of. The only reason for a separate branch for it is to allow me to change the .json file and add shell scripts, and be able to switch back to master without losing files. I'm trying to imagine why you would have loss stuck at 5 and... can't think of a good reason. Perhaps compare the train.sh and resume.sh in my branch to the command-line arguments you are supplying and see if there's an important difference? Learning rate perhaps? [Edit]: I observe the loss to start dropping right away, within the first 50 steps. Loss drops to < 3 rapidly well before 1K steps. So if you don't see that, I think something is wrong. |
Looks like it was learning rate, I changed it from 0.02 to 0.001 and it's now steadily dropping, thanks! |
@neale, this is cool! It would be interesting, with this output as a baseline, to now take a bundle of pieces from the same composer -- doesn't have to be a lot -- and train the network on those alone, with the same settings, same procedure, etc. I'd be very curious to hear that output alongside what you've got. My sense is that training this WaveNet implementation on a large, diverse corpus is going to be tricky until we have a method for "conditioning" and telling the network that A is supposed to sound like B and C, but not as much like Y and Z, etc. Otherwise it just tries to generalize across the entire breadth of what it's hearing, and that's a lot to ask. Question for everyone: what's the math to compute the length of the receptive field for a given set of params/dilations? I know I should know this but… I do not 😬 |
@Nyrt those sound pretty good. What was your loss approximately when you generated those? And regarding the optimizer: are you using SGD/momentum optimizer for any particular reason? I haven't seen it learning as fast as adam or rmsprop, but your results are hard to argue with. |
@robinsloan I've reread the paper and realized that classical music is absolutely beyond the ability of the model. With a 300ms receptive field, the multiple instruments probably sound just like what I posted. Also the receptive field size is just the size of the convolution window. So in a regular CNN people usually use 3x3 or 5x5, two dimensional receptive fields. We use 1D conv layers, and with the dilations the receptive fields get sparse by multiplying the length by [1, 2, 4, 8, ..] and zeroing portions of the filter. |
Also can someone edify me as to why anything but rmsprop is being used. I thought it would perform the best here. |
@neale Er, I guess I mean, how do you know your receptive field is 300ms long? I get that it depends on sample rate and dilations, but I don't understand the arithmetic. |
@robinsloan the receptive field length (in seconds) is just the sum of your dilation layers (receptive field in unit-less numbers) divided by your sample rate (1/seconds), so in the case of a
your receptive field would be the sum of the Somewhat related, @neale it may be worth trying with a longer receptive field than 100ms since the best speech samples have been in the 300ms+ range. The paper mentions fields >1 second for the music generation, but I think we could get results better than static if we upped the dilations to 6 blocks of 512 or so. |
@mortont I would love to but I can't even have a full [1..512] stack because I only have a gtx 970 :( |
@neale I believe it was speaker 266-- I can confirm this once I get back to my big machine. @jyegerlehner The loss was still oscillating a bit, but the minimum it hit was something around 1.5-1.6. Most of the time it was closer to 1.7-1.8, peaking at 2. The reason I was using the SGD/momentum optimizer was that it avoids an issue I was having early on where the loss would suddenly explode and start generating white noise. The slower training wasn't really an issue because it was running overnight anyway. I haven't tried RMSprop yet, though. |
In the quest for ever better audio I made some samples that sound a lot better than anything I could get before.
unmentioned: rmsprop and initial LR of 0.001 I used a receptive field of 2s, on some 20 hours of solo piano that I scraped off youtube. I trained this out to 100k steps, and those are still generating. Unfortunately I had to decrease the sample size to 32000 for lack of available memory. I just grabbed a 8GB 1070, so in a few days I'll triple my model and try it out. |
I just grabbed some P100's and have trained large wavenet. :) (Trained with current default wavenet_params.json) |
Nako, What was dataset size (#files, length of files)? For how long did you On Friday, November 11, 2016, Nako Sung notifications@github.com wrote:
|
@Cortexelus 158 files, 200K bytes each. ~50k iteration with learning rate annealing. :) |
@nakosung at what sample rate? |
In the paper, the receptive field is about 300ms and the sample rate is 16000. So the current default setting is good in my opinion; except something like sample size, number of iterations, learning rate, optimizer, which we can tweak through experiments to avoid overfitting and get better convergence. To generate a good sound, we should consider the local conditioning on text e.g. (as mentioned in Section 3.1 of the paper). |
@Cortexelus default setting(16K) New sample with slience. https://soundcloud.com/nako-sung/test-3-wav |
I've just added dropout and concat-elu (#184). I hope dropout would help generator's quality. :) |
Hi all, I am using the default settings and on an ec2 p2.xlarge and am getting 2.7 sec/step. I am pretty certain I saw a post on one of the issues describing 0.5 sec steps. Just wondering if anyone had some tips |
@willjhenry In my case I got 3.5 sec/step using KAIST-Korean corpus. (IBM PowerPC, P100) |
Is the goal here to produce random speech? If so, it sounds like nakosung nailed it. What about text-to-speech? I'd love to start using this in OpenAssistant for our tts. Please help us? Festival is so poor... https://github.com/vavrek/openassistant |
Generating random speech is a step on the way to generating speech conditioned on what you want to it to say (i.e. generating non-random speech). Please see the WaveNet paper. I don't think this project, at least in its current form, is a candidate for your tts solution. 1) there's no local conditioning on the desired speech implemented in the repo, at least yet. 2) Even if it were, it doesn't generate audio in real-time. Takes M seconds to generate N seconds of audio, where M >> N. |
@willjhenry @nakosung On the time-per-step: the code in the master of this repo produces a training batch by pasting together subsequent audio clips from the corpus until the length is at least sample_size=100000 samples long. I think this is wrong because it is training the net to predict a discontinuity where it transitions from the first to second clip. This is fixed in both the global condition PR and koz4k's PR. Since most of the VCTK corpus clips are much shorter than 100000 samples, these branches produce faster step times as the number of samples in the train batch tends to be smaller (than what you get with the ibab master). Having said all that, I was getting about 0.5-1.0 second steps on my branch, on Titan XP. |
@nakosung That's a really cool sample you've got there: https://soundcloud.com/nako-sung/test-3-wav. I noticed that your model produces very smooth-sounding vowels. I trained my model with 50k+ steps (current default parameters) and still have some considerable tremor in the vowels: https://soundcloud.com/belevtsoff/wavenet_54k_audiobook. My training corpus is an audiobook with 1+ hours of clean speech. Also, I use RandomShuffleQueue for feeding input data. Can you think about a possible reason for this poor quality? |
Can't you just use google or siri to produce the corpus? Maybe through an API you could send a word as text to them and get back a sound. There's your training data. Eventually you could get longer and longer sentences until it can handle anything. |
@nakosung Have you trained on Mutiple GPU with Mutiple nodes? The default training just run one GPU. Do you mind share how to train it on multiple nodes? |
@AlvinChen13 Multi node training didn't scale well. I think it has bottleneck in network bandwidth. (Although I didn't test ASGD) I switched to TitanXP which had larger memory so I stopped multi node/gpu experiment. |
@nakosung Would you mind share your distributed code? Our lab have 8 nodes with 2 M40 for each, and 4 nodes with 2 K40 for each, and all are connected to 40G switch. I beleive multi-node training is neccessary if we train on huge dataset with bigger receptive fields. It is worth to cost some efforts to investigate it. Google claimed that tf 1.0 can achieve 58x performance improving with 64 GPU for inception v3. |
Let's discuss strategies for producing audio samples.
When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.
Some ideas I've had to improve on this:
librosa
.The text was updated successfully, but these errors were encountered: