Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio #32

Closed
faiface opened this issue Jun 2, 2017 · 29 comments
Closed

Audio #32

faiface opened this issue Jun 2, 2017 · 29 comments
Labels
Milestone

Comments

@faiface
Copy link
Owner

faiface commented Jun 2, 2017

Audio is a major missing feature in Pixel. The time has come to fix this. This issue serves as a design and implementation discussion place as well as progress reporting place.

Requirements

Here, I will summarize the most important requirements I demand from the implementation of an audio system in Pixel.

  • Playing sound and music files
  • Playing arbitrary generated stream of PCM samples
  • Support and preference of streaming, even from files
  • Ability to stop the playback at any time
  • Easily play arbitrary number of sounds simultaneously
  • Playback must be smooth, without glitches and as precise in timing as possible
  • Ability to create arbitrary stream-based audio effects in the library as well as by the user of the library (= defining 1 or 2 interfaces for this)

Design

Let's define two abstract interfaces with no specific definitions yet:

  1. Wave generator - an object of this type is capable of generating a stream of PCM waves
  2. Wave receiver - an object of this type is capable of receiving a stream of PCM waves and acting accordingly

Examples: A loaded sound file is a wave generator. A speaker is a wave receiver, since it receives the waves and plays them. An audio effect is both receiver and generator: it receives waves, modifies them and generates the modified waves.

Having defined these interfaces, here are a few example of ways the user would chain them together to create the final audio result. Objects sending to the right are generators and objects receiving from the left are receivers.

[ clap.mp3 ] -> [ volume adjustment, 0.5x ] -> [ speaker ]
[ sine wave generator ] -> [ distortion effect ] -> [ phaser effect ] -> [ speaker ]
[ music.wav ] -> [speaker ]

The "wave generator" and "wave receiver" names can and probably will change.

Implementation

This is not decided yet. Here are a few examples of possible implementation, all of which are a bit problematic.

1.

type WaveGenerator interface {
    Generate(format Format, p []byte) (n int, err error)
}

type WaveReceiver interface {
    Receive(format Format, p []byte) (n int, err error)
}

I don't like this implementation. For a stream of PCM waves, it requires the user to actively feed the speaker with new data each frame. This would be very hard to use, we would probably implement more abstractions on top of this.

2.

type WaveGenerator interface {
    Generate(format Format, p []byte) (n int, err error)
}

type WaveReceiver interface {
    Receive(WaveGenerator)
    Update()
}

The way you would use this implementation is like this:

sound := loadSound("hello.mp3") // WaveGenerator
effect.Receive(sound)
speaker.Receive(effect)

for !win.Closed() {
    speaker.Update()
}

This way, it'd be easy to create arbitrary chains of generators/receivers, however, questions and problems arise when we ask: How to play multiple sounds together? How to stop the playback? How to play two sounds one after another?

What needs to be done

First, we need to find an actual design that meets all of the requirements. Then we need to implement it.

Low level stuff

For the actual audio playback, we'll use the awesome oto package by @hajimehoshi, which supports audio playback on all major platforms.

@faiface faiface added the feature label Jun 2, 2017
@faiface faiface added this to the 0.7 milestone Jun 2, 2017
@rawktron
Copy link

rawktron commented Jun 3, 2017

Some comments from the Gitter channel on the above questions:

  • Playing multiple sounds at once - well… you probably need some concept of a "mix" or "mux" bus object at some point. So this would be another input/output thing. And would like usually have N inputs, and every frame read from those inputs and mix the waveforms together. Now, this is likely handled already at the driver level - but I think really the point stands that you probably don't want to be writing to separate 'speaker' objects for every sound - there should be like one Master Bus that mixes sounds and then that outputs to some speaker configuration (which I guess is always Stereo?)

  • How to stop/track sounds - so, there are a number of ways to do this, but often, in many engines, if you Start Playback on a sound, it returns a handle to the currently playing object. So in Go, that could either be a pointer to the object, OR, maybe a channel that is attached to that object? Then you can send it Stop/Status/Pause/Start commands. Status is super important for precise playback.

  • Queuing sounds. You might need a higher level abstraction to deal with this. You need to have the concept of a "Sound" or an "Event" differentiated from a "Sample". The primary way to do this is to either have some kind of thing that can either take a list of samples and play them sequentially, OR, a timeline and you can specify start times that might not be immediately sequential. Alternately, to remain consistent with the chaining idea here, I would say you could do something like create a special kind of "Sequential" mix-bus, so if you wrote samples to that, they would play sequentially, whereas the regular kind of mix-bus would play anything you sent to it simultaneously.

  • Passing byte arrays around has the benefit that you can relatively easily track the precise sample-accurate position, which means you can have precise timing information. Including precise musical time.

  • Also for clarity I'm using 2 different definitions of 'Samples' here - in the Queuing example when I say Sample - I mean like in the sense of a sample used in a Sampler - so an individual .wav file. In the timing example, I mean a sample in the sense of 1 32-bit or 16-bit chunk of audio data in the audio buffer.

  • Also something you likely want to add to your requirements that is fundamental: streaming vs. memory. On PC this is sort of not a huge deal, but on any other memory constrained platform like console or mobile, you will almost always want to stream large ambiences/music from a URL (which could be network or disk) vs. just storing the audio data in memory.

@faiface
Copy link
Owner Author

faiface commented Jun 5, 2017

OK, after a few days of thinking I've come up with a pretty decent API design, I guess.

Interfaces

type Streamer interface {
    Stream(samples []float32) (n int, err error)
}

type Player interface {
    Play(Streamer) Handle
}

type Handle interface {
    Stop()
    Time() time.Duration
    // maybe a few more methods
}

Now, let's describe them in detail.

Streamer

The Streamer interface is very similar to io.Reader, its Stream method does basically the same thing as io.Reader's Read method: fills the provided samples slice with data, returns the number of written samples and a potential error, such as io.EOF. It also advances the streamer by the returned amount of samples.

Initially, I though that Stream method would also take an additional format argument, which would specify the format of the samples. Feel free to argue, but eventually I've come to the conclusion, that having a standardized sample format is better. Internally, a streamer can use whatever format it wants. But once the samples are requested, it needs to provide them in the standard format (such as 44100 per second, 32-bit floats, two channels, will need to figure out the format).

So, here's an example implementation of a streamer that would produce a simple sine wave:

type SineWaveStreamer struct {
    position int
}

func (sws *SineWaveStreamer) Stream(samples []float32) (n int, err error) {
    for i := range samples {
        samples[i] = float32(math.Sin(float64(sws.position + i) / 100))
    }
    sws.position += len(samples)
    return len(samples), nil
}

Simple, right?

Player

Now, the Player interface does not mimic io.Writer interface, because it would be very cumbersome to use (think why). Instead, it provides a Play method which takes a Streamer and returns Handle. So, how should it work?

Calling the Play method should add/append/set (depends on a concrete Player) the provided Streamer to the Player. The Player should then pull data from this streamer when necessary. Let's see an example.

Here's a very primitive example. This Player is a Streamer at the same time, so it's an audio effect. What it does is that it lowers the volume by 50% for a single streamer.

type LowerVolumeEffect struct {
    s Streamer
}

func (lve *LowerVolumeEffect) Play(s Streamer) Handle {
    lve.s = s
    return &lveHandle{lve}
}

func (lve *LowerVolumeEffect) Stream(samples []float32) (n int, err error) {
    if lve.s == nil {
        return len(samples), nil
    }
    n, err = lve.s.Stream(samples)
    for i := range samples[:n] {
        samples[i] /= 2 // half volume
    }
    return n, err
}

type lveHandle struct {
    lve *LowerVolumeEffect
}

func (lh *lveHandle) Stop() {
    lh.lve.s = nil
}

// Time method omitted for simplicity

A more complicated player would do more advanced stuff. For example, a sequencer would append the provided streamer to its list of streamers and would stream those streamers one after another. A mixer would add the provided streamer to the list of streamers and would mix then when streaming. A speaker player would regularly pull new data from the provided streamers and play them through the actual speaker.

Handle

Handle allows us to control and monitor audio playback. I guess it doesn't require more explanation.

Sound files

We might be tempted to treat a sound file as a streamer, but that would be wrong. Streamer gets drained, a sound file can be played multiple times. So, when playing a sound file, I imagine an API like this:

sound := loadSound("clap.mp3")
sequencer.Play(sound.Streamer())
sequencer.Play(sound.Streamer())
sequencer.Play(sound.Streamer())
speaker.Play(sequencer)

Each sound.Streamer() call returns a new streamer that streams the sound file. The above code would play the clapping sound three times through the speaker.

Speaker

Speaker needs to regularly pull new data from the streamers provided by speaker.Play method. When to pull them? One way would be in a different goroutine, but that would require synchronization. I think the best way would be to have a speaker.Update method. In each call, speaker would pull the appropriate amount of data from the streamers. User of the library would be required to call this Update method each frame.

for !win.Closed() {
    // stuff, stuff, stuff
    speaker.Update() // pulls new data from streamers and plays them
}

Conclusion

I think this API is pretty good. I appreciate any comments on it. We need to figure out a few things:

  • What sample format to use?
  • What methods to include in Handle?

Then we need to split up the work and start working! ;)

@faiface
Copy link
Owner Author

faiface commented Jun 5, 2017

Maybe we could even use float64 in the Stream method. The streaming would always happen by small chunks, so float64 would be probably not so bad memory-wise. Of course, a sound file would internally store it's data in a more memory-friendly format.

@hajimehoshi
Copy link

How loadSound detects the file type?

@faiface
Copy link
Owner Author

faiface commented Jun 5, 2017

@hajimehoshi No idea, it's just a placeholder in the code, so far.

@joeblew99
Copy link

i added an example of recording audio to the oto lib too.

its here: ebitengine/oto#8

@faiface
Copy link
Owner Author

faiface commented Jun 6, 2017

It'd also be possible to get rid of Handler and move those methods to Streamer instead. This would probably make more sense, because Streamer has to keep track of the time anyway (usually). Calling Stop method on a Streamer would cause all of the following Stream calls to return EOF.

type Streamer interface {
    Stream(samples []float64) (n int, err error)
    Time() time.Duration
    Stop()
}

type Player interface {
    Play(Streamer)
}

@faiface
Copy link
Owner Author

faiface commented Jun 6, 2017

How to rewind? I think something like this would work, without chaning any interface API. Arbitrary streamers cannot be rewound anyway.

currentStreamer.Stop()
currentStreamer = sound.StreamerAt(time)
player.Play(currentStreamer)

StreamerAt would simply return a streamer that streams the given file from the given time. So what we did is that we stopped the current playback and replaced it with a new one from the correct time.

@faiface
Copy link
Owner Author

faiface commented Jun 6, 2017

So, one thing that was bugging me is that Time and Stop methods of Streamer/Handle would always be implemented the same and it would cause a lot of boilerplate code. Then I realized, that these methods are usually only used with a very small number of streamer, such as the actual sound files. They will be rarely used with effect streamers, for example. So, it might not be a bad idea actually, to implement them separately in a 'wrapper type' for controlling audio. So, this is how it would look like.

type Streamer interface {
	Stream(samples []float64) (n int, err error)
}

type Ctrl struct {
	Streamer Streamer
	Paused   bool
	Time     time.Duration
}

func (c *Ctrl) Stream(samples []float64) (n int, err error) {
	if c.Streamer == nil {
		return 0, io.EOF
	}
	if c.Paused {
		return len(samples), nil
	}
	n, err = c.Streamer.Stream(samples)
	c.Time += n * SampleDuration
	return n, err
}

So, with this, if we want to control the playback of a streamer, we wrap it inside a Ctrl, like this:

music := &audio.Ctrl{Streamer: musicFile.Streamer()}
speaker.Play(music)

Then, when we want to pause the playback, we just set the Paused flag:

music.Paused = true

When we want to check the current playing time, we just check the Time field:

fmt.Printf("Currently at: %v\n", music.Time)

To fully stop the music playback, we set the streamer to nil:

music.Streamer = nil

@hajimehoshi
Copy link

Sounds nice, but the length of a given byte array to Stream is determined by the caller, and I think this implementation implicitly assumes a kind of limited size of the array. If there is no limitation, it would be legal even if the caller gives an arbitrary long array to Stream, but this could be unsynchronized with Pause state.

@faiface
Copy link
Owner Author

faiface commented Jun 7, 2017

@hajimehoshi That is true, but think about when possibly could this become unsynchronized. Speaker would always pull only as much data as it needs, so no problem there. Let's think about a speee effect, that would sometimes pull more or less depending on it's speed. But it would still only take as much as it needs. Saving sound to a file. Here, the encoder is free to take arbitrary number of samples. But pausing makes no sense here. So, I think, that once we require that it is necessary to always take only as much samples as one needs, your concerns will only reflect the fact, that pausing only makes sense in real time. Hope this wasn't too confusing.

@hajimehoshi
Copy link

Speaker is the user of Streamer? So in fact there is an assumption of the size of an array, right? I think it's ok as long as this is clarified :-)

@hajimehoshi
Copy link

Speaker would always pull only as much data as it needs

The number of bytes must be exactly as much as needed or the delay can be accumulated...

@faiface
Copy link
Owner Author

faiface commented Jun 7, 2017

Yeah, there is assumption, that samples which are requested by Stream are immediately used and are not bufferred.

@faiface
Copy link
Owner Author

faiface commented Jun 7, 2017

And now, that oto implemented buffer size, delay can't be accumulated if speaker is implemented properly ;)

@faiface
Copy link
Owner Author

faiface commented Jun 7, 2017

And btw, Steamer is the only way to play anything here, so yeah, speaker uses streamer, because that's to way to tell it what to play.

@hajimehoshi
Copy link

Ah, so data can be discarded when oto's buffer is overflowed and then delay is reduced. That makes sense.

@faiface
Copy link
Owner Author

faiface commented Jun 7, 2017

Sort of. Not really discarded, but postponed and pushed to oto's Player later.

@faiface
Copy link
Owner Author

faiface commented Jun 8, 2017

We had a long and productive discussion which resulted in several modifications and clarifications to the API I proposed earlier. I will sum those up here, plus I'll introduced one little idea of my own which hasn't been discussed yet.

Unified sample format

First thing we need to understand is that the idea of unified sample format is only affecting samples transferred through streaming (by Streamer). Internally, a file can be loaded in whatever sample format, but once it is to be streamed, the streamed samples need to follow the unified format.

@rawktron objected, that one single unified sample rate is a very bad idea, because it would usually result in resampling, which can reduce the audio quality. The solution is to introduce a global SampleRate variable, which can be set by the user. This variable sets the sample rate of the unified sample format and everyone has to follow it.

var SampleRate = 48000

Other than that, the sample format is two channels, float64 format. It's float64 because this format will only be used for streaming. When streaming, only small chunks of data are ever used at any moment, so memory is not to worry about too much. Using float64 makes it compatible with the standard "math" library, which makes it easier to use. This choice may be revised if it ends up causing performance issues.

Streamer

Here's the new form of the Streamer interface.

type Streamer interface {
    Stream(samples [][2]float64) (n int, ok bool)
}

Replacing err with ok is my idea here, which we haven't discussed yet. We were discussing error handling a little bit and figured that the most common and probably only possible error is the expected io.EOF signalling the end of the stream. If this is actually true, we don't need to return an error, ok is sufficient, where false indicates the end of the stream.

The samples argument is now [][2]float64. Value samples[i][0] is the i-th sample on the left channel, samples[i][1] is the i-th sample on the right channel.

There are three possible return patterns of the Stream method:

  1. n == len(samples) && ok - Whole slice of samples successfully streamed.
  2. n > 0 && n < len(samples) && ok - In this case, streamer streamed some samples, but reached the end of the stream. All following calls to Stream should result in the third (following) case.
  3. n == 0 && !ok - The streamer is already drained. No more samples will ever come.

Regarding the Player interface: I don't think it's really necessary. Who would accept it as an argument anyway? As we'll see, most of the 'effects' and 'compositors' won't take the form of the formerly proposed Player anyway.

Basic compositors

We can define several types for composing other streamers in various ways. These types will of course be streamers too, so composition can reach arbitrary depth. Here are the three basic compositors I think are most useful. Since they're so useful, I think we can shorten their names to avoid too much typing.

Seq

type Seq []Streamer

This type will stream the streamers in the slice one by one with perfect precision (i.e. zero silence between two consecutive streamers).

Mix

type Mix []Streamer

This type will stream the streamers in the slice simultaneously, adding (with +) samples together and thus mixing the sounds.

Sched

type Sched []struct {
    After time.Duration
    S     Streamer
}

With Sched each streamer will start streaming after the specified time. Note, that the time is not the real time, but instead, it's determined by the number of streamer samples. This type is actually a generalization of the former two types, but they should be implemented separately for convenience and performance.

@faiface
Copy link
Owner Author

faiface commented Jul 3, 2017

Ok, guys, time to get some real work done! Here's what needs to be done:

  • audio
    • Streamer interface
    • Buffer struct, stores audio data in a specific format, has Streamer(from, until time.Duration) method which creates a streamer, can be created from a streamer (by accumulating)
    • Ctrl
    • Compositors
      • Seq
      • Mix
      • Sched
    • speaker playback package
    • wav (decoder package)
      • Decoder, has Streamer(from, until time.Duration) method, but reads lazily on demand using seeking
    • ogg (decoder package)
      • Decoder, same as with wav
    • mp3 (decoder package)
      • Decoder, same as with wav
    • Audio effects
      • Volume
      • Positional
      • Equalizer
      • we'll figure out more

The overall structure can change after discussion.

Now, we need to get to work, so, anyone who wants to contribute, please:

  1. Respond to this issue with one thing (at a time) you want to be working on, minimum time it will take you to finish and maximum time it will take you to finish.
  2. Start working, feel free to ask questions and discuss the design. Please, discuss before making any significant design decisions.
  3. Make sure to finish somewhere between the minimum and maximum time.

@faiface
Copy link
Owner Author

faiface commented Jul 3, 2017

To start, I am taking the Streamer interface.
Minimum time: 10 minutes.
Maximum time: 4 hours.

@faiface
Copy link
Owner Author

faiface commented Jul 3, 2017

Streamer interface done.

@alistanis
Copy link
Contributor

alistanis commented Jul 3, 2017

I'll take the initial Speaker implementation (or playback).

Minimum time: probably 30 minutes
Maximum time: 2-4 hours to get it performant

I'll be starting this tomorrow afternoon, 7/4/2017 (I have the day off)

@faiface
Copy link
Owner Author

faiface commented Jul 4, 2017

I'm taking compositors (Seq, Mix, Sched).
Minimum time: 1 day
Maximum time: 4 days

@faiface
Copy link
Owner Author

faiface commented Jul 4, 2017

@alistanis The Speaker API as I imagine it:

speaker.Play(streamer) // starts playing streamer, called when starting to play a streamer
speaker.Update()       // pulls data from the playing streamer / streamers, called every tick

Initially, a speaker can play one streamer at a time. But eventually, we want to be able to play arbitrary amount of streamers through a speaker.

The reason for a separate Update() method is to avoid race conditions, but this may be revisited later.

@faiface
Copy link
Owner Author

faiface commented Jul 8, 2017

Ok, I did Mix, Seq, Ctrl and speaker and I'll leave Sched for later, because implementing it efficiently is non-trivial, plus it's not as important yet.

@faiface
Copy link
Owner Author

faiface commented Jul 10, 2017

I'm taking WAV decoder. This will also serve as a reference API for other decoders.
Min: 1 day
Max: 5 days

@alistanis
Copy link
Contributor

@faiface have we determined which decoders we want to support yet? I'm thinking wav/ogg/aiff/mp3 is a decent combination to start with, with mp3 being the most difficult (and also the one I have the most experience with now)

Right now I'm dedicating this week to implementing low latency playback for MacOS with AudioHAL in oto, and due to lack of free time I expect that to take up the rest of this week and perhaps some of the weekend, even though I have some C prototypes already working - one plays a wav file and the other plays a sine wave. It's just a lot of C code and the interface isn't super compatible with ours/oto's like ALSA is.

Min: 5 days
Max: 2 weeks

@faiface faiface mentioned this issue Jul 14, 2017
33 tasks
@faiface
Copy link
Owner Author

faiface commented Jul 15, 2017

The discussion is moving to faiface/beep#1. Hope to see you there ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants