How it works: Linux audio explained


There's a problem with the state of Linux audio, and it's not that it doesn't always work. The issue is that it's overcomplicated. This soon becomes evident if you sit down with a piece of paper and try to draw the relationships between the technologies involved with taking audio from a music file to your speakers: the diagram soon turns into a plate of knotted spaghetti. This is a failure because there's nothing intrinsically more complicated about audio than any other technology. It enters your Linux box at one point and leaves at another.

If you've had enough of this mess and want to understand just how all the bits fit together, we're here to help - read on to learn exactly how Linux audio works!

If we were drawing the OSI model used to describe the networking framework that connects your machine to every other machine on the network, we'd find clear strata, each with its own domain of processes and functionality. There's very little overlap in layers, and you certainly don't find end-user processes in layer seven messing with the electrical impulses of the raw bitstreams in layer one.

Yet this is exactly what can happen with the Linux audio framework. There isn't even a clearly defined bottom level, with several audio technologies messing around with the kernel and your hardware independently. Linux's audio architecture is more like the layers of the Earth's crust than the network model, with lower levels occasionally erupting on to the surface, causing confusion and distress, and upper layers moving to displace the underlying technology that was originally hidden.

The Open Sound Protocol, for example, used to be found at the kernel level talking to your hardware directly, but it's now a compatibility layer that sits on top of ALSA. ALSA itself has a kernel level stack and a higher API for programmers to use, mixing drivers and hardware properties with the ability to play back surround sound or an MP3 codec. When most distributions stick PulseAudio and GStreamer on top, you end up with a melting pot of instability with as much potential for destruction as the San Andreas fault.

Here's a simplified view of the audio layers typically used in Linux. The deeper the layer, the closer to the hardware it is.

Here's a simplified view of the audio layers typically used in Linux. The deeper the layer, the closer to the hardware it is.


INPUTS: PulseAudio, Jack, GStreamer, Xine, SDL, ESD

OUTPUTS: Hardware, OSS

As Maria von Trapp said, "Let's start at the very beginning." When it comes to modern Linux audio, the beginning is the Advanced Linux Sound Architecture, or ALSA. This connects to the Linux kernel and provides audio functionality to the rest of the system. But it's also far more ambitious than a normal kernel driver; it can mix, provide compatibility with other layers, create an API for programmers and work at such a low and stable latency that it can compete with the ASIO and CoreAudio equivalents on the Windows and OS X platforms.

ALSA was designed to replace OSS. However, OSS isn't really dead, thanks to a compatibility layer in ALSA designed to enable older, OSS-only applications to run. It's easiest to think of ALSA as the device driver layer of the Linux sound system. Your audio hardware needs a corresponding kernel module, prefixed with snd_, and this needs to be loaded and running for anything to happen. This is why you need an ALSA kernel driver for any sound to be heard on your system, and why your laptop was mute for so long before someone thought of creating a driver for it. Fortunately, most distros will configure your devices and modules automatically.

ALSA is responsible for translating your audio hardware's capabilities into a software API that the rest of your system uses to manipulate sound. It was designed to tackle many of the shortcomings of OSS (and most other sound drivers at the time), the most notable of which was that only one application could access the hardware at a time. This is why a software component in ALSA needs to manages audio requests and understand your hardware's capabilities.

If you want to play a game while listening to music from Amarok, for example, ALSA needs to be able to take both of these audio streams and mix them together in software, or use a hardware mixer on your soundcard to the same effect. ALSA can also manage up to eight audio devices and sometimes access the MIDI functionality on hardware, although this depends on the specifications of your hardware's audio driver and is becoming less important as computers get more powerful.

This screenshot of Alsa Mixer shows off everything that's wrong with Linux audio - it really doesn't need to be this complicated.

This screenshot of Alsa Mixer shows off everything that's wrong with Linux audio - it really doesn't need to be this complicated.

Where ALSA does differ from the typical kernel module/device driver is in the way it's partly user-configurable. This is where the complexity in Linux audio starts to appear, because you can alter almost anything about your ALSA configuration by creating your own config file - from how streams of audio are mixed together and which outputs they leave your system from, to the sample rate, bit-depth and real-time effects.

ALSA's relative transparency, efficiency and flexibility have helped to make it the standard for Linux audio, and the layer that almost every other audio framework has to go through in order to communicate with the audio hardware.


INPUTS: GStreamer, Xine, ALSA


If you're thinking that things are going to get easier with ALSA safely behind us, you're sadly mistaken. ALSA covers most of the nuts and bolts of getting audio into and out of your machine, but you must navigate another layer of complexity. This is the domain of PulseAudio - an attempt to bridge the gap between hardware and software capabilities, local and remote machines, and the contents of audio streams. It does for networked audio what ALSA does for multiple soundcards, and has become something of a standard across many Linux distros because of its flexibility.

As with ALSA, this flexibility brings complexity, but the problem is compounded by PulseAudio because it's more user-facing. This means normal users are more likely to get tangled in its web. Most distros keep its configuration at arm's length; with the latest release of Ubuntu, for example, you might not even notice that PulseAudio is installed. If you click on the mixer applet to adjust your soundcard's audio level, you get the ALSA panel, but what you're really seeing is ALSA going to PulseAudio, then back to ALSA - a virtual device.

At first glance, PulseAudio doesn't appear to add anything new to Linux audio, which is why it faces so much hostility. It doesn't simplify what we have already or make audio more robust, but it does add several important features. It's also the catch-all layer for Linux audio applications, regardless of their individual capabilities or the specification of your hardware.

PulseAudio is powerful, but often derided for making Linux audio even more complicated.

PulseAudio is powerful, but often derided for making Linux audio even more complicated.

If all applications used PulseAudio, things would be simple. Developers wouldn't need to worry about the complexities of other systems, because PulseAudio brings cross-platform compatibility. But this is one of the main reasons why there are so many other audio solutions. Unlike ALSA, PulseAudio can run on multiple operating systems, including other POSIX platforms and Microsoft Windows. This means that if you build an application to use PulseAudio rather than ALSA, porting that application to a different platform should be easy.

But there's a symbiotic relationship between ALSA and PulseAudio because, on Linux systems, the latter needs the former to survive. PulseAudio configures itself as a virtual device connected to ALSA, like any other piece of hardware. This makes PulseAudio more like Jack, because it sits between ALSA and the desktop, piping data back and forth transparently. It also has its own terminology. Sinks, for instance, are the final destination. These could be another machine on the network or the audio outputs on your soundcard courtesy of the virtual ALSA. The parts of PulseAudio that fill these sinks are called 'sources' - typically audio-generating applications on your system, audio inputs from your soundcard, or a network audio stream being sent from another PulseAudio machine.

Unlike Jack, applications aren't directly responsible for adding and removing sources, and you get a finer degree of control over each stream. Using the PulseAudio mixer, for example, you can adjust the relative volume of every application passing through PulseAudio, regardless of whether that application features its own slider or not. This is a great way of curtailing noisy websites.


INPUTS: Phonon

OUTPUTS: ALSA, PulseAudio, Jack, ESD

With GStreamer, Linux audio starts to look even more confusing. This is because, like PulseAudio, GStreamer doesn't seem to add anything new to the mix. It's another multimedia framework and gained a reasonable following of developers in the years before PulseAudio, especially on the Gnome desktop. It's one of the few ways to install and use proprietary codecs easily on the Linux desktop. It's also the audio framework of choice for GTK developers, and you can even find a version handling audio on the Palm Pre.

GStreamer slots into the audio layers above PulseAudio (which it uses for sound output on most distributions), but below the application level. GStreamer is unique because it's not designed solely for audio - it supports several formats of streaming media, including video, through the use of plugins.

MP3 playback, for example, is normally added to your system through an additional codec download that attaches itself as a GStreamer plugin. The commercial Fluendo MP3 decoder, one of the only officially licenced codecs available for Linux, is supplied as a GStreamer plugin, as are its other proprietary codecs, including MPEG-2, H.264 and MPEG.


INPUTS: PulseAudio, GStreamer, ALSA,


Despite the advantages of open configurations such as PulseAudio, they all pipe audio between applications with the assumption that it will proceed directly to the outputs. Jack is the middle layer - the audio equivalent of remote procedure calls in programming, enabling audio applications to be built from a variety of components.

The best example is a virtual recording studio, where one application is responsible for grabbing the audio data and another for processing the audio with effects, before finally sending the resulting stream through a mastering processor to be readied for release. A real recording studio might use a web of cables, sometimes known as jacks, to build these connections. Jack does the same in software.

Jack is an acronym for 'Jack Audio Connection Kit'. It's built to be low-latency, which means there's no undue processing performed on the audio that might impede its progress. But for Jack to be useful, an audio application has to be specifically designed to handle Jack connections. As a result, it's not a simple replacement for the likes of ALSA and PulseAudio, and needs to be run on top of another system that will generate the sound and provide the physical inputs.

With Jack, you can connect the audio output from applications to the audio input of others manually - just like in a real recording studio.

With Jack, you can connect the audio output from applications to the audio input of others manually - just like in a real recording studio.

With most Jack-compatible applications, you're free to route the audio and inputs in whichever way you please. You could take the output from VLC, for example, and pipe it directly into Audacity to record the stream as it plays back.

Or you could send it through JackRack, an application that enables you to build a tower of real-time effects, including pinging delays, cavernous reverb and voluptuous vocoding.

This versatility is fantastic for digital audio workstations. Ardour uses Jack for internal and external connections, for instance, and the Jamin mastering processor can only be used as part of a chain of Jack processes. It's the equivalent of having full control over how your studio is wired. Its implementation has been so successful on the Linux desktop that you can find Jack being put to similar use on OS X.



OUTPUTS: Audio hardware

In the world of professional and semi-professional audio, many audio interfaces connect to their host machine using a FireWire port. This approach has many advantages. FireWire is fast and devices can be bus powered. Many laptop and desktop machines have FireWire ports without any further modification, and the standard is stable and mostly mature. You can also take FireWire devices on the road for remote recording with a laptop and plug them back into your desktop machine when you get back to the studio.

But unlike USB, where there's a standard for handling audio without additional drivers, FireWire audio interfaces need their own drivers. The complexities of the FireWire protocol mean these can't easily create an ALSA interface, so they need their own layer. Originally, this work fell to a project called FreeBOB. This took advantage of the fact that many FireWire audio devices were based on the same hardware.

FFADO is the successor to FreeBOB, and opens the driver platform to many other types of FireWire audio interface. Version 2 was released at the end of 2009, and includes support for many units from the likes of Alesis, Apogee, ART, CME, Echo, Edirol, Focusrite, M-Audio, Mackie, Phonic and Terratec. Which devices do and don't work is rather random, so you need to check before investing in one, but many of these manufacturers have helped driver development by providing devices for the developers to use and test.

Another neat feature in FFADO is that some the DSP mixing features of the hardware have been integrated into the driver, complete with a graphical mixer for controlling the balance of the various inputs and outputs. This is different to the ALSA mixer, because it means audio streams can be controlled on the hardware with zero latency, which is exactly what you need if you're recording a live performance.

Unlike other audio layers, FFADO will only shuffle audio between Jack and your audio hardware. There's no back door to PulseAudio or GStreamer, unless you run those against Jack. This means you can't use FFADO as a general audio layer for music playback or movies unless you're prepared to mess around with installation and Jack. But it also means that the driver isn't overwhelmed by support for various different protocols, especially because most serious audio applications include Jack support by default. This makes it one of the best choices for a studio environment.


INPUTS: Phonon


We're starting to get into the niche geology of Linux audio. Xine is a little like the chalk downs; it's what's left after many other audio layers have been washed away. Most users will recognise the name from the very capable DVD movie and media player that most distributions still bundle, despite its age, and this is the key to Xine's longevity.

When Xine was created, the developers split it into a back-end library to handle the media, and a front-end application for user interaction. It's the library that's persisted, thanks to its ability to play numerous containers, including AVI, Matroska and Ogg, and dozens of the formats they contain, such as AAC, Flac, MP3, Vorbis and WMA. It does this by harnessing the powers of many other libraries. As a result, Xine can act as a catch-all framework for developers who want to offer the best range of file compatibility without worrying about the legality of proprietary codecs and patents.

Xine can talk to ALSA and PulseAudio for its output, and there are still many applications that can talk to Xine directly. The most popular are the Gxine front-end and Totem, but Xine is also the default back-end for KDE's Phonon, so you can find it locked to everything from Amarok to Kaffeine.


INPUTS: Qt and KDE applications

OUTPUTS: GStreamer, Xine

Phonon was designed to make life easier for developers and users by removing some of the system's increasing complexity. It started life as another level of audio abstraction for KDE 4 applications, but it was considered such a good idea that Qt developers made it their own, pulling it directly into the Qt framework that KDE itself is based on.

This had great advantages for developers of cross-platform applications. It made it possible to write a music player on Linux with Qt and simply recompile it for OS X and Windows without worrying about how the music would be played back, the capabilities of the sound hardware being used, or how the destination operating system would handle audio. This was all done automatically by Qt and Phonon, passing the audio to the CoreAudio API on OS X, for example, or DirectSound on Windows. On the Linux platform (and unlike the original KDE version of Phonon), Qt's Phonon passes the audio to GStreamer, mostly for its transparent codec support.

Phonon support is being quietly dropped from the Qt framework. There have been many criticisms of the system, the most common being that it's too simplistic and offers nothing new, although it's likely that KDE will hold on to the framework for the duration of the KDE 4 lifecycle.

The rest of the bunch

There are many other audio technologies, including ESD, SDL and PortAudio. ESD is the Enlightenment Sound Daemon, and for a long time it was the default sound server for the Gnome desktop. Eventually, Gnome was ported to use libcanberra (which itself talks to ALSA, GStreamer, OSS and PulseAudio) and ESD was dropped as a requirement in April 2009. Then there's Arts, the KDE equivalent of ESD, although it wasn't as widely supported and seemed to cause more problems than it solved. Most people have now moved to KDE 4, so it's no longer an issue.

SDL, on the other hand, is still thriving as the audio output component in the SDL library, which is used to create hundreds of cross-platform games. It supports plenty of features, and is both mature and stable.

PortAudio is another cross-platform audio library that adds SGI, Unix and Beos to the mix of possible destinations. The most notable application to use PortAudio is the Audacity audio editor, which may explain its sometimes unpredictable sound output and the quality of its Jack support.

And then there's OSS, the Open Sound System. It hasn't been a core Linux audio technology since version 2.4 of the kernel, but there's just no shaking it. This is partly because so many older applications are dependent on it and, unlike ALSA, it works on systems other than Linux. There's even a FreeBSD version. It was a good system for 1992, but ALSA is nearly always recommended as a replacement.

OSS defined how audio would work on Linux, and in particular, the way audio devices are accessed through the ioctl tree, as with /dev/dsp, for example. ALSA features an OSS compatibility layer to enable older applications to stick to OSS without abandoning the current ALSA standard.

The OSS project has experimented with open source and proprietary development, and is still being actively developed as a commercial endeavour by 4 Front Technologies. Build 2002 of OSS 4.2 was released in November 2009.

First published in Linux Format

First published in Linux Format magazine

You should follow us on or Twitter

Your comments

reply to some of the above

@Paul Davis:
Yes I think the sound architecture on my android and your high end studio machine should be the same. - Just not the same amount of hardware involved and therefore more or less slots and processing power.
If an audiochannel is just was it's name suggests: a stream of pressure samples that gets piped through different filters, combiners and I/O I can see no reason, why limited hardware that can feed only 3 inputs (say an alarm, messages from the phone signaling system like a 2nd call inbound and your voice) to the speaker should have different management than a machine with lots of instruments connected and a 64-core DSP-board + hardware FFTs in FPGAs.
It's all straight forward to describe in what was called reJackD above. Especially if you can pipe filter/generator output to control of other channels as well.Even explaining the "re" in 'Recursive Jack Daemon'. Thinking of the possible effects this creates this may even be a pretty cool synthesiser ;-)

I agree, that it can easily be a user space process as many other daemons - moving it to kernel may just give it realtime priority automatically. Since mixing is so fundamental it should be a service of reJackD and not in a separat layer. Sample quality and frequencies were mentioned: to make the architecture useful for different hardware, the sampling strategy (interpolation/merging to a high/low intermediate) should be a parameter.

@BeOS Got it Right:
I programmed audio filters on DSP-cards in the 90's that worked with us-latency
and have no idea of BeOS. Please enlight us:
* what's included in "everything" taxatively
* how the copy from BeOS shall be carried out
* what parts from the current system(s) can be reused

kind of, only kind of,

kind of, only kind of, interesting to read some of the 6x5x4x3x2x1 combinations of possible reasons for my having 2 SILENT OS's for a month now. KIND of interesting.grrrrrrrrrrrrrr


Why is that so complicated...
Thanks for the article!


ALSA was a pretty good idea and implementation, but it is now lacking a few features. Instead of improving on ALSA, somebody decided to make PulseAudio. Now, PulseAudio provides some great functions, but it does not need its own system. ALSA should be re-written with ideas from Jack and include the features that PulseAudio provided. Basically, we would make a Frankenstein audio system. Then, devs should work on making kick ass wrappers for different sound systems, so that applications that sent audio to OSS or Jack or PulseAudio could just send to ALSA via the wrapper. Sound would then have some latency, but anyone who was serious about sound would just send audio directly to ALSA.

TL;DR Make ALSA2 from Jack's core and then make a shit ton of wrappers.

P.S. Sorry for the French

Thank You!

Thank your for that great article giving clear overview about not so clear things simply and short!

All I want to do is listen to my headphones...

Well, I stumbled on this article after poking around for a few days trying to figure out why I can't get decent sound from my Ubuntu-ized laptop.

I have a very simple objective - copy music from my CD's to my laptop, plug in my headphones, listen to music. When I was running Windoze it was super simple... use EAC to copy discs to laptop, run Foobar to play music, plug in headphones. With Foobar it was easy to add some enhancements, like headphone crossfeed to improve imaging on headphones. If I downloaded 24/96 or 24/192 tracks from HDTracks they sounded great!

Now, I've installed Ubuntu (12.10) with the Unity desktop. I've tried a *LOT* of different audio players, and they *ALL* sound like sheet. I couldn't figure out what the heck the problem was - same hardware, same chipset, same source, but what sounded really, really good on the system before Ubuntu (and by really, really good I mean the sound compared very closely to playing the same CD's on a high-end CD player and dedicated headphone amp) now sounded like it was playing over a cheap radio headset. No dynamic range, fuzzy bass, hard treble, no soundstage, just really bad. Thanks to this article I'm starting to grasp how complex the audio chain actually is under Ubuntu, but I don't have a clear idea as to how to fix this. I'm using Audacious right now (mostly because it got decent user reviews *AND* it includes the Bauer stereo to binaural crossfeed plugin, which is wonderful for headphones) and I have a choice of PulseAudio, ALSA, OSS4, SDL, FileWriter and Jack listed under "output", each of which has a bunch of configuration options. I have no idea where to start to try and get this thing to play music with the fidelity I *KNOW* it's capable of, having been listening to it using a different OS for, like, 3 years. Right now, I can't tell the different between two versions of a song, one an uncompressed WAV file and one a highly compressed MP3 (128k bitrate) - they both sound equally poor.

I've been in IT for 30 years, and an audiophile for longer than that, but it's only recently the two paths have crossed. (Never had any use for music on computers until I started working from a home office and wanted to play tunes on my laptop.) I can't *BELIEVE* how hard it is to do something as simple as get decent music playing on my headphones. I don't even want to think about what's going to be involved if I try moving to some hi-rez downloads or try to hook up and external D/A to this thing.

Just a pointless rant from a Windoze hating open source guy who just wants to play some tunes...

Thanks a lot

Its really nice. Much needed for beginners like me.

No Pain No gain!

I’m starting to see the light,thank you for putting it all in perspective.I recently started running my recording studio on a total Linux Audio Platform.I have to say,It is not an easy transition.Considering the fact of finding the right sound cards and other variables,but at last, I am beginning to feel the freedom that the Linux audio architecture has to offer.And let us not forget.True sampling rates, high definition audio is the way to go beyond the dreadful sounding Mp3.

Thank you for the good writeup.

Greetings! I’ve been following your site for a while now and finally got the bravery to go ahead and give you a shout out from Atascocita Tx! Just wanted to mention keep up the great work!

oldie by goldie

thx for the interesting insight.

Passando clienti throughout acquirenti

<a href=>borse louis vuitton</a>
Qualche mese fa, ero diffidente di accendere illinois televisore,<a href=>sciarpa louis vuitton</a>
elizabeth di essere bombardati nrrr annunci su dove spendere my partner and i miei soldi,<a href=>louis vuitton outlet</a>
quanto investire e quale assicurazione ti ' dato un bang for every il dollaro.<a href=>outlet louis vuitton</a>
Mother nelle ultime settimane my partner and i miei timori sembrano essere stati fugati arrive chicago maggior parte di questi annunci sono andati within letargo (supposrr que spera fino a new quando l . a . ripresa dell'economia).<a href=>cintura louis vuitton</a>
Ciò gna suppos que vede oggi sono annunci l'invio di un messaggio diverso every celui-ci consumatore.<a href=>louis vuitton neverfull</a>

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Username:   Password: