Note: This work thread is a part of the PECSS Project, advised by Dr. Rosa I. Arriaga, and funded by the National Science Foundation (Award Number: 1915504)

The PECSS App was envisioned to use all the sensors available in a regular smartphone to extract crucial insights about the patient while they work on their in-vivo or imaginal exposure exercises. One of the fundamental input streams that we wanted the app to incorporate was audio, and that is exactly what I contributed in this work thread. I developed the “Digiscape Audio Module” for Android, that extracts important audio features from the patient’s smartphone in a completely non-intrusive, offline fashion.

Code Repository: https://github.com/DiptarkBose/DigiScape

On this page, I explain how I developed the audio module for Android.

Ambient Audio

While at school, we have been shown pictures of perfect sinusoidal waves whenever the topic of sound was discussed. But that's hardly the case. Actual audio waves that are present in our surroundings are a composition of multiple frequencies. Imagine you are sitting in a park. Children laughing would produce high-frequency waves, whereas a lawnmower would produce sound waves of lower frequencies. Birds chirping around you could be towards the higher frequencies, whereas the sound of a football being kicked would be a low frequency 'thud'.

Thus, ambient audio is vastly complex and consists of a combination of multiple sinusoids. In reality, a sound wave around you would seem something like this:

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9d73ba47-5607-4d07-8e57-4de3566f4843/Untitled.png

How to get the weird squiggly line?

In an Android environment, the mic captures and stores the sound wave using an array of numerical values. The AudioRecord class and its methods help us achieve that. So essentially, audio in our context, is just an array of numerical values.

// Polls the AudioRecord object and stores audio discrete values in buffer
int nread = recorder.read(buffer, 0, buffer.length);

Something like [126, -57, 73, 56, 73, 452 ...........]

That's it! That's all the sound information we have with us to proceed with!

Deriving useful information from this array of numbers

The first glance at the array of numbers would probably give no indication about anything in our surroundings. But we need these numbers to paint the picture of the surroundings. Is there a engine idling near the mic? Is the television switched on? Is someone speaking something?

The main challenge of developing the Digiscape audio module is to understand how this array of numerical values can assist us in painting a picture of the smartphone's environment.

Why can't we simply use tons of readymade sound-detection python scripts on the user's audio?

There are tons of sound detection projects already implemented in python, which simply take in a .wav file, and churn out the most probable category that the sound belongs to. So why aren't we using something of this sort? Since Digiscape's audio is highly sensitive, all computations, classifications, and processing need to be done on the device itself, totally offline. No network/API calls are allowed. Hence, sending the audio file to an external server where the python script can do its magic is not really an option for us.