Audio classification using Image classification techniques

Both Image classification and audio classification were challenging tasks for a machine to do till AI and neural networks technology came to the scene.

Research on both problems were started decades before and something fruitful started coming out after the inception of Artificial intelligence and neural networks. Classification is always easier task for a human. Hope you all agree.. But I can't 100% agree with that statement. Why am I contradicting my own statement? I am not mad but there are certain cases like It's really hard to identify who is Jiswin and who is Jeswin (my friends) because both are identical twins and we can't make a judgement correctly with only our vision. Leave the identical twins case suppose we ordered a veg burger and a chicken burger, if they don't tell us which is chicken and which is veg will we be able to recognise it correctly? So there are certain cases in where human intelligence also fails. These are really simple examples from our daily life. So the takeover from these examples are human intelligence can fail sometimes, keep this in mind.

Now a days artificial intelligence can out perform almost any tasks than human. A classification model which is trained with a lot of images of veg and chicken burger will be able to precisly recognise a chicken burger and a veg burger. If you can't agree go and try it out here is the link
Google's inception model which is trained on imagenet dataset which can classify 1000 classes of objects is opensourced by google. Anyone can use the trained model or retrain it's last layer for new classes or to build your on classification model. They have a really nice tutorial to start with inception
If you haven't  gone through it please click on the above link and do a quick reading on it and come back. This is the simplest tutorial and good model which I found on google to do image classification. We will be talking a lot about it in the coming paragraphs.

Inception of Tensorflow Inception v3

Inception is the only model which I found giving accurate predictions in less time and easy to use, that means well documented. They use deep convolution neural networks in inception. Inception model have already shown some excellent performance than humans in some visual tasks. I hope everyone read about inception and  all understood how we can retrain it.

Audio classification using Tensorlfow Inception

Here by seeing this heading you might be confused. How can we train a model with audio files for classification? How can we do that? Actually it's not possible using inception. Your understanding 100% right till now. Inception can only be trained with images. It can only do image classification. Now, how we are going to solve this problem?
What if we can convert audio to images? Is there any pictorial representation for audio? During the research I came across 2 pictorial representation for audio files, spectrograms and chromagrams. A spectrogram is a visual representation of spectrum of frequencies of sound or other signal as they vary with time or some other variable - wikipedia. The chromagram is a chroma-time representation similar to spectrogram. You can try generating both using pyaudioanalysis in python. Here is a link to how to do it 
Below there is a chromagram taken from the above link.

So after generating the spectrograms or chromograms can we train an inception model with those images? Yes we can. For us, human eyes it might be difficult to find out a pattern from a set of spectrogram images but inception the way it sees an image is really different and it gives astonishing results.
Try to solve urban sound classification problem using inception and get amazed by seeing the results.
Image source:  Internet

Happy coding.. ;-)


  1. how about if I use this to classify spoken language?



Post a Comment

Popular Posts