Speech Controlled Music Player

Introduction

This project uses the Google Speech-to-Text API and the Dialogflow (or API.AI) API to decode a command recorded on a node.js client (in this case a terminal interface) in natural speech and sends it to an mbed through a server set up by an ESP8266 Wifi Module (TCP Socket), or the USB Serial Port. The mbed, that is connected to a microSD breakout board where the songs are stored, a uLCD, and a speaker, receives the decoded command, parses it, and executes it. The command either starts playing a song (specified or unspecified in the command), plays the next/previous song, or stops the music.

/media/uploads/georgekamar/set_up.jpg

Implementation

Set up

Notes for storing songs on the microSD card:

All files on the SD card must be in wave format with a sample rate of no more than 22,500 Hz, be sorted as SD/Wave/Artist/Song.wav. The songs must be written in a way that the Speech API’s output will match the song in Dialogflow. For example, the Speech API won’t decode the song “ Watch’a Sayin’ ”, it will decode “What You’re Saying”. It is up to you to either name the song “What You’re Saying” in Dialogflow and the SD Card, or add “What You’re Saying” as a synonym to “ Whatch’a Sayin’ “ in Dialogflow for it to be decoded as such. This will be discussed in more detail later.

Pin out

/media/uploads/georgekamar/pin_out.jpg

If the TRRS 3.5mm Jack Breakout board is used instead of a speaker, the pinout is as below.

/media/uploads/georgekamar/pin_out_trrs.jpg

In our example, the sound is sampled for a mono output so the left and right outputs are connected together.

The node.js Client

Setting up the Google SDK

Before writing the code for the client, we must first download the Google Cloud SDK, set up an account and a project on their online console, and authenticate our application. The instructions for all that can be found on https://cloud.google.com/speech/docs/getting-started.

Downloading the GitHub Project

In our example, we use the project found at https://github.com/googleapis/nodejs-speech to detect speech using the Google Speech API. Our music player modifies the streamingMicRecognize() function found in the recognize.js file in the project ( https://github.com/googleapis/nodejs-speech/blob/master/samples/recognize.js ). Instructions to set up this project can be found here: https://codelabs.developers.google.com/codelabs/cloud-speech-nodejs/index.html?index=..%2F..%2Findex#0. In part 2 step 4, just replace the GitHub link with the first one mentioned above.

Creating Dialogflow Agent

Sign up for an account on https://console.dialogflow.com/api-client/ and create a new agent. Our example uses V1 of the API and uses English as the language. Go to Export and Share, choose “Import from zip” and upload this zip file. (http://www.filehosting.org/file/details/714069/Music_Player.zip)

This is our agent for this project, it includes 4 intents: Play Song, Play Next, Play Previous and Stop, and uses 2 entities: our custom made @Song entity that was made by uploading a .csv file with a list of all the songs included on the SD card, and the system made @sys.music-artist (already included in Dialogflow). When Play Song is detected, it returns one of four responses to be parsed by the mbed. Depending on the entities it detects in the command (which song or which artist) it will either return them in the response or leave them out.

/media/uploads/georgekamar/text_response.jpg

Note: For song titles that contain non-proper English words (e.g: Watch’a Sayin’, Sittin’ On Top Of the World, Sk8ter Boi etc.), synonyms can be added in the corresponding row so that for example, “What You’re Saying” detects “Watch’a Sayin’ “Skater Boy” detects “Sk8ter boi” etc. This is useful since we’re using the Speech API, that will only return proper English words in its response.

Modifying streamingMicRecognize()

Our project modifies the streamingMicRecognize in the recognize.js file of the GitHub project. It adds a call to the Dialogflow API (or API.AI) and creates either a TCP socket or a SerialPort connection that sends the Dialogflow response. Below is the modified version of the function found on the recognize.js file with some global variable declarations. _Note: This is node.js code._

streamingMicRecognize() Function

var net = require('net');

var SerialPort = require('serialport');

var apiai = require('apiai');

var aiapp = apiai("Dialogflow-client-access-token");  // change this string to your Dialogflow agent's client access token

var apiaiRequest;

function streamingMicRecognize (encoding, sampleRateHertz, languageCode) {
  // [START speech_streaming_mic_recognize]
  const record = require('node-record-lpcm16');

  // Imports the Google Cloud client library
  const Speech = require('@google-cloud/speech');

  // Instantiates a client
  const speech = Speech();

  // The encoding of the audio file, e.g. 'LINEAR16'
  // const encoding = 'LINEAR16';

  // The sample rate of the audio file in hertz, e.g. 16000
  // const sampleRateHertz = 16000;

  // The BCP-47 language code to use, e.g. 'en-US'
  // const languageCode = 'en-US';

  const request = {
    config: {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode
    }
  };

  // Create a recognize stream
  const recognizeStream = speech.createRecognizeStream(request)
    .on('error', console.error)
    .on('data', function(data){
    	console.log(data.results);
    	
    	apiaiRequest = aiapp.textRequest(data.results, {
    		sessionId: '123'
		});

		apiaiRequest.on('response', function(apiResponse) {
			var apiResult = apiResponse["result"];
			var apiFul = apiResult["fulfillment"];
			var speech = JSON.stringify(apiFul["speech"]);
    		
    		/***** For SerialPort Connection, comment this code ****/
    		
    		/***** Open TCP Socket ****/
    		
    		var client = new net.Socket();
    		
			client.connect(80, '172.20.10.3', function() {	// IP Address Server is Set Up On	
					console.log('Connected');	
					client.write(speech);
					console.log("Speech Sent");
					client.destroy();
			});
	
	
			/***** For SerialPort Connection, uncomment this code ****/
			
			/***** Open SerialPort ****/
	/*
			var port = new SerialPort('/dev/tty.usbmodem14312', {	// replace with your port
  				baudRate: 9600
			});
			
			port.on('open', function() {
    			port.write(speech);
				port.close();	
			});
				
		*/	
		
		});

		apiaiRequest.on('error', function(error) {
    		console.log(error);
		});

		apiaiRequest.end();
   		
    });

  // Start recording and send the microphone input to the Speech API
  record.start({
    sampleRateHertz: sampleRateHertz,
    threshold: 0
  }).pipe(recognizeStream);

  console.log('Listening, press Ctrl+C to stop.');
  // [END speech_streaming_mic_recognize]
}

This code can be run by navigating to the file's directory on the terminal and running "node recognize.js listen".

The Server

The server is set up by sending commands to the ESP8266 Chip from the mbed through a serial connection. The Chip connects to an Access Point (we used a mobile hotspot but a hosted network on a laptop or a local secure network will also work), sets up a server and returns an IP address. The server then listens for an input on port 80 and transfers the data received to the mbed through the serial connection. The characters transferred are surrounded by double quotes.

Receiving & Parsing the Command

The mbed code is set up to wait for an interrupt from either the ESP8266 Chip or the USB port. When it is interrupted, it executes a function that stores the characters received in a buffer. It doesn’t store the first character (a double-quote) and waits for a double-quote character at the end of the string received to set a flag and begin parsing the command in the main function. The string is then broken down to simple functions: Play, Stop, play next, play previous. In the case of play, two inputs may follow: song _ and/or artist _. Whatever follows song or artist is stored in a variable and used to find and play the chosen song or artist. If no input follows, a random song is played.

Stop sets the “playing” variable to 0 and closes the file. Play next increases the song directory index by 1 and plays the song at that index. Play previous decreases the song index by 1 and plays the song at the new index.

Play on its own plays a random song using a random number generated by a scaled analogue input of a voltage divider. Where a real time clock is normally used for this purpose, the mbed does not include one and the random generator generates the same set of numbers on each run.

Play song ...... followed by the song name will search each directory for a song matching the input song name. The song is then played, and the current directory index logged for potential play next, play previous commands.

Play Artist ...... followed by the artist name will open the artist directory and play the first song. The directory index is logged as 0 for potential play next, play previous commands.

Play Artist …… song ...... including artist name and song name will just open the file: Wave/artist/song.wav. The song is then played, and the current directory index logged for potential play next, play previous commands.

uLCD

The uLCD displays the current artist and song as directed by song and artist variables. On a new command the screen is cleared, and the new data written. There is also code to display that the SD card is not present on start up or that a song/artist was not found.

Wave_player

The wave_player library plays the wav files opened from the SD file system. The wave file is output as a PWM output to the class D audio amplifier which boosts the signal to the speaker or audio jack breakout board.

Video Demo

Source Code

Import programSpeech_Controlled_Music_Player

A Speech Controlled Music Player that uses Google Speech API and Dialogflow API to understand natural speech.


Please log in to post comments.