Home

UI Design

Homepage

Visualization Page

Visualization Page Visualization representations are two charts and a text transcription box. The Utterances at Azimuth chart represents all utterances spread over the range of azimuths detected in the audio file. The Speech in Seconds chart represents the aggregate time of all utterances detected for each speaker. Transcription represents the speech recognition results from Bing Speech API, including transcription errors which are represented by 'Inaudible'. These are typically sounds that were identified as speech by Hark Saas and sent to Bing for recognition, but were actually background noise. Each of the representations on this page loads and updates dynamically as the results become available to the backend server. Each chart is interactive, with the ability for the user to cross-filter the results by clicking on an element in the chart (a slice in the pie chart, or a bar in the bar chart).

Thirdy-Party Services

###Amazon AWS Used for hosting the web server in Japan.

###Tornado Used as the web server for serving the webapp and delivering data to the browser via websockets.

###Microsoft Cognitive Services/Azure Used for hosting the Bing Speech API server instance.

###Hark SaaS Used for analyzing the audio files.

###Speech Recognition Used for transcribing the audio file via a wrapper around the Bing Speech API, as Google Speech API is not available anymore.

###d3.js Used for creating real-time data visualizations in the browser.

###crossfilter.js Used for n-dimensional filtering of multivariate datasets across D3 charts.

###c3.js A wrapper around D3.js for building charts quickly.

Internal Specification

Server

Configuration

STAGING_AREA = '/tmp/'
STATIC_PATH = 'static'
HTML_TEMPLATE_PATH = 'templates'
LISTEN_PORT = 80
LANGUAGE = 'ja-JP'

STAGING_AREA is the working space for processing audio files.
STATIC_PATH is the location of the javascript and css files to be served.
HTML_TEMPLATE_PATH is the location of the html files to be rendered by the Tornado web server.
LISTEN_PORT is the port on which the Http Server and websocket listen.

LANGUAGE is the locale used by Bing Speech API for speech recognition.

  settings = {
      'static_path': os.path.join(os.path.dirname(__file__), STATIC_PATH),
      'template_path': os.path.join(os.path.dirname(__file__), HTML_TEMPLATE_PATH)
  }

These settings are for passing to the Tornado application instance to map static files.

  default_hark_config = {
      'processType': 'batch',
      'params': {
          'numSounds': 2,
          'roomName': 'sample room',
          'micName': 'dome',
          'thresh': 21
      },
      'sources': [
          {'from': 0, 'to': 180},
          {'from': -180, 'to': 0},
      ]
  }

This is the default configuration metadata that the Hark session is initialized with. It presumes the uploaded audio file is an 8-channel file with two unique speakers.

class HttpRequestHandler

The handler for HTTP requests sent to the server. Get requests are asynchronous and can be handled concurrently. The post request handles upload of the audio file in a subprocess via corountine, which is non-blocking, however when the web socket writes hark and speech recognition data back to the browser, this occurs on the same port, so HTTP requests may be queued when the port is very busy. Future work is to set up an Nginx server behind a load balancer, and to reduce contention for the port.
```
  def async_upload(file):
```
A function called asynchronously for non-blocking uploads

class Hark

A simple wrapper around PyHarkSaas to inject logging and additional logic

class Speech

A wrapper around Speech Recognition module to inject additional logic. The Speech Recognition API is a module which supports multiple recognition APIs. Hark Visualizer uses the Bing Speech API, which is backed by an Azure instance hosting the API instance in west-us region (this was the only availability zone).

class WebSocketHandler

This is where the main websocket work is done. A websocket is initiated in JavaScript by the browser when the user navigates to visualization.html:
```
  // Connect to the remote websocket server
  var connection = new WebSocket("ws://harkvisualizer.com/websocket");
```

This triggers the Tornado web server to send the analysis results from Hark to the browser via this socket:

# Invoked when socket is opened by browser
def open(self):
    log.info('Web socket connection established')
    \# Do not hold packets for bandwidth optimization
    self.set_nodelay(True)
    # ioloop to wait before attempting to sending data 
    tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1),
                                          self.send_data)

def send_data(self, utterances_memo = []):
    if hark.client.getSessionID():
        results = hark.client.getResults()
        utterances = results['context']
    # If result contains more utterances than memo
    if len(utterances) > len(utterances_memo):
        # Must iterate since new utterances
        # could be anywhere in the result
        for utterance in utterances:
            utterance_id = utterance['srcID']
            # If utterance is new
            if utterance_id not in utterances_memo:
                # Memoize the srcID
                utterances_memo.append(utterance_id)
                self.write_message(json.dumps(utterance))
                log.info("Utterance %d written to socket", utterance_id)

    if hark.client.isFinished():
        # If we have all the utterances, transcribe, then close the socket
        if sum(results['scene']['numSounds'].values()) == len(utterances_memo):
            for srcID in range(len(utterances_memo)):
                random_string = ''.join(choice(ascii_uppercase) for i in range(10))
                file_name = '{0}{1}_part{2}.flac'.format(STAGING_AREA, random_string, srcID)
                hark.get_audio(srcID, file_name)
                transcription = speech.translate(file_name)
                utterance = utterances[srcID]
                seconds, milliseconds = divmod(utterance['startTimeMs'], 1000)
                minutes, seconds = divmod(seconds, 60)
                self.write_message(json.dumps(
                  '{0} at ({1}:{2}:{3}):'.format(utterance['guid'], minutes, seconds, milliseconds)))
                self.write_message(json.dumps(transcription, ensure_ascii=False))
            del utterances_memo[:]
            self.close()
    else:
        tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=1), self.send_data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly