Skip to the content.

My Data: a breathlessness data collection and visualisation project

Project background

This is the extended documentation for Jeremy Lo Ying Ping’s UCL Computer Science summer research project, which ran from June to August 2021 and was supervised by Prof Joseph Connor, Prof Dean Mohamedally and Prof Graham Roberts.

The principal aim of the project was to develop a proof-of-concept progressive web app for patients to collect and visualise both speech data – using an offline, on-device speech recognition model – and additional self-reported health data, as may be useful for analysing and tracking symptoms of breathlessness. This data may then be shared – only at patients’ behest to ensure they maintain control over their data – with their associated senior responsible officer to support research.

The hope is that this could provide a more ethical framework for respiratory disease researchers or clinicians to gather invaluable speech and self-reported data from consenting patients.

Demo

The app is currently live at https://mydata.jezz.me/. There is an anonymous login mode whereby anyone can test out the core functionality of the progressive web app without having been referred by a senior responsible officer of the demo system.

Showcase

App

Home (visualisation overview) and results pages

Note: in the graph below, I was testing the app in quick succession, whereas real-world data is more likely to be spread out over time.

Creating a new submission

Settings

Login, registration, activation and account recovery

Research

Measures of utterance fluency

While tangiential to the project, I have included this section out of my own interest and for the benefit of any interested readers.

Terminology

syllable: an organisational unit of a sequence of speech sounds, from which words are formed, e.g. the word ‘computer’ is formed from three syllables (com-pu-ter).

pause: a momentary silent gap in someone’s speech

duration: the duration for which someone speaks

phonation time or speaking time: the duration, excluding any silent pauses

It stands to reason that patients who experience greater degrees of breathlessness may find speaking more difficult, so it would be interesting to collect some data to test this hypothesis.

There are several different aspects to fluency [1]:

For this project, I am most concerned with utterance fluency (although there is certainly some overlap between the categories).

Note: it is generally preferable to consider syllables rather than whole words when trying to evaluate how fast someone is speaking. As an example, the phrase “thoroughly thoroughly thoroughly” takes considerably more effort to say than the phrase “two two two”, but while the former is 9 syllables and the latter only 3, they are both contain precisely 3 words.

In their investigation of fluency in long-term second-language speakers of English, Lahmann, Steinkrauss and Schmid used the speech analysis software Praat to calculate the following measures [2] of speed fluency (related to the speed at which one speaks):

Here, they defined a pause as lasting a minimum of 250ms for the purposes of calculating phonation time. This was also the definition used in Jong et al [3].

Speed is not the only characteristic of fluency; linguists also study breakdown fluency (related to pauses that interrupt the flow of speech) and repair fluency (related to hesitation and corrective reformulations speakers make as they speak) [1]. Examples of these measures include the following:

Furthermore, there are also composite measures that combine elements of all three characteristics of utterance fluency outlined above, e.g.

Side-note: in dialogues, measures such as the number of turns, pauses between turns and interruptions, as well as conversational overlap may be studied as well!

For my project, considering the limitations of the low-power mobile devices upon which my progressive web app is expected to run, my priority will be to measure speech rate (number of syllables divided by the duration); however, I would love to see articulation rate and a measure of breakdown fluency, such as mean pause rate as defined above, implemented in the future!

References

Comparison of on-device speech recognition models

As part of my project, I considered several different speech recognition libraries, among which were the following:

PocketSphinx.js

To summarise my thoughts, based off the performance of the pre-trained model of PocketSphinx.js used in the demo, while the provided model was lightweight and fairly decent at spotting single, specific keywords, it was not sufficiently accurate – even just to count the digits zero to nine as in the demo – for my use case.

There were scenarios where I would count from zero to nine and the numbers would appear in a pseudorandom-seeming order; however, this could also be a function of my (reasonably mild but pronounced) Lancashire accent!

Ideally, it would support a larger lexicon with some level of greater accuracy so that I could garner a measure of word or syllable rate. Nevertheless, I did appreciate that it supports swapping out different finite state grammars in JavaScript at runtime, so it would be possible to tab out my own words in its phonetic representation, which would then be recognised (so far as possible) by the model.

Vosk

Many of the collaborators who worked on PocketSphinx have gone on to work on a new, more modern speech recognition toolkit called Vosk. It certainly looks promising and I look forward to seeing where it goes!

While it does have early-days support for Python, Java, Node.JS, C# and C++, among other platforms, unfortunately, it does not currently either run in-browser or have a Flutter integration, whereas I would prefer a more cross-platform solution.

Vosk uses Kaldi internally, so I shifted my attention to seeing whether Kaldi may be usable directly in my project.

Mozilla DeepSpeech

As previously alluded to, I had a good experience using Mozilla DeepSpeech in IBM FISE v2 AskBob (repo) on low-power Windows and Linux desktop devices as part of a voice assistant project. I was therefore pleased to discover that there is a pre-existing Flutter binding for Mozilla DeepSpeech 0.9.3.

One downside, however, is the size of the models: while there is a TFLite model (46MiB) already reduced in size compared to the standard PBMM model (184MiB), a significantly larger external scorer (931MiB) – although optional – is needed to improve accuracy to what I found to be useful in my tests.

Kaldi in WASM

Finally, I was very pleased to discover the work of a group of researchers at Inria to compile Kaldi to web assembly so that it may be used in-browser. They had successfully pruned the kaldi-generic-en-tdnn_250 English Kaldi speech model from the Zamia Speech project (gooofy/zamia-speech) using KenML down from 108MiB to a more reasonable 66MiB, which brings it in the realm of possibly being used within a progressive web app.

In testing, I found that the accuracy was significantly better than that of PocketSphinx.js, not least for a much wider vocabulary. At first glance, it seemed comparable to the performance I had previously achieved with Mozilla DeepSpeech and its much larger model and external scorer, despite only having a model size of 66MiB.

As a result, I decided to explore building upon this use of Kaldi with web assembly for this project. Rather than building a native app, or using a cross-platform technology like Flutter, I would be able to make a progressive web app instead, which would allow everything to run in-browser on patients’ mobile devices, no installation required.

Progressive web apps certainly do have their disadvantages, particularly related to the maturity of web APIs (e.g. for scheduled notifications, audio and so on) and the difficulty of long-term storage; however, the ease of installation (i.e. none beyond adding the app to the homescreen) and much wider pool of devices that it could support (including being compatible with desktop devices as well) – when coupled with my increased familiarity with JavaScript – persuaded me in its favour.

Moreover, more intensive tasks could be offloaded to web workers (potentially further leveraging web assembly) to achieve performance closer to that of a native app, and creating a web app would leave open the possibility of using Tensorflow.js for training and inference, depending on the future direction of the project!

System architecture design

The project is formed of the following main components:

In the diagram below, “SRO” refers to a senior responsible officer, who receives submissions from their associated patients (the users of the app).

system architecture diagram

Metrics gathered

The focus of this project has been to devise a way to gather the following metrics:

Implementation

A few notable implementation aspects or features are summarised below.

App speech processing & syllable rate calculations

I have structured the majority of the speech recording and transcription code within a class called the SpeechService. It uses RecordRTC to record a 16kHz 16-bit signed integer PCM .WAV file, which is then converted into an audio buffer (in the main thread) and passed onto kaldi-wasm running the web assembly code in a web worker.

Results are calculated in calculateResults.js using the transcription of the audio and the duration of the speaking test. The syllable NPM module is used estimate the number of syllables forming each spoken word.

Submission storage

Within the app

Within the app, submissions are stored within a per-user IndexedDB database, wherein audio files are stored as binary large objects (BLOBs) alongside the rest of the submission metdata.

On the server

Submission audio files are stored as binary large objects (BLOBs, specifically of the MEDIUMBLOB type) within the submissions table of the MariaDB database (where they may be NULL should the patient have declined to attach the audio to their submission) in rows alongside data tied to each submission, such as the user’s ID, outward postcode and submission creation time. Submission metadata, such as speech rate and duration, are stored in a separate table. This data is encrypted at rest.

Submission audio files may be listened to directly in the dashboard or included within a ZIP file export of the submission data received from patients associated with a given senior responsible officer.

App & dashboard data visualisations

Submission results are visualised using the charting and data visualisation library, Victory by Formidable. The specific visualisations in question are explained in further detail in the next section.

User accounts

App sessions

Users login through session-based authentication with a remember-me cookie set to act essentially as a refresh token. Given one of the key features of the app is its ability to share the data at patients’ behests with their associated senior responsible officers, it was judged that for this project, session-based authentication would suffice, rather than going a full offline-first approach.

Nevertheless, in a future iteration of this app, this would be possible through a modification to the authenticator.js file and potentially some additional logic in the service worker to handle requests that fail due to the device being offline. This is likely to be more feasible when the background sync API is more widely supported across mobile browsers.

Dashboard sessions

SROs login using session-based authentication; however, no remember-me functionality is present to restrict access to potentially sensitive patient data to only those who definitely have the correct credentials.

Use of JSON web tokens

JSON web tokens (JWTs) are used for email confirmation and password reset tokens with a twist: a SHA256 hash of the user’s old email or old password Bcrypt hash is included within the secret so that any issued tokens expire as soon as the email is confirmed or the password reset is successful. They are also used as SRO invite tokens.

MariaDB

The full database schema written in SQL may be found here: schema.sql

Database tables containing potentially sensitive information are encrypted at rest with AES-256 using the MariaDB file key management encryption plugin.

Database scehma

database schema

remember_me
Column Type Notes
id int(11) unsigned Auto Increment  
user_id int(10) unsigned  
token varchar(255) This is only a hash of the user’s remember-me token.
created_at datetime [current_timestamp()]  
sros (encrypted)
Column Type
id int(11) unsigned Auto Increment
name varchar(255) NULL
email varchar(255)
password varchar(255) NULL
status tinyint(3) unsigned [0]
trusted tinyint(3) unsigned [0]
created_at datetime [current_timestamp()]
updated_at datetime NULL
submissions (encrypted)
Column Type
id int(11) unsigned Auto Increment
user_id int(10) unsigned
outward_postcode varchar(10)
audio mediumblob NULL
test_type_id int(11) unsigned
created_at datetime
received_at datetime [current_timestamp()]
submission_metadata (encrypted)
Column Type
id int(11) unsigned Auto Increment
submission_id int(11) unsigned
metadata_key varchar(255)
metadata_value text
test_types
Column Type
id int(11) unsigned Auto Increment
title varchar(255)
instruction text
possible_durations varchar(255)
active tinyint(3) unsigned [1]
users (encrypted)
Column Type
id int(10) unsigned Auto Increment
reference_id varchar(32)
email varchar(255) NULL
password varchar(255) NULL
outward_postcode varchar(10) NULL
sharing tinyint(3) unsigned [1]
sro_id int(11) unsigned
status tinyint(4) [0]
created_at datetime [current_timestamp()]
updated_at datetime NULL
extra varchar(255) NULL

Redis

Redis is an incredibly fast in-memory data store used in this project to store session data for both the app and the dashboard to avoid cluttering the database. In the future, it could possibly also handle remember me tokens (where keys have an EXPIRE set to automatically clear out expired tokens) as well.

Data

Export format

The dashboard may be used to export the data stored within the MariaDB database. It produces a ZIP file containing the following contents:

For example, the ZIP file export comprising a single submission with an ID of 1 would contain both a submission_1.wav audio file, as well as a submissions.json file with the following contents:

{
    "submission_id": 1,
    "reference_id": "000000000001",
    "outward_postcode": "SW1",
    "test_type_id": 1,
    "created_at": "2021-08-16T14:15:32.585Z",
    "received_at": "2021-08-16T14:19:07.766Z",
    "extra": "Additional information tied to this user supplied by their SRO.",

    "speech.syllableCount": "50",
    "speech.syllablesPerMinute": "300",
    "speech.wordCount": "31",
    "speech.wordsPerMinute": "186",
    "speech.duration": "10",
    "speech.transcription": "A record of what was said in the audio clip.",
    "sputum": "1",
    "wellbeing": "7",
    "dyspnoea": "1"
}

The first grouping of entries above contains submission and patient data common to all submission types, whereas the second grouping of entries contains submission metadata. These may either be generated by the app (e.g. those that start with speech) or self-reported scores input by the user, which are optional.

This format was designed to be flexible enough so that new analysis measures or more self-reported parameters may be more easily added in the future.

Visualisation of results

Time series plots

Patients are shown a series of zoomable graphs for each of the metadata points gathered (e.g. syllable rate, sputum colour, MRC dyspnoea score and wellbeing), of which a selection is shown below. These graphs are also visible to SROs where patients have shared the submissions.

The graphs show a scatter plot of the metadata values with a solid grey line, representing the ‘curve bundle’ interpolation option in d3, which produces a straightened cubic basis spline.

Then, there is a solid green line showing the cumulative mean of the data (plotted using the cubic Catmull-Rom spline interpolation option available in d3). The idea of this line is to represent a ‘ground truth’ for each variable established as patients regularly submit data through the app over time.

Where appropriate, values presenting one cumulative standard deviation above or below the cumulative mean (depending on the variable) are shown as an orange dashed line, plotted in a similar way to the mean. It is interesting to see how this value narrows in closer on the mean over time as more data is collected to better establish the ‘ground truth’.

Similarly, where appropriate, values repesenting three cumulative standard deviations above or below the cumulative mean are shown as a red line (with longer dashes).

This could potentially provide a foundation for future, more explicit nudging behaviour from the app.

The following graphs are from my own local testing (explaining the haphazard data):

time series graphs

Syllable rate comparison graphs

While more in-depth data analysis would want to be performed outside the dashboard using the JSON data exported therefrom, I thought it would be nonetheless useful to provide zoomable visualisations of the various metadata plotted against syllable rate, along with a basic linear regression displayed over the points, in the dashboard to more easily determine how long the variables correlate with each other.

comparison graphs

Metadata distribution graphs

In a similar vein, I also produce zoomable histograms showing the distribution of the numeric metadata produced, along with their corresponding box plots to aid data analysis.

metadata distribution graphs

Ideas for the future direction of the project

I have thoroughly enjoyed working on this project and while my student research internship was limited to only ten weeks over the summer, I thought I would list a few ideas as to where this project could go, should anyone be interested in contributing to it.