The human voice has been studied for centuries, not just for the few years or decades that computer voice processing has been around. My work -- and the science of profiling -- is truly built upon the centuries of observations, hard work, brilliant and meticulous studies conducted by hundreds of scientists in the past. Some of those lost their lives in the quest to understand the mysteries of the human voice. One name that comes to mind is that of Klatt, who conducted x-ray studies on himself to understand the voice production mechanism better, and lost his life to throat cancer that may have resulted from over-exposure to x-rays. My work stands on the shoulders of giants. Truly.

My contribution
Myriad studies in the past, in multiple fields of research, have found positive correlations between isolated human paramaters and different aspects of voice. Many of these aspects were not quantifiable. There have been very few attempts to actively predict different categories of human factors from voice, with the exception of emotion. Of those that were made, many used the content of speech for prediction.

Profiling is different from prior work in human-parameter deduction in two important aspects -- a) it is based on quantifying the previously unquantifiable entities and building discovery and prediction mechanisms based on those, and b) it is not based on the content of voice, and does not analyze language or words. It is based on the voice signal and the sounds produced in the human vocal tract, through phonation, articulation or other means. My work draws upon different aspects of the fundamental bio-mechanical process of voice production. It is agnostic to language.

From a broader perspective, in what I seek to do now, the basic observations that voice is related to some human parameters are not mine. Their bold extensions are mine. A hypothesis that enables profiling for many many more entities than were thought possible before, and a slew of ensuing methodologies that conform to it, are my contribution.

I was the first person in the world to articulate the possibility (now a fact) that the human face could be reconstructed from voice, and that in fact the entire human body form could be reconstructed from voice. I spoke about this on television in 2017, and to various people in scientific circles and the news media since early 2015. before I did that, I had begun working -- with anthropologist Mark Shriver in Penn State University, and computational geneticist Peter Claes from Leuven University, using data collected by them -- to build the methodologies needed to enable such reconstructions. The initial work did not reach fruition due to lack of sufficient data and due to the fact that we were too early on in the methodologies. Some foundational concepts needed to be developed further. The fact that voices could be reconstructed from facial structure (to some extent, as will always be), was emergent from a conversation I had with people from J Walter Thompson Inc., and Rijkmuseum of Holland in the summer of 2018.

A word about my colleagues Mark Shriver and Peter Claes: Mark has been making groundbreaking discoveries for a while now. His latest one is that the shapes of human noses are largely determined by climate. This has been extensively covered in the media and has appeared in Science. (Mark's latest work). Peter was the first person in the world to show that human faces could be reconstructed from the DNA. His work was also covered extensively by the media worldwide. (Peter's work).

The bold extensions I speak of are tied to my hypothesis: "If any factor influences a speaker's body or mind, and if a biological pathway can be established to link that influence to the voice production mehcanism of the speaker, then there must exist an influence on the voice signal produced, and it must be possible to measure that influence."

Building on this hypothesis, I devise mechanisms, based on artificial intelligence, based on machine learning, statistics, signal processing or other methodologies (not everything is data-driven), to discover these "micro-signatures" in voice. I then try to find ways to map them to their causal parameters.

The extensions are the concurrent deduction of bio-relevant parameters such as physical stature, height, weight, age, facial structure, body structure, mental and physical health conditions etc. Many (though not all) of these are new, and before I spoke of them, there is no mention of such deductions (or the possibility thereof) in the literature.

Profiling is rife with many, many challenges. A couple of very significant ones are that of disambiguation and of profiling accurately under voice disguise. The challenge of disambiguation relates to the accurate indetification of influences of specific bio-parameters in the presence of other (thousands of) influences that are exerted on our voice by myriad factors on a daily basis. I probably don't have to explain what the second challenge -- profiling accurately in the presence of voice disguise -- means. The range of variability of human voice has not been charted in its entirety, at least through quantifiable mathematical relations that machines can use. No one really knows the true depths of human voice. That remains to be discovered.

My group at CMU crossed two milestones in the last year:


Book about this technology: This is a technical book, written primarily to educate and inform students who wish to do further research on this technology

Profiling humans from their voice
432 pages.
Author: Rita Singh
Publisher: Springer-Nature
Release date: June 15, 2019

My group is now in the process of writing up our research. We were too engaged in building demonstrable systems so far, and those were needed to get us funded to keep us going. Publications will be listed here very soon... some exist but I have to collate them in the coming days.

I'll put up some examples of 3D faces generated by our live system last year here soon..

Here's something to consider:

The 21st century Mona Lisa, created by an AI system.

Here's a sampler of some 2D faces reconstructed from voice -- entirely. These are older results. Can you tell the reference from the reconstruction? Why do these have backgrounds? Why do they smile? Why are some in profile? Why do they have hair? How is hair even related to voice? All of these obvious questions have interesting answers that question our assumptions and expectations of AI... our paper answers all of these questions. Reconstruction goes far beyond this. Being able to reconstruct images in high definition is not an end in itself. The whys must be answered.

This is the work of my PhD student, Yandong Wen.