- Emergent Behavior
- Posts
- EB-5: Hume - Transcript - Part 2
EB-5: Hume - Transcript - Part 2
Transcript: Hume - Part 2
π©βπ€ Ate-A-Pi: Yeah, I mean, it's pretty intense, right? Because I think, for example, you see politicians, right? And a politician often, there are politicians that try to win you over with logic, and there are politicians who just play to your emotions, right? And you see some who are very, very good on the emotional aspect, and you also see some who are very good at the logical aspect, but who don't get elected.
π§βπ¬ Alan Cowen: Exactly.
π©βπ€ Ate-A-Pi: And then people get very, very annoyed when the ones who are good on the logical policymaking aspect, they don't get elected. But the guys who are, you know, the ones who are good at the emotional kind of manipulation or presentation, or like the ones who, you know, are able to empathize with the audience, and they kind of do well. And there's a lot of this confusion on like, on...
you know, whether that's fair. I think there's a lot of questioning that whether that's fair. There's also the sense that, you know, technology enabling transmission of emotion, right, from the day that John F. Kennedy goes on TV and he's on TV and he's in that debate. And he comes across as young and energetic and that emotionally gives the populace a push to kind of elect him. And...
And here we finally have computers are able to kind of detect those things, which they were not before. So it's another layer of technology finally. It's much more sophisticated than the TV, but now you have another way of transmitting and receiving that emotion, detecting and transmitting that emotion.
which you have with, I think, the voice assistant, right? Like now you can, you know, from the voice, you can tell those emotions and you can also transmit those emotions, which was, which is definitely not possible, you know, before your launch, like, you know, last week or two weeks ago, right?
you know, whether that's fair. I think there's a lot of questioning that whether that's fair. There's also the sense that, you know, technology enabling transmission of emotion, right, from the day that John F. Kennedy goes on TV and he's on TV and he's in that debate. And he comes across as young and energetic and that emotionally gives the populace a push to kind of elect him. And...
And here we finally have computers are able to kind of detect those things, which they were not before. So it's another layer of technology finally. It's much more sophisticated than the TV, but now you have another way of transmitting and receiving that emotion, detecting and transmitting that emotion.
which you have with, I think, the voice assistant, right? Like now you can, you know, from the voice, you can tell those emotions and you can also transmit those emotions, which was, which is definitely not possible, you know, before your launch, like, you know, last week or two weeks ago, right?
π§βπ¬ Alan Cowen: Right, exactly. And some of that is totally, I think there is some tension between emotion and reason. Historically, philosophers have argued about that. There was one philosopher, David Hume, who said, actually, there's no, you can't possibly have a tension between reason and emotion, because at the end of the day, everything you're reasoning toward is some emotional state. So, you know, at the end of the day, you're saying,
π©βπ€ Ate-A-Pi: Yeah.
π§βπ¬ Alan Cowen: I'm doing this because this makes me money and money can be used on things that make me happy. It all ends up supporting some emotional state. But he also talked about how, you know,
there's flaws in the way that moral philosophers were thinking in his time where moral philosophers were taking sort of examples and thought experiments and they were responding to those thought experiments, experiments emotionally. And he realized that at the end of the day, they were post -hoc trying to justify those emotional reactions that they had. And that was, that made him very controversial in the philosophy world. So he quit his job and he kind of became a public speaker, but now he's probably the most
influential philosopher of his time. So there's this tension, but if you can sort of align those things and be angry about the things that you should be angry about, and that enables you to think in the way that is actually better for you. So anger has a kind of functional purpose, like when you're angry about something, it means that you've been treated unfairly.
If that motivates the right thought process that allows you to identify how you've been treated unfairly and to remedy the situation, it is the best emotion to be feeling at that time. And there's no way to just to take that thought process and have that happen in absence of anger. Like anger is really the driver of that thought process that enables you to dig.
deeper into that experience. And so there's, there's, there is a tension between emotions and, you know, a reason when the emotions are irrational, but when the emotions are rational, I like to think that that's actually the ideal situation for somebody where they're kind of enlightened. You say that's enlightenment when your emotions are rational, and they're guiding you in the right directions. So sometimes politicians can use that in a way that is
there's flaws in the way that moral philosophers were thinking in his time where moral philosophers were taking sort of examples and thought experiments and they were responding to those thought experiments, experiments emotionally. And he realized that at the end of the day, they were post -hoc trying to justify those emotional reactions that they had. And that was, that made him very controversial in the philosophy world. So he quit his job and he kind of became a public speaker, but now he's probably the most
influential philosopher of his time. So there's this tension, but if you can sort of align those things and be angry about the things that you should be angry about, and that enables you to think in the way that is actually better for you. So anger has a kind of functional purpose, like when you're angry about something, it means that you've been treated unfairly.
If that motivates the right thought process that allows you to identify how you've been treated unfairly and to remedy the situation, it is the best emotion to be feeling at that time. And there's no way to just to take that thought process and have that happen in absence of anger. Like anger is really the driver of that thought process that enables you to dig.
deeper into that experience. And so there's, there's, there is a tension between emotions and, you know, a reason when the emotions are irrational, but when the emotions are rational, I like to think that that's actually the ideal situation for somebody where they're kind of enlightened. You say that's enlightenment when your emotions are rational, and they're guiding you in the right directions. So sometimes politicians can use that in a way that is
π§βπ¬ Alan Cowen: Contrary to our well -being sometimes they can use that in a way that's in support of our well -being if it's contrary if it's contrary to our well -being if they're Manipulating our emotions in a way that makes us do something that's not good for us. I would call that manipulation If it's in support of our well -being if they're if they're bringing up these emotions that motivate behaviors that are helpful to us and allow us to Support policies that are helpful to us Then I would say there's nothing wrong with that. That's actually the ideal
π©βπ€ Ate-A-Pi: Mm -hmm. Mm -hmm.
π©βπ€ Ate-A-Pi: pro -social manipulation, pro -social encouragement, not so manipulate. I'm showing you the light. I'm showing you the light, right? You know?
π§βπ¬ Alan Cowen: I wouldn't call it Yeah
π§βπ¬ Alan Cowen: Yeah, in the context of technology, I would call it personalization. It's like what we want the algorithm to be optimized for, then the emotions that we want to evoke are the emotions that it evokes.
π©βπ€ Ate-A-Pi: Personalization.
π©βπ€ Ate-A-Pi: Indeed, indeed.
So I noted a couple of design decisions on the demo, for example. The demo uses a male voice. And it's interesting because a lot of tech companies have traditionally chosen female voices. Alexa, Siri, they've traditionally chosen at least the default voice to be a female voice. So what drove your kind of like, OK, we're going to use a male voice? That's like a design decision that someone had to make at the back.
So I noted a couple of design decisions on the demo, for example. The demo uses a male voice. And it's interesting because a lot of tech companies have traditionally chosen female voices. Alexa, Siri, they've traditionally chosen at least the default voice to be a female voice. So what drove your kind of like, OK, we're going to use a male voice? That's like a design decision that someone had to make at the back.
π§βπ¬ Alan Cowen: So, you know, we can use any voice. We started with this male voice. It's actually our creative producer, Matt Forte, who does content for us and hosts our podcast. And he's just very multi -talented. He has a great voice. And we were looking at the kinds of voices we could use. And we decided that a podcast voice was a good place to start because...
In general, it already is somebody who is trying to stimulate positive conversation and not being reactive on their, you know, I think that at the end of the day, we optimize the algorithm for the listener, right, and not for the assistant to feel better because the assistant doesn't have feelings. In a conversation, we pre -train on lots of conversations to determine how...
the language and text to speech sort of mixed together. And in a conversation, people do selfish things. They get angry and supportive of their point of view. And sometimes it's something that's satisfying for them to do, but it's not good for the listener, it doesn't help them. There's various things people do, and podcast hosts are less likely to do that. And we like Matt Forte as a person, we think he has a good personality. So it was a good place to start, and then we continued to optimize it for the listener's responses.
In general, it already is somebody who is trying to stimulate positive conversation and not being reactive on their, you know, I think that at the end of the day, we optimize the algorithm for the listener, right, and not for the assistant to feel better because the assistant doesn't have feelings. In a conversation, we pre -train on lots of conversations to determine how...
the language and text to speech sort of mixed together. And in a conversation, people do selfish things. They get angry and supportive of their point of view. And sometimes it's something that's satisfying for them to do, but it's not good for the listener, it doesn't help them. There's various things people do, and podcast hosts are less likely to do that. And we like Matt Forte as a person, we think he has a good personality. So it was a good place to start, and then we continued to optimize it for the listener's responses.
π©βπ€ Ate-A-Pi: Indeed. Would you call yourself, would you say you have a foundation model for emotions at this point?
π§βπ¬ Alan Cowen: Yeah, I think it's a foundation model for what I would call cross -channel, yeah, like a cross -channel language expression foundation model. It's a language model, the first language model that understands expression.
π©βπ€ Ate-A-Pi: Okay, I'm gonna ask a bunch of like weird questions now. Have you ever like, you know, taken the face model and pointed it at a chimpanzee and seen, you know, whether you have some commonality of expressions there, like.
π§βπ¬ Alan Cowen: Heheheheh
π§βπ¬ Alan Cowen: We so we have I mean it's a little bit messy right now because I think internally even though we don't tell it to it's just sort of understands human facial structure and its variation and then when you put it When you point out something doesn't have a human facial structure It can be a little messy, but we do have primatologists who have tried this and to some extent they do see like okay What we already sort of knew which is that?
animals laugh, like the open mouth smile exists in different animals, including chimps, but also monkeys and seals and dogs and a lot of other things. If they look like they're laughing, sometimes that's good. Chimps also do something that looks like a smile that's actually very negative. When they do that, it's like this, it looks like this, and it's a grimace and it's a fear expression.
animals laugh, like the open mouth smile exists in different animals, including chimps, but also monkeys and seals and dogs and a lot of other things. If they look like they're laughing, sometimes that's good. Chimps also do something that looks like a smile that's actually very negative. When they do that, it's like this, it looks like this, and it's a grimace and it's a fear expression.
π©βπ€ Ate-A-Pi: Grimace.
Yeah. Yeah.
Yeah. Yeah.
π§βπ¬ Alan Cowen: But to the human eye, it can look like a smile. So in movies sometimes when you see a chimp smiling, it's actually forming a grimace expression. They like threatened it and it's acting out a smile because of that. So probably ruins some movies, but yeah. So there's similarities and differences. Yeah.
π©βπ€ Ate-A-Pi: Mm -hmm.
π©βπ€ Ate-A-Pi: Um
Let's talk about human outliers, right? So one of the things I noted as I use the face, the webcam kind of detection, emotion detection, was that I found myself making faces in order to hit the emotions there, right? Like you make an angry face in order to hit the anger expression or you make a smiling face in order. So I found myself like kind of training myself over time to hit each of those emotional notes to see whether I could hit them.
So are there other people who are very natural at hitting sharp and clear emotional notes on face and voice?
Let's talk about human outliers, right? So one of the things I noted as I use the face, the webcam kind of detection, emotion detection, was that I found myself making faces in order to hit the emotions there, right? Like you make an angry face in order to hit the anger expression or you make a smiling face in order. So I found myself like kind of training myself over time to hit each of those emotional notes to see whether I could hit them.
So are there other people who are very natural at hitting sharp and clear emotional notes on face and voice?
π§βπ¬ Alan Cowen: Yeah, there's, yeah, some people are more expressive than others. Some people just have more monotonous voice. I think I have a more monotonous voice, which is funny that I study this stuff. Some people have more monotonous voices and some people are a lot more expressive. And...
π©βπ€ Ate-A-Pi: Autism?
π§βπ¬ Alan Cowen: Yeah, autism definitely contributes to that. I would say, well, you know.
In the studies, people with ASD, they tend to have more difficulty deciphering the meaning of expressions in other people. And it's not so much actually that they don't understand what expressions mean, but they just don't naturally pay as much attention. So if you force them to sort of consciously think about it, then they do understand most of the time what's going on. But if they're not really, they tend to be more fixated on functional.
elements of their environment, which makes them very good at many things. It's a different kind of intelligence, but less focused on social aspects of interactions.
In the studies, people with ASD, they tend to have more difficulty deciphering the meaning of expressions in other people. And it's not so much actually that they don't understand what expressions mean, but they just don't naturally pay as much attention. So if you force them to sort of consciously think about it, then they do understand most of the time what's going on. But if they're not really, they tend to be more fixated on functional.
elements of their environment, which makes them very good at many things. It's a different kind of intelligence, but less focused on social aspects of interactions.
π©βπ€ Ate-A-Pi: So it's not like a disability per se, but a disattention, kind of.
π§βπ¬ Alan Cowen: Yeah, it's almost motivational. And in fact, I would say a lot of aspects of our intelligence are more motivational than people think. Like people who aren't good at math a lot of times just like they have no interest in it. And they never did even as kids, which is why they didn't really learn math. They just have no interest in spending time with numbers. Yeah.
π©βπ€ Ate-A-Pi: Indeed. How did the... Because you've been in the field for a significant period of time, how did the introduction of the Transformers change things in the field? What happened? I think the OpenAI guys have this thing where the moment Transformers came out, Ilya said, this is it, and everything changed. They switched all immediately. How did that introduction of the Transformers change your field?
When did it change your field? What was the trajectory there?
When did it change your field? What was the trajectory there?
π§βπ¬ Alan Cowen: Yeah, I mean, it took a, I mean, in affective computing, it took a while, I would say. And in psychology, let's just not say they use algorithms at all, but in affective computing, there was a fixation on classification tasks that didn't really require transformers and smaller data sets that you couldn't really train transformers on. And so, maybe.
π©βπ€ Ate-A-Pi: So it was more kind of like tabular, almost like tabular classification tasks. Like, yeah.
π§βπ¬ Alan Cowen: Yeah, so yeah, well not, yeah, well, vision and audio were in there, but they were being interpreted sort of in isolation. So there's kind of a separation where like you have these affective computing datasets, and then you have these other datasets. And if you're only going to train on this small affective computing dataset, you can't really train a transformer.
π©βπ€ Ate-A-Pi: Original audio.
π§βπ¬ Alan Cowen: So it took a while for that to take effect. In language, both language and anything that had a lot of data that exists in the world, so captions for videos and transcriptions of audio, all of those took, like transformers started affecting those areas very rapidly because the data was already there. In effective computing, you only had really small scale measures of expressive behavior, which is what we had to correct. So, so.
We first train our models on relatively smaller data sets. Now we train models on huge data sets with millions of people involved and hundreds of thousands of hours. And those we can train transformers from scratch. We can also repurpose transformers that have been trained for other tasks and use them for that.
And then we also train transformers for everything. Everything has some elements of a transformer at this point, even if you add a diffusion model. So we use transformers in pretty much everything. But now, you know, what makes transformers really amazing is how they scale, I would say. Like they can learn at scale. Just they can just keep learning almost a predictable rate, whereas other models get saturated pretty fast.
We first train our models on relatively smaller data sets. Now we train models on huge data sets with millions of people involved and hundreds of thousands of hours. And those we can train transformers from scratch. We can also repurpose transformers that have been trained for other tasks and use them for that.
And then we also train transformers for everything. Everything has some elements of a transformer at this point, even if you add a diffusion model. So we use transformers in pretty much everything. But now, you know, what makes transformers really amazing is how they scale, I would say. Like they can learn at scale. Just they can just keep learning almost a predictable rate, whereas other models get saturated pretty fast.
π©βπ€ Ate-A-Pi: Right, right, right, right, right, right, at this point.
π©βπ€ Ate-A-Pi: Um, so.
You formed a Hume initiative at one point. There's an initiative and there's a nonprofit. What does the Hume initiative do? Why did you set it up?
You formed a Hume initiative at one point. There's an initiative and there's a nonprofit. What does the Hume initiative do? Why did you set it up?
π§βπ¬ Alan Cowen: The Hume Initiative is a nonprofit that I set up when I left Google to start both Hume and the Hume Initiative. And one of the things that was top of mind when I left Google was how are people going to misuse this? And it was sort of part of the discussion at the time, and I took it very seriously, I still do, that emotions are sensitive, that if you augment data with emotional data, that it becomes more, that you can unlock more things that are private, although I think this is true of language too.
I think language is also very sensitive. Sometimes people consider emotion data to be more sensitive. I actually consider language data to be more sensitive. Emotion augments it. But regardless, we put together a nonprofit to set up what were kind of the first guidelines for empathic AI. And I would say like the most concrete guidelines for AI generally.
and those are live now on the Human Initiatives website. I helped draft them along with the committee, which was ethicists, cyber law experts, social scientists, you know, pretty broad range of talents there. And then the members of the committee who were independent of human AI, the for -profit voted on them. And we enforce those guidelines in our terms of use. The core...
principle was this this technology should be used to optimize for people's well -being and it shouldn't be used to optimize for things that Could be contrary to people's well -being where? emotion data Can give you signs of like whether somebody is engaged you shouldn't use it to optimize for engagement though because that's where I think even without the specifics of what we're doing, but
But generally speaking, if you optimize too much for engagement at the expense of understanding how it's affecting the person's wellbeing, you're gonna end up ruining their life if they spend too much time on an app, basically. Like, that can happen. So.
I think language is also very sensitive. Sometimes people consider emotion data to be more sensitive. I actually consider language data to be more sensitive. Emotion augments it. But regardless, we put together a nonprofit to set up what were kind of the first guidelines for empathic AI. And I would say like the most concrete guidelines for AI generally.
and those are live now on the Human Initiatives website. I helped draft them along with the committee, which was ethicists, cyber law experts, social scientists, you know, pretty broad range of talents there. And then the members of the committee who were independent of human AI, the for -profit voted on them. And we enforce those guidelines in our terms of use. The core...
principle was this this technology should be used to optimize for people's well -being and it shouldn't be used to optimize for things that Could be contrary to people's well -being where? emotion data Can give you signs of like whether somebody is engaged you shouldn't use it to optimize for engagement though because that's where I think even without the specifics of what we're doing, but
But generally speaking, if you optimize too much for engagement at the expense of understanding how it's affecting the person's wellbeing, you're gonna end up ruining their life if they spend too much time on an app, basically. Like, that can happen. So.
π§βπ¬ Alan Cowen: It was very top of mind for me to address this. And especially as these generative language models were coming out, nobody knew exactly what they were capable of. Internally, I had been playing around with language models since like early 2020. And they had, at that time, the best language models that existed in the world, and there may be OpenAI did internally too, but nobody knew about this stuff. And it wasn't really clear what they were capable of, except that you could talk to them and they talked back to you like humans in some sense.
and we're creative and it was just very different. Nobody really knew what the risk would be. And I felt that at the time, if this technology was just optimized the way that social media news feeds are optimized, then it could be very dangerous because it's sort of this hyper, like the expressiveness of it is much higher. It can create anything. At some point, I felt it would create images and audio and that's ended up happening more slowly.
So that was my motivation in starting the HEMA Initiative. And I do think the guidelines have held up pretty well, even though we finalized those guidelines back in 2022 before a lot of the generative AI fervor kind of was brought to the surface. I think that they've held up pretty well. And the core principle that we should be optimizing for measurements of people's wellbeing, which is enabled by understanding people's expressive behavior, I think is still the way forward.
and we're creative and it was just very different. Nobody really knew what the risk would be. And I felt that at the time, if this technology was just optimized the way that social media news feeds are optimized, then it could be very dangerous because it's sort of this hyper, like the expressiveness of it is much higher. It can create anything. At some point, I felt it would create images and audio and that's ended up happening more slowly.
So that was my motivation in starting the HEMA Initiative. And I do think the guidelines have held up pretty well, even though we finalized those guidelines back in 2022 before a lot of the generative AI fervor kind of was brought to the surface. I think that they've held up pretty well. And the core principle that we should be optimizing for measurements of people's wellbeing, which is enabled by understanding people's expressive behavior, I think is still the way forward.
π©βπ€ Ate-A-Pi: Yeah. So you mentioned generating video. At some point, are you going to be able to generate video, like a face talking to you, with kind of authentic emotion? Is that on the cards at some point?
π§βπ¬ Alan Cowen: Yeah, we can pretty much do that internally. I'm not sure, yeah. We have a lot of things on our plate. I think that there's a, the majority of interfaces will move to audio first. And then video, once it really provides enough addition to what's going on in the audio.
And I think that there's some features we need to add to make it worthwhile, even though video can help in various ways already. We have a better sense of when you're done speaking if we have video of your face. And so it's a better conversational experience. But I don't think it's worth people turning their camera on if the application doesn't already require that. So we feel we need to add some more core features to make that worthwhile. And then we'll release it.
And I think that there's some features we need to add to make it worthwhile, even though video can help in various ways already. We have a better sense of when you're done speaking if we have video of your face. And so it's a better conversational experience. But I don't think it's worth people turning their camera on if the application doesn't already require that. So we feel we need to add some more core features to make that worthwhile. And then we'll release it.
π©βπ€ Ate-A-Pi: right on. That is amazing. I mean, my personal opinion is that the next few years are the gauntlet. I call it the gauntlet period because I think you guys are very ethical, but I don't think some of the players, especially in China, are going to be that ethical. And so I think we're going to have to go through this period of what I call engrossment.
Like, you know, everyone just getting completely engrossed in their little like, you know, engagement, like hacking, like things. Uh, and that's really like the difficult period. Like if we don't get to AGI, we're just going to be stuck in this like zombie era of engagement for like a long period of time, which is like the worst possible outcome. Right. So, uh, you know.
Like, you know, everyone just getting completely engrossed in their little like, you know, engagement, like hacking, like things. Uh, and that's really like the difficult period. Like if we don't get to AGI, we're just going to be stuck in this like zombie era of engagement for like a long period of time, which is like the worst possible outcome. Right. So, uh, you know.
π§βπ¬ Alan Cowen: Yeah, totally. Well, I mean, kids are already spending six hours a day on TikTok, right? Like how many hours more? And that's not, you know, this is using technology that's still improving so rapidly. It's nowhere near the ceiling. And I don't know how we would allow kids to spend 10 hours a day on TikTok or more. Some are already in like, but that starts to dig into it.
the ability to educate our children, right? Like you have to fight with this engagement hacking app. So I think that there's already pushback and there will be more against those strategies almost at the nation state level too.
the ability to educate our children, right? Like you have to fight with this engagement hacking app. So I think that there's already pushback and there will be more against those strategies almost at the nation state level too.
π©βπ€ Ate-A-Pi: Yeah.
π©βπ€ Ate-A-Pi: Is anyone working on a tutor, on like a tutor using the HUM technology?
π§βπ¬ Alan Cowen: Yeah, there's developers and we haven't really rolled out access to our voice to voice API yet. We have an API for measuring expression, it's already developers using it, and an API for predicting custom kind of outcomes. And there's a big interest in education, personal tutors kind of that. I think one of the main things that a tutor does is they make sure you're paying attention. And so if you had somebody
if somebody was able to turn that on for themselves, then that's huge. And asking the right questions at the same time, paying attention, asking the right questions, live feedback, yes, detecting confusion and explaining things when you're confused, huge, right? And there's other developers working in health and wellness and customer service and a lot of different spaces where this stuff is important.
if somebody was able to turn that on for themselves, then that's huge. And asking the right questions at the same time, paying attention, asking the right questions, live feedback, yes, detecting confusion and explaining things when you're confused, huge, right? And there's other developers working in health and wellness and customer service and a lot of different spaces where this stuff is important.
π©βπ€ Ate-A-Pi: Detecting confusion, right? Detecting confusion.
π©βπ€ Ate-A-Pi: Yeah. Yeah.
π§βπ¬ Alan Cowen: But I think that your personal tutor in the future is gonna be a personal AI that is using an app. Hopefully it's powered by Hume, but it's still your personal AI and you know that it's optimized for you and understands you better, understands what kinds of explanations work for you.
has it understands your background competencies and knowledge and how you like to learn and all of those things. And it's in your control. And if that is optimized for your wellbeing, I think the future will be very bright. If it's optimized for you to buy what somebody is selling you or spend all your time in an app, then that could be dark, right? So.
has it understands your background competencies and knowledge and how you like to learn and all of those things. And it's in your control. And if that is optimized for your wellbeing, I think the future will be very bright. If it's optimized for you to buy what somebody is selling you or spend all your time in an app, then that could be dark, right? So.
π©βπ€ Ate-A-Pi: Yeah.
π©βπ€ Ate-A-Pi: What have been the challenges moving from research to product? Because commercializing research is one of the big challenges of our era because there's a lot of research out there. How do you make it into a product? What were the big challenges that you faced?
π§βπ¬ Alan Cowen: So I think before we had good generative models, which was actually relatively recent, that we realized we had everything we needed to train really good generative models. We were providing this measurement API, and we had lots and lots of apps using it. But I think that there is a certain amount of.
it's a little bit technical to understand how to use data. We were just giving people tons and tons of data. Every single word that was spoken, we have tons of outputs from our prosody model and facial expression model and so forth. And then we left it up to the developers to decide what to do with that data. And some developers are sophisticated enough that they were already doing really amazing things, but many weren't. So I think with its newest product, the EV, Empathic Voice Interface,
it's much easier to get started. And you can, you know, just with like a few lines of code, you can build into any product and you have a voice interface and it can use tools. So I'm really excited about how fast people can build things on that. Like we already built a widget on our website, very fast to build, that can guide you to different pages and answer your questions. You can say, this is where the information is, but I'll give you a quick summary that speaks to exactly the question that you asked me. And then you can, you can...
scroll the page to find more information. I think that that can be built so quickly into any app that the possibilities are endless. And then it can start to do any function that the app is capable of doing, any API that it calls, you can specify that as something that the model can do. So you can navigate to different pages yourself, or you can say, hey, model, do this, buy this thing, or.
or add this to my calendar, it'll be able to do any number of things with the context that it has. What surprised me too is that even in lieu of those tool use applications, people really love talking to it. So people are talking to it for half an hour sometimes.
it's a little bit technical to understand how to use data. We were just giving people tons and tons of data. Every single word that was spoken, we have tons of outputs from our prosody model and facial expression model and so forth. And then we left it up to the developers to decide what to do with that data. And some developers are sophisticated enough that they were already doing really amazing things, but many weren't. So I think with its newest product, the EV, Empathic Voice Interface,
it's much easier to get started. And you can, you know, just with like a few lines of code, you can build into any product and you have a voice interface and it can use tools. So I'm really excited about how fast people can build things on that. Like we already built a widget on our website, very fast to build, that can guide you to different pages and answer your questions. You can say, this is where the information is, but I'll give you a quick summary that speaks to exactly the question that you asked me. And then you can, you can...
scroll the page to find more information. I think that that can be built so quickly into any app that the possibilities are endless. And then it can start to do any function that the app is capable of doing, any API that it calls, you can specify that as something that the model can do. So you can navigate to different pages yourself, or you can say, hey, model, do this, buy this thing, or.
or add this to my calendar, it'll be able to do any number of things with the context that it has. What surprised me too is that even in lieu of those tool use applications, people really love talking to it. So people are talking to it for half an hour sometimes.
π§βπ¬ Alan Cowen: significant proportion, like not quite 10 % but close. So that tells me that we've...
π©βπ€ Ate-A-Pi: Yeah, my biggest complaint was actually the demo was too soft. So in a noisy room, and often you'd go out of a place and like, hey guys, I want to show you something. And you'd have like six people crowd over and you just no way to hear. So that was probably the biggest complaint. But besides that, yeah. Yeah.
π§βπ¬ Alan Cowen: That's another great one is that with traditional text -based apps, you can use ChatGBT Voice, but it's not as ideal. You get a response, and then you have to relay it to your friends. But if you're in a room with people and you just want to answer a question, it's actually best for it to play it out loud. Right now, on the mobile app, the mobile version of our demo is not great. But we'll hopefully have a mobile app soon, and it will have.
it will be specialized to have the right volume and everything.
it will be specialized to have the right volume and everything.
π©βπ€ Ate-A-Pi: Yeah, I mean, I was, how much more, you know, we have this thing about, you know, in the Zoom era, we have this thing about face -to -face versus Zoom, right? And a lot of people complain about bandwidth of, like, the amount of information that you lose on, when you're on a Zoom call versus a face -to -face, like, there's this perception that the face -to -face, you have so much more information.
Right. And I like, and even more so when you're, it's just a chat GPT voice, like it has very much, much less content, right? Like how much, how much more information is there? So emotions like in the voice and in the, in the facial expressions are one, how much more information are we losing when we move from a face to face to a zoom? Like, is there other other other things like inhalation, like.
you know, air movement, like stuff which are like, you know, subconscious kind of detects and kind of, you know, gives us a feel for that we don't get on that, on that zoom call. Like how much more information are we losing?
Right. And I like, and even more so when you're, it's just a chat GPT voice, like it has very much, much less content, right? Like how much, how much more information is there? So emotions like in the voice and in the, in the facial expressions are one, how much more information are we losing when we move from a face to face to a zoom? Like, is there other other other things like inhalation, like.
you know, air movement, like stuff which are like, you know, subconscious kind of detects and kind of, you know, gives us a feel for that we don't get on that, on that zoom call. Like how much more information are we losing?
π§βπ¬ Alan Cowen: I think that there's a huge element of being in the same environment and knowing what somebody is looking at that is really important. So when people are normally talking with each other, you kind of look away while you're talking and you look at each other and you make eye contact and then you kind of look away again.
Whereas with the zoom, you can't make eye contact, really. I mean, there's ways to correct for that. But you can't really look away from your computer because if you do, you could be looking at something else. Like the person doesn't know what you're looking at. So it's not a shared environment where they can say, OK, they're just looking off into space. We're looking at the same thing right now. It looks like they're distracted. So that's not a good sign.
And so that makes the whole conversation more taxing because you have to focus on one point visually for the whole conversation. And the latency is really important. If you're with somebody, there's backchanneling that happens that says this person has something to say.
and you kind of has to, um, uh -huh, yeah, and then you kind of want to enter the conversation and the other person signaling in subtle ways that they're finishing up what they're done saying or like maybe what the rest of their saying is going to say is unimportant. And, and so, um, so they're willing to let you interject and all that's happening at a sub 200 millisecond latency. So it's pretty disruptive when you have 50, even 50 milliseconds back and forth of network latency.
because that cuts into that 200 milliseconds where all this stuff happens by 25%. And it's actually typically worse than that. So there's that. There's a lot that's happening. And tiffinal laughter, like laughing together is very important. So if you're in person, it's much more satisfying to laugh together because there's something that happens where you...
Whereas with the zoom, you can't make eye contact, really. I mean, there's ways to correct for that. But you can't really look away from your computer because if you do, you could be looking at something else. Like the person doesn't know what you're looking at. So it's not a shared environment where they can say, OK, they're just looking off into space. We're looking at the same thing right now. It looks like they're distracted. So that's not a good sign.
And so that makes the whole conversation more taxing because you have to focus on one point visually for the whole conversation. And the latency is really important. If you're with somebody, there's backchanneling that happens that says this person has something to say.
and you kind of has to, um, uh -huh, yeah, and then you kind of want to enter the conversation and the other person signaling in subtle ways that they're finishing up what they're done saying or like maybe what the rest of their saying is going to say is unimportant. And, and so, um, so they're willing to let you interject and all that's happening at a sub 200 millisecond latency. So it's pretty disruptive when you have 50, even 50 milliseconds back and forth of network latency.
because that cuts into that 200 milliseconds where all this stuff happens by 25%. And it's actually typically worse than that. So there's that. There's a lot that's happening. And tiffinal laughter, like laughing together is very important. So if you're in person, it's much more satisfying to laugh together because there's something that happens where you...
π§βπ¬ Alan Cowen: It's not necessarily the same exact timing, but there's some timing to it, some rhythm to it. The people have studied pretty deeply that if you meet somebody for the first time and you have this, then you're much more likely to be friends with them later. It is that important, yeah. It predicts things like divorce rates and stuff.
π©βπ€ Ate-A-Pi: Wow.
π§βπ¬ Alan Cowen: So, and that's like old research. I don't know if that replicates that well, but I think there is something to it. But.
π©βπ€ Ate-A-Pi: Ha ha ha ha ha ha ha
π©βπ€ Ate-A-Pi: That's the other thing, like the replication crisis is something that you would have to face up to every day. Every time you try something, you'll be like, all right, it's basically like every single piece of old research, we have to like reinvestigate because they didn't have the data, they didn't have the tools, we're gonna have to remeasure everything.
π§βπ¬ Alan Cowen: Yeah.
π§βπ¬ Alan Cowen: Yeah, exactly. And let's, I think this time around we can let the AI learn it first and then we go and figure out what the AI already knows, you know, rather than kind of hypothesis driven research as much. People might yell at me for that, but I think it's gonna happen. So.
π©βπ€ Ate-A-Pi: Do you get the sense that you're cleaning up the mistakes of the 20th century?
π§βπ¬ Alan Cowen: Well, we're operating a different level of analysis. There's a lot of questions that are still unanswered where we're just saying, okay, these are the inputs. Psychology hadn't been studying the right inputs because they were studying such a narrow range of emotional behaviors, but these are the inputs. Let's not ascribe any meaning to them yet. Just put them in the model and the model will learn what they predict and how to predict them.
And the model starts to learn things like when people are done speaking and what laughs mean and what makes people confused and all of that stuff. And just learning this stuff, you know the model learns it because it can predict these expressions. But you're not, but we don't operate at the level of analysis of trying to explain what the model has learned yet. And I think that that task is going to be an interesting one for psychologists if they can pick up this technology and run with it.
And the model starts to learn things like when people are done speaking and what laughs mean and what makes people confused and all of that stuff. And just learning this stuff, you know the model learns it because it can predict these expressions. But you're not, but we don't operate at the level of analysis of trying to explain what the model has learned yet. And I think that that task is going to be an interesting one for psychologists if they can pick up this technology and run with it.
π©βπ€ Ate-A-Pi: So when you look going forward, how do you improve the model over time? Is it getting user feedback? Is it some of the, is it nailing down like, is it like a march of the nines, like nailing down specific areas where the model fails and kind of analyzing those and collecting data specifically in those areas? Or is it expanding on the number of languages? I've heard you say before,
that you need a different model for every language. I think I've heard you say that before. And so what is the process forward? How do you improve going forward?
that you need a different model for every language. I think I've heard you say that before. And so what is the process forward? How do you improve going forward?
π§βπ¬ Alan Cowen: I wouldn't say you need a different model for every language, but you need some extra data, basically. We can use multilingual data and train a model that understands multiple languages. It also takes in multiple languages and predicts that because of the way we're doing it, which is data -driven, it can understand the interplay between the language and the prosody and the facial expression and all that. So, yes, to answer your questions, all of the above.
learning from feedback, because we can learn from expressions, we have a way of learning from feedback at scale, learning from more data that we can license, get our hands on across different countries, across different kinds of people, different kinds of applications. But we actually want to empower application developers to learn from their data. So instead of everything being centralized where we aggregate the data,
and we use it all to train one model, application developers can say, okay, in my dataset, this is what people's expressions are when they react to the model. Let's find the model for my application, which could be customer service, could be health and wellness. And that's, I think, gonna be really powerful, the continuous learning at inference time to deploy models for users in applications.
learning from feedback, because we can learn from expressions, we have a way of learning from feedback at scale, learning from more data that we can license, get our hands on across different countries, across different kinds of people, different kinds of applications. But we actually want to empower application developers to learn from their data. So instead of everything being centralized where we aggregate the data,
and we use it all to train one model, application developers can say, okay, in my dataset, this is what people's expressions are when they react to the model. Let's find the model for my application, which could be customer service, could be health and wellness. And that's, I think, gonna be really powerful, the continuous learning at inference time to deploy models for users in applications.
π©βπ€ Ate-A-Pi: Amazing. So over time, with specific, for specific applications, with specific developers, fine tuning and getting better at serving their customers, those models emotively, getting emotionally closer to being able to respond to their customers over time.
π§βπ¬ Alan Cowen: Exactly, knowing what their customers want and how to bring that about, understanding the preferences specific to that application and specific to users. I think that's going to be the long term what's going to differentiate applications and we enable that.
π©βπ€ Ate-A-Pi: How do you separate out the kind of Claude or Bing or OpenAI? So I feel like we have various levels of intelligence or analysis. You have the web search component in Bing or a browser -based thing. Then you have the large language model with a big data set, which is also able to access web search. And then you have your model, which is kind of like,
deciding it's almost like a task allocator which decides when to use a Claude or opening eye on the back end. Would you say that that's going on? So you have these various levels of two or three layers there. So how do those decisions get made? Is it a prompt, like when this happens, use Claude, or something like that?
deciding it's almost like a task allocator which decides when to use a Claude or opening eye on the back end. Would you say that that's going on? So you have these various levels of two or three layers there. So how do those decisions get made? Is it a prompt, like when this happens, use Claude, or something like that?
π§βπ¬ Alan Cowen: You can prompt it. So we put a lot of the decisions in the hands of developers, but we enable them. So we say you can prompt our model on when it can do things similar to how you would prompt open AI, but our model has more context. It understands whether the user is confused and what word they're emphasizing. So what they're confused about, and it can kind of take from the response of a web and church, like what is the relevant piece of context there to report back on. But we don't see ourselves as competitors to cloud. I think cloud and, or, or, or open AI. In fact, we use cloud in our.
demo, we use OpenA on our widget. And the developer can bring a lot of capabilities to our API by prompting those large language models. In fact, if they already have a prompt for the large language models, they can just hook that up into our API.
API and we provide the conversationality. Our model is really good at talking and understanding users' preferences and shaping the response that way. But it can utilize the capabilities of larger models and tools that the larger models use that we can call. It's...
demo, we use OpenA on our widget. And the developer can bring a lot of capabilities to our API by prompting those large language models. In fact, if they already have a prompt for the large language models, they can just hook that up into our API.
API and we provide the conversationality. Our model is really good at talking and understanding users' preferences and shaping the response that way. But it can utilize the capabilities of larger models and tools that the larger models use that we can call. It's...
π©βπ€ Ate-A-Pi: Mm -hmm.
π§βπ¬ Alan Cowen: To some extent, like a lot of some of it's provided out of the box. I think application, the more powerful thing is that applications can customize how these different tools are used for their own needs. And that's ultimately, I think, going to be the future. And a lot of it will happen in the application backends. I think that like, like OpenAI will...
provide the frontier reasoning model, right? But I think people will want their own data to be on their server. And there will be a service that is able to utilize that data really effectively and send that to the reasoning model or communicate with the reasoning model to tell it what.
what it wants and be able to retrieve that data, something more sophisticated than RIG, like recursive search. And Hume won't be involved with any of that. The developers can build that into Hume and Hume will take on the interface of what understanding the user's voice and getting the information, delivering it back to the user. So Hume is kind of operating distinctly at that interface level.
provide the frontier reasoning model, right? But I think people will want their own data to be on their server. And there will be a service that is able to utilize that data really effectively and send that to the reasoning model or communicate with the reasoning model to tell it what.
what it wants and be able to retrieve that data, something more sophisticated than RIG, like recursive search. And Hume won't be involved with any of that. The developers can build that into Hume and Hume will take on the interface of what understanding the user's voice and getting the information, delivering it back to the user. So Hume is kind of operating distinctly at that interface level.
π©βπ€ Ate-A-Pi: Indeed. Segway, have you had GPU troubles? Have you had to hunt around for GPUs? Has that been a thing?
π§βπ¬ Alan Cowen: Um, we did for a while. Um, we're good now. We actually, uh, we use Andromeda cluster, which is, uh, Daniel gross and that Friedman through AI grant provide that, um, which has been good. We also have other ways of getting GPUs. Now I think it's gotten a little bit easier than it was last year. Last year there was a real crisis. Like you couldn't even get, you couldn't even get a 100s. It's just crazy. Um,
π©βπ€ Ate-A-Pi: Uh -huh.
π©βπ€ Ate-A-Pi: Ha ha ha.
π§βπ¬ Alan Cowen: And now there's more options and I think Nvidia is going to be providing even more options. So I'm not as worried about that going forward.
π©βπ€ Ate-A-Pi: Awesome, awesome. Looking forward, five years in the future, where do you see whom? What is your hope that, where do you want to get to?
π§βπ¬ Alan Cowen: In five years, I mean, we want to be the interface that people use to interact with AI. So if you want something from an AI model, the most important thing is first you need a model that understands what you want, extracts your preferences.
then relays that to other AI tools, AI models. It won't be like a one model world where the same model that understands your preferences based on your voice and language also knows how to code. I don't think that's going to be the case. I think that there's going to be a model that takes your voice, your language, understands your preferences, generates, uses the code model to generate the right code by prompting it in a more...
Like the code model needs a kind of expanded prompt. Like OpenAI kind of does this with Sora. Like they expand the prompt and use that expanded prompt to generate the video. I think that's like the video model is not actually interacting directly with the user. There's actually a language model in between. I think Hume could be in between too and it would expand the prompt not just based on your language but also based on your voice and that indicates a lot more. So Hume will be the interface for a lot of applications.
But beyond just being the interface and the middle man, we also have the ability to say this is optimized for users. We can measure that. We can look at people's reactions and say, OK, people are happier for having used this application with this interface. We can report back to the user. And I think in some cases, we want to be able to have visibility to the end user. And the end user will want that from us because it'll be wanted.
understand what model is being used and have some guarantees about that. So ideally we're at an interface layer for many, many AI applications. That's the goal.
then relays that to other AI tools, AI models. It won't be like a one model world where the same model that understands your preferences based on your voice and language also knows how to code. I don't think that's going to be the case. I think that there's going to be a model that takes your voice, your language, understands your preferences, generates, uses the code model to generate the right code by prompting it in a more...
Like the code model needs a kind of expanded prompt. Like OpenAI kind of does this with Sora. Like they expand the prompt and use that expanded prompt to generate the video. I think that's like the video model is not actually interacting directly with the user. There's actually a language model in between. I think Hume could be in between too and it would expand the prompt not just based on your language but also based on your voice and that indicates a lot more. So Hume will be the interface for a lot of applications.
But beyond just being the interface and the middle man, we also have the ability to say this is optimized for users. We can measure that. We can look at people's reactions and say, OK, people are happier for having used this application with this interface. We can report back to the user. And I think in some cases, we want to be able to have visibility to the end user. And the end user will want that from us because it'll be wanted.
understand what model is being used and have some guarantees about that. So ideally we're at an interface layer for many, many AI applications. That's the goal.
π©βπ€ Ate-A-Pi: Amazing. Alan, thank you so much for spending some time with me today. And I mean, it's one of those things that, you know, once you hear about it, you know that someone should have done it, and then you're just so happy that someone competent and like motivated and also like pro -social has decided to do it. So I'm so happy to meet you.
Reply