Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:05
Welcome back to Knowpriors. We're excited
0:07
to talk to Karen Gole and
0:10
Albert Gu, the co-founders of Cartigia
0:12
and authors behind such revolutionary models
0:14
as S4 and Mamba. They're
0:16
leading a rebellion against the dominant architecture
0:18
of Transformers, so we're excited to talk
0:20
to them about that and their company
0:22
today. Welcome, Karen, Albert.
0:25
Thank you. Nice to be here. And
0:27
Kate, tell us a little bit more about Cartigia,
0:29
the product, what people can do with it today,
0:31
some of the use cases. Yeah, definitely. We launched
0:33
Sonic. Sonic is a really fast text-to-speech engine, so
0:36
some of the places I think that we've
0:38
seen people be really excited about using
0:41
Sonic is where they want to do
0:43
interactive low-latency voice generation. So I think
0:45
the two places we've really kind of
0:48
had a lot of excitement is one
0:50
in gaming where folks are really just
0:53
interested in powering characters
0:56
and roles and NPCs. The dream is to
0:58
have a game where you have millions of
1:00
players and they're able to just interact with
1:04
these models and get back responses on
1:06
the fly. And I think that's sort of where
1:09
we've seen a lot of excitement and uptake. And
1:11
then the other end is voice agents and being
1:13
able to power them and again, low-latency their matters.
1:15
And even with what we've done with Sonic, we're
1:17
already kind of shaving off like 150 milliseconds off
1:20
of you
1:22
know, what they typically use. And so, you know,
1:24
the roadmap is let's get to the next 600
1:26
milliseconds and try to shave those off over
1:28
the course of the year. That's been the place
1:30
where it's been pretty exciting. Love to
1:33
talk a little bit just about backgrounds and how
1:35
you ended up starting Cartigia. Maybe you can start
1:37
with the research journey and like what kinds of
1:39
problems you were both working on. Cart and I both
1:41
came from the same PhD group at Stanford. I did
1:43
a pretty long PhD and I worked on a bunch
1:45
of problems, but I ended up sort of
1:47
working on a bunch of problems around sequence
1:49
modeling. It came out of kind of these problems I
1:52
started working on actually at DeepMind during the internship. And
1:54
then I started working on sequence modeling around the same
1:56
time actually that Transformers got popular. I actually
1:58
instead of working on that. I got really interested
2:01
in these alternate recurrent models, which I thought were
2:03
really elegant for other reasons, and I felt fundamental
2:05
in a sense. So I was just really interested
2:07
in them and I worked on them for a
2:10
few years. A couple of years ago, me and Kern
2:12
worked together on this model called S4, which
2:14
got popular for showing that some form
2:16
of recurrent model called a state-space model
2:19
was really effective in some applications. Man,
2:21
I've continuing to be pushing on that
2:23
direction. Recently, I proposed
2:26
a model called Mamba, which
2:29
brought these to language modeling and
2:31
showed really good results there. So people have been
2:33
really interested. We've been using
2:36
them for applications
2:38
and other domains and so on.
2:40
So yeah, it's really exciting. Personally,
2:42
I just started as a professor at CMU
2:45
this year. My research lab there is working
2:47
on the academic side of these questions while
2:49
at Cartigio, we're putting them into production. Yeah,
2:51
I guess my story was that I grew
2:53
up in India, so I came from an
2:56
engineering family. All my ancestors
2:58
were engineers. So I
3:00
actually was trying to be a doctor
3:02
in high school, but my aptitude for
3:04
biology was very low, so I
3:07
abandoned it instead of being an engineer. So
3:10
I took a fairly typical
3:12
path, went to an IIT, came to grad school,
3:14
and then ended up at
3:16
Stanford. Actually, started out working on reinforcement learning
3:18
back in 2017, 18,
3:21
and then once I got into Stanford, I
3:23
started working with Chris who was somewhat
3:26
skeptical about reinforcement learning as
3:28
a field. This is
3:30
Chris Ray. Yes, Chris Ray was our
3:33
PhD advisor. So I had a very
3:35
interesting transition period where I started a
3:37
PhD because I had no idea what
3:39
I was working on, and so
3:41
I was just exploring. Then
3:43
ended up actually, we did our first project together
3:45
too. Oh yeah, it's a good times. Actually, we
3:47
knew each other before that, and I think then
3:49
we started working together on that
3:51
first project. We would
3:53
hang out socially and then start working together. The
3:56
only memory I have of that project was we-
3:58
I kept- But I kept
4:01
filling up this disk on G Cloud and expanding
4:03
it by one terabyte every time. And then it
4:05
would keep filling up. And I would insist on
4:07
only adding a terabyte to it, which
4:11
he was very mad about for a while. Well,
4:13
by the end of the project, it was like
4:15
running a bunch of experiments. And the logs would
4:17
get filled up faster than the other ones. Basically,
4:20
I would be there tracking the
4:22
experiments. And Karan would be there deleting logs in real
4:24
time so that our runs didn't crash.
4:26
It was a really interesting way to get
4:29
started working together. Yeah, so we started working
4:31
together then. And then I eventually started working
4:33
on Albert on the S4 push when he
4:35
was pushing from NeurIPS. And I think he
4:37
was working on it alone and then needed
4:39
help. I got recruited in to
4:42
help out because I was just not
4:44
doing anything for that NeurIPS deadline. So
4:46
I ended up spending about three weeks on that,
4:48
two or three weeks, something like that. And then
4:51
we really pushed hard. And that's kind of how
4:53
I got interested in it, because he had been
4:55
working on this stuff for a while. And nobody
4:57
really knew what he was doing. To
4:59
be honest, in the lab, it was just over
5:01
in the corner, scribbling away, talking to himself. We
5:03
don't really know what's going on. Could
5:06
you actually tell us more about SSMs? And how
5:08
is it different from transformer-based architectures? And what are
5:10
some of the main areas that people are applying
5:12
them right now? Because I think it's really interesting,
5:14
is sort of another approach. It really kind of
5:16
got started from work on RNNs
5:19
or current neural networks that I was working on
5:21
before as an intern in 2019. It
5:24
kind of felt like the right thing to do for
5:26
sequential modeling because the basic premise of this is
5:28
that if you want to model a sequence of data, you
5:31
want to kind of process the sequence one at
5:33
a time. If you think about the way that
5:35
you will kind of process information, you're taking it
5:38
sequentially and kind of encoding it into your
5:40
representation of the information that you know. And
5:43
then you get new information, and you update
5:45
your belief for your state or whatever with
5:47
the new information that you have. You can
5:49
basically say almost any model actually is doing
5:51
this. And then there were some connections to
5:54
other dynamical systems and other things that I
5:56
found really interesting mathematically. And I
5:58
just thought this kind of felt like... a
6:01
fundamental way to do this. It just felt
6:03
right in some ways. You can think of
6:05
these models as doing something. There's
6:08
some loose inspiration from the brain even, where
6:10
you think of the model as encoding all
6:12
the information it's seen into
6:15
a compressed state. It could be fuzzy
6:17
compression, but that's actually powerful in some
6:19
ways because it's a way of stripping
6:21
out unnecessary information and just trying to
6:23
focus on the things that matter, encode
6:25
those and process those, and then work
6:27
with that. We can get more than
6:29
technical details, but at a high level, it's just
6:31
this thing. It's just representing this
6:34
idea of this fuzzy compression and fast
6:36
updating. So you're just keeping this
6:38
state in memory, that's just always updating as you see
6:41
new information. Is there better architecture for
6:43
certain types of data, or did you
6:45
have applications in
6:47
mind besides the general
6:49
architectural concept? Yeah. So it really
6:51
can be applied to pretty much everything. So just like
6:54
transformers, these are applied to everything. So
6:57
can these models. Over the course
6:59
of research over a few
7:01
years, we realized that there are different manages
7:03
for different types of data, and lots of
7:06
different variants of these models are
7:09
better at different types of data or others.
7:11
So the first type of model we worked
7:13
on, we're really good at modeling perceptual signals.
7:15
So you can think of text data as
7:18
a representation that's already been
7:20
really compressed and tokenized. Be
7:23
cooked. Yes, sure. It's
7:25
very dense. Every
7:27
token in text already has a meaning, it's dense
7:29
information. Now, if you look at a video or
7:32
an audio signal, it's highly compressible. For
7:34
example, if you sample at a really high rate,
7:36
it's very continuous and so that means
7:39
it's compressible. It
7:41
turns out that different types of
7:43
models just have different inductive biases
7:46
or strengths at modeling these
7:48
things. The first types of models
7:50
we were looking at were really good, actually, at
7:52
modeling these raw waveforms. Raw
7:55
pixels, things like that, but not as
7:57
good at modeling text and transformers are way better there. newer
8:00
versions of these models like Mamba, which was
8:02
the most recent one that's been out for
8:04
a few months, that's a lot better at
8:07
modeling the same types of data as Transformers.
8:09
Even there, there's subtler kind of trade-offs. But
8:12
yeah, so one thing we kind of learned
8:14
is that in general, there's no free lunch there. So people
8:16
think that you can throw a Transformer at anything and it
8:18
just works. Actually, it doesn't really.
8:20
If you tried to throw it at
8:23
the raw pixel level or the raw sample
8:25
level in audio waveforms, I think it doesn't
8:27
work nearly as well. So you have to
8:29
be a little more deliberate about this. They
8:31
really evolved hand in hand with the whole
8:34
ecosystem of the whole training pipeline. So it's
8:36
like the places that people use Transformers, the
8:39
data has already kind of been processed in a way
8:41
that helps the model.
8:44
For example, people have been talking
8:46
a lot about tokenization and how it's both
8:49
extremely important, but also very counterintuitive, unnatural,
8:51
and has its own issues. That's an
8:53
example of something that's kind of developed
8:55
hand in hand with the Transformer architecture.
8:57
And then when you kind of break
8:59
away from these assumptions, then some
9:01
of your modeling assumptions no longer hold, and then
9:03
some of these other models actually work
9:06
better. Do you think of the
9:08
advantages, like natural fit that translates
9:10
to quality for certain data types?
9:13
At least if we think about, let's say, perceptual
9:15
data, or I don't know, rich or raw, precooked,
9:18
not precooked data. Or how
9:21
do you think about efficiency or the other
9:23
dimensions of comparing the architectures? Yeah,
9:25
so I guess so far we talked kind of about
9:28
the inductive bias or the fit for the data. Now,
9:30
the other reason why we really cared about these is
9:32
because of efficiency. So yeah, maybe we should have led
9:34
with that even. So people have
9:36
yelled for a long time about
9:38
this quadratic scaling of Transformers. One
9:40
of the big advantages of these
9:42
alternatives is the linear scaling. So
9:45
it just means that basically the
9:47
time it takes to process any new
9:49
token is basically constant time for a
9:51
current model. But for a transformer,
9:53
it scales with the history that you've seen.
9:55
This is obviously a huge advantage when you're
9:57
really scaling to collect lots of data. there's
22:00
still so much more you can do in this
22:02
area. Can you actually talk about that? Because I
22:04
think a lot of people would say, that feels
22:06
a lot more solved in the last year, which
22:08
is text to audio generation. Like what's left between
22:10
here and the sailing in terms of thinking about
22:12
the application experience? Yeah, I think the way I
22:14
think about it is would I want to talk
22:16
to this thing for more than 30 seconds? And
22:19
if the answer is no, then it's not
22:22
solved. And if the answer is yes, then
22:24
it is solved. And I think most text
22:26
to speech systems. Cards audio touring tests, yeah.
22:28
Are not that interesting yet. You
22:30
don't feel as engaged as you do when you're talking
22:32
to a human. I know there's other,
22:34
obviously other reasons you talk to humans, which is,
22:36
you know, sorry, I don't want to come across
22:39
as crazy here, but yeah, there's a
22:41
society that we live in. So we
22:44
want to talk to people for that reason, obviously.
22:46
But I do think the engagement that you have
22:48
with these systems is not that high. When you're
22:50
trying to build these things, you really kind of get
22:53
so into the weeds on like, oh, I can't
22:55
say this thing this way. And it's like so boring
22:57
when it says it that way. And how do I
22:59
control this part of it to say it like
23:01
this? You know, the intonation. Are there specific dimensions that
23:03
you look at from an eval perspective that you think
23:06
are most important in terms of how you think
23:08
about? Yeah, evals for, you know, generation are generally challenging
23:10
because they're qualitative
23:12
and based on sort of, you
23:14
know, the general perception of someone
23:16
who looks at something and says,
23:18
this is more interesting than this.
23:21
And so there is some dimension to that. But I think for
23:23
speech, like, you know, emotion is something that
23:25
matters a lot because you want to be able to
23:27
kind of control, you know, the way in which things
23:29
are said. And I think the other piece that's really
23:31
interesting is the how speeches used
23:33
to embody kind of the roles people play
23:35
in society. So like different people speak in
23:37
different ways because they have, you know, different
23:39
jobs or work in different, you know, areas
23:41
or live in different parts of the world.
23:44
And that's sort of the nuance that I
23:46
don't think any models really capture well, which
23:48
is like, you know, if you're a nurse,
23:50
you need to talk in a different way
23:52
than if you're a lawyer or if you're
23:54
a judge or if you're a venture capitalist,
23:56
you know, very different forms of speech. The
23:58
highest form of voice. So
24:01
those are all very challenging, I would say.
24:03
So it's not solved, is my claim. There's
24:05
also interesting point which is kind of like,
24:07
even just for basic evaluations of like, can
24:10
your ASR system recognize these words or
24:12
can your TTS system
24:15
say this word? Even that is actually
24:17
not quite a local problem. For
24:19
a lot of hard things, you actually need to
24:21
really have the language understanding in order to process
24:24
and figure out what is the right way of pronouncing
24:26
this and so on. So actually to really get perfect,
24:31
even just TTS or speech-to-speech, you
24:33
actually really need to have a model that has
24:36
more understanding at least of the language, but
24:38
it's not really an isolated component anymore. So
24:40
you have to start getting into these multimodal
24:42
models just to even do one modality
24:45
well. So that's somewhere where that we
24:47
were eyeing from the beginning as well.
24:50
We were using this as an entry
24:52
point into building out the stack toward
24:54
all of that, and hopefully that's all
24:56
going to help the audio as well,
24:58
but also start getting other modalities. That's
25:01
really cool. I mean, I guess you've
25:03
done so much pioneering key work on
25:05
the SSM side. How is multimodality or
25:07
speech really impacted how you thought about the
25:09
broader problem or has it? It's more just
25:11
the generic solutions that ones that make sense.
25:13
I don't think multimodality by itself has been
25:16
a driving motivation for this work because I think
25:18
of these basic
25:20
models I've been working on as basic generic
25:23
building blocks that can be used anywhere. So
25:26
they certainly can be used in multimodal systems
25:28
to good effect, I think. Different
25:30
modalities have presented different challenges, which has influenced
25:32
the design of these. But
25:35
I always look for the most general
25:37
purpose fundamental building block that
25:39
can be used everywhere. So multimodality
25:41
is more of
25:43
a different set of challenges in terms of how
25:47
are you applying the building
25:49
blocks to that, but you still use the same
25:51
techniques and they mostly work. Given
25:54
that versatility of model architecture, generality
25:56
as a building block, what do
25:58
you do next for car? nice
30:00
solutions to hard problems. But
30:02
it's not always possible. So at Cartigio, we,
30:05
of course, need to solve the actual engineering
30:07
challenges. And there's always going to be hairy
30:09
things. But as
30:11
much as I can, I'm always trying to
30:13
strive to kind of make everything simple, unified
30:15
as possible. That's great. Yeah, I remember. I
30:17
can't remember. Is it Erdos or somebody used
30:21
to talk about certain theorems coming out
30:23
of God's book or something? Or so elegant?
30:25
Yeah, I very much adhere to that idea.
30:28
So it's called proofs from the book,
30:30
is what he would say. And
30:33
that's actually kind of thing that kind of guides
30:35
a lot of the way that
30:37
I like picking, choosing problems. And what you're referring
30:39
to is, of course, in pure
30:42
math. Sometimes you see proofs or
30:45
ideas that just feel like
30:47
this is obviously just the right way of doing
30:49
things. It's so elegant. It's so correct. Things are
30:51
not in the machine learning world. Things are often
30:54
not nearly that clean. But
30:56
you still can have still the same kind
30:58
of concept, just maybe a different level of
31:00
abstraction. But sometimes certain approaches
31:02
or something just seems like the right way
31:05
of doing things. Unfortunately,
31:07
this thing is also kind of like, it
31:09
can be subjective. Yeah, sometimes I
31:12
tell people this is just
31:14
the right way of doing it. And I can't explain
31:16
why. But maybe we should kind of have like one
31:18
of our pillars should be about the book so
31:21
I can start saying this. Let's
31:23
see the demo. Yeah, I'd love to show you.
31:27
Cool. Yeah,
31:29
I have our model running
31:31
on our standard issue Mac here.
31:34
Basically, this is our text-to-speech model, Sonic.
31:36
And our playground is running in the
31:38
cloud. And so part of what I
31:40
talked about earlier was how do you
31:42
kind of bring this closer to on-device
31:44
and edge. And I think the first
31:46
place to start is your laptop. And
31:48
then hopefully shrink it
31:50
down and bring it closer and closer to a
31:52
smaller footprint. So let me start running this. It's
31:55
great to be on the No Priors podcast today.
31:58
We have the same feature set. that's in
32:00
the cloud but running on this and... Prove
32:02
it's real time and not copes. Say, you don't have to
32:04
believe in God, but you have to believe in the book.
32:06
I think that's the Erdosch quote. Was that the
32:08
quote? Let me grab a
32:11
interesting voice for this one. Erdosch
32:13
is, where's Erdosch from? Hungary.
32:15
Hungary. I mean, that's a default gas
32:17
for any mathematician from America. Oh yeah, sure, he's just
32:19
the same. All right, I'm gonna press enter. You
32:23
don't have to believe in God, you have
32:25
to believe in the book. That's
32:27
pretty good. Lancy
32:29
is pretty good. Yeah, it works really fast
32:31
and I think that's part of what I
32:33
think gets me really excited, which is like,
32:35
you know, it streams out audio
32:37
instantly, so yeah. I would talk to Erdosch on
32:40
my laptop. Yeah, yeah, me too. That'd
32:43
be a great way to get inspired every
32:45
morning. Yeah, I know. Yeah. Yeah,
32:47
that'd be great. Your team is now, how many
32:49
people? We are 15 people now.
32:51
And eight interns. Sarah
32:54
always gives me shit for this. It's a big
32:56
intern class, yeah. That's amazing. We have
32:58
a lot of interns. I really like
33:00
interns. They're great. They're excited, they wanna
33:02
do cool things. And
33:04
are there specific roles that you're currently hiring for, adding
33:07
up? Yeah, we are hiring
33:09
for model roles specifically.
33:12
We're hiring across the engineering stack, but really
33:14
wanna kind of build out our modeling team
33:17
deeper, so always looking for great
33:19
folks to come to Team SSM
33:22
and help us build the future. The rebellion.
33:24
Yeah, the rebellion. Yeah, we used to actually
33:26
call it. Yeah, it's, what do you call
33:29
it? Overthrowing the empire. Yeah, yeah,
33:31
yeah. That was the theme during
33:33
our PhDs. And yeah, I would
33:35
love to continue to have folks inbound
33:37
us and chat with us
33:39
if they're excited about this technology and the
33:42
use cases. A lot of exciting work
33:44
to do, both research and bringing it
33:46
to people. Yep. Find
33:49
us on Twitter at NoPryersPod. Subscribe to
33:51
our YouTube channel if you wanna see
33:54
our faces. Follow the show on
33:56
Apple Podcasts, Spotify, or wherever you listen. That
33:58
way you get a new episode. every week.
34:01
And sign up for emails or
34:03
find transcripts for every episode at
34:05
no-priors.com.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More