Podchaser Logo
Home
State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

Released Thursday, 27th June 2024
Good episode? Give it some love!
State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

State Space Models and Real-time Intelligence with Karan Goel and Albert Gu from Cartesia

Thursday, 27th June 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:05

Welcome back to Knowpriors. We're excited

0:07

to talk to Karen Gole and

0:10

Albert Gu, the co-founders of Cartigia

0:12

and authors behind such revolutionary models

0:14

as S4 and Mamba. They're

0:16

leading a rebellion against the dominant architecture

0:18

of Transformers, so we're excited to talk

0:20

to them about that and their company

0:22

today. Welcome, Karen, Albert.

0:25

Thank you. Nice to be here. And

0:27

Kate, tell us a little bit more about Cartigia,

0:29

the product, what people can do with it today,

0:31

some of the use cases. Yeah, definitely. We launched

0:33

Sonic. Sonic is a really fast text-to-speech engine, so

0:36

some of the places I think that we've

0:38

seen people be really excited about using

0:41

Sonic is where they want to do

0:43

interactive low-latency voice generation. So I think

0:45

the two places we've really kind of

0:48

had a lot of excitement is one

0:50

in gaming where folks are really just

0:53

interested in powering characters

0:56

and roles and NPCs. The dream is to

0:58

have a game where you have millions of

1:00

players and they're able to just interact with

1:04

these models and get back responses on

1:06

the fly. And I think that's sort of where

1:09

we've seen a lot of excitement and uptake. And

1:11

then the other end is voice agents and being

1:13

able to power them and again, low-latency their matters.

1:15

And even with what we've done with Sonic, we're

1:17

already kind of shaving off like 150 milliseconds off

1:20

of you

1:22

know, what they typically use. And so, you know,

1:24

the roadmap is let's get to the next 600

1:26

milliseconds and try to shave those off over

1:28

the course of the year. That's been the place

1:30

where it's been pretty exciting. Love to

1:33

talk a little bit just about backgrounds and how

1:35

you ended up starting Cartigia. Maybe you can start

1:37

with the research journey and like what kinds of

1:39

problems you were both working on. Cart and I both

1:41

came from the same PhD group at Stanford. I did

1:43

a pretty long PhD and I worked on a bunch

1:45

of problems, but I ended up sort of

1:47

working on a bunch of problems around sequence

1:49

modeling. It came out of kind of these problems I

1:52

started working on actually at DeepMind during the internship. And

1:54

then I started working on sequence modeling around the same

1:56

time actually that Transformers got popular. I actually

1:58

instead of working on that. I got really interested

2:01

in these alternate recurrent models, which I thought were

2:03

really elegant for other reasons, and I felt fundamental

2:05

in a sense. So I was just really interested

2:07

in them and I worked on them for a

2:10

few years. A couple of years ago, me and Kern

2:12

worked together on this model called S4, which

2:14

got popular for showing that some form

2:16

of recurrent model called a state-space model

2:19

was really effective in some applications. Man,

2:21

I've continuing to be pushing on that

2:23

direction. Recently, I proposed

2:26

a model called Mamba, which

2:29

brought these to language modeling and

2:31

showed really good results there. So people have been

2:33

really interested. We've been using

2:36

them for applications

2:38

and other domains and so on.

2:40

So yeah, it's really exciting. Personally,

2:42

I just started as a professor at CMU

2:45

this year. My research lab there is working

2:47

on the academic side of these questions while

2:49

at Cartigio, we're putting them into production. Yeah,

2:51

I guess my story was that I grew

2:53

up in India, so I came from an

2:56

engineering family. All my ancestors

2:58

were engineers. So I

3:00

actually was trying to be a doctor

3:02

in high school, but my aptitude for

3:04

biology was very low, so I

3:07

abandoned it instead of being an engineer. So

3:10

I took a fairly typical

3:12

path, went to an IIT, came to grad school,

3:14

and then ended up at

3:16

Stanford. Actually, started out working on reinforcement learning

3:18

back in 2017, 18,

3:21

and then once I got into Stanford, I

3:23

started working with Chris who was somewhat

3:26

skeptical about reinforcement learning as

3:28

a field. This is

3:30

Chris Ray. Yes, Chris Ray was our

3:33

PhD advisor. So I had a very

3:35

interesting transition period where I started a

3:37

PhD because I had no idea what

3:39

I was working on, and so

3:41

I was just exploring. Then

3:43

ended up actually, we did our first project together

3:45

too. Oh yeah, it's a good times. Actually, we

3:47

knew each other before that, and I think then

3:49

we started working together on that

3:51

first project. We would

3:53

hang out socially and then start working together. The

3:56

only memory I have of that project was we-

3:58

I kept- But I kept

4:01

filling up this disk on G Cloud and expanding

4:03

it by one terabyte every time. And then it

4:05

would keep filling up. And I would insist on

4:07

only adding a terabyte to it, which

4:11

he was very mad about for a while. Well,

4:13

by the end of the project, it was like

4:15

running a bunch of experiments. And the logs would

4:17

get filled up faster than the other ones. Basically,

4:20

I would be there tracking the

4:22

experiments. And Karan would be there deleting logs in real

4:24

time so that our runs didn't crash.

4:26

It was a really interesting way to get

4:29

started working together. Yeah, so we started working

4:31

together then. And then I eventually started working

4:33

on Albert on the S4 push when he

4:35

was pushing from NeurIPS. And I think he

4:37

was working on it alone and then needed

4:39

help. I got recruited in to

4:42

help out because I was just not

4:44

doing anything for that NeurIPS deadline. So

4:46

I ended up spending about three weeks on that,

4:48

two or three weeks, something like that. And then

4:51

we really pushed hard. And that's kind of how

4:53

I got interested in it, because he had been

4:55

working on this stuff for a while. And nobody

4:57

really knew what he was doing. To

4:59

be honest, in the lab, it was just over

5:01

in the corner, scribbling away, talking to himself. We

5:03

don't really know what's going on. Could

5:06

you actually tell us more about SSMs? And how

5:08

is it different from transformer-based architectures? And what are

5:10

some of the main areas that people are applying

5:12

them right now? Because I think it's really interesting,

5:14

is sort of another approach. It really kind of

5:16

got started from work on RNNs

5:19

or current neural networks that I was working on

5:21

before as an intern in 2019. It

5:24

kind of felt like the right thing to do for

5:26

sequential modeling because the basic premise of this is

5:28

that if you want to model a sequence of data, you

5:31

want to kind of process the sequence one at

5:33

a time. If you think about the way that

5:35

you will kind of process information, you're taking it

5:38

sequentially and kind of encoding it into your

5:40

representation of the information that you know. And

5:43

then you get new information, and you update

5:45

your belief for your state or whatever with

5:47

the new information that you have. You can

5:49

basically say almost any model actually is doing

5:51

this. And then there were some connections to

5:54

other dynamical systems and other things that I

5:56

found really interesting mathematically. And I

5:58

just thought this kind of felt like... a

6:01

fundamental way to do this. It just felt

6:03

right in some ways. You can think of

6:05

these models as doing something. There's

6:08

some loose inspiration from the brain even, where

6:10

you think of the model as encoding all

6:12

the information it's seen into

6:15

a compressed state. It could be fuzzy

6:17

compression, but that's actually powerful in some

6:19

ways because it's a way of stripping

6:21

out unnecessary information and just trying to

6:23

focus on the things that matter, encode

6:25

those and process those, and then work

6:27

with that. We can get more than

6:29

technical details, but at a high level, it's just

6:31

this thing. It's just representing this

6:34

idea of this fuzzy compression and fast

6:36

updating. So you're just keeping this

6:38

state in memory, that's just always updating as you see

6:41

new information. Is there better architecture for

6:43

certain types of data, or did you

6:45

have applications in

6:47

mind besides the general

6:49

architectural concept? Yeah. So it really

6:51

can be applied to pretty much everything. So just like

6:54

transformers, these are applied to everything. So

6:57

can these models. Over the course

6:59

of research over a few

7:01

years, we realized that there are different manages

7:03

for different types of data, and lots of

7:06

different variants of these models are

7:09

better at different types of data or others.

7:11

So the first type of model we worked

7:13

on, we're really good at modeling perceptual signals.

7:15

So you can think of text data as

7:18

a representation that's already been

7:20

really compressed and tokenized. Be

7:23

cooked. Yes, sure. It's

7:25

very dense. Every

7:27

token in text already has a meaning, it's dense

7:29

information. Now, if you look at a video or

7:32

an audio signal, it's highly compressible. For

7:34

example, if you sample at a really high rate,

7:36

it's very continuous and so that means

7:39

it's compressible. It

7:41

turns out that different types of

7:43

models just have different inductive biases

7:46

or strengths at modeling these

7:48

things. The first types of models

7:50

we were looking at were really good, actually, at

7:52

modeling these raw waveforms. Raw

7:55

pixels, things like that, but not as

7:57

good at modeling text and transformers are way better there. newer

8:00

versions of these models like Mamba, which was

8:02

the most recent one that's been out for

8:04

a few months, that's a lot better at

8:07

modeling the same types of data as Transformers.

8:09

Even there, there's subtler kind of trade-offs. But

8:12

yeah, so one thing we kind of learned

8:14

is that in general, there's no free lunch there. So people

8:16

think that you can throw a Transformer at anything and it

8:18

just works. Actually, it doesn't really.

8:20

If you tried to throw it at

8:23

the raw pixel level or the raw sample

8:25

level in audio waveforms, I think it doesn't

8:27

work nearly as well. So you have to

8:29

be a little more deliberate about this. They

8:31

really evolved hand in hand with the whole

8:34

ecosystem of the whole training pipeline. So it's

8:36

like the places that people use Transformers, the

8:39

data has already kind of been processed in a way

8:41

that helps the model.

8:44

For example, people have been talking

8:46

a lot about tokenization and how it's both

8:49

extremely important, but also very counterintuitive, unnatural,

8:51

and has its own issues. That's an

8:53

example of something that's kind of developed

8:55

hand in hand with the Transformer architecture.

8:57

And then when you kind of break

8:59

away from these assumptions, then some

9:01

of your modeling assumptions no longer hold, and then

9:03

some of these other models actually work

9:06

better. Do you think of the

9:08

advantages, like natural fit that translates

9:10

to quality for certain data types?

9:13

At least if we think about, let's say, perceptual

9:15

data, or I don't know, rich or raw, precooked,

9:18

not precooked data. Or how

9:21

do you think about efficiency or the other

9:23

dimensions of comparing the architectures? Yeah,

9:25

so I guess so far we talked kind of about

9:28

the inductive bias or the fit for the data. Now,

9:30

the other reason why we really cared about these is

9:32

because of efficiency. So yeah, maybe we should have led

9:34

with that even. So people have

9:36

yelled for a long time about

9:38

this quadratic scaling of Transformers. One

9:40

of the big advantages of these

9:42

alternatives is the linear scaling. So

9:45

it just means that basically the

9:47

time it takes to process any new

9:49

token is basically constant time for a

9:51

current model. But for a transformer,

9:53

it scales with the history that you've seen.

9:55

This is obviously a huge advantage when you're

9:57

really scaling to collect lots of data. there's

22:00

still so much more you can do in this

22:02

area. Can you actually talk about that? Because I

22:04

think a lot of people would say, that feels

22:06

a lot more solved in the last year, which

22:08

is text to audio generation. Like what's left between

22:10

here and the sailing in terms of thinking about

22:12

the application experience? Yeah, I think the way I

22:14

think about it is would I want to talk

22:16

to this thing for more than 30 seconds? And

22:19

if the answer is no, then it's not

22:22

solved. And if the answer is yes, then

22:24

it is solved. And I think most text

22:26

to speech systems. Cards audio touring tests, yeah.

22:28

Are not that interesting yet. You

22:30

don't feel as engaged as you do when you're talking

22:32

to a human. I know there's other,

22:34

obviously other reasons you talk to humans, which is,

22:36

you know, sorry, I don't want to come across

22:39

as crazy here, but yeah, there's a

22:41

society that we live in. So we

22:44

want to talk to people for that reason, obviously.

22:46

But I do think the engagement that you have

22:48

with these systems is not that high. When you're

22:50

trying to build these things, you really kind of get

22:53

so into the weeds on like, oh, I can't

22:55

say this thing this way. And it's like so boring

22:57

when it says it that way. And how do I

22:59

control this part of it to say it like

23:01

this? You know, the intonation. Are there specific dimensions that

23:03

you look at from an eval perspective that you think

23:06

are most important in terms of how you think

23:08

about? Yeah, evals for, you know, generation are generally challenging

23:10

because they're qualitative

23:12

and based on sort of, you

23:14

know, the general perception of someone

23:16

who looks at something and says,

23:18

this is more interesting than this.

23:21

And so there is some dimension to that. But I think for

23:23

speech, like, you know, emotion is something that

23:25

matters a lot because you want to be able to

23:27

kind of control, you know, the way in which things

23:29

are said. And I think the other piece that's really

23:31

interesting is the how speeches used

23:33

to embody kind of the roles people play

23:35

in society. So like different people speak in

23:37

different ways because they have, you know, different

23:39

jobs or work in different, you know, areas

23:41

or live in different parts of the world.

23:44

And that's sort of the nuance that I

23:46

don't think any models really capture well, which

23:48

is like, you know, if you're a nurse,

23:50

you need to talk in a different way

23:52

than if you're a lawyer or if you're

23:54

a judge or if you're a venture capitalist,

23:56

you know, very different forms of speech. The

23:58

highest form of voice. So

24:01

those are all very challenging, I would say.

24:03

So it's not solved, is my claim. There's

24:05

also interesting point which is kind of like,

24:07

even just for basic evaluations of like, can

24:10

your ASR system recognize these words or

24:12

can your TTS system

24:15

say this word? Even that is actually

24:17

not quite a local problem. For

24:19

a lot of hard things, you actually need to

24:21

really have the language understanding in order to process

24:24

and figure out what is the right way of pronouncing

24:26

this and so on. So actually to really get perfect,

24:31

even just TTS or speech-to-speech, you

24:33

actually really need to have a model that has

24:36

more understanding at least of the language, but

24:38

it's not really an isolated component anymore. So

24:40

you have to start getting into these multimodal

24:42

models just to even do one modality

24:45

well. So that's somewhere where that we

24:47

were eyeing from the beginning as well.

24:50

We were using this as an entry

24:52

point into building out the stack toward

24:54

all of that, and hopefully that's all

24:56

going to help the audio as well,

24:58

but also start getting other modalities. That's

25:01

really cool. I mean, I guess you've

25:03

done so much pioneering key work on

25:05

the SSM side. How is multimodality or

25:07

speech really impacted how you thought about the

25:09

broader problem or has it? It's more just

25:11

the generic solutions that ones that make sense.

25:13

I don't think multimodality by itself has been

25:16

a driving motivation for this work because I think

25:18

of these basic

25:20

models I've been working on as basic generic

25:23

building blocks that can be used anywhere. So

25:26

they certainly can be used in multimodal systems

25:28

to good effect, I think. Different

25:30

modalities have presented different challenges, which has influenced

25:32

the design of these. But

25:35

I always look for the most general

25:37

purpose fundamental building block that

25:39

can be used everywhere. So multimodality

25:41

is more of

25:43

a different set of challenges in terms of how

25:47

are you applying the building

25:49

blocks to that, but you still use the same

25:51

techniques and they mostly work. Given

25:54

that versatility of model architecture, generality

25:56

as a building block, what do

25:58

you do next for car? nice

30:00

solutions to hard problems. But

30:02

it's not always possible. So at Cartigio, we,

30:05

of course, need to solve the actual engineering

30:07

challenges. And there's always going to be hairy

30:09

things. But as

30:11

much as I can, I'm always trying to

30:13

strive to kind of make everything simple, unified

30:15

as possible. That's great. Yeah, I remember. I

30:17

can't remember. Is it Erdos or somebody used

30:21

to talk about certain theorems coming out

30:23

of God's book or something? Or so elegant?

30:25

Yeah, I very much adhere to that idea.

30:28

So it's called proofs from the book,

30:30

is what he would say. And

30:33

that's actually kind of thing that kind of guides

30:35

a lot of the way that

30:37

I like picking, choosing problems. And what you're referring

30:39

to is, of course, in pure

30:42

math. Sometimes you see proofs or

30:45

ideas that just feel like

30:47

this is obviously just the right way of doing

30:49

things. It's so elegant. It's so correct. Things are

30:51

not in the machine learning world. Things are often

30:54

not nearly that clean. But

30:56

you still can have still the same kind

30:58

of concept, just maybe a different level of

31:00

abstraction. But sometimes certain approaches

31:02

or something just seems like the right way

31:05

of doing things. Unfortunately,

31:07

this thing is also kind of like, it

31:09

can be subjective. Yeah, sometimes I

31:12

tell people this is just

31:14

the right way of doing it. And I can't explain

31:16

why. But maybe we should kind of have like one

31:18

of our pillars should be about the book so

31:21

I can start saying this. Let's

31:23

see the demo. Yeah, I'd love to show you.

31:27

Cool. Yeah,

31:29

I have our model running

31:31

on our standard issue Mac here.

31:34

Basically, this is our text-to-speech model, Sonic.

31:36

And our playground is running in the

31:38

cloud. And so part of what I

31:40

talked about earlier was how do you

31:42

kind of bring this closer to on-device

31:44

and edge. And I think the first

31:46

place to start is your laptop. And

31:48

then hopefully shrink it

31:50

down and bring it closer and closer to a

31:52

smaller footprint. So let me start running this. It's

31:55

great to be on the No Priors podcast today.

31:58

We have the same feature set. that's in

32:00

the cloud but running on this and... Prove

32:02

it's real time and not copes. Say, you don't have to

32:04

believe in God, but you have to believe in the book.

32:06

I think that's the Erdosch quote. Was that the

32:08

quote? Let me grab a

32:11

interesting voice for this one. Erdosch

32:13

is, where's Erdosch from? Hungary.

32:15

Hungary. I mean, that's a default gas

32:17

for any mathematician from America. Oh yeah, sure, he's just

32:19

the same. All right, I'm gonna press enter. You

32:23

don't have to believe in God, you have

32:25

to believe in the book. That's

32:27

pretty good. Lancy

32:29

is pretty good. Yeah, it works really fast

32:31

and I think that's part of what I

32:33

think gets me really excited, which is like,

32:35

you know, it streams out audio

32:37

instantly, so yeah. I would talk to Erdosch on

32:40

my laptop. Yeah, yeah, me too. That'd

32:43

be a great way to get inspired every

32:45

morning. Yeah, I know. Yeah. Yeah,

32:47

that'd be great. Your team is now, how many

32:49

people? We are 15 people now.

32:51

And eight interns. Sarah

32:54

always gives me shit for this. It's a big

32:56

intern class, yeah. That's amazing. We have

32:58

a lot of interns. I really like

33:00

interns. They're great. They're excited, they wanna

33:02

do cool things. And

33:04

are there specific roles that you're currently hiring for, adding

33:07

up? Yeah, we are hiring

33:09

for model roles specifically.

33:12

We're hiring across the engineering stack, but really

33:14

wanna kind of build out our modeling team

33:17

deeper, so always looking for great

33:19

folks to come to Team SSM

33:22

and help us build the future. The rebellion.

33:24

Yeah, the rebellion. Yeah, we used to actually

33:26

call it. Yeah, it's, what do you call

33:29

it? Overthrowing the empire. Yeah, yeah,

33:31

yeah. That was the theme during

33:33

our PhDs. And yeah, I would

33:35

love to continue to have folks inbound

33:37

us and chat with us

33:39

if they're excited about this technology and the

33:42

use cases. A lot of exciting work

33:44

to do, both research and bringing it

33:46

to people. Yep. Find

33:49

us on Twitter at NoPryersPod. Subscribe to

33:51

our YouTube channel if you wanna see

33:54

our faces. Follow the show on

33:56

Apple Podcasts, Spotify, or wherever you listen. That

33:58

way you get a new episode. every week.

34:01

And sign up for emails or

34:03

find transcripts for every episode at

34:05

no-priors.com.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features