Decoupling Systems to Get Closer to the Data by The Real Python Podcast | Podchaser

Episode from the podcastThe Real Python Podcast

Decoupling Systems to Get Closer to the Data

Released Friday, 19th April 2024

Good episode? Give it some love!

Decoupling Systems to Get Closer to the Data

Decoupling Systems to Get Closer to the Data

Friday, 19th April 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Welcome to the Real Python Podcast.

0:02

This is episode 201. What

0:05

are the benefits of using a decoupled

0:08

data processing system? And how

0:10

do you write reusable queries for a

0:12

variety of backend data platforms? This

0:15

week on the show, Philip Cloud,

0:17

the lead maintainer of IBIS, will

0:19

discuss this portable Python DataFrame library.

0:22

Philip contrasts IBIS's workflow with

0:24

other Python DataFrame libraries. We

0:26

discuss how getting close to

0:28

the data speeds things up

0:30

and conserves memory. He

0:32

describes the different approaches IBIS provides

0:34

for querying data and how to select

0:37

a specific backend. We discuss

0:39

ways to get started with the library

0:41

and how to access example data sets

0:43

to experiment with the platform. Philip

0:45

discovered IBIS while looking for a tool

0:47

that allowed him to reuse SQL queries

0:49

written for a specific data platform on

0:51

a different one. He also recounts how

0:54

he got involved with the IBIS project,

0:56

sharing his background in open source, and

0:58

learning how to contribute to a first

1:00

project. This episode

1:02

is sponsored by Mailtrap, an

1:04

email delivery platform that developers love.

1:08

Try for free at

1:10

mailtrap.io. All right, let's get

1:12

started. Is

1:33

a weekly conversation about using Python in

1:35

the real world. My. Name is

1:37

Christopher Bailey, Your host. Each. Week

1:39

We feature interviews with experts in

1:41

the community and discussions about the

1:43

topics, articles and courses found at

1:45

realpython.com. After. The podcast. Join us.

1:47

and learn real world Python skills with

1:50

a community of experts at realpython.com. Hey,

1:52

Philip, welcome to the show. Hey,

1:54

Chris, great to be here. Yeah,

1:57

so Wes McKinney hooked us up to...

2:00

talk a little deeper about Ibis. I mentioned

2:02

multiple times that are very interested in that

2:04

project and we had so much

2:06

other things to talk about when he came on. So he

2:08

gave me your name

2:10

and kind of showed me not only the

2:12

things happening with the project but you have

2:15

a detailed YouTube channel going there which I

2:17

think is nice. But maybe we

2:19

can start with this. How did you get involved

2:21

with Ibis to begin with? Yeah, so

2:23

in 2016 I was working at,

2:26

well it's now called Meta, Facebook

2:28

then. And I was

2:31

in data engineering. The job there is,

2:33

that job there anyway is writing a

2:35

lot of SQL code. And

2:38

Facebook has a dizzying array

2:40

of infrastructure. Data engineering deals mostly with

2:42

Hive, or at least at the time

2:44

it was mostly Hive. Presto was like

2:46

sort of the new kid on the

2:48

block and it was getting

2:50

a lot of internal like sort of hype

2:53

and use and whatever. Hive was

2:55

like super hard to use for

2:57

building a pipeline

2:59

because when you're working with like

3:02

a data engineering pipeline, you often are iterating,

3:05

right? You don't know necessarily exactly what your

3:07

code is going to look like right

3:09

away. So you need something that's going to give

3:11

you somewhat reasonable feedback,

3:14

a somewhat reasonable feedback loop. Like you're

3:16

not going to be waiting like 30

3:18

seconds to run account star query or

3:20

something like that. Okay, yeah. So just

3:22

to kind of break it down even

3:24

a little bit more there, like when

3:26

you talk about pipelines, I'm guessing

3:28

there's a variety. They could be the ingestion

3:31

of data pipeline, but there also

3:33

could be like just the transformation

3:35

layer sort of stuff. Yeah,

3:38

I mean I can give you kind of a whirlwind

3:40

tour of how this whole system works. Basically

3:43

all of Facebook's apps like sort of emit data

3:45

at some probably alarming

3:47

rate and it's

3:50

going into a message bus,

3:52

which is essentially like a giant queue, right?

3:54

It's just like a bunch of like in-memory

3:56

things. The apps are all kind of forwarding

3:58

the data through this pipeline. And

4:00

then that pipe splits off into like

4:03

a bunch of different things. So

4:06

you can sort of hook into that pipe

4:08

with PHP and do like arbitrary programmatic

4:11

transformations. You can run

4:13

like streaming SQL on that. Or

4:16

you can just kind of write it

4:18

directly to Hive. In that

4:20

case, it would go

4:22

into essentially a file system using

4:25

Facebook's file format called dwarf, which

4:27

is a derivative of another file

4:29

format called bork. And

4:32

then our job. I love

4:34

the name, sorry. Yeah, they're funny names. And

4:40

so once they hit

4:42

disk, data engineers could start

4:45

building transformations. And

4:47

then those transformations would of course be like written

4:50

to a disk somewhere else and to another

4:52

table. And there's this just

4:54

gigantic, you know, directed graph of like

4:57

each transformation and, you know,

5:00

being run daily effectively. Okay.

5:03

And that's sort of how the whole thing works. So

5:06

basically our what we did was

5:08

write a lot of very complicated

5:10

select statements. Okay, yeah. We

5:12

were almost never writing like insert or

5:14

create table that was automatically done by

5:17

other processes. Okay, you needed

5:19

to pull information out

5:21

in some categorical things like,

5:23

you know, like narrow it in some way. Yeah,

5:26

yeah. Just sort of whatever we were, whatever

5:29

sort of things we wanted to know about the product.

5:32

I worked on the search, the Facebook search

5:34

product. Okay. That's sort

5:37

of, yeah, we would, I don't know, we did a bunch

5:39

of different stuff, a lot of accounting stuff, a lot of

5:41

summing stuff, not a lot of like super fancy math or

5:43

anything like that. But a lot of sort

5:45

of how are people moving throughout the app, that kind of

5:47

thing. Yeah, okay. Yeah. So,

5:50

sorry, let's, I'm, the original question I got involved

5:52

in Ivis. Yeah, yeah. Well, you're doing

5:55

all this like really intense stuff with SQL and

5:57

having lots of these statements and then having.

6:00

and not wanting

6:02

to necessarily run it across everything. Maybe

6:04

you want like an early return of

6:06

like, what is this gonna even look like? Is

6:09

that kind of where you were headed? Yeah, totally.

6:11

So you could access the data

6:13

from either Hive or Presto. And I think with

6:15

Hive is that Hive is built on this like

6:17

idea, like it's originally built on Hadoop and

6:20

it was a way of turning SQL

6:22

statements into MapReduce jobs. And MapReduce

6:24

is what is like a technology that is

6:27

designed to survive the apocalypse, right?

6:29

Nothing will take it down. Okay.

6:34

And that was largely the case.

6:36

And the trade-off is that while

6:38

the apocalypse, you know, may not

6:40

end Hive, you may not

6:42

be able to get an answer to your immediate

6:44

question in any sort of like

6:46

interactive amount of time. Okay. Presto

6:49

was designed to like

6:52

minimize that, the trade-off

6:54

there. Okay. And being

6:56

able to scale to sort of Facebook scale

6:58

as well as like give you back interactive

7:00

queries where possible, or give you back, give

7:03

you interactive speed where possible. And

7:07

the dialects were not the same, right? So

7:09

Hive has its thing, it's like

7:11

its own SQL dialect and then Presto has

7:13

its own SQL dialect. And so I wanted

7:15

a way to like write something, some code,

7:17

Python preferably, that

7:19

I could write it sort of once. And then I could

7:21

like, when I wanted to go to production,

7:24

I could just say, hey, like give me the Hive

7:26

SQL for that. And then when I'm like interactively, you

7:28

know, when I'm like iterating on an analysis, like run

7:30

it against Presto. Okay. And

7:32

so I started looking around for that. And then I

7:35

saw, I saw what Wes was doing with Ibis. And I

7:37

was like, this looks like the thing that I kind of

7:39

want. All right. So that's sort

7:41

of how I got involved. Okay, so you saw

7:43

it being demonstrated by Wes in

7:46

some capacity? Yeah, I think it's

7:48

built. You know, it's been almost 10

7:51

years. And so I don't remember

7:53

exactly how like, like the exact causal chain

7:55

of how I got there, but yeah,

7:58

I think I saw that. He had announced it maybe

8:00

on his blog and then I was like, oh,

8:03

this is cool. It seems like exactly

8:05

what I need or what I want anyway.

8:08

Yeah, yeah, that sounds cool. Yeah. So then you kind

8:10

of jumped in and I always

8:12

wonder about this sort of, uh, process

8:15

of getting involved in a project.

8:18

And I've had a

8:20

few people on talking about open source and avenues

8:24

in for, you know, a

8:26

lot of my audience is going to be beginner, intermediate,

8:28

and then, I don't know, I wonder

8:30

how many advanced people I have on the show, you

8:33

know, since we're kind of a learning

8:35

website of Python, but it definitely

8:37

varies. But I think a lot of them wonder, you

8:39

know, like, how do I get involved in a project

8:41

like that? And so I wonder, well,

8:43

what was your experience as far as like, you

8:46

thought this is an interesting tool. Did

8:48

you then say, Hey, I'd like to become

8:50

more involved in contribute or what was the

8:52

process there? By that time

8:54

I had already been actively contributing to

8:56

a couple of open source projects. So

8:59

I guess I

9:01

can, I can convey a story from when

9:04

I got involved in open source, like the first

9:06

time that might, sure. Yeah. That's always,

9:08

I love that stuff. Cause I think it's interesting for

9:10

people to like, you know, give them

9:12

a encouragement, but also like

9:14

maybe warn them if there's a potential, you

9:16

know, things that they need to be aware

9:18

of getting involved in this world.

9:21

Yeah. Totally. So I got

9:23

the first, the first like

9:25

major open source project that I contributed to

9:27

his pandas, but it was, it

9:29

was around the like zero dot 13 release

9:33

or something. I mean, it was years ago.

9:37

And I was, I was, I was

9:40

in grad school. I studied neuroscience and

9:42

grad school computational neuroscience. And

9:44

so I needed some,

9:46

or I wanted a specific pandas to

9:48

do something a specific way. The thing

9:50

I was interested in was cross correlation.

9:53

Okay. And, and like, I think

9:55

pandas at the time was doing this sort

9:57

of like, there's a naive way to do

9:59

it. That's like very. very slow and then

10:01

there's a way using like fast Fourier transforms

10:03

to do it much faster

10:06

and so I was like cool I want

10:08

this in pandas I

10:10

want like this to be the cross correlation

10:13

algorithm and so so

10:15

what I did was open a github

10:17

issue and paste the code that I

10:19

had written to do this with pandas

10:21

into the issue and I was

10:24

like here's the code here's

10:26

how you do like accept my I mean I

10:28

wasn't like demanding they accept my contribution but I

10:30

sort of you know I went about it in

10:32

like I was like well like I don't know

10:34

how to use this thing called github like how

10:36

do I write right what do I do it's

10:38

I was just like I'm gonna put the information

10:40

out there and like you know

10:42

hopefully somebody is either gonna

10:45

say like you know this sucks like do

10:47

it this way or like this you know like like

10:49

you're doing it wrong here's how to do it and

10:51

so the community was largely like like

10:54

the community bit there was very helpful

10:56

and they're like here you know that's good this

10:58

isn't the way to do this like

11:01

pull requests etc and so

11:03

I guess I

11:05

guess I would just I would say like if

11:07

you have an idea or you want to contribute

11:10

you know open a github issue and

11:13

and put put the information there and

11:15

like if the project is gonna

11:18

be worth contributing to they'll help you out and

11:20

okay tell you like hey they'll say hey this

11:22

is the path and yeah right like good

11:24

have you seen our contributing docs etc that sort

11:26

of thing nice because I think a

11:29

lot of people a lot of us have been

11:32

down a similar path where we didn't really know

11:34

what we were doing and then yeah

11:36

sure you know we then we did once somebody

11:39

was like you know here's how to do it

11:41

but it's a whole other like thing a whole

11:43

other organization that's it's

11:46

got its own like you said like this one happened

11:48

to be a pretty friendly community and so forth and

11:51

so you never know what what's behind that for so

11:53

that it's that's kind of a fun

11:55

way of getting in so was it's something

11:57

similar with ibis then so

11:59

with ibis I looked through GitHub

12:01

issues that Wes had created and

12:03

I picked one that I thought

12:07

I understood what needed to be done. I don't

12:10

remember if I asked any

12:12

clarifying questions, but if there's any ambiguity,

12:14

it's always a good idea

12:16

to ask. And then worked on it. I

12:19

think the issue was the Postgres backend.

12:22

I guess we'll get into what a

12:24

backend is later. Yeah, there's

12:26

lots of backends to talk about. Yeah.

12:30

And so I contributed the

12:32

Postgres backend. I think it's my first major

12:34

PR. I don't remember if

12:36

I did anything smaller before

12:38

that. That

12:41

was the first major contribution that I submitted. So

12:45

then that eventually moved into you at

12:48

this point, you're involved directly

12:50

with Voltron data, right? Yeah.

12:52

So Wes and I overlapped at 2Sigma

12:55

and they were big supporters of Ibus.

12:57

So we worked on some Ibus related

12:59

stuff there. I spent

13:02

a good amount of time working

13:04

on Ibus, like the open source

13:06

project in addition to doing whatever

13:08

2Sigma specific things we were doing there. Okay.

13:11

Yeah. And then I

13:13

dropped out of the world

13:15

of Python analytics tools and

13:18

went to work on just

13:20

something totally different, like Rust,

13:22

semi real-time,

13:26

video machine learning things.

13:28

And I was like Rust infrastructure.

13:30

And then I came back

13:33

up for air. You're

13:38

down in the lower depths there. Yeah,

13:40

that was an interesting time, but

13:42

perhaps that's maybe for another conversation.

13:45

Yeah, we hinted a little bit at

13:47

what Ibus is. And of course, I

13:49

talked to Wes about it with a

13:52

little detail. We didn't have a

13:54

ton of time because we were talking about lots of

13:56

different stuff. But maybe we could just start

13:58

with like, you. We're

14:00

interested in it because of what it could

14:03

do for you in reusing

14:06

your Python code with

14:08

these SQL statements and not having to

14:11

rewrite these things that

14:13

you've created. I worked at a

14:15

bank for a while and that was a big job

14:17

I did. It

14:19

was a mortgage company and they were fundsetting

14:22

a platform and then basically starting a

14:24

new platform. They

14:26

had all these reports and all these things

14:29

that they still wanted to generate in basically the

14:31

same kind of style if

14:33

they needed to from this old

14:35

data. They

14:38

just gave me this job of

14:40

like, all right, rebuild all

14:42

these reports. I'm like, do you have a schema

14:44

for the database? No. Oh boy.

14:47

It's just a raw table. Okay. Do

14:49

you know what the relationships are or whatever? Not

14:52

really. Can you give me some

14:54

existing reports that came from this thing? I

14:58

just reverse engineered it all. It was my

15:00

first job learning SQL and working in the

15:03

industry. I was like heads down,

15:05

learning all this sort of stuff. I understand the

15:07

idea of wanting to take ... People

15:11

spend a lot of time building

15:13

queries and they're very, very detailed

15:15

and they're something you may want to be able to reuse.

15:18

It's interesting to me that this

15:21

is maybe kind of this way in

15:23

for people that are interested

15:25

in a tool like this, that

15:28

they are involved in a lot of SQL

15:30

stuff or maybe their business has it. Maybe

15:32

we can just talk about what's fundamentally

15:35

different about what's happening with this DataFrame

15:37

library compared to pandas or polars. Yeah.

15:41

I think there's

15:43

a few fundamental differences. When

15:46

you're working with pandas, whenever you take

15:49

an action like you call a method

15:51

or you add

15:53

two series or DataFrames together,

15:56

it's happening right away. execute

16:00

like pandas itself well really

16:05

numpy but you know for purposes of this question

16:07

we could just say it's pandas.

16:09

Yeah pandas is going to like you know allocate

16:12

memory for the output let's say we're adding

16:14

two series together so it's going to

16:16

allocate you know memory and it's

16:19

going to do the addition you know element by element and

16:21

then fill in the allocated memory.

16:23

Let's say you do another

16:25

addition well that's going to do that's going

16:27

to allocate another addition and fill it in

16:30

and so forth so like every time

16:32

you're generating this tree

16:34

of allocations. Okay and each

16:36

one of them is taking up its own separate space right

16:38

in a way. Exactly and then every like

16:42

intermediate addition you're kind of you're

16:44

wasting memory in a sense right because it's

16:46

just going to be thrown away to get

16:48

that final output. Sure okay.

16:51

And with a system with

16:53

IBIS and the pullers has like

16:56

an expression based API as well so there's

16:58

some there's some overlap there and conceptually.

17:01

Yeah. But with like an expression

17:03

based API you're describing

17:06

you're writing that that addition expression you know A

17:08

plus B plus C and then

17:11

you're it's being compiled into something else and

17:13

in IBIS case it's often SQL. Okay. And

17:15

then you're handing that off to

17:17

the database engine which is almost certainly not

17:20

going to do it's not

17:22

going to evaluate that expression in the way I

17:24

just described because it has more information

17:26

about what it's going to

17:28

do. Pandas for example doesn't know

17:31

that you're doing A plus B plus C all

17:33

it sees is two series and a function call

17:35

to add them together. Okay it has

17:38

no look ahead at all. Yeah it

17:40

can't it's it can't

17:42

see what the global computation

17:44

is you're trying to express. Okay. Whereas

17:47

a SQL database can right it's

17:49

got the whole query available to

17:51

it so it can parse it and turns into

17:54

various more structured information like a tree that it

17:56

can then analyze and say oh I know

17:59

I'm doing an addition of A. A, B, and C. I

18:02

just need to allocate that one output array and

18:04

then call the function on every element of A,

18:06

B, and C at a time. I only do

18:08

that one allocation. So

18:11

that's a big difference. So I

18:13

guess the way you can express it, it's like

18:16

it's doing the entire computation at once. It's

18:18

not evaluating every intermediate

18:20

step. So in some

18:22

ways, I've heard

18:24

the term being used, especially with folders,

18:27

maybe sometimes called lazy evaluation. And I don't

18:29

know if that's the exact same terminology we're

18:32

thinking of here. Yeah.

18:34

So there's some

18:37

specific technical details around the

18:39

difference between lazy

18:41

and deferred. Lazy

18:44

tends to kind of... There's a specific...

18:47

It comes from the world of functional

18:49

programming where in a

18:52

lazily evaluative, using the sort

18:54

of technical definition, the only

18:56

things that are evaluated ever

18:58

are the things that get used. And

19:01

it's sort of my construction of

19:04

whatever interpreter programming language you're using

19:06

that will be lazily evaluated. Nowadays

19:09

people use the word lazy to mean a

19:11

slightly different thing,

19:13

but overlapping concept. Yeah, I was thinking

19:16

that. Where you're not

19:18

evaluating things when you write them

19:20

necessarily. It's

19:23

such an interesting approach, the idea that

19:25

you're taking this set of instructions and

19:27

looking at them as a whole versus

19:30

just recipe list, do this, do this, do this,

19:32

do this, do this. And I can see how

19:35

that can create a lot of efficiency within

19:37

a system like that. It can say, okay, well, we don't

19:40

need to grab everything. We can grab just

19:43

what we need for this particular operation or query

19:45

or what have you. Yeah,

19:48

and then there's just decades

19:50

of research poured into SQL

19:52

databases in particular and newer

19:55

systems like .db are sort of extending

19:58

that tradition into the... Yeah. to

20:00

the analytics world and bringing a

20:02

lot of cutting edge research

20:05

and deep expertise in designing

20:07

these systems. This

20:13

episode is sponsored by MailTrap, an

20:15

email delivery platform that developers love.

20:18

MailTrap is an email

20:20

sending solution with industry-best analytics,

20:23

SMTP, and email

20:25

APIs for major programming

20:27

languages. And it

20:30

includes 24-7 human support. Try

20:33

it out for free at mailtrap.io. That's

20:38

m-a-i-l-t-r-a-p.io. I

20:44

think that kind of leads us a little bit into this idea of

20:46

construction-wise, like how this

20:49

library kind of a little

20:51

bit thinks differently than, and as we already

20:53

mentioned, the functionality difference. But one

20:57

of the things I found fascinating, like, alright, let

20:59

me just play with this thing, is

21:01

it's sort of like, okay, well, what backend do

21:03

you want? And I was like, oh, okay, well, that's

21:06

not a choice that I had to really think about

21:08

so much right away. And I

21:10

think that fundamental difference

21:14

is interesting. What are you doing when

21:16

you're choosing a backend? The default is

21:19

typically DuckDB, I think. I

21:22

think probably for performance reasons, that's why you kind of

21:24

favor it in some reasons. But

21:26

maybe you can talk about that a little bit.

21:28

Like, what are you doing as you're choosing this

21:30

backend database type tool? Yeah.

21:34

So when you're choosing

21:36

a backend, you're opting into

21:38

some assumptions about more or

21:41

less the, I guess I would say the maximum

21:43

scale at which you can operate. So

21:47

if you're like, I'm opting

21:49

into DuckDB, let's say.

21:53

And yes, we do sort of implicitly

21:55

opt people into DuckDB. That's

21:57

because it's kind of the easiest. one

22:01

of our backends to get started with,

22:03

it has generally, it has almost all

22:05

of the functionality that I've supports. Okay.

22:08

It tends to be low memory. It's,

22:10

you know, parallel, etc. It's got all

22:12

these like goodies. Yeah, yeah,

22:14

yeah. We've talked about that

22:16

quite a bit. Yeah, exactly. Yeah. So,

22:19

and when you're opting into DuckDB, you're

22:21

saying, okay, DuckDB has a

22:23

maximum scale that it can operate at, which

22:25

because it's, and it's by design,

22:27

right? It's not like saying, oh, you know, everybody

22:30

needs whatever petabyte scale. Most

22:32

people don't. Right. Then

22:34

DuckDB, it's basically like, I

22:37

have data that at most fits

22:39

on my hard drive, right? And if I need

22:41

to go any bigger than that, you might want

22:43

to choose a different backend. But if you, but

22:45

most people don't have a terabyte of

22:48

data that they need to analyze. Maybe, maybe now

22:50

that's less true than it used to be. Well,

22:53

yeah, it depends. Like, I feel like that's something that comes

22:55

up on the show is that I'm

22:58

often talking to, again, you know, these kind

23:00

of beginner, intermediate, or people getting going and

23:02

are interested in trying stuff out. And

23:05

you're right, they don't have a petabyte. They don't

23:07

hardly probably even have a terabyte of data. And

23:10

so they want to just experiment and try things out. But

23:12

it's like a lot of

23:15

talk and conversations are about

23:17

these like huge scale things.

23:19

And it's like, well, I want to like

23:22

introduce people to the idea of it. And

23:24

I feel like the scaling part can come

23:26

later, you know, and also it's also expensive

23:28

to even play in that realm, you know?

23:30

Exactly. Yeah. Yeah. So

23:33

I guess like one of the things

23:35

that we strive for with Ivas

23:37

is to make that transition as seamless

23:39

as possible. So there's a

23:41

lot of setup for a lot of

23:43

these bigger systems like Snowflake and Spark

23:46

and BigQuery and so forth. And,

23:48

you know, assuming

23:51

you have sort of the same data

23:53

in each system, like the Ivas code

23:55

shouldn't change very much. Okay. Maybe you'll

23:58

have to connect to something differently. But

24:01

once you have that, that same code

24:03

that you wrote to do all your analysis can

24:05

kind of run on both. And so,

24:07

you know, I was like, we

24:10

don't really like to talk about Ivis itself scaling

24:12

because it's not the

24:14

size of the data is not like a scaling

24:16

factor that's super relevant for Ivis. We're actually just

24:18

like, hey, people have built these amazing systems. We're

24:20

just going to hand you the sequel and like

24:23

we know you're going to like crush

24:25

it. Yeah. Yeah. And

24:28

it's kind of like that whole idea of it

24:30

being sort of disconnected. I forget the

24:32

couple. Yeah. The decoupling of

24:34

everything when having these sort

24:37

of separate systems that that are

24:40

repurposeable or reusable, which

24:42

is great, you know, because that helps you also

24:45

as an engineer, like, as you mentioned, people

24:47

move around from job to job and so

24:49

forth. So like these tools that you're familiar

24:51

with can maybe come with you and

24:54

the techniques that you've developed and so forth. So that's kind of nice

24:56

to have a system that can do that also.

24:58

Yeah, totally. And like one

25:00

of the reasons we might, we sort

25:03

of support a large number of

25:05

backends in addition, sometimes people just like we

25:07

asked for it. We're trying to come

25:09

up with like a better, you know, more sort of,

25:12

I guess, transparent rationale for

25:14

implementing or support for or not

25:16

implementing support for a specific back

25:18

end. But one

25:21

of the reasons is so that just people who

25:23

are in various like settings that they may or

25:26

may not have control over can use

25:28

the tool, right? Like somebody might be, management

25:31

at some large org might be like, we're using BigQuery

25:33

or we're using stuff like, and like we want them

25:35

to be able to use Ibis. And like one

25:37

of the things where Ibis excels is

25:40

like taking your development code into

25:42

production with like minimal code changes. So

25:44

that same person who has

25:46

to use BigQuery for production can

25:49

take a sample of that data, put

25:51

it in DuckDB, or just like download a sample

25:53

that has a Parquet file from BigQuery. And

25:56

then on their, you know, on their whatever, their

25:58

laptop, they can sit there and... do analysis

26:00

with ducty B build that using ibis

26:02

right and then run that same thing

26:04

against the big query back end all

26:06

those experiments exactly are going

26:09

to translate yeah exactly yeah cool

26:11

yeah that seems like the the essence

26:13

of data science in some ways like

26:15

you kind of want to

26:18

allow the data scientist a chance

26:20

to be able to play

26:22

with things and mess around with things and

26:24

and the ability of that portability lets them

26:26

you know do that in a in

26:29

a circumstance where it you know

26:31

less intense if you will and less

26:33

costly right like every time you run

26:35

you know a query on snowflake or

26:37

i mean big query has like

26:39

some stuff where you can like have a

26:41

fixed you know amount of compute and you

26:43

just you know pay up front for that

26:45

but so fancier sort of

26:47

pricing models aside like it will be

26:49

costly for you to run your query

26:52

against all of prod yeah you know

26:54

as opposed to down sampling it and

26:57

then you can do whatever you want with it as many

26:59

times as you want right yeah that makes sense so that

27:01

that ends up being a good we've

27:03

heard from from our users that

27:06

they do this what's interesting

27:08

to me about this idea of adding

27:11

the support or generally supporting

27:13

lots of these back

27:15

ends is that if

27:19

i'm not calculating this right i

27:21

feel like you

27:23

would normally need a bunch of

27:25

third-party python libraries to build

27:27

those connections to the databases and they

27:30

don't have then a a

27:32

robust like way for

27:34

the data frame library to like directly connect

27:36

to it and so that's kind of why

27:38

you guys are going this extra

27:41

mile of like well we're going to support

27:43

the back end and like have our own

27:45

connection to it as opposed to like it

27:47

being like an additional component that has

27:49

to be added in is that part of the thinking

27:51

there yeah so the way that we

27:54

we typically for sql

27:57

back ends anyway there's um most

27:59

of the backends have what's

28:02

called the, I

28:04

forget what the name of it is, the DB

28:06

API, which is like a, there's a PEP, a

28:08

Python PEP for this. It's like

28:10

a set of classes and methods and

28:12

exception types that a library

28:14

needs to implement if it wants to say

28:17

that it's kind of Python DB API compatible.

28:20

Okay. And so we use the

28:22

various like vendor libraries or open

28:24

source libraries for these things. You

28:26

know, like snowflake has a thing,

28:28

BigQuery has a thing. Okay.

28:31

You know, there's pyodbc, which we use

28:33

for MS SQL and

28:35

so forth. And yeah, so we

28:37

don't write the

28:39

thing that encodes like, you

28:42

know, whatever the database protocol that sends,

28:44

you know, the query and the data

28:46

using the whatever my SQL wire format.

28:49

We don't write that. Okay. We

28:51

use off the shelf open source tools

28:53

to handle like connections and so forth.

28:55

All right. Good. Yeah. What we

28:57

built is the, the sort of the SQL

28:59

kind of compilers that take the data frame

29:01

API and turned into the

29:04

SQL code. Okay. And some

29:07

of that is then the flavoring, if

29:09

you will, of those different types of

29:12

databases. Well, funny story.

29:14

We used to write, we

29:16

used to have this sort of hybrid chimera

29:19

world where some of our translation was done

29:21

using SQL alchemy and some of it was

29:23

like handwritten, like we would actually write the

29:25

strings. Okay. In the

29:27

next release, we've kind of gutted all

29:30

of that and unified our compilers

29:32

around another library called SQL Glot, which

29:34

has support for all of the dialects

29:37

that we use. Okay. Like polyglot, that's

29:39

where the name's coming from. Yeah. Yeah.

29:41

So it's, it's taking out, we're taking

29:43

our eyes expressions and then turning

29:46

them into like SQL Glot things. And

29:48

then that turns into the correct SQL dialect.

29:50

Okay. Cool. Yeah. So this

29:53

is kind of like you, this

29:55

is where you sort of jumped into in the sense

29:57

that you were doing this for, you said

29:59

Postgres, right? That's right. yeah, Yeah

30:02

yeah. You can imagine that

30:04

is the sort of approaches of like well how

30:06

are we going to handle the stuff and. One

30:10

of the approaches that we talked about

30:12

right away was like this idea like

30:14

I have existing sequel commands and I

30:16

wanna use these queries. In

30:19

a library supports. That

30:21

methodology along with. Kind

30:23

of typical data frame stuff. Also

30:25

like. Maybe we can talk about that.

30:28

load it like. What's

30:30

the difference there? Like what's involved in.

30:32

Running a meal standard sequel command and would

30:34

he kind of output. Sir

30:37

So. There's. Generally like.

30:39

To buy think ways in

30:42

which people use what what

30:44

are lovingly called Ross equal.

30:47

Citizens good as Dead. The first one

30:49

is to run commands that don't produce

30:52

the output like creating a table. When

30:54

you create a table, it's just like

30:56

mutating some states somewhere on disk may

30:58

be right that aim to a catalog

31:01

and etc. But. It

31:03

isn't reduced produce a thing right? Have it

31:05

your site yeah run statement if you know

31:07

it's if it runs it runs or as

31:10

get access. To that so

31:12

we have like any we have any jets literally

31:14

called Raw Underscore sequel and that does He give

31:16

it a string and it's gonna give you back

31:18

like. Whatever that bp a p

31:20

I would give you back that's pretty

31:22

bare. It's pretty low level. Please bare

31:24

bones. You can manage everything yourself. That's

31:27

I go as like an escape hatch and

31:29

I this is just as the side we

31:32

have a few love we have like a

31:34

few tears of escape hatch because people like

31:36

wanna do things at different with different levels

31:38

of attraction. Church. So Rossi

31:40

was the lowest level of abstraction you

31:43

dislike, right? You're single string like do

31:45

the thing with the driver. Would do

31:47

it again. And then and and

31:49

you can run select statements, but you're going to

31:51

be, you're going after Man, it's like pulling back

31:53

that the list of rose and all that stuff

31:55

yourself. okay it's up popping

31:57

back on an object of of sorts

32:00

Yeah, it's going to give you back like

32:02

some kind of capital R result thingy or

32:05

it's sort of it's very

32:07

backend specific because the drivers are

32:09

necessarily returning backend specific objects. Okay.

32:13

The next I guess level up of

32:15

abstraction is like

32:17

you handing the connection a

32:20

select statement. So now like

32:22

we've restricted the level like the

32:24

SQL statements you can run because

32:27

if you give us a select statement, we can actually

32:29

just build an IBIS expression from that. Yeah.

32:32

Okay. All we need are the

32:35

column names and the types and then you've got this

32:37

sort of opaque blob. It's like this is going to

32:39

be the first thing. It's a table, you

32:41

know, and you can run your query that

32:43

way. You get back a

32:46

tape, an IBIS expression, like a table expression, and

32:48

then you can start working with that thing as

32:50

if it were just a regular old IBIS table.

32:54

Just a use case where you're like, I've

32:56

got a huge pile of existing SQL, bunch

32:59

of select statements and I

33:01

want to start like using IBIS, but like all

33:03

the stuff to set up my existing tables and

33:06

so forth exists. I don't want to rewrite that

33:08

in IBIS yet. Maybe you do later,

33:10

but you don't now. So

33:12

that's like the dot SQL method on

33:15

the backend object. We

33:17

have one more SQL escape hatch, which

33:20

is definitely our sort of like fanciest

33:23

escape hatch. Okay.

33:26

And this is a SQL method

33:28

on the table expression itself where

33:30

you can actually run SQL

33:33

against the

33:35

IBIS expression that precedes it. Okay.

33:39

Which is kind of nutty, right? Like you're somehow

33:41

taking this Python code and getting it into

33:43

the database and then you can mix and

33:45

match too. So you can go into SQL

33:47

and out of SQL and back of the

33:49

IBIS, et cetera. Okay. And

33:51

that escape hatch is for the use

33:54

case when the IBIS doesn't have an

33:56

API to do what you want, but

33:58

the database has. It's

34:00

something in the database you know you need.

34:03

So you would use that as a patch

34:05

for that use case. So

34:07

then the whole other approach of

34:09

working with it is

34:12

in much more of a

34:15

data frame centric methodology,

34:17

is that right? Yep, yep.

34:19

So things sort of, I mean, they,

34:22

I would say

34:24

they look and feel

34:26

pandas-esque, you know, it's not really. Sure,

34:29

yeah. There's a bunch of stuff that we like

34:31

don't implement from pandas and there's a bunch of

34:33

places where the APIs differ and so forth, but

34:35

it's got the flavor of like calling

34:37

methods on a table object. Yeah.

34:41

So, you know, group by

34:43

join. I was very

34:45

inspired by an R library called dplyr.

34:47

And so we take a lot of

34:49

the sort of the words and verbs

34:52

and nouns from dplyr like mutate and

34:54

select. So that's

34:56

quite a divergence, I think, from

34:58

pandas. I'm a fan too, because

35:01

that's my other weird like jaunt into like programming

35:03

that I kind of got into late in life

35:05

is I worked in a marketing job and they

35:07

were like a dual house and they

35:10

hired me on to be like a

35:12

Python like automation person. And

35:15

they had a bunch of R stuff running

35:17

too. And I was like, I'll learn it.

35:20

Sure, yeah. And so I loved the

35:22

old concept of the tidyverse. I love the

35:24

concept of dplyr and I was able to

35:26

start writing the

35:28

sort of connected statement sort of stuff

35:31

piping that that made

35:33

sense in my mind so clearly,

35:36

especially the stuff I was working with.

35:39

And so that's kind of one of those things I think

35:41

is very interesting that you guys have almost like, where

35:44

are you coming from? Welcome to Ibis.

35:46

Right, right, right. Well, that's sort

35:49

of that's exactly what we're going for. We're definitely

35:51

going for like that. That's sort of like the

35:53

piping kind of experience that dplyr

35:55

has where I mean, you know, like

35:57

R has the sort of native pipe

35:59

operand. now, but before they used to have

36:01

just the like percent, you know, angle

36:04

percent thing. Yeah, it's like a greater than

36:06

sign or whatever. Right, right. And

36:08

so in Python, we already, we

36:10

have the dot operator, right? And so instead

36:13

of piping, like we have dot and so

36:15

we're definitely going for that like, you know,

36:17

fluent design API where you

36:19

can just chain stuff and then you build up these

36:21

big chains and it gets all sort of compiled into

36:23

SQL, very heavily inspired

36:26

by dplyr. We, a lot of we have

36:28

like pivot wider and pivot longer, like we

36:31

have a feature called selectors, which is 100%

36:34

like stolen, like, you know, not

36:37

stolen. I mean, like, it was anyway,

36:39

inspired by very heavily inspired. Like I

36:42

implemented that. And when I implemented that,

36:44

I actually ported the test suite from

36:46

the selectors test into

36:48

Python. So I could be like,

36:50

this does this behaves like the

36:53

exact same way in Python. Cool.

36:56

Yeah, yeah. I and I was a big fan of

36:58

the mutate. I

37:00

just like, it was like such

37:02

a pain in Python to do that, at least

37:04

at the time when I was playing with you. And so

37:07

that was one of those things where like, it

37:09

just seemed like a lot of overhead to do something

37:11

where I'm just and I was working with a lot

37:13

of text, which again,

37:15

pandas talking to Wes about it,

37:18

definitely came from like finance,

37:20

if you will. Right, right, right. And then, you

37:22

know, somewhat, you know, numbers and, and, you know,

37:25

kind of dealing with that stuff, and the hands

37:27

to the back end of NumPy. And so like,

37:29

Texas always kind of like, yeah, you can do

37:31

it. So

37:33

I kind of appreciate that. And it's definitely

37:35

gotten better and better. But it's definitely something

37:37

that I see right away. And

37:40

I guess it's nice. Yeah, we try to, I

37:43

guess, one of the different main

37:45

differences between like the database world

37:47

and like NumPy comes from like

37:49

numerical and scientific computing, which, right,

37:52

maybe nowadays is dealing with a lot more strings. But you

37:54

know, back in the day, strings

37:56

were kind of an afterthought. Right, right. And like,

37:58

it's coming out of a tradition of

38:00

tools like Matlab where they're

38:03

very heavily focused on matrix

38:05

math. Everything's an

38:07

array, et cetera. Optimized for that. Yeah, exactly.

38:10

Yeah. And so, but in the database world,

38:12

like strings have been a thing from day

38:14

one because, you know, you work for a

38:16

bank or you work for a law firm

38:18

and look for these things like we're dealing

38:20

with lots of texts and

38:22

descriptions of things. And yeah. Yeah. And so

38:25

anyway, yeah. We try to do right by the

38:27

string. That's

38:30

great. One of the things that's

38:32

interesting about this whole process is that, and I don't

38:35

know where I saw the statement, but I know it's somewhere in

38:37

the, either having talked

38:39

about it or kind of,

38:41

you know, discussing it is this idea of

38:43

getting close to the data as possible. And

38:45

I feel like, is that something that by

38:49

kind of recreating these functions and so

38:53

forth, like this, this functionality of like, you can

38:55

write these statements, chain them all

38:57

together, and then it's going to again, rewrite it

38:59

and at least process it in a

39:01

way that it's now like a SQL statement. Is that part

39:03

of that? Like this idea of like, I want to be

39:05

able to get in and work with data and anybody

39:08

who's worked with SQL for a long time, like having

39:11

to have an abstraction layer is, it's

39:13

always kind of hard as a transition. And

39:15

I feel like that's something that,

39:18

you know, you're obviously, we talked about three different

39:20

methodologies of ways that people can approach it, but

39:22

is that part of like what you mean by

39:24

like getting close to the data as possible or

39:27

what exactly do you mean by that? Yeah. So

39:29

getting close to the data is really about making

39:32

sure that you're computing

39:35

in the most efficient way. Okay.

39:38

So I think traditionally or at

39:40

least like I've definitely done this in the

39:42

past where I just

39:45

ran like pandas.readSQL, I

39:48

gave like, I gave it a select star

39:50

and then you're, you're like

39:52

pulling however many whatever bytes

39:54

back to your local machine. Right.

40:00

you're doing a computation with pandas for better

40:02

or worse. When we talk about like

40:05

being close to the data, we're talking

40:07

about like the computation occurring on the

40:09

engine sort of that knows how to

40:12

do that best. And optimize it

40:14

already. Exactly. So like let's

40:16

just like snowflake for example. Snowflake

40:20

is the one that knows how to operate on

40:22

tables and snowflake the best, right? So

40:24

okay, pulling a table back from

40:26

snowflake and then doing your computation and pandas

40:28

if it can be expressed in SQL is

40:31

pretty inefficient, right? You're gonna pay egress costs

40:34

from and yeah so and

40:37

you know if data it's like new data

40:39

arrives like now you're gonna have to pull

40:41

that back again and anyway it's just it's

40:43

sort of it becomes both

40:45

prohibitive in time, space and

40:47

dollars. Yeah it's interesting.

40:49

I feel like it's a related conversation to

40:51

you know what's

40:53

happening with with Arrow and the

40:56

idea of like let's not have

40:59

to go through a translation layer each

41:01

time to look at this information if we can kind

41:03

of all agree and and that's

41:05

definitely part of this platform also, right?

41:07

Yep totally and the idea

41:10

like one of the things that IBIS it

41:13

makes it possible to do this because we're

41:15

just saying hey database like

41:17

here's the query like take care of it

41:19

just give me the just giving

41:21

the results. Okay. So we don't

41:23

have to we don't have to pull anything back

41:26

we don't need to bring anything into

41:28

memory until it's like the final result

41:30

that you asked for and even then

41:32

like you actually

41:34

have to opt explicitly into doing that

41:36

by calling a method like it's like

41:38

let's say somebody just like pulls

41:41

but wants to pull back you know a billion rows

41:44

it's possible with IBIS you

41:46

have to kind of like opt into it you have to

41:48

call a method that says like hey give me back all

41:50

the data. This

41:56

week I want to shine a spotlight on

41:58

another real Python video course. It

42:01

covers how to create interactive geographic

42:03

visualizations that you can share as

42:05

a website. The course

42:07

is based on a real Python tutorial by

42:10

previous guest, Martin Royce. It's

42:13

titled Creating Web Maps from Your

42:15

Data with Python Foleum, and

42:17

it's presented by video instructor Kimberly

42:20

Fessel. And she shows you how

42:22

to create an interactive map using

42:24

Foleum and save it as an HTML

42:26

file, how to choose from

42:28

different web map tiles, how

42:31

to anchor your map to a specific

42:33

geolocation, and bind data to

42:35

a GeoJSON layer to create

42:37

a choropleth map, and then

42:39

how to style that choropleth map. She

42:42

also shows you how to add points of interest

42:44

and other features. Learning how

42:46

to build interactive visualizations is a worthy

42:48

investment of your time, and sharing

42:50

standalone web pages is a great way

42:53

to get your users to understand and

42:55

dig into the data. And

42:57

like most of the video courses on

42:59

real Python, this course is broken into

43:02

easily consumable sections. Each

43:04

lesson has a transcript, including closed captions.

43:07

And you'll have access to code samples

43:09

for the techniques shown, in this case,

43:11

a complete interactive Jupyter notebook. Check

43:14

out the video course. You can find a link in the show

43:16

notes, or you can find it

43:18

using the search tool on realpython.com. So

43:25

you've kind of dug pretty deep into the

43:27

functionality and kind of the background of maybe

43:29

where people are coming from in different

43:32

libraries and so forth. And

43:34

it's always hard in an

43:36

audio podcast to explain a lot

43:38

of this stuff. One of the things I think is

43:41

interesting is you've created this YouTube series,

43:43

which I don't know if it's IBIS

43:45

specific, but your series is what,

43:47

Philip in the Cloud, right? Philip in the Cloud. I

43:49

love the name. Because my last name is Cloud. Yeah,

43:53

exactly. I'll

43:56

propose for somebody who works

43:58

in data these days. Yeah,

44:00

so what are the types of things that you

44:02

cover in the your YouTube channel? Definitely

44:06

all Ivis right now. Okay, let's

44:09

see So we've covered we've

44:11

covered like some integrations with

44:13

other tools We've

44:16

covered various Ivis features

44:19

I've done a couple of like live

44:22

like early early on when I when I

44:24

started it I've done I did

44:26

a couple like sort of live debugging

44:28

sessions or like I Was

44:31

like I'll demo this feature and then it's like oh

44:33

it didn't work in this way for this reason So

44:35

I would like sit there and try and figure out

44:37

what was happening. Okay, that's always

44:39

a interesting Yes,

44:41

that's been fun and

44:44

then you know newer newer features

44:47

Yeah, it's sort of like a

44:49

grab bag of Ivis, you know

44:51

functionality news Stuff.

44:54

Okay. Yeah mix of stuff. Cool. One

44:57

of the things I think about especially with our

44:59

that I thought was interesting Is

45:01

that it came with? You know at

45:03

least some of the basic tools had like

45:05

example data in it And I feel like I

45:08

this definitely is in the same boat there in

45:10

my thinking of that right that you have Some

45:12

stuff that people can kind of play around with just

45:15

the library with a few built-in

45:17

sort of a data points Or

45:19

do you have to download those separate? Ish

45:24

like it's sort of a mix of yes to

45:26

all those answers Okay. All

45:28

right. It's all those questions. We'll

45:31

provide links to a guide We

45:34

have we have like on our landing page

45:36

Ivis project org a way that

45:39

you can can get started like right away with examples

45:41

And it's got like rubble

45:43

like you stuff that if you follow it the

45:45

sort of one-line install You

45:47

should be able to copy paste and run that

45:49

code. We Forget exactly when

45:52

we add this but a while ago we

45:54

added like an Ivis dot examples module Okay,

45:57

and like we again shamelessly

46:00

from R and literally we

46:02

like have an R script that like pulls

46:05

the data out from like a few

46:07

packages and like puts it into like

46:10

a bucket a cloud bucket okay

46:13

and so when you call

46:15

like ibis.examples.penguins.fetch it's gonna pull

46:17

down that example from the

46:19

cloud bucket and give

46:21

you back an ibis expression. Okay interesting

46:24

so it's kind of a little convoluted

46:26

but it's doing the work for you as long

46:28

as you have the internet connection. Yep you need

46:30

the internet connection and that's

46:32

only because we didn't want to ship data

46:35

in our package. Yeah no no it's

46:37

gonna be bigger. Yeah we have a

46:40

couple bigger datasets up there

46:42

as well like a subset the IMDB data.

46:44

Yeah yeah that's the one I see that's

46:46

interesting. And then some of those are in

46:49

parquet I believe because they're just so

46:51

much smaller than if they

46:53

were in like TSV or whatever. But

46:56

yeah you can get started with those we've got

46:58

a variety of different data sets we've

47:00

got sort of the the R classics

47:02

like MT cars and polymer penguins. Yeah.

47:04

Then we've got some more we've got

47:07

some like World of Warcraft data up

47:09

there as well. Okay. Like gaming data

47:11

there's a bunch. Yeah it's

47:13

nice it's always fun to kind of get

47:15

to playing with things that have

47:18

weight to them that you can kind of actually play around

47:20

with it's not like randomly generate

47:22

a bunch of numbers for me which I've

47:24

seen a lot of demonstration stuff and it's

47:26

like all right my eyes are glazing over

47:28

sorry. Yeah we want people to be able

47:30

to interact with like a real data set

47:32

in like with as little

47:34

initial friction as possible right. So we're not

47:37

gonna hand we're not gonna be like oh

47:39

download this like example 3 terabyte data sets

47:41

like okay like 25 you know it's like

47:43

10% of the people who use IMDB

47:45

is gonna be able to like store that on disk. So

47:47

we're like we

47:49

like to use the polymer penguins so you know shout out

47:51

to the the authors of that

47:53

paper who have generously provided this

47:56

data. It's like it's like a small

47:58

data set but it's interesting. Yeah, yeah.

48:00

And then it's got like, you know, it's got- Lots of

48:03

interesting fields. Exactly. So yeah, and there's, there's

48:05

just a, it's a

48:07

rich enough dataset that we can say,

48:09

we can demo a lot of features

48:11

of IBIS. Right. Using

48:14

that. And then, you know, when

48:16

you want to get into some fancier stuff,

48:18

like with arrays and structs, like maybe you

48:20

switch over to IMDB dataset, cause you know,

48:22

they've got sort of, they've

48:25

got some stuff where you can, yeah, process

48:27

a field into an array and start, you

48:29

know, messing around with like on Nest and

48:31

other kind of more advanced features

48:34

of IBIS. That's definitely

48:36

a database that would have the many to

48:38

many relationship kind of stuff happening. Oh

48:41

yeah, yeah, no, it's the, and the

48:43

way they encode the relationships is sort

48:45

of interesting because everything's got a key,

48:48

but then some of the, some of the

48:50

things that are, there's some

48:52

pre-joins that happen. Okay. I

48:55

don't, I mean, I don't know exactly how

48:58

that data's generated, whatever. Some

49:00

engineer IMDB doing it. Yeah,

49:03

I really had to think about it, yeah, totally.

49:05

Yeah, there's definitely some fields where like, I

49:08

forget, I think it's like roles, the

49:11

roles that a particular person took

49:13

on in, Right. in a

49:15

given movie, like there's, you know, that can be like

49:17

sort of turned into an array and

49:20

you can imagine that, yeah. Yeah,

49:22

it's kind of funny cause like, yeah, that person could

49:24

be, you know, have multiple roles

49:26

in a particular movie, you know, or it

49:28

could be played by a different person. Yeah,

49:30

it's like, oh, there's lots of interesting things,

49:33

like, they're at different ages. There's a lot

49:35

of weird stuff to think about,

49:37

like laying out a database like that. So it's just

49:39

a fun one to look at to like say, oh,

49:42

I don't know if I'd model it,

49:44

like exactly like that, but. Yes, yes. And

49:47

it's also full of, I guess what I

49:49

would consider like junk, but interesting junk because,

49:51

Okay. People's birth dates are like,

49:53

you know, year 40 or something like that. Like that

49:55

is, it's sort of like stuff that doesn't

49:57

really make a whole lot of sense. Okay.

50:01

But it's nonetheless interesting to poke around and see

50:03

if you can kind of figure

50:05

out what went wrong there or guess,

50:07

you know, it's like data detective kind

50:09

of thing. Yeah, yeah, exactly. Yeah. That

50:12

sounds like there's some pretty good resources there.

50:14

You mentioned the landing page

50:16

for that. Are there, along

50:19

with the YouTube series where you're kind of doing

50:21

live demonstrations of working with the library and

50:24

working with data and trying things out, interacting

50:27

with people in that. I've

50:29

seen you've had a few guests also. What

50:31

else would you suggest for somebody who's interested

50:33

in checking out the library? Like what are

50:35

other resources for them? Let's see.

50:38

I would say, I mean, the best resource,

50:40

and we've put a

50:42

lot of hours into this, is our

50:44

website, which is also our documentation. Yeah,

50:47

the API stuff on there is great. Yeah.

50:49

Yeah, I would also suggest, like we've also put

50:52

a good amount of effort into getting like a

50:54

GitHub, like a working

50:56

GitHub code space set up so that somebody

50:59

can say, create a code

51:01

space. That'll just put you into

51:03

a VS code, a browser,

51:05

like VS code that has all the

51:07

dependencies installed and you can start running

51:09

Ivas examples right away directly from

51:11

the shell. Like you just

51:13

fire up Python, copy paste the code

51:15

from the website and you're off the races. Nice.

51:18

Yeah, maybe we can share some links at the end then. Yep.

51:22

Yeah, which I'll definitely include. We're also

51:24

looking to, no promises, but we're potentially

51:26

looking at like, being able to give

51:28

like a, you know, an in browser,

51:31

like interactive Ivas shell. So somebody wouldn't even

51:33

have to fire up a code space or

51:35

install anything. They could just like run

51:38

our examples or some of

51:40

them, like in their browser. Using

51:42

something like Wazi or... Yeah, Pyadive,

51:45

which is like the in browser

51:47

Python interpreter. It's

51:51

frankly magic, but

51:54

it's awesome. Yeah, we're living in

51:57

interesting times. Yeah. Yeah. Yeah,

51:59

I'm interested in that. For. Lot.

52:01

Of reasons I've had a Brett Can and on

52:03

the show how nice to talk about it and

52:05

he's been very. Involved in trying

52:07

to make it a target you

52:10

ever for Python and I mean.

52:13

The. And keep kind of watching the

52:15

space and in seeing what what's gonna

52:17

happen Next study updates, it'll be like.

52:20

A maybe a quarter of a window of like.

52:23

Text but it's all links he's

52:25

like here. Here's. Where to

52:27

go look at, learn more and so forth of

52:29

it's not not narrative that our efforts and will

52:31

yeah. But yeah,

52:33

there's a lot of work happening there. Yep, yeah.

52:36

Okay, so we met in the website we mention

52:38

you Tube We said we get some links for

52:40

people to experiment and try things out on. There's

52:43

lots of those cool examples that people can kind

52:45

of try out and will download the data for

52:47

them to to work with. Illness Will have to

52:49

go and find a bunch of data. Maybe we

52:52

can talk about. And I know

52:54

it's like a laundry list but maybe with start about

52:56

some of what are the back and that it does

52:58

support Liquid would be the and with much inductee be

53:00

in posts grass and. Spark. And

53:02

and try to remember all the ones we mentioned so far,

53:04

but it's. Quite. A few weeks have

53:07

like a a development commands her the

53:09

suspect It's called like listless back ends

53:11

and literally just the wind dylan a

53:13

list of them because it changes and

53:15

I use it sometimes. Gets this. This

53:17

does come up from time to time

53:19

and I I want a area of

53:21

not. That one it's always

53:23

in my mind is stuck the be

53:25

of course because it's the one we

53:27

use align interact with lot but like

53:29

the a big query click houses one

53:31

that I think we've got a number

53:33

of people using task data fusion druids.

53:36

Access All. Flame

53:38

Day's work site dabbled in the

53:41

streaming world Impala, which is sort

53:43

of like the original. Back.

53:45

And. That. Was like the primary

53:47

back in. With. The Ibis to

53:49

was developed for circuit. Microsoft.

53:53

Siegel Server M, a sequel. Or

53:56

a goal. By. The usual suspects

53:58

her. Yeah. The players

54:00

back as if you can believe it.

54:02

Okay, I'm. Happy to so

54:04

to talk about the layers. There's if

54:07

you want to, but there's the apply

54:09

spark a snowflake. Three know. Sequel:

54:12

I Bunch. Yes! And we talked

54:14

about. Lots. Of these the entry let

54:16

ways into the platform. People that are coming

54:18

from our. Should

54:21

have a fairly the a friendly experience and have

54:23

kind of like a guide for them to can

54:25

like. Here's what you should expect. the money of

54:27

that was read by it. Are you sir are

54:29

Awesome! Greatest. What our

54:31

colleagues of voter data for Of that.

54:34

So you also have a oh and russians for

54:36

people are be much more can fight on based

54:38

and then people that are may be coming from

54:41

st sequel. If those. Are the

54:43

three Major wondered if if I don't think

54:45

girl out I think we're we're thing about

54:47

adding one that cycling from pie spark as

54:50

well I'm saying it's and others are is

54:52

aka. That another Irving place where

54:54

people like have spent a lot of time

54:56

and so the trailers like a way to

54:58

have come from. That itis. He

55:01

really started with about the project itself.

55:03

Are you being supported to work on

55:05

this in I Acres? It is an

55:07

open source? Yep. tool. is that entirely

55:10

true For vulture and eight hours and

55:12

through something else gets a vote on

55:14

data is like that. Primary financial supporter

55:16

of Ibis. Okay, we has. At

55:20

last count, it's not that many. I just

55:22

don't remember. But there

55:24

is, I think. Six. Or

55:26

seven fulltime people working on

55:28

different different aspects of I

55:30

this. And then

55:32

we've got a few people

55:35

from outside of Ibis that

55:37

contribute. We've got a person

55:39

Google. We've got a person

55:41

who is just a very

55:43

enthusiastic user. We recently like

55:45

Made into a computer. And

55:49

so. it sort of

55:51

that it's core like supported bibles raw data and

55:53

then we have a number of like we're trying

55:55

to grow like the developer community and so we

55:57

wanna say who had a brigades of the from

56:00

outside Voltron data who are

56:02

interested in contributing, especially for

56:05

backends that some of us may

56:07

not know a lot about. Yeah,

56:09

I can imagine that can be tricky depending on

56:12

the history of the backend. There's

56:14

just one of the unique

56:17

development, let's call

56:19

it experiences that one may have when

56:21

working on Ivas is having

56:24

to deal with the idiosyncrasies of

56:26

20 execution engines,

56:30

especially around all the fun,

56:33

but not really that fun edge

56:35

cases of null handling. There's

56:38

just a lot of different stuff there, how

56:40

they happen to do floating

56:42

point rounding. That

56:45

differs among each of these. There's

56:49

a lot of interesting details there, but yeah,

56:52

it can be quite tiring. At

56:55

the end, you end up with some knowledge about

56:57

how 20 systems work, but you're like, where

56:59

am I going to use this except for Ivas? If

57:03

I'm an Ivas developer, it's useful. Yeah,

57:06

hardly anyone has 20 unique

57:08

databases in production. I

57:11

have an odd duck question that I

57:13

wondered about, and I didn't dig

57:15

deep into the documentation, but you

57:18

talk about this idea of it taking

57:20

what you've written and it

57:23

generating the SQL that then is used

57:25

on that backend. Is there a

57:27

way to have it output it

57:30

also as that actual SQL

57:32

query? Absolutely. Great.

57:35

So there's a couple

57:37

of ways that you can do that.

57:40

We have this top-level function that's like

57:42

Ivas.to underscore SQL. You give it an

57:44

Ivas expression and optionally a dialect that

57:46

you want it to generate, and

57:49

it gives you back a SQL string. If

57:54

you're in an IPython or a Jupyter

57:56

Notebook, it will actually syntax

57:58

highlight that output. And you

58:00

can see it in a little bit

58:02

more readable way. Yeah,

58:05

so adding on to

58:07

the portability. Yep. And

58:11

the idea with that is you can get

58:13

something that can be used as a SQL string, but then if

58:16

you just want to look at your SQL, you also

58:18

get the syntax-highlighted thing. You

58:20

can turn it to whatever dialect Ivas

58:23

supports. I think that

58:25

is maybe a form of debugging, too, potentially. Oh, we

58:27

all, all of us Ivas developers, use it all the

58:30

time in that way. Okay,

58:32

yeah, cool. We

58:34

also have a compile method. So

58:37

the two-SQL one is like, in

58:40

some ways, it's very aesthetics-focused, right? It's

58:43

going to do pretty printing of the SQL. It'll

58:45

indent it and all this stuff. Sure. The

58:48

compile method is a little bit more raw. It doesn't

58:51

do any pretty printing. It's not

58:53

very readable. But

58:55

if you want to get exactly what's going on in

58:57

the database, that's what you would print out.

59:00

I guess a little bit like how whatever

59:03

CSS files or HTML files could

59:05

be all space removed. Right,

59:08

it's not quite that level of craziness,

59:10

like where you're JavaScript minification is

59:12

not at that level of insanity,

59:14

but it's towards that direction. Cool.

59:24

So I have these questions I'd like to ask everybody who

59:26

comes on the show. The first one is, what's something that

59:28

you're excited about that's happening in the world of Python? There's

59:30

a few things. Okay. So

59:33

I know Pyadai is not particularly new,

59:35

but I am definitely just very

59:38

interested in that. I'm excited about where

59:40

it's heading. Yeah, yeah. I know Peter

59:43

Wang from Anaconda, has he

59:45

been on the show? I

59:49

invited him literally moments

59:51

after he walked off the stage at PyCon,

59:53

and we still have yet to connect, and

59:55

so I've got to try again. Yeah, he's

59:57

awesome. a

1:00:00

character hilarious guy. Anyway,

1:00:03

I know he was like

1:00:05

a long time ago. He's like, why can't

1:00:07

we run Python in the browser and then

1:00:09

whatever fast forward a decade or two and

1:00:11

now you can. So that's pretty exciting to

1:00:13

me. SQL Glot

1:00:16

somewhat biased there just because we're heavy users

1:00:18

of it. No, no, it's helping you guys

1:00:20

out. It's a

1:00:22

pretty exciting project. I think a

1:00:24

lot of us working on Ibis were like, it would

1:00:27

be great if like, we didn't have to

1:00:29

write all this translation layer

1:00:31

and like somebody else would do it. And,

1:00:33

uh, and somebody else did

1:00:37

independent of us, you know, trying to control or

1:00:39

anything like that. It just, it showed up one

1:00:41

day and we were like, wow, this is really

1:00:43

something. Yeah,

1:00:46

that's cool. Python us is coming

1:00:48

up. I think a bunch of the

1:00:51

Ibis team are going to be there. We're giving

1:00:53

a tutorial. Nice. Some of us

1:00:55

are giving a talk in Spanish at

1:00:57

the Charles. Yeah. Yeah.

1:00:59

Track. One of my coworkers is very

1:01:01

involved in that. Okay. Yeah.

1:01:04

So that's great. And then as

1:01:06

usual, there's always some exciting

1:01:09

new stuff in the world of

1:01:11

Python package management, like UV and

1:01:13

pixie. Yeah. Something

1:01:15

to watch. Yeah. Yeah, exactly. So I

1:01:19

spent a lot of time working

1:01:21

on package management tools in

1:01:24

various capacities, um, okay.

1:01:26

Either like for an application at a job

1:01:30

or just like working with complex

1:01:32

development environments, but you can imagine

1:01:34

Ibis has a lot of optional

1:01:36

dependencies and so like

1:01:38

we need environments. Should I be

1:01:40

back on to do a survey with me? Maybe

1:01:42

we can bring a handful of people in. We

1:01:44

can talk about it. Oh man. I think that

1:01:47

was just erupted like, I don't know, like violence

1:01:49

or something because it's just, it's just that kind

1:01:51

of topic. Yeah.

1:01:53

Yeah. It's very, very, uh,

1:01:55

opinionated, uh, very much so.

1:01:57

Yeah. So, but like.

1:02:00

I see things like UV and I've have you

1:02:02

had Charlie Marsh on the on the show? No,

1:02:04

no, I'm somebody else who's on the list I've

1:02:07

been I'm thinking about I've been kind of watching

1:02:09

rough to him forming the company

1:02:11

and That's been interesting to kind of

1:02:14

watch too because it's just sort of a similar journey of

1:02:16

a few other Yeah, I

1:02:18

don't want to come smaller But like individuals who said I

1:02:20

want to make a company and let's turn this into a

1:02:22

thing and and that's hard

1:02:25

Totally. So I wonder what the struggles are

1:02:27

there. I might actually approach it from that

1:02:29

angle, too yeah, no Charlie's great talk to

1:02:31

him for sure and then pixie which

1:02:35

is like a It's

1:02:38

like an a now, you know an analogous

1:02:40

sort of tool but working, you know more

1:02:42

closely with the Conda ecosystem Yeah,

1:02:44

yeah, if you if you don't know

1:02:46

both vol prex, I mean, I'm

1:02:48

happy to put you in touch Yeah, yeah

1:02:50

in the mamba and all that stuff. Yeah.

1:02:53

Yeah, I'm sure he would be a

1:02:55

good person to talk to as well So like I'm

1:02:57

I'm kind of watching both those tools to see where

1:03:00

things go I mean, I think the

1:03:02

Python community has had some Struggles

1:03:05

with various like standards around

1:03:07

package management and just trying

1:03:10

to get some consensus coalescing

1:03:12

on Various things

1:03:14

and it's such a wide target to

1:03:16

hit very. Yep. So that's the problem

1:03:18

it's used in so many different fields

1:03:21

and all these different backgrounds and you

1:03:23

literally have the immediate division of Data

1:03:26

science and you know everything happening

1:03:28

with anaconda and you know, Conda

1:03:31

and all that sort of stuff So versus

1:03:33

yeah, and I think one of

1:03:35

the like these tools are coming from a

1:03:38

few decades of learning What

1:03:40

is good what works and what doesn't work?

1:03:42

And so they have the

1:03:44

benefit of the hindsight of all

1:03:47

the things that we wish we could change but that

1:03:49

we can't change and so Like

1:03:51

the programming language like rust comes along and cargo

1:03:53

and people like oh my god Like

1:03:56

this is right really how the thing should be

1:03:58

but they can stand on the those shoulders,

1:04:00

man. Right. And so it's like,

1:04:02

tools like UV and pixie, like

1:04:05

have all that history to build

1:04:08

on, which I think, you know, somewhat

1:04:11

speaks to their ability to

1:04:13

succeed. Yeah. So, yeah.

1:04:16

Yeah, that's awesome. That's a whole bunch of stuff.

1:04:18

I will definitely add links for all

1:04:20

those items. And I'm very interested in

1:04:22

when people suggest guests because I'm

1:04:25

always looking to add more people to

1:04:27

the roster. So. Sure. What's

1:04:29

something that you want to learn next? Again, this doesn't have to

1:04:31

be about programming. Right now, I'm

1:04:33

currently learning Spanish. Okay.

1:04:36

I live with two native speakers

1:04:38

and one that speak English. And so I'm

1:04:41

just trying to go like, you know, as deep as

1:04:43

possible as I can there. Okay.

1:04:45

It's an immersion. Yeah.

1:04:50

It's sort of, it's

1:04:52

tough, but it's like, it's very

1:04:55

rewarding. I'm

1:04:57

using a platform called Learn

1:04:59

Craft Spanish, which takes a

1:05:01

different approach than other attempts

1:05:04

that I've made. Okay. I

1:05:06

think a lot of, like a lot of the, a

1:05:08

lot of these sort of app based things. Right.

1:05:11

Right. The Duolingo's and such. Yeah. They

1:05:13

don't give you, they don't focus

1:05:15

on fundamentals like grammar.

1:05:17

They focus a lot on vocabulary. So like,

1:05:20

how do I say dog and milk and

1:05:22

whatever? Right. Right. And they can have these

1:05:24

pretty little icons and so forth to trigger

1:05:27

you. They're kind of designed to

1:05:29

like keep you in the app. And then, you know,

1:05:31

I don't know. I mean, I've, I don't want to say

1:05:33

anything like negative, but.

1:05:36

No, no, it's, it's almost the same complaint

1:05:38

people have about tutorials in the Python world.

1:05:40

It's like, maybe you should go

1:05:42

build something. You know, maybe you should go have

1:05:44

an actual conversation. Yeah. Yeah. Exactly. It's kind of

1:05:47

like a different approach. Yeah. It's

1:05:49

what, like Learn Craft Spanish takes a very

1:05:51

different approach in that they teach you a

1:05:53

lot of the hardest grammar first. So also

1:05:58

just, I didn't know a lot. The

1:06:00

names for grammatical in got caught structure

1:06:02

it so I think that direction, object

1:06:05

pronouns and so forth. I

1:06:07

did you buy his seventies and hey what

1:06:09

did your i got sick pronoun be the

1:06:11

English at the accidental know now I can

1:06:13

tell you. but it's only as I learned

1:06:15

in the context of learning Spanish so preference

1:06:17

to like. The. At getting into that,

1:06:19

start getting into the harder stuff first

1:06:21

as a whole lot more rewarding. Because.

1:06:24

You can. Build. The. The

1:06:26

tools you need to ask for the vocabulary,

1:06:28

right? That's that's sort of. It's. And

1:06:31

goal It's like whoa, If you just need

1:06:33

to not as a table than like you

1:06:35

just described the table right? Okay,

1:06:38

That's. Cool. Essentially a new see way

1:06:40

of for approaching it or you. Yeah.

1:06:43

So. Because. You mentioned the

1:06:45

icon talk or are you involved

1:06:47

in that then trying to their

1:06:49

advantage or know yes I am

1:06:51

edited out Tbd and how we

1:06:53

got are no One has to

1:06:55

be deliver her. My

1:06:58

colleague who is a native Spanish

1:07:00

speaker will be sir is leading

1:07:02

the charge on that add You

1:07:05

know hopefully see their. Ask

1:07:07

me to say anything complicated. Was

1:07:10

the al ghazl jackets. That's fun!

1:07:13

What's. The best way the people can follow the work

1:07:15

that you do online get hub. As. Down,

1:07:17

I don't. I. Do

1:07:19

a little bit tweeting, but

1:07:22

mostly it's. A

1:07:24

joke. Like. The. Things I

1:07:26

say are not that serious threat as

1:07:29

that you're you're serious networking to yeah

1:07:31

latter, lots of other battle yeah the

1:07:33

last major thing I did on twitter

1:07:35

was an April Fools joke related to

1:07:38

I this is okay I'm in a

1:07:40

separate that are it'll take that the

1:07:42

those interactions I guess too serious man

1:07:44

but I they deadly spend most of

1:07:47

my bike online sort of work type

1:07:49

of get have a. You.

1:07:51

Are a to convert after your

1:07:53

first experience? Is there that we

1:07:55

discuss? Yep Yep. I've been on

1:07:57

Get Home for a long time.

1:08:00

Yeah, that's great. Well, Philip, it's fantastic

1:08:02

to talk to you. Thanks for coming on the show.

1:08:04

Yeah, thanks, Christopher. Thanks for inviting me. Glad we got

1:08:06

to chat. And

1:08:11

don't forget, this episode was brought to

1:08:14

you by MailTrap, an email

1:08:16

delivery platform that developers love. Try

1:08:19

it out for free at mailtrap.io. I

1:08:24

want to thank Philip Cloud for coming on the show this

1:08:26

week. And

1:08:28

I want to thank you for listening to the

1:08:30

Real Python Podcast. Make sure that you click that

1:08:33

follow button in your podcast player. And if you

1:08:35

see a subscribe button somewhere, remember

1:08:37

that the Real Python Podcast is free. If

1:08:40

you like the show, please leave us a review. You

1:08:42

can find show notes with links to all

1:08:44

the topics we spoke about inside your podcast

1:08:47

player or at realpython.com

1:08:49

podcast. And while you're there, you can leave us

1:08:52

a question or a topic idea. I've

1:08:54

been your host, Christopher Bailey, and look forward to

1:08:56

talking to you soon.

Rate

Get this podcast via API

From The Podcast

The Real Python Podcast

A weekly Python podcast hosted by Christopher Bailey with interviews, coding tips, and conversation with guests from the Python community.The show covers a wide range of topics including Python programming best practices, career tips, and related software development topics.Join us every Friday morning to hear what's new in the world of Python programming and become a more effective Pythonista.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More