Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Welcome to the Real Python Podcast.
0:02
This is episode 201. What
0:05
are the benefits of using a decoupled
0:08
data processing system? And how
0:10
do you write reusable queries for a
0:12
variety of backend data platforms? This
0:15
week on the show, Philip Cloud,
0:17
the lead maintainer of IBIS, will
0:19
discuss this portable Python DataFrame library.
0:22
Philip contrasts IBIS's workflow with
0:24
other Python DataFrame libraries. We
0:26
discuss how getting close to
0:28
the data speeds things up
0:30
and conserves memory. He
0:32
describes the different approaches IBIS provides
0:34
for querying data and how to select
0:37
a specific backend. We discuss
0:39
ways to get started with the library
0:41
and how to access example data sets
0:43
to experiment with the platform. Philip
0:45
discovered IBIS while looking for a tool
0:47
that allowed him to reuse SQL queries
0:49
written for a specific data platform on
0:51
a different one. He also recounts how
0:54
he got involved with the IBIS project,
0:56
sharing his background in open source, and
0:58
learning how to contribute to a first
1:00
project. This episode
1:02
is sponsored by Mailtrap, an
1:04
email delivery platform that developers love.
1:08
Try for free at
1:10
mailtrap.io. All right, let's get
1:12
started. Is
1:33
a weekly conversation about using Python in
1:35
the real world. My. Name is
1:37
Christopher Bailey, Your host. Each. Week
1:39
We feature interviews with experts in
1:41
the community and discussions about the
1:43
topics, articles and courses found at
1:45
realpython.com. After. The podcast. Join us.
1:47
and learn real world Python skills with
1:50
a community of experts at realpython.com. Hey,
1:52
Philip, welcome to the show. Hey,
1:54
Chris, great to be here. Yeah,
1:57
so Wes McKinney hooked us up to...
2:00
talk a little deeper about Ibis. I mentioned
2:02
multiple times that are very interested in that
2:04
project and we had so much
2:06
other things to talk about when he came on. So he
2:08
gave me your name
2:10
and kind of showed me not only the
2:12
things happening with the project but you have
2:15
a detailed YouTube channel going there which I
2:17
think is nice. But maybe we
2:19
can start with this. How did you get involved
2:21
with Ibis to begin with? Yeah, so
2:23
in 2016 I was working at,
2:26
well it's now called Meta, Facebook
2:28
then. And I was
2:31
in data engineering. The job there is,
2:33
that job there anyway is writing a
2:35
lot of SQL code. And
2:38
Facebook has a dizzying array
2:40
of infrastructure. Data engineering deals mostly with
2:42
Hive, or at least at the time
2:44
it was mostly Hive. Presto was like
2:46
sort of the new kid on the
2:48
block and it was getting
2:50
a lot of internal like sort of hype
2:53
and use and whatever. Hive was
2:55
like super hard to use for
2:57
building a pipeline
2:59
because when you're working with like
3:02
a data engineering pipeline, you often are iterating,
3:05
right? You don't know necessarily exactly what your
3:07
code is going to look like right
3:09
away. So you need something that's going to give
3:11
you somewhat reasonable feedback,
3:14
a somewhat reasonable feedback loop. Like you're
3:16
not going to be waiting like 30
3:18
seconds to run account star query or
3:20
something like that. Okay, yeah. So just
3:22
to kind of break it down even
3:24
a little bit more there, like when
3:26
you talk about pipelines, I'm guessing
3:28
there's a variety. They could be the ingestion
3:31
of data pipeline, but there also
3:33
could be like just the transformation
3:35
layer sort of stuff. Yeah,
3:38
I mean I can give you kind of a whirlwind
3:40
tour of how this whole system works. Basically
3:43
all of Facebook's apps like sort of emit data
3:45
at some probably alarming
3:47
rate and it's
3:50
going into a message bus,
3:52
which is essentially like a giant queue, right?
3:54
It's just like a bunch of like in-memory
3:56
things. The apps are all kind of forwarding
3:58
the data through this pipeline. And
4:00
then that pipe splits off into like
4:03
a bunch of different things. So
4:06
you can sort of hook into that pipe
4:08
with PHP and do like arbitrary programmatic
4:11
transformations. You can run
4:13
like streaming SQL on that. Or
4:16
you can just kind of write it
4:18
directly to Hive. In that
4:20
case, it would go
4:22
into essentially a file system using
4:25
Facebook's file format called dwarf, which
4:27
is a derivative of another file
4:29
format called bork. And
4:32
then our job. I love
4:34
the name, sorry. Yeah, they're funny names. And
4:40
so once they hit
4:42
disk, data engineers could start
4:45
building transformations. And
4:47
then those transformations would of course be like written
4:50
to a disk somewhere else and to another
4:52
table. And there's this just
4:54
gigantic, you know, directed graph of like
4:57
each transformation and, you know,
5:00
being run daily effectively. Okay.
5:03
And that's sort of how the whole thing works. So
5:06
basically our what we did was
5:08
write a lot of very complicated
5:10
select statements. Okay, yeah. We
5:12
were almost never writing like insert or
5:14
create table that was automatically done by
5:17
other processes. Okay, you needed
5:19
to pull information out
5:21
in some categorical things like,
5:23
you know, like narrow it in some way. Yeah,
5:26
yeah. Just sort of whatever we were, whatever
5:29
sort of things we wanted to know about the product.
5:32
I worked on the search, the Facebook search
5:34
product. Okay. That's sort
5:37
of, yeah, we would, I don't know, we did a bunch
5:39
of different stuff, a lot of accounting stuff, a lot of
5:41
summing stuff, not a lot of like super fancy math or
5:43
anything like that. But a lot of sort
5:45
of how are people moving throughout the app, that kind of
5:47
thing. Yeah, okay. Yeah. So,
5:50
sorry, let's, I'm, the original question I got involved
5:52
in Ivis. Yeah, yeah. Well, you're doing
5:55
all this like really intense stuff with SQL and
5:57
having lots of these statements and then having.
6:00
and not wanting
6:02
to necessarily run it across everything. Maybe
6:04
you want like an early return of
6:06
like, what is this gonna even look like? Is
6:09
that kind of where you were headed? Yeah, totally.
6:11
So you could access the data
6:13
from either Hive or Presto. And I think with
6:15
Hive is that Hive is built on this like
6:17
idea, like it's originally built on Hadoop and
6:20
it was a way of turning SQL
6:22
statements into MapReduce jobs. And MapReduce
6:24
is what is like a technology that is
6:27
designed to survive the apocalypse, right?
6:29
Nothing will take it down. Okay.
6:34
And that was largely the case.
6:36
And the trade-off is that while
6:38
the apocalypse, you know, may not
6:40
end Hive, you may not
6:42
be able to get an answer to your immediate
6:44
question in any sort of like
6:46
interactive amount of time. Okay. Presto
6:49
was designed to like
6:52
minimize that, the trade-off
6:54
there. Okay. And being
6:56
able to scale to sort of Facebook scale
6:58
as well as like give you back interactive
7:00
queries where possible, or give you back, give
7:03
you interactive speed where possible. And
7:07
the dialects were not the same, right? So
7:09
Hive has its thing, it's like
7:11
its own SQL dialect and then Presto has
7:13
its own SQL dialect. And so I wanted
7:15
a way to like write something, some code,
7:17
Python preferably, that
7:19
I could write it sort of once. And then I could
7:21
like, when I wanted to go to production,
7:24
I could just say, hey, like give me the Hive
7:26
SQL for that. And then when I'm like interactively, you
7:28
know, when I'm like iterating on an analysis, like run
7:30
it against Presto. Okay. And
7:32
so I started looking around for that. And then I
7:35
saw, I saw what Wes was doing with Ibis. And I
7:37
was like, this looks like the thing that I kind of
7:39
want. All right. So that's sort
7:41
of how I got involved. Okay, so you saw
7:43
it being demonstrated by Wes in
7:46
some capacity? Yeah, I think it's
7:48
built. You know, it's been almost 10
7:51
years. And so I don't remember
7:53
exactly how like, like the exact causal chain
7:55
of how I got there, but yeah,
7:58
I think I saw that. He had announced it maybe
8:00
on his blog and then I was like, oh,
8:03
this is cool. It seems like exactly
8:05
what I need or what I want anyway.
8:08
Yeah, yeah, that sounds cool. Yeah. So then you kind
8:10
of jumped in and I always
8:12
wonder about this sort of, uh, process
8:15
of getting involved in a project.
8:18
And I've had a
8:20
few people on talking about open source and avenues
8:24
in for, you know, a
8:26
lot of my audience is going to be beginner, intermediate,
8:28
and then, I don't know, I wonder
8:30
how many advanced people I have on the show, you
8:33
know, since we're kind of a learning
8:35
website of Python, but it definitely
8:37
varies. But I think a lot of them wonder, you
8:39
know, like, how do I get involved in a project
8:41
like that? And so I wonder, well,
8:43
what was your experience as far as like, you
8:46
thought this is an interesting tool. Did
8:48
you then say, Hey, I'd like to become
8:50
more involved in contribute or what was the
8:52
process there? By that time
8:54
I had already been actively contributing to
8:56
a couple of open source projects. So
8:59
I guess I
9:01
can, I can convey a story from when
9:04
I got involved in open source, like the first
9:06
time that might, sure. Yeah. That's always,
9:08
I love that stuff. Cause I think it's interesting for
9:10
people to like, you know, give them
9:12
a encouragement, but also like
9:14
maybe warn them if there's a potential, you
9:16
know, things that they need to be aware
9:18
of getting involved in this world.
9:21
Yeah. Totally. So I got
9:23
the first, the first like
9:25
major open source project that I contributed to
9:27
his pandas, but it was, it
9:29
was around the like zero dot 13 release
9:33
or something. I mean, it was years ago.
9:37
And I was, I was, I was
9:40
in grad school. I studied neuroscience and
9:42
grad school computational neuroscience. And
9:44
so I needed some,
9:46
or I wanted a specific pandas to
9:48
do something a specific way. The thing
9:50
I was interested in was cross correlation.
9:53
Okay. And, and like, I think
9:55
pandas at the time was doing this sort
9:57
of like, there's a naive way to do
9:59
it. That's like very. very slow and then
10:01
there's a way using like fast Fourier transforms
10:03
to do it much faster
10:06
and so I was like cool I want
10:08
this in pandas I
10:10
want like this to be the cross correlation
10:13
algorithm and so so
10:15
what I did was open a github
10:17
issue and paste the code that I
10:19
had written to do this with pandas
10:21
into the issue and I was
10:24
like here's the code here's
10:26
how you do like accept my I mean I
10:28
wasn't like demanding they accept my contribution but I
10:30
sort of you know I went about it in
10:32
like I was like well like I don't know
10:34
how to use this thing called github like how
10:36
do I write right what do I do it's
10:38
I was just like I'm gonna put the information
10:40
out there and like you know
10:42
hopefully somebody is either gonna
10:45
say like you know this sucks like do
10:47
it this way or like this you know like like
10:49
you're doing it wrong here's how to do it and
10:51
so the community was largely like like
10:54
the community bit there was very helpful
10:56
and they're like here you know that's good this
10:58
isn't the way to do this like
11:01
pull requests etc and so
11:03
I guess I
11:05
guess I would just I would say like if
11:07
you have an idea or you want to contribute
11:10
you know open a github issue and
11:13
and put put the information there and
11:15
like if the project is gonna
11:18
be worth contributing to they'll help you out and
11:20
okay tell you like hey they'll say hey this
11:22
is the path and yeah right like good
11:24
have you seen our contributing docs etc that sort
11:26
of thing nice because I think a
11:29
lot of people a lot of us have been
11:32
down a similar path where we didn't really know
11:34
what we were doing and then yeah
11:36
sure you know we then we did once somebody
11:39
was like you know here's how to do it
11:41
but it's a whole other like thing a whole
11:43
other organization that's it's
11:46
got its own like you said like this one happened
11:48
to be a pretty friendly community and so forth and
11:51
so you never know what what's behind that for so
11:53
that it's that's kind of a fun
11:55
way of getting in so was it's something
11:57
similar with ibis then so
11:59
with ibis I looked through GitHub
12:01
issues that Wes had created and
12:03
I picked one that I thought
12:07
I understood what needed to be done. I don't
12:10
remember if I asked any
12:12
clarifying questions, but if there's any ambiguity,
12:14
it's always a good idea
12:16
to ask. And then worked on it. I
12:19
think the issue was the Postgres backend.
12:22
I guess we'll get into what a
12:24
backend is later. Yeah, there's
12:26
lots of backends to talk about. Yeah.
12:30
And so I contributed the
12:32
Postgres backend. I think it's my first major
12:34
PR. I don't remember if
12:36
I did anything smaller before
12:38
that. That
12:41
was the first major contribution that I submitted. So
12:45
then that eventually moved into you at
12:48
this point, you're involved directly
12:50
with Voltron data, right? Yeah.
12:52
So Wes and I overlapped at 2Sigma
12:55
and they were big supporters of Ibus.
12:57
So we worked on some Ibus related
12:59
stuff there. I spent
13:02
a good amount of time working
13:04
on Ibus, like the open source
13:06
project in addition to doing whatever
13:08
2Sigma specific things we were doing there. Okay.
13:11
Yeah. And then I
13:13
dropped out of the world
13:15
of Python analytics tools and
13:18
went to work on just
13:20
something totally different, like Rust,
13:22
semi real-time,
13:26
video machine learning things.
13:28
And I was like Rust infrastructure.
13:30
And then I came back
13:33
up for air. You're
13:38
down in the lower depths there. Yeah,
13:40
that was an interesting time, but
13:42
perhaps that's maybe for another conversation.
13:45
Yeah, we hinted a little bit at
13:47
what Ibus is. And of course, I
13:49
talked to Wes about it with a
13:52
little detail. We didn't have a
13:54
ton of time because we were talking about lots of
13:56
different stuff. But maybe we could just start
13:58
with like, you. We're
14:00
interested in it because of what it could
14:03
do for you in reusing
14:06
your Python code with
14:08
these SQL statements and not having to
14:11
rewrite these things that
14:13
you've created. I worked at a
14:15
bank for a while and that was a big job
14:17
I did. It
14:19
was a mortgage company and they were fundsetting
14:22
a platform and then basically starting a
14:24
new platform. They
14:26
had all these reports and all these things
14:29
that they still wanted to generate in basically the
14:31
same kind of style if
14:33
they needed to from this old
14:35
data. They
14:38
just gave me this job of
14:40
like, all right, rebuild all
14:42
these reports. I'm like, do you have a schema
14:44
for the database? No. Oh boy.
14:47
It's just a raw table. Okay. Do
14:49
you know what the relationships are or whatever? Not
14:52
really. Can you give me some
14:54
existing reports that came from this thing? I
14:58
just reverse engineered it all. It was my
15:00
first job learning SQL and working in the
15:03
industry. I was like heads down,
15:05
learning all this sort of stuff. I understand the
15:07
idea of wanting to take ... People
15:11
spend a lot of time building
15:13
queries and they're very, very detailed
15:15
and they're something you may want to be able to reuse.
15:18
It's interesting to me that this
15:21
is maybe kind of this way in
15:23
for people that are interested
15:25
in a tool like this, that
15:28
they are involved in a lot of SQL
15:30
stuff or maybe their business has it. Maybe
15:32
we can just talk about what's fundamentally
15:35
different about what's happening with this DataFrame
15:37
library compared to pandas or polars. Yeah.
15:41
I think there's
15:43
a few fundamental differences. When
15:46
you're working with pandas, whenever you take
15:49
an action like you call a method
15:51
or you add
15:53
two series or DataFrames together,
15:56
it's happening right away. execute
16:00
like pandas itself well really
16:05
numpy but you know for purposes of this question
16:07
we could just say it's pandas.
16:09
Yeah pandas is going to like you know allocate
16:12
memory for the output let's say we're adding
16:14
two series together so it's going to
16:16
allocate you know memory and it's
16:19
going to do the addition you know element by element and
16:21
then fill in the allocated memory.
16:23
Let's say you do another
16:25
addition well that's going to do that's going
16:27
to allocate another addition and fill it in
16:30
and so forth so like every time
16:32
you're generating this tree
16:34
of allocations. Okay and each
16:36
one of them is taking up its own separate space right
16:38
in a way. Exactly and then every like
16:42
intermediate addition you're kind of you're
16:44
wasting memory in a sense right because it's
16:46
just going to be thrown away to get
16:48
that final output. Sure okay.
16:51
And with a system with
16:53
IBIS and the pullers has like
16:56
an expression based API as well so there's
16:58
some there's some overlap there and conceptually.
17:01
Yeah. But with like an expression
17:03
based API you're describing
17:06
you're writing that that addition expression you know A
17:08
plus B plus C and then
17:11
you're it's being compiled into something else and
17:13
in IBIS case it's often SQL. Okay. And
17:15
then you're handing that off to
17:17
the database engine which is almost certainly not
17:20
going to do it's not
17:22
going to evaluate that expression in the way I
17:24
just described because it has more information
17:26
about what it's going to
17:28
do. Pandas for example doesn't know
17:31
that you're doing A plus B plus C all
17:33
it sees is two series and a function call
17:35
to add them together. Okay it has
17:38
no look ahead at all. Yeah it
17:40
can't it's it can't
17:42
see what the global computation
17:44
is you're trying to express. Okay. Whereas
17:47
a SQL database can right it's
17:49
got the whole query available to
17:51
it so it can parse it and turns into
17:54
various more structured information like a tree that it
17:56
can then analyze and say oh I know
17:59
I'm doing an addition of A. A, B, and C. I
18:02
just need to allocate that one output array and
18:04
then call the function on every element of A,
18:06
B, and C at a time. I only do
18:08
that one allocation. So
18:11
that's a big difference. So I
18:13
guess the way you can express it, it's like
18:16
it's doing the entire computation at once. It's
18:18
not evaluating every intermediate
18:20
step. So in some
18:22
ways, I've heard
18:24
the term being used, especially with folders,
18:27
maybe sometimes called lazy evaluation. And I don't
18:29
know if that's the exact same terminology we're
18:32
thinking of here. Yeah.
18:34
So there's some
18:37
specific technical details around the
18:39
difference between lazy
18:41
and deferred. Lazy
18:44
tends to kind of... There's a specific...
18:47
It comes from the world of functional
18:49
programming where in a
18:52
lazily evaluative, using the sort
18:54
of technical definition, the only
18:56
things that are evaluated ever
18:58
are the things that get used. And
19:01
it's sort of my construction of
19:04
whatever interpreter programming language you're using
19:06
that will be lazily evaluated. Nowadays
19:09
people use the word lazy to mean a
19:11
slightly different thing,
19:13
but overlapping concept. Yeah, I was thinking
19:16
that. Where you're not
19:18
evaluating things when you write them
19:20
necessarily. It's
19:23
such an interesting approach, the idea that
19:25
you're taking this set of instructions and
19:27
looking at them as a whole versus
19:30
just recipe list, do this, do this, do this,
19:32
do this, do this. And I can see how
19:35
that can create a lot of efficiency within
19:37
a system like that. It can say, okay, well, we don't
19:40
need to grab everything. We can grab just
19:43
what we need for this particular operation or query
19:45
or what have you. Yeah,
19:48
and then there's just decades
19:50
of research poured into SQL
19:52
databases in particular and newer
19:55
systems like .db are sort of extending
19:58
that tradition into the... Yeah. to
20:00
the analytics world and bringing a
20:02
lot of cutting edge research
20:05
and deep expertise in designing
20:07
these systems. This
20:13
episode is sponsored by MailTrap, an
20:15
email delivery platform that developers love.
20:18
MailTrap is an email
20:20
sending solution with industry-best analytics,
20:23
SMTP, and email
20:25
APIs for major programming
20:27
languages. And it
20:30
includes 24-7 human support. Try
20:33
it out for free at mailtrap.io. That's
20:38
m-a-i-l-t-r-a-p.io. I
20:44
think that kind of leads us a little bit into this idea of
20:46
construction-wise, like how this
20:49
library kind of a little
20:51
bit thinks differently than, and as we already
20:53
mentioned, the functionality difference. But one
20:57
of the things I found fascinating, like, alright, let
20:59
me just play with this thing, is
21:01
it's sort of like, okay, well, what backend do
21:03
you want? And I was like, oh, okay, well, that's
21:06
not a choice that I had to really think about
21:08
so much right away. And I
21:10
think that fundamental difference
21:14
is interesting. What are you doing when
21:16
you're choosing a backend? The default is
21:19
typically DuckDB, I think. I
21:22
think probably for performance reasons, that's why you kind of
21:24
favor it in some reasons. But
21:26
maybe you can talk about that a little bit.
21:28
Like, what are you doing as you're choosing this
21:30
backend database type tool? Yeah.
21:34
So when you're choosing
21:36
a backend, you're opting into
21:38
some assumptions about more or
21:41
less the, I guess I would say the maximum
21:43
scale at which you can operate. So
21:47
if you're like, I'm opting
21:49
into DuckDB, let's say.
21:53
And yes, we do sort of implicitly
21:55
opt people into DuckDB. That's
21:57
because it's kind of the easiest. one
22:01
of our backends to get started with,
22:03
it has generally, it has almost all
22:05
of the functionality that I've supports. Okay.
22:08
It tends to be low memory. It's,
22:10
you know, parallel, etc. It's got all
22:12
these like goodies. Yeah, yeah,
22:14
yeah. We've talked about that
22:16
quite a bit. Yeah, exactly. Yeah. So,
22:19
and when you're opting into DuckDB, you're
22:21
saying, okay, DuckDB has a
22:23
maximum scale that it can operate at, which
22:25
because it's, and it's by design,
22:27
right? It's not like saying, oh, you know, everybody
22:30
needs whatever petabyte scale. Most
22:32
people don't. Right. Then
22:34
DuckDB, it's basically like, I
22:37
have data that at most fits
22:39
on my hard drive, right? And if I need
22:41
to go any bigger than that, you might want
22:43
to choose a different backend. But if you, but
22:45
most people don't have a terabyte of
22:48
data that they need to analyze. Maybe, maybe now
22:50
that's less true than it used to be. Well,
22:53
yeah, it depends. Like, I feel like that's something that comes
22:55
up on the show is that I'm
22:58
often talking to, again, you know, these kind
23:00
of beginner, intermediate, or people getting going and
23:02
are interested in trying stuff out. And
23:05
you're right, they don't have a petabyte. They don't
23:07
hardly probably even have a terabyte of data. And
23:10
so they want to just experiment and try things out. But
23:12
it's like a lot of
23:15
talk and conversations are about
23:17
these like huge scale things.
23:19
And it's like, well, I want to like
23:22
introduce people to the idea of it. And
23:24
I feel like the scaling part can come
23:26
later, you know, and also it's also expensive
23:28
to even play in that realm, you know?
23:30
Exactly. Yeah. Yeah. So
23:33
I guess like one of the things
23:35
that we strive for with Ivas
23:37
is to make that transition as seamless
23:39
as possible. So there's a
23:41
lot of setup for a lot of
23:43
these bigger systems like Snowflake and Spark
23:46
and BigQuery and so forth. And,
23:48
you know, assuming
23:51
you have sort of the same data
23:53
in each system, like the Ivas code
23:55
shouldn't change very much. Okay. Maybe you'll
23:58
have to connect to something differently. But
24:01
once you have that, that same code
24:03
that you wrote to do all your analysis can
24:05
kind of run on both. And so,
24:07
you know, I was like, we
24:10
don't really like to talk about Ivis itself scaling
24:12
because it's not the
24:14
size of the data is not like a scaling
24:16
factor that's super relevant for Ivis. We're actually just
24:18
like, hey, people have built these amazing systems. We're
24:20
just going to hand you the sequel and like
24:23
we know you're going to like crush
24:25
it. Yeah. Yeah. And
24:28
it's kind of like that whole idea of it
24:30
being sort of disconnected. I forget the
24:32
couple. Yeah. The decoupling of
24:34
everything when having these sort
24:37
of separate systems that that are
24:40
repurposeable or reusable, which
24:42
is great, you know, because that helps you also
24:45
as an engineer, like, as you mentioned, people
24:47
move around from job to job and so
24:49
forth. So like these tools that you're familiar
24:51
with can maybe come with you and
24:54
the techniques that you've developed and so forth. So that's kind of nice
24:56
to have a system that can do that also.
24:58
Yeah, totally. And like one
25:00
of the reasons we might, we sort
25:03
of support a large number of
25:05
backends in addition, sometimes people just like we
25:07
asked for it. We're trying to come
25:09
up with like a better, you know, more sort of,
25:12
I guess, transparent rationale for
25:14
implementing or support for or not
25:16
implementing support for a specific back
25:18
end. But one
25:21
of the reasons is so that just people who
25:23
are in various like settings that they may or
25:26
may not have control over can use
25:28
the tool, right? Like somebody might be, management
25:31
at some large org might be like, we're using BigQuery
25:33
or we're using stuff like, and like we want them
25:35
to be able to use Ibis. And like one
25:37
of the things where Ibis excels is
25:40
like taking your development code into
25:42
production with like minimal code changes. So
25:44
that same person who has
25:46
to use BigQuery for production can
25:49
take a sample of that data, put
25:51
it in DuckDB, or just like download a sample
25:53
that has a Parquet file from BigQuery. And
25:56
then on their, you know, on their whatever, their
25:58
laptop, they can sit there and... do analysis
26:00
with ducty B build that using ibis
26:02
right and then run that same thing
26:04
against the big query back end all
26:06
those experiments exactly are going
26:09
to translate yeah exactly yeah cool
26:11
yeah that seems like the the essence
26:13
of data science in some ways like
26:15
you kind of want to
26:18
allow the data scientist a chance
26:20
to be able to play
26:22
with things and mess around with things and
26:24
and the ability of that portability lets them
26:26
you know do that in a in
26:29
a circumstance where it you know
26:31
less intense if you will and less
26:33
costly right like every time you run
26:35
you know a query on snowflake or
26:37
i mean big query has like
26:39
some stuff where you can like have a
26:41
fixed you know amount of compute and you
26:43
just you know pay up front for that
26:45
but so fancier sort of
26:47
pricing models aside like it will be
26:49
costly for you to run your query
26:52
against all of prod yeah you know
26:54
as opposed to down sampling it and
26:57
then you can do whatever you want with it as many
26:59
times as you want right yeah that makes sense so that
27:01
that ends up being a good we've
27:03
heard from from our users that
27:06
they do this what's interesting
27:08
to me about this idea of adding
27:11
the support or generally supporting
27:13
lots of these back
27:15
ends is that if
27:19
i'm not calculating this right i
27:21
feel like you
27:23
would normally need a bunch of
27:25
third-party python libraries to build
27:27
those connections to the databases and they
27:30
don't have then a a
27:32
robust like way for
27:34
the data frame library to like directly connect
27:36
to it and so that's kind of why
27:38
you guys are going this extra
27:41
mile of like well we're going to support
27:43
the back end and like have our own
27:45
connection to it as opposed to like it
27:47
being like an additional component that has
27:49
to be added in is that part of the thinking
27:51
there yeah so the way that we
27:54
we typically for sql
27:57
back ends anyway there's um most
27:59
of the backends have what's
28:02
called the, I
28:04
forget what the name of it is, the DB
28:06
API, which is like a, there's a PEP, a
28:08
Python PEP for this. It's like
28:10
a set of classes and methods and
28:12
exception types that a library
28:14
needs to implement if it wants to say
28:17
that it's kind of Python DB API compatible.
28:20
Okay. And so we use the
28:22
various like vendor libraries or open
28:24
source libraries for these things. You
28:26
know, like snowflake has a thing,
28:28
BigQuery has a thing. Okay.
28:31
You know, there's pyodbc, which we use
28:33
for MS SQL and
28:35
so forth. And yeah, so we
28:37
don't write the
28:39
thing that encodes like, you
28:42
know, whatever the database protocol that sends,
28:44
you know, the query and the data
28:46
using the whatever my SQL wire format.
28:49
We don't write that. Okay. We
28:51
use off the shelf open source tools
28:53
to handle like connections and so forth.
28:55
All right. Good. Yeah. What we
28:57
built is the, the sort of the SQL
28:59
kind of compilers that take the data frame
29:01
API and turned into the
29:04
SQL code. Okay. And some
29:07
of that is then the flavoring, if
29:09
you will, of those different types of
29:12
databases. Well, funny story.
29:14
We used to write, we
29:16
used to have this sort of hybrid chimera
29:19
world where some of our translation was done
29:21
using SQL alchemy and some of it was
29:23
like handwritten, like we would actually write the
29:25
strings. Okay. In the
29:27
next release, we've kind of gutted all
29:30
of that and unified our compilers
29:32
around another library called SQL Glot, which
29:34
has support for all of the dialects
29:37
that we use. Okay. Like polyglot, that's
29:39
where the name's coming from. Yeah. Yeah.
29:41
So it's, it's taking out, we're taking
29:43
our eyes expressions and then turning
29:46
them into like SQL Glot things. And
29:48
then that turns into the correct SQL dialect.
29:50
Okay. Cool. Yeah. So this
29:53
is kind of like you, this
29:55
is where you sort of jumped into in the sense
29:57
that you were doing this for, you said
29:59
Postgres, right? That's right. yeah, Yeah
30:02
yeah. You can imagine that
30:04
is the sort of approaches of like well how
30:06
are we going to handle the stuff and. One
30:10
of the approaches that we talked about
30:12
right away was like this idea like
30:14
I have existing sequel commands and I
30:16
wanna use these queries. In
30:19
a library supports. That
30:21
methodology along with. Kind
30:23
of typical data frame stuff. Also
30:25
like. Maybe we can talk about that.
30:28
load it like. What's
30:30
the difference there? Like what's involved in.
30:32
Running a meal standard sequel command and would
30:34
he kind of output. Sir
30:37
So. There's. Generally like.
30:39
To buy think ways in
30:42
which people use what what
30:44
are lovingly called Ross equal.
30:47
Citizens good as Dead. The first one
30:49
is to run commands that don't produce
30:52
the output like creating a table. When
30:54
you create a table, it's just like
30:56
mutating some states somewhere on disk may
30:58
be right that aim to a catalog
31:01
and etc. But. It
31:03
isn't reduced produce a thing right? Have it
31:05
your site yeah run statement if you know
31:07
it's if it runs it runs or as
31:10
get access. To that so
31:12
we have like any we have any jets literally
31:14
called Raw Underscore sequel and that does He give
31:16
it a string and it's gonna give you back
31:18
like. Whatever that bp a p
31:20
I would give you back that's pretty
31:22
bare. It's pretty low level. Please bare
31:24
bones. You can manage everything yourself. That's
31:27
I go as like an escape hatch and
31:29
I this is just as the side we
31:32
have a few love we have like a
31:34
few tears of escape hatch because people like
31:36
wanna do things at different with different levels
31:38
of attraction. Church. So Rossi
31:40
was the lowest level of abstraction you
31:43
dislike, right? You're single string like do
31:45
the thing with the driver. Would do
31:47
it again. And then and and
31:49
you can run select statements, but you're going to
31:51
be, you're going after Man, it's like pulling back
31:53
that the list of rose and all that stuff
31:55
yourself. okay it's up popping
31:57
back on an object of of sorts
32:00
Yeah, it's going to give you back like
32:02
some kind of capital R result thingy or
32:05
it's sort of it's very
32:07
backend specific because the drivers are
32:09
necessarily returning backend specific objects. Okay.
32:13
The next I guess level up of
32:15
abstraction is like
32:17
you handing the connection a
32:20
select statement. So now like
32:22
we've restricted the level like the
32:24
SQL statements you can run because
32:27
if you give us a select statement, we can actually
32:29
just build an IBIS expression from that. Yeah.
32:32
Okay. All we need are the
32:35
column names and the types and then you've got this
32:37
sort of opaque blob. It's like this is going to
32:39
be the first thing. It's a table, you
32:41
know, and you can run your query that
32:43
way. You get back a
32:46
tape, an IBIS expression, like a table expression, and
32:48
then you can start working with that thing as
32:50
if it were just a regular old IBIS table.
32:54
Just a use case where you're like, I've
32:56
got a huge pile of existing SQL, bunch
32:59
of select statements and I
33:01
want to start like using IBIS, but like all
33:03
the stuff to set up my existing tables and
33:06
so forth exists. I don't want to rewrite that
33:08
in IBIS yet. Maybe you do later,
33:10
but you don't now. So
33:12
that's like the dot SQL method on
33:15
the backend object. We
33:17
have one more SQL escape hatch, which
33:20
is definitely our sort of like fanciest
33:23
escape hatch. Okay.
33:26
And this is a SQL method
33:28
on the table expression itself where
33:30
you can actually run SQL
33:33
against the
33:35
IBIS expression that precedes it. Okay.
33:39
Which is kind of nutty, right? Like you're somehow
33:41
taking this Python code and getting it into
33:43
the database and then you can mix and
33:45
match too. So you can go into SQL
33:47
and out of SQL and back of the
33:49
IBIS, et cetera. Okay. And
33:51
that escape hatch is for the use
33:54
case when the IBIS doesn't have an
33:56
API to do what you want, but
33:58
the database has. It's
34:00
something in the database you know you need.
34:03
So you would use that as a patch
34:05
for that use case. So
34:07
then the whole other approach of
34:09
working with it is
34:12
in much more of a
34:15
data frame centric methodology,
34:17
is that right? Yep, yep.
34:19
So things sort of, I mean, they,
34:22
I would say
34:24
they look and feel
34:26
pandas-esque, you know, it's not really. Sure,
34:29
yeah. There's a bunch of stuff that we like
34:31
don't implement from pandas and there's a bunch of
34:33
places where the APIs differ and so forth, but
34:35
it's got the flavor of like calling
34:37
methods on a table object. Yeah.
34:41
So, you know, group by
34:43
join. I was very
34:45
inspired by an R library called dplyr.
34:47
And so we take a lot of
34:49
the sort of the words and verbs
34:52
and nouns from dplyr like mutate and
34:54
select. So that's
34:56
quite a divergence, I think, from
34:58
pandas. I'm a fan too, because
35:01
that's my other weird like jaunt into like programming
35:03
that I kind of got into late in life
35:05
is I worked in a marketing job and they
35:07
were like a dual house and they
35:10
hired me on to be like a
35:12
Python like automation person. And
35:15
they had a bunch of R stuff running
35:17
too. And I was like, I'll learn it.
35:20
Sure, yeah. And so I loved the
35:22
old concept of the tidyverse. I love the
35:24
concept of dplyr and I was able to
35:26
start writing the
35:28
sort of connected statement sort of stuff
35:31
piping that that made
35:33
sense in my mind so clearly,
35:36
especially the stuff I was working with.
35:39
And so that's kind of one of those things I think
35:41
is very interesting that you guys have almost like, where
35:44
are you coming from? Welcome to Ibis.
35:46
Right, right, right. Well, that's sort
35:49
of that's exactly what we're going for. We're definitely
35:51
going for like that. That's sort of like the
35:53
piping kind of experience that dplyr
35:55
has where I mean, you know, like
35:57
R has the sort of native pipe
35:59
operand. now, but before they used to have
36:01
just the like percent, you know, angle
36:04
percent thing. Yeah, it's like a greater than
36:06
sign or whatever. Right, right. And
36:08
so in Python, we already, we
36:10
have the dot operator, right? And so instead
36:13
of piping, like we have dot and so
36:15
we're definitely going for that like, you know,
36:17
fluent design API where you
36:19
can just chain stuff and then you build up these
36:21
big chains and it gets all sort of compiled into
36:23
SQL, very heavily inspired
36:26
by dplyr. We, a lot of we have
36:28
like pivot wider and pivot longer, like we
36:31
have a feature called selectors, which is 100%
36:34
like stolen, like, you know, not
36:37
stolen. I mean, like, it was anyway,
36:39
inspired by very heavily inspired. Like I
36:42
implemented that. And when I implemented that,
36:44
I actually ported the test suite from
36:46
the selectors test into
36:48
Python. So I could be like,
36:50
this does this behaves like the
36:53
exact same way in Python. Cool.
36:56
Yeah, yeah. I and I was a big fan of
36:58
the mutate. I
37:00
just like, it was like such
37:02
a pain in Python to do that, at least
37:04
at the time when I was playing with you. And so
37:07
that was one of those things where like, it
37:09
just seemed like a lot of overhead to do something
37:11
where I'm just and I was working with a lot
37:13
of text, which again,
37:15
pandas talking to Wes about it,
37:18
definitely came from like finance,
37:20
if you will. Right, right, right. And then, you
37:22
know, somewhat, you know, numbers and, and, you know,
37:25
kind of dealing with that stuff, and the hands
37:27
to the back end of NumPy. And so like,
37:29
Texas always kind of like, yeah, you can do
37:31
it. So
37:33
I kind of appreciate that. And it's definitely
37:35
gotten better and better. But it's definitely something
37:37
that I see right away. And
37:40
I guess it's nice. Yeah, we try to, I
37:43
guess, one of the different main
37:45
differences between like the database world
37:47
and like NumPy comes from like
37:49
numerical and scientific computing, which, right,
37:52
maybe nowadays is dealing with a lot more strings. But you
37:54
know, back in the day, strings
37:56
were kind of an afterthought. Right, right. And like,
37:58
it's coming out of a tradition of
38:00
tools like Matlab where they're
38:03
very heavily focused on matrix
38:05
math. Everything's an
38:07
array, et cetera. Optimized for that. Yeah, exactly.
38:10
Yeah. And so, but in the database world,
38:12
like strings have been a thing from day
38:14
one because, you know, you work for a
38:16
bank or you work for a law firm
38:18
and look for these things like we're dealing
38:20
with lots of texts and
38:22
descriptions of things. And yeah. Yeah. And so
38:25
anyway, yeah. We try to do right by the
38:27
string. That's
38:30
great. One of the things that's
38:32
interesting about this whole process is that, and I don't
38:35
know where I saw the statement, but I know it's somewhere in
38:37
the, either having talked
38:39
about it or kind of,
38:41
you know, discussing it is this idea of
38:43
getting close to the data as possible. And
38:45
I feel like, is that something that by
38:49
kind of recreating these functions and so
38:53
forth, like this, this functionality of like, you can
38:55
write these statements, chain them all
38:57
together, and then it's going to again, rewrite it
38:59
and at least process it in a
39:01
way that it's now like a SQL statement. Is that part
39:03
of that? Like this idea of like, I want to be
39:05
able to get in and work with data and anybody
39:08
who's worked with SQL for a long time, like having
39:11
to have an abstraction layer is, it's
39:13
always kind of hard as a transition. And
39:15
I feel like that's something that,
39:18
you know, you're obviously, we talked about three different
39:20
methodologies of ways that people can approach it, but
39:22
is that part of like what you mean by
39:24
like getting close to the data as possible or
39:27
what exactly do you mean by that? Yeah. So
39:29
getting close to the data is really about making
39:32
sure that you're computing
39:35
in the most efficient way. Okay.
39:38
So I think traditionally or at
39:40
least like I've definitely done this in the
39:42
past where I just
39:45
ran like pandas.readSQL, I
39:48
gave like, I gave it a select star
39:50
and then you're, you're like
39:52
pulling however many whatever bytes
39:54
back to your local machine. Right.
40:00
you're doing a computation with pandas for better
40:02
or worse. When we talk about like
40:05
being close to the data, we're talking
40:07
about like the computation occurring on the
40:09
engine sort of that knows how to
40:12
do that best. And optimize it
40:14
already. Exactly. So like let's
40:16
just like snowflake for example. Snowflake
40:20
is the one that knows how to operate on
40:22
tables and snowflake the best, right? So
40:24
okay, pulling a table back from
40:26
snowflake and then doing your computation and pandas
40:28
if it can be expressed in SQL is
40:31
pretty inefficient, right? You're gonna pay egress costs
40:34
from and yeah so and
40:37
you know if data it's like new data
40:39
arrives like now you're gonna have to pull
40:41
that back again and anyway it's just it's
40:43
sort of it becomes both
40:45
prohibitive in time, space and
40:47
dollars. Yeah it's interesting.
40:49
I feel like it's a related conversation to
40:51
you know what's
40:53
happening with with Arrow and the
40:56
idea of like let's not have
40:59
to go through a translation layer each
41:01
time to look at this information if we can kind
41:03
of all agree and and that's
41:05
definitely part of this platform also, right?
41:07
Yep totally and the idea
41:10
like one of the things that IBIS it
41:13
makes it possible to do this because we're
41:15
just saying hey database like
41:17
here's the query like take care of it
41:19
just give me the just giving
41:21
the results. Okay. So we don't
41:23
have to we don't have to pull anything back
41:26
we don't need to bring anything into
41:28
memory until it's like the final result
41:30
that you asked for and even then
41:32
like you actually
41:34
have to opt explicitly into doing that
41:36
by calling a method like it's like
41:38
let's say somebody just like pulls
41:41
but wants to pull back you know a billion rows
41:44
it's possible with IBIS you
41:46
have to kind of like opt into it you have to
41:48
call a method that says like hey give me back all
41:50
the data. This
41:56
week I want to shine a spotlight on
41:58
another real Python video course. It
42:01
covers how to create interactive geographic
42:03
visualizations that you can share as
42:05
a website. The course
42:07
is based on a real Python tutorial by
42:10
previous guest, Martin Royce. It's
42:13
titled Creating Web Maps from Your
42:15
Data with Python Foleum, and
42:17
it's presented by video instructor Kimberly
42:20
Fessel. And she shows you how
42:22
to create an interactive map using
42:24
Foleum and save it as an HTML
42:26
file, how to choose from
42:28
different web map tiles, how
42:31
to anchor your map to a specific
42:33
geolocation, and bind data to
42:35
a GeoJSON layer to create
42:37
a choropleth map, and then
42:39
how to style that choropleth map. She
42:42
also shows you how to add points of interest
42:44
and other features. Learning how
42:46
to build interactive visualizations is a worthy
42:48
investment of your time, and sharing
42:50
standalone web pages is a great way
42:53
to get your users to understand and
42:55
dig into the data. And
42:57
like most of the video courses on
42:59
real Python, this course is broken into
43:02
easily consumable sections. Each
43:04
lesson has a transcript, including closed captions.
43:07
And you'll have access to code samples
43:09
for the techniques shown, in this case,
43:11
a complete interactive Jupyter notebook. Check
43:14
out the video course. You can find a link in the show
43:16
notes, or you can find it
43:18
using the search tool on realpython.com. So
43:25
you've kind of dug pretty deep into the
43:27
functionality and kind of the background of maybe
43:29
where people are coming from in different
43:32
libraries and so forth. And
43:34
it's always hard in an
43:36
audio podcast to explain a lot
43:38
of this stuff. One of the things I think is
43:41
interesting is you've created this YouTube series,
43:43
which I don't know if it's IBIS
43:45
specific, but your series is what,
43:47
Philip in the Cloud, right? Philip in the Cloud. I
43:49
love the name. Because my last name is Cloud. Yeah,
43:53
exactly. I'll
43:56
propose for somebody who works
43:58
in data these days. Yeah,
44:00
so what are the types of things that you
44:02
cover in the your YouTube channel? Definitely
44:06
all Ivis right now. Okay, let's
44:09
see So we've covered we've
44:11
covered like some integrations with
44:13
other tools We've
44:16
covered various Ivis features
44:19
I've done a couple of like live
44:22
like early early on when I when I
44:24
started it I've done I did
44:26
a couple like sort of live debugging
44:28
sessions or like I Was
44:31
like I'll demo this feature and then it's like oh
44:33
it didn't work in this way for this reason So
44:35
I would like sit there and try and figure out
44:37
what was happening. Okay, that's always
44:39
a interesting Yes,
44:41
that's been fun and
44:44
then you know newer newer features
44:47
Yeah, it's sort of like a
44:49
grab bag of Ivis, you know
44:51
functionality news Stuff.
44:54
Okay. Yeah mix of stuff. Cool. One
44:57
of the things I think about especially with our
44:59
that I thought was interesting Is
45:01
that it came with? You know at
45:03
least some of the basic tools had like
45:05
example data in it And I feel like I
45:08
this definitely is in the same boat there in
45:10
my thinking of that right that you have Some
45:12
stuff that people can kind of play around with just
45:15
the library with a few built-in
45:17
sort of a data points Or
45:19
do you have to download those separate? Ish
45:24
like it's sort of a mix of yes to
45:26
all those answers Okay. All
45:28
right. It's all those questions. We'll
45:31
provide links to a guide We
45:34
have we have like on our landing page
45:36
Ivis project org a way that
45:39
you can can get started like right away with examples
45:41
And it's got like rubble
45:43
like you stuff that if you follow it the
45:45
sort of one-line install You
45:47
should be able to copy paste and run that
45:49
code. We Forget exactly when
45:52
we add this but a while ago we
45:54
added like an Ivis dot examples module Okay,
45:57
and like we again shamelessly
46:00
from R and literally we
46:02
like have an R script that like pulls
46:05
the data out from like a few
46:07
packages and like puts it into like
46:10
a bucket a cloud bucket okay
46:13
and so when you call
46:15
like ibis.examples.penguins.fetch it's gonna pull
46:17
down that example from the
46:19
cloud bucket and give
46:21
you back an ibis expression. Okay interesting
46:24
so it's kind of a little convoluted
46:26
but it's doing the work for you as long
46:28
as you have the internet connection. Yep you need
46:30
the internet connection and that's
46:32
only because we didn't want to ship data
46:35
in our package. Yeah no no it's
46:37
gonna be bigger. Yeah we have a
46:40
couple bigger datasets up there
46:42
as well like a subset the IMDB data.
46:44
Yeah yeah that's the one I see that's
46:46
interesting. And then some of those are in
46:49
parquet I believe because they're just so
46:51
much smaller than if they
46:53
were in like TSV or whatever. But
46:56
yeah you can get started with those we've got
46:58
a variety of different data sets we've
47:00
got sort of the the R classics
47:02
like MT cars and polymer penguins. Yeah.
47:04
Then we've got some more we've got
47:07
some like World of Warcraft data up
47:09
there as well. Okay. Like gaming data
47:11
there's a bunch. Yeah it's
47:13
nice it's always fun to kind of get
47:15
to playing with things that have
47:18
weight to them that you can kind of actually play around
47:20
with it's not like randomly generate
47:22
a bunch of numbers for me which I've
47:24
seen a lot of demonstration stuff and it's
47:26
like all right my eyes are glazing over
47:28
sorry. Yeah we want people to be able
47:30
to interact with like a real data set
47:32
in like with as little
47:34
initial friction as possible right. So we're not
47:37
gonna hand we're not gonna be like oh
47:39
download this like example 3 terabyte data sets
47:41
like okay like 25 you know it's like
47:43
10% of the people who use IMDB
47:45
is gonna be able to like store that on disk. So
47:47
we're like we
47:49
like to use the polymer penguins so you know shout out
47:51
to the the authors of that
47:53
paper who have generously provided this
47:56
data. It's like it's like a small
47:58
data set but it's interesting. Yeah, yeah.
48:00
And then it's got like, you know, it's got- Lots of
48:03
interesting fields. Exactly. So yeah, and there's, there's
48:05
just a, it's a
48:07
rich enough dataset that we can say,
48:09
we can demo a lot of features
48:11
of IBIS. Right. Using
48:14
that. And then, you know, when
48:16
you want to get into some fancier stuff,
48:18
like with arrays and structs, like maybe you
48:20
switch over to IMDB dataset, cause you know,
48:22
they've got sort of, they've
48:25
got some stuff where you can, yeah, process
48:27
a field into an array and start, you
48:29
know, messing around with like on Nest and
48:31
other kind of more advanced features
48:34
of IBIS. That's definitely
48:36
a database that would have the many to
48:38
many relationship kind of stuff happening. Oh
48:41
yeah, yeah, no, it's the, and the
48:43
way they encode the relationships is sort
48:45
of interesting because everything's got a key,
48:48
but then some of the, some of the
48:50
things that are, there's some
48:52
pre-joins that happen. Okay. I
48:55
don't, I mean, I don't know exactly how
48:58
that data's generated, whatever. Some
49:00
engineer IMDB doing it. Yeah,
49:03
I really had to think about it, yeah, totally.
49:05
Yeah, there's definitely some fields where like, I
49:08
forget, I think it's like roles, the
49:11
roles that a particular person took
49:13
on in, Right. in a
49:15
given movie, like there's, you know, that can be like
49:17
sort of turned into an array and
49:20
you can imagine that, yeah. Yeah,
49:22
it's kind of funny cause like, yeah, that person could
49:24
be, you know, have multiple roles
49:26
in a particular movie, you know, or it
49:28
could be played by a different person. Yeah,
49:30
it's like, oh, there's lots of interesting things,
49:33
like, they're at different ages. There's a lot
49:35
of weird stuff to think about,
49:37
like laying out a database like that. So it's just
49:39
a fun one to look at to like say, oh,
49:42
I don't know if I'd model it,
49:44
like exactly like that, but. Yes, yes. And
49:47
it's also full of, I guess what I
49:49
would consider like junk, but interesting junk because,
49:51
Okay. People's birth dates are like,
49:53
you know, year 40 or something like that. Like that
49:55
is, it's sort of like stuff that doesn't
49:57
really make a whole lot of sense. Okay.
50:01
But it's nonetheless interesting to poke around and see
50:03
if you can kind of figure
50:05
out what went wrong there or guess,
50:07
you know, it's like data detective kind
50:09
of thing. Yeah, yeah, exactly. Yeah. That
50:12
sounds like there's some pretty good resources there.
50:14
You mentioned the landing page
50:16
for that. Are there, along
50:19
with the YouTube series where you're kind of doing
50:21
live demonstrations of working with the library and
50:24
working with data and trying things out, interacting
50:27
with people in that. I've
50:29
seen you've had a few guests also. What
50:31
else would you suggest for somebody who's interested
50:33
in checking out the library? Like what are
50:35
other resources for them? Let's see.
50:38
I would say, I mean, the best resource,
50:40
and we've put a
50:42
lot of hours into this, is our
50:44
website, which is also our documentation. Yeah,
50:47
the API stuff on there is great. Yeah.
50:49
Yeah, I would also suggest, like we've also put
50:52
a good amount of effort into getting like a
50:54
GitHub, like a working
50:56
GitHub code space set up so that somebody
50:59
can say, create a code
51:01
space. That'll just put you into
51:03
a VS code, a browser,
51:05
like VS code that has all the
51:07
dependencies installed and you can start running
51:09
Ivas examples right away directly from
51:11
the shell. Like you just
51:13
fire up Python, copy paste the code
51:15
from the website and you're off the races. Nice.
51:18
Yeah, maybe we can share some links at the end then. Yep.
51:22
Yeah, which I'll definitely include. We're also
51:24
looking to, no promises, but we're potentially
51:26
looking at like, being able to give
51:28
like a, you know, an in browser,
51:31
like interactive Ivas shell. So somebody wouldn't even
51:33
have to fire up a code space or
51:35
install anything. They could just like run
51:38
our examples or some of
51:40
them, like in their browser. Using
51:42
something like Wazi or... Yeah, Pyadive,
51:45
which is like the in browser
51:47
Python interpreter. It's
51:51
frankly magic, but
51:54
it's awesome. Yeah, we're living in
51:57
interesting times. Yeah. Yeah. Yeah,
51:59
I'm interested in that. For. Lot.
52:01
Of reasons I've had a Brett Can and on
52:03
the show how nice to talk about it and
52:05
he's been very. Involved in trying
52:07
to make it a target you
52:10
ever for Python and I mean.
52:13
The. And keep kind of watching the
52:15
space and in seeing what what's gonna
52:17
happen Next study updates, it'll be like.
52:20
A maybe a quarter of a window of like.
52:23
Text but it's all links he's
52:25
like here. Here's. Where to
52:27
go look at, learn more and so forth of
52:29
it's not not narrative that our efforts and will
52:31
yeah. But yeah,
52:33
there's a lot of work happening there. Yep, yeah.
52:36
Okay, so we met in the website we mention
52:38
you Tube We said we get some links for
52:40
people to experiment and try things out on. There's
52:43
lots of those cool examples that people can kind
52:45
of try out and will download the data for
52:47
them to to work with. Illness Will have to
52:49
go and find a bunch of data. Maybe we
52:52
can talk about. And I know
52:54
it's like a laundry list but maybe with start about
52:56
some of what are the back and that it does
52:58
support Liquid would be the and with much inductee be
53:00
in posts grass and. Spark. And
53:02
and try to remember all the ones we mentioned so far,
53:04
but it's. Quite. A few weeks have
53:07
like a a development commands her the
53:09
suspect It's called like listless back ends
53:11
and literally just the wind dylan a
53:13
list of them because it changes and
53:15
I use it sometimes. Gets this. This
53:17
does come up from time to time
53:19
and I I want a area of
53:21
not. That one it's always
53:23
in my mind is stuck the be
53:25
of course because it's the one we
53:27
use align interact with lot but like
53:29
the a big query click houses one
53:31
that I think we've got a number
53:33
of people using task data fusion druids.
53:36
Access All. Flame
53:38
Day's work site dabbled in the
53:41
streaming world Impala, which is sort
53:43
of like the original. Back.
53:45
And. That. Was like the primary
53:47
back in. With. The Ibis to
53:49
was developed for circuit. Microsoft.
53:53
Siegel Server M, a sequel. Or
53:56
a goal. By. The usual suspects
53:58
her. Yeah. The players
54:00
back as if you can believe it.
54:02
Okay, I'm. Happy to so
54:04
to talk about the layers. There's if
54:07
you want to, but there's the apply
54:09
spark a snowflake. Three know. Sequel:
54:12
I Bunch. Yes! And we talked
54:14
about. Lots. Of these the entry let
54:16
ways into the platform. People that are coming
54:18
from our. Should
54:21
have a fairly the a friendly experience and have
54:23
kind of like a guide for them to can
54:25
like. Here's what you should expect. the money of
54:27
that was read by it. Are you sir are
54:29
Awesome! Greatest. What our
54:31
colleagues of voter data for Of that.
54:34
So you also have a oh and russians for
54:36
people are be much more can fight on based
54:38
and then people that are may be coming from
54:41
st sequel. If those. Are the
54:43
three Major wondered if if I don't think
54:45
girl out I think we're we're thing about
54:47
adding one that cycling from pie spark as
54:50
well I'm saying it's and others are is
54:52
aka. That another Irving place where
54:54
people like have spent a lot of time
54:56
and so the trailers like a way to
54:58
have come from. That itis. He
55:01
really started with about the project itself.
55:03
Are you being supported to work on
55:05
this in I Acres? It is an
55:07
open source? Yep. tool. is that entirely
55:10
true For vulture and eight hours and
55:12
through something else gets a vote on
55:14
data is like that. Primary financial supporter
55:16
of Ibis. Okay, we has. At
55:20
last count, it's not that many. I just
55:22
don't remember. But there
55:24
is, I think. Six. Or
55:26
seven fulltime people working on
55:28
different different aspects of I
55:30
this. And then
55:32
we've got a few people
55:35
from outside of Ibis that
55:37
contribute. We've got a person
55:39
Google. We've got a person
55:41
who is just a very
55:43
enthusiastic user. We recently like
55:45
Made into a computer. And
55:49
so. it sort of
55:51
that it's core like supported bibles raw data and
55:53
then we have a number of like we're trying
55:55
to grow like the developer community and so we
55:57
wanna say who had a brigades of the from
56:00
outside Voltron data who are
56:02
interested in contributing, especially for
56:05
backends that some of us may
56:07
not know a lot about. Yeah,
56:09
I can imagine that can be tricky depending on
56:12
the history of the backend. There's
56:14
just one of the unique
56:17
development, let's call
56:19
it experiences that one may have when
56:21
working on Ivas is having
56:24
to deal with the idiosyncrasies of
56:26
20 execution engines,
56:30
especially around all the fun,
56:33
but not really that fun edge
56:35
cases of null handling. There's
56:38
just a lot of different stuff there, how
56:40
they happen to do floating
56:42
point rounding. That
56:45
differs among each of these. There's
56:49
a lot of interesting details there, but yeah,
56:52
it can be quite tiring. At
56:55
the end, you end up with some knowledge about
56:57
how 20 systems work, but you're like, where
56:59
am I going to use this except for Ivas? If
57:03
I'm an Ivas developer, it's useful. Yeah,
57:06
hardly anyone has 20 unique
57:08
databases in production. I
57:11
have an odd duck question that I
57:13
wondered about, and I didn't dig
57:15
deep into the documentation, but you
57:18
talk about this idea of it taking
57:20
what you've written and it
57:23
generating the SQL that then is used
57:25
on that backend. Is there a
57:27
way to have it output it
57:30
also as that actual SQL
57:32
query? Absolutely. Great.
57:35
So there's a couple
57:37
of ways that you can do that.
57:40
We have this top-level function that's like
57:42
Ivas.to underscore SQL. You give it an
57:44
Ivas expression and optionally a dialect that
57:46
you want it to generate, and
57:49
it gives you back a SQL string. If
57:54
you're in an IPython or a Jupyter
57:56
Notebook, it will actually syntax
57:58
highlight that output. And you
58:00
can see it in a little bit
58:02
more readable way. Yeah,
58:05
so adding on to
58:07
the portability. Yep. And
58:11
the idea with that is you can get
58:13
something that can be used as a SQL string, but then if
58:16
you just want to look at your SQL, you also
58:18
get the syntax-highlighted thing. You
58:20
can turn it to whatever dialect Ivas
58:23
supports. I think that
58:25
is maybe a form of debugging, too, potentially. Oh, we
58:27
all, all of us Ivas developers, use it all the
58:30
time in that way. Okay,
58:32
yeah, cool. We
58:34
also have a compile method. So
58:37
the two-SQL one is like, in
58:40
some ways, it's very aesthetics-focused, right? It's
58:43
going to do pretty printing of the SQL. It'll
58:45
indent it and all this stuff. Sure. The
58:48
compile method is a little bit more raw. It doesn't
58:51
do any pretty printing. It's not
58:53
very readable. But
58:55
if you want to get exactly what's going on in
58:57
the database, that's what you would print out.
59:00
I guess a little bit like how whatever
59:03
CSS files or HTML files could
59:05
be all space removed. Right,
59:08
it's not quite that level of craziness,
59:10
like where you're JavaScript minification is
59:12
not at that level of insanity,
59:14
but it's towards that direction. Cool.
59:24
So I have these questions I'd like to ask everybody who
59:26
comes on the show. The first one is, what's something that
59:28
you're excited about that's happening in the world of Python? There's
59:30
a few things. Okay. So
59:33
I know Pyadai is not particularly new,
59:35
but I am definitely just very
59:38
interested in that. I'm excited about where
59:40
it's heading. Yeah, yeah. I know Peter
59:43
Wang from Anaconda, has he
59:45
been on the show? I
59:49
invited him literally moments
59:51
after he walked off the stage at PyCon,
59:53
and we still have yet to connect, and
59:55
so I've got to try again. Yeah, he's
59:57
awesome. a
1:00:00
character hilarious guy. Anyway,
1:00:03
I know he was like
1:00:05
a long time ago. He's like, why can't
1:00:07
we run Python in the browser and then
1:00:09
whatever fast forward a decade or two and
1:00:11
now you can. So that's pretty exciting to
1:00:13
me. SQL Glot
1:00:16
somewhat biased there just because we're heavy users
1:00:18
of it. No, no, it's helping you guys
1:00:20
out. It's a
1:00:22
pretty exciting project. I think a
1:00:24
lot of us working on Ibis were like, it would
1:00:27
be great if like, we didn't have to
1:00:29
write all this translation layer
1:00:31
and like somebody else would do it. And,
1:00:33
uh, and somebody else did
1:00:37
independent of us, you know, trying to control or
1:00:39
anything like that. It just, it showed up one
1:00:41
day and we were like, wow, this is really
1:00:43
something. Yeah,
1:00:46
that's cool. Python us is coming
1:00:48
up. I think a bunch of the
1:00:51
Ibis team are going to be there. We're giving
1:00:53
a tutorial. Nice. Some of us
1:00:55
are giving a talk in Spanish at
1:00:57
the Charles. Yeah. Yeah.
1:00:59
Track. One of my coworkers is very
1:01:01
involved in that. Okay. Yeah.
1:01:04
So that's great. And then as
1:01:06
usual, there's always some exciting
1:01:09
new stuff in the world of
1:01:11
Python package management, like UV and
1:01:13
pixie. Yeah. Something
1:01:15
to watch. Yeah. Yeah, exactly. So I
1:01:19
spent a lot of time working
1:01:21
on package management tools in
1:01:24
various capacities, um, okay.
1:01:26
Either like for an application at a job
1:01:30
or just like working with complex
1:01:32
development environments, but you can imagine
1:01:34
Ibis has a lot of optional
1:01:36
dependencies and so like
1:01:38
we need environments. Should I be
1:01:40
back on to do a survey with me? Maybe
1:01:42
we can bring a handful of people in. We
1:01:44
can talk about it. Oh man. I think that
1:01:47
was just erupted like, I don't know, like violence
1:01:49
or something because it's just, it's just that kind
1:01:51
of topic. Yeah.
1:01:53
Yeah. It's very, very, uh,
1:01:55
opinionated, uh, very much so.
1:01:57
Yeah. So, but like.
1:02:00
I see things like UV and I've have you
1:02:02
had Charlie Marsh on the on the show? No,
1:02:04
no, I'm somebody else who's on the list I've
1:02:07
been I'm thinking about I've been kind of watching
1:02:09
rough to him forming the company
1:02:11
and That's been interesting to kind of
1:02:14
watch too because it's just sort of a similar journey of
1:02:16
a few other Yeah, I
1:02:18
don't want to come smaller But like individuals who said I
1:02:20
want to make a company and let's turn this into a
1:02:22
thing and and that's hard
1:02:25
Totally. So I wonder what the struggles are
1:02:27
there. I might actually approach it from that
1:02:29
angle, too yeah, no Charlie's great talk to
1:02:31
him for sure and then pixie which
1:02:35
is like a It's
1:02:38
like an a now, you know an analogous
1:02:40
sort of tool but working, you know more
1:02:42
closely with the Conda ecosystem Yeah,
1:02:44
yeah, if you if you don't know
1:02:46
both vol prex, I mean, I'm
1:02:48
happy to put you in touch Yeah, yeah
1:02:50
in the mamba and all that stuff. Yeah.
1:02:53
Yeah, I'm sure he would be a
1:02:55
good person to talk to as well So like I'm
1:02:57
I'm kind of watching both those tools to see where
1:03:00
things go I mean, I think the
1:03:02
Python community has had some Struggles
1:03:05
with various like standards around
1:03:07
package management and just trying
1:03:10
to get some consensus coalescing
1:03:12
on Various things
1:03:14
and it's such a wide target to
1:03:16
hit very. Yep. So that's the problem
1:03:18
it's used in so many different fields
1:03:21
and all these different backgrounds and you
1:03:23
literally have the immediate division of Data
1:03:26
science and you know everything happening
1:03:28
with anaconda and you know, Conda
1:03:31
and all that sort of stuff So versus
1:03:33
yeah, and I think one of
1:03:35
the like these tools are coming from a
1:03:38
few decades of learning What
1:03:40
is good what works and what doesn't work?
1:03:42
And so they have the
1:03:44
benefit of the hindsight of all
1:03:47
the things that we wish we could change but that
1:03:49
we can't change and so Like
1:03:51
the programming language like rust comes along and cargo
1:03:53
and people like oh my god Like
1:03:56
this is right really how the thing should be
1:03:58
but they can stand on the those shoulders,
1:04:00
man. Right. And so it's like,
1:04:02
tools like UV and pixie, like
1:04:05
have all that history to build
1:04:08
on, which I think, you know, somewhat
1:04:11
speaks to their ability to
1:04:13
succeed. Yeah. So, yeah.
1:04:16
Yeah, that's awesome. That's a whole bunch of stuff.
1:04:18
I will definitely add links for all
1:04:20
those items. And I'm very interested in
1:04:22
when people suggest guests because I'm
1:04:25
always looking to add more people to
1:04:27
the roster. So. Sure. What's
1:04:29
something that you want to learn next? Again, this doesn't have to
1:04:31
be about programming. Right now, I'm
1:04:33
currently learning Spanish. Okay.
1:04:36
I live with two native speakers
1:04:38
and one that speak English. And so I'm
1:04:41
just trying to go like, you know, as deep as
1:04:43
possible as I can there. Okay.
1:04:45
It's an immersion. Yeah.
1:04:50
It's sort of, it's
1:04:52
tough, but it's like, it's very
1:04:55
rewarding. I'm
1:04:57
using a platform called Learn
1:04:59
Craft Spanish, which takes a
1:05:01
different approach than other attempts
1:05:04
that I've made. Okay. I
1:05:06
think a lot of, like a lot of the, a
1:05:08
lot of these sort of app based things. Right.
1:05:11
Right. The Duolingo's and such. Yeah. They
1:05:13
don't give you, they don't focus
1:05:15
on fundamentals like grammar.
1:05:17
They focus a lot on vocabulary. So like,
1:05:20
how do I say dog and milk and
1:05:22
whatever? Right. Right. And they can have these
1:05:24
pretty little icons and so forth to trigger
1:05:27
you. They're kind of designed to
1:05:29
like keep you in the app. And then, you know,
1:05:31
I don't know. I mean, I've, I don't want to say
1:05:33
anything like negative, but.
1:05:36
No, no, it's, it's almost the same complaint
1:05:38
people have about tutorials in the Python world.
1:05:40
It's like, maybe you should go
1:05:42
build something. You know, maybe you should go have
1:05:44
an actual conversation. Yeah. Yeah. Exactly. It's kind of
1:05:47
like a different approach. Yeah. It's
1:05:49
what, like Learn Craft Spanish takes a very
1:05:51
different approach in that they teach you a
1:05:53
lot of the hardest grammar first. So also
1:05:58
just, I didn't know a lot. The
1:06:00
names for grammatical in got caught structure
1:06:02
it so I think that direction, object
1:06:05
pronouns and so forth. I
1:06:07
did you buy his seventies and hey what
1:06:09
did your i got sick pronoun be the
1:06:11
English at the accidental know now I can
1:06:13
tell you. but it's only as I learned
1:06:15
in the context of learning Spanish so preference
1:06:17
to like. The. At getting into that,
1:06:19
start getting into the harder stuff first
1:06:21
as a whole lot more rewarding. Because.
1:06:24
You can. Build. The. The
1:06:26
tools you need to ask for the vocabulary,
1:06:28
right? That's that's sort of. It's. And
1:06:31
goal It's like whoa, If you just need
1:06:33
to not as a table than like you
1:06:35
just described the table right? Okay,
1:06:38
That's. Cool. Essentially a new see way
1:06:40
of for approaching it or you. Yeah.
1:06:43
So. Because. You mentioned the
1:06:45
icon talk or are you involved
1:06:47
in that then trying to their
1:06:49
advantage or know yes I am
1:06:51
edited out Tbd and how we
1:06:53
got are no One has to
1:06:55
be deliver her. My
1:06:58
colleague who is a native Spanish
1:07:00
speaker will be sir is leading
1:07:02
the charge on that add You
1:07:05
know hopefully see their. Ask
1:07:07
me to say anything complicated. Was
1:07:10
the al ghazl jackets. That's fun!
1:07:13
What's. The best way the people can follow the work
1:07:15
that you do online get hub. As. Down,
1:07:17
I don't. I. Do
1:07:19
a little bit tweeting, but
1:07:22
mostly it's. A
1:07:24
joke. Like. The. Things I
1:07:26
say are not that serious threat as
1:07:29
that you're you're serious networking to yeah
1:07:31
latter, lots of other battle yeah the
1:07:33
last major thing I did on twitter
1:07:35
was an April Fools joke related to
1:07:38
I this is okay I'm in a
1:07:40
separate that are it'll take that the
1:07:42
those interactions I guess too serious man
1:07:44
but I they deadly spend most of
1:07:47
my bike online sort of work type
1:07:49
of get have a. You.
1:07:51
Are a to convert after your
1:07:53
first experience? Is there that we
1:07:55
discuss? Yep Yep. I've been on
1:07:57
Get Home for a long time.
1:08:00
Yeah, that's great. Well, Philip, it's fantastic
1:08:02
to talk to you. Thanks for coming on the show.
1:08:04
Yeah, thanks, Christopher. Thanks for inviting me. Glad we got
1:08:06
to chat. And
1:08:11
don't forget, this episode was brought to
1:08:14
you by MailTrap, an email
1:08:16
delivery platform that developers love. Try
1:08:19
it out for free at mailtrap.io. I
1:08:24
want to thank Philip Cloud for coming on the show this
1:08:26
week. And
1:08:28
I want to thank you for listening to the
1:08:30
Real Python Podcast. Make sure that you click that
1:08:33
follow button in your podcast player. And if you
1:08:35
see a subscribe button somewhere, remember
1:08:37
that the Real Python Podcast is free. If
1:08:40
you like the show, please leave us a review. You
1:08:42
can find show notes with links to all
1:08:44
the topics we spoke about inside your podcast
1:08:47
player or at realpython.com
1:08:49
podcast. And while you're there, you can leave us
1:08:52
a question or a topic idea. I've
1:08:54
been your host, Christopher Bailey, and look forward to
1:08:56
talking to you soon.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More