Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Do. You have data that you pull
0:02
from external sources or that is
0:04
generated in appears that your digital
0:06
doorstep and but that data needs
0:08
processed, filtered, transformed, distributed and much
0:10
more. One. Of the biggest tools
0:12
to create these data pipelines with
0:14
Python is Baxter. And we're
0:17
fortunate to have Headroom Naveed on the
0:19
showed. Tell us about it. Headroom is
0:21
the head of Did Engineering and Devereaux
0:23
at Baxter Labs and we're talking data
0:25
pipelines this week here at Talk By
0:27
Than. This is hop by The
0:29
New Me Episode Four Hundred and Fifty Four
0:31
Recorded January Eleventh. Two Thousand And Twenty Four.
0:48
Welcome. To Talk By The Enemy
0:50
a weekly podcast on Python. This is
0:52
your host Michael Kennedy. Follow me on
0:54
Mastodon where I'm at Him Kennedy and
0:57
follow the podcast using at Tuck Python
0:59
both on Busted on.org. Keep. Up
1:01
with the show and listened over seven years
1:03
of pass episode at top I thought.of him.
1:06
We. Started streaming most of our episodes
1:08
live on you tube. Subscribe to our
1:10
youtube channel over at Talk Python.fm/you Tube
1:13
to get notified about upcoming shows and
1:15
be part of that episode. This.
1:19
This episode is sponsored by Posit Connect
1:21
from the makers of Shiny. Publish,
1:24
share and deploy all of your data
1:26
projects that you're creating using Python. Streamlet,
1:28
Dash, Shiny, Bokeh, FastAPI,
1:30
Flas, Quattro, Reports, Dashboards
1:33
and APIs. Pause. It connects
1:35
supports all of them. Try Posit
1:37
connect for free by going to
1:39
talk Python that Fm/posit Pos Id.
1:42
And. It's also brought to you by
1:44
us over at Tuck Python Training. Did
1:46
you know that we have over two
1:48
hundred and fifty hours of Python courses?
1:51
Yeah. That's right, check him out at Talk
1:53
by thought that Fm/courses. Last.
1:56
Week I told you about our new
1:58
course build an Ai audio. With
2:00
Python. Well. I have a
2:02
nother brand new an amazing course to tell
2:04
you about. This. Time. It's all
2:07
about Pythons typing system and how to
2:09
take the most advantage of it. It's.
2:11
A really awesome course called Rock Solid
2:13
Python with Python typing. This one of
2:15
my favorite courses that I've created in
2:18
the last couple of years. Python.
2:20
Type hints are really starting
2:22
to transform Python, especially from
2:24
the ecosystems perspective. Think. Fast
2:26
a pie i didn't a bear type,
2:28
etc. This. Course shows you
2:30
the ins and outs of Python type in
2:33
syntax of course, but it also gives you
2:35
guidance on when and how to use type
2:37
hints. Check out this for
2:39
half hour in depth course at
2:41
tuck by found out of him/courses.
2:44
South. Onto. Those data pipelines. Headroom
2:48
A welcome to talk by the
2:50
me it's amazing the heavier Michael
2:52
get have to could be here
2:54
yeah can talk about data data
2:56
pipelines, automation and will boil of
2:58
it's how you have. I been
3:00
in the Dev ops he side
3:03
of things this week and I
3:05
would have a special special appreciation
3:07
of it I can tell already
3:09
so that excited as or we
3:11
could also consistent in the so
3:13
for we get to that though
3:15
for we talk about Dexter and
3:17
data pipelines and. Orchestration more broadly
3:19
was just go over to background on
3:21
you, introduce yourself for people, had you
3:23
get into Python and beta orchestration and
3:26
all those things for us here. So
3:28
my name is Petromin of Eat, I'm
3:30
the head of State Engineering and overall
3:32
attack sir that's a mouthful and I've
3:35
been longtime Python user since two Point
3:37
Seven and I got started with Python
3:39
with a D with me. Things she
3:41
said of sheer laziness. I was working
3:44
at a bank and their asses grow
3:46
tasks, something involving going into servers. Opening
3:48
up a text file and seeing of
3:51
a patch was apply to server. A
3:53
nightmare scenario when there's a hundred service
3:55
attack and fifteen different patches to confirm.
3:57
Yeah so this kind of predates like
3:59
the cloud and all that automobile and
4:01
staff writer. so does this a before
4:03
cloud. This is like rate between Python
4:06
to in Python. three of you trying
4:08
to figure out how to use pin
4:10
statements correctly actually learn Python as they
4:12
this gotta be better way and honestly
4:14
have not looked back at. Think: if
4:16
you automate or create tread trajectory you'll
4:18
see it says on shared by finding
4:20
ways to be more lazy or in
4:23
many ways aspects. Yeah, who was A
4:25
I think it was Matthew Rockland that
4:27
had the phrase something like productive laziness
4:29
Her: yes. And like that, like. I'm
4:32
going to find a way to leverage my
4:34
laziness to force me to bill automation so
4:36
I never, ever have to do this sort
4:38
of thing again. I got that sort of
4:40
print is very motivating to not have to
4:42
do something and I'll do anything to notice
4:45
of the down there. it's incredible. Unlike that
4:47
of Up Stuff I was talking about. just
4:49
one command and. As
4:51
maybe eight or nine new apps with
4:53
all their tears redeploy update is received
4:55
in It's It took me a lot
4:57
of work to get there, but now
4:59
I. Never after. Think about it again, at
5:01
least not for a few years. and to
5:03
and it's It's amazing I can be productive.
5:05
It's like right and line and line with
5:07
us. So or what are some of the
5:09
Python projects he been he worked on talked
5:12
about a different way as supply this over
5:14
the years. Oh yes so it's sort of
5:16
in with internal just like Python projects turn
5:18
automate with his had some wrote task that
5:20
I had and that accidently becomes you know
5:22
a bigger project. People see it and nilly
5:24
oh I want that to and seven when
5:26
they have to build a cool green interface
5:28
because most people don't speak Python. and
5:30
so i got me into i agree
5:32
i think it was called way back
5:34
when that was the fun journey and
5:36
then from there it's really taken off
5:38
a lot of it has been mostly
5:40
personal projects by to understand of the
5:42
source with are really pick learning a
5:45
path for me as well really being
5:47
absorbed by things like see cloth me
5:49
and requests back when they were coming
5:51
up eventually lead to more but it
5:53
engineering or temporal where i got involved
5:55
with tools like airflow and try to
5:57
automate it up i find incentives patches
5:59
and server that one day led to,
6:01
I guess, making a long story short, a
6:03
role at Daxter where now I contribute a
6:06
little bit to Daxter. I work on Daxter,
6:08
the core project itself, but I also use
6:10
Daxter internally to build our own data pipelines.
6:13
I'm sure it's interesting to see
6:15
how you all both build Daxter
6:17
and then consume Daxter. Yeah, it's
6:19
been wonderful. I think there's a
6:21
lot of great things about it.
6:24
One is like getting access to
6:26
Daxter before it's fully released, right?
6:28
So internally, we dog food, new
6:30
features, new concepts, and we work with the
6:32
product team, the engineering team to say, hey,
6:34
this makes sense, this works really well, that
6:36
doesn't. And that feedback loop is so
6:39
fast and so iterative that for me personally, being
6:42
able to see that come to fruition is really,
6:44
really compelling. But at the same time, I get
6:46
to work at a place that's building a tool
6:48
for me. You don't often
6:50
get that luxury. Yeah. I've
6:52
worked in ads, I've worked in insurance,
6:54
it's banking, these are nice things, but
6:56
it's not built for me, right? And
6:59
so for me, that's probably been the biggest benefit,
7:01
I would say. Right. If you
7:03
work in some marketing thing, you're like, you know,
7:05
I retargeted myself so well today, you wouldn't believe
7:07
it. I really enjoyed it. Yeah,
7:10
I've seen the ads that I've created before.
7:12
So it's a little fun, but it's not
7:14
the same. Yeah, I've heard of people who
7:16
are really, really good at ad targeting
7:19
and finding groups where they
7:21
like pranked their wife or something or just had
7:23
an ad that would only show up for their
7:25
wife by running it. It's like so
7:28
specific and you know, freak them out a little bit. That's
7:30
pretty clever. Yeah, maybe
7:32
it wasn't appreciated, but it is clever. Who
7:34
knows? All right. Well,
7:38
before we jump in, you said that of
7:40
course you built GUIs with PyGui and those
7:42
sorts of things because people don't
7:45
speak Python back then, two, seven days and
7:47
whatever. Is that different now? Not
7:49
that people speak Python, but is it different in the
7:51
sense that like, hey, I could give them a Jupyter
7:53
Notebook or I could give them Streamlit
7:56
or one of these things, right? Like a little
7:58
more or less you building just... like plug it
8:00
in? I think so. I mean, yeah, like you
8:02
said, it's not different in that most people probably
8:04
still to the stage don't speak Python. I know
8:06
we had this like movement a little bit back
8:08
where everyone was going to learn like SQL and
8:11
everyone was going to learn to code. I
8:13
was never that bullish on that trend
8:15
because like if I'm a marketing person, I've got
8:17
10,000 things to do and learning
8:19
to code isn't going to be the priority ever.
8:22
So I think building interfaces for people that
8:24
are easy to use and speak well to
8:27
them is always useful. That never has gone
8:29
away. But I think the tooling around
8:31
it has been better, right? I don't think I'll
8:33
ever want to use POGUI again and nothing wrong
8:36
with the platform. It's just like not fun to
8:38
write streamlet makes it so easy to do that.
8:40
So it's like something like retool and there's like
8:42
a thousand other ways now that you can bring
8:44
these tools in front of your stakeholders and your
8:47
users that just wasn't possible before. I think it's
8:49
a pretty exciting time. There are a lot of
8:51
pretty polished tools. Yeah, it's gone so good. Yeah.
8:53
There are some interesting ones like OpenBB. Do you
8:56
know that the financial dashboard thing? I've heard of
8:58
this. I haven't seen it. Yeah, it's
9:00
basically for traders, but it's like a terminal
9:03
type thing that has a bunch of
9:05
Matplotlib and other interactive stuff that pops
9:07
up kind of compared to say Bloomberg
9:09
dashboard things. But yeah, that's one sense
9:11
where like maybe traders go and learn
9:14
Python because it's like, all right, there's
9:16
enough value here. But in general, I
9:18
don't think people are going to stop
9:20
what they're doing to learning the code.
9:22
So these new UI things are not.
9:24
All right, let's dive in and talk
9:26
about this general category
9:29
first of data pipelines, data
9:31
orchestration, all those things. We'll talk about
9:33
Daxter and some of the trends and
9:36
that. So just grab some random internet
9:38
search for what is a
9:40
data pipeline maybe look like, but you know,
9:42
people out there listening who don't necessarily live
9:44
in that space, which I think is honestly
9:46
many of us, maybe we should, but maybe
9:49
in our minds, we don't think we live
9:51
in data pipeline land. Like tell them about
9:53
it. Yeah, for sure. It is hard to
9:55
think about if you haven't done or built
9:57
one before. In many ways, a data pipeline
9:59
is just a series. of steps that you
10:01
apply to some data set that you have
10:04
in order to transform it to something a
10:06
little bit more valuable at the very end.
10:08
That's a simplified version, the devil's in the
10:10
details, but really, at the end of
10:12
the day, you're in a business, the production of data happens by
10:14
the very nature of operating that
10:17
business. It tends to be the core thing
10:19
that all businesses have in common. And then
10:21
the other output is you have people within
10:23
a business who are trying to understand how
10:25
the business is operating. And this used to
10:27
be easy when all we was a single
10:30
spreadsheet that we can look at once a month. I
10:32
think is the system gone a little bit more
10:34
complex than these days, computer and automation? And expectations,
10:36
like they expect to be able to see almost
10:38
real time, not I'll see it at the end
10:40
of the month, sort of. That's right. Yeah. I
10:42
think people have gotten used to getting data too,
10:44
which is both good and bad good in the
10:46
sense that now people are making better decisions, bad,
10:49
and then there's more work for us to do.
10:51
And we can't just sit in our feet for
10:53
half a day, half a month waiting for the
10:55
next request to come in. There's just an endless
10:57
stream that seems to never end. So that's what
10:59
really is pipeline is all about. It's like
11:01
taking these data and making it consumable in
11:03
a way that users tools will understand that
11:05
helps people make decisions at the very end
11:08
of the day. That's sort of the nuts
11:10
and bolts of it. In your mind, does
11:12
data acquisition live in this
11:14
land? So for example, maybe we have
11:16
a scheduled job that goes and does
11:18
web scraping, calls an API once an
11:20
hour, and that might kick off a
11:23
whole pipeline of processing. Or we watch
11:25
a folder for people to upload
11:28
over FTP, like a CSV
11:31
file or something horrible like that. It's
11:33
unspeakable. But something like that where
11:35
you say, Oh, a new CSV has arrived for
11:37
me to get. Right? Yeah, I think that's
11:40
the beginning of all data pipeline journeys in
11:42
my mind very much. And actually, as much
11:44
as we hate it, it's not terrible. I
11:46
mean, there
11:48
are worse ways to transfer files. But it's
11:51
still very much in use today. And
11:53
every data pipeline journey at some point
11:55
has to begin with consumption of data
11:58
from somewhere. Hopefully, it's SFT. not
12:00
just straight FTP, like the encrypted, don't just
12:03
send your password in the
12:05
plain text. Oh well, I've
12:07
seen that go wrong. That's a story for
12:10
another day, honestly. All
12:12
right, well, let's talk about the project that you work
12:14
on. We've been talking about it in general, but
12:16
let's talk about Baxter. Like, where does it fit in this
12:18
world? Yes. Baxter to me
12:21
is a way to build a data
12:23
platform. It's also a different way of
12:25
thinking about how you build data pipelines.
12:27
Maybe it's good to compare it with
12:29
kind of what the world was like, I think,
12:31
before Dijkstra and how it came
12:33
about to be. So if you think
12:35
of Airflow, I think it's probably the
12:37
most canonical orchestrator out there. But there
12:39
are other ways which people used to
12:41
orchestrate these data pipelines. They
12:44
were often task-based, right? Like, I would
12:46
download file, I would unzip file, I
12:48
would upload file. These are sort of
12:51
the words we use to describe the
12:53
various steps within a pipeline. Some
12:55
of those little steps might be Python functions that
12:58
you write. Maybe there's some pre-built other ones. Yeah,
13:00
there might be Python, could be a bash script,
13:02
could be logging into a server and downloading a
13:04
file, could be hitting request and
13:07
downloading something from the internet, unzipping it. Just
13:09
a various hodgepodge of commands that would run.
13:11
That's typically how we thought about it. For
13:13
more complex scenarios where your data is bigger,
13:16
maybe it's running against a Hadoop cluster or
13:18
a Spark cluster. The compute's been offloaded somewhere
13:20
else. But the sort of conceptual way you
13:22
ended to think about these things is in
13:25
terms of tasks, right? Process this thing, do
13:27
this massive data dump, run a bunch of
13:29
things, and then your job is
13:31
complete. With Airflow, or I'm sorry, with DAGSAR,
13:33
we kind of flip it around a little
13:35
bit on our heads and we say, instead
13:38
of thinking about tasks, what if we flipped
13:40
that around and thought about the actual underlying
13:42
assets that you're creating? What if you told
13:44
us not the steps that you're going to
13:47
take, but the thing that you produce? Because
13:49
it turns out that people and data people
13:51
and stakeholders really, we don't care about the
13:53
task. We just assume that you're going
13:55
to do it. What we care about is that table, that model,
13:57
that file. that
14:00
Jupyter Notebook. And if we model our
14:02
pipeline through that, then we get a
14:04
whole bunch of other benefits. And that's
14:06
sort of the Daxter's sort of pitch,
14:08
right? Like, if you want to understand
14:10
the things that are being produced by
14:12
these tasks, tell us about the underlying
14:14
assets. And then when a stakeholder says and comes
14:16
to you and says, how old is this table?
14:19
Has it been refreshed lately? Well, you don't have
14:21
to go look at a specific task and remember
14:23
that task ABC had modeled XYZ. You just go
14:25
and look up model XYZ directly there and it's
14:27
there for you. And because you've defined things in
14:30
this way, you get other nice things like a
14:32
lineage graph, you get to understand how fresh your
14:34
data is, you can do event based orchestration and
14:36
all kinds of nice things that are a lot
14:39
harder to do in a task world. Yeah,
14:41
more declarative, less imperative,
14:44
I suppose. Yeah, it's been the trend, I
14:46
think, in lots of tooling. React, I think
14:48
was famous for this as well, right? In
14:50
many ways, it was a hard framework, I
14:52
think, for people to sort of get their
14:55
heads around initially, because you were so used
14:57
to the jQuery declared or jQuery
14:59
style of doing things. Yeah, how do I hook
15:01
the event that makes the thing happen? Right. And
15:04
React said, let's think about it a little bit
15:06
differently. Let's do this event based orchestration. Really. And
15:08
I think the proof is putting React
15:10
everywhere now and jQuery would be not so much. Yeah,
15:13
there's still a lot of jQuery out there, but there's not
15:15
a lot of action. Not a lot
15:17
of active jQuery, but I imagine there's some
15:19
there's just because people like,
15:21
you know what, don't touch that, that works. Which
15:24
is probably the smartest thing people can do,
15:26
I think. Yeah, honestly, even though new frameworks
15:29
are shiny. And if there's
15:31
any ecosystem that loves to chase the shiny
15:33
new idea to the JavaScript web world. Oh,
15:35
yeah, there's no shortage of new frameworks coming
15:37
out every time. Yeah, I mean, we
15:40
do too, but not as much as like,
15:42
that's six months old. That's so old, we
15:44
can't possibly do that anymore. We're rewrite now. We're
15:46
gonna do the big rewrite again. Yep.
15:49
Okay, so Daxter is the company,
15:52
but also is open source. What's the
15:54
story on like, can I use it for free? Is it
15:57
open source? I pay for it. Okay, company,
16:00
Daxter open source is the
16:02
product 100% free, we're very
16:04
committed to the open source model. I
16:07
would say 95% of the things you can get
16:09
out of Daxter are available through open source and
16:11
we tend to try to release everything through that
16:13
model. You can run
16:15
very complex pipelines and you
16:17
can deploy it all on your own if you
16:19
wish. There is a Daxter cloud product, which is
16:21
really the hosted version of Daxter. If you want
16:23
hosted plane, we can do that for you through
16:25
Daxter cloud, but it all runs on the same
16:27
code base and the modeling and the files all
16:30
essentially look the same. Okay,
16:32
so obviously you could get
16:34
like I talked about at the beginning, you could
16:36
go down the DevOps side, get your own open
16:38
source Daxter, set up, schedule it, run it on
16:40
servers, all those things. But if we just wanted
16:42
something real simple, we could just go to you
16:45
guys and say, hey, I built this with Daxter.
16:47
Will you run it for me? Pretty much, yeah,
16:49
right? So there's two options there. You can do
16:51
the serverless model, which says, you know, Daxter just
16:53
run it, we take care of the compute, we
16:55
take care of the execution for you and you
16:58
just write the code and upload it to GitHub
17:00
or, you know, repository of your
17:02
choice and we'll sync to that and then run
17:04
it. The other option is to do the hybrid
17:06
model. So you basically do the CI CD aspect,
17:09
you just say you push to name
17:11
your branch, if you push that branch, that
17:13
means we're just going to deploy a new
17:15
version and whatever happens after that, it'll be
17:17
in production, right? Exactly. Yeah. And we offer
17:20
some templates that you can use in GitHub
17:22
for workflows in order to accommodate that. Excellent.
17:25
Then I cut you off, you're saying something about hybrid. Hybrid
17:27
is the other option. For those of you who want to
17:29
run your own compute, you don't want the data
17:31
leaving your ecosystem. You can say we've got
17:33
this Kubernetes cluster, this ECs cluster, but we
17:35
still want to use the Daxter cloud product
17:38
to sort of manage the control plane. Daxter
17:40
cloud will do that. And then you can
17:42
go off and execute things on your own
17:44
environment if that's something you wish to do.
17:46
Oh, yeah, that's pretty clever. Because running stuff
17:48
in containers isn't too bad. But running container
17:50
clusters, all of a sudden, you're
17:52
back, back doing a lot of work, right? Exactly.
17:55
Yeah. Okay, well, let's maybe talk
17:57
about Daxter for a bit that I Want to talk
17:59
about some of. The trend as well. that
18:01
was you talk to, maybe setting up a
18:03
pipeline? I could. What is it look like
18:05
need talked about in general. I'm less imperative,
18:08
more declared of. but where does it look
18:10
like? Be careful. Other him a code on
18:12
audio that you know? Oh yes. give us
18:14
a sense of what the programming model feels
18:17
like for us as much as possible. It
18:19
really feels like to spreading Python. It's pretty
18:21
easy you out of their creator on top
18:24
of your existing Python function that does something
18:26
as a simple decorator called asset and then
18:28
you are. I find that function becomes. The
18:31
did assets house represented in the ducks are
18:33
you I so you could imagine you've got
18:35
a pipeline that gets like maybe slack analytics
18:37
and upon set to such as part by
18:40
your first pipeline a functional it because I
18:42
feel excited data and that would be your
18:44
asset in that function is where you do
18:46
all the transform the download the data until
18:49
you've really created that fundamental data as if
18:51
you care about and I can be stored
18:53
either you know in a data warehouse to
18:55
as three hours. So to what a process
18:58
that that's really up to you. And
19:00
then the resources to sort of where the
19:02
power i think of a lot of dice
19:04
it comes in the ass at its are
19:06
a lot like declaration of the thing and
19:08
going to create a resource is how I'm
19:10
going to operate on that bit because sometimes
19:12
you might want to have a a same
19:14
a ducky be instance locally because it's easier
19:16
and faster to operate. What will you Moving
19:18
to the cloud you want to have a
19:20
it a break sweater or hub snowflake? You
19:23
can swap resources based on environments and you
19:25
are sick and reference that resource and as
19:27
long as a hazard. same sort of if
19:29
you are you. Can really sucks to be
19:31
changed between were that data as going to
19:33
be persistent. Does Dexter know how to talk
19:35
to those different platforms? Does it natively understand
19:37
the Tv and snowflake? Yes it's interesting. People
19:39
often look the dachshund like oh does it
19:41
to Axe and question is like accurate as
19:44
anything you can do Python with which is
19:46
most things yeah visible thing so I think
19:48
you to come from the airflow world you're
19:50
very much used to like these airflow provide
19:52
hers and if ya know what I think
19:54
and yeah yeah you read a post guess
19:56
easy to find the post with provider you
19:58
want He said there you need to find
20:01
Esther provider attacks Are you going to say
20:03
it off today that if you want to
20:05
use know say for example it's ah the
20:07
Stasi connector package from Snowflake A use that
20:09
as a resource directly and then you just
20:11
run you're sick or that way There are
20:13
some places where we do have innovations that
20:15
out you when you get into the bees
20:17
of i owe manager is where we are
20:20
sister data on your behalf and sell for
20:22
as three for snowflake for example there's other
20:24
ways with can persist that they to for
20:26
you but issued to turn it on a
20:28
query to try to execute something. Sort
20:30
of say something somewhere you don't have to
20:32
you that system at all. You can just
20:35
use whatever Python package you give. you would
20:37
use any way to do that. So maybe
20:39
some data as expensive for us to get
20:41
as a company like mean were charged i
20:44
am usage basis or super slow or something
20:46
I could write as Python code that doesn't
20:48
say well look at my local database if
20:50
it's already there. use that as had to
20:53
stay or otherwise than to actually go get
20:55
it but it put it there and then
20:57
get it back and like that kind of
20:59
stuff would be up to meet us at
21:02
against. yep and as that I see his.
21:04
You're not really limited by like anyone's data
21:06
model or or world's here on how did
21:09
has to be retrieve saved augmented. You get
21:11
a couple ways you could say whenever I'm
21:13
working locally uses persisted a restore that we're
21:15
just going to use for develop purposes. Fancy
21:18
database called super Light some like that exactly
21:20
Yes. Wonderful listed of it's an Axis Yeah
21:22
Hill worked really really well. And then you
21:24
say when I'm in different environment when I'm
21:27
in production swap out likes to call it
21:29
resource for a. Name your favorite Cod
21:31
were has been source and co fit
21:33
fast that data from there are only
21:35
use money I owe locally on E
21:37
as three on on pod. It's very
21:39
simple to sweaty six. Oh okay, yeah
21:41
so it looks like you build up
21:43
these assets as you call these pieces
21:45
of data. I don't care that access
21:47
is an and then you have a
21:49
nice you I that lets you go
21:52
and build those out snow workflows style
21:54
right? Yeah, exactly. This is where we
21:56
get into the wonderful world of tags
21:58
which stands for Directed A So. I
22:00
think it says for a bunch of things
22:03
that are not connected in a circle but
22:05
are connected in some way. for the cabin
22:07
a loop wake of the you never know
22:09
where to start, a way to had kiddo
22:11
you guys and but not a know that
22:13
of this is not a single element like
22:15
a path through this dataset with the beginning
22:17
and end. Then we can kind of so
22:19
tomorrow this. Connected. Graph of things and
22:21
then we know how to execute the right.
22:23
We can say well this is the first
22:25
and we have to run to test for
22:27
all dependencies. start and then we can either
22:29
branch off in parallel or we continue linearly
22:31
until everything is complete and if something breaks
22:33
in the middle we can resume from that
22:35
broken spot. Okay, excellent and is that the
22:37
recommended way? Like if I write all this
22:39
Python code that works on the pieces then
22:41
the next recommendation would be to fire up
22:43
the you I and start building or he
22:45
say eyes, really write it and code and
22:47
then you can just visualize it or or
22:49
monitor everything. Indexers written code the you I
22:52
read psychos and it interpreted as a
22:54
tag and then it displays or for
22:56
you. There are some things to do.
22:58
The violence: You can materialize assets. You
23:00
can make them run, You can do
23:02
back fills, You can view meditator. You
23:04
can sort of enable and disable schedules.
23:06
But the Core: We really do it
23:08
as a dyke show. the core declaration
23:10
of how things are done. It's always
23:12
done through code. Okay, access to we
23:14
say materialize Maybe I have a and
23:16
asset which is really a Python function
23:18
I read that goes and pulls. Down
23:20
a Csv file the materialize a be I
23:23
want to see kind of representative data and
23:25
this as in the you I and so
23:27
I could go or at and think this
23:29
is right. Let's keep passing it down that
23:31
what that means materialise really means just run
23:34
this pretty thoracic. make this asset new, again,
23:36
fresh, again rare as part of that material
23:38
is Haitian we sometimes have at meditator. You
23:40
can see this on the right if you're
23:43
looking at the screen here where we talk
23:45
about what the time Sep was that you
23:47
are our business of a graph of like
23:49
number of rows. over time all that
23:51
mattered data is the you can emit
23:53
and we emit some ourselves by default
23:55
with the framework and as you materialized
23:58
assets as you run the ass and
24:00
over again over time we capture all that
24:02
and then you can really get a nice
24:04
overview of this asset's lifetime essentially. I think
24:07
the metadata is really pretty excellent. Over
24:09
time, you can see how the data
24:11
has grown and changed. Yeah, the metadata
24:13
is really powerful and it's one of
24:15
the nice benefits of being in this
24:18
asset world because you don't really want
24:20
to metadata on this task that run.
24:22
You want to know this table that
24:24
I created, how many rows does it
24:26
have every single time it's run. That
24:28
number drops by like 50 percent. That's
24:30
a big problem. Conversely, if the runtime is
24:32
slowly increasing every single day, you might not
24:34
notice it, but over a month or two
24:36
it went from a 30-second pipeline to 30
24:38
minutes, maybe there's a great place to
24:41
start optimizing that one specific asset. Right.
24:43
What's cool is if it's just Python
24:45
code, you know how to optimize that
24:47
probably, right? Hopefully, yes. Well,
24:49
as much as you're going to... You have
24:52
all the power of Python and you
24:54
should be able to as opposed to it's
24:56
deep down inside some framework that you don't
24:58
really... Exactly. Yeah. It's Python, you can benchmark
25:00
it. You probably knew you didn't write it
25:03
that well when you first started and you can
25:05
always find ways to improve it. So
25:07
this UI is something that you can just
25:09
run locally kind of like Jupiter. 100 percent.
25:11
Just type Dijkstra dev and then you get
25:13
the full UI experience. You get to see
25:16
the runs, all your assets. Is it a
25:18
web app? It is, yeah. It's a web
25:20
app. There's a Postgres backend and then there's
25:22
a couple of services that run the web
25:24
server, the GraphQL and then the workers. Nice.
25:26
Yeah. So pretty serious web app, it sounds
25:28
like. But you
25:30
probably just run it all. Yeah. Something you
25:32
run all probably containers
25:35
or something you just fire up when you
25:37
download it, right? Locally, it doesn't even use
25:39
containers. It's just all pure Python for
25:42
that. But once you deploy, yeah, I think you
25:44
might want to go down the container route. But
25:46
it's nice not having to have Docker just to
25:48
like run a simple test deployment. Yeah, I guess
25:50
not everyone's machine has that for
25:53
sure. So question from the audience here,
25:55
Jazzy asked, does it hook into
25:57
AWS in particular? Is It compatible?
26:00
The ball with existing pipelines like ingestion
26:02
Lambda as are transformed My unless you
26:04
can look into the of us so
26:06
we have some it of your inhibitions
26:08
built in. Like I mentioned before, there's
26:10
nothing stopping you from importing boat or
26:12
three and and doing anything really you
26:14
want. So a very simple use case
26:16
like let's say you already have an
26:19
existing transformation in triggered in it Lvs
26:21
through some lambda he sips model that
26:23
with indexer and say who trigger that
26:25
lambda or three Okay and the acid
26:27
itself is really that repetition of that
26:29
pipeline. Weight off your wedding that code
26:31
within banks yourself at still occurring on the
26:34
to be a swimmer And it's a really
26:36
simple way to start adding a limit of
26:38
of livability orchestration to existing pipelines. Okay as
26:40
pretty cool because now you have this nice
26:43
you I and these meditate in Us history
26:45
but it's someone else Have club exactly Yes
26:47
now precursor to fall born from a sitting
26:49
there and over time you buy decides. you
26:52
know this in Atlanta that I had it's
26:54
are in the get out of hand. I
26:56
wanted broken apart to multiple assets by one
26:58
to serve optimizers. always dykes. Can help
27:00
you along That has now. Excellent howdy
27:03
set up Like triggers are observe ability
27:05
inside Dax Her eyes does he asked
27:07
about us? Sorry, but like in general
27:09
right? if a row is entered into
27:11
a database, something's dropped in a blob
27:13
storage or that a changes that are
27:15
no yes those requests. It's a lot
27:18
of options. Indexer: We do model every
27:20
asset with a couple little size. I
27:22
think that a really useful A Think
27:24
about what is whether the code of
27:26
that particular asset has changed me and
27:28
the other one is whether. and the
27:30
upstream of the asset has changed in those
27:33
things really power a lot of automation functionality
27:35
that we can get and stream so let's
27:37
start with are things a street samples the
27:39
eager to understand either bucket and there is
27:41
you or file the gets uploaded every day
27:43
you know what time the followed it uploaded
27:46
it or know when it'll be uploaded but
27:48
you know at some point it will be
27:50
indexer we have a thing called the sensor
27:52
which you can just an ex tuna three
27:54
location you can define how it looks into
27:56
their file or into a folder and then
27:59
you just pull every 30 seconds
28:01
until something happens. When that something
28:03
happens, that triggers an event. And
28:06
that, if I can trickle at your will
28:08
downstream to everything that depends on it, I do
28:10
connect to these things. So it gets you awake
28:12
from this, like, oh, I'm going to schedule
28:14
something to run every hour. Maybe the data
28:16
will be there, but maybe it won't. And you
28:19
can have a much more event-based workflow. When
28:21
this file runs, I want everything downstream to
28:23
know that this data has changed. And as
28:25
data flows through the systems, everything will sort of
28:27
work its way down. Yeah, I like it.
28:31
This portion of Talk Python to me is
28:33
brought to you by Posit, the makers of
28:36
Shiny, formerly RStudio, and especially,
28:38
Shiny for Python. Let
28:40
me ask you a question. Are you building
28:42
awesome things? Of course you are. You're a
28:44
developer or data scientist. That's what we do.
28:46
And you should check out Posit Connect. Posit
28:49
Connect is a way for you to
28:51
publish, share, and deploy all the data
28:53
products that you're building using Python. People
28:56
ask me the same question all the time. Michael,
28:58
I have some cool data science project or notebook
29:01
that I built. How do I
29:03
share it with my users, stakeholders, teammates?
29:05
Or I need to learn FastAPI or
29:08
Flask or maybe Vue or ReactJS? Hold
29:10
on now. Those are cool technologies, and I'm sure
29:13
you benefit from them. But maybe stay focused on
29:15
the data project. Let Posit Connect handle
29:17
that side of things. With Posit
29:19
Connect, you can rapidly and
29:21
securely deploy the things you
29:23
build in Python. Streamlet, Dash,
29:25
Shiny, Bokeh, FastAPI, Flask, Quarto,
29:28
Ports, Dashboards, and APIs. Posit
29:30
Connect supports all of them. And Posit
29:32
Connect comes with all the bells and
29:35
whistles to satisfy IT and other enterprise
29:37
requirements. Make deployment the easiest
29:39
step in your workflow with Posit
29:41
Connect. For a limited time, you
29:43
can try Posit Connect for free
29:45
for three months by going to
29:47
talkbython.fm slash posit. That's talkbython.fm slash
29:50
posit. The link is in your podcast
29:52
player show notes. Thank you
29:54
to the team at Posit for supporting TalkByThon. The sensor
29:57
comes with a link
29:59
to the webinar. And that's it for today. concept is really cool
30:01
because I'm sure that there's a ton of
30:03
cloud machines people provisioned just because
30:05
this thing runs every 15 minutes,
30:08
that runs every 30 minutes and
30:10
you add them up and in
30:12
aggregate we need eight machines just
30:14
to handle the automation rather
30:16
than – because they're hoping to catch something
30:18
without too much latency but maybe that actually
30:20
only changes once a week. Exactly. And
30:23
I think that's where we have to like sometimes
30:25
step away from the way we're so used to
30:27
thinking about things and I'm guilty of this. When
30:29
I create a data pipeline, my natural inclination is
30:31
to create a schedule where I can say, is
30:33
this a daily one? Is this weekly? Is this
30:35
monthly? But what I'm finding more and more is
30:37
when I'm creating my pipelines, I'm not adding a
30:39
schedule. I'm using DAGSAR's auto materialized
30:41
policies and I'm just telling it, you figure
30:44
it out. I don't have to think about
30:46
schedules. Just figure out when this thing should
30:48
be updated. When parents have been updated, you
30:50
run. When the data has changed, you
30:52
run. And then just like figure it out and leave
30:54
me alone. Work pretty well
30:56
for me so far. I think it's great. I have a
30:59
refresh the search index on the various
31:02
podcast pages that runs and it runs every
31:04
hour but the podcast ships weekly, right? But
31:06
I don't know which hour it is and
31:08
so it seems like that's enough latency but
31:10
it would be way better to put just
31:12
a little bit of smart like what was
31:15
the last date that anything changed? Was that
31:17
since the last time you saw it? Maybe
31:19
we'll just leave that alone. You're
31:22
starting to inspire me to go write
31:24
more code but pretty cool. All
31:26
right. So on the homepage at
31:28
dagsar.io, you've got a nice graphic
31:30
that shows you both how to write
31:33
the code, like some examples of the
31:35
code as well as how that looks
31:37
in the UI. And one of them
31:39
says to launch backfills. What is this
31:42
backfill thing? Oh, this is my favorite
31:44
thing. Okay. So when you
31:46
first start your data journey as a data
31:48
engineer, you sort of have a
31:50
pipeline and you build it and it just
31:52
runs on a schedule and that's fine. What
31:54
you soon find is you might have to
31:56
go back in time. You might say, I've
31:59
got this. data set that updates monthly.
32:01
Here's a great example, AWS cost
32:04
reporting, right? AWS will send
32:06
you some data around, you know, all your
32:08
instances and your S3 bucket, all that. And
32:10
it'll update that data every day or every
32:12
month or whatever have you. Due to some
32:14
reason, you got to go back in time
32:16
and refresh data that AWS updated due to
32:18
some like discrepancy. Backsell is sort of how
32:20
you do that. And it worked hand in
32:23
hand with this idea of a partition. A
32:25
partition is sort of how your data is
32:27
naturally organized. And it's like a nice way
32:29
to represent that natural organization. It has nothing
32:31
to do with like the fundamental way how
32:33
often you want to run it. It's more
32:35
around like, I've got a data set that
32:38
comes in once a month is represented monthly,
32:40
it might be updated daily, but the representation
32:42
of the data is monthly. So I will
32:44
partition it by month. It doesn't have to
32:47
be dates. It could be strings, it could
32:49
be a list, you could have a partition
32:51
for every company, or every client, or you
32:53
know, every domain you have, whatever you sort
32:56
of think is a natural way to think
32:58
about breaking apart that pipeline. And
33:00
once you do that partition, you can do
33:02
these nice things called backfills, which says, instead
33:04
of running this entire pipeline on all my
33:06
data, I want you to pick that one
33:08
month where your data went wrong, or that
33:10
one month where data was missing, and just
33:12
run the partition on that range. And so
33:15
you limit compute, you save resources and get
33:17
a little bit more efficient. It's just easier
33:19
to like, think about your pipelines because you've
33:21
got this natural built in partitioning Excellent.
33:24
So maybe you missed some
33:26
important event, maybe your automation went down
33:28
for a little bit came back up, you're
33:30
like, Oh, no, we've we've missed it. Right.
33:33
But you want to start over, For
33:35
years. So Maybe we could just go and
33:37
run the last day to worth of exactly.
33:40
Okay. Another One would be your vendor says,
33:42
Hey, by the way, we actually screwed up.
33:44
We Uploaded this file from two months ago,
33:46
but the numbers were all wrong. Yeah, we've
33:49
uploaded a new version to that destination. Can
33:51
You update your data set? One Way is
33:53
to recompute the entire universe from scratch. But
33:55
If you've partitioned things, and you can say
33:58
no limit that to just miss one particular
34:00
then for that month and that what British
34:02
you can trickle down always all your other
34:04
assets that depend on that, we do have
34:06
to free decide the have to think about
34:09
this partitioning beforehand or can you do it
34:11
retroactively he could are effectively and I have
34:13
done that before as well. It really depends
34:15
on on where you're at. I think it's
34:18
your first as ever probably and bother partitions,
34:20
but he really is a lotta work to
34:22
get them to get them started. Okay, yeah,
34:24
really nice. I like a lot of the
34:26
ideas here are like that. It's got this
34:29
visual component. That you can and
34:31
see what's going on inspected.c can debug runs
34:33
or what happens there like obviously when your
34:35
polling data from many different sources maybe it's
34:38
not your data your taken in fields could
34:40
vanish can be the wrong type system singer
34:42
down I'm sure sure they do working as
34:44
interesting so what's it looks a little bit
34:47
I know I go web browser to bug
34:49
dev tools thing to. For the record my
34:51
code never fails. I've never had a bug
34:53
in my life before. the let you have
34:56
this is yeah I bought mine. doesn't he
34:58
only do it to make a. And
35:00
example and from Miami of our other yes
35:02
it's I do it's intention of course yet
35:05
a humble myself a little bit as exactly
35:07
is the first few is ice one of
35:09
my favorite I've been some is a bit
35:11
views but this is it's actually really fun
35:13
to watch y sus of the runway you
35:16
execute this by the but really like was
35:18
go back to you know what the world
35:20
before or procedures we use Crime rate we
35:22
have a basket that would do something and
35:24
we have a cronje up it said make
35:27
sure this thing runs and then hopefully it
35:29
was successful but sometimes. it was it and
35:31
it's a sometimes it was it that's always
35:33
been the problem right it's like well what
35:35
were you know i don't know why it
35:37
failed i was when there's sale know what
35:39
a what point of a subset of hell
35:42
that's really hard to do with this the
35:44
bugger really is is is a structured lot
35:46
of every step that's been going on through
35:48
your i find right to and this view
35:50
there's three assets and been kind to see
35:52
here when it's called users when it's hot
35:54
orders and one is to run tbt the
35:57
presumably there is to you know tables that
35:59
are being updated and then dbt job it
36:01
looks like this being updated at the very
36:03
end. Once you execute this pipeline, all the
36:05
logs are captured from each of those assets.
36:07
So you can manually write your own logs,
36:09
you have access to a Python logger, and
36:11
you can use your info, your error, whatever
36:14
have you, and log output that way, and
36:16
it'll be captured in a structured way. But
36:18
it also captured logs from
36:20
your integrations. So using dbt, we capture
36:22
those logs as well, you can see
36:24
it processing every single asset. So if
36:26
anything does go wrong, you can filter
36:28
down and understand at what step,
36:31
at what point, does something go wrong.
36:33
That's awesome. And just the historical
36:36
aspect, could just go in through logs,
36:38
especially multiple systems can be really, really
36:40
tricky to figure out what's the problem,
36:42
what actually caused this to go wrong,
36:44
but come back and say, Oh, it
36:46
crashed, pull up the UI and see,
36:48
all right, well, show me, show
36:51
me what this run did, and show me what this job did.
36:53
And it seems like it's a lot easier to debug than your
36:56
standard web API or something like that. Exactly. You can
36:58
click on to any of these assets that get metadata
37:00
that we had earlier as well. If you know,
37:02
one step failed, and it's kind of flaky, flaky,
37:04
you can just click on that one step and
37:06
say just rerun this, everything else is fine, we
37:09
don't need to restart from scratch. Okay, and it'll
37:11
keep like the data from before.
37:13
So you don't have to rerun that. Yeah,
37:15
I mean, it depends on how you built
37:17
the pipeline. We like to build item potent
37:19
pipelines is how we sort of talk about
37:21
it, data engineering landscape, right? So you should
37:23
be able to run something multiple times and
37:26
not break anything in a perfect world. That's
37:28
not always possible. But ideally, yes. And
37:30
so we can presume that if users completed
37:32
successfully, then we don't have to run that
37:34
again, because that data was persisted, you know,
37:36
database s3 somewhere. And if orders was
37:39
the one that was broken, we can just only
37:41
run orders and not have to worry about rewriting
37:43
the whole thing from scratch. Excellent. So
37:46
item potent for people who maybe don't know,
37:48
you run it once or you perform the operation once
37:50
or you perform it 20 times, same
37:53
outcome should have side effects,
37:55
right? That's the idea. Yeah,
37:57
that's the idea. We Use your stuff. It sure
37:59
is. I'm at a D V that
38:01
I'm very hard but the more you
38:03
can build path i sat way the
38:05
easier your life becomes immediately. Generally not
38:07
always for generally true for programming as
38:09
well Are a few doctor functional programming
38:12
people they'll say like it's an absolute
38:14
but yes a personal programmers love love
38:16
is kind of stuff and it as
38:18
it does lend itself for the wealthy
38:20
data pipelines. If I find on like
38:22
maybe some of the suffering during stuff
38:24
it's a little bit different in that
38:26
the data changing is what causes often
38:28
most of the headaches rate. Is less
38:30
so the actual code you right but
38:32
more this citizens and to change so
38:34
frequently and so often in new and
38:37
novel an interesting way that you would
38:39
often never expect. And so the more
38:41
you can sort of make that function
38:43
so pure that you can provide any
38:45
sort of dataset and really tests you'll
38:47
easily these expectations when they at quicken
38:49
the easier it is to serve depot
38:51
these things and below them in the
38:53
future yeah and cast them as well
38:56
yes of Asus Via. So speaking of
38:58
that kind of stuff like what's the.
39:00
Scale Ability story. I've got some
39:02
big huge complicated data pipeline. can
39:05
I parallelism and have them run
39:07
multiple pieces like the of the
39:09
different branches are some like that
39:11
Excesses to. That's one of the
39:13
key benefits I think in reading
39:15
your assets in this dag way
39:17
Race: Anything that is paralyzed, people
39:19
will be paralyzed. Know some of
39:21
the limits on that. For the
39:23
too much prohibition is bad. you
39:25
poor little database can handle it.
39:28
He can say media concurrency limit.
39:30
on this one just for today is worth
39:32
putting or something and eighty eye for an
39:34
external bend her they might not appreciate ten
39:36
thousand requests the second on that was to
39:38
maybe our he was slow down but another
39:40
rate limiting right you can run into to
39:43
me a class and than their than your
39:45
stuff crashes than investor as they i can
39:47
be all thing creole davis nebula concerned spots
39:49
have returned the world is is simple anything
39:51
that can be paralyzed will be a through
39:53
dexter and that's really the benefit of reading
39:55
these tags is it is a nice algorithm
39:57
for determining whether she looks like now I
40:00
guess if you have a diamond shape or any sort
40:02
of splits, those two things now
40:04
become just ascyclical. They can't turn around
40:06
and then eventually depend on each other
40:08
again. So that's a perfect chance to
40:10
just go fork it out. Exactly. And
40:12
that's been where partitions are also kind
40:14
of interesting. If you have a partitioned
40:16
asset, you could take your data set,
40:18
partition it into five buckets, and run
40:20
all five partitions at once, knowing full
40:22
well that because you've written this in
40:24
a idempotent and partitioned way, that the
40:26
first pipeline will only operate on Apple
40:29
and the second one only operates on bananas. And
40:32
there is no commingling of apples and bananas anywhere
40:34
in the pipeline. Oh, that's interesting.
40:36
I hadn't really thought about using the partitions for
40:38
parallelism, but of course. Yeah. It's
40:41
a fun little way to break things apart. So
40:43
if we run this on the Daxter cloud
40:46
or even on our own, this is pretty
40:48
much automatic. We don't have to do anything.
40:50
Like Daxter just looks at it and says,
40:52
this looks parallelizable, and it will go. That's
40:55
right. Yeah. As long as you've got the
40:57
full deployment, whether it's OSS or cloud, Daxter
40:59
will basically parallelize it for you, which
41:01
is possible. Excellent. You can set global
41:03
currency limits. So you might say, 64
41:05
is more than enough parallelization
41:08
that I need. Or maybe I want
41:10
less because I'm worried about overloading systems,
41:12
but it's really up to you. Yeah.
41:14
I'm putting this on a $10 server.
41:17
Please don't kill me. Just
41:19
respect that it's somewhat wimpy, but that's OK. Yeah.
41:21
But it'll get the job done. It'll get the
41:23
job done. All right. I want to talk about
41:25
some of the tools and some of the tools
41:27
that are maybe at play here when working with
41:29
Daxter and some of the trends and stuff. But
41:31
before that, it maybe speaks to where
41:34
you could see people adopt a tool
41:36
like Daxter, but they generally don't.
41:38
They don't realize, like, oh, actually, there's
41:40
a whole framework for this. I
41:43
could, sure, I could go and
41:45
build just on HTTP server and
41:48
hook into the request and start writing to it. But
41:50
maybe I should use fast or fast API. There's
41:53
these frameworks that we really
41:55
naturally adopt for certain situations
41:57
like APIs and others.
42:00
background jobs, data pipelines, where I think there's
42:02
probably a good chunk of people who could
42:04
benefit from stuff like this, but they just
42:06
don't think they need a framework for it.
42:09
Like, cron is enough. Yeah, it's funny because sometimes
42:11
cron is enough. I don't want
42:13
to encourage people not to use cron, but
42:16
think twice, at least, is what I would
42:18
say. So probably the first
42:20
trigger for me of thinking of, you know, is
42:22
that actually a good choice is like, am I
42:24
trying to ingest data from somewhere? That's
42:27
something that fails. Like, I think we just can accept
42:29
that, you know, if you're moving data around, the
42:31
data source will break, the expectations will
42:33
change, you'll need to debug it, you'll
42:35
need to run it, and doing that
42:37
in cron is a nightmare. So I
42:39
would say definitely start to think about
42:41
an orchestration system if you're ingesting data.
42:44
If you have a simple cron job that sends one
42:46
email, like, you're probably fine. I don't think you need
42:48
to implement all of the tags just to do that.
42:51
But the more closer you get
42:53
to data pipelining, I think the
42:55
better your life will be if
42:57
you are not trying to debug
42:59
a obtuse process that no one really
43:02
understands six months from now. Excellent.
43:04
All right, maybe we could touch on some
43:07
of the tools that are interesting. I see
43:09
people using, you talked about DuckDB and DBT,
43:11
a lot of Ds starting here, but give
43:14
us a sense of like some of the
43:16
supporting tools you see a lot of folks
43:18
using that are interesting. Yeah, for sure. I
43:20
think in the data space, probably DBT is
43:23
one of the most popular choices. And
43:26
DBT, in many ways, it's nothing more
43:28
than a command line tool that
43:31
runs a bunch of SQL in a
43:33
bag as well. So there's actually a
43:35
nice fit with DAGSAR and DBT together.
43:37
DBT is really used by people who
43:39
are trying to model that business process
43:42
using SQL against typically a
43:44
data warehouse. So if you
43:46
have your data in, for
43:48
example, Postgres, a Snowflake, Databricks,
43:50
Microsoft SQL, these types of
43:52
data warehouses, generally, you're
43:54
trying to model some type of
43:56
business process. And typically, people use
43:58
SQL to do that. Now you can
44:01
do this without dbt, but dbt has
44:03
provided nice clean interface to doing so
44:06
It makes it very easy to connect these models
44:08
together to run them to have a development workflow
44:10
That works really well and then you can push
44:12
it to prod and have things run again in
44:14
production So that's dbt. We
44:17
find it works really well and a lot of
44:19
our customers are actually using dbt as well There's
44:21
a duct DB, which is a great
44:24
it's like the sequel light for
44:26
columnar databases, right? Yeah, it's in
44:28
process It's fast. It's written by
44:30
the Dutch or something. You can't like about
44:32
it. It's free We love that feels very
44:34
comfortable in Python itself So
44:37
easy. Yes, exactly the Dutch
44:39
have given us so much and
44:41
they've asked nothing of us So I'm
44:44
always very thankful for them. It's fast.
44:46
It's so fast It's like if
44:48
you've ever used pandas for processing large
44:50
volumes of data You will occasionally hit
44:53
memory limits or inefficiencies in doing
44:55
these large aggregates I won't go
44:57
to all the reasons of why that is but duck DB Sort
45:00
of changes that because it's a fast
45:02
serverless sort of C++ written Tooling
45:05
to do really fast vectorized work and
45:07
by that I mean like it works
45:09
on columns Typically in
45:11
like sequel light you're doing transactions
45:13
You're doing single row updates writes
45:16
inserts and sequel light is created
45:18
that where typical transactional databases fail
45:20
or art as powerful Is
45:22
when you do aggregates when you're looking at
45:24
an entire column, right? Just the way they're
45:26
architected if you want to know the average
45:28
of the median The sum of some
45:31
large number of columns and you want to group that by
45:33
a whole bunch of things You want
45:35
to know the first date someone did something
45:37
and the last one those types of vectorized
45:39
operations Duck DB is really really fast at
45:41
doing and it's a great Alternative
45:44
to for example pandas which can
45:46
often hit memory limits and be
45:48
a little bit slow in that
45:50
regard Yeah, it looks like you
45:52
have some pretty cool aspects transactions,
45:54
of course, but it also says
45:56
direct parquet CSV and JSON querying
45:58
So if you You've got a CSV
46:00
file hanging around and you wanna ask questions
46:03
about it, or JSON or some of the
46:05
data science stuff through Parquet. Turn
46:07
up indexed proper query engine against it.
46:09
Don't just use a dictionary or something,
46:11
right? Yeah, it's great for reading
46:13
a CSV, zip files, tar
46:16
files, Parquet, partition Parquet files, all
46:18
that stuff that usually was really
46:20
annoying to do and operate on.
46:22
You can now install .db. It's
46:24
kind of great CLI too. So
46:26
before you go and program your
46:28
entire pipeline, you just run .db
46:30
and you start writing SQL against CSV files
46:32
and all this stuff to really understand your
46:35
data and just really see how quick it
46:37
is. I used it on a bird dataset
46:39
that I had as an example project and
46:41
there was millions of rows and
46:43
I was joining them together and doing massive group
46:45
buys and it was done in seconds. And it
46:48
was just hard for me to believe that it
46:50
was even correct, because it was so quick. So
46:52
it is wonderful. I'm about to have done that
46:54
wrong somehow. Because it's
46:56
done, it shouldn't be done. Yeah. The
46:58
fact it's in process means there's not
47:01
a babysit, a server for you to
47:03
babysit, patch, make sure it's still running.
47:05
It's accessible but not too accessible, all
47:07
that, right? It's a pip and sell
47:09
away, which is always, we love that,
47:12
right? Yeah, absolutely. You mentioned, I guess
47:14
I mentioned Parquet, but also Apache Arrow seems
47:16
like it's making its way into a lot
47:18
of different tools and sort
47:21
of foundational sort of high memory, high
47:23
performance in memory processing. Have you
47:25
used this, Eddie? I've used it, especially through
47:28
working through different languages. So moving
47:30
data between Python and R is where I
47:33
last used this. I think Arrow's
47:35
great at that. I believe Arrow is underneath
47:37
some of the rust to Python
47:39
as well. It's working there.
47:41
So typically I don't use Arrow directly
47:43
myself, but it's in many of the
47:46
tooling I use. All right. So
47:48
great product and so much of the ecosystem is now
47:50
built on Arrow. Yeah, I think a lot of it,
47:52
I feel like the first time I heard about it
47:55
was through Polars. I'm
47:57
pretty sure, which is another rust. story
48:00
for kind of like pandas, but
48:02
a little bit more fluent lazy API. Yes.
48:04
We live at such great times, to be
48:06
honest. So, polars is a Python
48:08
bindings for Rust, I believe is kind of
48:11
how I think about it. It does all
48:13
the transformation in Rust, but you've had this
48:15
Python interface to it and it
48:17
makes things again, incredibly fast. I
48:19
would say similar in speed to
48:21
DuckDP. They both are quite comparable
48:23
sometimes. Yeah, it also claims
48:26
to have vectorized and columnar processing and all
48:28
that kind of stuff. Yeah, it's pretty incredible.
48:30
So, not a drop-in replacement for pandas, but
48:32
if you have the opportunity to use it
48:34
and you don't need to use the full
48:36
breadth of what pandas offers, because pandas is
48:38
quite a huge package. There's a lot it
48:40
does. But if you're just using simple transforms,
48:42
I think polars is a great option to
48:44
explore. Now, I talked to a Ritchie Vinc,
48:47
who was part of that. And I think
48:49
they explicitly chose to not try to make
48:51
it a drop-in replacement for pandas, but try
48:54
to choose an API that would allow the
48:56
engine to be smarter. I see you're asking
48:58
for this, but the step before you
49:00
wanted this other thing. So, let me do
49:03
that transformation all in one shot. A little
49:05
bit like a query optimization engine. What else
49:07
is out there? We got time for just
49:09
a couple more. If there's anything there, like,
49:12
oh yeah, people use this all the time.
49:14
Obviously, the databases, you've said, Postgres, Snowflake, etc.
49:16
Yeah, there's so much. So, another little one
49:19
I like is called DLP, DLP Hub. It's
49:21
getting a lot of attraction as well. And
49:23
what I like about it is how lightweight
49:25
it is. I'm such a big fan of
49:28
lightweight tooling that's non-massive frameworks. Lowering data is,
49:30
I think, still kind of yucky in many
49:32
ways. It's not fun. And DLP makes it
49:34
a little bit simpler and easier to do
49:36
so. So, that's what I would recommend people
49:38
just walk into if you got to either
49:40
interest data from some API,
49:43
some website, some CSV file. It's
49:45
a great way to do that.
49:47
It claims it's the Python library
49:49
for data teams loading data into
49:51
unexpected places. Very interesting. Yes, that's
49:53
great. Yeah, this looks cool. All
49:55
right. Well, I
49:58
Guess maybe let's talk about... This
50:00
knowing talk about what's next. Animal probiotic
50:02
time. I'm always fascinated. I think they're
50:04
starting to be a bit of a
50:06
blueprint for this, but companies that take
50:08
a thing, they make it in a
50:10
given away and have a company around
50:12
it. And and congratulations you all for
50:14
doing that. Pray and a lot of
50:16
it seems six kind a sin around
50:18
the open core model, which I don't
50:20
know if that's exactly how you would
50:23
characterize yourself from him. A hug about
50:25
the business Sykes I noticed many successful
50:27
open source projects. They don't necessarily result
50:29
in full time. Jobs or companies of people
50:31
were to want. that is it really is
50:33
simply isn't dumb. I don't think it's one
50:35
that anyone is truly figure it out. Well
50:37
I can say this is the way forward
50:39
for everyone but it is something we're trying
50:41
to think for for Dexter thing is working
50:43
pretty well and what I think is really
50:45
powerful about I sir is like he open
50:47
source project is really really good and it
50:49
hasn't really been limited in in in many
50:51
ways in order to drive like cloud products
50:54
of the Dm Surely believe that there's actual
50:56
value in the separation that he sings. There
50:58
are some things that we just can't do.
51:00
In the open source platform for example,
51:02
this pipelines on cloud that involve you
51:04
know interesting data to hurled systems and
51:06
rest. Of them access
51:08
to do on your pets or system. or
51:11
tax or for the most part of the
51:13
dice of the source i you really believe
51:16
though it's getting it in the hands or
51:18
to others is the best way to prove
51:20
the value of it as if we can
51:22
build a business or top of that letting
51:24
world's super happy to do so it's nice
51:26
that we get to to try both sides
51:28
of it to me that's one of them
51:30
are you saying parts rates a lot of
51:32
the development that we to and that's open
51:34
source is driven by people who are paid
51:36
through you know what happened on a cloud
51:38
and i think from what i can tell
51:40
there is still better way to build robots
51:43
or spot i've been to have people who
51:45
are particularly pay to develop a product otherwise
51:47
it can be liberal love but one that
51:49
doesn't last for very long and whenever i
51:51
think about building software there's eighty percent of
51:53
that super exciting fun and percent and then
51:55
there's that little sliver of like really fine
51:57
policy that if it's not just your job
51:59
to make that thing polished, you're just, for
52:01
the most part, just not going to polish
52:03
that bit, right? Good stuff.
52:06
UI, design, support. There's all these
52:08
things that go into making software
52:10
really extraordinary. That's really, really tough
52:12
to do. And I think
52:14
I really like the open source business model.
52:16
I think for me, being able to just
52:19
try something, not having to talk to sales
52:21
and being able to just deploy locally and
52:23
test it out and see if this works.
52:25
And if I choose to do so, deploy
52:27
it in production, or if I bought the
52:29
cloud product, they don't like the direction that is
52:31
going, I can leave and go open source as
52:33
well. That's pretty compelling to me. Yeah, for sure
52:35
it is. And I
52:37
think the more moving pieces of infrastructure,
52:39
the more uptime you want and all
52:42
those types of things, the more somebody
52:44
who's maybe a programmer, but not a
52:46
DevOps infrastructure person, but needs to have
52:48
it there, right? Like that's an opportunity
52:50
as well, right? For you to say,
52:52
look, you can write the code. We
52:55
made it cool for you to write the code, but
52:57
you don't have to get notified when the server's down
52:59
or whatever. We'll just take care of that for
53:01
you. That's pretty awesome. Yeah, and it's efficient through the
53:03
scale as well, right? We've learned the
53:06
same mistakes over and over again, so you don't have
53:08
to, which is nice. I don't know how many people
53:10
who want to maintain servers, but people do, and they're
53:12
more than welcome to if that's how they choose to
53:14
do so. Yeah, for sure. All
53:16
right, just about out of time. Let's close
53:19
up our conversation with where are
53:21
things going for Dijkstra? What's
53:23
on the roadmap? What are you excited about? Oh,
53:25
that's a good one. I think we've actually published
53:28
our roadmap line somewhere if you search Dijkstra
53:30
roadmap. It's probably out there. I think for the
53:32
most part, that hasn't changed much going into 2024,
53:34
though we may update it.
53:37
There it is. We're really just doubling down on
53:40
what we've built already. I think there's a lot
53:42
of work we can do on the product itself
53:44
to make it easier to use, easier to understand.
53:46
Dijkstra specifically is really focused around the education piece.
53:49
We launched Dijkstra University's first module,
53:51
which helps you really understand the
53:53
core concepts around Dijkstra. Our next
53:55
module is coming up in a couple months, and
53:58
that'll be around using Dijkstra with dbt. which
54:00
is our most popular integration. We're building out more
54:02
integrations as well. So I built
54:04
a little integration called Embedded ELT that makes
54:06
it easy to ingest data. But I want
54:08
to actually build an integration with the ELT
54:11
as well, ELT Hub. So we'll be doing
54:13
that. And there's more
54:15
coming down the pipe, but I don't know how much I can say.
54:17
Look over to an event in April
54:20
where we'll have a launch event on
54:22
all that's coming. Nice. It's an online
54:24
thing people can attend or something like
54:26
that. Yeah, there'll be some announcement there
54:28
on the Daxter website on that. Maybe
54:30
I will call it one thing that's actually
54:33
really fun. It's called Daxter Open Platform. It's
54:35
a GitHub repo that we launched a couple
54:37
months ago, I want to say. We
54:39
took our internal, I should go back
54:42
one more. Sorry. It's like GitHub, Daxter
54:44
Open Platform and GitHub. I
54:46
have it somewhere. Yeah. It's
54:49
up here in another organization. Yes,
54:51
it should be somewhere here. There
54:53
it is. Daxter Open Platform on
54:55
GitHub. And it's really a clone
54:57
of our production pipelines. For the
54:59
most part, there's some things we've chosen to
55:01
ignore because they're sensitive. But as much as
55:04
possible, we've defaulted to making it public and
55:06
open. And the whole reason behind this was
55:08
because as data engineers, it's often hard to
55:10
see how other data engineers write code. We
55:12
get to see how software engineers write code
55:14
quite often, but most people don't want to
55:16
share their platforms for various
55:18
good reasons. We also use
55:20
smaller teams or maybe just
55:22
one person. And then those
55:24
pipelines are so integrated into
55:27
your specific infrastructure.
55:30
It's not like, well, here's a web framework to
55:32
share. Here's how we integrate into that one weird
55:34
API that we have that no one else has.
55:36
There is no point in publishing it to you.
55:38
That's typically how it goes. Or they're so large
55:40
that they're afraid that there's some important situation that
55:42
they just don't want to take the risk on.
55:44
And then we built something that's in the middle
55:46
where we've taken as much as we can and
55:48
we publicized it. And you can't run this on
55:50
your own. That's not the point. The point is
55:52
to look at the code and see how does
55:54
Daxter use Daxter and what does that look like?
55:56
Nice. Okay. All right. Well, I'll put a link
55:58
to that in the show. and people can
56:00
check it out. Yeah, I guess let's
56:03
wrap it up with the final call action.
56:05
People are interested in Daxter. How do they
56:07
get started? What do you tell them? Oh
56:10
yeah. daxter.io is probably the greatest place to
56:12
start. You can try the cloud product. We
56:14
have free self-serve or you can try the
56:16
local install as well. If you
56:18
get stuck, a great place to join is our Slack
56:20
channel, which is up on our website. There's even a
56:23
Ask AI channel where you can just talk
56:25
to a Slack bot that's been trained on
56:27
all our GitHub issues and discussions. Surprisingly
56:30
good at walking you through any debugging, any issues
56:32
or even advice. That's pretty excellent actually. Yeah, it's
56:34
real fun. It's really fun. And it allows it
56:36
to work. We're also there in the community where
56:39
you can just chat to us as well. Cool.
56:42
All right. Pedram, thank you for being on the show. Thanks
56:45
for all the work on Daxter and sharing it with us. Thank
56:47
you Michael. You bet. See you later.
56:49
This has been another episode of Talk Python to Me. Thank
56:52
you to our sponsors. Be sure to check out what
56:54
they're offering. It really helps support the show. This
56:58
episode is sponsored by
57:00
Posit Connect from the
57:02
makers of Shiny. Publish,
57:04
share and deploy all
57:06
of your data projects
57:08
that you're creating using
57:10
Python. Streamlit-shinybokeh-fastapi-flask-querto-reports-dashboards and APIs.
57:13
Posit Connect supports all of them. Try
57:15
Posit Connect for free by going to
57:17
talkpython.fm slash posit.
57:19
P-O-S-I-T. Want
57:22
to level up your Python? We have one of
57:24
the largest catalogs of Python video courses over at
57:27
Talk Python. Our content ranges from
57:29
true beginners to deeply advanced topics like
57:31
memory and async. And best of all,
57:33
there's not a subscription in sight. Check
57:35
it out for yourself at training.talkpython.fm. Be
57:39
sure to subscribe to the show, open your favorite
57:41
podcast app, and search for Python. We should be
57:44
right at the top. You can also
57:46
find the iTunes feed at slash iTunes,
57:48
the Google Play feed at slash Play,
57:50
and the Direct RSS feed at
57:52
slash RSS on talkpython.fm. We're
57:54
live streaming most of our recordings these days. If you
57:56
want to be part of the show and have your
57:59
comments featured on the... Be sure
58:01
to subscribe to our YouTube channel
58:03
at talkpython.fm slash YouTube. This
58:05
is your host Michael Kennedy. Thanks so much for
58:07
listening. I really appreciate it. Now get out there
58:09
and write some Python code. Thanks
58:30
for watching.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More