Podchaser Logo
Home
Data Pipelines with Dagster

Data Pipelines with Dagster

Released Thursday, 21st March 2024
Good episode? Give it some love!
Data Pipelines with Dagster

Data Pipelines with Dagster

Data Pipelines with Dagster

Data Pipelines with Dagster

Thursday, 21st March 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Do. You have data that you pull

0:02

from external sources or that is

0:04

generated in appears that your digital

0:06

doorstep and but that data needs

0:08

processed, filtered, transformed, distributed and much

0:10

more. One. Of the biggest tools

0:12

to create these data pipelines with

0:14

Python is Baxter. And we're

0:17

fortunate to have Headroom Naveed on the

0:19

showed. Tell us about it. Headroom is

0:21

the head of Did Engineering and Devereaux

0:23

at Baxter Labs and we're talking data

0:25

pipelines this week here at Talk By

0:27

Than. This is hop by The

0:29

New Me Episode Four Hundred and Fifty Four

0:31

Recorded January Eleventh. Two Thousand And Twenty Four.

0:48

Welcome. To Talk By The Enemy

0:50

a weekly podcast on Python. This is

0:52

your host Michael Kennedy. Follow me on

0:54

Mastodon where I'm at Him Kennedy and

0:57

follow the podcast using at Tuck Python

0:59

both on Busted on.org. Keep. Up

1:01

with the show and listened over seven years

1:03

of pass episode at top I thought.of him.

1:06

We. Started streaming most of our episodes

1:08

live on you tube. Subscribe to our

1:10

youtube channel over at Talk Python.fm/you Tube

1:13

to get notified about upcoming shows and

1:15

be part of that episode. This.

1:19

This episode is sponsored by Posit Connect

1:21

from the makers of Shiny. Publish,

1:24

share and deploy all of your data

1:26

projects that you're creating using Python. Streamlet,

1:28

Dash, Shiny, Bokeh, FastAPI,

1:30

Flas, Quattro, Reports, Dashboards

1:33

and APIs. Pause. It connects

1:35

supports all of them. Try Posit

1:37

connect for free by going to

1:39

talk Python that Fm/posit Pos Id.

1:42

And. It's also brought to you by

1:44

us over at Tuck Python Training. Did

1:46

you know that we have over two

1:48

hundred and fifty hours of Python courses?

1:51

Yeah. That's right, check him out at Talk

1:53

by thought that Fm/courses. Last.

1:56

Week I told you about our new

1:58

course build an Ai audio. With

2:00

Python. Well. I have a

2:02

nother brand new an amazing course to tell

2:04

you about. This. Time. It's all

2:07

about Pythons typing system and how to

2:09

take the most advantage of it. It's.

2:11

A really awesome course called Rock Solid

2:13

Python with Python typing. This one of

2:15

my favorite courses that I've created in

2:18

the last couple of years. Python.

2:20

Type hints are really starting

2:22

to transform Python, especially from

2:24

the ecosystems perspective. Think. Fast

2:26

a pie i didn't a bear type,

2:28

etc. This. Course shows you

2:30

the ins and outs of Python type in

2:33

syntax of course, but it also gives you

2:35

guidance on when and how to use type

2:37

hints. Check out this for

2:39

half hour in depth course at

2:41

tuck by found out of him/courses.

2:44

South. Onto. Those data pipelines. Headroom

2:48

A welcome to talk by the

2:50

me it's amazing the heavier Michael

2:52

get have to could be here

2:54

yeah can talk about data data

2:56

pipelines, automation and will boil of

2:58

it's how you have. I been

3:00

in the Dev ops he side

3:03

of things this week and I

3:05

would have a special special appreciation

3:07

of it I can tell already

3:09

so that excited as or we

3:11

could also consistent in the so

3:13

for we get to that though

3:15

for we talk about Dexter and

3:17

data pipelines and. Orchestration more broadly

3:19

was just go over to background on

3:21

you, introduce yourself for people, had you

3:23

get into Python and beta orchestration and

3:26

all those things for us here. So

3:28

my name is Petromin of Eat, I'm

3:30

the head of State Engineering and overall

3:32

attack sir that's a mouthful and I've

3:35

been longtime Python user since two Point

3:37

Seven and I got started with Python

3:39

with a D with me. Things she

3:41

said of sheer laziness. I was working

3:44

at a bank and their asses grow

3:46

tasks, something involving going into servers. Opening

3:48

up a text file and seeing of

3:51

a patch was apply to server. A

3:53

nightmare scenario when there's a hundred service

3:55

attack and fifteen different patches to confirm.

3:57

Yeah so this kind of predates like

3:59

the cloud and all that automobile and

4:01

staff writer. so does this a before

4:03

cloud. This is like rate between Python

4:06

to in Python. three of you trying

4:08

to figure out how to use pin

4:10

statements correctly actually learn Python as they

4:12

this gotta be better way and honestly

4:14

have not looked back at. Think: if

4:16

you automate or create tread trajectory you'll

4:18

see it says on shared by finding

4:20

ways to be more lazy or in

4:23

many ways aspects. Yeah, who was A

4:25

I think it was Matthew Rockland that

4:27

had the phrase something like productive laziness

4:29

Her: yes. And like that, like. I'm

4:32

going to find a way to leverage my

4:34

laziness to force me to bill automation so

4:36

I never, ever have to do this sort

4:38

of thing again. I got that sort of

4:40

print is very motivating to not have to

4:42

do something and I'll do anything to notice

4:45

of the down there. it's incredible. Unlike that

4:47

of Up Stuff I was talking about. just

4:49

one command and. As

4:51

maybe eight or nine new apps with

4:53

all their tears redeploy update is received

4:55

in It's It took me a lot

4:57

of work to get there, but now

4:59

I. Never after. Think about it again, at

5:01

least not for a few years. and to

5:03

and it's It's amazing I can be productive.

5:05

It's like right and line and line with

5:07

us. So or what are some of the

5:09

Python projects he been he worked on talked

5:12

about a different way as supply this over

5:14

the years. Oh yes so it's sort of

5:16

in with internal just like Python projects turn

5:18

automate with his had some wrote task that

5:20

I had and that accidently becomes you know

5:22

a bigger project. People see it and nilly

5:24

oh I want that to and seven when

5:26

they have to build a cool green interface

5:28

because most people don't speak Python. and

5:30

so i got me into i agree

5:32

i think it was called way back

5:34

when that was the fun journey and

5:36

then from there it's really taken off

5:38

a lot of it has been mostly

5:40

personal projects by to understand of the

5:42

source with are really pick learning a

5:45

path for me as well really being

5:47

absorbed by things like see cloth me

5:49

and requests back when they were coming

5:51

up eventually lead to more but it

5:53

engineering or temporal where i got involved

5:55

with tools like airflow and try to

5:57

automate it up i find incentives patches

5:59

and server that one day led to,

6:01

I guess, making a long story short, a

6:03

role at Daxter where now I contribute a

6:06

little bit to Daxter. I work on Daxter,

6:08

the core project itself, but I also use

6:10

Daxter internally to build our own data pipelines.

6:13

I'm sure it's interesting to see

6:15

how you all both build Daxter

6:17

and then consume Daxter. Yeah, it's

6:19

been wonderful. I think there's a

6:21

lot of great things about it.

6:24

One is like getting access to

6:26

Daxter before it's fully released, right?

6:28

So internally, we dog food, new

6:30

features, new concepts, and we work with the

6:32

product team, the engineering team to say, hey,

6:34

this makes sense, this works really well, that

6:36

doesn't. And that feedback loop is so

6:39

fast and so iterative that for me personally, being

6:42

able to see that come to fruition is really,

6:44

really compelling. But at the same time, I get

6:46

to work at a place that's building a tool

6:48

for me. You don't often

6:50

get that luxury. Yeah. I've

6:52

worked in ads, I've worked in insurance,

6:54

it's banking, these are nice things, but

6:56

it's not built for me, right? And

6:59

so for me, that's probably been the biggest benefit,

7:01

I would say. Right. If you

7:03

work in some marketing thing, you're like, you know,

7:05

I retargeted myself so well today, you wouldn't believe

7:07

it. I really enjoyed it. Yeah,

7:10

I've seen the ads that I've created before.

7:12

So it's a little fun, but it's not

7:14

the same. Yeah, I've heard of people who

7:16

are really, really good at ad targeting

7:19

and finding groups where they

7:21

like pranked their wife or something or just had

7:23

an ad that would only show up for their

7:25

wife by running it. It's like so

7:28

specific and you know, freak them out a little bit. That's

7:30

pretty clever. Yeah, maybe

7:32

it wasn't appreciated, but it is clever. Who

7:34

knows? All right. Well,

7:38

before we jump in, you said that of

7:40

course you built GUIs with PyGui and those

7:42

sorts of things because people don't

7:45

speak Python back then, two, seven days and

7:47

whatever. Is that different now? Not

7:49

that people speak Python, but is it different in the

7:51

sense that like, hey, I could give them a Jupyter

7:53

Notebook or I could give them Streamlit

7:56

or one of these things, right? Like a little

7:58

more or less you building just... like plug it

8:00

in? I think so. I mean, yeah, like you

8:02

said, it's not different in that most people probably

8:04

still to the stage don't speak Python. I know

8:06

we had this like movement a little bit back

8:08

where everyone was going to learn like SQL and

8:11

everyone was going to learn to code. I

8:13

was never that bullish on that trend

8:15

because like if I'm a marketing person, I've got

8:17

10,000 things to do and learning

8:19

to code isn't going to be the priority ever.

8:22

So I think building interfaces for people that

8:24

are easy to use and speak well to

8:27

them is always useful. That never has gone

8:29

away. But I think the tooling around

8:31

it has been better, right? I don't think I'll

8:33

ever want to use POGUI again and nothing wrong

8:36

with the platform. It's just like not fun to

8:38

write streamlet makes it so easy to do that.

8:40

So it's like something like retool and there's like

8:42

a thousand other ways now that you can bring

8:44

these tools in front of your stakeholders and your

8:47

users that just wasn't possible before. I think it's

8:49

a pretty exciting time. There are a lot of

8:51

pretty polished tools. Yeah, it's gone so good. Yeah.

8:53

There are some interesting ones like OpenBB. Do you

8:56

know that the financial dashboard thing? I've heard of

8:58

this. I haven't seen it. Yeah, it's

9:00

basically for traders, but it's like a terminal

9:03

type thing that has a bunch of

9:05

Matplotlib and other interactive stuff that pops

9:07

up kind of compared to say Bloomberg

9:09

dashboard things. But yeah, that's one sense

9:11

where like maybe traders go and learn

9:14

Python because it's like, all right, there's

9:16

enough value here. But in general, I

9:18

don't think people are going to stop

9:20

what they're doing to learning the code.

9:22

So these new UI things are not.

9:24

All right, let's dive in and talk

9:26

about this general category

9:29

first of data pipelines, data

9:31

orchestration, all those things. We'll talk about

9:33

Daxter and some of the trends and

9:36

that. So just grab some random internet

9:38

search for what is a

9:40

data pipeline maybe look like, but you know,

9:42

people out there listening who don't necessarily live

9:44

in that space, which I think is honestly

9:46

many of us, maybe we should, but maybe

9:49

in our minds, we don't think we live

9:51

in data pipeline land. Like tell them about

9:53

it. Yeah, for sure. It is hard to

9:55

think about if you haven't done or built

9:57

one before. In many ways, a data pipeline

9:59

is just a series. of steps that you

10:01

apply to some data set that you have

10:04

in order to transform it to something a

10:06

little bit more valuable at the very end.

10:08

That's a simplified version, the devil's in the

10:10

details, but really, at the end of

10:12

the day, you're in a business, the production of data happens by

10:14

the very nature of operating that

10:17

business. It tends to be the core thing

10:19

that all businesses have in common. And then

10:21

the other output is you have people within

10:23

a business who are trying to understand how

10:25

the business is operating. And this used to

10:27

be easy when all we was a single

10:30

spreadsheet that we can look at once a month. I

10:32

think is the system gone a little bit more

10:34

complex than these days, computer and automation? And expectations,

10:36

like they expect to be able to see almost

10:38

real time, not I'll see it at the end

10:40

of the month, sort of. That's right. Yeah. I

10:42

think people have gotten used to getting data too,

10:44

which is both good and bad good in the

10:46

sense that now people are making better decisions, bad,

10:49

and then there's more work for us to do.

10:51

And we can't just sit in our feet for

10:53

half a day, half a month waiting for the

10:55

next request to come in. There's just an endless

10:57

stream that seems to never end. So that's what

10:59

really is pipeline is all about. It's like

11:01

taking these data and making it consumable in

11:03

a way that users tools will understand that

11:05

helps people make decisions at the very end

11:08

of the day. That's sort of the nuts

11:10

and bolts of it. In your mind, does

11:12

data acquisition live in this

11:14

land? So for example, maybe we have

11:16

a scheduled job that goes and does

11:18

web scraping, calls an API once an

11:20

hour, and that might kick off a

11:23

whole pipeline of processing. Or we watch

11:25

a folder for people to upload

11:28

over FTP, like a CSV

11:31

file or something horrible like that. It's

11:33

unspeakable. But something like that where

11:35

you say, Oh, a new CSV has arrived for

11:37

me to get. Right? Yeah, I think that's

11:40

the beginning of all data pipeline journeys in

11:42

my mind very much. And actually, as much

11:44

as we hate it, it's not terrible. I

11:46

mean, there

11:48

are worse ways to transfer files. But it's

11:51

still very much in use today. And

11:53

every data pipeline journey at some point

11:55

has to begin with consumption of data

11:58

from somewhere. Hopefully, it's SFT. not

12:00

just straight FTP, like the encrypted, don't just

12:03

send your password in the

12:05

plain text. Oh well, I've

12:07

seen that go wrong. That's a story for

12:10

another day, honestly. All

12:12

right, well, let's talk about the project that you work

12:14

on. We've been talking about it in general, but

12:16

let's talk about Baxter. Like, where does it fit in this

12:18

world? Yes. Baxter to me

12:21

is a way to build a data

12:23

platform. It's also a different way of

12:25

thinking about how you build data pipelines.

12:27

Maybe it's good to compare it with

12:29

kind of what the world was like, I think,

12:31

before Dijkstra and how it came

12:33

about to be. So if you think

12:35

of Airflow, I think it's probably the

12:37

most canonical orchestrator out there. But there

12:39

are other ways which people used to

12:41

orchestrate these data pipelines. They

12:44

were often task-based, right? Like, I would

12:46

download file, I would unzip file, I

12:48

would upload file. These are sort of

12:51

the words we use to describe the

12:53

various steps within a pipeline. Some

12:55

of those little steps might be Python functions that

12:58

you write. Maybe there's some pre-built other ones. Yeah,

13:00

there might be Python, could be a bash script,

13:02

could be logging into a server and downloading a

13:04

file, could be hitting request and

13:07

downloading something from the internet, unzipping it. Just

13:09

a various hodgepodge of commands that would run.

13:11

That's typically how we thought about it. For

13:13

more complex scenarios where your data is bigger,

13:16

maybe it's running against a Hadoop cluster or

13:18

a Spark cluster. The compute's been offloaded somewhere

13:20

else. But the sort of conceptual way you

13:22

ended to think about these things is in

13:25

terms of tasks, right? Process this thing, do

13:27

this massive data dump, run a bunch of

13:29

things, and then your job is

13:31

complete. With Airflow, or I'm sorry, with DAGSAR,

13:33

we kind of flip it around a little

13:35

bit on our heads and we say, instead

13:38

of thinking about tasks, what if we flipped

13:40

that around and thought about the actual underlying

13:42

assets that you're creating? What if you told

13:44

us not the steps that you're going to

13:47

take, but the thing that you produce? Because

13:49

it turns out that people and data people

13:51

and stakeholders really, we don't care about the

13:53

task. We just assume that you're going

13:55

to do it. What we care about is that table, that model,

13:57

that file. that

14:00

Jupyter Notebook. And if we model our

14:02

pipeline through that, then we get a

14:04

whole bunch of other benefits. And that's

14:06

sort of the Daxter's sort of pitch,

14:08

right? Like, if you want to understand

14:10

the things that are being produced by

14:12

these tasks, tell us about the underlying

14:14

assets. And then when a stakeholder says and comes

14:16

to you and says, how old is this table?

14:19

Has it been refreshed lately? Well, you don't have

14:21

to go look at a specific task and remember

14:23

that task ABC had modeled XYZ. You just go

14:25

and look up model XYZ directly there and it's

14:27

there for you. And because you've defined things in

14:30

this way, you get other nice things like a

14:32

lineage graph, you get to understand how fresh your

14:34

data is, you can do event based orchestration and

14:36

all kinds of nice things that are a lot

14:39

harder to do in a task world. Yeah,

14:41

more declarative, less imperative,

14:44

I suppose. Yeah, it's been the trend, I

14:46

think, in lots of tooling. React, I think

14:48

was famous for this as well, right? In

14:50

many ways, it was a hard framework, I

14:52

think, for people to sort of get their

14:55

heads around initially, because you were so used

14:57

to the jQuery declared or jQuery

14:59

style of doing things. Yeah, how do I hook

15:01

the event that makes the thing happen? Right. And

15:04

React said, let's think about it a little bit

15:06

differently. Let's do this event based orchestration. Really. And

15:08

I think the proof is putting React

15:10

everywhere now and jQuery would be not so much. Yeah,

15:13

there's still a lot of jQuery out there, but there's not

15:15

a lot of action. Not a lot

15:17

of active jQuery, but I imagine there's some

15:19

there's just because people like,

15:21

you know what, don't touch that, that works. Which

15:24

is probably the smartest thing people can do,

15:26

I think. Yeah, honestly, even though new frameworks

15:29

are shiny. And if there's

15:31

any ecosystem that loves to chase the shiny

15:33

new idea to the JavaScript web world. Oh,

15:35

yeah, there's no shortage of new frameworks coming

15:37

out every time. Yeah, I mean, we

15:40

do too, but not as much as like,

15:42

that's six months old. That's so old, we

15:44

can't possibly do that anymore. We're rewrite now. We're

15:46

gonna do the big rewrite again. Yep.

15:49

Okay, so Daxter is the company,

15:52

but also is open source. What's the

15:54

story on like, can I use it for free? Is it

15:57

open source? I pay for it. Okay, company,

16:00

Daxter open source is the

16:02

product 100% free, we're very

16:04

committed to the open source model. I

16:07

would say 95% of the things you can get

16:09

out of Daxter are available through open source and

16:11

we tend to try to release everything through that

16:13

model. You can run

16:15

very complex pipelines and you

16:17

can deploy it all on your own if you

16:19

wish. There is a Daxter cloud product, which is

16:21

really the hosted version of Daxter. If you want

16:23

hosted plane, we can do that for you through

16:25

Daxter cloud, but it all runs on the same

16:27

code base and the modeling and the files all

16:30

essentially look the same. Okay,

16:32

so obviously you could get

16:34

like I talked about at the beginning, you could

16:36

go down the DevOps side, get your own open

16:38

source Daxter, set up, schedule it, run it on

16:40

servers, all those things. But if we just wanted

16:42

something real simple, we could just go to you

16:45

guys and say, hey, I built this with Daxter.

16:47

Will you run it for me? Pretty much, yeah,

16:49

right? So there's two options there. You can do

16:51

the serverless model, which says, you know, Daxter just

16:53

run it, we take care of the compute, we

16:55

take care of the execution for you and you

16:58

just write the code and upload it to GitHub

17:00

or, you know, repository of your

17:02

choice and we'll sync to that and then run

17:04

it. The other option is to do the hybrid

17:06

model. So you basically do the CI CD aspect,

17:09

you just say you push to name

17:11

your branch, if you push that branch, that

17:13

means we're just going to deploy a new

17:15

version and whatever happens after that, it'll be

17:17

in production, right? Exactly. Yeah. And we offer

17:20

some templates that you can use in GitHub

17:22

for workflows in order to accommodate that. Excellent.

17:25

Then I cut you off, you're saying something about hybrid. Hybrid

17:27

is the other option. For those of you who want to

17:29

run your own compute, you don't want the data

17:31

leaving your ecosystem. You can say we've got

17:33

this Kubernetes cluster, this ECs cluster, but we

17:35

still want to use the Daxter cloud product

17:38

to sort of manage the control plane. Daxter

17:40

cloud will do that. And then you can

17:42

go off and execute things on your own

17:44

environment if that's something you wish to do.

17:46

Oh, yeah, that's pretty clever. Because running stuff

17:48

in containers isn't too bad. But running container

17:50

clusters, all of a sudden, you're

17:52

back, back doing a lot of work, right? Exactly.

17:55

Yeah. Okay, well, let's maybe talk

17:57

about Daxter for a bit that I Want to talk

17:59

about some of. The trend as well. that

18:01

was you talk to, maybe setting up a

18:03

pipeline? I could. What is it look like

18:05

need talked about in general. I'm less imperative,

18:08

more declared of. but where does it look

18:10

like? Be careful. Other him a code on

18:12

audio that you know? Oh yes. give us

18:14

a sense of what the programming model feels

18:17

like for us as much as possible. It

18:19

really feels like to spreading Python. It's pretty

18:21

easy you out of their creator on top

18:24

of your existing Python function that does something

18:26

as a simple decorator called asset and then

18:28

you are. I find that function becomes. The

18:31

did assets house represented in the ducks are

18:33

you I so you could imagine you've got

18:35

a pipeline that gets like maybe slack analytics

18:37

and upon set to such as part by

18:40

your first pipeline a functional it because I

18:42

feel excited data and that would be your

18:44

asset in that function is where you do

18:46

all the transform the download the data until

18:49

you've really created that fundamental data as if

18:51

you care about and I can be stored

18:53

either you know in a data warehouse to

18:55

as three hours. So to what a process

18:58

that that's really up to you. And

19:00

then the resources to sort of where the

19:02

power i think of a lot of dice

19:04

it comes in the ass at its are

19:06

a lot like declaration of the thing and

19:08

going to create a resource is how I'm

19:10

going to operate on that bit because sometimes

19:12

you might want to have a a same

19:14

a ducky be instance locally because it's easier

19:16

and faster to operate. What will you Moving

19:18

to the cloud you want to have a

19:20

it a break sweater or hub snowflake? You

19:23

can swap resources based on environments and you

19:25

are sick and reference that resource and as

19:27

long as a hazard. same sort of if

19:29

you are you. Can really sucks to be

19:31

changed between were that data as going to

19:33

be persistent. Does Dexter know how to talk

19:35

to those different platforms? Does it natively understand

19:37

the Tv and snowflake? Yes it's interesting. People

19:39

often look the dachshund like oh does it

19:41

to Axe and question is like accurate as

19:44

anything you can do Python with which is

19:46

most things yeah visible thing so I think

19:48

you to come from the airflow world you're

19:50

very much used to like these airflow provide

19:52

hers and if ya know what I think

19:54

and yeah yeah you read a post guess

19:56

easy to find the post with provider you

19:58

want He said there you need to find

20:01

Esther provider attacks Are you going to say

20:03

it off today that if you want to

20:05

use know say for example it's ah the

20:07

Stasi connector package from Snowflake A use that

20:09

as a resource directly and then you just

20:11

run you're sick or that way There are

20:13

some places where we do have innovations that

20:15

out you when you get into the bees

20:17

of i owe manager is where we are

20:20

sister data on your behalf and sell for

20:22

as three for snowflake for example there's other

20:24

ways with can persist that they to for

20:26

you but issued to turn it on a

20:28

query to try to execute something. Sort

20:30

of say something somewhere you don't have to

20:32

you that system at all. You can just

20:35

use whatever Python package you give. you would

20:37

use any way to do that. So maybe

20:39

some data as expensive for us to get

20:41

as a company like mean were charged i

20:44

am usage basis or super slow or something

20:46

I could write as Python code that doesn't

20:48

say well look at my local database if

20:50

it's already there. use that as had to

20:53

stay or otherwise than to actually go get

20:55

it but it put it there and then

20:57

get it back and like that kind of

20:59

stuff would be up to meet us at

21:02

against. yep and as that I see his.

21:04

You're not really limited by like anyone's data

21:06

model or or world's here on how did

21:09

has to be retrieve saved augmented. You get

21:11

a couple ways you could say whenever I'm

21:13

working locally uses persisted a restore that we're

21:15

just going to use for develop purposes. Fancy

21:18

database called super Light some like that exactly

21:20

Yes. Wonderful listed of it's an Axis Yeah

21:22

Hill worked really really well. And then you

21:24

say when I'm in different environment when I'm

21:27

in production swap out likes to call it

21:29

resource for a. Name your favorite Cod

21:31

were has been source and co fit

21:33

fast that data from there are only

21:35

use money I owe locally on E

21:37

as three on on pod. It's very

21:39

simple to sweaty six. Oh okay, yeah

21:41

so it looks like you build up

21:43

these assets as you call these pieces

21:45

of data. I don't care that access

21:47

is an and then you have a

21:49

nice you I that lets you go

21:52

and build those out snow workflows style

21:54

right? Yeah, exactly. This is where we

21:56

get into the wonderful world of tags

21:58

which stands for Directed A So. I

22:00

think it says for a bunch of things

22:03

that are not connected in a circle but

22:05

are connected in some way. for the cabin

22:07

a loop wake of the you never know

22:09

where to start, a way to had kiddo

22:11

you guys and but not a know that

22:13

of this is not a single element like

22:15

a path through this dataset with the beginning

22:17

and end. Then we can kind of so

22:19

tomorrow this. Connected. Graph of things and

22:21

then we know how to execute the right.

22:23

We can say well this is the first

22:25

and we have to run to test for

22:27

all dependencies. start and then we can either

22:29

branch off in parallel or we continue linearly

22:31

until everything is complete and if something breaks

22:33

in the middle we can resume from that

22:35

broken spot. Okay, excellent and is that the

22:37

recommended way? Like if I write all this

22:39

Python code that works on the pieces then

22:41

the next recommendation would be to fire up

22:43

the you I and start building or he

22:45

say eyes, really write it and code and

22:47

then you can just visualize it or or

22:49

monitor everything. Indexers written code the you I

22:52

read psychos and it interpreted as a

22:54

tag and then it displays or for

22:56

you. There are some things to do.

22:58

The violence: You can materialize assets. You

23:00

can make them run, You can do

23:02

back fills, You can view meditator. You

23:04

can sort of enable and disable schedules.

23:06

But the Core: We really do it

23:08

as a dyke show. the core declaration

23:10

of how things are done. It's always

23:12

done through code. Okay, access to we

23:14

say materialize Maybe I have a and

23:16

asset which is really a Python function

23:18

I read that goes and pulls. Down

23:20

a Csv file the materialize a be I

23:23

want to see kind of representative data and

23:25

this as in the you I and so

23:27

I could go or at and think this

23:29

is right. Let's keep passing it down that

23:31

what that means materialise really means just run

23:34

this pretty thoracic. make this asset new, again,

23:36

fresh, again rare as part of that material

23:38

is Haitian we sometimes have at meditator. You

23:40

can see this on the right if you're

23:43

looking at the screen here where we talk

23:45

about what the time Sep was that you

23:47

are our business of a graph of like

23:49

number of rows. over time all that

23:51

mattered data is the you can emit

23:53

and we emit some ourselves by default

23:55

with the framework and as you materialized

23:58

assets as you run the ass and

24:00

over again over time we capture all that

24:02

and then you can really get a nice

24:04

overview of this asset's lifetime essentially. I think

24:07

the metadata is really pretty excellent. Over

24:09

time, you can see how the data

24:11

has grown and changed. Yeah, the metadata

24:13

is really powerful and it's one of

24:15

the nice benefits of being in this

24:18

asset world because you don't really want

24:20

to metadata on this task that run.

24:22

You want to know this table that

24:24

I created, how many rows does it

24:26

have every single time it's run. That

24:28

number drops by like 50 percent. That's

24:30

a big problem. Conversely, if the runtime is

24:32

slowly increasing every single day, you might not

24:34

notice it, but over a month or two

24:36

it went from a 30-second pipeline to 30

24:38

minutes, maybe there's a great place to

24:41

start optimizing that one specific asset. Right.

24:43

What's cool is if it's just Python

24:45

code, you know how to optimize that

24:47

probably, right? Hopefully, yes. Well,

24:49

as much as you're going to... You have

24:52

all the power of Python and you

24:54

should be able to as opposed to it's

24:56

deep down inside some framework that you don't

24:58

really... Exactly. Yeah. It's Python, you can benchmark

25:00

it. You probably knew you didn't write it

25:03

that well when you first started and you can

25:05

always find ways to improve it. So

25:07

this UI is something that you can just

25:09

run locally kind of like Jupiter. 100 percent.

25:11

Just type Dijkstra dev and then you get

25:13

the full UI experience. You get to see

25:16

the runs, all your assets. Is it a

25:18

web app? It is, yeah. It's a web

25:20

app. There's a Postgres backend and then there's

25:22

a couple of services that run the web

25:24

server, the GraphQL and then the workers. Nice.

25:26

Yeah. So pretty serious web app, it sounds

25:28

like. But you

25:30

probably just run it all. Yeah. Something you

25:32

run all probably containers

25:35

or something you just fire up when you

25:37

download it, right? Locally, it doesn't even use

25:39

containers. It's just all pure Python for

25:42

that. But once you deploy, yeah, I think you

25:44

might want to go down the container route. But

25:46

it's nice not having to have Docker just to

25:48

like run a simple test deployment. Yeah, I guess

25:50

not everyone's machine has that for

25:53

sure. So question from the audience here,

25:55

Jazzy asked, does it hook into

25:57

AWS in particular? Is It compatible?

26:00

The ball with existing pipelines like ingestion

26:02

Lambda as are transformed My unless you

26:04

can look into the of us so

26:06

we have some it of your inhibitions

26:08

built in. Like I mentioned before, there's

26:10

nothing stopping you from importing boat or

26:12

three and and doing anything really you

26:14

want. So a very simple use case

26:16

like let's say you already have an

26:19

existing transformation in triggered in it Lvs

26:21

through some lambda he sips model that

26:23

with indexer and say who trigger that

26:25

lambda or three Okay and the acid

26:27

itself is really that repetition of that

26:29

pipeline. Weight off your wedding that code

26:31

within banks yourself at still occurring on the

26:34

to be a swimmer And it's a really

26:36

simple way to start adding a limit of

26:38

of livability orchestration to existing pipelines. Okay as

26:40

pretty cool because now you have this nice

26:43

you I and these meditate in Us history

26:45

but it's someone else Have club exactly Yes

26:47

now precursor to fall born from a sitting

26:49

there and over time you buy decides. you

26:52

know this in Atlanta that I had it's

26:54

are in the get out of hand. I

26:56

wanted broken apart to multiple assets by one

26:58

to serve optimizers. always dykes. Can help

27:00

you along That has now. Excellent howdy

27:03

set up Like triggers are observe ability

27:05

inside Dax Her eyes does he asked

27:07

about us? Sorry, but like in general

27:09

right? if a row is entered into

27:11

a database, something's dropped in a blob

27:13

storage or that a changes that are

27:15

no yes those requests. It's a lot

27:18

of options. Indexer: We do model every

27:20

asset with a couple little size. I

27:22

think that a really useful A Think

27:24

about what is whether the code of

27:26

that particular asset has changed me and

27:28

the other one is whether. and the

27:30

upstream of the asset has changed in those

27:33

things really power a lot of automation functionality

27:35

that we can get and stream so let's

27:37

start with are things a street samples the

27:39

eager to understand either bucket and there is

27:41

you or file the gets uploaded every day

27:43

you know what time the followed it uploaded

27:46

it or know when it'll be uploaded but

27:48

you know at some point it will be

27:50

indexer we have a thing called the sensor

27:52

which you can just an ex tuna three

27:54

location you can define how it looks into

27:56

their file or into a folder and then

27:59

you just pull every 30 seconds

28:01

until something happens. When that something

28:03

happens, that triggers an event. And

28:06

that, if I can trickle at your will

28:08

downstream to everything that depends on it, I do

28:10

connect to these things. So it gets you awake

28:12

from this, like, oh, I'm going to schedule

28:14

something to run every hour. Maybe the data

28:16

will be there, but maybe it won't. And you

28:19

can have a much more event-based workflow. When

28:21

this file runs, I want everything downstream to

28:23

know that this data has changed. And as

28:25

data flows through the systems, everything will sort of

28:27

work its way down. Yeah, I like it.

28:31

This portion of Talk Python to me is

28:33

brought to you by Posit, the makers of

28:36

Shiny, formerly RStudio, and especially,

28:38

Shiny for Python. Let

28:40

me ask you a question. Are you building

28:42

awesome things? Of course you are. You're a

28:44

developer or data scientist. That's what we do.

28:46

And you should check out Posit Connect. Posit

28:49

Connect is a way for you to

28:51

publish, share, and deploy all the data

28:53

products that you're building using Python. People

28:56

ask me the same question all the time. Michael,

28:58

I have some cool data science project or notebook

29:01

that I built. How do I

29:03

share it with my users, stakeholders, teammates?

29:05

Or I need to learn FastAPI or

29:08

Flask or maybe Vue or ReactJS? Hold

29:10

on now. Those are cool technologies, and I'm sure

29:13

you benefit from them. But maybe stay focused on

29:15

the data project. Let Posit Connect handle

29:17

that side of things. With Posit

29:19

Connect, you can rapidly and

29:21

securely deploy the things you

29:23

build in Python. Streamlet, Dash,

29:25

Shiny, Bokeh, FastAPI, Flask, Quarto,

29:28

Ports, Dashboards, and APIs. Posit

29:30

Connect supports all of them. And Posit

29:32

Connect comes with all the bells and

29:35

whistles to satisfy IT and other enterprise

29:37

requirements. Make deployment the easiest

29:39

step in your workflow with Posit

29:41

Connect. For a limited time, you

29:43

can try Posit Connect for free

29:45

for three months by going to

29:47

talkbython.fm slash posit. That's talkbython.fm slash

29:50

posit. The link is in your podcast

29:52

player show notes. Thank you

29:54

to the team at Posit for supporting TalkByThon. The sensor

29:57

comes with a link

29:59

to the webinar. And that's it for today. concept is really cool

30:01

because I'm sure that there's a ton of

30:03

cloud machines people provisioned just because

30:05

this thing runs every 15 minutes,

30:08

that runs every 30 minutes and

30:10

you add them up and in

30:12

aggregate we need eight machines just

30:14

to handle the automation rather

30:16

than – because they're hoping to catch something

30:18

without too much latency but maybe that actually

30:20

only changes once a week. Exactly. And

30:23

I think that's where we have to like sometimes

30:25

step away from the way we're so used to

30:27

thinking about things and I'm guilty of this. When

30:29

I create a data pipeline, my natural inclination is

30:31

to create a schedule where I can say, is

30:33

this a daily one? Is this weekly? Is this

30:35

monthly? But what I'm finding more and more is

30:37

when I'm creating my pipelines, I'm not adding a

30:39

schedule. I'm using DAGSAR's auto materialized

30:41

policies and I'm just telling it, you figure

30:44

it out. I don't have to think about

30:46

schedules. Just figure out when this thing should

30:48

be updated. When parents have been updated, you

30:50

run. When the data has changed, you

30:52

run. And then just like figure it out and leave

30:54

me alone. Work pretty well

30:56

for me so far. I think it's great. I have a

30:59

refresh the search index on the various

31:02

podcast pages that runs and it runs every

31:04

hour but the podcast ships weekly, right? But

31:06

I don't know which hour it is and

31:08

so it seems like that's enough latency but

31:10

it would be way better to put just

31:12

a little bit of smart like what was

31:15

the last date that anything changed? Was that

31:17

since the last time you saw it? Maybe

31:19

we'll just leave that alone. You're

31:22

starting to inspire me to go write

31:24

more code but pretty cool. All

31:26

right. So on the homepage at

31:28

dagsar.io, you've got a nice graphic

31:30

that shows you both how to write

31:33

the code, like some examples of the

31:35

code as well as how that looks

31:37

in the UI. And one of them

31:39

says to launch backfills. What is this

31:42

backfill thing? Oh, this is my favorite

31:44

thing. Okay. So when you

31:46

first start your data journey as a data

31:48

engineer, you sort of have a

31:50

pipeline and you build it and it just

31:52

runs on a schedule and that's fine. What

31:54

you soon find is you might have to

31:56

go back in time. You might say, I've

31:59

got this. data set that updates monthly.

32:01

Here's a great example, AWS cost

32:04

reporting, right? AWS will send

32:06

you some data around, you know, all your

32:08

instances and your S3 bucket, all that. And

32:10

it'll update that data every day or every

32:12

month or whatever have you. Due to some

32:14

reason, you got to go back in time

32:16

and refresh data that AWS updated due to

32:18

some like discrepancy. Backsell is sort of how

32:20

you do that. And it worked hand in

32:23

hand with this idea of a partition. A

32:25

partition is sort of how your data is

32:27

naturally organized. And it's like a nice way

32:29

to represent that natural organization. It has nothing

32:31

to do with like the fundamental way how

32:33

often you want to run it. It's more

32:35

around like, I've got a data set that

32:38

comes in once a month is represented monthly,

32:40

it might be updated daily, but the representation

32:42

of the data is monthly. So I will

32:44

partition it by month. It doesn't have to

32:47

be dates. It could be strings, it could

32:49

be a list, you could have a partition

32:51

for every company, or every client, or you

32:53

know, every domain you have, whatever you sort

32:56

of think is a natural way to think

32:58

about breaking apart that pipeline. And

33:00

once you do that partition, you can do

33:02

these nice things called backfills, which says, instead

33:04

of running this entire pipeline on all my

33:06

data, I want you to pick that one

33:08

month where your data went wrong, or that

33:10

one month where data was missing, and just

33:12

run the partition on that range. And so

33:15

you limit compute, you save resources and get

33:17

a little bit more efficient. It's just easier

33:19

to like, think about your pipelines because you've

33:21

got this natural built in partitioning Excellent.

33:24

So maybe you missed some

33:26

important event, maybe your automation went down

33:28

for a little bit came back up, you're

33:30

like, Oh, no, we've we've missed it. Right.

33:33

But you want to start over, For

33:35

years. So Maybe we could just go and

33:37

run the last day to worth of exactly.

33:40

Okay. Another One would be your vendor says,

33:42

Hey, by the way, we actually screwed up.

33:44

We Uploaded this file from two months ago,

33:46

but the numbers were all wrong. Yeah, we've

33:49

uploaded a new version to that destination. Can

33:51

You update your data set? One Way is

33:53

to recompute the entire universe from scratch. But

33:55

If you've partitioned things, and you can say

33:58

no limit that to just miss one particular

34:00

then for that month and that what British

34:02

you can trickle down always all your other

34:04

assets that depend on that, we do have

34:06

to free decide the have to think about

34:09

this partitioning beforehand or can you do it

34:11

retroactively he could are effectively and I have

34:13

done that before as well. It really depends

34:15

on on where you're at. I think it's

34:18

your first as ever probably and bother partitions,

34:20

but he really is a lotta work to

34:22

get them to get them started. Okay, yeah,

34:24

really nice. I like a lot of the

34:26

ideas here are like that. It's got this

34:29

visual component. That you can and

34:31

see what's going on inspected.c can debug runs

34:33

or what happens there like obviously when your

34:35

polling data from many different sources maybe it's

34:38

not your data your taken in fields could

34:40

vanish can be the wrong type system singer

34:42

down I'm sure sure they do working as

34:44

interesting so what's it looks a little bit

34:47

I know I go web browser to bug

34:49

dev tools thing to. For the record my

34:51

code never fails. I've never had a bug

34:53

in my life before. the let you have

34:56

this is yeah I bought mine. doesn't he

34:58

only do it to make a. And

35:00

example and from Miami of our other yes

35:02

it's I do it's intention of course yet

35:05

a humble myself a little bit as exactly

35:07

is the first few is ice one of

35:09

my favorite I've been some is a bit

35:11

views but this is it's actually really fun

35:13

to watch y sus of the runway you

35:16

execute this by the but really like was

35:18

go back to you know what the world

35:20

before or procedures we use Crime rate we

35:22

have a basket that would do something and

35:24

we have a cronje up it said make

35:27

sure this thing runs and then hopefully it

35:29

was successful but sometimes. it was it and

35:31

it's a sometimes it was it that's always

35:33

been the problem right it's like well what

35:35

were you know i don't know why it

35:37

failed i was when there's sale know what

35:39

a what point of a subset of hell

35:42

that's really hard to do with this the

35:44

bugger really is is is a structured lot

35:46

of every step that's been going on through

35:48

your i find right to and this view

35:50

there's three assets and been kind to see

35:52

here when it's called users when it's hot

35:54

orders and one is to run tbt the

35:57

presumably there is to you know tables that

35:59

are being updated and then dbt job it

36:01

looks like this being updated at the very

36:03

end. Once you execute this pipeline, all the

36:05

logs are captured from each of those assets.

36:07

So you can manually write your own logs,

36:09

you have access to a Python logger, and

36:11

you can use your info, your error, whatever

36:14

have you, and log output that way, and

36:16

it'll be captured in a structured way. But

36:18

it also captured logs from

36:20

your integrations. So using dbt, we capture

36:22

those logs as well, you can see

36:24

it processing every single asset. So if

36:26

anything does go wrong, you can filter

36:28

down and understand at what step,

36:31

at what point, does something go wrong.

36:33

That's awesome. And just the historical

36:36

aspect, could just go in through logs,

36:38

especially multiple systems can be really, really

36:40

tricky to figure out what's the problem,

36:42

what actually caused this to go wrong,

36:44

but come back and say, Oh, it

36:46

crashed, pull up the UI and see,

36:48

all right, well, show me, show

36:51

me what this run did, and show me what this job did.

36:53

And it seems like it's a lot easier to debug than your

36:56

standard web API or something like that. Exactly. You can

36:58

click on to any of these assets that get metadata

37:00

that we had earlier as well. If you know,

37:02

one step failed, and it's kind of flaky, flaky,

37:04

you can just click on that one step and

37:06

say just rerun this, everything else is fine, we

37:09

don't need to restart from scratch. Okay, and it'll

37:11

keep like the data from before.

37:13

So you don't have to rerun that. Yeah,

37:15

I mean, it depends on how you built

37:17

the pipeline. We like to build item potent

37:19

pipelines is how we sort of talk about

37:21

it, data engineering landscape, right? So you should

37:23

be able to run something multiple times and

37:26

not break anything in a perfect world. That's

37:28

not always possible. But ideally, yes. And

37:30

so we can presume that if users completed

37:32

successfully, then we don't have to run that

37:34

again, because that data was persisted, you know,

37:36

database s3 somewhere. And if orders was

37:39

the one that was broken, we can just only

37:41

run orders and not have to worry about rewriting

37:43

the whole thing from scratch. Excellent. So

37:46

item potent for people who maybe don't know,

37:48

you run it once or you perform the operation once

37:50

or you perform it 20 times, same

37:53

outcome should have side effects,

37:55

right? That's the idea. Yeah,

37:57

that's the idea. We Use your stuff. It sure

37:59

is. I'm at a D V that

38:01

I'm very hard but the more you

38:03

can build path i sat way the

38:05

easier your life becomes immediately. Generally not

38:07

always for generally true for programming as

38:09

well Are a few doctor functional programming

38:12

people they'll say like it's an absolute

38:14

but yes a personal programmers love love

38:16

is kind of stuff and it as

38:18

it does lend itself for the wealthy

38:20

data pipelines. If I find on like

38:22

maybe some of the suffering during stuff

38:24

it's a little bit different in that

38:26

the data changing is what causes often

38:28

most of the headaches rate. Is less

38:30

so the actual code you right but

38:32

more this citizens and to change so

38:34

frequently and so often in new and

38:37

novel an interesting way that you would

38:39

often never expect. And so the more

38:41

you can sort of make that function

38:43

so pure that you can provide any

38:45

sort of dataset and really tests you'll

38:47

easily these expectations when they at quicken

38:49

the easier it is to serve depot

38:51

these things and below them in the

38:53

future yeah and cast them as well

38:56

yes of Asus Via. So speaking of

38:58

that kind of stuff like what's the.

39:00

Scale Ability story. I've got some

39:02

big huge complicated data pipeline. can

39:05

I parallelism and have them run

39:07

multiple pieces like the of the

39:09

different branches are some like that

39:11

Excesses to. That's one of the

39:13

key benefits I think in reading

39:15

your assets in this dag way

39:17

Race: Anything that is paralyzed, people

39:19

will be paralyzed. Know some of

39:21

the limits on that. For the

39:23

too much prohibition is bad. you

39:25

poor little database can handle it.

39:28

He can say media concurrency limit.

39:30

on this one just for today is worth

39:32

putting or something and eighty eye for an

39:34

external bend her they might not appreciate ten

39:36

thousand requests the second on that was to

39:38

maybe our he was slow down but another

39:40

rate limiting right you can run into to

39:43

me a class and than their than your

39:45

stuff crashes than investor as they i can

39:47

be all thing creole davis nebula concerned spots

39:49

have returned the world is is simple anything

39:51

that can be paralyzed will be a through

39:53

dexter and that's really the benefit of reading

39:55

these tags is it is a nice algorithm

39:57

for determining whether she looks like now I

40:00

guess if you have a diamond shape or any sort

40:02

of splits, those two things now

40:04

become just ascyclical. They can't turn around

40:06

and then eventually depend on each other

40:08

again. So that's a perfect chance to

40:10

just go fork it out. Exactly. And

40:12

that's been where partitions are also kind

40:14

of interesting. If you have a partitioned

40:16

asset, you could take your data set,

40:18

partition it into five buckets, and run

40:20

all five partitions at once, knowing full

40:22

well that because you've written this in

40:24

a idempotent and partitioned way, that the

40:26

first pipeline will only operate on Apple

40:29

and the second one only operates on bananas. And

40:32

there is no commingling of apples and bananas anywhere

40:34

in the pipeline. Oh, that's interesting.

40:36

I hadn't really thought about using the partitions for

40:38

parallelism, but of course. Yeah. It's

40:41

a fun little way to break things apart. So

40:43

if we run this on the Daxter cloud

40:46

or even on our own, this is pretty

40:48

much automatic. We don't have to do anything.

40:50

Like Daxter just looks at it and says,

40:52

this looks parallelizable, and it will go. That's

40:55

right. Yeah. As long as you've got the

40:57

full deployment, whether it's OSS or cloud, Daxter

40:59

will basically parallelize it for you, which

41:01

is possible. Excellent. You can set global

41:03

currency limits. So you might say, 64

41:05

is more than enough parallelization

41:08

that I need. Or maybe I want

41:10

less because I'm worried about overloading systems,

41:12

but it's really up to you. Yeah.

41:14

I'm putting this on a $10 server.

41:17

Please don't kill me. Just

41:19

respect that it's somewhat wimpy, but that's OK. Yeah.

41:21

But it'll get the job done. It'll get the

41:23

job done. All right. I want to talk about

41:25

some of the tools and some of the tools

41:27

that are maybe at play here when working with

41:29

Daxter and some of the trends and stuff. But

41:31

before that, it maybe speaks to where

41:34

you could see people adopt a tool

41:36

like Daxter, but they generally don't.

41:38

They don't realize, like, oh, actually, there's

41:40

a whole framework for this. I

41:43

could, sure, I could go and

41:45

build just on HTTP server and

41:48

hook into the request and start writing to it. But

41:50

maybe I should use fast or fast API. There's

41:53

these frameworks that we really

41:55

naturally adopt for certain situations

41:57

like APIs and others.

42:00

background jobs, data pipelines, where I think there's

42:02

probably a good chunk of people who could

42:04

benefit from stuff like this, but they just

42:06

don't think they need a framework for it.

42:09

Like, cron is enough. Yeah, it's funny because sometimes

42:11

cron is enough. I don't want

42:13

to encourage people not to use cron, but

42:16

think twice, at least, is what I would

42:18

say. So probably the first

42:20

trigger for me of thinking of, you know, is

42:22

that actually a good choice is like, am I

42:24

trying to ingest data from somewhere? That's

42:27

something that fails. Like, I think we just can accept

42:29

that, you know, if you're moving data around, the

42:31

data source will break, the expectations will

42:33

change, you'll need to debug it, you'll

42:35

need to run it, and doing that

42:37

in cron is a nightmare. So I

42:39

would say definitely start to think about

42:41

an orchestration system if you're ingesting data.

42:44

If you have a simple cron job that sends one

42:46

email, like, you're probably fine. I don't think you need

42:48

to implement all of the tags just to do that.

42:51

But the more closer you get

42:53

to data pipelining, I think the

42:55

better your life will be if

42:57

you are not trying to debug

42:59

a obtuse process that no one really

43:02

understands six months from now. Excellent.

43:04

All right, maybe we could touch on some

43:07

of the tools that are interesting. I see

43:09

people using, you talked about DuckDB and DBT,

43:11

a lot of Ds starting here, but give

43:14

us a sense of like some of the

43:16

supporting tools you see a lot of folks

43:18

using that are interesting. Yeah, for sure. I

43:20

think in the data space, probably DBT is

43:23

one of the most popular choices. And

43:26

DBT, in many ways, it's nothing more

43:28

than a command line tool that

43:31

runs a bunch of SQL in a

43:33

bag as well. So there's actually a

43:35

nice fit with DAGSAR and DBT together.

43:37

DBT is really used by people who

43:39

are trying to model that business process

43:42

using SQL against typically a

43:44

data warehouse. So if you

43:46

have your data in, for

43:48

example, Postgres, a Snowflake, Databricks,

43:50

Microsoft SQL, these types of

43:52

data warehouses, generally, you're

43:54

trying to model some type of

43:56

business process. And typically, people use

43:58

SQL to do that. Now you can

44:01

do this without dbt, but dbt has

44:03

provided nice clean interface to doing so

44:06

It makes it very easy to connect these models

44:08

together to run them to have a development workflow

44:10

That works really well and then you can push

44:12

it to prod and have things run again in

44:14

production So that's dbt. We

44:17

find it works really well and a lot of

44:19

our customers are actually using dbt as well There's

44:21

a duct DB, which is a great

44:24

it's like the sequel light for

44:26

columnar databases, right? Yeah, it's in

44:28

process It's fast. It's written by

44:30

the Dutch or something. You can't like about

44:32

it. It's free We love that feels very

44:34

comfortable in Python itself So

44:37

easy. Yes, exactly the Dutch

44:39

have given us so much and

44:41

they've asked nothing of us So I'm

44:44

always very thankful for them. It's fast.

44:46

It's so fast It's like if

44:48

you've ever used pandas for processing large

44:50

volumes of data You will occasionally hit

44:53

memory limits or inefficiencies in doing

44:55

these large aggregates I won't go

44:57

to all the reasons of why that is but duck DB Sort

45:00

of changes that because it's a fast

45:02

serverless sort of C++ written Tooling

45:05

to do really fast vectorized work and

45:07

by that I mean like it works

45:09

on columns Typically in

45:11

like sequel light you're doing transactions

45:13

You're doing single row updates writes

45:16

inserts and sequel light is created

45:18

that where typical transactional databases fail

45:20

or art as powerful Is

45:22

when you do aggregates when you're looking at

45:24

an entire column, right? Just the way they're

45:26

architected if you want to know the average

45:28

of the median The sum of some

45:31

large number of columns and you want to group that by

45:33

a whole bunch of things You want

45:35

to know the first date someone did something

45:37

and the last one those types of vectorized

45:39

operations Duck DB is really really fast at

45:41

doing and it's a great Alternative

45:44

to for example pandas which can

45:46

often hit memory limits and be

45:48

a little bit slow in that

45:50

regard Yeah, it looks like you

45:52

have some pretty cool aspects transactions,

45:54

of course, but it also says

45:56

direct parquet CSV and JSON querying

45:58

So if you You've got a CSV

46:00

file hanging around and you wanna ask questions

46:03

about it, or JSON or some of the

46:05

data science stuff through Parquet. Turn

46:07

up indexed proper query engine against it.

46:09

Don't just use a dictionary or something,

46:11

right? Yeah, it's great for reading

46:13

a CSV, zip files, tar

46:16

files, Parquet, partition Parquet files, all

46:18

that stuff that usually was really

46:20

annoying to do and operate on.

46:22

You can now install .db. It's

46:24

kind of great CLI too. So

46:26

before you go and program your

46:28

entire pipeline, you just run .db

46:30

and you start writing SQL against CSV files

46:32

and all this stuff to really understand your

46:35

data and just really see how quick it

46:37

is. I used it on a bird dataset

46:39

that I had as an example project and

46:41

there was millions of rows and

46:43

I was joining them together and doing massive group

46:45

buys and it was done in seconds. And it

46:48

was just hard for me to believe that it

46:50

was even correct, because it was so quick. So

46:52

it is wonderful. I'm about to have done that

46:54

wrong somehow. Because it's

46:56

done, it shouldn't be done. Yeah. The

46:58

fact it's in process means there's not

47:01

a babysit, a server for you to

47:03

babysit, patch, make sure it's still running.

47:05

It's accessible but not too accessible, all

47:07

that, right? It's a pip and sell

47:09

away, which is always, we love that,

47:12

right? Yeah, absolutely. You mentioned, I guess

47:14

I mentioned Parquet, but also Apache Arrow seems

47:16

like it's making its way into a lot

47:18

of different tools and sort

47:21

of foundational sort of high memory, high

47:23

performance in memory processing. Have you

47:25

used this, Eddie? I've used it, especially through

47:28

working through different languages. So moving

47:30

data between Python and R is where I

47:33

last used this. I think Arrow's

47:35

great at that. I believe Arrow is underneath

47:37

some of the rust to Python

47:39

as well. It's working there.

47:41

So typically I don't use Arrow directly

47:43

myself, but it's in many of the

47:46

tooling I use. All right. So

47:48

great product and so much of the ecosystem is now

47:50

built on Arrow. Yeah, I think a lot of it,

47:52

I feel like the first time I heard about it

47:55

was through Polars. I'm

47:57

pretty sure, which is another rust. story

48:00

for kind of like pandas, but

48:02

a little bit more fluent lazy API. Yes.

48:04

We live at such great times, to be

48:06

honest. So, polars is a Python

48:08

bindings for Rust, I believe is kind of

48:11

how I think about it. It does all

48:13

the transformation in Rust, but you've had this

48:15

Python interface to it and it

48:17

makes things again, incredibly fast. I

48:19

would say similar in speed to

48:21

DuckDP. They both are quite comparable

48:23

sometimes. Yeah, it also claims

48:26

to have vectorized and columnar processing and all

48:28

that kind of stuff. Yeah, it's pretty incredible.

48:30

So, not a drop-in replacement for pandas, but

48:32

if you have the opportunity to use it

48:34

and you don't need to use the full

48:36

breadth of what pandas offers, because pandas is

48:38

quite a huge package. There's a lot it

48:40

does. But if you're just using simple transforms,

48:42

I think polars is a great option to

48:44

explore. Now, I talked to a Ritchie Vinc,

48:47

who was part of that. And I think

48:49

they explicitly chose to not try to make

48:51

it a drop-in replacement for pandas, but try

48:54

to choose an API that would allow the

48:56

engine to be smarter. I see you're asking

48:58

for this, but the step before you

49:00

wanted this other thing. So, let me do

49:03

that transformation all in one shot. A little

49:05

bit like a query optimization engine. What else

49:07

is out there? We got time for just

49:09

a couple more. If there's anything there, like,

49:12

oh yeah, people use this all the time.

49:14

Obviously, the databases, you've said, Postgres, Snowflake, etc.

49:16

Yeah, there's so much. So, another little one

49:19

I like is called DLP, DLP Hub. It's

49:21

getting a lot of attraction as well. And

49:23

what I like about it is how lightweight

49:25

it is. I'm such a big fan of

49:28

lightweight tooling that's non-massive frameworks. Lowering data is,

49:30

I think, still kind of yucky in many

49:32

ways. It's not fun. And DLP makes it

49:34

a little bit simpler and easier to do

49:36

so. So, that's what I would recommend people

49:38

just walk into if you got to either

49:40

interest data from some API,

49:43

some website, some CSV file. It's

49:45

a great way to do that.

49:47

It claims it's the Python library

49:49

for data teams loading data into

49:51

unexpected places. Very interesting. Yes, that's

49:53

great. Yeah, this looks cool. All

49:55

right. Well, I

49:58

Guess maybe let's talk about... This

50:00

knowing talk about what's next. Animal probiotic

50:02

time. I'm always fascinated. I think they're

50:04

starting to be a bit of a

50:06

blueprint for this, but companies that take

50:08

a thing, they make it in a

50:10

given away and have a company around

50:12

it. And and congratulations you all for

50:14

doing that. Pray and a lot of

50:16

it seems six kind a sin around

50:18

the open core model, which I don't

50:20

know if that's exactly how you would

50:23

characterize yourself from him. A hug about

50:25

the business Sykes I noticed many successful

50:27

open source projects. They don't necessarily result

50:29

in full time. Jobs or companies of people

50:31

were to want. that is it really is

50:33

simply isn't dumb. I don't think it's one

50:35

that anyone is truly figure it out. Well

50:37

I can say this is the way forward

50:39

for everyone but it is something we're trying

50:41

to think for for Dexter thing is working

50:43

pretty well and what I think is really

50:45

powerful about I sir is like he open

50:47

source project is really really good and it

50:49

hasn't really been limited in in in many

50:51

ways in order to drive like cloud products

50:54

of the Dm Surely believe that there's actual

50:56

value in the separation that he sings. There

50:58

are some things that we just can't do.

51:00

In the open source platform for example,

51:02

this pipelines on cloud that involve you

51:04

know interesting data to hurled systems and

51:06

rest. Of them access

51:08

to do on your pets or system. or

51:11

tax or for the most part of the

51:13

dice of the source i you really believe

51:16

though it's getting it in the hands or

51:18

to others is the best way to prove

51:20

the value of it as if we can

51:22

build a business or top of that letting

51:24

world's super happy to do so it's nice

51:26

that we get to to try both sides

51:28

of it to me that's one of them

51:30

are you saying parts rates a lot of

51:32

the development that we to and that's open

51:34

source is driven by people who are paid

51:36

through you know what happened on a cloud

51:38

and i think from what i can tell

51:40

there is still better way to build robots

51:43

or spot i've been to have people who

51:45

are particularly pay to develop a product otherwise

51:47

it can be liberal love but one that

51:49

doesn't last for very long and whenever i

51:51

think about building software there's eighty percent of

51:53

that super exciting fun and percent and then

51:55

there's that little sliver of like really fine

51:57

policy that if it's not just your job

51:59

to make that thing polished, you're just, for

52:01

the most part, just not going to polish

52:03

that bit, right? Good stuff.

52:06

UI, design, support. There's all these

52:08

things that go into making software

52:10

really extraordinary. That's really, really tough

52:12

to do. And I think

52:14

I really like the open source business model.

52:16

I think for me, being able to just

52:19

try something, not having to talk to sales

52:21

and being able to just deploy locally and

52:23

test it out and see if this works.

52:25

And if I choose to do so, deploy

52:27

it in production, or if I bought the

52:29

cloud product, they don't like the direction that is

52:31

going, I can leave and go open source as

52:33

well. That's pretty compelling to me. Yeah, for sure

52:35

it is. And I

52:37

think the more moving pieces of infrastructure,

52:39

the more uptime you want and all

52:42

those types of things, the more somebody

52:44

who's maybe a programmer, but not a

52:46

DevOps infrastructure person, but needs to have

52:48

it there, right? Like that's an opportunity

52:50

as well, right? For you to say,

52:52

look, you can write the code. We

52:55

made it cool for you to write the code, but

52:57

you don't have to get notified when the server's down

52:59

or whatever. We'll just take care of that for

53:01

you. That's pretty awesome. Yeah, and it's efficient through the

53:03

scale as well, right? We've learned the

53:06

same mistakes over and over again, so you don't have

53:08

to, which is nice. I don't know how many people

53:10

who want to maintain servers, but people do, and they're

53:12

more than welcome to if that's how they choose to

53:14

do so. Yeah, for sure. All

53:16

right, just about out of time. Let's close

53:19

up our conversation with where are

53:21

things going for Dijkstra? What's

53:23

on the roadmap? What are you excited about? Oh,

53:25

that's a good one. I think we've actually published

53:28

our roadmap line somewhere if you search Dijkstra

53:30

roadmap. It's probably out there. I think for the

53:32

most part, that hasn't changed much going into 2024,

53:34

though we may update it.

53:37

There it is. We're really just doubling down on

53:40

what we've built already. I think there's a lot

53:42

of work we can do on the product itself

53:44

to make it easier to use, easier to understand.

53:46

Dijkstra specifically is really focused around the education piece.

53:49

We launched Dijkstra University's first module,

53:51

which helps you really understand the

53:53

core concepts around Dijkstra. Our next

53:55

module is coming up in a couple months, and

53:58

that'll be around using Dijkstra with dbt. which

54:00

is our most popular integration. We're building out more

54:02

integrations as well. So I built

54:04

a little integration called Embedded ELT that makes

54:06

it easy to ingest data. But I want

54:08

to actually build an integration with the ELT

54:11

as well, ELT Hub. So we'll be doing

54:13

that. And there's more

54:15

coming down the pipe, but I don't know how much I can say.

54:17

Look over to an event in April

54:20

where we'll have a launch event on

54:22

all that's coming. Nice. It's an online

54:24

thing people can attend or something like

54:26

that. Yeah, there'll be some announcement there

54:28

on the Daxter website on that. Maybe

54:30

I will call it one thing that's actually

54:33

really fun. It's called Daxter Open Platform. It's

54:35

a GitHub repo that we launched a couple

54:37

months ago, I want to say. We

54:39

took our internal, I should go back

54:42

one more. Sorry. It's like GitHub, Daxter

54:44

Open Platform and GitHub. I

54:46

have it somewhere. Yeah. It's

54:49

up here in another organization. Yes,

54:51

it should be somewhere here. There

54:53

it is. Daxter Open Platform on

54:55

GitHub. And it's really a clone

54:57

of our production pipelines. For the

54:59

most part, there's some things we've chosen to

55:01

ignore because they're sensitive. But as much as

55:04

possible, we've defaulted to making it public and

55:06

open. And the whole reason behind this was

55:08

because as data engineers, it's often hard to

55:10

see how other data engineers write code. We

55:12

get to see how software engineers write code

55:14

quite often, but most people don't want to

55:16

share their platforms for various

55:18

good reasons. We also use

55:20

smaller teams or maybe just

55:22

one person. And then those

55:24

pipelines are so integrated into

55:27

your specific infrastructure.

55:30

It's not like, well, here's a web framework to

55:32

share. Here's how we integrate into that one weird

55:34

API that we have that no one else has.

55:36

There is no point in publishing it to you.

55:38

That's typically how it goes. Or they're so large

55:40

that they're afraid that there's some important situation that

55:42

they just don't want to take the risk on.

55:44

And then we built something that's in the middle

55:46

where we've taken as much as we can and

55:48

we publicized it. And you can't run this on

55:50

your own. That's not the point. The point is

55:52

to look at the code and see how does

55:54

Daxter use Daxter and what does that look like?

55:56

Nice. Okay. All right. Well, I'll put a link

55:58

to that in the show. and people can

56:00

check it out. Yeah, I guess let's

56:03

wrap it up with the final call action.

56:05

People are interested in Daxter. How do they

56:07

get started? What do you tell them? Oh

56:10

yeah. daxter.io is probably the greatest place to

56:12

start. You can try the cloud product. We

56:14

have free self-serve or you can try the

56:16

local install as well. If you

56:18

get stuck, a great place to join is our Slack

56:20

channel, which is up on our website. There's even a

56:23

Ask AI channel where you can just talk

56:25

to a Slack bot that's been trained on

56:27

all our GitHub issues and discussions. Surprisingly

56:30

good at walking you through any debugging, any issues

56:32

or even advice. That's pretty excellent actually. Yeah, it's

56:34

real fun. It's really fun. And it allows it

56:36

to work. We're also there in the community where

56:39

you can just chat to us as well. Cool.

56:42

All right. Pedram, thank you for being on the show. Thanks

56:45

for all the work on Daxter and sharing it with us. Thank

56:47

you Michael. You bet. See you later.

56:49

This has been another episode of Talk Python to Me. Thank

56:52

you to our sponsors. Be sure to check out what

56:54

they're offering. It really helps support the show. This

56:58

episode is sponsored by

57:00

Posit Connect from the

57:02

makers of Shiny. Publish,

57:04

share and deploy all

57:06

of your data projects

57:08

that you're creating using

57:10

Python. Streamlit-shinybokeh-fastapi-flask-querto-reports-dashboards and APIs.

57:13

Posit Connect supports all of them. Try

57:15

Posit Connect for free by going to

57:17

talkpython.fm slash posit.

57:19

P-O-S-I-T. Want

57:22

to level up your Python? We have one of

57:24

the largest catalogs of Python video courses over at

57:27

Talk Python. Our content ranges from

57:29

true beginners to deeply advanced topics like

57:31

memory and async. And best of all,

57:33

there's not a subscription in sight. Check

57:35

it out for yourself at training.talkpython.fm. Be

57:39

sure to subscribe to the show, open your favorite

57:41

podcast app, and search for Python. We should be

57:44

right at the top. You can also

57:46

find the iTunes feed at slash iTunes,

57:48

the Google Play feed at slash Play,

57:50

and the Direct RSS feed at

57:52

slash RSS on talkpython.fm. We're

57:54

live streaming most of our recordings these days. If you

57:56

want to be part of the show and have your

57:59

comments featured on the... Be sure

58:01

to subscribe to our YouTube channel

58:03

at talkpython.fm slash YouTube. This

58:05

is your host Michael Kennedy. Thanks so much for

58:07

listening. I really appreciate it. Now get out there

58:09

and write some Python code. Thanks

58:30

for watching.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features