Podchaser Logo
Home
Data Science Panel at PyCon 2024

Data Science Panel at PyCon 2024

Released Thursday, 20th June 2024
Good episode? Give it some love!
Data Science Panel at PyCon 2024

Data Science Panel at PyCon 2024

Data Science Panel at PyCon 2024

Data Science Panel at PyCon 2024

Thursday, 20th June 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

I have a special episode for you this time around. We're

0:02

coming to you live from PyCon 2024.

0:06

I had the chance to sit

0:08

down with some amazing people from

0:10

the data science side of things,

0:12

Jordy Burchill, Maria Jose, Molina Contreras,

0:14

and Jessica Green. We cover a

0:16

whole set of recent topics from a data

0:18

science perspective, though we did have to cut

0:20

the conversation a bit short as they were

0:22

coming from and going to talks they were

0:24

all giving, but it's still a pretty deep

0:26

conversation. I know you'll enjoy it. This

0:29

is TalkPython.me, episode 467 recorded on

0:32

location in Pittsburgh on May 18th,

0:34

2024. Are you

0:36

ready for your host, please? You're

0:39

listening to Michael Kennedy on TalkPython to

0:42

me, live from Portland, Oregon,

0:44

and this segment was made with Python. Welcome

0:50

to TalkPython to me, a weekly

0:52

podcast on Python. This is your

0:54

host, Michael Kennedy. Follow me on

0:56

Mastodon, where I'm at M Kennedy

0:58

and follow the podcast using at

1:00

TalkPython, both on boston.org. Keep

1:02

up with the show and listen to

1:04

over seven years of past episodes at

1:06

TalkPython.fm. We've started streaming most

1:09

of our episodes live on YouTube.

1:11

Subscribe to our YouTube channel over at

1:13

TalkPython.fm slash YouTube to get notified about

1:15

upcoming shows and be part of that

1:18

episode. This episode is

1:20

brought to you by Sentry. Don't let

1:22

those errors go unnoticed. Sentry like we

1:24

do here at TalkPython. Sign up at

1:27

TalkPython.fm slash Sentry. And it's

1:29

brought to you by Code Comments, an

1:31

original podcast from Red Hat. This podcast

1:33

covers stories from technologists who've been through

1:36

tough tech transitions and share

1:38

how their teams survive the journey.

1:41

Episodes are available everywhere you listen

1:43

to your podcasts and at TalkPython.fm

1:45

slash code dash comments. Hello

1:48

from PyCon. Hello, Jessica. Jodie, Maria, welcome

1:50

to TalkPython to me. It's awesome to

1:52

have you all here and I'm looking

1:54

forward to talking about data science, some

1:57

fun LLM questions, maybe some

1:59

controversial questions. some data science

2:01

tools, all sorts of good things. Of course, before

2:03

we get to that, Jodie, you've

2:05

been on the show a time or two and

2:08

people may know you, but maybe not. So how

2:10

about a quick introduction, what you all are into?

2:12

Maria, you wanna start? Oh, okay. Well,

2:15

my name is Maria. I am

2:17

originally from Barcelona, but I am

2:19

based in Berlin. I work as

2:21

a data scientist in a small

2:24

startup that will try to

2:27

solve some sustainability problems.

2:29

And yeah, that is new. Excellent. So

2:32

my name's Jodie and I am a data

2:34

science developer advocate. Been working in data science

2:36

for about eight years. And yeah, I'm

2:38

currently working at JetBrains as you can see from the shirt. And

2:41

in the background. And the background. And

2:44

so I say my interest at the

2:46

moment is natural language processing because

2:48

I worked in that a big chunk of

2:50

my career, but the core statistics will always

2:52

be my love. So tabular data, I'm there

2:55

for you always. Beautiful. Yeah,

2:57

my name is Jessica. So I'm an ML

2:59

engineer at KOSA, which is the search engine

3:02

for a better planet. I

3:04

am actually a career changer. So I used to

3:06

roast coffee for a living and I really just

3:09

got into this field in the last six years.

3:11

So I don't have like any formal

3:13

training. I'm a community slash self-taught engineer.

3:16

And I went through more of a

3:18

like a backend focused path. And

3:20

now I've started to work in the ML realm.

3:22

So really exciting. Yeah, very, very

3:25

interesting. Another thing I absolutely love

3:27

is coffee. Oh

3:29

my gosh. I

3:31

think we're running on it at PyCon. Pretty much

3:33

we are. Yeah, we're getting farther

3:35

into the show and more coffee

3:37

is needed. But I do want to

3:39

ask you, you know, what do you

3:41

think about being in the data science space?

3:43

That's a really different world that interacting with

3:46

people all day and working with your hands

3:48

more or whatever. Yeah, like how

3:50

has it been with this switch? There is

3:52

a lot of synergies actually. When you're still behind

3:55

the espresso machine and you're getting all the orders

3:57

in and you need to like problem solve. Right.

4:00

get everyone their correct order to the way

4:02

that they like it. So

4:04

there was a lot of transferable skills, I will say.

4:07

But I think what I found really powerful,

4:10

especially maybe learning at this

4:12

specific period of time, is

4:14

how accessible a lot of the tools

4:16

are today. So I won't say easy

4:18

because I put a lot of hard

4:20

work into it, but how possible it

4:22

is, even with a background like mine

4:24

to get into the field. Awesome.

4:26

I switched. I didn't have a formal

4:29

education either. I took two computer college

4:31

courses just because they match, I need

4:33

it for something else. I

4:37

think you can completely succeed

4:39

here teaching yourself. There's so many resources. Honestly,

4:42

the problem is what resources you choose to

4:44

learn these days. You can spend all your

4:46

time, while I'm doing another tutorial, I'm doing

4:48

another class, like some point you got to

4:50

start doing something. I

4:52

think actually it felt like that probably

4:54

when we all started. So

4:57

data science was just getting hot when I started.

4:59

Oh my God, back when I started, this

5:02

is how long ago it was. There were

5:04

actually like those articles like R versus Python.

5:06

Like this is not a conversation anyone's having

5:08

anymore, but they have similar conversations. I think

5:10

it makes it super difficult for beginners because

5:12

the field felt inaccessible, I think, eight years

5:15

ago. The field feels very

5:17

hostile to beginners right now, I think, because of

5:19

the AI hype. I don't actually think the field

5:21

has changed that much in

5:23

fundamentals. It's just NLP has

5:25

become a bigger thing in computer vision recently,

5:28

but we can get into that. Yeah,

5:30

I completely agree with you. To

5:33

be honest, for me, data science

5:36

is super proud to wall, full

5:38

of a lot of things that

5:41

are kind of popping up, doing

5:43

different evolution during time. And

5:46

it's so interesting to see the evolution

5:48

in the last eight years. I

5:51

started eight years ago in data

5:53

science. And I remember when I

5:56

was doing things eight years ago and

5:58

how I'm doing things now. and

6:00

I love it. I love

6:03

it to see this progression and

6:05

I am pretty sure that in

6:07

eight more years we're gonna be

6:09

in something completely different and separate

6:11

stuff. Yeah, I totally agree with that.

6:13

I do. And I also think data

6:15

science is interesting because coming into it

6:17

you can be a data scientist but

6:19

because some other reason, right? I could

6:21

be a data science because I'm interested

6:23

in biology or sustainability or

6:26

something. Whereas if you're a web developer

6:28

or you build APIs or you optimize,

6:30

you know, whatever, you're more focused on

6:32

I care about the thing, the code

6:34

itself, rather than I'm trying to, I

6:36

care about that and this is a

6:39

tool to address that. Yeah,

6:41

actually, I was gonna say I met

6:43

a bioinformatician yesterday. Like that's also a

6:46

data scientist, like someone who works in

6:48

genetic data. Yeah, absolutely. I had a comment from,

6:50

I did a show recently from

6:52

about how Python's used in neurology labs, right?

6:54

And somebody wrote me, this is my favorite

6:57

episode, it speaks to me, I'm also a

6:59

neurologist, you know, like it's really cool. Alright,

7:01

we're looking out, kind of the backside a

7:03

little bit, we're looking out of the expo

7:05

hall here at PyCon. So I don't know

7:08

about you all feel, but for me this

7:10

is like my geek holiday. I get to

7:12

come here and it's really special

7:14

to me because I get to see

7:16

my friends who I've collaborated with projects

7:19

on and I admire and I've worked with

7:21

but I might never see them outside of

7:23

this week, you know, maybe

7:25

they live in Australia or Europe or

7:27

some oddly just down the street and

7:30

yet still I don't see them except

7:32

here. So maybe what are

7:34

your thoughts on PyCon here? It's

7:36

my first time attending, so I'm super stoked, I

7:38

have to say, like it's slightly overwhelming because there's

7:40

so many things going on and like you mentioned

7:43

the opportunity to meet so many folks that I

7:45

either already knew in some capacity but had never

7:47

met or didn't meet before but have heard of

7:49

their work. So yeah, it's been a real honor

7:52

to be here, right? And get to, I mean,

7:54

we are all based in Berlin so we do

7:56

actually know each other but it's also a great

7:58

opportunity to be here. It's also a pleasure just

8:00

to come away on a geek holiday with friends.

8:03

Yeah, and we were actually all just at

8:06

PyCon DE just before this, like a month

8:08

ago. Yeah, a month ago. Yeah, it's a

8:10

different scale, let's put it that way. But

8:12

I think it's a similar feel. Like, one

8:15

thing that I value so much about the

8:17

Python community is that it's community. And I'm

8:19

very lucky to have gotten involved in a

8:21

program called Hatchery, which you two have also

8:24

been involved in. It's

8:26

the Hatchery we're running is Humble Data. And

8:29

what I like is this program got

8:31

accepted at a Python conference, which is

8:33

designed for people who have never coded

8:35

and who are career changes, because I'm

8:38

also a career changer from academia. And

8:40

this is what makes, I think, Python

8:42

special, the community. And I think the

8:44

PyCons are an absolute representation of that.

8:47

Yeah, absolutely. For me, it's

8:49

the same feeling. I love to go

8:51

to different conferences of PyCon because

8:55

we have a lot of things in

8:57

common, but also we

9:00

have differences. And the different

9:02

conferences bring a different point

9:04

of value. And

9:06

I think it's awesome. And came

9:08

here and made friends. This is

9:10

my third time in here, and

9:12

I'm super, super excited and happy.

9:14

And I'm super eager to next

9:16

year. And also the Python in

9:18

Espanyol. Yeah, of course. And also

9:20

we have even here, we have

9:22

a track that is PyCon

9:25

to tell us to be even

9:27

more welcoming to different people from

9:29

different communities. And it's just amazing.

9:31

It's super nice, to be honest. Awesome.

9:33

Yeah, I definitely want to encourage people out

9:36

there listening who feel like, oh, I'm not

9:38

high enough of a level of Python to

9:40

come. I'm not ready

9:42

for PyCon. I believe last year I haven't heard any

9:44

numbers this year. I believe last year 50% of

9:47

the attendees were first time attendees. And I

9:49

think that's generally true. A lot of times

9:51

people are, it's their first time coming. Yeah,

9:54

I think you can get a lot out

9:56

of it even if you're not super advanced.

9:58

Maybe even more so than if... if

10:00

you are super advanced. I definitely have

10:02

had the opportunity, like the honor, I

10:04

would actually say, to like listen into

10:06

conversations around topics that I find interesting

10:08

but aren't part of my day-to-day work.

10:11

And it's just like general vibe that

10:13

whether it's at lunch or during the

10:15

breaks or after a talk, you get

10:17

to partake in these conversations, which ultimately

10:19

will advance you. So if you also

10:21

want to get sponsoring, right? Like a

10:24

lot of people need their work to

10:26

sponsor them. I think there's a lot

10:28

of reasoning behind asking for PyCon as

10:30

a conference because there's so much value. Jessica,

10:32

that's a great point. And I think also

10:34

I was talking to someone earlier about how

10:37

much more affordable this is than a lot

10:39

of tech conferences. A lot of them are

10:41

like, how many thousand dollars is just the

10:43

ticket? And this is not

10:45

that cheap, but it's relatively cheap

10:47

compared. And also, oh, sorry.

10:49

I was gonna say you could do a plug for EuroPython

10:51

while you're here. We have also

10:53

the option to have grants. There

10:55

is a different programs by

10:58

latest grants or the conference

11:00

organization grants. Also,

11:02

this is something that could

11:04

help people to try to

11:06

apply or came here. Yeah,

11:09

they mentioned that at the opening

11:11

keynote or the introductions before the

11:13

keynote. It's some significant number of

11:16

grants that were given. I can't remember the number, but it's

11:18

like half a million dollars or something in grants. Was that

11:20

what it was? I think it was

11:22

around that scale. Yeah. Yeah,

11:24

it's a really big deal. And I suppose

11:26

all three of you being from Berlin, we

11:28

should say generally the same stuff applies to

11:30

EuroPython as well, I imagine, right? Yeah. So

11:33

if you're in Europe, the biggest deal is to

11:35

get all the way to the US, maybe go

11:37

to EuroPython as well, which be fun. Yeah,

11:39

or something more local. This

11:42

portion of TalkPython. I mean, it's brought

11:44

to you by Open Telemetry Support at

11:46

Century. In the

11:48

previous two episodes, you heard how we

11:50

use Century's error monitoring at TalkPython and

11:53

how distributed tracing connects errors,

11:55

performance and slowdowns and more

11:57

across services and tiers. But...

12:00

You may be thinking, our company uses open

12:02

telemetry, so it doesn't make sense for us

12:04

to switch to Sentry. After all, open

12:07

telemetry is a standard and you've already

12:09

adopted it, right? Did

12:11

you know with just a couple of

12:13

lines of code, you can connect open

12:15

telemetry's monitoring and reporting to Sentry's backend?

12:18

Open telemetry does not come with a backend to

12:21

store your data, analytics on top of that data,

12:23

a UI or error monitoring. And

12:25

that's exactly what you get when

12:27

you integrate Sentry with your open

12:30

telemetry setup. Don't fly

12:32

blind, fix and monitor code faster

12:34

with Sentry. Integrate your open telemetry

12:36

systems with Sentry and see what

12:38

you've been missing. Create your Sentry

12:40

account at talkpython.fm slash Sentry-Telemetry.

12:42

And when you sign up, use

12:44

the code talkpython, all caps, no

12:46

spaces. It's good for two free

12:48

months of Sentry's business plan, which

12:50

will give you 20 times as

12:53

many monthly events as well as

12:55

other features. My thanks to

12:57

Sentry for supporting TalkPython.me. Jodi,

13:00

you have been on the receiving end of

13:03

many, many questions and you've been, let's

13:05

see here doing demos, so warned with

13:07

people for a day and a half

13:09

and surprise you still have your voice. I

13:11

got to give a talk in two hours too, so

13:13

I hope I have a voice. Yeah. Speak

13:16

quietly. I don't, save a

13:18

little bit for that. One of

13:20

the questions you said was that people are

13:22

still just have core data science questions. They're

13:25

not necessarily trying to figure out how LLMs

13:27

are gonna change the world, but how do

13:29

you do that with pandas or whatever? Like

13:31

what are your thoughts in this? Yeah. What

13:34

are your takeaways? So I alluded to the

13:36

fact I have a academic background. I probably

13:38

talked about this on the last podcast, but

13:40

basically my background is in behavioral sciences. So

13:43

a lot of core statistics and working with

13:45

what's called tabular data, data in tables. And

13:48

pretty much I would say, look, this

13:50

is a guesstimate. This is not scientific.

13:52

But my kind of gut feeling, PyCon

13:54

after PyCon, conference after conference that I

13:56

do, I think like 80% of

13:59

people are probably still doing it. doing this

14:01

stuff because business questions are not necessarily sold

14:03

with the cutting edge. Business

14:05

questions are solved with the simplest possible

14:07

model that will address your needs. I

14:09

think we talked about this in the

14:12

last podcast. So like for an example,

14:14

my last job, we had to deal

14:16

with low latency systems, like very low

14:18

latency. So we used a decision

14:20

tree to solve the problem. Decision tree is

14:22

a very old algorithm. It's not sexy anymore,

14:25

but everyone's secretly still using it. And

14:27

so, yeah, some people is doing cutting

14:29

edge LLM stuff. But my

14:32

feeling is this is a

14:34

technology that maybe has more

14:36

interest than real profitable applications

14:38

because these are expensive models

14:40

to run and deploy and

14:42

to set up reliable pipelines

14:44

for. Yeah, my feeling is,

14:46

gut feeling is a lot of people are

14:48

still just doing boring linear regression, which I

14:50

will defend until the day I lie. My

14:53

favorite algorithm. Amazing. Yeah. And

14:55

I think what we've seen that in

14:57

our work as well is we don't

15:00

per se need the biggest fanciest thing.

15:02

We need something that works and provides

15:04

users with useful information. I think there's

15:06

also still a lot of problems with

15:08

large language models like Simon alluded to

15:10

in the keynote today around security. So

15:14

if you want to put this into a product,

15:16

it's still kind of early days. But

15:18

I don't think those base kind of

15:20

NLP techniques are going to go away

15:22

anytime soon. And I think like we

15:24

spoke about learners earlier and people coming

15:26

into the field. There's still

15:28

a huge amount of value just to

15:31

go and learn this core aspects that

15:33

will serve you really well. Absolutely. Way

15:35

more than LLMs and AIs and all

15:37

that stuff. You can use a LLM

15:39

to learn it. That's

15:42

what we just saw in the keynote. Absolutely.

15:44

And I also think what

15:46

people are going to do with LLMs and stuff like is

15:49

ask it to help keep me this little bit of code

15:51

or that bit of code. But you're going to need to

15:53

be able to look at it and say, yeah, that does

15:55

make sense. Yeah, that does fit in. And so you need

15:57

to know that's a reasonable use of pandas. What do you

15:59

think Maria? I completely agree.

16:01

The LLMs world is kind of complex.

16:03

I think that it has a lot

16:06

of potential and I think that a

16:08

lot of people could see this potential

16:10

and everyone is getting very excited and

16:12

even a bit in a hype because

16:15

of that. However, it has

16:17

a lot of limitations still

16:19

nowadays, I can tell you,

16:21

because I am currently working

16:23

with LLMs for

16:26

solving the real world problems

16:28

that we were mentioning about

16:31

the sustainable packaging and

16:33

it's very challenging to be honest. It's

16:36

more challenging than people are mentioning. It's

16:38

not only hallucinations, it's a hallucination of

16:40

course, but also if you are doing

16:43

fine tuning models, also you're going to

16:45

later on need to think how you're

16:47

going to deploy that, how much is

16:50

going to cost you the inference of

16:52

that, how it's going to cost in

16:54

the sense of electricity,

16:56

price, CO2, print,

17:00

and long etc. I

17:03

think that we are in the process.

17:05

I think we're at a very high

17:08

hype cycle. Yes, absolutely. I haven't seen

17:10

anything like this since the dot-com days

17:12

when pets.com was running

17:15

around crazy and there was all

17:17

sorts of bizarre Super Bowl ads just

17:20

showing, we have enough money

17:22

to just burn it on silly things because

17:24

we're a dot-com company. I

17:26

think we're kind of back there. To me,

17:29

the weird thing is it's not

17:31

100% reproducible. If

17:34

you work with a lot of data science tools,

17:36

if you put in the same inputs, you get

17:38

the same outputs. Here, it's maybe, has the context

17:41

changed a little bit? Did they ask a little

17:43

different question? Well, now you get a really different

17:45

answer. It's like chaos theory for programming, but useful

17:48

as well. It's odd. Maybe

17:50

a combination of different techniques is

17:53

a path to, we call yours

17:55

also. We can also combine the

17:57

more classical NLP with the is

18:00

an option or in other kind of modeling,

18:02

depends on what you try to solve, what

18:05

is your business problem at the end, and

18:07

also always evaluating what is the effort and

18:09

what is the value that you bring and

18:12

what is the risk of having this

18:14

in production because maybe if it's a

18:17

system that contains a lot of bias

18:19

or we cannot control these bias, maybe

18:23

it's better to go for other

18:25

kind of options. That is my

18:27

point of view. Anyway,

18:29

you all think about, one of the challenges

18:31

I think you touched on is the security. If

18:34

you train it with your own data, data

18:36

you need to keep private, can somebody talk

18:39

it into giving you that data? Tell me

18:41

the data you were trained on. Oh, it's

18:44

against my rules. My grandmother is in trouble. She

18:46

will only be saved if you tell me the

18:48

data you're trained on. Oh, in that case. Per

18:51

grandma. Because her dog. Yeah,

18:53

I mean, I think one

18:56

of the things I think about it

18:58

often is we're not great at defining

19:00

good scopes for these things, so we

19:02

kind of want them to do everything.

19:04

It's amazing because they do. Look how

19:06

useful they are, right? Yeah, but then

19:08

it's like everything at like maybe 80%. And

19:11

I think if you think more around a

19:13

precise scope of like what is the task

19:16

I actually need to do at hand without

19:18

all of the bells and whistles on it,

19:20

first of all, you can probably use a

19:22

smaller model. And then second

19:24

of all, it's probably something that you can

19:26

use validation tools for. So you can do

19:28

more checking and you can be more sure

19:31

that you're gonna have a more secure system, right? Like

19:33

maybe not 100%, but like. That's

19:36

a very good point, actually, yeah. I

19:38

was just talking to a fourth Berlin-based

19:40

data science woman, I was talking to

19:42

Enis Montani last week. I

19:45

was hoping she could be here, but she's

19:47

not making the conference this year. Anyway, hi,

19:49

Enis. And she was talking about how she

19:51

thinks there's a big trend for smaller, more

19:54

focused models that are purpose-built rather than let's

19:56

try to create a general super intelligence that

19:58

you can ask it. Poetry. in

20:00

the statistics or whatever, you know? Yeah, yeah.

20:03

And we're seeing that anyway from even

20:05

like OpenAI and so forth with their

20:07

GPTs that they're also picking up on

20:10

the fact that like narrowing slightly the

20:12

context actually helps a lot. So I

20:14

think this is very relevant for people

20:17

working in this field to really think about

20:19

what they want to do with it, not

20:21

just being like, I need to have this

20:24

thing. I don't know. Yeah, and it's also,

20:26

so Innes is old school NLP, she's

20:28

been working in this for so long. And so Innes

20:31

is one of the creators of Spacey, which is

20:33

like one of the most sophisticated, I

20:35

think, general purpose NLP packages in Python.

20:37

And I remember back when I had

20:39

like a job where I did NLP

20:41

for three years on search engine improvements,

20:43

like this was the sort of stuff

20:45

you were doing. Like things about like,

20:47

okay, it seems kind of quaint now,

20:50

but it's still really important. Like how

20:52

can you clean your data effectively? And

20:54

it's very complex when it comes to

20:56

tech stuff. And so yeah, like Innes,

20:58

of course she's completely right, but she's

21:00

seen all of this. She knows where

21:02

this is going. Yeah, absolutely. Let's

21:05

touch on some tools. I know

21:07

Maria, you had some interesting ones, just

21:10

general data science tools that while

21:12

people are listening, should be

21:14

like, let's check the LLM or as Jodie

21:16

puts it, old school, just core

21:18

data science. Yeah, yeah,

21:20

yeah. And the gunner depends of what

21:22

kind of problem you want to solve.

21:25

Again, it's like, it's not the

21:27

tool. This is my

21:29

perspective. It's not only one tool or 10 tools.

21:32

It's the pent-of-view problem. And the pent-of-view

21:34

problem, we have tools that

21:36

are gonna help us more or

21:39

easier than others. For

21:42

instance, some tools that I'm using currently,

21:44

just for giving you an example, this

21:46

line chain or

21:49

this card. And

21:52

yeah, and they are two

21:54

open source libraries. Line chain

21:56

is more focused in

21:59

the chat. system in case

22:01

that you want to develop a chat

22:03

system or of course has a lot

22:05

of more applications because LinkedIn is super

22:08

useful also for handling all the

22:11

left-hand what's models. Yeah there's

22:13

some cool boost here that boost with

22:15

cool products based on LinkedIn as well.

22:18

Oh really? I'm gonna take a look. Then

22:21

you export as a

22:24

Python application. It's very neat.

22:27

Yeah but you also said

22:29

GIS card. G-I-S-K-R-D. Exactly. Okay.

22:31

It's the one that has a

22:33

turtle, the logo, very cute. This

22:36

people is developing a library

22:39

for evaluating the models, try

22:41

to take a look in

22:43

the bias of the system,

22:46

has tests, tests

22:48

your models and generating

22:50

metrics to help you understand if

22:53

the model that you are using

22:55

or training or fine-tuning is something

22:57

that you can trust or not

23:00

or you need to reevaluate or restart

23:02

the system or whatever you need to

23:04

do. I think this

23:06

kind of libraries are super necessary

23:08

especially right now that still it's

23:12

very young the field and I

23:15

think that they are very very important. This

23:18

portion of Talk by Thunamy is brought to

23:20

you by Code Comments, an original podcast from

23:22

Red Hat. You know when you're

23:24

working on a project and you leave behind a

23:26

small comment in the code maybe

23:28

you're hoping to help others learn what

23:30

isn't clear at first. Sometimes that code

23:32

comment tells a story of a challenging

23:34

journey to the current state of the

23:36

project. Code Comments, the podcast,

23:39

features technologists who've been through

23:41

tough tech transitions and

23:43

they share how their teams survived that journey.

23:46

The host Jamie Parker is a

23:48

Red Hatter and an experienced engineer.

23:50

In each episode Jamie recounts the

23:52

stories of technologists from across the

23:54

industry who've been on a journey

23:57

implementing new technologies. I recently listened to

23:59

an episode episode about DevOps from

24:01

the folks at Worldwide Technology. The

24:04

hardest challenge turned out to be getting buy-in

24:06

on the new tech stack, rather than using

24:08

that tech stack directly. It's

24:10

a message that we can all relate to, and I'm

24:13

sure you can take some hard-won lessons back to your

24:15

own team. Give code comments a

24:17

listen. Search for code comments

24:19

in your podcast player, or just

24:22

use our link, talkbython.fm slash code

24:24

dash comments. The link is in

24:26

your podcast player's show notes. Thank you

24:28

to Code Comments and Red Hat for supporting

24:30

Talkbythonomy. Jerry? Yeah,

24:33

so maybe I'm gonna do a little plug for my talk. So

24:36

when I was doing psychology, I

24:38

was fascinated by psychometrics. And what

24:41

you learn when you learn psychometrics

24:43

is measurement captures one

24:45

specific thing, and you need to be

24:47

very clear about what it captures. And

24:50

so at the moment, we're seeing a

24:52

lot of leaderboards to help people evaluate

24:54

LLM performance, but also things like hallucination

24:56

rates, or things like bias and toxicity.

24:59

What we need to understand is these

25:01

things have extremely specific definitions. So

25:03

in my talk, I'm gonna be delving

25:05

into a package, which I do, a

25:08

package, sorry, a measurement that I love,

25:10

called Truthful QA. But Truthful QA is

25:12

designed to measure a specific type of

25:15

hallucinations in English-speaking communities because it assesses

25:17

incorrect facts, things like misconceptions, misinformation, conspiracies.

25:19

They're not gonna be present in other

25:22

languages. And so it's not as easy

25:24

as looking at, okay, this model has

25:26

a low hallucination rate. What does that

25:29

mean? Or this model has good performance.

25:31

Does it have that performance in your

25:33

domain? How did they assess that? So

25:35

it's very boring, but actually it's not

25:38

because measurement's super sexy. You

25:40

need to think about this stuff. It's really

25:42

interesting, but it's challenging, and it requires a lot

25:44

of hard graph from you. Awesome, and

25:47

while people will be watching this in

25:49

the future after your talk is out,

25:52

that talk will be on YouTube, right? Yes, it'll be recorded.

25:54

Yeah, so people can check out your talk. What's the title?

25:57

Live, Damn Lies, and Large Language Models.

26:00

I love it. It's the best title I've ever come up with.

26:02

That is a good title. I love it. Jessica,

26:05

tools, libraries, packages? Maybe

26:08

I'll look my tutorial that was two

26:10

days ago, and we'll also be recording

26:12

somewhere at some point. We

26:15

were working on looking at monitoring

26:17

and observability of Python applications, which

26:19

could well be your AI,

26:22

LLM kind of thing.

26:25

We're using a package called Code Carbon.

26:29

It measures the carbon emissions of

26:31

your code, of your workload. This

26:34

is one way that we can start to get

26:37

an idea of

26:39

the impact that we're having with these things.

26:41

I think it's a really great library. It's

26:43

open source. They're looking for contributors. It's

26:46

not the full picture, of course, because if

26:48

you're using a cloud provider, you also need

26:50

to ask and follow up with them to

26:52

get further information. How much of

26:55

their renewable versus non-renewable energy?

26:58

Is it a coal plant? Please say it's not a coal plant.

27:00

Yeah, we live in Germany. Germany

27:03

is not too bad, but there is a lot of

27:05

coal in there. I think this

27:07

is a great way to start to think about

27:09

it as technologists, because often it's easy to see

27:11

these problems as something out

27:13

of our control or

27:16

beyond the scope of the work that we do us

27:18

every day, but I think there's still a lot that

27:20

we actually can do. Make a huge

27:22

difference. Just as simple as could we cache this

27:24

output and then reuse it or let it run

27:26

for five minutes on the cluster and we're not

27:28

that big of a hurry. We'll just let it

27:30

run over and over and over and then we'll

27:33

let it run in continuous integration. Exactly.

27:36

The good thing there also is those things cost money

27:38

too. You don't just

27:40

need to save the planet. You can also save yourself somebody to

27:43

spend it on something else. 100% the

27:45

same, but usually you have this benefit that

27:48

other people care more about money. As

27:51

a business metric, it can be a

27:53

bit easier to sell. Absolutely. I've had a

27:56

couple of episodes on this previously, but just

27:58

give people a sense of how... how much

28:00

energy is in training some of these

28:02

large models? And since it's, on

28:05

one of the shows that I talked to, there

28:07

was some research done that say, training one of these

28:09

large models just one time is as much as say,

28:11

a person driving a car for a year, type

28:14

of energy, and you're like, oh, that's

28:16

no joke. And so that might

28:19

encourage you to run smaller models

28:21

or things like that, which make a

28:23

big difference. I think for a long time

28:25

we were thinking like, oh, it's the training that's

28:27

everything, and then it's kind of like fine

28:29

once the training's done, but actually the inference

28:31

is also just as compute heavy. When you

28:34

see the slow words coming out, that's pain,

28:36

CP, and right here. Yeah, and if it's already regressive,

28:38

it loops. Yeah. I

28:40

think it's, you have to look at it holistically. I

28:43

think it's very useful to have these metrics

28:45

that we compare to other things, because then

28:47

we get a sense of like how daunting

28:49

that is. I think like comparing it to

28:52

like air travel or like to cars and

28:54

so forth is good, and we

28:56

tend to focus a little bit on like, oh,

28:58

it's just this part of the system and not

29:00

the system as a whole. Well,

29:02

I think the training was done a lot

29:04

previously and the usage was done less, and

29:06

now the usage has just gone out of

29:08

control. Like if you don't have AI in

29:11

your, I don't know, menu ordering app, it's

29:13

a useless thing, right? It's like everybody

29:15

needs it. They don't really need it,

29:17

I think, but they think they need it, or the VCs

29:19

think they need it or something. I think also

29:21

like a lot of people might think, oh,

29:23

we need to train our own models, but

29:25

with things like RAG, like retrieval, augmentation, generation,

29:28

that now a lot of vector database

29:30

services are promoting and educating people around

29:32

how to do, that's not true. So

29:34

you can take like a base model

29:37

and start to give it your data

29:39

without the need to like tune something

29:41

yourself, like train something yourself, sorry. Yeah, all

29:43

right, we are very, very nearly out

29:45

of time here, ladies, we all have different

29:47

things we gotta run off and do, but

29:49

let me just close out with some quick

29:51

thoughts, and really this deserves maybe two hours,

29:53

but we've got two minutes. For

29:56

data scientists out there listening who

29:58

are concerned that... that things

30:00

like Copilot and Devon and all these

30:02

weird, I'll write code for you things

30:05

are going to make learning data science

30:07

not relevant. What do you think? I

30:09

think it's still gonna be super relevant,

30:11

but. I think that gonna

30:14

help a lot. And

30:16

I think that could be

30:18

seen as a potential useful

30:20

tool that

30:23

could help a lot of people.

30:25

It's even for beginners for learning.

30:27

I think for people who is

30:29

starting to code could be super

30:31

useful to try to take a

30:34

look with Copilot or with

30:36

LLMs and say, hey, I don't understand the

30:38

code, can you explain to me or what

30:40

is happening in this function or something like

30:42

that. From here to

30:44

be able to introduce

30:47

an idea and have a production

30:49

ready code, we are

30:51

very far away, to be honest right now. We

30:54

need more work and the field needs to

30:56

improve a bit. But

30:59

I truly believe that's gonna help us

31:02

a lot at some point in time.

31:04

I think maybe I'll take like a

31:06

different perspective and say that I think

31:08

for data scientists, like the core concern

31:10

for us is not really code, it's

31:12

more data, I guess. Yeah,

31:14

absolutely. So I think

31:16

like I'm seeing some potential, like even

31:19

with our own tools at JetBrains, to

31:21

potentially help introduce people to the idea

31:23

of how to work with data, but

31:25

there's not really necessarily huge shortcuts here

31:27

because you're standing to learn how to

31:30

clean a data set and evaluate for

31:32

quality. And so the science part of

31:34

data science, I don't think it's

31:36

ever gonna go away. Like you still need to be able

31:38

to think about business problems. You still need to be able

31:40

to think about data. We'll be there forever. It'll be there

31:42

forever, thank God. It's so good. That's

31:46

fun, yeah. Maybe as not a data scientist,

31:48

I can give a slightly different perspective. I

31:51

feel like because it comes up just

31:53

for general programming all the time as

31:55

well, right? And I think one of

31:57

the things that is at the moment

31:59

most hurting, industry is the lack

32:01

of getting people into junior level

32:03

jobs and not AI or any

32:05

technology itself. It's a very human

32:07

problem as are pretty

32:10

much all of the problems with

32:12

AI itself. So I think to

32:14

be honest what we need to

32:16

do is really hire more juniors,

32:18

make more entry level programs and

32:20

get people into these positions and

32:22

get them trained upon using the

32:24

tools. We don't need to keep, there's

32:27

going to be plenty of work for the rest of us

32:29

for the next foreseeable future, considering all

32:32

the big social problems that we have

32:34

to solve. So I just think we

32:36

should do that. All right. Well, let's

32:39

leave it there. Maria, Jodi, Jessica, thank you so

32:41

much for being on the show. Thank you. Thank

32:43

you very much. It was amazing. Bye. Bye. This

32:47

has been another episode of Talk Python to me.

32:50

Thank you to our sponsors. Be sure to check out

32:52

what they're offering. It really helps support the show. Take

32:55

some stress out of your life.

32:57

Get notified immediately about errors and

33:00

performance issues in your web or

33:02

mobile applications with Sentry. Just visit

33:04

talkpython.fm slash Sentry and get started

33:06

for free. And be sure to

33:08

use the promo code talk python

33:10

all one word code comments and

33:12

original podcasts from Red Hat. This

33:15

podcast covers stories from technologists who've

33:17

been through tough tech transitions and

33:20

share how their teams survive the

33:22

journey. Episodes are available everywhere

33:24

you listen to your podcasts and at

33:26

talk python.fm slash code dash comments. When

33:29

you level up your Python, we have

33:31

one of the largest catalogs of Python

33:33

video courses over at Talk Python. Our

33:35

content ranges from true beginners to deeply

33:37

advanced topics like memory and async. And

33:39

best of all, there's not a subscription

33:42

in sight. Check it out for yourself

33:44

at training.talkpython.fm. Be

33:46

sure to subscribe to the show, open your favorite

33:49

podcast app, and search for Python. We should be

33:51

right at the top. You can also

33:53

find the iTunes feed at slash iTunes,

33:55

the Google Play feed at slash Play,

33:57

and the Direct RSS feed at

33:59

slash RSS on talkpython.fm. We're

34:01

live streaming most of our recordings these days.

34:04

If you want to be part of the

34:06

show and have your comments featured on the

34:08

air, be sure to subscribe to our YouTube

34:10

channel at TalkPython.fm slash YouTube. This

34:12

is your host, Michael Kennedy. Thanks so much for

34:14

listening. I really appreciate it. Now get out there

34:17

and write some Python code.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features