Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
I have a special episode for you this time around. We're
0:02
coming to you live from PyCon 2024.
0:06
I had the chance to sit
0:08
down with some amazing people from
0:10
the data science side of things,
0:12
Jordy Burchill, Maria Jose, Molina Contreras,
0:14
and Jessica Green. We cover a
0:16
whole set of recent topics from a data
0:18
science perspective, though we did have to cut
0:20
the conversation a bit short as they were
0:22
coming from and going to talks they were
0:24
all giving, but it's still a pretty deep
0:26
conversation. I know you'll enjoy it. This
0:29
is TalkPython.me, episode 467 recorded on
0:32
location in Pittsburgh on May 18th,
0:34
2024. Are you
0:36
ready for your host, please? You're
0:39
listening to Michael Kennedy on TalkPython to
0:42
me, live from Portland, Oregon,
0:44
and this segment was made with Python. Welcome
0:50
to TalkPython to me, a weekly
0:52
podcast on Python. This is your
0:54
host, Michael Kennedy. Follow me on
0:56
Mastodon, where I'm at M Kennedy
0:58
and follow the podcast using at
1:00
TalkPython, both on boston.org. Keep
1:02
up with the show and listen to
1:04
over seven years of past episodes at
1:06
TalkPython.fm. We've started streaming most
1:09
of our episodes live on YouTube.
1:11
Subscribe to our YouTube channel over at
1:13
TalkPython.fm slash YouTube to get notified about
1:15
upcoming shows and be part of that
1:18
episode. This episode is
1:20
brought to you by Sentry. Don't let
1:22
those errors go unnoticed. Sentry like we
1:24
do here at TalkPython. Sign up at
1:27
TalkPython.fm slash Sentry. And it's
1:29
brought to you by Code Comments, an
1:31
original podcast from Red Hat. This podcast
1:33
covers stories from technologists who've been through
1:36
tough tech transitions and share
1:38
how their teams survive the journey.
1:41
Episodes are available everywhere you listen
1:43
to your podcasts and at TalkPython.fm
1:45
slash code dash comments. Hello
1:48
from PyCon. Hello, Jessica. Jodie, Maria, welcome
1:50
to TalkPython to me. It's awesome to
1:52
have you all here and I'm looking
1:54
forward to talking about data science, some
1:57
fun LLM questions, maybe some
1:59
controversial questions. some data science
2:01
tools, all sorts of good things. Of course, before
2:03
we get to that, Jodie, you've
2:05
been on the show a time or two and
2:08
people may know you, but maybe not. So how
2:10
about a quick introduction, what you all are into?
2:12
Maria, you wanna start? Oh, okay. Well,
2:15
my name is Maria. I am
2:17
originally from Barcelona, but I am
2:19
based in Berlin. I work as
2:21
a data scientist in a small
2:24
startup that will try to
2:27
solve some sustainability problems.
2:29
And yeah, that is new. Excellent. So
2:32
my name's Jodie and I am a data
2:34
science developer advocate. Been working in data science
2:36
for about eight years. And yeah, I'm
2:38
currently working at JetBrains as you can see from the shirt. And
2:41
in the background. And the background. And
2:44
so I say my interest at the
2:46
moment is natural language processing because
2:48
I worked in that a big chunk of
2:50
my career, but the core statistics will always
2:52
be my love. So tabular data, I'm there
2:55
for you always. Beautiful. Yeah,
2:57
my name is Jessica. So I'm an ML
2:59
engineer at KOSA, which is the search engine
3:02
for a better planet. I
3:04
am actually a career changer. So I used to
3:06
roast coffee for a living and I really just
3:09
got into this field in the last six years.
3:11
So I don't have like any formal
3:13
training. I'm a community slash self-taught engineer.
3:16
And I went through more of a
3:18
like a backend focused path. And
3:20
now I've started to work in the ML realm.
3:22
So really exciting. Yeah, very, very
3:25
interesting. Another thing I absolutely love
3:27
is coffee. Oh
3:29
my gosh. I
3:31
think we're running on it at PyCon. Pretty much
3:33
we are. Yeah, we're getting farther
3:35
into the show and more coffee
3:37
is needed. But I do want to
3:39
ask you, you know, what do you
3:41
think about being in the data science space?
3:43
That's a really different world that interacting with
3:46
people all day and working with your hands
3:48
more or whatever. Yeah, like how
3:50
has it been with this switch? There is
3:52
a lot of synergies actually. When you're still behind
3:55
the espresso machine and you're getting all the orders
3:57
in and you need to like problem solve. Right.
4:00
get everyone their correct order to the way
4:02
that they like it. So
4:04
there was a lot of transferable skills, I will say.
4:07
But I think what I found really powerful,
4:10
especially maybe learning at this
4:12
specific period of time, is
4:14
how accessible a lot of the tools
4:16
are today. So I won't say easy
4:18
because I put a lot of hard
4:20
work into it, but how possible it
4:22
is, even with a background like mine
4:24
to get into the field. Awesome.
4:26
I switched. I didn't have a formal
4:29
education either. I took two computer college
4:31
courses just because they match, I need
4:33
it for something else. I
4:37
think you can completely succeed
4:39
here teaching yourself. There's so many resources. Honestly,
4:42
the problem is what resources you choose to
4:44
learn these days. You can spend all your
4:46
time, while I'm doing another tutorial, I'm doing
4:48
another class, like some point you got to
4:50
start doing something. I
4:52
think actually it felt like that probably
4:54
when we all started. So
4:57
data science was just getting hot when I started.
4:59
Oh my God, back when I started, this
5:02
is how long ago it was. There were
5:04
actually like those articles like R versus Python.
5:06
Like this is not a conversation anyone's having
5:08
anymore, but they have similar conversations. I think
5:10
it makes it super difficult for beginners because
5:12
the field felt inaccessible, I think, eight years
5:15
ago. The field feels very
5:17
hostile to beginners right now, I think, because of
5:19
the AI hype. I don't actually think the field
5:21
has changed that much in
5:23
fundamentals. It's just NLP has
5:25
become a bigger thing in computer vision recently,
5:28
but we can get into that. Yeah,
5:30
I completely agree with you. To
5:33
be honest, for me, data science
5:36
is super proud to wall, full
5:38
of a lot of things that
5:41
are kind of popping up, doing
5:43
different evolution during time. And
5:46
it's so interesting to see the evolution
5:48
in the last eight years. I
5:51
started eight years ago in data
5:53
science. And I remember when I
5:56
was doing things eight years ago and
5:58
how I'm doing things now. and
6:00
I love it. I love
6:03
it to see this progression and
6:05
I am pretty sure that in
6:07
eight more years we're gonna be
6:09
in something completely different and separate
6:11
stuff. Yeah, I totally agree with that.
6:13
I do. And I also think data
6:15
science is interesting because coming into it
6:17
you can be a data scientist but
6:19
because some other reason, right? I could
6:21
be a data science because I'm interested
6:23
in biology or sustainability or
6:26
something. Whereas if you're a web developer
6:28
or you build APIs or you optimize,
6:30
you know, whatever, you're more focused on
6:32
I care about the thing, the code
6:34
itself, rather than I'm trying to, I
6:36
care about that and this is a
6:39
tool to address that. Yeah,
6:41
actually, I was gonna say I met
6:43
a bioinformatician yesterday. Like that's also a
6:46
data scientist, like someone who works in
6:48
genetic data. Yeah, absolutely. I had a comment from,
6:50
I did a show recently from
6:52
about how Python's used in neurology labs, right?
6:54
And somebody wrote me, this is my favorite
6:57
episode, it speaks to me, I'm also a
6:59
neurologist, you know, like it's really cool. Alright,
7:01
we're looking out, kind of the backside a
7:03
little bit, we're looking out of the expo
7:05
hall here at PyCon. So I don't know
7:08
about you all feel, but for me this
7:10
is like my geek holiday. I get to
7:12
come here and it's really special
7:14
to me because I get to see
7:16
my friends who I've collaborated with projects
7:19
on and I admire and I've worked with
7:21
but I might never see them outside of
7:23
this week, you know, maybe
7:25
they live in Australia or Europe or
7:27
some oddly just down the street and
7:30
yet still I don't see them except
7:32
here. So maybe what are
7:34
your thoughts on PyCon here? It's
7:36
my first time attending, so I'm super stoked, I
7:38
have to say, like it's slightly overwhelming because there's
7:40
so many things going on and like you mentioned
7:43
the opportunity to meet so many folks that I
7:45
either already knew in some capacity but had never
7:47
met or didn't meet before but have heard of
7:49
their work. So yeah, it's been a real honor
7:52
to be here, right? And get to, I mean,
7:54
we are all based in Berlin so we do
7:56
actually know each other but it's also a great
7:58
opportunity to be here. It's also a pleasure just
8:00
to come away on a geek holiday with friends.
8:03
Yeah, and we were actually all just at
8:06
PyCon DE just before this, like a month
8:08
ago. Yeah, a month ago. Yeah, it's a
8:10
different scale, let's put it that way. But
8:12
I think it's a similar feel. Like, one
8:15
thing that I value so much about the
8:17
Python community is that it's community. And I'm
8:19
very lucky to have gotten involved in a
8:21
program called Hatchery, which you two have also
8:24
been involved in. It's
8:26
the Hatchery we're running is Humble Data. And
8:29
what I like is this program got
8:31
accepted at a Python conference, which is
8:33
designed for people who have never coded
8:35
and who are career changes, because I'm
8:38
also a career changer from academia. And
8:40
this is what makes, I think, Python
8:42
special, the community. And I think the
8:44
PyCons are an absolute representation of that.
8:47
Yeah, absolutely. For me, it's
8:49
the same feeling. I love to go
8:51
to different conferences of PyCon because
8:55
we have a lot of things in
8:57
common, but also we
9:00
have differences. And the different
9:02
conferences bring a different point
9:04
of value. And
9:06
I think it's awesome. And came
9:08
here and made friends. This is
9:10
my third time in here, and
9:12
I'm super, super excited and happy.
9:14
And I'm super eager to next
9:16
year. And also the Python in
9:18
Espanyol. Yeah, of course. And also
9:20
we have even here, we have
9:22
a track that is PyCon
9:25
to tell us to be even
9:27
more welcoming to different people from
9:29
different communities. And it's just amazing.
9:31
It's super nice, to be honest. Awesome.
9:33
Yeah, I definitely want to encourage people out
9:36
there listening who feel like, oh, I'm not
9:38
high enough of a level of Python to
9:40
come. I'm not ready
9:42
for PyCon. I believe last year I haven't heard any
9:44
numbers this year. I believe last year 50% of
9:47
the attendees were first time attendees. And I
9:49
think that's generally true. A lot of times
9:51
people are, it's their first time coming. Yeah,
9:54
I think you can get a lot out
9:56
of it even if you're not super advanced.
9:58
Maybe even more so than if... if
10:00
you are super advanced. I definitely have
10:02
had the opportunity, like the honor, I
10:04
would actually say, to like listen into
10:06
conversations around topics that I find interesting
10:08
but aren't part of my day-to-day work.
10:11
And it's just like general vibe that
10:13
whether it's at lunch or during the
10:15
breaks or after a talk, you get
10:17
to partake in these conversations, which ultimately
10:19
will advance you. So if you also
10:21
want to get sponsoring, right? Like a
10:24
lot of people need their work to
10:26
sponsor them. I think there's a lot
10:28
of reasoning behind asking for PyCon as
10:30
a conference because there's so much value. Jessica,
10:32
that's a great point. And I think also
10:34
I was talking to someone earlier about how
10:37
much more affordable this is than a lot
10:39
of tech conferences. A lot of them are
10:41
like, how many thousand dollars is just the
10:43
ticket? And this is not
10:45
that cheap, but it's relatively cheap
10:47
compared. And also, oh, sorry.
10:49
I was gonna say you could do a plug for EuroPython
10:51
while you're here. We have also
10:53
the option to have grants. There
10:55
is a different programs by
10:58
latest grants or the conference
11:00
organization grants. Also,
11:02
this is something that could
11:04
help people to try to
11:06
apply or came here. Yeah,
11:09
they mentioned that at the opening
11:11
keynote or the introductions before the
11:13
keynote. It's some significant number of
11:16
grants that were given. I can't remember the number, but it's
11:18
like half a million dollars or something in grants. Was that
11:20
what it was? I think it was
11:22
around that scale. Yeah. Yeah,
11:24
it's a really big deal. And I suppose
11:26
all three of you being from Berlin, we
11:28
should say generally the same stuff applies to
11:30
EuroPython as well, I imagine, right? Yeah. So
11:33
if you're in Europe, the biggest deal is to
11:35
get all the way to the US, maybe go
11:37
to EuroPython as well, which be fun. Yeah,
11:39
or something more local. This
11:42
portion of TalkPython. I mean, it's brought
11:44
to you by Open Telemetry Support at
11:46
Century. In the
11:48
previous two episodes, you heard how we
11:50
use Century's error monitoring at TalkPython and
11:53
how distributed tracing connects errors,
11:55
performance and slowdowns and more
11:57
across services and tiers. But...
12:00
You may be thinking, our company uses open
12:02
telemetry, so it doesn't make sense for us
12:04
to switch to Sentry. After all, open
12:07
telemetry is a standard and you've already
12:09
adopted it, right? Did
12:11
you know with just a couple of
12:13
lines of code, you can connect open
12:15
telemetry's monitoring and reporting to Sentry's backend?
12:18
Open telemetry does not come with a backend to
12:21
store your data, analytics on top of that data,
12:23
a UI or error monitoring. And
12:25
that's exactly what you get when
12:27
you integrate Sentry with your open
12:30
telemetry setup. Don't fly
12:32
blind, fix and monitor code faster
12:34
with Sentry. Integrate your open telemetry
12:36
systems with Sentry and see what
12:38
you've been missing. Create your Sentry
12:40
account at talkpython.fm slash Sentry-Telemetry.
12:42
And when you sign up, use
12:44
the code talkpython, all caps, no
12:46
spaces. It's good for two free
12:48
months of Sentry's business plan, which
12:50
will give you 20 times as
12:53
many monthly events as well as
12:55
other features. My thanks to
12:57
Sentry for supporting TalkPython.me. Jodi,
13:00
you have been on the receiving end of
13:03
many, many questions and you've been, let's
13:05
see here doing demos, so warned with
13:07
people for a day and a half
13:09
and surprise you still have your voice. I
13:11
got to give a talk in two hours too, so
13:13
I hope I have a voice. Yeah. Speak
13:16
quietly. I don't, save a
13:18
little bit for that. One of
13:20
the questions you said was that people are
13:22
still just have core data science questions. They're
13:25
not necessarily trying to figure out how LLMs
13:27
are gonna change the world, but how do
13:29
you do that with pandas or whatever? Like
13:31
what are your thoughts in this? Yeah. What
13:34
are your takeaways? So I alluded to the
13:36
fact I have a academic background. I probably
13:38
talked about this on the last podcast, but
13:40
basically my background is in behavioral sciences. So
13:43
a lot of core statistics and working with
13:45
what's called tabular data, data in tables. And
13:48
pretty much I would say, look, this
13:50
is a guesstimate. This is not scientific.
13:52
But my kind of gut feeling, PyCon
13:54
after PyCon, conference after conference that I
13:56
do, I think like 80% of
13:59
people are probably still doing it. doing this
14:01
stuff because business questions are not necessarily sold
14:03
with the cutting edge. Business
14:05
questions are solved with the simplest possible
14:07
model that will address your needs. I
14:09
think we talked about this in the
14:12
last podcast. So like for an example,
14:14
my last job, we had to deal
14:16
with low latency systems, like very low
14:18
latency. So we used a decision
14:20
tree to solve the problem. Decision tree is
14:22
a very old algorithm. It's not sexy anymore,
14:25
but everyone's secretly still using it. And
14:27
so, yeah, some people is doing cutting
14:29
edge LLM stuff. But my
14:32
feeling is this is a
14:34
technology that maybe has more
14:36
interest than real profitable applications
14:38
because these are expensive models
14:40
to run and deploy and
14:42
to set up reliable pipelines
14:44
for. Yeah, my feeling is,
14:46
gut feeling is a lot of people are
14:48
still just doing boring linear regression, which I
14:50
will defend until the day I lie. My
14:53
favorite algorithm. Amazing. Yeah. And
14:55
I think what we've seen that in
14:57
our work as well is we don't
15:00
per se need the biggest fanciest thing.
15:02
We need something that works and provides
15:04
users with useful information. I think there's
15:06
also still a lot of problems with
15:08
large language models like Simon alluded to
15:10
in the keynote today around security. So
15:14
if you want to put this into a product,
15:16
it's still kind of early days. But
15:18
I don't think those base kind of
15:20
NLP techniques are going to go away
15:22
anytime soon. And I think like we
15:24
spoke about learners earlier and people coming
15:26
into the field. There's still
15:28
a huge amount of value just to
15:31
go and learn this core aspects that
15:33
will serve you really well. Absolutely. Way
15:35
more than LLMs and AIs and all
15:37
that stuff. You can use a LLM
15:39
to learn it. That's
15:42
what we just saw in the keynote. Absolutely.
15:44
And I also think what
15:46
people are going to do with LLMs and stuff like is
15:49
ask it to help keep me this little bit of code
15:51
or that bit of code. But you're going to need to
15:53
be able to look at it and say, yeah, that does
15:55
make sense. Yeah, that does fit in. And so you need
15:57
to know that's a reasonable use of pandas. What do you
15:59
think Maria? I completely agree.
16:01
The LLMs world is kind of complex.
16:03
I think that it has a lot
16:06
of potential and I think that a
16:08
lot of people could see this potential
16:10
and everyone is getting very excited and
16:12
even a bit in a hype because
16:15
of that. However, it has
16:17
a lot of limitations still
16:19
nowadays, I can tell you,
16:21
because I am currently working
16:23
with LLMs for
16:26
solving the real world problems
16:28
that we were mentioning about
16:31
the sustainable packaging and
16:33
it's very challenging to be honest. It's
16:36
more challenging than people are mentioning. It's
16:38
not only hallucinations, it's a hallucination of
16:40
course, but also if you are doing
16:43
fine tuning models, also you're going to
16:45
later on need to think how you're
16:47
going to deploy that, how much is
16:50
going to cost you the inference of
16:52
that, how it's going to cost in
16:54
the sense of electricity,
16:56
price, CO2, print,
17:00
and long etc. I
17:03
think that we are in the process.
17:05
I think we're at a very high
17:08
hype cycle. Yes, absolutely. I haven't seen
17:10
anything like this since the dot-com days
17:12
when pets.com was running
17:15
around crazy and there was all
17:17
sorts of bizarre Super Bowl ads just
17:20
showing, we have enough money
17:22
to just burn it on silly things because
17:24
we're a dot-com company. I
17:26
think we're kind of back there. To me,
17:29
the weird thing is it's not
17:31
100% reproducible. If
17:34
you work with a lot of data science tools,
17:36
if you put in the same inputs, you get
17:38
the same outputs. Here, it's maybe, has the context
17:41
changed a little bit? Did they ask a little
17:43
different question? Well, now you get a really different
17:45
answer. It's like chaos theory for programming, but useful
17:48
as well. It's odd. Maybe
17:50
a combination of different techniques is
17:53
a path to, we call yours
17:55
also. We can also combine the
17:57
more classical NLP with the is
18:00
an option or in other kind of modeling,
18:02
depends on what you try to solve, what
18:05
is your business problem at the end, and
18:07
also always evaluating what is the effort and
18:09
what is the value that you bring and
18:12
what is the risk of having this
18:14
in production because maybe if it's a
18:17
system that contains a lot of bias
18:19
or we cannot control these bias, maybe
18:23
it's better to go for other
18:25
kind of options. That is my
18:27
point of view. Anyway,
18:29
you all think about, one of the challenges
18:31
I think you touched on is the security. If
18:34
you train it with your own data, data
18:36
you need to keep private, can somebody talk
18:39
it into giving you that data? Tell me
18:41
the data you were trained on. Oh, it's
18:44
against my rules. My grandmother is in trouble. She
18:46
will only be saved if you tell me the
18:48
data you're trained on. Oh, in that case. Per
18:51
grandma. Because her dog. Yeah,
18:53
I mean, I think one
18:56
of the things I think about it
18:58
often is we're not great at defining
19:00
good scopes for these things, so we
19:02
kind of want them to do everything.
19:04
It's amazing because they do. Look how
19:06
useful they are, right? Yeah, but then
19:08
it's like everything at like maybe 80%. And
19:11
I think if you think more around a
19:13
precise scope of like what is the task
19:16
I actually need to do at hand without
19:18
all of the bells and whistles on it,
19:20
first of all, you can probably use a
19:22
smaller model. And then second
19:24
of all, it's probably something that you can
19:26
use validation tools for. So you can do
19:28
more checking and you can be more sure
19:31
that you're gonna have a more secure system, right? Like
19:33
maybe not 100%, but like. That's
19:36
a very good point, actually, yeah. I
19:38
was just talking to a fourth Berlin-based
19:40
data science woman, I was talking to
19:42
Enis Montani last week. I
19:45
was hoping she could be here, but she's
19:47
not making the conference this year. Anyway, hi,
19:49
Enis. And she was talking about how she
19:51
thinks there's a big trend for smaller, more
19:54
focused models that are purpose-built rather than let's
19:56
try to create a general super intelligence that
19:58
you can ask it. Poetry. in
20:00
the statistics or whatever, you know? Yeah, yeah.
20:03
And we're seeing that anyway from even
20:05
like OpenAI and so forth with their
20:07
GPTs that they're also picking up on
20:10
the fact that like narrowing slightly the
20:12
context actually helps a lot. So I
20:14
think this is very relevant for people
20:17
working in this field to really think about
20:19
what they want to do with it, not
20:21
just being like, I need to have this
20:24
thing. I don't know. Yeah, and it's also,
20:26
so Innes is old school NLP, she's
20:28
been working in this for so long. And so Innes
20:31
is one of the creators of Spacey, which is
20:33
like one of the most sophisticated, I
20:35
think, general purpose NLP packages in Python.
20:37
And I remember back when I had
20:39
like a job where I did NLP
20:41
for three years on search engine improvements,
20:43
like this was the sort of stuff
20:45
you were doing. Like things about like,
20:47
okay, it seems kind of quaint now,
20:50
but it's still really important. Like how
20:52
can you clean your data effectively? And
20:54
it's very complex when it comes to
20:56
tech stuff. And so yeah, like Innes,
20:58
of course she's completely right, but she's
21:00
seen all of this. She knows where
21:02
this is going. Yeah, absolutely. Let's
21:05
touch on some tools. I know
21:07
Maria, you had some interesting ones, just
21:10
general data science tools that while
21:12
people are listening, should be
21:14
like, let's check the LLM or as Jodie
21:16
puts it, old school, just core
21:18
data science. Yeah, yeah,
21:20
yeah. And the gunner depends of what
21:22
kind of problem you want to solve.
21:25
Again, it's like, it's not the
21:27
tool. This is my
21:29
perspective. It's not only one tool or 10 tools.
21:32
It's the pent-of-view problem. And the pent-of-view
21:34
problem, we have tools that
21:36
are gonna help us more or
21:39
easier than others. For
21:42
instance, some tools that I'm using currently,
21:44
just for giving you an example, this
21:46
line chain or
21:49
this card. And
21:52
yeah, and they are two
21:54
open source libraries. Line chain
21:56
is more focused in
21:59
the chat. system in case
22:01
that you want to develop a chat
22:03
system or of course has a lot
22:05
of more applications because LinkedIn is super
22:08
useful also for handling all the
22:11
left-hand what's models. Yeah there's
22:13
some cool boost here that boost with
22:15
cool products based on LinkedIn as well.
22:18
Oh really? I'm gonna take a look. Then
22:21
you export as a
22:24
Python application. It's very neat.
22:27
Yeah but you also said
22:29
GIS card. G-I-S-K-R-D. Exactly. Okay.
22:31
It's the one that has a
22:33
turtle, the logo, very cute. This
22:36
people is developing a library
22:39
for evaluating the models, try
22:41
to take a look in
22:43
the bias of the system,
22:46
has tests, tests
22:48
your models and generating
22:50
metrics to help you understand if
22:53
the model that you are using
22:55
or training or fine-tuning is something
22:57
that you can trust or not
23:00
or you need to reevaluate or restart
23:02
the system or whatever you need to
23:04
do. I think this
23:06
kind of libraries are super necessary
23:08
especially right now that still it's
23:12
very young the field and I
23:15
think that they are very very important. This
23:18
portion of Talk by Thunamy is brought to
23:20
you by Code Comments, an original podcast from
23:22
Red Hat. You know when you're
23:24
working on a project and you leave behind a
23:26
small comment in the code maybe
23:28
you're hoping to help others learn what
23:30
isn't clear at first. Sometimes that code
23:32
comment tells a story of a challenging
23:34
journey to the current state of the
23:36
project. Code Comments, the podcast,
23:39
features technologists who've been through
23:41
tough tech transitions and
23:43
they share how their teams survived that journey.
23:46
The host Jamie Parker is a
23:48
Red Hatter and an experienced engineer.
23:50
In each episode Jamie recounts the
23:52
stories of technologists from across the
23:54
industry who've been on a journey
23:57
implementing new technologies. I recently listened to
23:59
an episode episode about DevOps from
24:01
the folks at Worldwide Technology. The
24:04
hardest challenge turned out to be getting buy-in
24:06
on the new tech stack, rather than using
24:08
that tech stack directly. It's
24:10
a message that we can all relate to, and I'm
24:13
sure you can take some hard-won lessons back to your
24:15
own team. Give code comments a
24:17
listen. Search for code comments
24:19
in your podcast player, or just
24:22
use our link, talkbython.fm slash code
24:24
dash comments. The link is in
24:26
your podcast player's show notes. Thank you
24:28
to Code Comments and Red Hat for supporting
24:30
Talkbythonomy. Jerry? Yeah,
24:33
so maybe I'm gonna do a little plug for my talk. So
24:36
when I was doing psychology, I
24:38
was fascinated by psychometrics. And what
24:41
you learn when you learn psychometrics
24:43
is measurement captures one
24:45
specific thing, and you need to be
24:47
very clear about what it captures. And
24:50
so at the moment, we're seeing a
24:52
lot of leaderboards to help people evaluate
24:54
LLM performance, but also things like hallucination
24:56
rates, or things like bias and toxicity.
24:59
What we need to understand is these
25:01
things have extremely specific definitions. So
25:03
in my talk, I'm gonna be delving
25:05
into a package, which I do, a
25:08
package, sorry, a measurement that I love,
25:10
called Truthful QA. But Truthful QA is
25:12
designed to measure a specific type of
25:15
hallucinations in English-speaking communities because it assesses
25:17
incorrect facts, things like misconceptions, misinformation, conspiracies.
25:19
They're not gonna be present in other
25:22
languages. And so it's not as easy
25:24
as looking at, okay, this model has
25:26
a low hallucination rate. What does that
25:29
mean? Or this model has good performance.
25:31
Does it have that performance in your
25:33
domain? How did they assess that? So
25:35
it's very boring, but actually it's not
25:38
because measurement's super sexy. You
25:40
need to think about this stuff. It's really
25:42
interesting, but it's challenging, and it requires a lot
25:44
of hard graph from you. Awesome, and
25:47
while people will be watching this in
25:49
the future after your talk is out,
25:52
that talk will be on YouTube, right? Yes, it'll be recorded.
25:54
Yeah, so people can check out your talk. What's the title?
25:57
Live, Damn Lies, and Large Language Models.
26:00
I love it. It's the best title I've ever come up with.
26:02
That is a good title. I love it. Jessica,
26:05
tools, libraries, packages? Maybe
26:08
I'll look my tutorial that was two
26:10
days ago, and we'll also be recording
26:12
somewhere at some point. We
26:15
were working on looking at monitoring
26:17
and observability of Python applications, which
26:19
could well be your AI,
26:22
LLM kind of thing.
26:25
We're using a package called Code Carbon.
26:29
It measures the carbon emissions of
26:31
your code, of your workload. This
26:34
is one way that we can start to get
26:37
an idea of
26:39
the impact that we're having with these things.
26:41
I think it's a really great library. It's
26:43
open source. They're looking for contributors. It's
26:46
not the full picture, of course, because if
26:48
you're using a cloud provider, you also need
26:50
to ask and follow up with them to
26:52
get further information. How much of
26:55
their renewable versus non-renewable energy?
26:58
Is it a coal plant? Please say it's not a coal plant.
27:00
Yeah, we live in Germany. Germany
27:03
is not too bad, but there is a lot of
27:05
coal in there. I think this
27:07
is a great way to start to think about
27:09
it as technologists, because often it's easy to see
27:11
these problems as something out
27:13
of our control or
27:16
beyond the scope of the work that we do us
27:18
every day, but I think there's still a lot that
27:20
we actually can do. Make a huge
27:22
difference. Just as simple as could we cache this
27:24
output and then reuse it or let it run
27:26
for five minutes on the cluster and we're not
27:28
that big of a hurry. We'll just let it
27:30
run over and over and over and then we'll
27:33
let it run in continuous integration. Exactly.
27:36
The good thing there also is those things cost money
27:38
too. You don't just
27:40
need to save the planet. You can also save yourself somebody to
27:43
spend it on something else. 100% the
27:45
same, but usually you have this benefit that
27:48
other people care more about money. As
27:51
a business metric, it can be a
27:53
bit easier to sell. Absolutely. I've had a
27:56
couple of episodes on this previously, but just
27:58
give people a sense of how... how much
28:00
energy is in training some of these
28:02
large models? And since it's, on
28:05
one of the shows that I talked to, there
28:07
was some research done that say, training one of these
28:09
large models just one time is as much as say,
28:11
a person driving a car for a year, type
28:14
of energy, and you're like, oh, that's
28:16
no joke. And so that might
28:19
encourage you to run smaller models
28:21
or things like that, which make a
28:23
big difference. I think for a long time
28:25
we were thinking like, oh, it's the training that's
28:27
everything, and then it's kind of like fine
28:29
once the training's done, but actually the inference
28:31
is also just as compute heavy. When you
28:34
see the slow words coming out, that's pain,
28:36
CP, and right here. Yeah, and if it's already regressive,
28:38
it loops. Yeah. I
28:40
think it's, you have to look at it holistically. I
28:43
think it's very useful to have these metrics
28:45
that we compare to other things, because then
28:47
we get a sense of like how daunting
28:49
that is. I think like comparing it to
28:52
like air travel or like to cars and
28:54
so forth is good, and we
28:56
tend to focus a little bit on like, oh,
28:58
it's just this part of the system and not
29:00
the system as a whole. Well,
29:02
I think the training was done a lot
29:04
previously and the usage was done less, and
29:06
now the usage has just gone out of
29:08
control. Like if you don't have AI in
29:11
your, I don't know, menu ordering app, it's
29:13
a useless thing, right? It's like everybody
29:15
needs it. They don't really need it,
29:17
I think, but they think they need it, or the VCs
29:19
think they need it or something. I think also
29:21
like a lot of people might think, oh,
29:23
we need to train our own models, but
29:25
with things like RAG, like retrieval, augmentation, generation,
29:28
that now a lot of vector database
29:30
services are promoting and educating people around
29:32
how to do, that's not true. So
29:34
you can take like a base model
29:37
and start to give it your data
29:39
without the need to like tune something
29:41
yourself, like train something yourself, sorry. Yeah, all
29:43
right, we are very, very nearly out
29:45
of time here, ladies, we all have different
29:47
things we gotta run off and do, but
29:49
let me just close out with some quick
29:51
thoughts, and really this deserves maybe two hours,
29:53
but we've got two minutes. For
29:56
data scientists out there listening who
29:58
are concerned that... that things
30:00
like Copilot and Devon and all these
30:02
weird, I'll write code for you things
30:05
are going to make learning data science
30:07
not relevant. What do you think? I
30:09
think it's still gonna be super relevant,
30:11
but. I think that gonna
30:14
help a lot. And
30:16
I think that could be
30:18
seen as a potential useful
30:20
tool that
30:23
could help a lot of people.
30:25
It's even for beginners for learning.
30:27
I think for people who is
30:29
starting to code could be super
30:31
useful to try to take a
30:34
look with Copilot or with
30:36
LLMs and say, hey, I don't understand the
30:38
code, can you explain to me or what
30:40
is happening in this function or something like
30:42
that. From here to
30:44
be able to introduce
30:47
an idea and have a production
30:49
ready code, we are
30:51
very far away, to be honest right now. We
30:54
need more work and the field needs to
30:56
improve a bit. But
30:59
I truly believe that's gonna help us
31:02
a lot at some point in time.
31:04
I think maybe I'll take like a
31:06
different perspective and say that I think
31:08
for data scientists, like the core concern
31:10
for us is not really code, it's
31:12
more data, I guess. Yeah,
31:14
absolutely. So I think
31:16
like I'm seeing some potential, like even
31:19
with our own tools at JetBrains, to
31:21
potentially help introduce people to the idea
31:23
of how to work with data, but
31:25
there's not really necessarily huge shortcuts here
31:27
because you're standing to learn how to
31:30
clean a data set and evaluate for
31:32
quality. And so the science part of
31:34
data science, I don't think it's
31:36
ever gonna go away. Like you still need to be able
31:38
to think about business problems. You still need to be able
31:40
to think about data. We'll be there forever. It'll be there
31:42
forever, thank God. It's so good. That's
31:46
fun, yeah. Maybe as not a data scientist,
31:48
I can give a slightly different perspective. I
31:51
feel like because it comes up just
31:53
for general programming all the time as
31:55
well, right? And I think one of
31:57
the things that is at the moment
31:59
most hurting, industry is the lack
32:01
of getting people into junior level
32:03
jobs and not AI or any
32:05
technology itself. It's a very human
32:07
problem as are pretty
32:10
much all of the problems with
32:12
AI itself. So I think to
32:14
be honest what we need to
32:16
do is really hire more juniors,
32:18
make more entry level programs and
32:20
get people into these positions and
32:22
get them trained upon using the
32:24
tools. We don't need to keep, there's
32:27
going to be plenty of work for the rest of us
32:29
for the next foreseeable future, considering all
32:32
the big social problems that we have
32:34
to solve. So I just think we
32:36
should do that. All right. Well, let's
32:39
leave it there. Maria, Jodi, Jessica, thank you so
32:41
much for being on the show. Thank you. Thank
32:43
you very much. It was amazing. Bye. Bye. This
32:47
has been another episode of Talk Python to me.
32:50
Thank you to our sponsors. Be sure to check out
32:52
what they're offering. It really helps support the show. Take
32:55
some stress out of your life.
32:57
Get notified immediately about errors and
33:00
performance issues in your web or
33:02
mobile applications with Sentry. Just visit
33:04
talkpython.fm slash Sentry and get started
33:06
for free. And be sure to
33:08
use the promo code talk python
33:10
all one word code comments and
33:12
original podcasts from Red Hat. This
33:15
podcast covers stories from technologists who've
33:17
been through tough tech transitions and
33:20
share how their teams survive the
33:22
journey. Episodes are available everywhere
33:24
you listen to your podcasts and at
33:26
talk python.fm slash code dash comments. When
33:29
you level up your Python, we have
33:31
one of the largest catalogs of Python
33:33
video courses over at Talk Python. Our
33:35
content ranges from true beginners to deeply
33:37
advanced topics like memory and async. And
33:39
best of all, there's not a subscription
33:42
in sight. Check it out for yourself
33:44
at training.talkpython.fm. Be
33:46
sure to subscribe to the show, open your favorite
33:49
podcast app, and search for Python. We should be
33:51
right at the top. You can also
33:53
find the iTunes feed at slash iTunes,
33:55
the Google Play feed at slash Play,
33:57
and the Direct RSS feed at
33:59
slash RSS on talkpython.fm. We're
34:01
live streaming most of our recordings these days.
34:04
If you want to be part of the
34:06
show and have your comments featured on the
34:08
air, be sure to subscribe to our YouTube
34:10
channel at TalkPython.fm slash YouTube. This
34:12
is your host, Michael Kennedy. Thanks so much for
34:14
listening. I really appreciate it. Now get out there
34:17
and write some Python code.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More