Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Welcome to the Real Python Podcast.
0:03
This is episode 197. How is Python being used to automate
0:08
processes in the laboratory? How
0:10
can it speed up scientific work with
0:12
DNA sequencing? This
0:15
week on the show, chemical
0:17
engineering PhD student Parse Khatarmazi
0:19
is here to discuss Python
0:21
and bioinformatics. Parse
0:23
provides background on his research
0:26
and the bioinformatic techniques used
0:28
to discover gut microbes' role
0:30
in human health and diseases.
0:32
We talk about automating lab
0:34
experiments with liquid handling robots
0:36
in Python. We dig
0:38
into libraries to shatter and reassemble
0:40
DNA sequences. Parse also shares
0:43
current projects from the Chan Lab
0:45
at Colorado State University and his
0:47
GitHub repository. All
0:49
right, let's get started. Is
1:11
a weekly conversation about using Python in
1:13
the real world. My. Name is
1:15
Christopher Bailey, Your host. Each. Week
1:17
We feature interviews with experts in
1:19
the community and discussions about the
1:21
topics, articles and courses found at
1:23
realpython.com. After. After the podcast, join us The podcast. Join us.
1:26
and learn real world Python skills with
1:28
a community of experts at realpython.com. Hey,
1:31
Parse, welcome to the show. Hi, Christopher. Thanks for
1:33
having me on the show. I'm
1:36
really excited to talk to you. You
1:38
reached out and had a bunch of
1:40
interesting things that you wanted to talk
1:42
about. A lot of them have to
1:44
do with real world applications of Python
1:48
in the laboratory and experiments.
1:50
Maybe you could give me a
1:52
little background on where
1:54
you're at. You're currently in your PhD program.
1:56
Maybe you could explain a little bit about
1:58
what you're currently doing. Yeah,
2:02
so right now I'm in
2:04
my PhD program, fifth year
2:06
at Colorado State University, and
2:09
I'm in chemical and biological
2:11
engineering department. So my background,
2:14
I'm coming from an engineering
2:16
background, and honestly for my
2:18
undergraduate studies, I never did
2:21
any biology. And
2:23
in my doctoral studies, I got
2:25
interested in biological systems.
2:29
And somehow they're very similar to systems that
2:31
we study right now. They are
2:33
maybe more complex, but the concepts
2:35
behind them are very similar to
2:38
classical chemical engineering like factories.
2:41
We actually treat the cells like
2:43
factories, and the analogy goes beyond
2:45
that even. Like we have, for
2:47
example, piping systems. They have some
2:49
sort of analog in the cellular
2:52
and biological. Okay. So
2:54
I find this system really interesting, and
2:57
a lot of programming is also involved
2:59
in this process. I think that's
3:01
really cool that you kind of almost sort
3:03
of shifted direction into your PhD
3:05
program. That's pretty cool. Were
3:08
you doing any programming before that in your
3:10
other, was it an engineering course then before
3:13
that? Yes. So mainly
3:15
we were using MATLAB for
3:18
anything, and it was mostly
3:20
computer simulations. Okay. And
3:23
the fun part is that in my first
3:25
year in the PhD, everything was in MATLAB.
3:28
But maybe over a year,
3:30
everything in our lab just shifted
3:32
to Python. Okay. Because
3:34
we soon realized that the Python
3:38
offers... One
3:40
thing is the package. There are so many
3:42
good packages that we can use in Python,
3:45
and also how easy it is for
3:47
someone to get started with Python and
3:49
become better soon. So that's why we
3:51
shifted to Python, I think, after
3:54
the first year. And we have been using
3:57
Python for maybe four years.
4:00
something like that. Yeah. Yeah, yeah.
4:02
I would imagine that a lot of the tooling,
4:05
I don't know, these last four years have been
4:07
extremely productive as far
4:09
as, you know, the scientific community and
4:11
the adoption of Python there. I'm not
4:13
saying that wasn't before that, but I
4:15
feel like the tooling
4:18
has gotten easier
4:20
and, like you said, there's the ability to
4:22
kind of build on top of other people's
4:24
work as opposed to having to build everything
4:26
from scratch. Has that been your experience? Yes.
4:28
And one funny thing, I'm coming
4:30
back from a seminar today and
4:33
these seminars happen like weekly and
4:35
it's been four weeks in a
4:37
row that everybody's saying, we
4:39
were using Math Lab and suddenly
4:41
we shifted to Python. It seems
4:43
like it's something that's really happening
4:46
at the more speed recently. Yeah.
4:49
Yeah, that's interesting because I think a
4:51
lot of universities, I mean,
4:53
it depends on the professor's background and the
4:55
tooling that they've been using and maybe the
4:58
funding. Math Lab is
5:00
a thing where it's, you know, they have
5:02
to be purchased, right? Seats for it
5:04
and things like that. So I think
5:06
that might be an attractive thing for
5:08
a university too. Yeah, yeah, exactly. And
5:10
I know that some of the programs
5:12
are thinking about replacing this entirely. So
5:15
one issue is that the courses are
5:17
based on Math Lab and already there's
5:19
a lot of syllabus development and you
5:21
need to like prepare
5:23
new materials. And that's why maybe
5:25
in the education part, we still
5:27
have Math Lab thoughts, but
5:30
I think that will also change. Like
5:32
even the freshmen and undergraduate students will
5:34
be trained on Python
5:36
instead of Math. That's not just
5:38
my guess. Yeah. And that's
5:40
the pattern that I'm seeing right now. Yeah. Okay,
5:43
cool. So one
5:45
of the areas that you said
5:48
that you wanted to speak about when you sent
5:50
me the email was you wanted
5:52
to talk about using Python in the lab
5:55
and then how Python's being used
5:57
in the field of bioinformatics. And
5:59
I... immediately had to go
6:01
online and go, all right, what is bioinformatics?
6:03
So I don't know if you're comfortable. Could
6:05
you explain to the audience, like, you know,
6:08
generally what is bioinformatics? Yeah,
6:11
sure. I think it's a very
6:13
general term and, and the way
6:15
I use it is, it's just
6:17
like procedures to process biological data,
6:19
the data that relates to biological
6:21
systems. This could go
6:23
even to like hospitals and it could
6:25
be as broad as that, like data
6:27
from hospitals could be in
6:30
this realm as well. But what
6:32
I am working on is specifically
6:34
data from microbial systems. So these
6:37
cells are living organisms and
6:39
we kind of get information,
6:42
different type of information, the
6:44
way we process that, the science that
6:47
is behind processing this information into useful
6:49
information that could be used for
6:51
next step actions, all of
6:53
that falls into bioinformatics. Okay.
6:57
So it could be, you mentioned
6:59
later, like working with
7:01
sequences of DNA, or it
7:03
might be looking at the
7:05
information that you're doing through
7:07
repeated studies, just sort of
7:10
managing the information about,
7:12
if you will, the biology field. Yeah,
7:16
exactly. Because DNA sequences are
7:18
really, like it's a big
7:21
amount of data is being generated that
7:24
are DNA sequences. And it
7:27
goes from how do we store
7:29
these data? How do we use
7:31
databases? How do we like different
7:34
algorithms? How we can process
7:36
these information into useful outputs?
7:38
All of that requires
7:40
computer knowledge from software engineering
7:43
algorithms. And it goes even
7:45
beyond these topics. Yeah, yeah.
7:48
That's a lot of data to manage. Yeah. It's
7:50
kind of one of these like areas you always hear of
7:52
the world of big data. And
7:55
you think of like banks and
7:57
financial data and lots of
7:59
documentation. and so forth, but you're dealing
8:01
with just raw, huge
8:04
amounts of data, looking at, like you
8:06
said, the genomes and things like that, which is
8:08
a pretty, pretty intense amount of data. So
8:11
a couple of things that we wanted
8:13
to dig into based upon our back and forth
8:15
through email was to kind of think about like,
8:17
what are the different places where
8:20
Python fits into your role as
8:22
a researcher? And I
8:25
thought one of the cool ones was this idea
8:27
of how Python has
8:29
helped you in the lab itself
8:32
and doing your experiments. You
8:34
sent me a video about
8:36
how you
8:39
would manually dilute a
8:42
bacterial sample, was the example they gave there,
8:44
and how it was
8:47
like, okay, the
8:49
beginning of the day, wipe down this surface.
8:51
Okay, start here. And it was just like
8:53
so much manual stuff. And then like literally
8:56
the next day, you're only like maybe five
8:58
steps into the process. It was
9:00
kind of wild. And so you sent me a link
9:02
to a company, Opentrons,
9:04
this manufacturer who was creating
9:07
a liquid handling robot. Do you want to talk
9:09
a little bit about that and then how Python
9:12
intersects with that as far as helping you
9:14
in the lab? Yeah,
9:16
sure. So as a researcher,
9:18
my work splits into two
9:20
parts. One is the wet lab projects,
9:22
which we actually go to the lab
9:25
and plan and do experiments in the
9:27
lab. And the other
9:29
part is more like computational work,
9:31
where we develop algorithms, do data
9:33
analysis and that. So yeah,
9:36
that robot is really something
9:39
that has changed the way we do experiments.
9:41
And it falls into the wet lab part.
9:44
So we used to do things by
9:46
hand, like hand pipettes. And
9:49
it becomes really hard because in many of
9:51
the experiments that we do, it's just you
9:53
have to go manually. And some of these
9:55
plates that we work with, they have like
9:58
96 very similar. wealth
10:00
that you need to pipe something from
10:02
one of them and drop it into
10:05
the other one. And it becomes really
10:07
confusing and error prone, I
10:09
guess, maybe. Very
10:11
error prone and also tedious
10:14
because you have to do something
10:16
repetitious. And it's
10:18
just like if that part could
10:20
be automated, it saves a
10:22
lot of time and probably a
10:24
lot of error and in
10:27
the long term, maybe money for the
10:29
lab. And for this reason,
10:31
we use these machines in the lab to
10:33
automate the process. Do you know the
10:36
age of the use of those types of machines
10:38
in the lab? Is it
10:40
recent development? In our
10:42
lab or in general? Yeah, maybe your
10:44
lab or just generally. So we
10:46
started about, I think, five years ago
10:49
in the lab with
10:51
these robots. And I think
10:53
at that point, this company was very new. So
10:56
things were already at the second generation,
10:58
but we were one of
11:00
the first labs on our campus to
11:02
use such robotics. So I think it
11:04
wasn't that common, even if it was
11:06
like the company existed before. It wasn't
11:08
that common. But after some time, right
11:11
now, I know it's these four labs in
11:14
our department that use such robotics. So I
11:16
think like people are moving towards that point.
11:19
Yeah, yeah. Actually, the fun
11:21
fact is we used to perform a
11:23
lot of COVID tests on campus during
11:25
the COVID years. Oh, okay. And because
11:27
the numbers were huge, what campus that
11:30
they used these machines to speed up
11:32
the process and make the testing
11:34
part really fast so we could get
11:37
back our results from our
11:39
tests very soon and they could
11:41
soon isolate potential people with the
11:44
virus. They could say that soon
11:46
and essentially avoid spreading it. So
11:48
this is one of the places that
11:50
it was used. Nice. So
11:53
how is Python used in there?
11:55
I was able to go to the site. I don't
11:57
know if the company Opentrons, is that the name?
12:00
Yeah. Okay. Yeah. And so
12:02
I kind of dug into a
12:04
little bit and found the Python protocol
12:06
API and looking at it. What
12:10
where where does it fit in?
12:12
What what are the controls that it's allowing
12:14
you to do with Python? So
12:16
right now you can go to the
12:18
website, build a protocol just just using
12:20
the graphical user interface and without any
12:22
knowledge of Python. But what goes on
12:24
behind the scene is the Python code
12:27
is generated and it's given to a
12:29
computer that is inside the robot, which
12:31
is a raspberry, I think. And
12:33
then this code gets executed and
12:36
it gets transformed into robot
12:38
actions like go up this much and
12:40
pick up that much liquid from from
12:43
this coordinate and move it to that coordinate. So
12:45
as I said, right now there
12:48
is a great graphical user interface
12:50
that they provide. But for more
12:52
advanced protocols, we usually have to
12:54
write the protocol ourselves. So it
12:56
would become a Python script that you have
12:58
to write. And, you know, for for
13:00
that example that I said,
13:02
like you have wells that are let's say
13:04
you have 96 well plates
13:06
that are like 12 different columns
13:09
and you have to do repetitious
13:11
things. These concepts fit really nicely
13:13
with something like loops and programming.
13:15
Yeah. So instead
13:17
of doing it yourself,
13:19
just a simple for
13:22
loop and do that and avoid
13:24
possible mistakes. And one other interesting
13:27
programming concept that comes here
13:29
are the exceptions and
13:31
errors. So it can
13:34
run a simulation of the experiment, the
13:36
protocol that you give it and it
13:38
raises exceptions based on if there is
13:40
some sort of logical issues in your
13:43
code. For example, this well
13:45
has 200 milliliter liquid
13:47
in it. But if you're pipe
13:50
a microliter in it, but if you're
13:52
putting more than that, then that doesn't
13:54
mean anything because there's not that much
13:56
liquid in that well. So. Okay. So
13:59
it. It's really good to
14:01
know these upfront because if you are
14:03
doing those by hand in the lab
14:05
notebook, you might not notice some of
14:07
the miscalculations that you might have for
14:10
your experiment, which is really great. Yeah.
14:13
Yeah. It's sort of
14:15
pre-checking your work before you run it.
14:17
It's like a running a, almost
14:19
like a test run pass on it. Yeah.
14:23
You might do like a PyTest or something like that on it. Yes.
14:27
Yeah. I mean, like there are some
14:29
stock tools that are built into
14:31
the graphical user interface. Are
14:34
you able to take what one
14:36
of those would generate as like a
14:38
script and just modify the existing script
14:40
and add the additional kind of controls
14:42
you want or the exceptions
14:44
you're mentioning? Yeah. Yes,
14:46
exactly. Finally, I think it just
14:48
gives you a .py file. Okay.
14:51
The interface. If you don't want to change
14:53
it, just import it in the desktop
14:56
computer, which is
14:58
connected to the robot. Finally, that Python code
15:00
gets interpreted. If you go to that Python
15:02
way, you can make any change that you
15:04
want. Again, before running
15:06
anything, even if you change that code,
15:09
before running it, it will run a test
15:12
to make sure that nothing is going on
15:14
and it's bad with the protocol.
15:16
Nice. Yeah. Yeah.
15:19
It's kind of like its own simulation. Yeah. Yeah.
15:22
When you're developing the code
15:24
for that, you kind of mentioned the word script a couple
15:26
of times. What does your personal
15:28
development environment look like?
15:31
Are you using like a
15:33
laptop and working with a particular code editor?
15:35
What are the types of tools that you
15:37
use in the lab for Python coding there?
15:39
So you can both use something
15:42
like any text editor, but
15:44
you also can use Jupyter, which is something
15:47
that I haven't used in my editor. Jupyter
15:49
always uses and for other projects, I haven't
15:51
used Jupyter, but the good thing about Jupyter
15:53
is that you can run one
15:56
part of your experiment and then stop and then
15:58
run the next instead of like
16:00
running the whole experiment at once. That's
16:02
one nice thing that Jupyter Notebooks gives
16:05
you. I always personally work in VS
16:07
Code and that's
16:09
my preferred editor that I go
16:11
to. I
16:14
always when I use Jupyter Notebooks, I
16:16
use the Jupyter extension inside VS Code.
16:18
Yeah, yeah. How's that flexibility? Yeah,
16:21
it's a really nice thing to have.
16:24
Are there other techniques that are
16:26
involved with using these liquid handling
16:29
robots? So techniques? What like what?
16:31
I'm trying to think of like
16:33
are there other, we talked
16:35
about it can be used for these
16:37
dilution experiments and things like that. Are
16:40
there other types of experiments that they
16:42
are well suited for it
16:45
for what you're doing currently? Yes,
16:47
well I think people doing a really interesting
16:49
type of experiments with it. For example, there
16:52
are type of experiments we want
16:54
to pick a colony of microbes and
16:57
inoculate something like another tube with that
16:59
specific colony. So really you need to
17:01
be very precise. For example, you have
17:03
to pick that colony very
17:06
precisely if you're doing it by hand.
17:08
Okay. But I've seen people that using
17:10
a camera that exists on
17:12
top of this robot, it could
17:15
actually use some image in our
17:17
office to go to that place and take
17:19
that colony and then drop it into a
17:22
destination. Well, it was just really fascinating
17:25
because it's really accurate. I
17:27
see a lot of applications for this. Why
17:31
I'm laughing is I did watch
17:33
that video on serial dilution and
17:35
I thought to myself, not
17:37
only the hand-eye coordination that you were mentioning
17:39
before of like all these different experiments going
17:42
and pipetting things and so forth, but there's
17:44
the second phase where you're getting
17:46
the PT dishes out and
17:48
not only labeling everything, but
17:50
then having to spread stuff
17:52
in these three little areas
17:55
around the thing. And I'm like, oh my
17:57
God, that would be so like, it's not only tedious,
17:59
but difficult to do because you're
18:01
supposed to only take a certain amount
18:03
to this other area. So I could
18:05
see how maybe computer vision and
18:08
this robot could maybe do that kind
18:10
of thing where it's laying it out inside of a P2
18:12
dish. Is that something it does too?
18:14
Because we talked about the dilution, but I don't know
18:16
if it does the plating also. Yeah. So I think
18:19
these things don't come by default. It's
18:21
just the creativity of the users. Okay.
18:24
Yeah. Because there's a Python going
18:26
on in the backend and you
18:29
have access to all these really
18:31
cool image analysis libraries. And finally,
18:33
everything gets converted to protocol. So
18:36
I think that's why so
18:39
many people can be creative and create
18:41
really cool things with the robot. But finally,
18:43
what happens inside that code is that
18:46
you tell the robot to go to
18:48
a specific destination. And
18:50
that could be hacked
18:53
to go to a destination and do something
18:55
that we want out of that maybe it
18:57
wasn't designed to do that, but it could
19:00
be cool in a lab setting. Sure. So
19:03
kind of moving beyond that and kind
19:05
of switching gears into,
19:07
okay, now you've run
19:09
these experiments and you've got your
19:11
results. Now you're looking
19:14
at doing these techniques for these
19:16
bioinformatic techniques and you're like, okay, I want to
19:19
sequence data. There's a couple of things
19:21
that you were talking about, a couple of different
19:23
experiments that you were running, like you were doing
19:25
some stuff, kind of looking at the role of
19:27
gut microbes and human health. You want to talk
19:29
a little bit about, I don't know what to
19:31
call it, a project or to call it a
19:33
study. Like I don't have the terminology in my
19:36
head. Sorry. No worries.
19:38
Yeah, sure. So actually
19:40
we can start with the robot, how
19:42
that happens. So we have these tiny
19:45
microbes in our gut that helps us
19:47
stay healthy. They extract a
19:49
lot of nutrition from the food
19:51
that we eat and they finally, they
19:53
circulate back to our bloodstream. So
19:56
when something happens to this community of micro
19:58
and this community of micro, Micros
20:00
is a very complex
20:02
combination of different micros, micros
20:04
from different taxonomic branches. What
20:07
happens is that when you take a sample
20:09
from the gut environment, you're
20:11
not left with one single
20:13
organism. You have thousands of
20:15
different species. It becomes
20:17
a really hard problem how we
20:19
can understand what they're doing. The
20:21
goal of that dilution to extinction
20:24
experiment is that to break down
20:26
that community by dilution, every
20:28
time that you dilute, you leave something
20:30
out from the previous community and
20:32
introduce a more simplified sub-community
20:35
into the new one, and then
20:37
again dilute until you reach
20:39
to a point where you have two
20:41
or three different microorganisms that you can
20:43
actually work with. You can
20:45
understand them better. That's the whole
20:48
point of doing serial dilution experiments.
20:51
From that point, we can, for
20:53
example, compare all the communities
20:55
that have two or three micros and
20:57
see, for example, this one produces more
21:00
of that compound. That
21:02
compound is good for health. How we
21:04
can improve the whole community is by
21:06
making the environment more suitable for the
21:08
ones that make that specific compound that
21:10
we are after. Finally,
21:13
for example, there are many diseases
21:15
that are linked to specific type
21:18
of dysbiosis or
21:20
imbalance in the gut microbiome
21:23
community. By comparing
21:25
samples from these patients to
21:27
those healthy individuals, we can see which
21:29
microbes are different or the ones that
21:31
are different, what they are doing differently,
21:34
and using this for therapeutic applications. Yes,
21:38
we isolate all
21:40
these simplified communities, and then
21:42
what we do is usually
21:44
we either measure the chemicals
21:46
that are produced by these
21:48
organisms or sequence the DNA.
21:51
Through that DNA sequencing and
21:53
the bioinformatics technique, we can
21:55
say, this Organ has
21:57
probably did this, and that's why we can.
22:00
Maybe maybe use it as a probiotic
22:02
or if it's the has a bad
22:04
effect, just remove it. Do.
22:06
Something that that it cannot grow so fast
22:09
that is doing in the on health effects.
22:11
Poker? Yeah. Softening. You.
22:14
Talked about the dilution getting down
22:16
to like, maybe only seen two
22:18
or three things in a sample
22:20
as opposed to like the whole
22:22
wide gamut of everything that's there.
22:24
And then you talked about measuring
22:26
compounds that are. That. Are there
22:28
I just proteins or what have you? We're
22:30
one of the tools or use their i'm
22:32
sorry I'm kind of going really be the
22:35
care but son would techniques to use to
22:37
look at that. So.
22:39
Third to techniques that views but
22:41
there are other one so gas
22:43
chromatography and liquid chromatography are to
22:46
com and techniques that as he
22:48
is and mainly when you take
22:50
blood samples outset these similar. Instruments.
22:53
To measure difference so that they
22:55
would say finally do is a
22:57
good a approximate concentration of each
22:59
of those components that are identifying
23:01
the samples. So to those are
23:03
the to input nm are is
23:05
another one. That. Has fallen but
23:07
we don't use but these two are
23:09
really com and because once you had
23:12
the instruments I'm in itself is not
23:14
cheap but once you have the instrument
23:16
it could be cheap to run a
23:18
sample and see what you have like
23:20
what components are in your in your
23:22
samples and and other thing that views
23:24
on a daily basis. To. Catch
23:26
a break it down to
23:28
how that information. Is.
23:30
Pulled out. And. Turned
23:33
into data. Are you. Inserting.
23:36
That small sample into that can be seen and
23:39
it's during the measurement and then it's our putting
23:41
like a data file for you. Hear
23:43
exactly. But. the wealth of
23:45
the of the only thing that you
23:48
need to do before a yes to
23:50
do some sort of preparation for example
23:52
is sir yes to filter the microbes
23:54
out for example because because those microbes
23:56
are larger particles and they could interfere
23:58
with the machine First, you do something
24:01
like a centrifuge to keep the
24:03
bacteria and larger particles out. Then
24:05
when you have a very, maybe,
24:08
a well-behaved liquid, I don't know
24:10
for lack of a better word,
24:14
but when you have that, you
24:16
can run it. When you give it to
24:18
the machine, it will output some sort of
24:20
diagrams. These diagrams, based
24:23
on the peak intensity and where
24:25
that diagram happens, because it's like
24:27
a spectrum, it's like those earthquake
24:29
type of graphs. I don't know if you've seen them. They're
24:32
very noisy. We
24:35
have the same thing here, but depending on
24:37
where that peak is happening and how big
24:39
that peak is, so where that peak is
24:41
happening in the chromatogram, we call it the
24:43
name of that graph. In
24:45
the chromatogram, where that peak is happening is
24:48
telling you what component it is and how
24:50
intense that peak is. It's telling you how
24:52
much of that component is there. Okay.
24:55
Just to do a quick analogy on that
24:57
specific thing, as a person
25:00
who's into photography, there's a setting that
25:02
you can use called a histogram that
25:04
looks at the overall light of the
25:06
image. It shows peaks
25:08
and valleys showing like, okay,
25:10
this part of the image had this much
25:12
light and was clipping or
25:14
was too bright and it's all
25:16
white or this was too dark
25:18
or whatever. I'm guessing that chromatogram
25:20
is a similar thing and you're
25:22
able to see the different levels
25:24
of components there. Yes, exactly. Very
25:26
similar, except that in pictures, you
25:28
only have intensity and light is
25:30
always the same, but here, different
25:32
components reach that receptor that is
25:34
at the end of the machine
25:37
and each peak is for different components.
25:40
That's the only difference. Yeah. Yeah,
25:42
yeah. Okay. Those
25:44
graphs that you get out of
25:46
there then can be turned into the raw data
25:48
that you're going to use for the next step.
25:51
Yes. And then the machine comes
25:53
with a software that turns
25:55
those into tabular data. For example, you can
25:57
say this is the concentration of components. A,
26:00
this is the concentration of component
26:02
B, and so on. And it gives
26:04
you like a spreadsheet that has that
26:07
information in it. And then
26:09
that's something where we can take
26:11
that information and use it in
26:13
statistical analysis to compare the
26:15
samples. This
26:20
week, I want to shine a
26:22
spotlight on another RealPython video course. It's
26:25
titled Building Python Project
26:27
Documentation with mkdocs. The
26:29
course is based on a RealPython step-by-step
26:31
project by frequent guest Martin
26:34
Broise. And in the video
26:36
course, instructor Darren Jones shows
26:38
you how to work with
26:40
mkdocs to produce static pages
26:42
for Markdown, pull in code
26:44
documentation from doc strings using
26:46
mkdocs strings, follow best practices
26:48
for project documentation, and
26:50
use the material for mkdocs theme
26:52
to make your documentation look great,
26:54
and how to host your documentation
26:56
on GitHub pages. I think using
26:58
tools like this can make what
27:00
seems like a daunting task so
27:02
much easier. And I think it's
27:04
a worthy investment of your time
27:06
to learn how to automate production
27:08
of your project's documentation. Your users
27:10
will truly appreciate it. RealPython video
27:12
courses are broken into easily consumable
27:15
sections and where needed include
27:17
code examples for the technique shown. All
27:19
lessons have a transcript including closed captions. Check
27:21
out the video course. You can find a
27:23
link in the show notes or
27:25
you can find it using the enhanced
27:28
search tool on realpython.com. And
27:33
so is that machine connected in a way
27:35
that the data, I know I'm being really
27:37
microscopic in my analysis of how we're talking
27:39
about this stuff, but like is it coming
27:41
out as a CSV file or is it,
27:43
how are you getting that regular data? Usually
27:46
I think it's an Excel format. Like
27:50
usually it's like that. And you can just save
27:53
that Excel file finally into a CSV.
27:55
One thing is, so for researchers,
27:58
maybe Excel be
28:00
a more familiar term than
28:02
a CSV sometimes that's the
28:04
case. And Excel is
28:06
a pretty common tool to use and
28:08
sometimes you don't even need to get
28:10
out of Excel to do everything that
28:12
is related to your research. But when
28:14
you scale things up that's where Python
28:17
becomes maybe more efficient in
28:20
doing the data analysis. Yeah,
28:22
I recently had some of the people
28:24
working on the Python in Excel on
28:26
the show and those are interesting to
28:28
talk to them and it's interesting to
28:30
hear that because that sounds like that might be
28:32
yet another way to again maybe
28:35
avoid having to hop through multiple
28:37
layers to get somewhere. Yes. At
28:39
least to do the initial analytic
28:42
research and kind of looking at what you have
28:44
to make sure like this is worthwhile we're gonna
28:47
take this to the next step. Yeah,
28:49
that sounds really exciting. Hey, I
28:52
haven't had a chance to try that out yet
28:54
but looks very interesting. Yeah, it's
28:56
still kind of in a beta
28:58
phase where you have to be part of their
29:01
sort of developer 365 program
29:04
and have to sign up for things and so
29:06
forth and I think it's only on Windows. I'm
29:09
intrigued by it. It's very interesting to see what
29:11
they're gonna do with it and it's there's
29:14
a lot of stuff in it. They preload a lot of
29:16
data science stuff ready to go in it. Do
29:20
you want to talk a little bit about the
29:22
DNA sequencing technologies or did we cover most of
29:24
the things you wanted to cover on this first
29:27
section? Yes, and I think this
29:29
is a good point because we have
29:31
our samples now and we have
29:33
sequenced it but why
29:36
do we do sequencing is because we
29:38
can and the idea is we can
29:40
infer all the biological information from DNA
29:42
because we think that DNA is the
29:45
blueprint to living organisms. Every
29:47
kind of information that is required
29:49
for biological functions encoded in DNA.
29:51
So the assumption is if we
29:53
can sequence DNA and understand that
29:55
sequence we can say a lot
29:57
of things about the biology that's
29:59
that is going on in the
30:01
samples that we got those information
30:03
from. So
30:05
what happens is that using
30:08
different techniques, we have removed different components
30:10
that we don't need. And somehow we
30:12
want to purify that DNA that is
30:14
inside a sample in a process called
30:16
DNA extraction. So we don't wanna have
30:19
different components that we don't need for
30:21
the DNA analysis and they could interfere
30:23
with the process. So we just try
30:25
to using chemical treatment, just take
30:28
those components out. When
30:30
we do DNA extraction, finally, we
30:33
have our DNA and this DNA
30:35
can be sequenced in different facilities.
30:38
And finally, what you get out
30:40
of these sequencing machines is just
30:42
a bunch of A,
30:44
T, C, G letters
30:46
that are really
30:48
large, like surprisingly large. Yeah.
30:52
Well, not surprisingly, because if
30:54
we assume that everything is happening, using
30:56
the information inside this DNA, it won't
30:58
be as surprising. But
31:00
those files could be large, even for
31:02
very small cell
31:04
that have simpler DNA than other
31:07
ones. Can you give
31:09
a size that would
31:11
be comparable on computer terms? Like
31:13
is that... Yeah, sure. Gigabytes
31:16
or something larger? So
31:19
it depends. It depends on, in
31:21
the microbial world, we are usually
31:24
are bound to 10 megabytes on the
31:26
high end. And on the low end,
31:28
we are... Usually it's
31:30
half a megabyte, I would say. The
31:34
entire DNA for one cell would
31:37
be around that. But for human
31:39
cells, it's in the gigabyte. And
31:42
everything changes in between. For example,
31:44
you have some sort of maybe
31:46
more complex micro like yeast will
31:48
have bigger and more complex
31:51
DNA that you could use in
31:53
the bakeries or in the breweries
31:55
for making beer. Those are slightly
31:57
more complex. Okay. and
32:00
bacterial cells and they should
32:02
be not still in the gigabytes,
32:04
but definitely larger than bacterial
32:06
cell. Yeah, one of the techniques
32:09
you talked about is this idea
32:11
of, I think it
32:13
was, was it called shattering to
32:16
break apart the DNA to focus
32:18
on like the very specific
32:20
sequence. Because I'm guessing in
32:23
the types of things you study, like
32:25
if you're looking at bacteria, there's
32:28
probably a huge, well, a
32:30
large amount of repetition that
32:32
all of them have this sort of structural stuff
32:34
and then you want to focus on certain areas.
32:36
Am I getting that part right? Well,
32:39
the reason for shattering
32:41
DNA is not to focus
32:44
on a specific part. It's just because
32:46
the sequencing facilities cannot sequence DNA that
32:48
are longer than a specific length. They
32:51
have this limitation. Okay. They
32:53
can't focus on optical signals and if
32:55
they continue to longer pieces
32:57
of DNA, they finally, the error
32:59
becomes so high that the data is basically
33:01
not useful. So
33:04
what we have to do is
33:06
before that using mechanical forces, we
33:09
have to break down the DNA,
33:11
fix DNA molecules into smaller pieces,
33:14
like 300, we call it
33:16
base pairs by 300 ATC G-letter
33:19
or something like that. That's the usual. And
33:23
then we can sequence those smaller pieces. But
33:25
now that we have solved one
33:28
problem, we have created many more
33:30
because the problem becomes then how
33:32
do you know how to
33:35
fit these pieces together? It becomes
33:37
a big puzzle. And the
33:39
way it happens is based on
33:42
the overlap that these sequences
33:44
might have, there are different
33:46
algorithms called assembly algorithms that
33:48
they make a sort of
33:50
graph that connects these sequences
33:52
based on the overlap between
33:54
these sequences. And then finally
33:56
finding the longest path that
33:58
you can find between. these pieces
34:00
in the graph that will kind of resolve
34:02
that piece of DNA in that region. And
34:04
it's a really open-ended problem
34:06
and a very complex algorithm
34:09
that these tools achieve.
34:11
And these, because of the performance
34:14
considerations, they are usually not implemented
34:17
in Python, usually in other
34:19
languages like C, C++. Yeah,
34:21
yeah, that makes sense. But
34:24
they usually have an interface in Python. So
34:26
finally, for example, the CLI is in Python.
34:30
That connects to different modules that
34:32
are written in other languages. Yeah,
34:34
you provided a bunch of links to these
34:37
libraries. Is
34:40
the one, I think it's called
34:42
MegaHit, is that kind of in this realm
34:44
that we're talking about? Okay. Yes, it's a tool
34:47
that I use and it's designed to get
34:50
those, what we call short reads,
34:52
and it's an assembler. So it
34:54
assembles those short reads into longer
34:56
pieces. And the goal here is
34:58
to recreate those pieces
35:00
that we shattered. The
35:02
reason is if we have small
35:04
pieces, we don't have that much
35:07
statistical significance to say things for
35:09
sure. For example, if it's too
35:11
small, it could be happening by
35:13
just chance, by random. However,
35:16
if we make it really long,
35:18
then if there is a very
35:20
similar match to this long piece in a
35:22
database that we have, we can say things
35:24
more certainly. Okay. Yeah, this is
35:27
now significant or what have you. Yeah,
35:29
yes. Yeah. Okay. You shared
35:31
that project with me, which I think it's
35:33
kind of interesting because it says
35:35
here like a copyright of 2015, the University
35:37
of Hong Kong, their kind of initial license
35:40
of it up here on GitHub. I
35:43
find that fascinating. Is that common
35:45
in universities that across
35:47
these different communities that they are sharing
35:49
their code? Is that like a pretty
35:52
common thing that you've found within
35:54
this field? Yes, especially in bioinformatics. One
35:56
thing that I'm really grateful for is
35:58
that being open. sources kind
36:01
of the theme. When you publish
36:03
a paper or wherever you
36:05
mention your package name
36:08
or wherever you want to present on,
36:10
I think usually you provide that
36:12
in an open source like as
36:14
a GitHub repository and places
36:16
like that that everybody can use
36:19
and it's really a common
36:21
theme as I said. Good. This
36:23
field. So yes, I think
36:25
yes. Yeah, yeah, that's great. So
36:28
one of the other projects you mentioned is looking
36:31
at the prediction of anaerobic
36:33
digestion metabolism. Yes. And
36:36
that one is, I think it's
36:38
called AD Toolbox. Do you want to talk
36:40
about that project? Yes. So the other packages
36:42
that you mentioned, those are by other labs,
36:44
but we are starting to write
36:47
our own packages and publish them.
36:49
So sometimes these packages stand like
36:52
they call different tools that exist
36:54
in other languages or from
36:56
other projects. So AD Toolbox is
36:58
a project that we started for
37:00
modeling the anaerobic digestion system. Anaerobic
37:02
digestion system is just to
37:05
explain that quickly. It's a system
37:07
that has been used traditionally for
37:10
making use of waste, especially
37:12
organic waste, something like foods
37:15
from cafeteria, restaurants. These all
37:17
go to waste and if we don't
37:19
do something about them, they get converted
37:21
to methane, which goes to atmosphere. We
37:23
lose a lot of energy and also
37:26
it's a greenhouse gas.
37:28
So it has a really high
37:30
global warming potential. So the goal
37:32
of this project is to somehow
37:35
manage that anaerobic digestion process
37:37
to break down these waste
37:39
components into useful products. And this
37:41
is happening by microbes. So the type
37:44
of microbes that exist in this environment
37:46
matter. For example, if you put more
37:48
of those microbes that are more useful
37:51
for the process to produce the
37:53
product that we are after, there's
37:55
a good chance that we improve the efficiency
37:57
of this process. So since this is a
37:59
micro process we need to take
38:01
into account the information that is coming
38:04
from the DNA of those microbes. And
38:06
this is the goal of this tool.
38:08
For example, it takes the DNA information,
38:10
processes them, and finally it feeds
38:12
them through a mathematical model and in
38:14
future a machine learning model to predict
38:17
the behavior of this anaerobic digestion system,
38:19
what you can do to improve them
38:21
and applications like this. So
38:23
this is a project that other members
38:26
of your doctoral program are
38:28
working on together? This
38:30
is mainly led by me
38:33
and we have some undergraduate
38:35
students that are trained
38:37
on Python and finally they contribute
38:39
to this project. So
38:41
yes. Nice. It's like
38:43
a code fiber. Yeah, yeah.
38:46
What are some of the other libraries that
38:48
you're able to leverage to do the work
38:50
inside of this package? So most of the
38:52
things that I use to do, so for
38:54
example, it has different modules. So at one
38:56
point where we use the
38:59
DNA sequences, we use packages
39:01
outside of Python. There's this
39:03
really cool sequence alignment tool
39:05
that matches a sequence of
39:07
DNA into a known database.
39:10
It's called MMC. Okay. This
39:12
is a very, very cool tool
39:15
that is written in, I think, in
39:17
C++ and it's really fast for that
39:19
kind of... So this code actually calls
39:21
that MMC and
39:23
then collects information. We use pandas
39:26
for any kind of data manipulation,
39:28
for example, getting the alignment results
39:30
and using that information to
39:33
draw any sort of
39:35
conclusions. And then finally we disconnect
39:37
it to a dash app like
39:39
that in the plotly world. Yeah,
39:41
yeah, sure. And it
39:43
finally creates a dash application that shows
39:46
the simulation results and this is an
39:48
interactive web page. So for
39:50
example, different parameters could be changed. What
39:52
happens if we increase the temperature? What
39:54
is the effect of increasing temperature
39:57
on methane production? So when you
39:59
change that... parameter of temperature
40:01
it will change the results and
40:03
will show the results like
40:05
it updates to a page basically. Yeah
40:08
I found that should be a very
40:10
good complement to this project. Yeah I'm
40:13
a big fan of visualizations and that's
40:15
a great project because it includes
40:17
so much of the the underlying work
40:19
that you can kind of again host it and
40:22
get it posted there. I'm wondering a
40:24
little bit about the data that is involved
40:26
there are you using you know what
40:28
kind of database where is
40:30
all this data stored that you're you're accessing
40:32
and running through the system. Yes
40:35
so as I said it
40:37
has different modules for each module it could
40:39
be different. Okay. Most
40:41
of the databases that we
40:43
talk about here in this
40:46
project they're usually small we
40:48
intentionally kept them small. Okay.
40:50
To be fast because since we are
40:52
only focusing on anaerobic digestion we may
40:54
be we may not need all
40:57
the information from different ecosystems and
40:59
because of that we are just
41:01
using a flat file which is
41:03
it's a common format called FASTA
41:05
in bioinformatics which is a fancy
41:07
text file again in the key
41:09
value format so you
41:12
have a key and then you have a
41:14
value so your keys are just aligned that
41:16
starts with a specific character like a character
41:19
sign and then your sequence
41:21
starts right below that line again so
41:23
the key will be that line that's
41:25
there with the character and whatever comes
41:27
underneath will be the information and that's how
41:29
we store these data. Okay
41:32
you have you
41:34
have another project that you have you
41:36
say that's still under heavy development and
41:38
will become public soon is
41:40
that the AD toolbox or is that the
41:42
next one the spam DFBBA? No no
41:45
the AD toolbox is still something that
41:47
we are working on especially these days
41:49
we want it will be out very
41:51
soon I think in a matter of
41:53
weeks. Okay. But my next
41:56
project is completely published and it's on
41:58
github and there's a Try
42:00
it, Arbus, you have a good documentation
42:03
website for it. Cool. Available,
42:05
yes. So the next one. What's that
42:07
project do? So that
42:10
project is more like an
42:12
AI project. Okay. So
42:15
one thing is when you have these
42:17
pieces of DNA, you have some information,
42:19
but the problem is you still cannot
42:21
predict the behavior of cells because even
42:24
given that information, there are so
42:27
many different ways that micros can
42:29
behave given their DNA. For example,
42:31
if you consider complex typing system,
42:33
which valves they should turn, that
42:36
information is not in
42:38
the DNA. So well, I
42:40
mean, at least it's not easy to
42:42
extract those information. So how
42:44
micros regulate their behavior is
42:46
something that is a really
42:49
open problem in this field.
42:51
Okay. This tool is
42:53
something that tries a technique called
42:56
reinforcement learning where all
42:58
different trajectories for behavior of
43:00
a microbe is tried. And
43:03
then based on trial and error, these
43:05
microbe try to improve their behaviors. And
43:07
the reason that you think that this will
43:10
work is because well, microbe evolved really fast
43:12
in the last, like you can see the
43:14
microbe evolved in a few generations.
43:16
Okay. So what happens is
43:18
the microbe just evolved, they adapt their strategies
43:20
and maybe something that we are
43:22
all familiar with is different, for example, strains
43:24
of the COVID virus. You
43:27
see that at some point, some strain
43:29
comes out that acts a little different,
43:31
maybe more contagious. It's just because they're
43:33
rapidly changing and that change gets reflected.
43:35
I mean, microbes are more complex and
43:37
so the
43:39
problem becomes more complicated. But this
43:41
tool is basically some artificial intelligence
43:43
technique to find how the behavior
43:45
of these microbe will converge
43:47
to a specific point that is determined
43:50
by the evolution of that organism.
43:53
And here we use a lot
43:55
of neural network packages like PyTorch
43:57
and also the Ray library for
43:59
parallelization. visualization. Okay. Drilling
44:02
into like things like the hardware and how you're
44:04
running these things. I mentioned
44:07
time to time on the show that I tried
44:09
to run projects that I want to feature on
44:12
the show to kind of showcase them. Say, oh,
44:14
this seems like a really cool project and it
44:16
might use something like PyTorch or use some other
44:18
big library like that. And I have a hard
44:21
time getting them set up very often. And so
44:24
I feel like it works best sometimes to
44:26
like have it as a container or some
44:28
other kind of environment. So I'm wondering like
44:30
how are you running those types of experiments
44:32
and what type of machine is it running
44:35
on? So for this one,
44:38
for some of the test cases, it depends
44:40
on the test case, how complex and hard
44:42
it is to make those
44:44
simulations. If it's for the test cases
44:46
that I have on the documentation website,
44:49
you can run it on a simple. I
44:51
have a Mac M1 machine, which is great.
44:54
And it suffices for those kind
44:56
of applications. But for bigger projects,
44:58
we move to a supercomputer that
45:00
we have in Colorado that is
45:02
shared between the universities here. It's
45:04
called Altime. Okay. It's shared between
45:06
CSU and Colorado University at Boulder.
45:09
And I think one more, which
45:11
I don't recall right now. These
45:13
are sometimes you can really get
45:15
big resources from this supercomputer. And
45:17
then, for example, for our assembly,
45:20
what we do is I usually
45:22
request for two terabytes of random
45:24
access memory, which is really high
45:27
and could not be done. Yeah,
45:29
definitely not not on my machine.
45:31
So and then then what
45:34
you do there is it's a Linux
45:36
system. You create your visual environment, install
45:38
the packages that you want. And then
45:40
the code for me, I use this
45:42
approach that these could connect to a
45:44
remote server. And I just type in
45:46
the code that I want and you
45:49
can debug that in a remote
45:51
server. And finally, when it's ready, I just
45:53
run the project on the cluster and
45:55
get the results back and do
45:58
the Simpler data analysis. My
46:00
own personal computer. Or for
46:02
yeah I was wonder about that. Having.
46:05
These resources that are against university
46:07
scale which is kind of fun
46:09
so that one on could link
46:11
for all these different projects. As
46:13
you mentioned a couple times about
46:15
the documentation of that particular project.
46:18
With. Our tools that are using to help
46:20
you document that. Ah yes,
46:22
I used And Kate thoughts
46:25
for building the documentation website
46:27
and site to. Have
46:30
the for example thought test and
46:32
also yeah yeah all the ducks things
46:34
for every function and class and in
46:36
the thing the script to be
46:38
as clear as possible and and yeah
46:41
I love had kids that's I think
46:43
it's it makes the whole documentation
46:45
a lot for easy and also final
46:47
for that's that's really good. So Syria,
46:50
Syria we have a couple courses
46:52
that touch on that night it it's
46:54
a nice way to count against America
46:56
going and it definitely assists a
46:58
lot of that. Process Again, You have
47:01
to become like a web developer. record
47:03
activities and site which is nice. Yeah,
47:06
So far as I have these questions are task. Everybody
47:08
comes on the show and the first one is he
47:10
would something that you're excited about that's happening in the
47:12
world. A Python right now. So.
47:15
For. Me: there's this package called
47:17
cited by Ill look at
47:19
it's coming up and I
47:21
think so not. Version One: I
47:24
think it's it. Could be it. Could. Help
47:26
for all the Python users
47:29
in our field because most
47:31
of these tools exist in
47:33
our and. It's really
47:35
good to have that two bucks
47:37
and in Python as well because
47:39
we have everything in Python so
47:41
it's just sometimes were east statistical
47:43
tests for example we need to
47:45
go to our and it could
47:47
be like the time that we
47:49
need to spend to learn a
47:51
new programming language could be. Something.
47:55
Maybe. more efficient and i think they'd
47:57
these packages help a lot and and
47:59
for money see a lot of the things
48:01
that have been missing has been added to
48:04
the psychic bio package and I'm really excited
48:06
about it being released. Yeah, cool.
48:09
What's something that you want to learn next? Again, this
48:11
doesn't have to be programming, but
48:13
is there something that you're interested in learning? Yeah.
48:16
So for me, maybe
48:18
something that I don't have that
48:20
much experience with, and I like
48:22
to learn more about it
48:24
is how to work on different
48:27
parts of the project in a team, because
48:29
mostly what I have been doing as a
48:31
researcher has been working alone on my script.
48:34
And of course we use GitHub, but,
48:36
but it's different when it comes to
48:38
multiple people collaborating on the same project
48:41
and as an open source project. And
48:43
I think this is something that I really
48:46
want to get into, to contribute
48:48
to open source projects, at least
48:50
ones that are in our field.
48:52
And I think I can maybe
48:54
positively contribute there. Yeah. Yeah.
48:57
I had a couple of shows recently about sort
48:59
of inroads and ways to kind of get involved.
49:02
And I wonder if certain conferences might
49:04
be a chance to be able to sit down
49:06
with some other people and look
49:08
at collaborating on it. That's great. Yeah.
49:10
You already got kind of a good resume going with,
49:12
with, uh, what you're, what you're working on. So, so
49:16
how can people follow the work that you do online?
49:19
Anything related to the code,
49:21
we usually publish the code
49:23
on GitHub. So my GitHub
49:25
request story, I usually
49:27
post them on my GitHub as well. So
49:29
we have this GitHub page or
49:32
account for our lab that we use.
49:35
And then all the projects are on that.
49:37
But when it is published, I also post
49:39
it on my own account as well. I
49:41
pin it on my account. So
49:44
that's how the new project, but also on
49:46
LinkedIn and other social media
49:48
and some of them I'm active, especially
49:50
LinkedIn, I announced all the new projects
49:53
there as well. Nice. I'll
49:55
include all the links for all
49:57
those repositories and your LinkedIn. Well
50:00
thanks Parza, it's been really fantastic to talk to you
50:02
about all this stuff. Thank you Chris, it's been really
50:04
fun to talk to you as well. I
50:11
want to thank Parza Ghadarmazi for coming on the
50:13
show this week. And
50:15
I want to thank you for listening to the Real
50:17
Python Podcast. Make sure that you click
50:20
that follow button in your podcast player, and if
50:22
you see a subscribe button somewhere, remember
50:24
that the Real Python Podcast is free. If
50:27
you like the show, please leave us a review. You
50:29
can find show notes with links to all
50:31
the topics we spoke about inside your podcast
50:34
player or at realpython.com/podcast. And
50:36
while you're there, you can leave us
50:38
a question or a topic idea. I've
50:41
been your host, Christopher Bailey, and I look
50:43
forward to talking to you soon.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More