Podchaser Logo
Home
Using Python in Bioinformatics and the Laboratory

Using Python in Bioinformatics and the Laboratory

Released Friday, 22nd March 2024
Good episode? Give it some love!
Using Python in Bioinformatics and the Laboratory

Using Python in Bioinformatics and the Laboratory

Using Python in Bioinformatics and the Laboratory

Using Python in Bioinformatics and the Laboratory

Friday, 22nd March 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Welcome to the Real Python Podcast.

0:03

This is episode 197. How is Python being used to automate

0:08

processes in the laboratory? How

0:10

can it speed up scientific work with

0:12

DNA sequencing? This

0:15

week on the show, chemical

0:17

engineering PhD student Parse Khatarmazi

0:19

is here to discuss Python

0:21

and bioinformatics. Parse

0:23

provides background on his research

0:26

and the bioinformatic techniques used

0:28

to discover gut microbes' role

0:30

in human health and diseases.

0:32

We talk about automating lab

0:34

experiments with liquid handling robots

0:36

in Python. We dig

0:38

into libraries to shatter and reassemble

0:40

DNA sequences. Parse also shares

0:43

current projects from the Chan Lab

0:45

at Colorado State University and his

0:47

GitHub repository. All

0:49

right, let's get started. Is

1:11

a weekly conversation about using Python in

1:13

the real world. My. Name is

1:15

Christopher Bailey, Your host. Each. Week

1:17

We feature interviews with experts in

1:19

the community and discussions about the

1:21

topics, articles and courses found at

1:23

realpython.com. After. After the podcast, join us The podcast. Join us.

1:26

and learn real world Python skills with

1:28

a community of experts at realpython.com. Hey,

1:31

Parse, welcome to the show. Hi, Christopher. Thanks for

1:33

having me on the show. I'm

1:36

really excited to talk to you. You

1:38

reached out and had a bunch of

1:40

interesting things that you wanted to talk

1:42

about. A lot of them have to

1:44

do with real world applications of Python

1:48

in the laboratory and experiments.

1:50

Maybe you could give me a

1:52

little background on where

1:54

you're at. You're currently in your PhD program.

1:56

Maybe you could explain a little bit about

1:58

what you're currently doing. Yeah,

2:02

so right now I'm in

2:04

my PhD program, fifth year

2:06

at Colorado State University, and

2:09

I'm in chemical and biological

2:11

engineering department. So my background,

2:14

I'm coming from an engineering

2:16

background, and honestly for my

2:18

undergraduate studies, I never did

2:21

any biology. And

2:23

in my doctoral studies, I got

2:25

interested in biological systems.

2:29

And somehow they're very similar to systems that

2:31

we study right now. They are

2:33

maybe more complex, but the concepts

2:35

behind them are very similar to

2:38

classical chemical engineering like factories.

2:41

We actually treat the cells like

2:43

factories, and the analogy goes beyond

2:45

that even. Like we have, for

2:47

example, piping systems. They have some

2:49

sort of analog in the cellular

2:52

and biological. Okay. So

2:54

I find this system really interesting, and

2:57

a lot of programming is also involved

2:59

in this process. I think that's

3:01

really cool that you kind of almost sort

3:03

of shifted direction into your PhD

3:05

program. That's pretty cool. Were

3:08

you doing any programming before that in your

3:10

other, was it an engineering course then before

3:13

that? Yes. So mainly

3:15

we were using MATLAB for

3:18

anything, and it was mostly

3:20

computer simulations. Okay. And

3:23

the fun part is that in my first

3:25

year in the PhD, everything was in MATLAB.

3:28

But maybe over a year,

3:30

everything in our lab just shifted

3:32

to Python. Okay. Because

3:34

we soon realized that the Python

3:38

offers... One

3:40

thing is the package. There are so many

3:42

good packages that we can use in Python,

3:45

and also how easy it is for

3:47

someone to get started with Python and

3:49

become better soon. So that's why we

3:51

shifted to Python, I think, after

3:54

the first year. And we have been using

3:57

Python for maybe four years.

4:00

something like that. Yeah. Yeah, yeah.

4:02

I would imagine that a lot of the tooling,

4:05

I don't know, these last four years have been

4:07

extremely productive as far

4:09

as, you know, the scientific community and

4:11

the adoption of Python there. I'm not

4:13

saying that wasn't before that, but I

4:15

feel like the tooling

4:18

has gotten easier

4:20

and, like you said, there's the ability to

4:22

kind of build on top of other people's

4:24

work as opposed to having to build everything

4:26

from scratch. Has that been your experience? Yes.

4:28

And one funny thing, I'm coming

4:30

back from a seminar today and

4:33

these seminars happen like weekly and

4:35

it's been four weeks in a

4:37

row that everybody's saying, we

4:39

were using Math Lab and suddenly

4:41

we shifted to Python. It seems

4:43

like it's something that's really happening

4:46

at the more speed recently. Yeah.

4:49

Yeah, that's interesting because I think a

4:51

lot of universities, I mean,

4:53

it depends on the professor's background and the

4:55

tooling that they've been using and maybe the

4:58

funding. Math Lab is

5:00

a thing where it's, you know, they have

5:02

to be purchased, right? Seats for it

5:04

and things like that. So I think

5:06

that might be an attractive thing for

5:08

a university too. Yeah, yeah, exactly. And

5:10

I know that some of the programs

5:12

are thinking about replacing this entirely. So

5:15

one issue is that the courses are

5:17

based on Math Lab and already there's

5:19

a lot of syllabus development and you

5:21

need to like prepare

5:23

new materials. And that's why maybe

5:25

in the education part, we still

5:27

have Math Lab thoughts, but

5:30

I think that will also change. Like

5:32

even the freshmen and undergraduate students will

5:34

be trained on Python

5:36

instead of Math. That's not just

5:38

my guess. Yeah. And that's

5:40

the pattern that I'm seeing right now. Yeah. Okay,

5:43

cool. So one

5:45

of the areas that you said

5:48

that you wanted to speak about when you sent

5:50

me the email was you wanted

5:52

to talk about using Python in the lab

5:55

and then how Python's being used

5:57

in the field of bioinformatics. And

5:59

I... immediately had to go

6:01

online and go, all right, what is bioinformatics?

6:03

So I don't know if you're comfortable. Could

6:05

you explain to the audience, like, you know,

6:08

generally what is bioinformatics? Yeah,

6:11

sure. I think it's a very

6:13

general term and, and the way

6:15

I use it is, it's just

6:17

like procedures to process biological data,

6:19

the data that relates to biological

6:21

systems. This could go

6:23

even to like hospitals and it could

6:25

be as broad as that, like data

6:27

from hospitals could be in

6:30

this realm as well. But what

6:32

I am working on is specifically

6:34

data from microbial systems. So these

6:37

cells are living organisms and

6:39

we kind of get information,

6:42

different type of information, the

6:44

way we process that, the science that

6:47

is behind processing this information into useful

6:49

information that could be used for

6:51

next step actions, all of

6:53

that falls into bioinformatics. Okay.

6:57

So it could be, you mentioned

6:59

later, like working with

7:01

sequences of DNA, or it

7:03

might be looking at the

7:05

information that you're doing through

7:07

repeated studies, just sort of

7:10

managing the information about,

7:12

if you will, the biology field. Yeah,

7:16

exactly. Because DNA sequences are

7:18

really, like it's a big

7:21

amount of data is being generated that

7:24

are DNA sequences. And it

7:27

goes from how do we store

7:29

these data? How do we use

7:31

databases? How do we like different

7:34

algorithms? How we can process

7:36

these information into useful outputs?

7:38

All of that requires

7:40

computer knowledge from software engineering

7:43

algorithms. And it goes even

7:45

beyond these topics. Yeah, yeah.

7:48

That's a lot of data to manage. Yeah. It's

7:50

kind of one of these like areas you always hear of

7:52

the world of big data. And

7:55

you think of like banks and

7:57

financial data and lots of

7:59

documentation. and so forth, but you're dealing

8:01

with just raw, huge

8:04

amounts of data, looking at, like you

8:06

said, the genomes and things like that, which is

8:08

a pretty, pretty intense amount of data. So

8:11

a couple of things that we wanted

8:13

to dig into based upon our back and forth

8:15

through email was to kind of think about like,

8:17

what are the different places where

8:20

Python fits into your role as

8:22

a researcher? And I

8:25

thought one of the cool ones was this idea

8:27

of how Python has

8:29

helped you in the lab itself

8:32

and doing your experiments. You

8:34

sent me a video about

8:36

how you

8:39

would manually dilute a

8:42

bacterial sample, was the example they gave there,

8:44

and how it was

8:47

like, okay, the

8:49

beginning of the day, wipe down this surface.

8:51

Okay, start here. And it was just like

8:53

so much manual stuff. And then like literally

8:56

the next day, you're only like maybe five

8:58

steps into the process. It was

9:00

kind of wild. And so you sent me a link

9:02

to a company, Opentrons,

9:04

this manufacturer who was creating

9:07

a liquid handling robot. Do you want to talk

9:09

a little bit about that and then how Python

9:12

intersects with that as far as helping you

9:14

in the lab? Yeah,

9:16

sure. So as a researcher,

9:18

my work splits into two

9:20

parts. One is the wet lab projects,

9:22

which we actually go to the lab

9:25

and plan and do experiments in the

9:27

lab. And the other

9:29

part is more like computational work,

9:31

where we develop algorithms, do data

9:33

analysis and that. So yeah,

9:36

that robot is really something

9:39

that has changed the way we do experiments.

9:41

And it falls into the wet lab part.

9:44

So we used to do things by

9:46

hand, like hand pipettes. And

9:49

it becomes really hard because in many of

9:51

the experiments that we do, it's just you

9:53

have to go manually. And some of these

9:55

plates that we work with, they have like

9:58

96 very similar. wealth

10:00

that you need to pipe something from

10:02

one of them and drop it into

10:05

the other one. And it becomes really

10:07

confusing and error prone, I

10:09

guess, maybe. Very

10:11

error prone and also tedious

10:14

because you have to do something

10:16

repetitious. And it's

10:18

just like if that part could

10:20

be automated, it saves a

10:22

lot of time and probably a

10:24

lot of error and in

10:27

the long term, maybe money for the

10:29

lab. And for this reason,

10:31

we use these machines in the lab to

10:33

automate the process. Do you know the

10:36

age of the use of those types of machines

10:38

in the lab? Is it

10:40

recent development? In our

10:42

lab or in general? Yeah, maybe your

10:44

lab or just generally. So we

10:46

started about, I think, five years ago

10:49

in the lab with

10:51

these robots. And I think

10:53

at that point, this company was very new. So

10:56

things were already at the second generation,

10:58

but we were one of

11:00

the first labs on our campus to

11:02

use such robotics. So I think it

11:04

wasn't that common, even if it was

11:06

like the company existed before. It wasn't

11:08

that common. But after some time, right

11:11

now, I know it's these four labs in

11:14

our department that use such robotics. So I

11:16

think like people are moving towards that point.

11:19

Yeah, yeah. Actually, the fun

11:21

fact is we used to perform a

11:23

lot of COVID tests on campus during

11:25

the COVID years. Oh, okay. And because

11:27

the numbers were huge, what campus that

11:30

they used these machines to speed up

11:32

the process and make the testing

11:34

part really fast so we could get

11:37

back our results from our

11:39

tests very soon and they could

11:41

soon isolate potential people with the

11:44

virus. They could say that soon

11:46

and essentially avoid spreading it. So

11:48

this is one of the places that

11:50

it was used. Nice. So

11:53

how is Python used in there?

11:55

I was able to go to the site. I don't

11:57

know if the company Opentrons, is that the name?

12:00

Yeah. Okay. Yeah. And so

12:02

I kind of dug into a

12:04

little bit and found the Python protocol

12:06

API and looking at it. What

12:10

where where does it fit in?

12:12

What what are the controls that it's allowing

12:14

you to do with Python? So

12:16

right now you can go to the

12:18

website, build a protocol just just using

12:20

the graphical user interface and without any

12:22

knowledge of Python. But what goes on

12:24

behind the scene is the Python code

12:27

is generated and it's given to a

12:29

computer that is inside the robot, which

12:31

is a raspberry, I think. And

12:33

then this code gets executed and

12:36

it gets transformed into robot

12:38

actions like go up this much and

12:40

pick up that much liquid from from

12:43

this coordinate and move it to that coordinate. So

12:45

as I said, right now there

12:48

is a great graphical user interface

12:50

that they provide. But for more

12:52

advanced protocols, we usually have to

12:54

write the protocol ourselves. So it

12:56

would become a Python script that you have

12:58

to write. And, you know, for for

13:00

that example that I said,

13:02

like you have wells that are let's say

13:04

you have 96 well plates

13:06

that are like 12 different columns

13:09

and you have to do repetitious

13:11

things. These concepts fit really nicely

13:13

with something like loops and programming.

13:15

Yeah. So instead

13:17

of doing it yourself,

13:19

just a simple for

13:22

loop and do that and avoid

13:24

possible mistakes. And one other interesting

13:27

programming concept that comes here

13:29

are the exceptions and

13:31

errors. So it can

13:34

run a simulation of the experiment, the

13:36

protocol that you give it and it

13:38

raises exceptions based on if there is

13:40

some sort of logical issues in your

13:43

code. For example, this well

13:45

has 200 milliliter liquid

13:47

in it. But if you're pipe

13:50

a microliter in it, but if you're

13:52

putting more than that, then that doesn't

13:54

mean anything because there's not that much

13:56

liquid in that well. So. Okay. So

13:59

it. It's really good to

14:01

know these upfront because if you are

14:03

doing those by hand in the lab

14:05

notebook, you might not notice some of

14:07

the miscalculations that you might have for

14:10

your experiment, which is really great. Yeah.

14:13

Yeah. It's sort of

14:15

pre-checking your work before you run it.

14:17

It's like a running a, almost

14:19

like a test run pass on it. Yeah.

14:23

You might do like a PyTest or something like that on it. Yes.

14:27

Yeah. I mean, like there are some

14:29

stock tools that are built into

14:31

the graphical user interface. Are

14:34

you able to take what one

14:36

of those would generate as like a

14:38

script and just modify the existing script

14:40

and add the additional kind of controls

14:42

you want or the exceptions

14:44

you're mentioning? Yeah. Yes,

14:46

exactly. Finally, I think it just

14:48

gives you a .py file. Okay.

14:51

The interface. If you don't want to change

14:53

it, just import it in the desktop

14:56

computer, which is

14:58

connected to the robot. Finally, that Python code

15:00

gets interpreted. If you go to that Python

15:02

way, you can make any change that you

15:04

want. Again, before running

15:06

anything, even if you change that code,

15:09

before running it, it will run a test

15:12

to make sure that nothing is going on

15:14

and it's bad with the protocol.

15:16

Nice. Yeah. Yeah.

15:19

It's kind of like its own simulation. Yeah. Yeah.

15:22

When you're developing the code

15:24

for that, you kind of mentioned the word script a couple

15:26

of times. What does your personal

15:28

development environment look like?

15:31

Are you using like a

15:33

laptop and working with a particular code editor?

15:35

What are the types of tools that you

15:37

use in the lab for Python coding there?

15:39

So you can both use something

15:42

like any text editor, but

15:44

you also can use Jupyter, which is something

15:47

that I haven't used in my editor. Jupyter

15:49

always uses and for other projects, I haven't

15:51

used Jupyter, but the good thing about Jupyter

15:53

is that you can run one

15:56

part of your experiment and then stop and then

15:58

run the next instead of like

16:00

running the whole experiment at once. That's

16:02

one nice thing that Jupyter Notebooks gives

16:05

you. I always personally work in VS

16:07

Code and that's

16:09

my preferred editor that I go

16:11

to. I

16:14

always when I use Jupyter Notebooks, I

16:16

use the Jupyter extension inside VS Code.

16:18

Yeah, yeah. How's that flexibility? Yeah,

16:21

it's a really nice thing to have.

16:24

Are there other techniques that are

16:26

involved with using these liquid handling

16:29

robots? So techniques? What like what?

16:31

I'm trying to think of like

16:33

are there other, we talked

16:35

about it can be used for these

16:37

dilution experiments and things like that. Are

16:40

there other types of experiments that they

16:42

are well suited for it

16:45

for what you're doing currently? Yes,

16:47

well I think people doing a really interesting

16:49

type of experiments with it. For example, there

16:52

are type of experiments we want

16:54

to pick a colony of microbes and

16:57

inoculate something like another tube with that

16:59

specific colony. So really you need to

17:01

be very precise. For example, you have

17:03

to pick that colony very

17:06

precisely if you're doing it by hand.

17:08

Okay. But I've seen people that using

17:10

a camera that exists on

17:12

top of this robot, it could

17:15

actually use some image in our

17:17

office to go to that place and take

17:19

that colony and then drop it into a

17:22

destination. Well, it was just really fascinating

17:25

because it's really accurate. I

17:27

see a lot of applications for this. Why

17:31

I'm laughing is I did watch

17:33

that video on serial dilution and

17:35

I thought to myself, not

17:37

only the hand-eye coordination that you were mentioning

17:39

before of like all these different experiments going

17:42

and pipetting things and so forth, but there's

17:44

the second phase where you're getting

17:46

the PT dishes out and

17:48

not only labeling everything, but

17:50

then having to spread stuff

17:52

in these three little areas

17:55

around the thing. And I'm like, oh my

17:57

God, that would be so like, it's not only tedious,

17:59

but difficult to do because you're

18:01

supposed to only take a certain amount

18:03

to this other area. So I could

18:05

see how maybe computer vision and

18:08

this robot could maybe do that kind

18:10

of thing where it's laying it out inside of a P2

18:12

dish. Is that something it does too?

18:14

Because we talked about the dilution, but I don't know

18:16

if it does the plating also. Yeah. So I think

18:19

these things don't come by default. It's

18:21

just the creativity of the users. Okay.

18:24

Yeah. Because there's a Python going

18:26

on in the backend and you

18:29

have access to all these really

18:31

cool image analysis libraries. And finally,

18:33

everything gets converted to protocol. So

18:36

I think that's why so

18:39

many people can be creative and create

18:41

really cool things with the robot. But finally,

18:43

what happens inside that code is that

18:46

you tell the robot to go to

18:48

a specific destination. And

18:50

that could be hacked

18:53

to go to a destination and do something

18:55

that we want out of that maybe it

18:57

wasn't designed to do that, but it could

19:00

be cool in a lab setting. Sure. So

19:03

kind of moving beyond that and kind

19:05

of switching gears into,

19:07

okay, now you've run

19:09

these experiments and you've got your

19:11

results. Now you're looking

19:14

at doing these techniques for these

19:16

bioinformatic techniques and you're like, okay, I want to

19:19

sequence data. There's a couple of things

19:21

that you were talking about, a couple of different

19:23

experiments that you were running, like you were doing

19:25

some stuff, kind of looking at the role of

19:27

gut microbes and human health. You want to talk

19:29

a little bit about, I don't know what to

19:31

call it, a project or to call it a

19:33

study. Like I don't have the terminology in my

19:36

head. Sorry. No worries.

19:38

Yeah, sure. So actually

19:40

we can start with the robot, how

19:42

that happens. So we have these tiny

19:45

microbes in our gut that helps us

19:47

stay healthy. They extract a

19:49

lot of nutrition from the food

19:51

that we eat and they finally, they

19:53

circulate back to our bloodstream. So

19:56

when something happens to this community of micro

19:58

and this community of micro, Micros

20:00

is a very complex

20:02

combination of different micros, micros

20:04

from different taxonomic branches. What

20:07

happens is that when you take a sample

20:09

from the gut environment, you're

20:11

not left with one single

20:13

organism. You have thousands of

20:15

different species. It becomes

20:17

a really hard problem how we

20:19

can understand what they're doing. The

20:21

goal of that dilution to extinction

20:24

experiment is that to break down

20:26

that community by dilution, every

20:28

time that you dilute, you leave something

20:30

out from the previous community and

20:32

introduce a more simplified sub-community

20:35

into the new one, and then

20:37

again dilute until you reach

20:39

to a point where you have two

20:41

or three different microorganisms that you can

20:43

actually work with. You can

20:45

understand them better. That's the whole

20:48

point of doing serial dilution experiments.

20:51

From that point, we can, for

20:53

example, compare all the communities

20:55

that have two or three micros and

20:57

see, for example, this one produces more

21:00

of that compound. That

21:02

compound is good for health. How we

21:04

can improve the whole community is by

21:06

making the environment more suitable for the

21:08

ones that make that specific compound that

21:10

we are after. Finally,

21:13

for example, there are many diseases

21:15

that are linked to specific type

21:18

of dysbiosis or

21:20

imbalance in the gut microbiome

21:23

community. By comparing

21:25

samples from these patients to

21:27

those healthy individuals, we can see which

21:29

microbes are different or the ones that

21:31

are different, what they are doing differently,

21:34

and using this for therapeutic applications. Yes,

21:38

we isolate all

21:40

these simplified communities, and then

21:42

what we do is usually

21:44

we either measure the chemicals

21:46

that are produced by these

21:48

organisms or sequence the DNA.

21:51

Through that DNA sequencing and

21:53

the bioinformatics technique, we can

21:55

say, this Organ has

21:57

probably did this, and that's why we can.

22:00

Maybe maybe use it as a probiotic

22:02

or if it's the has a bad

22:04

effect, just remove it. Do.

22:06

Something that that it cannot grow so fast

22:09

that is doing in the on health effects.

22:11

Poker? Yeah. Softening. You.

22:14

Talked about the dilution getting down

22:16

to like, maybe only seen two

22:18

or three things in a sample

22:20

as opposed to like the whole

22:22

wide gamut of everything that's there.

22:24

And then you talked about measuring

22:26

compounds that are. That. Are there

22:28

I just proteins or what have you? We're

22:30

one of the tools or use their i'm

22:32

sorry I'm kind of going really be the

22:35

care but son would techniques to use to

22:37

look at that. So.

22:39

Third to techniques that views but

22:41

there are other one so gas

22:43

chromatography and liquid chromatography are to

22:46

com and techniques that as he

22:48

is and mainly when you take

22:50

blood samples outset these similar. Instruments.

22:53

To measure difference so that they

22:55

would say finally do is a

22:57

good a approximate concentration of each

22:59

of those components that are identifying

23:01

the samples. So to those are

23:03

the to input nm are is

23:05

another one. That. Has fallen but

23:07

we don't use but these two are

23:09

really com and because once you had

23:12

the instruments I'm in itself is not

23:14

cheap but once you have the instrument

23:16

it could be cheap to run a

23:18

sample and see what you have like

23:20

what components are in your in your

23:22

samples and and other thing that views

23:24

on a daily basis. To. Catch

23:26

a break it down to

23:28

how that information. Is.

23:30

Pulled out. And. Turned

23:33

into data. Are you. Inserting.

23:36

That small sample into that can be seen and

23:39

it's during the measurement and then it's our putting

23:41

like a data file for you. Hear

23:43

exactly. But. the wealth of

23:45

the of the only thing that you

23:48

need to do before a yes to

23:50

do some sort of preparation for example

23:52

is sir yes to filter the microbes

23:54

out for example because because those microbes

23:56

are larger particles and they could interfere

23:58

with the machine First, you do something

24:01

like a centrifuge to keep the

24:03

bacteria and larger particles out. Then

24:05

when you have a very, maybe,

24:08

a well-behaved liquid, I don't know

24:10

for lack of a better word,

24:14

but when you have that, you

24:16

can run it. When you give it to

24:18

the machine, it will output some sort of

24:20

diagrams. These diagrams, based

24:23

on the peak intensity and where

24:25

that diagram happens, because it's like

24:27

a spectrum, it's like those earthquake

24:29

type of graphs. I don't know if you've seen them. They're

24:32

very noisy. We

24:35

have the same thing here, but depending on

24:37

where that peak is happening and how big

24:39

that peak is, so where that peak is

24:41

happening in the chromatogram, we call it the

24:43

name of that graph. In

24:45

the chromatogram, where that peak is happening is

24:48

telling you what component it is and how

24:50

intense that peak is. It's telling you how

24:52

much of that component is there. Okay.

24:55

Just to do a quick analogy on that

24:57

specific thing, as a person

25:00

who's into photography, there's a setting that

25:02

you can use called a histogram that

25:04

looks at the overall light of the

25:06

image. It shows peaks

25:08

and valleys showing like, okay,

25:10

this part of the image had this much

25:12

light and was clipping or

25:14

was too bright and it's all

25:16

white or this was too dark

25:18

or whatever. I'm guessing that chromatogram

25:20

is a similar thing and you're

25:22

able to see the different levels

25:24

of components there. Yes, exactly. Very

25:26

similar, except that in pictures, you

25:28

only have intensity and light is

25:30

always the same, but here, different

25:32

components reach that receptor that is

25:34

at the end of the machine

25:37

and each peak is for different components.

25:40

That's the only difference. Yeah. Yeah,

25:42

yeah. Okay. Those

25:44

graphs that you get out of

25:46

there then can be turned into the raw data

25:48

that you're going to use for the next step.

25:51

Yes. And then the machine comes

25:53

with a software that turns

25:55

those into tabular data. For example, you can

25:57

say this is the concentration of components. A,

26:00

this is the concentration of component

26:02

B, and so on. And it gives

26:04

you like a spreadsheet that has that

26:07

information in it. And then

26:09

that's something where we can take

26:11

that information and use it in

26:13

statistical analysis to compare the

26:15

samples. This

26:20

week, I want to shine a

26:22

spotlight on another RealPython video course. It's

26:25

titled Building Python Project

26:27

Documentation with mkdocs. The

26:29

course is based on a RealPython step-by-step

26:31

project by frequent guest Martin

26:34

Broise. And in the video

26:36

course, instructor Darren Jones shows

26:38

you how to work with

26:40

mkdocs to produce static pages

26:42

for Markdown, pull in code

26:44

documentation from doc strings using

26:46

mkdocs strings, follow best practices

26:48

for project documentation, and

26:50

use the material for mkdocs theme

26:52

to make your documentation look great,

26:54

and how to host your documentation

26:56

on GitHub pages. I think using

26:58

tools like this can make what

27:00

seems like a daunting task so

27:02

much easier. And I think it's

27:04

a worthy investment of your time

27:06

to learn how to automate production

27:08

of your project's documentation. Your users

27:10

will truly appreciate it. RealPython video

27:12

courses are broken into easily consumable

27:15

sections and where needed include

27:17

code examples for the technique shown. All

27:19

lessons have a transcript including closed captions. Check

27:21

out the video course. You can find a

27:23

link in the show notes or

27:25

you can find it using the enhanced

27:28

search tool on realpython.com. And

27:33

so is that machine connected in a way

27:35

that the data, I know I'm being really

27:37

microscopic in my analysis of how we're talking

27:39

about this stuff, but like is it coming

27:41

out as a CSV file or is it,

27:43

how are you getting that regular data? Usually

27:46

I think it's an Excel format. Like

27:50

usually it's like that. And you can just save

27:53

that Excel file finally into a CSV.

27:55

One thing is, so for researchers,

27:58

maybe Excel be

28:00

a more familiar term than

28:02

a CSV sometimes that's the

28:04

case. And Excel is

28:06

a pretty common tool to use and

28:08

sometimes you don't even need to get

28:10

out of Excel to do everything that

28:12

is related to your research. But when

28:14

you scale things up that's where Python

28:17

becomes maybe more efficient in

28:20

doing the data analysis. Yeah,

28:22

I recently had some of the people

28:24

working on the Python in Excel on

28:26

the show and those are interesting to

28:28

talk to them and it's interesting to

28:30

hear that because that sounds like that might be

28:32

yet another way to again maybe

28:35

avoid having to hop through multiple

28:37

layers to get somewhere. Yes. At

28:39

least to do the initial analytic

28:42

research and kind of looking at what you have

28:44

to make sure like this is worthwhile we're gonna

28:47

take this to the next step. Yeah,

28:49

that sounds really exciting. Hey, I

28:52

haven't had a chance to try that out yet

28:54

but looks very interesting. Yeah, it's

28:56

still kind of in a beta

28:58

phase where you have to be part of their

29:01

sort of developer 365 program

29:04

and have to sign up for things and so

29:06

forth and I think it's only on Windows. I'm

29:09

intrigued by it. It's very interesting to see what

29:11

they're gonna do with it and it's there's

29:14

a lot of stuff in it. They preload a lot of

29:16

data science stuff ready to go in it. Do

29:20

you want to talk a little bit about the

29:22

DNA sequencing technologies or did we cover most of

29:24

the things you wanted to cover on this first

29:27

section? Yes, and I think this

29:29

is a good point because we have

29:31

our samples now and we have

29:33

sequenced it but why

29:36

do we do sequencing is because we

29:38

can and the idea is we can

29:40

infer all the biological information from DNA

29:42

because we think that DNA is the

29:45

blueprint to living organisms. Every

29:47

kind of information that is required

29:49

for biological functions encoded in DNA.

29:51

So the assumption is if we

29:53

can sequence DNA and understand that

29:55

sequence we can say a lot

29:57

of things about the biology that's

29:59

that is going on in the

30:01

samples that we got those information

30:03

from. So

30:05

what happens is that using

30:08

different techniques, we have removed different components

30:10

that we don't need. And somehow we

30:12

want to purify that DNA that is

30:14

inside a sample in a process called

30:16

DNA extraction. So we don't wanna have

30:19

different components that we don't need for

30:21

the DNA analysis and they could interfere

30:23

with the process. So we just try

30:25

to using chemical treatment, just take

30:28

those components out. When

30:30

we do DNA extraction, finally, we

30:33

have our DNA and this DNA

30:35

can be sequenced in different facilities.

30:38

And finally, what you get out

30:40

of these sequencing machines is just

30:42

a bunch of A,

30:44

T, C, G letters

30:46

that are really

30:48

large, like surprisingly large. Yeah.

30:52

Well, not surprisingly, because if

30:54

we assume that everything is happening, using

30:56

the information inside this DNA, it won't

30:58

be as surprising. But

31:00

those files could be large, even for

31:02

very small cell

31:04

that have simpler DNA than other

31:07

ones. Can you give

31:09

a size that would

31:11

be comparable on computer terms? Like

31:13

is that... Yeah, sure. Gigabytes

31:16

or something larger? So

31:19

it depends. It depends on, in

31:21

the microbial world, we are usually

31:24

are bound to 10 megabytes on the

31:26

high end. And on the low end,

31:28

we are... Usually it's

31:30

half a megabyte, I would say. The

31:34

entire DNA for one cell would

31:37

be around that. But for human

31:39

cells, it's in the gigabyte. And

31:42

everything changes in between. For example,

31:44

you have some sort of maybe

31:46

more complex micro like yeast will

31:48

have bigger and more complex

31:51

DNA that you could use in

31:53

the bakeries or in the breweries

31:55

for making beer. Those are slightly

31:57

more complex. Okay. and

32:00

bacterial cells and they should

32:02

be not still in the gigabytes,

32:04

but definitely larger than bacterial

32:06

cell. Yeah, one of the techniques

32:09

you talked about is this idea

32:11

of, I think it

32:13

was, was it called shattering to

32:16

break apart the DNA to focus

32:18

on like the very specific

32:20

sequence. Because I'm guessing in

32:23

the types of things you study, like

32:25

if you're looking at bacteria, there's

32:28

probably a huge, well, a

32:30

large amount of repetition that

32:32

all of them have this sort of structural stuff

32:34

and then you want to focus on certain areas.

32:36

Am I getting that part right? Well,

32:39

the reason for shattering

32:41

DNA is not to focus

32:44

on a specific part. It's just because

32:46

the sequencing facilities cannot sequence DNA that

32:48

are longer than a specific length. They

32:51

have this limitation. Okay. They

32:53

can't focus on optical signals and if

32:55

they continue to longer pieces

32:57

of DNA, they finally, the error

32:59

becomes so high that the data is basically

33:01

not useful. So

33:04

what we have to do is

33:06

before that using mechanical forces, we

33:09

have to break down the DNA,

33:11

fix DNA molecules into smaller pieces,

33:14

like 300, we call it

33:16

base pairs by 300 ATC G-letter

33:19

or something like that. That's the usual. And

33:23

then we can sequence those smaller pieces. But

33:25

now that we have solved one

33:28

problem, we have created many more

33:30

because the problem becomes then how

33:32

do you know how to

33:35

fit these pieces together? It becomes

33:37

a big puzzle. And the

33:39

way it happens is based on

33:42

the overlap that these sequences

33:44

might have, there are different

33:46

algorithms called assembly algorithms that

33:48

they make a sort of

33:50

graph that connects these sequences

33:52

based on the overlap between

33:54

these sequences. And then finally

33:56

finding the longest path that

33:58

you can find between. these pieces

34:00

in the graph that will kind of resolve

34:02

that piece of DNA in that region. And

34:04

it's a really open-ended problem

34:06

and a very complex algorithm

34:09

that these tools achieve.

34:11

And these, because of the performance

34:14

considerations, they are usually not implemented

34:17

in Python, usually in other

34:19

languages like C, C++. Yeah,

34:21

yeah, that makes sense. But

34:24

they usually have an interface in Python. So

34:26

finally, for example, the CLI is in Python.

34:30

That connects to different modules that

34:32

are written in other languages. Yeah,

34:34

you provided a bunch of links to these

34:37

libraries. Is

34:40

the one, I think it's called

34:42

MegaHit, is that kind of in this realm

34:44

that we're talking about? Okay. Yes, it's a tool

34:47

that I use and it's designed to get

34:50

those, what we call short reads,

34:52

and it's an assembler. So it

34:54

assembles those short reads into longer

34:56

pieces. And the goal here is

34:58

to recreate those pieces

35:00

that we shattered. The

35:02

reason is if we have small

35:04

pieces, we don't have that much

35:07

statistical significance to say things for

35:09

sure. For example, if it's too

35:11

small, it could be happening by

35:13

just chance, by random. However,

35:16

if we make it really long,

35:18

then if there is a very

35:20

similar match to this long piece in a

35:22

database that we have, we can say things

35:24

more certainly. Okay. Yeah, this is

35:27

now significant or what have you. Yeah,

35:29

yes. Yeah. Okay. You shared

35:31

that project with me, which I think it's

35:33

kind of interesting because it says

35:35

here like a copyright of 2015, the University

35:37

of Hong Kong, their kind of initial license

35:40

of it up here on GitHub. I

35:43

find that fascinating. Is that common

35:45

in universities that across

35:47

these different communities that they are sharing

35:49

their code? Is that like a pretty

35:52

common thing that you've found within

35:54

this field? Yes, especially in bioinformatics. One

35:56

thing that I'm really grateful for is

35:58

that being open. sources kind

36:01

of the theme. When you publish

36:03

a paper or wherever you

36:05

mention your package name

36:08

or wherever you want to present on,

36:10

I think usually you provide that

36:12

in an open source like as

36:14

a GitHub repository and places

36:16

like that that everybody can use

36:19

and it's really a common

36:21

theme as I said. Good. This

36:23

field. So yes, I think

36:25

yes. Yeah, yeah, that's great. So

36:28

one of the other projects you mentioned is looking

36:31

at the prediction of anaerobic

36:33

digestion metabolism. Yes. And

36:36

that one is, I think it's

36:38

called AD Toolbox. Do you want to talk

36:40

about that project? Yes. So the other packages

36:42

that you mentioned, those are by other labs,

36:44

but we are starting to write

36:47

our own packages and publish them.

36:49

So sometimes these packages stand like

36:52

they call different tools that exist

36:54

in other languages or from

36:56

other projects. So AD Toolbox is

36:58

a project that we started for

37:00

modeling the anaerobic digestion system. Anaerobic

37:02

digestion system is just to

37:05

explain that quickly. It's a system

37:07

that has been used traditionally for

37:10

making use of waste, especially

37:12

organic waste, something like foods

37:15

from cafeteria, restaurants. These all

37:17

go to waste and if we don't

37:19

do something about them, they get converted

37:21

to methane, which goes to atmosphere. We

37:23

lose a lot of energy and also

37:26

it's a greenhouse gas.

37:28

So it has a really high

37:30

global warming potential. So the goal

37:32

of this project is to somehow

37:35

manage that anaerobic digestion process

37:37

to break down these waste

37:39

components into useful products. And this

37:41

is happening by microbes. So the type

37:44

of microbes that exist in this environment

37:46

matter. For example, if you put more

37:48

of those microbes that are more useful

37:51

for the process to produce the

37:53

product that we are after, there's

37:55

a good chance that we improve the efficiency

37:57

of this process. So since this is a

37:59

micro process we need to take

38:01

into account the information that is coming

38:04

from the DNA of those microbes. And

38:06

this is the goal of this tool.

38:08

For example, it takes the DNA information,

38:10

processes them, and finally it feeds

38:12

them through a mathematical model and in

38:14

future a machine learning model to predict

38:17

the behavior of this anaerobic digestion system,

38:19

what you can do to improve them

38:21

and applications like this. So

38:23

this is a project that other members

38:26

of your doctoral program are

38:28

working on together? This

38:30

is mainly led by me

38:33

and we have some undergraduate

38:35

students that are trained

38:37

on Python and finally they contribute

38:39

to this project. So

38:41

yes. Nice. It's like

38:43

a code fiber. Yeah, yeah.

38:46

What are some of the other libraries that

38:48

you're able to leverage to do the work

38:50

inside of this package? So most of the

38:52

things that I use to do, so for

38:54

example, it has different modules. So at one

38:56

point where we use the

38:59

DNA sequences, we use packages

39:01

outside of Python. There's this

39:03

really cool sequence alignment tool

39:05

that matches a sequence of

39:07

DNA into a known database.

39:10

It's called MMC. Okay. This

39:12

is a very, very cool tool

39:15

that is written in, I think, in

39:17

C++ and it's really fast for that

39:19

kind of... So this code actually calls

39:21

that MMC and

39:23

then collects information. We use pandas

39:26

for any kind of data manipulation,

39:28

for example, getting the alignment results

39:30

and using that information to

39:33

draw any sort of

39:35

conclusions. And then finally we disconnect

39:37

it to a dash app like

39:39

that in the plotly world. Yeah,

39:41

yeah, sure. And it

39:43

finally creates a dash application that shows

39:46

the simulation results and this is an

39:48

interactive web page. So for

39:50

example, different parameters could be changed. What

39:52

happens if we increase the temperature? What

39:54

is the effect of increasing temperature

39:57

on methane production? So when you

39:59

change that... parameter of temperature

40:01

it will change the results and

40:03

will show the results like

40:05

it updates to a page basically. Yeah

40:08

I found that should be a very

40:10

good complement to this project. Yeah I'm

40:13

a big fan of visualizations and that's

40:15

a great project because it includes

40:17

so much of the the underlying work

40:19

that you can kind of again host it and

40:22

get it posted there. I'm wondering a

40:24

little bit about the data that is involved

40:26

there are you using you know what

40:28

kind of database where is

40:30

all this data stored that you're you're accessing

40:32

and running through the system. Yes

40:35

so as I said it

40:37

has different modules for each module it could

40:39

be different. Okay. Most

40:41

of the databases that we

40:43

talk about here in this

40:46

project they're usually small we

40:48

intentionally kept them small. Okay.

40:50

To be fast because since we are

40:52

only focusing on anaerobic digestion we may

40:54

be we may not need all

40:57

the information from different ecosystems and

40:59

because of that we are just

41:01

using a flat file which is

41:03

it's a common format called FASTA

41:05

in bioinformatics which is a fancy

41:07

text file again in the key

41:09

value format so you

41:12

have a key and then you have a

41:14

value so your keys are just aligned that

41:16

starts with a specific character like a character

41:19

sign and then your sequence

41:21

starts right below that line again so

41:23

the key will be that line that's

41:25

there with the character and whatever comes

41:27

underneath will be the information and that's how

41:29

we store these data. Okay

41:32

you have you

41:34

have another project that you have you

41:36

say that's still under heavy development and

41:38

will become public soon is

41:40

that the AD toolbox or is that the

41:42

next one the spam DFBBA? No no

41:45

the AD toolbox is still something that

41:47

we are working on especially these days

41:49

we want it will be out very

41:51

soon I think in a matter of

41:53

weeks. Okay. But my next

41:56

project is completely published and it's on

41:58

github and there's a Try

42:00

it, Arbus, you have a good documentation

42:03

website for it. Cool. Available,

42:05

yes. So the next one. What's that

42:07

project do? So that

42:10

project is more like an

42:12

AI project. Okay. So

42:15

one thing is when you have these

42:17

pieces of DNA, you have some information,

42:19

but the problem is you still cannot

42:21

predict the behavior of cells because even

42:24

given that information, there are so

42:27

many different ways that micros can

42:29

behave given their DNA. For example,

42:31

if you consider complex typing system,

42:33

which valves they should turn, that

42:36

information is not in

42:38

the DNA. So well, I

42:40

mean, at least it's not easy to

42:42

extract those information. So how

42:44

micros regulate their behavior is

42:46

something that is a really

42:49

open problem in this field.

42:51

Okay. This tool is

42:53

something that tries a technique called

42:56

reinforcement learning where all

42:58

different trajectories for behavior of

43:00

a microbe is tried. And

43:03

then based on trial and error, these

43:05

microbe try to improve their behaviors. And

43:07

the reason that you think that this will

43:10

work is because well, microbe evolved really fast

43:12

in the last, like you can see the

43:14

microbe evolved in a few generations.

43:16

Okay. So what happens is

43:18

the microbe just evolved, they adapt their strategies

43:20

and maybe something that we are

43:22

all familiar with is different, for example, strains

43:24

of the COVID virus. You

43:27

see that at some point, some strain

43:29

comes out that acts a little different,

43:31

maybe more contagious. It's just because they're

43:33

rapidly changing and that change gets reflected.

43:35

I mean, microbes are more complex and

43:37

so the

43:39

problem becomes more complicated. But this

43:41

tool is basically some artificial intelligence

43:43

technique to find how the behavior

43:45

of these microbe will converge

43:47

to a specific point that is determined

43:50

by the evolution of that organism.

43:53

And here we use a lot

43:55

of neural network packages like PyTorch

43:57

and also the Ray library for

43:59

parallelization. visualization. Okay. Drilling

44:02

into like things like the hardware and how you're

44:04

running these things. I mentioned

44:07

time to time on the show that I tried

44:09

to run projects that I want to feature on

44:12

the show to kind of showcase them. Say, oh,

44:14

this seems like a really cool project and it

44:16

might use something like PyTorch or use some other

44:18

big library like that. And I have a hard

44:21

time getting them set up very often. And so

44:24

I feel like it works best sometimes to

44:26

like have it as a container or some

44:28

other kind of environment. So I'm wondering like

44:30

how are you running those types of experiments

44:32

and what type of machine is it running

44:35

on? So for this one,

44:38

for some of the test cases, it depends

44:40

on the test case, how complex and hard

44:42

it is to make those

44:44

simulations. If it's for the test cases

44:46

that I have on the documentation website,

44:49

you can run it on a simple. I

44:51

have a Mac M1 machine, which is great.

44:54

And it suffices for those kind

44:56

of applications. But for bigger projects,

44:58

we move to a supercomputer that

45:00

we have in Colorado that is

45:02

shared between the universities here. It's

45:04

called Altime. Okay. It's shared between

45:06

CSU and Colorado University at Boulder.

45:09

And I think one more, which

45:11

I don't recall right now. These

45:13

are sometimes you can really get

45:15

big resources from this supercomputer. And

45:17

then, for example, for our assembly,

45:20

what we do is I usually

45:22

request for two terabytes of random

45:24

access memory, which is really high

45:27

and could not be done. Yeah,

45:29

definitely not not on my machine.

45:31

So and then then what

45:34

you do there is it's a Linux

45:36

system. You create your visual environment, install

45:38

the packages that you want. And then

45:40

the code for me, I use this

45:42

approach that these could connect to a

45:44

remote server. And I just type in

45:46

the code that I want and you

45:49

can debug that in a remote

45:51

server. And finally, when it's ready, I just

45:53

run the project on the cluster and

45:55

get the results back and do

45:58

the Simpler data analysis. My

46:00

own personal computer. Or for

46:02

yeah I was wonder about that. Having.

46:05

These resources that are against university

46:07

scale which is kind of fun

46:09

so that one on could link

46:11

for all these different projects. As

46:13

you mentioned a couple times about

46:15

the documentation of that particular project.

46:18

With. Our tools that are using to help

46:20

you document that. Ah yes,

46:22

I used And Kate thoughts

46:25

for building the documentation website

46:27

and site to. Have

46:30

the for example thought test and

46:32

also yeah yeah all the ducks things

46:34

for every function and class and in

46:36

the thing the script to be

46:38

as clear as possible and and yeah

46:41

I love had kids that's I think

46:43

it's it makes the whole documentation

46:45

a lot for easy and also final

46:47

for that's that's really good. So Syria,

46:50

Syria we have a couple courses

46:52

that touch on that night it it's

46:54

a nice way to count against America

46:56

going and it definitely assists a

46:58

lot of that. Process Again, You have

47:01

to become like a web developer. record

47:03

activities and site which is nice. Yeah,

47:06

So far as I have these questions are task. Everybody

47:08

comes on the show and the first one is he

47:10

would something that you're excited about that's happening in the

47:12

world. A Python right now. So.

47:15

For. Me: there's this package called

47:17

cited by Ill look at

47:19

it's coming up and I

47:21

think so not. Version One: I

47:24

think it's it. Could be it. Could. Help

47:26

for all the Python users

47:29

in our field because most

47:31

of these tools exist in

47:33

our and. It's really

47:35

good to have that two bucks

47:37

and in Python as well because

47:39

we have everything in Python so

47:41

it's just sometimes were east statistical

47:43

tests for example we need to

47:45

go to our and it could

47:47

be like the time that we

47:49

need to spend to learn a

47:51

new programming language could be. Something.

47:55

Maybe. more efficient and i think they'd

47:57

these packages help a lot and and

47:59

for money see a lot of the things

48:01

that have been missing has been added to

48:04

the psychic bio package and I'm really excited

48:06

about it being released. Yeah, cool.

48:09

What's something that you want to learn next? Again, this

48:11

doesn't have to be programming, but

48:13

is there something that you're interested in learning? Yeah.

48:16

So for me, maybe

48:18

something that I don't have that

48:20

much experience with, and I like

48:22

to learn more about it

48:24

is how to work on different

48:27

parts of the project in a team, because

48:29

mostly what I have been doing as a

48:31

researcher has been working alone on my script.

48:34

And of course we use GitHub, but,

48:36

but it's different when it comes to

48:38

multiple people collaborating on the same project

48:41

and as an open source project. And

48:43

I think this is something that I really

48:46

want to get into, to contribute

48:48

to open source projects, at least

48:50

ones that are in our field.

48:52

And I think I can maybe

48:54

positively contribute there. Yeah. Yeah.

48:57

I had a couple of shows recently about sort

48:59

of inroads and ways to kind of get involved.

49:02

And I wonder if certain conferences might

49:04

be a chance to be able to sit down

49:06

with some other people and look

49:08

at collaborating on it. That's great. Yeah.

49:10

You already got kind of a good resume going with,

49:12

with, uh, what you're, what you're working on. So, so

49:16

how can people follow the work that you do online?

49:19

Anything related to the code,

49:21

we usually publish the code

49:23

on GitHub. So my GitHub

49:25

request story, I usually

49:27

post them on my GitHub as well. So

49:29

we have this GitHub page or

49:32

account for our lab that we use.

49:35

And then all the projects are on that.

49:37

But when it is published, I also post

49:39

it on my own account as well. I

49:41

pin it on my account. So

49:44

that's how the new project, but also on

49:46

LinkedIn and other social media

49:48

and some of them I'm active, especially

49:50

LinkedIn, I announced all the new projects

49:53

there as well. Nice. I'll

49:55

include all the links for all

49:57

those repositories and your LinkedIn. Well

50:00

thanks Parza, it's been really fantastic to talk to you

50:02

about all this stuff. Thank you Chris, it's been really

50:04

fun to talk to you as well. I

50:11

want to thank Parza Ghadarmazi for coming on the

50:13

show this week. And

50:15

I want to thank you for listening to the Real

50:17

Python Podcast. Make sure that you click

50:20

that follow button in your podcast player, and if

50:22

you see a subscribe button somewhere, remember

50:24

that the Real Python Podcast is free. If

50:27

you like the show, please leave us a review. You

50:29

can find show notes with links to all

50:31

the topics we spoke about inside your podcast

50:34

player or at realpython.com/podcast. And

50:36

while you're there, you can leave us

50:38

a question or a topic idea. I've

50:41

been your host, Christopher Bailey, and I look

50:43

forward to talking to you soon.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features