Podchaser Logo
Home
Practical Use for AI and Machine Learning in Science

Practical Use for AI and Machine Learning in Science

Released Monday, 24th June 2024
Good episode? Give it some love!
Practical Use for AI and Machine Learning in Science

Practical Use for AI and Machine Learning in Science

Practical Use for AI and Machine Learning in Science

Practical Use for AI and Machine Learning in Science

Monday, 24th June 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

It feels like everything today is AI

0:02

artificial intelligence that artificial intelligence that large

0:04

long term all chat GPT. It's just

0:07

like unending and yet I think people

0:09

are having a hard time finding really

0:11

practical uses on a day to day

0:14

basis for how they can actually take

0:16

advantage of this technology. And

0:18

I mean there's so many artificial intelligence

0:20

recipes that a person needs or cat

0:23

memes. But I

0:25

think that the real promise

0:28

of this technology is in science.

0:30

You can take sciences that generate

0:32

enormous amounts of data. You can

0:35

feed them into this transformer architecture

0:37

and then you can start to

0:39

make predictions that will be useful

0:41

to scientists. And we're really at

0:44

the nascent stages of this. Everybody's

0:46

too busy building chat bots and

0:48

not busy enough creating science

0:52

bots and hopefully

0:54

that will change. So my guest

0:56

today is Dr. Michael Smith. He's

0:58

a researcher at Aspia Space and

1:00

Universe TBD and he is

1:03

heading up the Astro PT

1:05

and Earth PT projects.

1:07

And these are exactly what I

1:09

said that you are taking enormous

1:12

amounts of climate data, earth data,

1:14

astronomy data and you are feeding

1:16

them into a transformer

1:19

architecture and you

1:21

are then generating next tokens. But the

1:24

next tokens are things like the weather

1:26

or galaxies and it's sort

1:29

of a really interesting fascinating project. And

1:31

they need help. They need more

1:33

people to get involved, especially people who are

1:35

programmers. So enjoy this

1:37

fascinating interview with Dr. Michael

1:40

Smith. Michael, we're all pretty familiar

1:42

at this point with the promises of chat, GBT,

1:45

large language models that can

1:47

produce text like fancy

1:50

auto complete or can create

1:53

images. But as

1:56

an astronomer, as a scientist, how

1:58

do you look at the technology?

2:00

technology of these large language models

2:02

or like the transformer architecture and

2:04

think about what kinds of possibilities

2:06

are out there for science. So

2:10

I think the exciting thing for me is

2:12

finding a nice way to compress

2:15

all of the data out there. So you can

2:17

see this with LLMs where they're managing to compress

2:19

text in a useful way for people with

2:23

the autoregressive architecture for science. We can do

2:25

something similar by getting

2:27

a load of scientific information and feeding it

2:29

into a similar architecture to the LLM and

2:33

then compressing it in that way and

2:35

then using that compressed embedding

2:38

space to do science progress on

2:40

that. So that's the thing that

2:42

excites me most, taking similar

2:44

architectures that have been proven in the textual domain

2:46

and then applying them in

2:48

a similar but like

2:50

adjacent way for scientific use cases.

2:54

But the experience that we've

2:56

all had using, say, chat GPT where we

3:00

ask it to write us a poem about,

3:02

I don't

3:04

know, your morning commute to work or

3:06

whatever. And we know that it's been

3:09

fed on just an enormous amount of

3:11

data from the internet and then has

3:13

been run through this transformer architecture that

3:15

then makes this predictive next-token model to

3:18

be able to try and bring up

3:21

the next text that it's likely to

3:23

see. And somehow magically, this creates poetry

3:25

or whatever. It's not great, but at

3:28

least it's doing it, which is kind

3:30

of amazing. But so if

3:32

the raw material for chat

3:35

GPT is the

3:37

text on the internet, what is the

3:39

raw material for, like,

3:42

say, an astronomy-based large-language

3:44

model? So I

3:47

would say it can be anything

3:49

astronomy-based. So we've been playing around

3:51

with the Astra-PT model by feeding

3:53

in galaxy observations taken straight from

3:55

the telescope, taken

3:58

straight from the Dark Energy Survey. telescope

4:00

and you just take the data from this, you put

4:02

it in, you try and predict the next token

4:05

or patch in a series of patches and

4:09

the model learns some

4:12

underlying physical properties from this. But we don't

4:14

have to stop at galaxies, we can feed

4:16

in time series of stellar

4:18

events. So

4:21

if you have a strange stellar object

4:23

that's doing a strange time series of

4:26

brightness over time, you can feed this into

4:29

the model as well and it will learn

4:31

something fundamental about how the

4:33

star is performing because it

4:35

needs to learn to predict the next token in the

4:39

time series. So

4:42

any observation that would

4:44

be difficult to

4:46

predict is useful to feed into a

4:49

model like this because in predicting this

4:52

next step, it's learning something fundamental about

4:54

the physics behind the scenes, which I

4:56

find very cool. I

4:59

know there's been similar work done in weather

5:01

modeling where the traditional method is you'd have

5:03

to come up with these enormous,

5:06

very complicated supercomputers that would

5:09

read in all of the butterfly wings

5:11

flapping across various places and then

5:13

try to predict the future rainfall in some

5:15

other part of the world. And now they

5:17

just take a bunch of pictures of

5:20

the world and then the AI just predicts

5:22

what the next frame is going to be

5:24

the next token. If the tokens are pictures

5:27

of planet Earth, then it knows how

5:29

to predict future tokens and it's much

5:31

more energy efficient and so on. So let's

5:33

go back to this idea then. So you've

5:35

got DESI, the Dark Energy Survey instrument,

5:39

and you've got millions

5:42

of images of galaxies that have

5:44

been taken. And so what

5:47

specifically are you doing to kind of prepare

5:49

this data for As for PT? So

5:53

that's the exciting thing. We don't have to do

5:56

that much preparation. We can take the... Yeah,

5:58

it's cool, right? take the

6:01

the raw observations you do a

6:03

little bit of like a quality

6:05

cut so you only take the

6:07

good like high quality low

6:10

noise galaxies for example or low noise time

6:12

series or whatever and then you

6:14

can feed them straight into the model and the model it

6:17

can separate out the useful

6:19

information from the not use of information

6:21

without us having to do it manually so

6:24

we're cutting out a lot of the time consuming manual

6:26

process here by feeding it straight into the model and

6:28

asking it to source the problem yeah

6:32

and just just to go back to

6:34

your your comment about

6:36

the earth observation climate climate

6:41

models that we that you were

6:43

just talking about we actually applied the

6:45

Astro PT model the large observation model

6:48

there to earth observation time

6:51

series as well and it works there too

6:53

so that's also a publication

6:55

earth PT we call

6:57

it and it's exactly the same model under the

6:59

hoods just on a different modality

7:02

so you can throw whatever on this model

7:04

when it because it will out automatically

7:06

which is very neat now

7:09

with earth it makes sense because earth goes

7:11

through this cycle that you know that you're

7:13

going to have weather patterns form and evolve

7:15

and they turn into other things around planet

7:17

earth and different features and and it is

7:19

a you know you're not

7:21

going to get too far out of the norm but but

7:24

with galaxies like say you feed in

7:26

two million galaxies into this system it's

7:29

gonna how useful is it

7:32

to predict the next galaxy

7:34

it so this is something

7:36

i found quite interesting in doing

7:38

this work and it was a little bit surprising to me so i

7:41

in in the paper there's a figure and it shows

7:43

the loss per token fed

7:46

and it asymptotes towards a flat line at some

7:48

point and it's earlier than i would expect and

7:51

i think this is because galaxies are

7:53

less information dense than earth observation for

7:56

example so in our earth observation model

7:58

the line asymptoted lower so it's It

8:00

learns more from the dataset. But

8:03

I think in astronomy, you can overcome this

8:05

with the inherent multimodality of it. So if

8:07

you have a galaxy observation in RGB bands,

8:10

GRZ, and you

8:12

would also have a corresponding spectra, which is

8:15

like a 1D description of the galaxy, you

8:17

can feed this in together and then you

8:19

would get more information compared

8:21

to just the GRZ bands. And

8:24

you can also go away and get some other modalities

8:26

too. And this would give the model more to learn

8:29

physically about the universe. So

8:32

it needs to do more. So the loss

8:34

should continue decreasing as it

8:36

learns it. So it should have

8:38

more to learn, yeah. So

8:41

you could give it say a million

8:43

galaxies and then spectroscopy

8:46

pairs. And then you

8:48

give it a galaxy that you don't know

8:50

the spectroscopy data on it and ask it

8:52

to guess what it thinks is gonna

8:54

be the chemical signature, the absorption line, so on

8:57

and so forth of that galaxy and then find

8:59

out whether or not it was how right it

9:01

was. Yeah, that's one scientific

9:03

use case we're trying to get started

9:05

now. And there's also some upcoming work

9:08

of a giant dataset of all

9:11

modalities of astronomy. Hopefully it'll

9:13

be out in the next couple of weeks or so. I can,

9:16

I don't know, do something, show

9:18

it, send it to you. But

9:21

yeah, yeah, it's just so very exciting stuff.

9:24

But what's the gist of that? What are you

9:26

hoping to do? Or are you under embargo still?

9:29

No, not under embargo. So it's

9:31

just a huge dataset of galaxies,

9:34

time series of stars, time series

9:36

of any astronomy observations,

9:39

galaxies from several different instruments plus

9:42

some tabular data too. And

9:46

we're calling it motor modal universe.

9:48

And yeah, it's just a huge dataset.

9:50

If you know Luther-Ai's pile from the texture of the

9:52

main, it's kind of like this, but for astronomy, it's

9:54

gonna be all out in the open so people can

9:56

go and play with it. I'm

9:58

very excited about this. I think it should. supercharged

10:01

research into these large astronomy

10:03

models. So just to sort

10:05

of explain this, you've

10:07

gone into every available data

10:10

source, pictures from DESI,

10:13

from other telescopes, spectroscopy data,

10:15

things from Gaia, everything

10:17

you can get your hands on, normalized

10:20

it, put it into one big

10:22

database that

10:24

can then be fed into various large

10:26

language models and people want to try

10:28

to come up with

10:32

scientific uses of that. Exactly

10:34

the case, but not a large

10:36

language model. You need to do some tokenization

10:39

to make it work for a large language model,

10:41

but a large observation model like Earth PT, you

10:43

can see it straight in, yeah. Like a

10:45

transformer model? Yeah, a transformer model, yeah. It'd be

10:48

perfect, yeah. It should just work. Right, right, okay.

10:50

So in the

10:52

paper of the astral PT, which people should definitely

10:55

check out, you showed some examples. I saw like

10:57

you had a grid of 15 out

10:59

of 16 tiles of

11:03

a galaxy and then you had

11:06

the software try to draw the last

11:09

piece. So how

11:11

would that be useful scientifically? So

11:14

that's the surrogate task. And the

11:16

reason behind it is just

11:19

to give the model something to

11:21

learn that's easy to define. So

11:25

if we predict the next square in the time series,

11:27

it needs to learn something about the underlying galaxy and

11:30

the underlying physics to predict the next square. It's

11:33

not so useful in itself scientifically, but once

11:35

you train the model on this, you can

11:37

take the

11:39

embeddings in the transformer like an

11:42

intermediate layer and then train

11:45

a linear probe, which is

11:48

just a linear regression and try and predict

11:50

some scientifically useful properties of the galaxy. So

11:53

this could be the color

11:55

or the magnitude in certain bands

11:57

or the red shifts, how far away it is. or

12:01

something like the morphology of the galaxy, which

12:03

is also scientifically useful. And

12:05

the exciting thing we found here is, if you

12:07

have a bigger model, it's

12:10

better at predicting these downstream tasks,

12:12

even though it's only trained in the next square

12:15

in the time series, in

12:17

the galaxy prediction.

12:21

And this is something that we also see

12:23

in large language modeling. So I found that

12:25

very exciting to see this connection between large

12:27

language modeling and this large transformer model and

12:30

that's, it's the same model under the hood, but it's a

12:32

completely different modality. Right,

12:34

right, right. I mean, we see scaling laws, we

12:37

see how just more compute,

12:39

better data, more

12:42

time spent actually training

12:44

the model, it

12:46

starts to develop not

12:48

only sort of performing the task that you

12:50

expect, but it's actually starting to come

12:53

up with meta skills that

12:55

underlie the task itself. So,

12:58

and I

13:01

can think of other examples like, JWST

13:03

only has so much time to be

13:05

able to image galaxies at

13:08

really high resolution, but then you've got

13:10

a better background of

13:12

say, coming from DESI or from Euclid,

13:14

where you're gonna get these galaxies, but

13:16

they're gonna look pretty crappy because

13:19

it just doesn't have the same kind

13:21

of capability as web. And so

13:23

theoretically, you could match

13:25

them up and say, here's the galaxy

13:27

by Euclid, here's

13:30

the one by web, here's the one by Euclid, here's

13:32

the one by web. Okay, here's the one by Euclid.

13:34

What would the web one look like? Zoom

13:37

in and enhance. Yeah, yeah,

13:39

yeah, you could definitely do this. This is something

13:42

I've been working on these past couple of weeks,

13:44

just to figure

13:46

out a way so

13:49

that we can match these pairs together and then we

13:51

can say maybe train a diffusion

13:53

model on the tokens I learned some of

13:55

the match pairs and then diffuse into web

13:57

or whichever you want, diffuse into.

14:01

like so that that would be very cool to implement this.

14:03

Wow. Yeah. Diffuse

14:06

into spectra. So like, like mid journey or

14:08

whatever, or said, you

14:10

know, you asked to be to draw you a picture of I

14:12

don't know, whatever.

14:14

You would be getting

14:18

it would be diffusing an image

14:22

that it would be predicting would be based

14:24

on that underlying data. Yeah,

14:27

obviously, obviously, and obviously people are

14:29

going to say, well, it's probably just it could be hallucinating.

14:31

So where do you think

14:34

the scientific like is there? What is the scientific

14:37

value? If there's a potential that it's hallucinating,

14:39

you know, then the error starts to creep in, right?

14:44

So I think the scientific

14:46

value here is

14:48

maybe to predict or

14:50

to build a database of

14:52

rare objects. So even if it's hallucinating, if you've only

14:54

got one example of an object, say, green

14:57

being galaxy, and

15:00

then you can diffuse 1000 green being

15:02

galaxies to train a different model, so you can go and search

15:04

for more green being galaxy than have to be green being

15:07

galaxy just a rare object. This is probably

15:09

one scientific use case for for a model like this. Yeah,

15:15

right. Or you take you

15:17

feed it millions of galaxies, and it's

15:19

properly drawing nice galaxies, and then it draws

15:21

the superstar

15:24

destroyer, you're like, wait a minute, we should maybe

15:28

image that galaxy see what see what's up with that. You know, is

15:30

that a Dyson sphere? Is that? Yeah, is that a galaxy far

15:34

far away? It's really interesting.

15:36

So once you sort of

15:40

start to see the capability of the potential of this, I mean,

15:42

it really sounds to me like we don't even know what's possible.

15:44

And so now you just look for data to consume. And

15:51

so is there a lot of unused data out there? Do you

15:53

think? For astronomy, definitely, which

15:55

is one of the reason we've been doing this multimodal unit of

15:57

the world. universe

16:00

project is so that we can get this data

16:02

nicely packaged so people can easily take it and

16:04

use it. At the

16:06

moment, there's different

16:08

groups releasing data all openly, which is very nice,

16:10

but it's more difficult to access if you're not

16:12

in astronomy and you don't know exactly how to

16:16

go and get it. But it's

16:18

multimodal universe project, it's going to be on hugging face, so

16:20

you can just go on there and access it. So that's

16:22

one of the things I'd like

16:24

to do as well is to democratize access

16:26

to astronomy data because I think it would

16:29

be a very nice thing to do and

16:31

it would be beneficial both for

16:33

astronomy and for machine learning research. So I think

16:35

it's a very good source

16:37

of multimodal data that maybe

16:40

isn't out there yet in a similar

16:42

format. And so what work needs to

16:44

be done to put this into

16:47

a situation that maybe scientists

16:49

could get access to it? So

16:56

I suppose astronomers

16:58

have access to it already, but if

17:01

you don't have astronomy training, you don't

17:03

know exactly how to use and process

17:05

this data. So I think maybe a

17:07

lot of documentation needs to be done,

17:10

just packaging up nicely, that

17:13

kind of thing. Well, yeah,

17:15

but let's say that I'm an astronomer.

17:17

I've been working on, I don't

17:20

know, I write Python scripts. I've been

17:23

accessing Gaia data and pulling

17:25

down tens of thousands of

17:28

white dwarf observations and I'm looking for

17:31

some kind of needle

17:33

in a haystack to try and sort of do

17:35

some scientific study on. And let's

17:38

say I wanted to say, well, what if I fed

17:40

all that into astro PT and then

17:43

looked for some kind of, where

17:45

would I go to actually start to do this work

17:47

if I want to try and play around with these

17:49

models? So

17:51

I suppose the first step would be

17:53

going to, there's an open

17:56

GitHub, the code is all open source, it's

17:58

ready to go. You can go there. You

18:00

can download the GitHub code, feed

18:02

in your dataset, and

18:05

it should just work. There's also a Discord

18:09

server as well, where we're all chatting on

18:11

there about Astro PT and large observation models,

18:13

and other deep learning for astronomy stuff. Yeah.

18:17

So yeah. Yeah, we'll put some links

18:19

to that when you've shown us. Yeah, yeah.

18:21

So then, would I

18:23

be fine-tuning on existing data,

18:26

or would I be training on just my 70,000 white

18:28

dwarf observations

18:33

and nothing else in the system, or

18:35

would I be looking to fine-tune an

18:37

existing dataset? Ideally,

18:40

I'd like to get to the point where you

18:42

just take the Astro PT model when you fine-tune

18:44

it on your specific dataset, just so it learns

18:46

a little bit more Astro knowledge on your specifics,

18:49

kind of like you would do with an Alama

18:51

model. Right

18:53

now, it's only been trained on galaxy

18:55

observations, but the next

18:57

step is going to be multimodal training. So we're

18:59

going to have a model that knows of other

19:02

astromodalities that you can then take and fine-tune on

19:04

your specific dataset. Right.

19:07

So that's in the works, but it's not quite there yet.

19:09

So right now, you'd be training from scratch, probably, on your

19:11

dataset. Right. And so

19:13

when you say multimodality, you're

19:15

not thinking, oh, I want text

19:17

and pictures. You're like, I

19:20

want spectroscopic data. I want

19:22

x-ray observations. I want neutrino

19:24

observations, gravitational wave observations, and

19:27

just keep feeding that data in

19:30

until interesting insights start

19:33

to pop out as you query the

19:35

data. That's exactly it. Yeah, just feeding

19:37

all the data. It learns the physics,

19:39

and then at the end of

19:41

it, you have this nice model that knows all of

19:44

this Astro knowledge that you can then extract with linear

19:46

probes or whatever you want to do. Yeah,

19:49

that's the dream. That's where I'd like to get

19:51

to, which is why this is an open source

19:53

project where everyone joins, we can

19:55

train this model together with data that you might

19:57

have. lot

20:00

of interesting work like the SETI researchers

20:02

have been able to get

20:04

a pipeline from radio observations to be

20:07

able to pull data that is interesting

20:09

to them to be able to search

20:11

through for any evidence of a, you

20:13

know, technological signature in their in the

20:15

radio observations is

20:18

getting the getting data from these

20:20

instruments and observatories is that is that difficult

20:22

right now? Or is it relatively straightforward to

20:24

kind of to take it all and feed

20:26

it in? I mean, we're seeing all kinds

20:28

of legal issues over in the regular world

20:30

as, as the machine

20:33

learning companies are having to face copyright

20:36

issues and so on. Is it is it a lot

20:38

simpler and in the scientific realm? I

20:41

would say it's, it's simply yeah, because

20:43

you're not you don't

20:46

have the problem of like copyright

20:48

infringement or other things you might

20:50

do in the textual world realm.

20:53

So it's, it's nicer

20:55

in that regard. Yeah,

20:58

yeah, I would say it's a

21:00

much nicer playground for developing these

21:02

large models, you don't have to.

21:04

There's less ethical questions in my

21:07

head when I'm using these models.

21:09

So that's what I like about

21:11

them. You can just use astronomy

21:13

like this all. So

21:16

so in your perfect world, there will be

21:20

pipelines of data coming from all of the

21:22

telescopes, all of the rovers,

21:24

all of the I don't know, you

21:27

know, everything, all the gravitational wave observatories,

21:29

it's all just being uploaded into this

21:31

model. And then you are running some

21:33

kind of training kind of like how

21:36

chat GPT happens every on a regular

21:38

basis. Do you see it like

21:40

some kind of continual process or like, okay,

21:42

now we're training version

21:44

three with all the data

21:46

up to, you know, this point? I

21:50

think, uh, realistically,

21:53

it's probably going to be a version by version thing.

21:55

I don't think it'll be continual just because we're a

21:58

small group at universe TVD. But

22:01

it would be nice to get some

22:03

continual pre-training going as well, where you

22:05

could – there's a lot of research

22:07

on about this, about continual pre-training in

22:09

the language world. So it'd

22:11

be nice to be able to do something like this

22:13

for these large observation models, these large transformers that feed

22:15

in astronomy. If

22:17

we can do this, we could also have

22:20

a base model. You take it, you do

22:22

some continual pre-training on it, on your dataset

22:24

that you might have, your astro dataset. And

22:27

then I can imagine you pushing it back to the base

22:30

model repository or wherever we host the model.

22:32

You're like, oh, we've added this extra dataset,

22:34

we push it back. And it's

22:37

a nice way to collaborate and improve the

22:39

model continually. So that would be very cool

22:41

to do, I think. Yeah. Yeah.

22:43

Yeah. Yeah. And

22:46

so one of the

22:48

really interesting things that's happened with

22:50

large language models is various

22:53

kinds of emergent behavior, where

22:55

you train it enough times and it starts

22:57

to develop a theory of mind that it

23:00

can sort of put itself in the mind

23:02

of another person to think what they're thinking.

23:05

Are you seeing any glimmers of

23:07

any kind of emergent behavior in

23:09

understanding underlying physics and

23:12

astronomy? Yes. This is one of

23:15

the cool things that was

23:17

in the paper. So we have,

23:19

like I said before, we have these linear probes

23:22

that take the embedding space that project it onto

23:24

some scientific problem. And we

23:26

found that certain galaxy

23:28

morphology classifications emerge at a certain

23:30

number of parameters. So I think

23:32

it was like 13 million

23:36

parameters. You

23:39

start seeing these questions start being answered by

23:41

the network. So if we can continue pushing

23:43

this, it'd be interesting to see what questions

23:45

it can't and

23:48

at what point they start being answered

23:50

by the model. And if there's some sort

23:52

of correlation there, we only tested it on

23:54

maybe two dozen questions. But if we can get

23:56

a load of these, maybe

23:59

there's a correlation between. between more difficult

24:01

questions and easier questions, it suddenly emerges

24:03

knowledge about the more difficult questions in

24:06

a predictable way. So give me an example of a question.

24:09

Like when you say a question, I mean, you know, if

24:11

it's just a bunch of pictures, can it only respond in

24:13

pictures? Can it only respond in spectroscopic

24:15

data? So

24:18

give me an example of how you might ask a question and how it

24:20

might give you the answer to the question. So

24:22

we have some questions

24:25

about the galaxy morphology. These are from

24:28

the Galaxy Zoo project. So this is a

24:30

citizen science project where citizen scientists, they look

24:32

at lots of galaxies and they answer questions

24:35

about it. Like this galaxy has so many

24:37

spiral arms or this is heavily barred. Or

24:40

this has an artifact in it, which

24:43

is like a strange occurrence

24:45

in the galaxy. And

24:48

we take these questions, we

24:50

encode the answers as

24:52

a float value between zero and one, depending

24:55

on how strong the answer

24:57

is from the Galaxy Zoo citizen scientists.

25:01

And we asked the model,

25:03

the linear probe model to

25:07

answer these questions. So you can

25:09

imagine the linear probe is taking

25:12

the galaxy as an input. And

25:16

then it's trying to predict how likely

25:18

it is to be heavily barred or

25:20

how likely it is to have n

25:22

spiral arms or how likely it is to have

25:25

a, have

25:28

a certain redshift or have a certain color

25:31

or a certain magnitude. And

25:34

these are the questions we ask it. We're not asking

25:36

it like text and it answers in text. We ask

25:38

it like a

25:41

supervised question like

25:44

in classical machine learning.

25:48

Right, right. But essentially you're saying, how,

25:52

what is the likelihood that a galaxy is gonna have

25:54

a thick bar? And

25:56

then it's gonna give you a distribution

25:59

between zero. something

40:00

about the underlying science because otherwise it wouldn't be very

40:02

good at predicting it. And if it's perfect

40:04

at predicting it, it's

40:07

going to be better at scientific analysis

40:10

than a model that's very bad at predicting

40:12

these this interlaced

40:14

paper multimodal like

40:17

data set that we're talking about. So

40:19

would would you do that with pre

40:21

training? Like would you take your

40:23

data have a really

40:25

smart but not necessarily graded astronomy,

40:27

LLM, prepare

40:31

the data in a format that

40:33

then some astronomy related one could

40:35

then feed on that data to, to have it

40:38

all nicely in this in the same structure? It'd

40:42

be nice. I imagine it would be. Yeah,

40:44

it'd be very nice to be able to

40:46

do this. I think that's a good route

40:48

to go down. But first we need to

40:50

build in the textual domain into these models.

40:53

But yeah, yeah, perfect world. Yeah, yeah.

40:56

It's been really interesting, like Microsoft

40:58

released their five model. And

41:01

it's surprisingly good for how small of

41:03

a model that it is. And the

41:05

answer appeared to be obsessing

41:08

about the quality of the data that

41:10

you don't just feed it all of Reddit, you

41:13

don't just feed it all of Twitter. In fact,

41:15

you have to go through line by line sentence

41:17

by sentence and make sure that you've got good

41:19

stuff before you actually could garbage in garbage out.

41:22

But but the question that I was asking was more

41:25

like if there was a data

41:27

set, like you're having to scavenge

41:29

from Gaia from Desi from web

41:32

from an shortly from

41:34

via Ruben. But

41:36

if there was like, the kind

41:38

of data that would be really useful

41:40

to learn from, could you sort of

41:43

imagine a perfect

41:45

data set? And then, you know, then we'll figure

41:47

out how to get that data? What would be

41:49

the ideal data? So the I

41:52

would say the ideal data to train the model

41:54

would be something that has a very high signal

41:57

to noise ratio. So it's, it's got a lot

41:59

of important physics. in the data and not

42:02

a lot of craft, so not a lot

42:04

of blank space or noise or whatever. So

42:06

that would be the perfect data set to

42:08

feed this. And I think we have so

42:10

much data coming in astronomy. We could do

42:12

some pretty aggressive cuts to get this good

42:14

high-quality data if we had some

42:16

people looking at this. And I think

42:19

that's what I did with Fire as well. They

42:21

cut it aggressively. Or

42:23

no, I think they actually took

42:26

synthetic data from

42:29

some large-language model and

42:32

trained the fire model on this. I

42:34

don't think we can do this in our case. But

42:36

we have all of this data, this raw data from

42:38

these telescopes that we can then cut. And we have

42:40

high signal, low noise. And it

42:42

would be much better for the model to train on

42:44

this. So it's not learning things that

42:47

aren't particularly relevant for the physics. Yeah.

42:50

Right. So you're going three steps forward,

42:52

one step back with high noise. It'd

42:54

be better to just go three steps

42:56

forward and no steps back with low

42:58

noise. That's exactly it. And

43:01

are there good? And I guess the problem is

43:03

always all telescopes are going to be trying to

43:05

observe at the very edge of what they are

43:07

capable of observing. And so you're

43:09

always going to just, at a

43:12

certain point, the noise makes the images

43:14

unusable. And so you're always hitting high

43:16

noise. Yeah.

43:18

Because you're trying to eke out the most

43:20

possible science from your data. So

43:23

I guess if it's new, like

43:26

a new type of galaxy or the edge

43:28

of our observable universe, and it's still

43:31

high, like, telescopic noise,

43:34

I would argue that there's more information in

43:36

that new galaxy compared to just

43:41

another spiral galaxy. So high

43:44

signals to noise, but maybe not telescopic

43:46

noise. Important

43:48

information or useful information compared

43:51

to just blank space. Yeah. Now,

43:54

that's interesting. Michael,

43:57

what are you obsessed with right now? And,

46:01

you know, like, I have a lot of people who listen

46:03

to what I do, and they are in the computer

46:06

field, they're, you know, they're working in,

46:09

in machine learning, or they're working in, in

46:12

various stuff, and they love astronomy, and they'd

46:14

love to contribute. So how

46:17

what would be the best way that people who

46:19

are listening to this interview, and they're really okay,

46:21

yeah, they, they want to try and put their

46:25

all of that technology knowledge

46:27

to science? What

46:29

is the best way that they can get involved? So

46:33

they can join the universe TBD discord,

46:36

which I can give you the link for this.

46:38

And it's a discord server that's open to all

46:40

like a Luther AI is and we're trying to

46:42

develop astronomy,

46:47

like machine learning models, not just

46:49

Astra PT, we've also tried to

46:51

develop member models on there for

46:53

astronomy, plus large language

46:55

models that are trained in Astro, like

46:57

archive papers and Astro knowledge. So

47:00

I think the best thing to do would be to

47:02

run that discord play around a bit with our projects,

47:05

what's going on, like, have

47:07

a look at our codes,

47:09

because we're astronomers, we're not machine learning

47:11

scientists. So our code is probably not,

47:14

not the best. Great. So this is the

47:16

thing, right? There's all these machine learning, astronomy

47:19

fans who can't wait to get

47:21

involved in a project. And this sounds like a great

47:23

project to get involved in. Well,

47:26

Michael, good luck on

47:29

unleashing this

47:31

technology to come up with new theories

47:33

about the universe. I look forward to

47:36

the new discoveries. Thank

47:38

you. It was great talking. I hope

47:40

you enjoyed that interview with Dr.

47:42

Michael Smith. Now I'm going to give you some

47:45

more thoughts and feedback. But first, I'd

47:47

like to thank our patrons. Thanks to Abe

47:49

Kingston, Adam Schaeffer, Andrew growth, David Gilton

47:51

and David Matt's, Dennis Alberti, Dustin cable,

47:53

Jeremy Madder, Jim Burke, Jordan Young, Josh

47:55

Schultz, mods, Paul Robach, Stephen Krosocke, Stephen

47:58

Fowler, Monday. and Vlad Shippelan who support

48:00

us at the master of the universe

48:02

level and all of our other supporters

48:04

on Patreon. I've been waiting for something

48:06

like this to come along, which is that, you

48:09

know, there's so much data. I, you know, I talk

48:11

about this all the time that you've got the Veroon

48:13

Observatory. It's going to be generating petabytes of data dumped

48:15

directly onto the internet. All

48:17

of this data coming from Gaia, all of

48:20

this information coming from Euclid, from Desi, like

48:22

there's just, there's too much data. And

48:25

yet there are all of these mysteries, these questions

48:27

that we have. What is dark matter? What is

48:29

dark energy? Why is there more matter than antimatter

48:31

in the universe? How did the first galaxies form?

48:33

Where did, like all of these questions, data

48:36

meet questions. And

48:38

hopefully, finally, we're going to

48:41

have some form of tool,

48:44

some kind of machine learning

48:46

system that can actually help

48:48

us make sense from all of this

48:51

data that's being gathered by all these

48:53

new observatories. And this sounds like the

48:55

right direction to me. And

48:57

so I don't know if you got the gist in the

48:59

conversation, but Michael is

49:01

definitely looking for more people to

49:03

get involved in the project, especially

49:06

the kinds of people who have

49:08

technical knowledge, programming experience, machine

49:11

learning, that kind of thing. And if

49:13

that's you and you've like

49:15

always wanted to get involved in an astronomy project,

49:18

this is your chance. So I'm going to

49:20

put a link to the Astro PT paper

49:22

in the show notes. I'm also going to

49:24

put a link to that Discord chat that

49:27

Michael was mentioning so that you can go

49:29

and join. Say hi and see

49:31

how you can help out. And maybe we

49:34

can have a computer help us solve some of the

49:36

biggest problems in the universe. All right.

49:38

We'll see you next time.

Rate

Join Podchaser to...

  • Rate podcasts and episodes
  • Follow podcasts and creators
  • Create podcast and episode lists
  • & much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.
,

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features