Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
It feels like everything today is AI
0:02
artificial intelligence that artificial intelligence that large
0:04
long term all chat GPT. It's just
0:07
like unending and yet I think people
0:09
are having a hard time finding really
0:11
practical uses on a day to day
0:14
basis for how they can actually take
0:16
advantage of this technology. And
0:18
I mean there's so many artificial intelligence
0:20
recipes that a person needs or cat
0:23
memes. But I
0:25
think that the real promise
0:28
of this technology is in science.
0:30
You can take sciences that generate
0:32
enormous amounts of data. You can
0:35
feed them into this transformer architecture
0:37
and then you can start to
0:39
make predictions that will be useful
0:41
to scientists. And we're really at
0:44
the nascent stages of this. Everybody's
0:46
too busy building chat bots and
0:48
not busy enough creating science
0:52
bots and hopefully
0:54
that will change. So my guest
0:56
today is Dr. Michael Smith. He's
0:58
a researcher at Aspia Space and
1:00
Universe TBD and he is
1:03
heading up the Astro PT
1:05
and Earth PT projects.
1:07
And these are exactly what I
1:09
said that you are taking enormous
1:12
amounts of climate data, earth data,
1:14
astronomy data and you are feeding
1:16
them into a transformer
1:19
architecture and you
1:21
are then generating next tokens. But the
1:24
next tokens are things like the weather
1:26
or galaxies and it's sort
1:29
of a really interesting fascinating project. And
1:31
they need help. They need more
1:33
people to get involved, especially people who are
1:35
programmers. So enjoy this
1:37
fascinating interview with Dr. Michael
1:40
Smith. Michael, we're all pretty familiar
1:42
at this point with the promises of chat, GBT,
1:45
large language models that can
1:47
produce text like fancy
1:50
auto complete or can create
1:53
images. But as
1:56
an astronomer, as a scientist, how
1:58
do you look at the technology?
2:00
technology of these large language models
2:02
or like the transformer architecture and
2:04
think about what kinds of possibilities
2:06
are out there for science. So
2:10
I think the exciting thing for me is
2:12
finding a nice way to compress
2:15
all of the data out there. So you can
2:17
see this with LLMs where they're managing to compress
2:19
text in a useful way for people with
2:23
the autoregressive architecture for science. We can do
2:25
something similar by getting
2:27
a load of scientific information and feeding it
2:29
into a similar architecture to the LLM and
2:33
then compressing it in that way and
2:35
then using that compressed embedding
2:38
space to do science progress on
2:40
that. So that's the thing that
2:42
excites me most, taking similar
2:44
architectures that have been proven in the textual domain
2:46
and then applying them in
2:48
a similar but like
2:50
adjacent way for scientific use cases.
2:54
But the experience that we've
2:56
all had using, say, chat GPT where we
3:00
ask it to write us a poem about,
3:02
I don't
3:04
know, your morning commute to work or
3:06
whatever. And we know that it's been
3:09
fed on just an enormous amount of
3:11
data from the internet and then has
3:13
been run through this transformer architecture that
3:15
then makes this predictive next-token model to
3:18
be able to try and bring up
3:21
the next text that it's likely to
3:23
see. And somehow magically, this creates poetry
3:25
or whatever. It's not great, but at
3:28
least it's doing it, which is kind
3:30
of amazing. But so if
3:32
the raw material for chat
3:35
GPT is the
3:37
text on the internet, what is the
3:39
raw material for, like,
3:42
say, an astronomy-based large-language
3:44
model? So I
3:47
would say it can be anything
3:49
astronomy-based. So we've been playing around
3:51
with the Astra-PT model by feeding
3:53
in galaxy observations taken straight from
3:55
the telescope, taken
3:58
straight from the Dark Energy Survey. telescope
4:00
and you just take the data from this, you put
4:02
it in, you try and predict the next token
4:05
or patch in a series of patches and
4:09
the model learns some
4:12
underlying physical properties from this. But we don't
4:14
have to stop at galaxies, we can feed
4:16
in time series of stellar
4:18
events. So
4:21
if you have a strange stellar object
4:23
that's doing a strange time series of
4:26
brightness over time, you can feed this into
4:29
the model as well and it will learn
4:31
something fundamental about how the
4:33
star is performing because it
4:35
needs to learn to predict the next token in the
4:39
time series. So
4:42
any observation that would
4:44
be difficult to
4:46
predict is useful to feed into a
4:49
model like this because in predicting this
4:52
next step, it's learning something fundamental about
4:54
the physics behind the scenes, which I
4:56
find very cool. I
4:59
know there's been similar work done in weather
5:01
modeling where the traditional method is you'd have
5:03
to come up with these enormous,
5:06
very complicated supercomputers that would
5:09
read in all of the butterfly wings
5:11
flapping across various places and then
5:13
try to predict the future rainfall in some
5:15
other part of the world. And now they
5:17
just take a bunch of pictures of
5:20
the world and then the AI just predicts
5:22
what the next frame is going to be
5:24
the next token. If the tokens are pictures
5:27
of planet Earth, then it knows how
5:29
to predict future tokens and it's much
5:31
more energy efficient and so on. So let's
5:33
go back to this idea then. So you've
5:35
got DESI, the Dark Energy Survey instrument,
5:39
and you've got millions
5:42
of images of galaxies that have
5:44
been taken. And so what
5:47
specifically are you doing to kind of prepare
5:49
this data for As for PT? So
5:53
that's the exciting thing. We don't have to do
5:56
that much preparation. We can take the... Yeah,
5:58
it's cool, right? take the
6:01
the raw observations you do a
6:03
little bit of like a quality
6:05
cut so you only take the
6:07
good like high quality low
6:10
noise galaxies for example or low noise time
6:12
series or whatever and then you
6:14
can feed them straight into the model and the model it
6:17
can separate out the useful
6:19
information from the not use of information
6:21
without us having to do it manually so
6:24
we're cutting out a lot of the time consuming manual
6:26
process here by feeding it straight into the model and
6:28
asking it to source the problem yeah
6:32
and just just to go back to
6:34
your your comment about
6:36
the earth observation climate climate
6:41
models that we that you were
6:43
just talking about we actually applied the
6:45
Astro PT model the large observation model
6:48
there to earth observation time
6:51
series as well and it works there too
6:53
so that's also a publication
6:55
earth PT we call
6:57
it and it's exactly the same model under the
6:59
hoods just on a different modality
7:02
so you can throw whatever on this model
7:04
when it because it will out automatically
7:06
which is very neat now
7:09
with earth it makes sense because earth goes
7:11
through this cycle that you know that you're
7:13
going to have weather patterns form and evolve
7:15
and they turn into other things around planet
7:17
earth and different features and and it is
7:19
a you know you're not
7:21
going to get too far out of the norm but but
7:24
with galaxies like say you feed in
7:26
two million galaxies into this system it's
7:29
gonna how useful is it
7:32
to predict the next galaxy
7:34
it so this is something
7:36
i found quite interesting in doing
7:38
this work and it was a little bit surprising to me so i
7:41
in in the paper there's a figure and it shows
7:43
the loss per token fed
7:46
and it asymptotes towards a flat line at some
7:48
point and it's earlier than i would expect and
7:51
i think this is because galaxies are
7:53
less information dense than earth observation for
7:56
example so in our earth observation model
7:58
the line asymptoted lower so it's It
8:00
learns more from the dataset. But
8:03
I think in astronomy, you can overcome this
8:05
with the inherent multimodality of it. So if
8:07
you have a galaxy observation in RGB bands,
8:10
GRZ, and you
8:12
would also have a corresponding spectra, which is
8:15
like a 1D description of the galaxy, you
8:17
can feed this in together and then you
8:19
would get more information compared
8:21
to just the GRZ bands. And
8:24
you can also go away and get some other modalities
8:26
too. And this would give the model more to learn
8:29
physically about the universe. So
8:32
it needs to do more. So the loss
8:34
should continue decreasing as it
8:36
learns it. So it should have
8:38
more to learn, yeah. So
8:41
you could give it say a million
8:43
galaxies and then spectroscopy
8:46
pairs. And then you
8:48
give it a galaxy that you don't know
8:50
the spectroscopy data on it and ask it
8:52
to guess what it thinks is gonna
8:54
be the chemical signature, the absorption line, so on
8:57
and so forth of that galaxy and then find
8:59
out whether or not it was how right it
9:01
was. Yeah, that's one scientific
9:03
use case we're trying to get started
9:05
now. And there's also some upcoming work
9:08
of a giant dataset of all
9:11
modalities of astronomy. Hopefully it'll
9:13
be out in the next couple of weeks or so. I can,
9:16
I don't know, do something, show
9:18
it, send it to you. But
9:21
yeah, yeah, it's just so very exciting stuff.
9:24
But what's the gist of that? What are you
9:26
hoping to do? Or are you under embargo still?
9:29
No, not under embargo. So it's
9:31
just a huge dataset of galaxies,
9:34
time series of stars, time series
9:36
of any astronomy observations,
9:39
galaxies from several different instruments plus
9:42
some tabular data too. And
9:46
we're calling it motor modal universe.
9:48
And yeah, it's just a huge dataset.
9:50
If you know Luther-Ai's pile from the texture of the
9:52
main, it's kind of like this, but for astronomy, it's
9:54
gonna be all out in the open so people can
9:56
go and play with it. I'm
9:58
very excited about this. I think it should. supercharged
10:01
research into these large astronomy
10:03
models. So just to sort
10:05
of explain this, you've
10:07
gone into every available data
10:10
source, pictures from DESI,
10:13
from other telescopes, spectroscopy data,
10:15
things from Gaia, everything
10:17
you can get your hands on, normalized
10:20
it, put it into one big
10:22
database that
10:24
can then be fed into various large
10:26
language models and people want to try
10:28
to come up with
10:32
scientific uses of that. Exactly
10:34
the case, but not a large
10:36
language model. You need to do some tokenization
10:39
to make it work for a large language model,
10:41
but a large observation model like Earth PT, you
10:43
can see it straight in, yeah. Like a
10:45
transformer model? Yeah, a transformer model, yeah. It'd be
10:48
perfect, yeah. It should just work. Right, right, okay.
10:50
So in the
10:52
paper of the astral PT, which people should definitely
10:55
check out, you showed some examples. I saw like
10:57
you had a grid of 15 out
10:59
of 16 tiles of
11:03
a galaxy and then you had
11:06
the software try to draw the last
11:09
piece. So how
11:11
would that be useful scientifically? So
11:14
that's the surrogate task. And the
11:16
reason behind it is just
11:19
to give the model something to
11:21
learn that's easy to define. So
11:25
if we predict the next square in the time series,
11:27
it needs to learn something about the underlying galaxy and
11:30
the underlying physics to predict the next square. It's
11:33
not so useful in itself scientifically, but once
11:35
you train the model on this, you can
11:37
take the
11:39
embeddings in the transformer like an
11:42
intermediate layer and then train
11:45
a linear probe, which is
11:48
just a linear regression and try and predict
11:50
some scientifically useful properties of the galaxy. So
11:53
this could be the color
11:55
or the magnitude in certain bands
11:57
or the red shifts, how far away it is. or
12:01
something like the morphology of the galaxy, which
12:03
is also scientifically useful. And
12:05
the exciting thing we found here is, if you
12:07
have a bigger model, it's
12:10
better at predicting these downstream tasks,
12:12
even though it's only trained in the next square
12:15
in the time series, in
12:17
the galaxy prediction.
12:21
And this is something that we also see
12:23
in large language modeling. So I found that
12:25
very exciting to see this connection between large
12:27
language modeling and this large transformer model and
12:30
that's, it's the same model under the hood, but it's a
12:32
completely different modality. Right,
12:34
right, right. I mean, we see scaling laws, we
12:37
see how just more compute,
12:39
better data, more
12:42
time spent actually training
12:44
the model, it
12:46
starts to develop not
12:48
only sort of performing the task that you
12:50
expect, but it's actually starting to come
12:53
up with meta skills that
12:55
underlie the task itself. So,
12:58
and I
13:01
can think of other examples like, JWST
13:03
only has so much time to be
13:05
able to image galaxies at
13:08
really high resolution, but then you've got
13:10
a better background of
13:12
say, coming from DESI or from Euclid,
13:14
where you're gonna get these galaxies, but
13:16
they're gonna look pretty crappy because
13:19
it just doesn't have the same kind
13:21
of capability as web. And so
13:23
theoretically, you could match
13:25
them up and say, here's the galaxy
13:27
by Euclid, here's
13:30
the one by web, here's the one by Euclid, here's
13:32
the one by web. Okay, here's the one by Euclid.
13:34
What would the web one look like? Zoom
13:37
in and enhance. Yeah, yeah,
13:39
yeah, you could definitely do this. This is something
13:42
I've been working on these past couple of weeks,
13:44
just to figure
13:46
out a way so
13:49
that we can match these pairs together and then we
13:51
can say maybe train a diffusion
13:53
model on the tokens I learned some of
13:55
the match pairs and then diffuse into web
13:57
or whichever you want, diffuse into.
14:01
like so that that would be very cool to implement this.
14:03
Wow. Yeah. Diffuse
14:06
into spectra. So like, like mid journey or
14:08
whatever, or said, you
14:10
know, you asked to be to draw you a picture of I
14:12
don't know, whatever.
14:14
You would be getting
14:18
it would be diffusing an image
14:22
that it would be predicting would be based
14:24
on that underlying data. Yeah,
14:27
obviously, obviously, and obviously people are
14:29
going to say, well, it's probably just it could be hallucinating.
14:31
So where do you think
14:34
the scientific like is there? What is the scientific
14:37
value? If there's a potential that it's hallucinating,
14:39
you know, then the error starts to creep in, right?
14:44
So I think the scientific
14:46
value here is
14:48
maybe to predict or
14:50
to build a database of
14:52
rare objects. So even if it's hallucinating, if you've only
14:54
got one example of an object, say, green
14:57
being galaxy, and
15:00
then you can diffuse 1000 green being
15:02
galaxies to train a different model, so you can go and search
15:04
for more green being galaxy than have to be green being
15:07
galaxy just a rare object. This is probably
15:09
one scientific use case for for a model like this. Yeah,
15:15
right. Or you take you
15:17
feed it millions of galaxies, and it's
15:19
properly drawing nice galaxies, and then it draws
15:21
the superstar
15:24
destroyer, you're like, wait a minute, we should maybe
15:28
image that galaxy see what see what's up with that. You know, is
15:30
that a Dyson sphere? Is that? Yeah, is that a galaxy far
15:34
far away? It's really interesting.
15:36
So once you sort of
15:40
start to see the capability of the potential of this, I mean,
15:42
it really sounds to me like we don't even know what's possible.
15:44
And so now you just look for data to consume. And
15:51
so is there a lot of unused data out there? Do you
15:53
think? For astronomy, definitely, which
15:55
is one of the reason we've been doing this multimodal unit of
15:57
the world. universe
16:00
project is so that we can get this data
16:02
nicely packaged so people can easily take it and
16:04
use it. At the
16:06
moment, there's different
16:08
groups releasing data all openly, which is very nice,
16:10
but it's more difficult to access if you're not
16:12
in astronomy and you don't know exactly how to
16:16
go and get it. But it's
16:18
multimodal universe project, it's going to be on hugging face, so
16:20
you can just go on there and access it. So that's
16:22
one of the things I'd like
16:24
to do as well is to democratize access
16:26
to astronomy data because I think it would
16:29
be a very nice thing to do and
16:31
it would be beneficial both for
16:33
astronomy and for machine learning research. So I think
16:35
it's a very good source
16:37
of multimodal data that maybe
16:40
isn't out there yet in a similar
16:42
format. And so what work needs to
16:44
be done to put this into
16:47
a situation that maybe scientists
16:49
could get access to it? So
16:56
I suppose astronomers
16:58
have access to it already, but if
17:01
you don't have astronomy training, you don't
17:03
know exactly how to use and process
17:05
this data. So I think maybe a
17:07
lot of documentation needs to be done,
17:10
just packaging up nicely, that
17:13
kind of thing. Well, yeah,
17:15
but let's say that I'm an astronomer.
17:17
I've been working on, I don't
17:20
know, I write Python scripts. I've been
17:23
accessing Gaia data and pulling
17:25
down tens of thousands of
17:28
white dwarf observations and I'm looking for
17:31
some kind of needle
17:33
in a haystack to try and sort of do
17:35
some scientific study on. And let's
17:38
say I wanted to say, well, what if I fed
17:40
all that into astro PT and then
17:43
looked for some kind of, where
17:45
would I go to actually start to do this work
17:47
if I want to try and play around with these
17:49
models? So
17:51
I suppose the first step would be
17:53
going to, there's an open
17:56
GitHub, the code is all open source, it's
17:58
ready to go. You can go there. You
18:00
can download the GitHub code, feed
18:02
in your dataset, and
18:05
it should just work. There's also a Discord
18:09
server as well, where we're all chatting on
18:11
there about Astro PT and large observation models,
18:13
and other deep learning for astronomy stuff. Yeah.
18:17
So yeah. Yeah, we'll put some links
18:19
to that when you've shown us. Yeah, yeah.
18:21
So then, would I
18:23
be fine-tuning on existing data,
18:26
or would I be training on just my 70,000 white
18:28
dwarf observations
18:33
and nothing else in the system, or
18:35
would I be looking to fine-tune an
18:37
existing dataset? Ideally,
18:40
I'd like to get to the point where you
18:42
just take the Astro PT model when you fine-tune
18:44
it on your specific dataset, just so it learns
18:46
a little bit more Astro knowledge on your specifics,
18:49
kind of like you would do with an Alama
18:51
model. Right
18:53
now, it's only been trained on galaxy
18:55
observations, but the next
18:57
step is going to be multimodal training. So we're
18:59
going to have a model that knows of other
19:02
astromodalities that you can then take and fine-tune on
19:04
your specific dataset. Right.
19:07
So that's in the works, but it's not quite there yet.
19:09
So right now, you'd be training from scratch, probably, on your
19:11
dataset. Right. And so
19:13
when you say multimodality, you're
19:15
not thinking, oh, I want text
19:17
and pictures. You're like, I
19:20
want spectroscopic data. I want
19:22
x-ray observations. I want neutrino
19:24
observations, gravitational wave observations, and
19:27
just keep feeding that data in
19:30
until interesting insights start
19:33
to pop out as you query the
19:35
data. That's exactly it. Yeah, just feeding
19:37
all the data. It learns the physics,
19:39
and then at the end of
19:41
it, you have this nice model that knows all of
19:44
this Astro knowledge that you can then extract with linear
19:46
probes or whatever you want to do. Yeah,
19:49
that's the dream. That's where I'd like to get
19:51
to, which is why this is an open source
19:53
project where everyone joins, we can
19:55
train this model together with data that you might
19:57
have. lot
20:00
of interesting work like the SETI researchers
20:02
have been able to get
20:04
a pipeline from radio observations to be
20:07
able to pull data that is interesting
20:09
to them to be able to search
20:11
through for any evidence of a, you
20:13
know, technological signature in their in the
20:15
radio observations is
20:18
getting the getting data from these
20:20
instruments and observatories is that is that difficult
20:22
right now? Or is it relatively straightforward to
20:24
kind of to take it all and feed
20:26
it in? I mean, we're seeing all kinds
20:28
of legal issues over in the regular world
20:30
as, as the machine
20:33
learning companies are having to face copyright
20:36
issues and so on. Is it is it a lot
20:38
simpler and in the scientific realm? I
20:41
would say it's, it's simply yeah, because
20:43
you're not you don't
20:46
have the problem of like copyright
20:48
infringement or other things you might
20:50
do in the textual world realm.
20:53
So it's, it's nicer
20:55
in that regard. Yeah,
20:58
yeah, I would say it's a
21:00
much nicer playground for developing these
21:02
large models, you don't have to.
21:04
There's less ethical questions in my
21:07
head when I'm using these models.
21:09
So that's what I like about
21:11
them. You can just use astronomy
21:13
like this all. So
21:16
so in your perfect world, there will be
21:20
pipelines of data coming from all of the
21:22
telescopes, all of the rovers,
21:24
all of the I don't know, you
21:27
know, everything, all the gravitational wave observatories,
21:29
it's all just being uploaded into this
21:31
model. And then you are running some
21:33
kind of training kind of like how
21:36
chat GPT happens every on a regular
21:38
basis. Do you see it like
21:40
some kind of continual process or like, okay,
21:42
now we're training version
21:44
three with all the data
21:46
up to, you know, this point? I
21:50
think, uh, realistically,
21:53
it's probably going to be a version by version thing.
21:55
I don't think it'll be continual just because we're a
21:58
small group at universe TVD. But
22:01
it would be nice to get some
22:03
continual pre-training going as well, where you
22:05
could – there's a lot of research
22:07
on about this, about continual pre-training in
22:09
the language world. So it'd
22:11
be nice to be able to do something like this
22:13
for these large observation models, these large transformers that feed
22:15
in astronomy. If
22:17
we can do this, we could also have
22:20
a base model. You take it, you do
22:22
some continual pre-training on it, on your dataset
22:24
that you might have, your astro dataset. And
22:27
then I can imagine you pushing it back to the base
22:30
model repository or wherever we host the model.
22:32
You're like, oh, we've added this extra dataset,
22:34
we push it back. And it's
22:37
a nice way to collaborate and improve the
22:39
model continually. So that would be very cool
22:41
to do, I think. Yeah. Yeah.
22:43
Yeah. Yeah. And
22:46
so one of the
22:48
really interesting things that's happened with
22:50
large language models is various
22:53
kinds of emergent behavior, where
22:55
you train it enough times and it starts
22:57
to develop a theory of mind that it
23:00
can sort of put itself in the mind
23:02
of another person to think what they're thinking.
23:05
Are you seeing any glimmers of
23:07
any kind of emergent behavior in
23:09
understanding underlying physics and
23:12
astronomy? Yes. This is one of
23:15
the cool things that was
23:17
in the paper. So we have,
23:19
like I said before, we have these linear probes
23:22
that take the embedding space that project it onto
23:24
some scientific problem. And we
23:26
found that certain galaxy
23:28
morphology classifications emerge at a certain
23:30
number of parameters. So I think
23:32
it was like 13 million
23:36
parameters. You
23:39
start seeing these questions start being answered by
23:41
the network. So if we can continue pushing
23:43
this, it'd be interesting to see what questions
23:45
it can't and
23:48
at what point they start being answered
23:50
by the model. And if there's some sort
23:52
of correlation there, we only tested it on
23:54
maybe two dozen questions. But if we can get
23:56
a load of these, maybe
23:59
there's a correlation between. between more difficult
24:01
questions and easier questions, it suddenly emerges
24:03
knowledge about the more difficult questions in
24:06
a predictable way. So give me an example of a question.
24:09
Like when you say a question, I mean, you know, if
24:11
it's just a bunch of pictures, can it only respond in
24:13
pictures? Can it only respond in spectroscopic
24:15
data? So
24:18
give me an example of how you might ask a question and how it
24:20
might give you the answer to the question. So
24:22
we have some questions
24:25
about the galaxy morphology. These are from
24:28
the Galaxy Zoo project. So this is a
24:30
citizen science project where citizen scientists, they look
24:32
at lots of galaxies and they answer questions
24:35
about it. Like this galaxy has so many
24:37
spiral arms or this is heavily barred. Or
24:40
this has an artifact in it, which
24:43
is like a strange occurrence
24:45
in the galaxy. And
24:48
we take these questions, we
24:50
encode the answers as
24:52
a float value between zero and one, depending
24:55
on how strong the answer
24:57
is from the Galaxy Zoo citizen scientists.
25:01
And we asked the model,
25:03
the linear probe model to
25:07
answer these questions. So you can
25:09
imagine the linear probe is taking
25:12
the galaxy as an input. And
25:16
then it's trying to predict how likely
25:18
it is to be heavily barred or
25:20
how likely it is to have n
25:22
spiral arms or how likely it is to have
25:25
a, have
25:28
a certain redshift or have a certain color
25:31
or a certain magnitude. And
25:34
these are the questions we ask it. We're not asking
25:36
it like text and it answers in text. We ask
25:38
it like a
25:41
supervised question like
25:44
in classical machine learning.
25:48
Right, right. But essentially you're saying, how,
25:52
what is the likelihood that a galaxy is gonna have
25:54
a thick bar? And
25:56
then it's gonna give you a distribution
25:59
between zero. something
40:00
about the underlying science because otherwise it wouldn't be very
40:02
good at predicting it. And if it's perfect
40:04
at predicting it, it's
40:07
going to be better at scientific analysis
40:10
than a model that's very bad at predicting
40:12
these this interlaced
40:14
paper multimodal like
40:17
data set that we're talking about. So
40:19
would would you do that with pre
40:21
training? Like would you take your
40:23
data have a really
40:25
smart but not necessarily graded astronomy,
40:27
LLM, prepare
40:31
the data in a format that
40:33
then some astronomy related one could
40:35
then feed on that data to, to have it
40:38
all nicely in this in the same structure? It'd
40:42
be nice. I imagine it would be. Yeah,
40:44
it'd be very nice to be able to
40:46
do this. I think that's a good route
40:48
to go down. But first we need to
40:50
build in the textual domain into these models.
40:53
But yeah, yeah, perfect world. Yeah, yeah.
40:56
It's been really interesting, like Microsoft
40:58
released their five model. And
41:01
it's surprisingly good for how small of
41:03
a model that it is. And the
41:05
answer appeared to be obsessing
41:08
about the quality of the data that
41:10
you don't just feed it all of Reddit, you
41:13
don't just feed it all of Twitter. In fact,
41:15
you have to go through line by line sentence
41:17
by sentence and make sure that you've got good
41:19
stuff before you actually could garbage in garbage out.
41:22
But but the question that I was asking was more
41:25
like if there was a data
41:27
set, like you're having to scavenge
41:29
from Gaia from Desi from web
41:32
from an shortly from
41:34
via Ruben. But
41:36
if there was like, the kind
41:38
of data that would be really useful
41:40
to learn from, could you sort of
41:43
imagine a perfect
41:45
data set? And then, you know, then we'll figure
41:47
out how to get that data? What would be
41:49
the ideal data? So the I
41:52
would say the ideal data to train the model
41:54
would be something that has a very high signal
41:57
to noise ratio. So it's, it's got a lot
41:59
of important physics. in the data and not
42:02
a lot of craft, so not a lot
42:04
of blank space or noise or whatever. So
42:06
that would be the perfect data set to
42:08
feed this. And I think we have so
42:10
much data coming in astronomy. We could do
42:12
some pretty aggressive cuts to get this good
42:14
high-quality data if we had some
42:16
people looking at this. And I think
42:19
that's what I did with Fire as well. They
42:21
cut it aggressively. Or
42:23
no, I think they actually took
42:26
synthetic data from
42:29
some large-language model and
42:32
trained the fire model on this. I
42:34
don't think we can do this in our case. But
42:36
we have all of this data, this raw data from
42:38
these telescopes that we can then cut. And we have
42:40
high signal, low noise. And it
42:42
would be much better for the model to train on
42:44
this. So it's not learning things that
42:47
aren't particularly relevant for the physics. Yeah.
42:50
Right. So you're going three steps forward,
42:52
one step back with high noise. It'd
42:54
be better to just go three steps
42:56
forward and no steps back with low
42:58
noise. That's exactly it. And
43:01
are there good? And I guess the problem is
43:03
always all telescopes are going to be trying to
43:05
observe at the very edge of what they are
43:07
capable of observing. And so you're
43:09
always going to just, at a
43:12
certain point, the noise makes the images
43:14
unusable. And so you're always hitting high
43:16
noise. Yeah.
43:18
Because you're trying to eke out the most
43:20
possible science from your data. So
43:23
I guess if it's new, like
43:26
a new type of galaxy or the edge
43:28
of our observable universe, and it's still
43:31
high, like, telescopic noise,
43:34
I would argue that there's more information in
43:36
that new galaxy compared to just
43:41
another spiral galaxy. So high
43:44
signals to noise, but maybe not telescopic
43:46
noise. Important
43:48
information or useful information compared
43:51
to just blank space. Yeah. Now,
43:54
that's interesting. Michael,
43:57
what are you obsessed with right now? And,
46:01
you know, like, I have a lot of people who listen
46:03
to what I do, and they are in the computer
46:06
field, they're, you know, they're working in,
46:09
in machine learning, or they're working in, in
46:12
various stuff, and they love astronomy, and they'd
46:14
love to contribute. So how
46:17
what would be the best way that people who
46:19
are listening to this interview, and they're really okay,
46:21
yeah, they, they want to try and put their
46:25
all of that technology knowledge
46:27
to science? What
46:29
is the best way that they can get involved? So
46:33
they can join the universe TBD discord,
46:36
which I can give you the link for this.
46:38
And it's a discord server that's open to all
46:40
like a Luther AI is and we're trying to
46:42
develop astronomy,
46:47
like machine learning models, not just
46:49
Astra PT, we've also tried to
46:51
develop member models on there for
46:53
astronomy, plus large language
46:55
models that are trained in Astro, like
46:57
archive papers and Astro knowledge. So
47:00
I think the best thing to do would be to
47:02
run that discord play around a bit with our projects,
47:05
what's going on, like, have
47:07
a look at our codes,
47:09
because we're astronomers, we're not machine learning
47:11
scientists. So our code is probably not,
47:14
not the best. Great. So this is the
47:16
thing, right? There's all these machine learning, astronomy
47:19
fans who can't wait to get
47:21
involved in a project. And this sounds like a great
47:23
project to get involved in. Well,
47:26
Michael, good luck on
47:29
unleashing this
47:31
technology to come up with new theories
47:33
about the universe. I look forward to
47:36
the new discoveries. Thank
47:38
you. It was great talking. I hope
47:40
you enjoyed that interview with Dr.
47:42
Michael Smith. Now I'm going to give you some
47:45
more thoughts and feedback. But first, I'd
47:47
like to thank our patrons. Thanks to Abe
47:49
Kingston, Adam Schaeffer, Andrew growth, David Gilton
47:51
and David Matt's, Dennis Alberti, Dustin cable,
47:53
Jeremy Madder, Jim Burke, Jordan Young, Josh
47:55
Schultz, mods, Paul Robach, Stephen Krosocke, Stephen
47:58
Fowler, Monday. and Vlad Shippelan who support
48:00
us at the master of the universe
48:02
level and all of our other supporters
48:04
on Patreon. I've been waiting for something
48:06
like this to come along, which is that, you
48:09
know, there's so much data. I, you know, I talk
48:11
about this all the time that you've got the Veroon
48:13
Observatory. It's going to be generating petabytes of data dumped
48:15
directly onto the internet. All
48:17
of this data coming from Gaia, all of
48:20
this information coming from Euclid, from Desi, like
48:22
there's just, there's too much data. And
48:25
yet there are all of these mysteries, these questions
48:27
that we have. What is dark matter? What is
48:29
dark energy? Why is there more matter than antimatter
48:31
in the universe? How did the first galaxies form?
48:33
Where did, like all of these questions, data
48:36
meet questions. And
48:38
hopefully, finally, we're going to
48:41
have some form of tool,
48:44
some kind of machine learning
48:46
system that can actually help
48:48
us make sense from all of this
48:51
data that's being gathered by all these
48:53
new observatories. And this sounds like the
48:55
right direction to me. And
48:57
so I don't know if you got the gist in the
48:59
conversation, but Michael is
49:01
definitely looking for more people to
49:03
get involved in the project, especially
49:06
the kinds of people who have
49:08
technical knowledge, programming experience, machine
49:11
learning, that kind of thing. And if
49:13
that's you and you've like
49:15
always wanted to get involved in an astronomy project,
49:18
this is your chance. So I'm going to
49:20
put a link to the Astro PT paper
49:22
in the show notes. I'm also going to
49:24
put a link to that Discord chat that
49:27
Michael was mentioning so that you can go
49:29
and join. Say hi and see
49:31
how you can help out. And maybe we
49:34
can have a computer help us solve some of the
49:36
biggest problems in the universe. All right.
49:38
We'll see you next time.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More