Podchaser Logo
Home
How to Read Leaked Datasets Like a Journalist

How to Read Leaked Datasets Like a Journalist

Released Friday, 26th January 2024
Good episode? Give it some love!
How to Read Leaked Datasets Like a Journalist

How to Read Leaked Datasets Like a Journalist

How to Read Leaked Datasets Like a Journalist

How to Read Leaked Datasets Like a Journalist

Friday, 26th January 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

It's a unit

0:04

system. I know it is. It's

0:06

how the files are going to

0:08

work. It tells her everything.

0:19

Sir, he's uploading the virus.

0:22

Eagle One, the package is being delivered. So

0:29

I live with, my wife is

0:31

a software engineer. Everybody

0:33

I interact with every day is

0:35

a software engineer, like in my normal life.

0:38

And we're pulling through this, they were pulling

0:40

through this book being like, Oh, this is

0:42

really good. Yeah, you have to internalize

0:44

all of this so we can stop answering your questions.

0:48

You know, write your own Python scripts, etc,

0:51

etc. Will you introduce

0:54

yourself and tell us about the book

0:56

we're here to discuss today? Yeah,

0:59

I am Micah Lee.

1:01

I work as the Director of Information Security

1:03

at The Intercept. And I

1:06

just published my first book called

1:08

Tax, Leaks and Revelations, the Art

1:11

of Analyzing Hacked and Leaked Data.

1:13

It's basically a,

1:15

it's like a technical book. And the

1:17

goal is to teach journalists, but also

1:20

researchers and activists and people who are

1:22

looking for a new hobby or whatever,

1:24

how to analyze the floods of hacked

1:26

and leaked data that are getting leaked

1:28

on the internet every day. Yeah,

1:31

you make it sound as if there is just

1:33

a flood of stuff that there's not enough people to

1:37

train people to sort through properly. You say

1:39

that's accurate? Yeah, that's

1:41

definitely accurate. Like I, I

1:44

only download and look at

1:46

like a small fraction of the data sets that

1:48

I hear about just because I'm too busy. I

1:50

just have if I'm like working on a project

1:53

that I just ignore everything else. And

1:56

I think that this is the case of, you

1:58

know, the other few data

2:01

journalists that are doing this type of data journalism,

2:04

there's not nearly enough of us and so that's one of

2:06

the goals of the book is to basically you know

2:09

raise an army make a lot more

2:11

people who are able to have

2:14

the skills that they need to analyze

2:16

data sets like this. Yeah, exactly. So

2:18

without having your book as your own

2:20

guide how did you get to the point where

2:22

you're like you know what I've been

2:25

doing this for a long enough time, it's time for me

2:27

to raise my own army. How did you get there? I mean

2:30

so this is kind of the a

2:32

lot of the work that I've

2:34

been doing at the intercept over the last ten years

2:37

and I didn't I

2:39

come from a background of computer

2:41

science and of programming and then

2:43

really actually like web development and

2:47

I had never been trained in journalism or

2:49

anything like that but because

2:51

I was working at the intercept and I'm

2:53

running into these data sets I just kind

2:56

of you know used all my technical

2:58

skills and learn more technical skills along the

3:00

way in order to figure out how things

3:02

work and I think that there

3:05

were a few big data sets

3:07

that I spent a lot of

3:09

time on and that I was

3:11

realizing that not enough people at

3:14

all are are doing this stuff

3:17

and they really inspired me to write this book.

3:20

What does it mean to be the director

3:22

of information security for a news organization? What

3:24

does that job look like day to day?

3:27

So it's a very interesting job. My job

3:29

might be a little bit different than some

3:31

others because my job is

3:33

like also split between doing a

3:36

lot of traditional info-check work but

3:38

also doing investigative journalism myself. But

3:40

yeah I do a mix

3:43

of traditional information security stuff like

3:45

I make sure that our you

3:48

know website infrastructure is secure, I manage some

3:50

vendors, I make sure that none of the

3:52

endpoints that people use get

3:54

hacked and do phishing training and that

3:56

sort of thing. But then

3:58

I also do a lot of journalism stuff. specific work

4:01

involving source protection

4:04

and figuring

4:06

out how to secure sensitive data.

4:09

There's a lot of decisions around

4:11

when it's appropriate to use cloud

4:14

services and when we have to keep

4:16

stuff just on our laptops or

4:18

occasionally when we have to keep stuff on the air

4:20

gap computers. So it's a

4:22

bit different because of that, because

4:24

of all the journalism security work.

4:27

Can you give us, is there a good anecdote about the

4:29

job that you can give us without getting anybody into trouble

4:32

with putting them in danger? So I was thinking about this

4:34

back several years ago when

4:37

we were reporting on Snowden documents. We

4:39

went to extreme measures to keep the

4:42

Snowden archive safe. We would use air

4:44

gap computers where we'd actually unscrew

4:47

the cases and remove the networking

4:49

hardware and stuff. Whenever

4:51

we were needed

4:53

to move a file from one air gap computer

4:55

to another air gap computer or we were getting

4:57

ready to publish and we had to move it

5:00

to a computer that's not air gapped, we didn't

5:02

trust USB 6. We didn't want

5:04

the USB 6 to be involved. So we

5:06

actually burned CDs and then we shredded the

5:08

CDs when we were done with them. And

5:11

we had separate USB CD

5:14

drives that are like, these are the air gap

5:16

ones and these are the not air gap ones

5:18

and things like that. But

5:21

every time we published a story based

5:23

on NSA documents that were top secret,

5:25

as we had

5:27

your journalist practice, we needed to reach out to the NSA

5:29

press office and ask them for comment. And

5:31

we would also give them a chance to tell their

5:34

side of the story and tell them what we're accusing

5:36

them of doing basically. And we also wanted to show

5:38

them the documents we were planning to publish to see

5:40

if they had any arguments for why we shouldn't. We

5:43

never actually didn't publish something because of something they

5:45

said. But basically the

5:47

NSA was like, okay, just email

5:50

us these documents. And

5:52

We were just like, you mean like just plain

5:54

text email? Like Just copy them to our computers

5:56

that we never use them on and then just

5:58

like attach them. Well, I

6:00

yeah, and it took years of

6:02

Snowden Journalism. Before. We finally

6:05

got them to make a p defeated. By.

6:07

The Fugitive T they were basically like

6:10

wanting on the have very strict rules

6:12

and out and they're like. Encrypt

6:15

eats documents separately and just attacks

6:17

the encrypted file don't actually the

6:19

concept of years and else and

6:22

he thinks that basically. An

6:24

essay of like terrified of the

6:26

Press Office because. The. Either the

6:28

people who are security fans of have access

6:30

to both the good arguments are talking to

6:33

journalists all the time that are really don't

6:35

want one of the people talking to journalists

6:37

and with a tabby monographs and so yeah

6:39

as a favela pretty fascinating. Your

6:42

the public affairs official sometimes the stephen

6:44

of read it easy Like how often

6:46

does the Usa talked to the press.

6:48

Anyway, About anything really. Every time you

6:51

publish the story was you know request

6:53

comment from and every time they have

6:55

no karma. So Village Voice

6:57

may may wanna know what's coming off

6:59

Downs? Yeah, it's funny. I'm thinking about

7:01

it now. it is his attentions. But.

7:05

You with these three litre organizations you have

7:07

this weird things the doing like the last

7:09

ten years where they or to rip seating

7:11

old information from like the cold War era

7:14

on to like the the reading rooms right

7:16

is fascinating stuff in there and like that

7:18

they have podcasts where they where they talk

7:20

about all the sole stuff for the going

7:23

to the sold material but it's very much

7:25

on their terms. Or. They're controlling the

7:27

narrative of their back catalogue. If something

7:29

has to point out, I'm Sue. Him

7:33

we can set you up for your yeah

7:35

yeah mill as. I feel like it's

7:37

even worse given and when we'll talk

7:40

about this later on and and are

7:42

discussed Sen and. It's than

7:44

a bad week for their

7:46

journalism's world. to the i'm

7:48

less the last and one

7:50

of the common refrain ends

7:52

that you now people will

7:54

use when people. and journalists

7:56

announcer they're being laid off as you know learn

7:58

to code says that is a learn-to-code joke

8:00

somewhere in all of here, but

8:03

obviously data journalism

8:05

is so

8:08

important and has only

8:10

become more important. As the

8:12

years have gone on, we've seen newsrooms

8:14

really spin up these investigative data teams,

8:17

etc. How

8:20

important have you seen these skills becoming

8:24

in your time in journalism? I know you said

8:26

that you started off on the computer

8:28

science programming side of things. Back

8:31

when I started big archives of data, like

8:35

big data sets like the Snowden archive or

8:37

like the you know the Chelsea Manning

8:39

leaks, those were very rare.

8:42

They happened sometimes but they weren't common

8:44

at all and now it is literally

8:46

like pretty much every day. There's like,

8:49

like if you follow ransomware groups, you

8:51

could just go to their websites and

8:53

just download data from like dozens of

8:55

companies they hacked and you know some

8:58

of them might have journalistic value. So

9:01

yeah, it's really, really, really common.

9:04

And so I think that this

9:06

type of data journalism skills, they're

9:09

more important than they've ever been and I think

9:11

that that's just going to increase over

9:14

time. And yeah, like

9:16

the book does teach you to learn

9:18

to code, but I want to like make it

9:23

clear that it doesn't require any prior experience

9:25

at all. It's like designed to be

9:27

really accessible and really friendly. All you

9:29

need is a computer, an internet

9:32

connection, a hard drive with about a terabyte

9:34

of free space and then just enough curiosity

9:36

and willingness to learn new skills. Only a

9:38

terabyte? That's all we need?

9:41

Yeah, about a terabyte. Yeah, that's a lot of space.

9:43

It's because you have to download Blue Lakes, which is

9:45

like 250 gigabytes

9:47

and then you have to extract Blue Lakes, which

9:49

is like, you know, doubles in size because it's

9:51

all zipped up and stuff. And then there's a

9:53

few other data sets, but that's the big one.

9:57

But yeah, like, like a lot of people.

10:00

do find a lot of this stuff kind of

10:02

intimidating, like typing commands

10:04

into terminals and writing Python code and stuff.

10:06

But then Musk walks you through the process

10:08

from the very beginning and I like hold

10:10

your hand the entire way and try to

10:13

be as accessible and as friendly as I can. Yeah,

10:16

and I feel like what a lot

10:18

of people miss is that data journalism

10:20

isn't replacing, you know, the classic, you

10:22

know, shoes on the ground, boots

10:24

on the ground journalism that we

10:27

see in like, you know, 1950s,

10:30

60s, you know, cop investigative

10:32

movies and stuff like that. One

10:35

of the things that I have found interesting,

10:37

you know, throughout my career, especially, you

10:39

know, working on the tech side of

10:41

journalism is just how

10:44

you can go from these massive

10:47

data sets, these massive leaks and

10:49

databases that are either leaked to

10:51

you or just leaked publicly. And

10:53

then, you know, you spend

10:56

time within them, and you're

10:58

able to find these stories that are basically hidden in

11:00

plain sight. How does

11:02

that process work? How do you know what to look for?

11:05

Yeah, it can be challenging. It depends. My

11:07

book is full of all these hands on

11:09

projects where you download real

11:11

data sets to work with. And

11:13

so I mentioned Blue Leaks. So

11:16

Blue Leaks is, you know, hundreds

11:18

of gigabytes. It was data that

11:20

was hacked from hundreds of different

11:23

US law enforcement websites in

11:25

the summer of 2020 in the

11:27

middle of the Black Lives Matter uprising, and it's

11:29

full of evidence of police misconduct.

11:31

And basically, like one of the tools that's

11:34

really helpful is social

11:38

tools. So Alice is an example.

11:40

Alice is this tool that was

11:43

developed by the Organized Incorruption

11:45

Reporting Project. And so you

11:49

could take Blue Leaks, you could index the entire

11:51

thing in Alice, and what it does is it

11:53

looks through every single file, it extracts all the

11:55

text, it lets

11:59

that entity exchange. So it pulls out all

12:01

the email addresses and phone numbers and like social

12:03

security numbers and whatever else it finds and it

12:06

lets you search the entire thing and it also does

12:08

OCR so optical

12:10

character recognition, so it

12:13

works with like scans documents and images and

12:15

so Yeah, what's the whole

12:17

needle in it in the haystack that thank you

12:20

The way that I would typically start this is

12:23

I would search for some things that I'm interested

12:25

in So like maybe I would search for the

12:27

city that I live in or the name of

12:29

a politician or something and then Using

12:32

search tools like that. You kind

12:34

of narrow the field of what you start focusing

12:36

on And

12:39

yeah, there's there's a chapter that you do how how to

12:41

use alif how to set it up on your own computer

12:43

and do it with any Any

12:45

data sets you want and there's also like a lot

12:47

of other? Things like

12:50

just correct. It's a command line tool.

12:52

It's incredibly useful for being able to

12:54

filter filter data So but

12:56

but yeah, you're never gonna you're never gonna get away

12:58

with Get away

13:00

from just like manually looking through those

13:04

I mean, I think people are like thinks that maybe

13:06

I can do it for you. I kind of don't

13:08

agree but Yeah

13:11

in the end like you're definitely gonna

13:14

send many many many hours of clicking

13:16

through reading documents taking notes And then

13:18

based on what you find, you know,

13:20

that's where your investigation goes And

13:23

where do you find data sets? one

13:25

of my most favorite things to do is Read

13:29

through thousand page Pentagon budgets because there's

13:31

always there's always stories in there Buried

13:33

in weird places and I'm control effing

13:35

and I'm looking for f-35 or whatever

13:39

Department of energy stuff is a good one. Where

13:42

do you where do you find data and what

13:44

is? DDoS Secrets

13:47

or DDoS. I'm

13:50

saying it wrong. I call it DDoS secret. Yes

13:54

Distributed denial of secrets

13:56

so distributed denial of secrets is

13:58

this nonprofit transparent selected, I work

14:01

really closely with them. They're

14:03

kind of like a public library of

14:05

act and leaks data sets and it's

14:07

specifically curated for journalists. And

14:09

it's great. It's

14:13

ddosecris.com and you go there and

14:15

you can see all

14:17

of the data sets that they've released and

14:20

you can download them. A lot of

14:22

the data is available for everyone. Some of

14:24

the data is called limited distribution which means

14:26

you have to request access for it and

14:28

that's basically to produce policies. So a lot

14:30

of these data sets have tons

14:32

of personal information. And so

14:36

ddosecris like have relationships with journalists

14:38

and sometimes with like academic researchers

14:40

and then they share it and

14:42

this way they don't end up

14:45

just publishing like tons of private

14:47

information from innocent people. So ddosecris

14:49

is like a really good source

14:52

for data sets and that's

14:54

where I get a lot of

14:56

the data that I work with myself and all the

14:58

data sets in the book that are examples are all

15:00

downloaded from ddosecris. But

15:03

this is just a tiny sample

15:05

of the data that's actually out

15:07

there. Like you

15:09

were just talking about thousand

15:12

page government documents. But also

15:15

some of the data is totally public then

15:17

you can just scrape it from the internet. Like

15:20

a good example. I mean ddosecris made this

15:22

a lot easier for people but the

15:25

parlour scrapes. So January

15:27

6, 2021 when Trump supporters stormed

15:31

the Capitol they all

15:34

have phones on them and they all recorded

15:36

themselves doing all of this stuff on

15:38

their phones and then they posted

15:40

these videos in real time to

15:43

the social network called parlour and

15:45

a lot of these videos included

15:47

like metadata like the GPS coordinates

15:49

on their phones. And so

15:52

after January 6 parlour was

15:56

basically kicked off of Google Play and kicked off

15:58

of the Apple App Store. because

16:02

basically violating their terms to like they

16:04

were refusing to moderate content that incites

16:07

violence. And then AWS also announced

16:09

that they were going to kick them off. But

16:12

they gave them a few days. And so someone

16:15

like during Donk Envio's hand

16:18

during the this few

16:20

day period while the Parlor data was still

16:22

there, like works to download, it was something

16:24

like 54 terabytes of videos. So basically everything

16:27

that had been uploaded to Parlor. It was

16:29

over a million videos. And

16:31

the ironic thing is it was just downloading it

16:33

from AWS and then taking it

16:36

to a different speed bucket.

16:38

So anyway, yeah, that data, there's

16:42

a whole chapter on like working with that

16:44

data and figuring out how to, you know,

16:46

like take this million videos and write a

16:48

Python script that like looks through all the

16:50

metadata and finds the ones that have GPS

16:52

coordinates in West India and then we're filmed

16:54

in January 6 and then how to map

16:56

them and all of that stuff. So yeah,

16:58

so that's an example of like totally public

17:00

data. But then there's also, yeah, that's hacker

17:02

groups a lot of times have telegram channels

17:04

where they just post data that they steal,

17:06

we have some where groups a lot of

17:08

times around like poor onion

17:10

services that have all the data you

17:12

can download from them. And

17:14

then there's also just like so much

17:17

misconfigurations with data out there.

17:19

It was like S3 buckets that

17:21

are totally open that people just discover

17:23

sometimes. And there's also

17:26

like a good example, the

17:28

American College of Pediatricians. So this

17:30

is a group that the Southern

17:32

Poverty Law Center calls an anti-LGBTQ

17:34

hate group. They wrote like an

17:36

anarchist brief in the decision that

17:39

overturned Roe v. Wade. They had a Google

17:41

Drive link that was like open to anyone

17:43

and someone found it and then downloaded 20

17:46

gigabytes of documents. And that has been

17:48

some journalism based on that data. And

17:51

so, so yeah, the data sets are

17:53

everywhere. The data, there's just so

17:55

much data. And if you just poke around

17:57

a little bit, you can find it. There's

18:01

a lot of really interesting revelations in there and there's

18:03

like nobody looking at it. How

18:05

do you know when you found a story?

18:10

I mean, so okay, so there's a

18:12

lot of data sets out there that are just

18:14

completely like not

18:17

interesting in terms of journalism. There's

18:19

like, you know, like, here's a

18:21

list of all of

18:23

the customers of some company or something and

18:25

it's like, like, unless there's, you know, something

18:27

that you think is in the public interest,

18:29

then like, okay, those

18:32

customers data was breached or whatever.

18:36

I think that the stories are really

18:38

like when you find, you

18:41

know, evidence of

18:43

corruption or evidence of

18:45

crimes or, you

18:48

know, like, like if you have internal

18:50

chats and you find people being like

18:52

really racist or really sexist or things

18:55

like that. So yeah, I

18:58

mean, a lot of times a data set comes and

19:00

you might be really excited about it and then you

19:02

spend a bunch of time looking through it and nothing

19:04

really comes from it. Actually

19:07

an example of this is I looked

19:09

to the data from Oakland, the city

19:11

of Oakland. The city of Oakland

19:13

was hit with ransomware and I guess

19:16

they didn't pay their ransom and the data was

19:18

put online. And

19:20

so I downloaded a copy of it

19:22

and I don't think that, I think there

19:24

definitely still might be stories in there and

19:26

I just didn't spend enough time thoroughly looking.

19:29

But basically, like, you know, there's a lot

19:31

of information about all the lawsuits against the

19:33

city of Oakland, but not like a lot

19:35

of like, not like their internal deliberations or

19:37

anything like that. And I just spent a

19:39

while and didn't really find much even though,

19:41

you know, there's all stuff about the Oakland

19:43

police. There's like some potentially interesting

19:45

stuff, but yeah, so I

19:48

don't know. It's subjective and I think

19:50

that it's really about like what's news,

19:52

what people will, are really

19:54

interested in knowing about and also,

19:56

you know, personally I like finding

19:59

Stories. The really gonna have some

20:01

sort of impact intentions? Were

20:04

talking earlier about having to send

20:06

saying this. To the and

20:08

say to get comment on that not.

20:11

Every organizations finity the an essay. Not

20:13

every organizations gonna be somewhere that you

20:15

could reach out to with the process.

20:17

How do you authenticate be? Say the

20:19

sets? Consider Theoretically, he got this google

20:21

drive that could be someone sleuthing it.

20:23

On how do you now. That's

20:26

entirely depends on the title data for them.

20:28

I was that it is and generally up

20:30

the top of was a different. Story

20:33

but I found out of the

20:36

safest way in general is to

20:38

use our infants to kind of

20:40

compare publicly available information with information

20:43

that is that feeds you for

20:45

my promise. And fifty from from

20:47

the surface is Israel's arm and

20:50

so good. For here's an example

20:52

and Islam. One. Of the case

20:54

studies that out and about his are there's

20:57

this and I baxter of called America's from

20:59

London. And say

21:01

arm. Off during the

21:03

pandemic they made like millions of

21:06

dollars. Essentially Philippe Filter baby you

21:08

know was opposed masks vaccine himself

21:10

and and maybe with all the

21:12

ruins the only way to save

21:14

it from cover it either. Next

21:17

in an address the clerk one

21:19

and they work with on. Some

21:22

private, some somehow someway small fellow

21:24

health and bananas and basically a

21:26

hacker had contacted me on signal

21:28

and says that they arm and

21:30

act as if like the fourth

21:32

pick handlers about these methods and

21:34

I'll be for me I like

21:36

I wanted and I will only

21:38

one one hundred. Megabytes By

21:41

like when I accepted it turned out

21:43

to be hundreds. Hundreds of the data

21:45

are. now that if this was all

21:47

patient records and prescription records will like

21:49

medical records of America Frontline doctor citizens.

21:52

And. I did all

21:54

this reserve I figured out like you know

21:56

like how much money they really were they

21:59

saying how that. The all of the

22:01

scam and everything like that. this actually went

22:03

to a congressional investigations but I didn't know

22:05

if the state it was really are not

22:07

and so the way that I ended up

22:09

authenticating and I figured of as opposed. To.

22:12

The like ah you know anti vax

22:14

There's also a with also like America

22:16

Frontline Doctors is very like and that

22:18

it was like kind of started as

22:21

part of the From Twenty's Plenty campaign

22:23

like to nested I'm like Simone Goal

22:25

but the person who started as a

22:27

Bomb or January second Saxena so there

22:30

were like marijuana Life from like an

22:32

Anti Democracy overlaps solicited probably some of

22:34

these places. And

22:37

job is that Right wing says network. There

22:39

was a totally other dampers. it's That

22:42

On included thirty thousand smell enough to

22:44

the dentist. Someone in his eyes off

22:46

all the email addresses of America online

22:48

database and and all the email addresses

22:50

and the scam integrates and compare them

22:53

on. I found a bunch of overlap

22:55

were like of it. there's to hear

22:57

some down the others. That. Also

22:59

are allegedly in there as. Soon

23:03

as far as want them all.

23:05

And then I found a handful

23:07

of them that we're talking about

23:09

over finances. Only seventy nine hundred

23:11

suburban. And the like

23:13

dates mine up. And.

23:15

For that many really confident that this the theater

23:17

was ill. As. Much

23:19

as an example the upper publicly

23:21

as out horse Caesar's exactly I

23:23

mean actually one of the conversations

23:25

with specifically talking about alleged. Like.

23:28

Bird, animal supply stores or allow and

23:30

my what am I gonna do and

23:32

an article Finally I got. I got

23:34

myself from Maximum I backers Wilde. I

23:37

have would me over curveball question

23:40

that may be very stupid. But.

23:43

It occurred to me so I thought I would

23:45

throw it out there some of the in the

23:47

days of napster. Aging

23:49

myself, dating myself. That's one

23:51

of the wizards the music

23:53

industry with with Handle the

23:55

Percy Problem is that they

23:57

would flood the zone with.

24:01

Fake versions of the songs so you would

24:03

download of you download something he thought was

24:05

Madonna's new single. It was Madonna parading you

24:07

for downloading something. Arm in

24:10

this way they fought back. do you

24:12

see do you ever see situation? If

24:14

I were. Somebody that was

24:16

sitting on large datasets as part of my jobs.

24:18

I was a fortune five hundred company. I.

24:20

Would perhaps. In.

24:22

The event of a breach, train a

24:24

large language model to generate sakes. Sake!

24:27

Data in that also flip the zone. Out

24:30

Giver. Think about anything like that. I

24:32

know that's kind of a weird, far

24:34

flung hypothetical question. I. Mean yeah,

24:36

I definitely think that there there.

24:39

Is like there's there's all different

24:41

from a son everywhere and I

24:43

think that's. When you are

24:46

reporting on some and I like this

24:48

is a there's always a bit like

24:50

okay I have confirmed some of the

24:52

information that data that I have his

24:55

authentic. By and I me feel

24:57

like more confident that it authentic but I

24:59

didn't come from every single device. Media can

25:01

confirm that an email and you know that

25:04

would actually that the last the other person

25:06

suggests. That doesn't mean it. The rest of

25:08

the manager out of Israel and cited been.

25:11

A good thing to do is report

25:13

on what you from them. So.

25:16

Ah yes you're going to be posing

25:18

with you know that to publish The

25:21

email this and make sure that what

25:23

you're published and is a real. Who

25:26

you are only did and so even if

25:28

there is like some takes up like actually

25:30

remember on. Since.

25:33

I. Might be getting the details

25:35

wrong, but when he was published

25:37

I think the data back from

25:39

Syria and it later turned out

25:42

that the original dataset included some

25:44

live in person about big bank

25:46

transfers between my. People. In Syria

25:48

people in Russia and not information would like

25:50

deleted from that into that. and

25:52

so like at it was like real a real

25:54

we stayed at that but like some of the

25:57

information was deployed with the with the family deleted

25:59

at the former publishing it. And

26:02

yeah, so I don't know. Like, I

26:05

think that this definitely has happened in

26:07

the past. I think that especially with

26:09

LLMs, with like AI, it's going

26:11

to get a lot worse.

26:15

But it is just true with everything,

26:18

like just the entire zone is flooded with nonsense.

26:20

And I think that that's true with with data

26:22

sets too. And so I think it's just really

26:24

important to do the work to authenticate

26:27

everything that you're going to that you're going to publish. All

26:30

right. Cyber listeners were to pause there for a break.

26:32

We'll be right back after this. All

26:59

right. Cyber listeners, we're back on with Michael

27:01

Lee talking about HEX, leaks, and revelations. What's

27:04

the most interesting exploited data set you've

27:06

seen? Okay. So one

27:09

that I find really interesting is the

27:13

Epic HEX. So Epic is the EPIK.

27:15

In 2021,

27:18

Anonymous hacked Epic and they called

27:20

this hack Epic Fail. And

27:23

basically Epic, a hosting provider,

27:25

it's run by like Christian

27:27

nationalists, it's used by a

27:29

lot of like

27:31

really far right organizations and groups

27:33

and websites and stuff. They do

27:36

domain name registration. And so a

27:38

lot of the places where like mass

27:41

shooters have posted their manifestos have

27:43

been websites posted by Epic. And so that's

27:46

how they're able to stay online, like like

27:49

8chan and

27:52

things like that. And like

27:54

the Oathkeepers actually. This

27:56

was probably like the biggest note of

27:59

a company that I've ever seen. I'd ever seen. And

28:01

the reason is because they had,

28:05

so the data

28:07

that Anonymous released was, you

28:10

know, hundreds of gigabytes of MySQL databases full

28:12

of tons of information. And like the really

28:14

interesting stuff in there was Epic

28:16

Grand, so there were a domain name

28:18

register, they ran a who is privacy

28:20

service. And you can look behind,

28:23

you can peek behind the who is privacy

28:25

for all of those domains. So I like,

28:27

in the chapter on SQL on SQL, in

28:31

my book, it shows you how to

28:34

like go and like look up osteepers.org

28:36

and figure out the, like

28:38

if you do a public who is search on it,

28:40

it says like, this was protected by privacy service. But

28:43

then you can run the like MySQL

28:45

queries and discover like, okay, Stuart Rhodes

28:48

is the owner of this and here's his like

28:50

address and phone number and stuff. But then, you

28:52

know, here is the technical contact of it. And

28:55

there's more information. And so I think that

28:58

that's really interesting. But also

29:00

this will epic hack, like it

29:02

included the

29:04

Texas GOP website, it's like a WordPress site.

29:06

It included like a SQL dump event and

29:08

all the files for it. And so actually

29:10

like when I was looking into it, I

29:13

recreated it in Docker containers and like spun

29:15

up the website and then like changed the

29:17

app and password and logged in. And I

29:19

was just like, look around the backend

29:21

of the Texas GOP website. This

29:24

whole hack was like in response to

29:26

the Texas heartbeat loss, which was like

29:29

the biggest restriction of abortion rates in the

29:32

US before Roe v. Wade was overturned. But

29:35

it also, this hack also included entire

29:37

like VM images. So it included

29:39

like the images of hard drives of

29:42

the virtual machines that were running at the software.

29:44

And so like one of them was

29:46

like GitLab. So it's basically like an

29:48

open source version of GitHub and it

29:51

included like all their source code repositories,

29:53

but also all of their like issues

29:55

and pull requests and all the continuous

29:57

integration. So like when they merge code

29:59

to production. like all of the secrets

30:01

that actually connect to the production servers. I

30:03

don't know. It was wild. So I think

30:05

that that was probably one of

30:08

the most fascinating data sets that

30:10

I had seen. My real reaction to that

30:12

is like, damn, are people like that

30:15

stupid? Like, like,

30:17

not to, you know, to victim

30:20

blame the people like this, they're, you know,

30:22

going to be. It

30:26

just seems like this is like a nightmare for

30:28

anyone who is on the opposite end of this.

30:32

Yeah. Should a regular person,

30:35

aside from, you know, doing

30:37

password keepers, etc, etc. What

30:41

should we be concerned about? You know, who's

30:44

phishing us that we should be afraid of? I

30:47

mean, the problems like like this,

30:49

when the company gets hacked and all their

30:51

data gets breached, that

30:53

is not really something that regular people can

30:55

handle. That's like the responsibility of the company.

30:58

And it's like, I mean,

31:00

I don't think that Epic

31:02

was especially competent, but even

31:05

for competent companies, it's really

31:07

hard. Like defending from

31:09

hackers is a very, very difficult situation.

31:11

It's like much easier to find a

31:13

single flaw and like hack something than

31:16

it is to find every

31:18

single flaw that anyone might find and defend against them

31:20

all. But in terms of just ordinary

31:22

people, like what you can do, like, yeah, I think

31:24

that the best you can really do is use a

31:28

password manager, have really good

31:30

passwords, try and like, like

31:33

a lot of times there's data breaches that

31:35

aren't actually like the whole

31:37

service provider gets hacked, but instead individual

31:39

accounts get hacked. So the way to

31:41

make sure your account doesn't get hacked

31:43

is use two-factor authentication. You

31:47

know, like post less information on the internet, like

31:49

don't store all of your or if you do

31:51

put it on the internet, put it, you

31:53

know, in places that are encrypted. So like,

31:55

instead of using,

31:57

storing all your stuff in Google Drive.

32:01

you know, maybe store it in like proton

32:03

drive or something like that. So then, you

32:05

know, if proton mail gets hacked or gets

32:07

law enforcement requests or whatever, they won't be

32:09

able to just hand over all of your

32:11

files. And if you are using Google

32:13

Drive, I mean, that's fine. Google is actually very secure.

32:15

But like turn on Google

32:18

Advanced Protection, which is a way of like

32:20

really locking down your Google account. It makes

32:22

it a lot harder to hack. I think

32:24

that's the best that ordinary people can do

32:26

is just use good, strong

32:28

passwords, use 2FA and

32:32

yeah, like don't have all your conversations on Discord,

32:34

have them on Signal. Kind

32:36

of piggybacking off of that, what's the most

32:38

creative intrusion you've seen? I'm

32:41

especially interested in any stories of

32:43

really wild social engineering. Let's

32:46

see. I mean, I'm not sure. So

32:48

what's the data sets that I get

32:50

generally? Like, like they're not always from

32:52

hackers, but when they are,

32:54

I have no idea how the intrusion happened. I just

32:56

have the data. So

32:59

yeah, so I

33:02

actually like was thinking

33:04

about this and I thought of an intrusion, but it's

33:06

not social engineering at all. But

33:08

really wild social engineering. I

33:11

mean, I just remember like,

33:14

so I've worked with a lot of like

33:19

NGOs and like human rights activists and

33:21

stuff. And I remember like

33:23

hearing about phishing emails with someone that looked

33:26

super convincing that was basically

33:28

like inviting you to a conference and being

33:30

like, we'll pay for it. Like,

33:32

like, you know, I know that this conference, like

33:34

in Europe or, you know, somewhere, somewhere they'll be

33:37

really fun to go to and it's really expensive.

33:39

Would be, you know, it's perfect

33:42

for you. We would just want you to like come

33:44

and attend and like maybe be on a panel

33:46

or something. And you know,

33:48

we have full funding for your flight and

33:50

for your per diem and for everything. And

33:53

that's like can be very enticing for like

33:55

nonprofit workers that were especially bit like

33:57

bits of interest and stuff. But

34:00

like, so what I was thinking is

34:02

the act like a wild

34:05

intrusion that isn't social

34:07

engineering was actually the

34:10

American frontline doctors like

34:12

telehealth symphonies. And I actually

34:14

I was talking to the hacker because they reached out

34:16

to me directly. So I asked them to think we

34:19

like how this hack happened. And they

34:21

said it was hilariously easy to hack. And

34:23

it actually like, like, it's

34:25

kind of it's kind of funny how

34:27

incredibly simple this is, but also how

34:30

incredibly impactful it was. So

34:33

the two companies that were hacked were Cadence

34:35

Health and Rab2 Pharmacy. Cadence

34:40

Health basically was did

34:42

like the telehealth consultation. So when

34:44

someone's like, I want Ibramectin,

34:46

they would pay $90 and have a doctor call

34:50

them on the phone and have like a doctor plan

34:53

basically in those costs $90 and anyone can

34:55

make accounts. So basically, the hacker

34:57

went to Cadence Health, made an

34:59

account, and then just was

35:01

like, looking at the

35:03

HTTP requests, their browser made as they were

35:06

like, clicking around in their account. And they

35:08

noticed that one of the requests had their

35:10

account ID on it. And

35:12

it was like, like get account info slash

35:14

ID. And they just included

35:16

all their information, but also include in like a

35:18

JSON object, it also included like their password hash.

35:22

And they changed the ID to a different

35:24

ID. And it included

35:26

all of a different patient's information. So

35:28

they just wrote a like little script

35:30

that just iterated through the IDs and

35:34

downloaded the patient data from 255,000 patients. And

35:38

that was that hack. And then the

35:40

other one was our Rab2 Pharmacy. So

35:42

this was like the main pharmacy that

35:45

after they prescribed Ibramectin and Hydroxychloroquine, they

35:47

would like go to the pharmacy to fill it. And

35:50

with your Abscu, anyone can create an account with this

35:52

pharmacy. And basically, I don't know how they did it,

35:54

but the hacker said that they discovered a special

35:57

URL, it was like the super admin

35:59

interface. And as long

36:02

as you're logged into an account, any account

36:04

at all, you have access to it. So

36:06

if you're not logged in, it like forces you to

36:08

log in. If you're logged in, you just have access

36:10

to all of this stuff. And it includes like a

36:13

list of all of the prescriptions that they had ever

36:15

filed. And so,

36:17

yeah, they just like

36:19

skipped all that information. And actually, when

36:21

I was doing this story, the Rabku

36:24

CEO, I like found

36:26

his phone number and called him. And he didn't

36:28

actually believe that they were hacked because they were

36:31

like, no, that's impossible. We're hip-hop compliant. We're really

36:33

secure. And then I like emailed him a screenshot

36:35

of the Superwoman interface. And he was just like,

36:37

oh, God, I have to call my video and

36:39

like hung up the phone. It's

36:41

really deflating. I don't know why I want there

36:43

to be more of a romance to all of

36:45

this, but it's

36:48

really deflating just how simple and

36:53

just ignorant a lot of this is to me. Like

36:56

it's just plain, just

36:59

plain not having your shit together. Right?

37:02

Yeah. Yeah. But I

37:04

mean, like, it really is. But

37:07

like, also, I don't it's hard

37:09

to do everything well. Like you

37:11

probably all have Google accounts. How

37:13

many of your Google Docs or

37:16

Google or folders in Google Drive

37:19

have shared settings that are open to anyone with the

37:21

link? Like, and you are you

37:23

actually want them to be open to everyone or

37:25

did you just not feel like sending their email

37:27

addresses to share them with. So it's like, I

37:30

think that everybody does this. And so

37:32

so it's like, it's like you have

37:34

to be kind of digital, vigilant

37:38

with your security practices to

37:41

not end up doing stuff like

37:43

this. But also, like,

37:46

like, yeah, if you're if you're having patient

37:48

data, if you're having like health care records

37:50

and stuff, you absolutely should not to have

37:52

like some sort of access control, you know.

37:56

Yeah, it is deflating how easy it is.

37:58

But not everything is a. I

38:00

think that that's why we're seeing these

38:02

data sets, because these are the easy

38:04

ones. Yeah, no, you're right. It's like,

38:07

you know, some people should definitely

38:09

do it. But then there

38:11

are some organizations that really, really

38:13

should be doing it. In

38:18

terms of, you know, making mistakes,

38:20

what are the kinds of mistakes

38:22

that you see when journalists who

38:24

aren't as versed in the world

38:26

of hacking reporting

38:28

on stuff like this? I mean, I

38:30

think that like, really,

38:32

it's just believing what companies

38:34

tell them and believing what

38:37

like billionaires say. Like,

38:40

like, there's just so many stories,

38:43

you know, from the last several years

38:45

about like crypto, and

38:48

how it could, you know, solve all of the

38:50

problems and how it's really secure and all sorts of

38:53

stuff. And then that's just like, turned out to not

38:55

be the case at all. So I think that

38:57

like, yeah,

39:01

just like believing a lot

39:03

of hype and not actually

39:05

verifying what companies say.

39:07

So this doesn't really like have to do

39:10

with my book or with data sets, really.

39:12

But I had in 2020, I worked on

39:14

the story about Zoom,

39:17

and how Zoom

39:20

basically was misleading all

39:23

of its customers claiming that it had end-to-end

39:25

encryption, and it didn't have end-to-end encryption. And

39:27

this actually led to like an FTC settlement

39:29

and where the FTC forced Zoom to implement

39:31

real end-to-end encryption. And it led to like,

39:34

I forget how many millions of dollars fast auction

39:36

lawsuits, which was pretty cool. But

39:38

basically, like me and Yael Gower is the

39:41

other journalists that work with it, we were

39:43

just looking through Zoom's like privacy policy and

39:45

asking them questions about how their end-to-end encryption

39:47

work. And we got them to kind of

39:50

admit that, well, actually, the like, keys that

39:52

protect the Zoom meetings are generated on Zoom

39:54

servers, and we do have copies of them.

39:56

And then we're like, that's not an end-to-end

39:59

encryption. And they're like, oh, well, we're just

40:01

using a different definition of end-to-end encryption. And

40:05

I think that that's probably true for

40:07

companies everywhere. They all just say whatever

40:09

the marketing people think sounds good. And

40:12

then you really have to look into

40:14

it in detail and ask them questions.

40:17

And ideally, don't even ask them questions. Reverse

40:19

engineer, how does stuff work? And

40:21

if you can figure that out, then yeah. I

40:24

feel like that's the big mistake.

40:26

It's just believing people without, like, unspectacling.

40:29

The idea that you would just believe

40:31

that Zoom is end-to-end and cr- Anyway,

40:33

I'm gonna let that go before it makes my

40:35

brain, the blood shoot out of my

40:37

ears. So we mentioned Discord earlier.

40:41

And I've been kind of fascinated

40:43

by this little chat room thing that was meant

40:45

for gaming, coming to

40:47

take on this weird oversized importance. I

40:51

think maybe in ways that people don't really realize. I

40:55

think the big story from the end of last

40:57

year was Jack DeShara, the DOD leaker, was

41:00

sharing things with the Discord group that

41:03

was supposed to, like, that he'd squirreled out of a

41:05

skiff. It's just strange stuff. Has

41:08

Discord become a part of your daily life? Is

41:10

it important to your work? And what are the

41:12

payrolls there? You

41:15

know, I actually don't use Discord all that

41:17

much. I use it, like, a little bit

41:19

now. There's a few different servers that I'm

41:21

part of. But I mean, I think that

41:24

really, like, what Discord is, is it's

41:26

like this massive, like, changes, semi-private place

41:28

for people to communicate online. And so

41:30

because there's, like, millions and millions of

41:33

people that are talking in it, that

41:36

totally makes sense that

41:38

there's gonna be leaks coming from it.

41:40

And the whole, yeah, the Jack DeShara

41:42

thing where he's posting, you know, top

41:44

secret documents about the Russia-UK war, basically,

41:46

like, for clout in front of his

41:48

friends. Yeah, that's fascinating.

41:51

I mean, so actually, like, one

41:54

of the other case studies in my book does

41:56

involve a lot of leaked Discord tests. And

41:59

this is from... a bit older, this

42:01

was from like 2017 when the

42:03

people who were like really, really

42:05

using like this little

42:07

gaming chat thing

42:09

were Neo-Nazis. So the

42:13

organizers of the Unite the Right rally

42:16

in Charlotte, Virginia, that whole thing was

42:18

organized on Discord. And then so were

42:21

like, like there was there was like

42:24

several other like 15 other Discord servers.

42:27

And so yeah, like I talk about

42:29

how like anti-fascist infiltrated these servers and then just

42:31

use some software to just like once you're in

42:33

a server to just go and scrape all

42:36

of the chat history that they have access

42:38

to for like the entire server since since

42:40

it started. And I think that

42:43

this is one of

42:45

the reasons why Discord

42:48

is such a big deal because it's actually really

42:50

easy for any person in a Discord server to

42:52

just grab everything and just have them posted to

42:55

that server. And then it's also easy because these

42:57

are like not mostly they're not

42:59

public, mostly they're like, but they're not really

43:01

that private either. It's like there's this Discord.gg link

43:03

that if you find one you can join another

43:05

server. And so if you scrape an entire, I

43:07

think this is what a lot

43:10

of the like implicators for these like not-can-eat

43:12

chat rooms did is they'd like grab, they'd

43:14

like make get their way into

43:16

one and then they would scrape it all

43:18

and search for Discord.gg and then they'd find

43:20

like seven others and they just join those

43:22

and scrape all of those. And

43:25

I think that yeah,

43:27

that's one of the reasons why it's

43:29

not like a signal group. Yeah, and

43:32

there's an illusion of privacy in them

43:34

that doesn't quite actually exist, right? Yeah,

43:36

yeah, absolutely. And in fact, I mean,

43:38

I think that also it's

43:40

important to like, I don't know, I always have

43:42

in the back of my mind that anything that

43:44

is not and encrypted,

43:47

like the company has access

43:49

to it. So there's an illusion of privacy in

43:52

your Google Docs too. It's a lot more private,

43:54

I think, than Discord channel where like, you know,

43:56

a bunch of strangers might join and you might

43:58

not know them. But like, But yeah.

44:02

Do you draft copy in Google Docs or do

44:04

you use something else? It

44:06

depends on the story. If it's a

44:08

story where it's just like, doesn't

44:10

matter, like it's totally, I'm not

44:13

like, don't have any source protection

44:15

things, then I can sometimes

44:17

do Google Docs. But

44:21

otherwise, actually use Word a lot. Yeah,

44:25

if it has anything to do with secret

44:30

information or source protection or whatever, then like

44:32

intercept policy is like, we don't use Google

44:35

Docs for any of that. All right, this

44:37

one is from one of my friends who

44:40

wanted me to ask this. What is the

44:42

worst kind of data set

44:44

system to work with and why is

44:46

it XML? So,

44:48

okay, so XML can be obnoxious.

44:51

But one good thing about XML is that

44:54

it's actually like an open format and there's

44:56

libraries that can work with it. What I

44:58

find even worse than XML is

45:00

like weird proprietary crap. So

45:03

like once someone

45:06

said like some surveillance videos

45:09

that were from some like, I

45:11

don't know, some like surveillance camera company

45:13

and the video is just a

45:15

normal video format. The only way to watch them is

45:17

to like, get like start up

45:19

a Windows VM and install the

45:21

company's like software and then you could open

45:24

them from there. And you

45:26

could, or you could maybe spend like hours

45:28

and hours and hours trying to figure out

45:30

how to like get an MP4 out of

45:32

this. So like something like that, it's just

45:34

obnoxious. And then even just like, I

45:37

remember I worked on, all right, I helped with that

45:39

story where it was

45:41

a leaked Oracle

45:43

database of

45:46

like Chinese police stuff that

45:48

was like involved in surveilling

45:50

Uyghurs and it was

45:53

an Oracle database and Oracle is like

45:55

a proprietary database thing. And so it'd

45:57

be so much easier if it would

45:59

just like. like MySQL or Postgres or

46:01

something. And none of the people that were working

46:03

on this, the tech people,

46:05

were that familiar with Oracle and you have

46:07

to buy a license. And eventually we managed

46:10

to figure out how to convert it into

46:12

Postgres so that we could actually work with

46:14

it. But yeah, just weird proprietary stuff is

46:16

really obnoxious at that stage. What's

46:19

the most common stuff you work with? Usually

46:21

SQL and that kind of thing? Is

46:23

it mostly that? Yeah, so just

46:26

collections of office documents are

46:28

really common. Like

46:30

a PDF and like Word files and

46:32

Excel files and things like that. Email

46:35

is really common. So normally that's, sometimes

46:37

it's folders full of email files, which

46:39

is like the standard for a single

46:42

email. But then if there's

46:44

also like inbox files and PST Outlook files,

46:46

there's a whole chapter called reading other people's

46:48

email that teaches you how to deal with

46:51

all of this stuff and how to import

46:53

it into Thunderbird and things like that. But

46:56

then for

46:58

structured data, like JSON files,

47:00

like JSON data and CSV spreadsheets are

47:03

like really, really common. Like

47:05

the American Frontline Doctors, it

47:08

was just nothing but

47:10

JSON files and CSV files and that's

47:12

it. And then yeah, SQL is

47:14

really common too. Bringing

47:17

us back kind of to where we are in

47:19

the present, you

47:21

know, not a great

47:23

week for jobs in

47:26

the world of journalism, but

47:28

at the same time, this is after we've had a

47:31

couple of years of a

47:33

lot of OSINT journalists

47:35

that are just, you know, guys

47:38

on the internet figuring things out. How

47:42

are you feeling about this industry right now? In

47:46

terms of like really

47:48

bad OSINT that

47:51

doesn't necessarily really like mean what people

47:53

think it means, you mostly

47:55

just ignore all of that stuff and I

47:57

sort of mixes in with other... kind

48:00

of like, I don't know, like

48:02

the internet is full of things,

48:05

of websites full of like bad reporting or

48:07

misinformation or spam or like a mix of

48:09

all of them. And so, um,

48:12

mostly don't really look like

48:15

I mostly just ignore that stuff. Um, I,

48:17

although I do think that, that goes that

48:19

went on well can be like really exciting

48:21

and interesting, especially if you like really just

48:24

narrow it down to like, okay, I've connected

48:26

these two things and here's, here's my proof.

48:29

Um, uh, but in terms of

48:31

the industry, I don't

48:34

know. I mean, things are

48:36

grim. Um, I, I'm actually,

48:38

uh, very happy about the,

48:40

the like kind of recent new direction of

48:42

the intercept though, where, where it, uh, split

48:45

off from first look media, which is its

48:47

parent company. And so now it's just a

48:49

completely independent nonprofit and it's, you know, um,

48:52

uh, like, like it seems

48:54

like the inner intercept is in a good place. So

48:56

I'm happy about that. Um, the, the whole industry

49:00

has a whole, I don't know. I really

49:03

hope that it doesn't get sucked into too

49:05

much AI. I

49:07

think it's going to in the short term, like it's just

49:09

gonna, we're just going to have to suffer through that, I

49:11

think, um, until

49:13

it like collapses in on itself.

49:16

Uh, I think that's, yeah, I think

49:18

we just, you're right. Yeah. But you

49:20

think about like, you get to live

49:22

through interesting times. Isn't

49:24

that wonderful? Yeah.

49:27

But maybe the AI is not going to

49:29

parse the data sets as well. I

49:31

don't know. Yeah. Is that

49:33

it? So one thing that, that

49:35

I found that like chat GPT

49:38

is really good at is helping

49:40

you write code. So if

49:42

you're, if you're new to, to this stuff, if

49:44

you want to like follow along with my book

49:46

and you're like very intimidated by the Python stuff

49:48

and you're like, okay, I need to write a

49:50

script that like opens a CSV file with millions

49:53

of rows and then loop three for out. You

49:55

can just ask chat GPT, hey, Python steps to

49:57

open a CSV file and look to the rows

49:59

and it'll give you a little snippet of

50:01

code. And so that sort of thing I think could

50:03

actually be really helpful. But in terms of

50:05

actually like finding the stuff for

50:08

you or like writing stuff for

50:10

you, no. Yeah, but like writing

50:12

code, I'm actually a big favorite

50:14

for helping you write code faster.

50:17

I love that. How

50:20

often do you mess with GIS data, if

50:22

at all? A

50:24

little bit. Like I, so

50:27

the whole like Parler data that

50:30

has GPS coordinates, I actually,

50:33

while I was writing that book, I'm like spending a

50:35

lot more time with the Parler data than I had,

50:37

like, you know, when it came out, I

50:40

was learning a lot about GIS software

50:42

too. But I basically like, you

50:46

know, figured out ways

50:48

of like various

50:50

options to map GPS

50:52

coordinates, which are pretty cool. One

50:55

of the things with the American Frontline Doctors data

50:57

was I have patient data and I had everyone's

50:59

addresses, but I didn't want to like, you know,

51:02

publish any one of the addresses, but I was

51:04

really curious like what states have the most people

51:06

and what cities have the most people. And so

51:08

I wrote some code

51:10

that basically like, put the list of like

51:13

all the patients in each city and

51:17

geocoded those cities. So I had GPS coordinates

51:19

for the cities and then I like mapped

51:21

it all. And so the article we published

51:23

actually had like an interactive map where you

51:25

can see, where you can like scroll around

51:27

and see the cities that have the most

51:29

and the least people

51:32

who are really into ivermectin and

51:34

made octoporaxin and probably are anti-vaxxin

51:36

into Trump themselves. Everyone loves a

51:38

good map. Yeah. So

51:42

Emily, do you want to take this last one? Yeah,

51:44

I mean, so many of our

51:46

listeners, you know, we definitely

51:49

have listeners, I'll say, who

51:52

have access to data sets and

51:54

might at some point, maybe

51:56

that point is now, want to become a

51:58

source for a journal. list for whatever

52:00

reason, what kind of

52:03

advice would you have for them

52:05

about managing risk and doing, you

52:07

know, risk assessment on their end

52:09

before, you know, becoming a source?

52:12

So I would say, first of all,

52:14

if you're thinking about this at all,

52:16

don't do any sort of things on

52:18

your work devices. Like, don't

52:21

search for information about, like, how do

52:23

you leak to a newsroom from your,

52:25

like, work computer. Don't

52:27

use your work computer or your work phones as much

52:29

as possible, but generally that's not possible. So

52:32

generally, if you have access to the data set, it's only from

52:34

your work device. If

52:37

you are thinking of leaking something,

52:39

it's really good to think about how many people

52:42

have access to the thing that you're leaking. It's

52:44

a big difference if you're going to, like, leak

52:46

an email that was sent out to your whole

52:48

company than it is if you're going to leak

52:50

an email that was sent out to three people.

52:54

And so I think that it's always important

52:56

to just think like the leak investigator. So,

52:58

like, after, you know,

53:00

let's say you become a source, you

53:03

leak some stuff, a journalist publishes

53:05

an article, like, at

53:07

that point when this is public, that's

53:09

when they're going to start an investigator

53:11

again. And so, like, think about what they have

53:13

access to. They're going to try and, like, and

53:16

come up with a suspect list and narrow

53:18

it as much as possible. And so, yeah,

53:20

like, think about all of the

53:22

things that you're going to do. The system that you

53:24

use to keep logs. Like, did you know that every

53:26

single time you open any single Google Doc, there

53:29

is a log in that in Google admin?

53:31

So, like, the administrator of your Google workspace

53:33

could go in and just, like, look at

53:36

a document and see, like, oh, this person

53:38

loaded it at these specific times and then

53:40

they loaded it, you know, every day for

53:42

a few times. And then, like, a week

53:44

later, this document was published in the news.

53:47

Right. So, like, all that stuff is public. And I

53:50

think that, like, thinking about that is

53:54

the most helpful thing to do.

53:57

Yeah, just like be aware of it. everything

54:00

that you do is basically under

54:02

surveillance. Everything leaves the trail and

54:06

if you want to try and do this as

54:10

safely as possible then like either try to like

54:12

leave a little trail as you can or try

54:14

to make sure that like your trail is mixed

54:16

up with like thousands of other people's trails. Michael

54:19

Lee, thank you so much for coming on to cyber

54:21

and walking us through this. The book is Hacks, Leaks,

54:23

and Revelations, the art of analyzing hacked

54:26

and leaked data. And it's out

54:28

now, yes? It's out now. And

54:30

you can see go to hacksandleaks.com. That's the

54:32

book's website. Thank you so much. Thank

54:35

you for having me. This has been great. Thank

55:01

you. Thank

55:30

you. Tired

56:00

of ads barging into your favorite news podcasts? Good

56:09

news. Ad-free listening on Amazon

56:11

Music is included with your Prime membership.

56:14

Just head to amazon.com/ad-free news podcast

56:16

to catch up on the latest

56:19

episodes without the ads.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features