CORECURSIVE #110

briffa_sep98_e.pro

The File That Sparked a Storm

briffa_sep98_e.pro

Can a single line of code change the way we see science, policy, and trust?

In this episode we explore the “Climategate” scandal that erupted from leaked emails and code snippets, fueling doubts about climate science. What starts as an investigation into accusations of fraud leads to an unexpected journey through the messy reality of data science, legacy code struggles, and the complex pressures scientists face every day.

Along the way, we uncover stories of hidden errors and misunderstood phrases taken out of context, revealing a world where science, software engineering, and human complexity intertwine. This story doesn’t just challenge assumptions—it shows the power and importance of transparency in science and technology.

Join Adam as he digs deep into Climategate, uncovering what really happened when code got thrust into the spotlight, and what it means for trust, truth, and open science.

Transcript

Note: This podcast is designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emphasis that’s not on the page

Athens

Adam: In Athens, 2013, Maria sits in a cafe. She’s 32, and she’s been jobless for three years. Like many Greeks, she feels stuck; each month tougher than the one before. Newspapers drone on about austerity. There are cuts, there are layoffs, there are pensions slashed. Life is closing in on her.

That morning, though, she spots an unusual headline. It’s a strange story from across the ocean about two Harvard economists, Carmen Reinhardt and Kenneth Rogoff. Their 2010 paper claimed that when a country’s debt tops 90% of its GDP, economic growth takes a hit. This idea became a key reason for austerity policies worldwide, including those imposed on Greece.

But the news article Maria reads is an update. A graduate student with his professors found a critical error in Reinhardt and Rogoff’s analysis: a simple spreadsheet mistake.

A miscalculated formula left out significant data, leading to factual inaccuracies instead of economies shrinking. When debt tops 90% of GDP, as the original paper had claimed, the corrective figures show average growth rates of around 2%. This wasn’t just an academic blunder; it had real-world fallout.

Governments misled by the flawed study, where they had just not selected all the rows, rolled out austerity measures, and because of that, there were prolonged recessions. There was soaring unemployment; there was social unrest in Greece. It was a national crisis. Unemployment over 27%. Public services falling apart.

The lives of people like Maria were in chaos. It’s unsettling, this idea that a simple spreadsheet error, a coding mistake, could steer global economic policy and change the lives of millions of people.

Maria’s story is, in fact, fictional, just a composition, but the people affected by this error were real. And it makes you wonder what other unseen mistakes or unintentional deviations in code are quietly shaping our world and what happens when those lines of code are thrust into the spotlight.

Intro

Adam: Welcome to Co Recursive. I’m Adam Gordon Bell. Today we’re exploring the invisible code that quietly shapes the world around us. Code most of us never think about at all. Here’s an example. My mom doesn’t own a computer unless you count her flip phone.

She’s nowhere near the Silicon Valley bro stereotype. But she did write code once in university. She studied psychometrics, measuring intelligence and cognitive skills. And back then, writing her research up meant running statistical calculations, which meant writing programs on punch cards and submitting them to batch processors to calculate correlations.

A lot of the world’s most important code is like that or like that GDP spreadsheet. It’s just some simple calculations tucked away somewhere in academia, sitting on a coauthor’s machine that’s only pulled up when a diagram needs to be regenerated or a constant needs to be tweaked, or when somebody requests it. It’s invisible code, but it’s powerful. It affects policies. Often this hidden code stays unnoticed until something goes wrong or until a single line out of context gets thrust into the spotlight.

That’s what today’s story is about. A story about more than just scientific data, a story about the human side of data analysis and the pressures on those who do it.

I’m talking about climategate.

The Accusation

Adam: Does anyone remember Climategate? It was all over the news back in 2009, 2010—15 years ago—and I vaguely recall it being a really big deal. I remembered something about leaked emails and climate change. It was one of those scandals that just happened in my past, and if I thought hard about it, I recall hearing about scientists getting caught red-handed fudging data to make global warming look worse than it was.

They were supposed to be truth seekers, but they were twisting the numbers to fit their agenda. I think there was a hack or a leak, and someone in climate science lied, and I kind of get why they would. Right? They were up against powerful oil lobbyists, and they were trying to win; they were trying to convince the world about global warming.

At least that’s what I remember.

But then I looked into it, and I realized it all boiled down to a single file, a single piece of code. That is briffa_sep98_e.pro — rolls off the tongue, doesn’t it? Climategate was like that spreadsheet error, but on a massive scale because it shifted how people saw scientists, and it sparked distrust in science for some, maybe for many.

And that trend continues today. But, but here’s the thing: we can find the truth ourselves. Today, I’m gonna go download the actual leaked Climategate files, open up the controversial code, and dig through it step by step, all to answer one big question.

Was Climategate evidence of scientific fraud, or was it something else entirely? To answer this, we’re gonna take some detours. We’ll explore strange files with cryptic names, decipher obscure programming languages like IDL, venture into unrelated scandals like the Alzheimer’s research scandal. And at times, it might feel like I’m getting lost in the details, but trust me, we’re always chasing the same goal: uncovering exactly what happened in Climategate, because I think it matters. I think we live in a world where science itself is increasingly under attack.

Where misinformation spreads faster than actual explanations, and where trust in experts is super, super fragile. So no matter what we uncover, the act of careful investigation itself is an essential skill. It will help us figure out what and who to trust in a moment when it feels like the stakes on the truth have never been higher.

The Back Story

Adam: It all started on November 17th, 2009. Something was wrong at the Climate Research Unit at the University of East Anglia in Norwich, England, a city of about 150,000 people. A backup server holding years of emails and research data had been breached. The university called it a sophisticated and carefully orchestrated attack.

160 megabytes of data were copied: emails, documents, code, everything.

And then there were some whispers online and a curious upload to Real Climate, and then anonymous posts hinting at secrets, suggesting that climate science was too important to be kept under wraps. By November 19th, whispers became a roar. An archive file with everything was copied across the internet, spreading fast.

Suddenly, thousands of private emails and documents were out there. Climate change denial blogs jumped on it, claiming the truth was finally being revealed.

In just a few days, the media picked up the story, and headlines ran about leaked emails and a brewing scandal. This was all just weeks before the Copenhagen Climate Summit. The University of East Anglia confirmed a breach, and the police got involved. The world watched as Climategate erupted.

No one knew the full impact yet, but it was clear: something big had just happened, and the world of climate science was about to be shaken.

When these files first leaked, James Delingpole at The Telegraph reported that global warming was based on one massive lie. Now, I just want to say I believe in global warming, and I believe that it’s caused by humans. My intent here is not to give a platform to the science deniers, but I do want to explore how we can move beyond just trusting the experts. How we can look at things ourselves, how we can investigate what is the truth, using our own minds and, you know, some effort.

Downloading the Leak

Adam: That’s why. I found these leaked files and I downloaded them. It’s a zip file, FOIA.zip, and it’s packed with documents.

It’s split into two folders: documents of mail. The mail folder is like 11,060 text files with names like 125423285.txt. If you open it, it’s just a plain text email between two researchers usually talking about a paper they’re working on.

The key file of our story: briffa_sep98_e.pro. It is in the documents folder in a directory called Harris Tree. This file is considered the smoking gun that triggered the controversy and led to an entire university lab being investigated by the UK House of Commons, which led to eight official inquiries, articles in the New York Times, and articles in the Washington Post that claimed the climate scientists were lying. They claimed that they were hiding things—that the world was actually getting cooler.

And it’s just one file: it’s just a small file: it’s 150 lines of what turns out to be IDL programming language that’s kind of like MATLAB or NumPy. But FORTRAN-based, I guess IDL is mainly used in science for number crunching and graphing.

It’s imperative code: it’s like: set this variable, then load this one, loop over these, and it’s pretty heavily commented. Although in IDL, the comments start aligned with a semicolon, which I find a bit confusing, but I got used to it.

;
; PLOTS 'ALL' REGION MXD timeseries from age banded and from hugershoff
; standardised datasets.
; Reads Harry's regional timeseries and outputs the 1600-1992 portion
; with missing values set appropriately.  Uses mxd, and just the
; "all band" timeseries
;****** APPLIES A VERY ARTIFICIAL CORRECTION FOR DECLINE*********
;
yrloc=[1400,findgen(19)*5.+1904]
valadj=[0.,0.,0.,0.,0.,-0.1,-0.25,-0.3,0.,-0.1,0.3,0.8,1.2,1.7,2.5,2.6,2.6,$
  2.6,2.6,2.6]*0.75         ; fudge factor
if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!'
;
loadct,39
def_1color,20,color='red'
plot,[0,1]
multi_plot,nrow=4,layout='large'
if !d.name eq 'X' then begin
  window, ysize=800
  !p.font=-1
endi else begin
  !p.font=0
  device,/helvetica,/bold,font_size=18
endelse
;
; Get regional tree lists and rbar
;
restore,filename='reglists.idlsave'
harryfn=['nwcan','wnam','cecan','nweur','sweur','nsib','csib','tib',$
  'esib','allsites']
;
rawdat=fltarr(4,2000)
for i = nreg-1 , nreg-1 do begin
  fn='mxd.'+harryfn(i)+'.pa.mean.dat'
  print,fn
  openr,1,fn
  readf,1,rawdat
  close,1
  ;
  densadj=reform(rawdat(2:3,*))
  ml=where(densadj eq -99.999,nmiss)
  densadj(ml)=!values.f_nan
  ;
  x=reform(rawdat(0,*))
  kl=where((x ge 1400) and (x le 1992))
  x=x(kl)
  densall=densadj(1,kl)     ; all bands
  densadj=densadj(0,kl)     ; 2-6 bands
  ;
  ; Now normalise w.r.t. 1881-1960
  ;
  mknormal,densadj,x,refperiod=[1881,1960],refmean=refmean,refsd=refsd
  mknormal,densall,x,refperiod=[1881,1960],refmean=refmean,refsd=refsd
;
; APPLY ARTIFICIAL CORRECTION
;
yearlyadj=interpol(valadj,yrloc,x)
densall=densall+yearlyadj
  ;
  ; Now plot them
  ;
  filter_cru,20,tsin=densall,tslow=tslow,/nan 
  cpl_barts,x,densall,title='Age-banded MXD from all sites',$ 
    xrange=[1399.5,1994.5],xtitle='Year',/xstyle,$ 
    zeroline=tslow,yrange=[-7,3]
  oplot,x,tslow,thick=3 
  oplot,!x.crange,[0.,0.],linestyle=1
  ;
endfor
;
; Restore the Hugershoff NHD1 (see Nature paper 2)
;
xband=x
restore,filename='../tree5/densadj_MEAN.idlsave'
; gets: x,densadj,n,neff
;
; Extract the post 1600 part
;
kl=where(x ge 1400)
x=x(kl)
densadj=densadj(kl)
;
; APPLY ARTIFICIAL CORRECTION
;
yearlyadj=interpol(valadj,yrloc,x)
densadj=densadj+yearlyadj
;
; Now plot it too
;
filter_cru,20,tsin=densadj,tslow=tshug,/nan 
cpl_barts,x,densadj,title='Hugershoff-standardised MXD from all sites',$ 
  xrange=[1399.5,1994.5],xtitle='Year',/xstyle,$ 
  zeroline=tshug,yrange=[-7,3],bar_color=20
oplot,x,tshug,thick=3,color=20
oplot,!x.crange,[0.,0.],linestyle=1
;
; Now overplot their bidecadal components
;
plot,xband,tslow,$
  xrange=[1399.5,1994.5],xtitle='Year',/xstyle,$ 
  yrange=[-6,2],thick=3,title='Low-pass (20-yr) filtered comparison'
oplot,x,tshug,thick=3,color=20
oplot,!x.crange,[0.,0.],linestyle=1
;
; Now overplot their 50-yr components
;
filter_cru,50,tsin=densadj,tslow=tshug,/nan 
filter_cru,50,tsin=densall,tslow=tslow,/nan 
plot,xband,tslow,$
  xrange=[1399.5,1994.5],xtitle='Year',/xstyle,$ 
  yrange=[-6,2],thick=3,title='Low-pass (50-yr) filtered comparison'
oplot,x,tshug,thick=3,color=20
oplot,!x.crange,[0.,0.],linestyle=1
;
; Now compute the full, high and low pass correlations between the two
; series
;
perst=1400.
peren=1992.
;
openw,1,'corr_age2hug.out'
thalf=[10.,30.,50.,100.]
ntry=n_elements(thalf)
printf,1,'Correlations between timeseries'
printf,1,'Age-banded vs. Hugershoff-standardised'
printf,1,'     Region    Full   <10   >10   >30   >50  >100'
;
kla=where((xband ge perst) and (xband le peren))
klh=where((x ge perst) and (x le peren))
ts1=densadj(klh)
ts2=densall(kla)
;
r1=correlate(ts1,ts2)
rall=fltarr(ntry)
for i = 0 , ntry-1 do begin
  filter_cru,thalf(i),tsin=ts1,tslow=tslow1,tshigh=tshi1,/nan
  filter_cru,thalf(i),tsin=ts2,tslow=tslow2,tshigh=tshi2,/nan
  if i eq 0 then r2=correlate(tshi1,tshi2)
  rall(i)=correlate(tslow1,tslow2)
endfor
;
printf,1,'ALL SITES',r1,r2,rall,$
  format='(A11,2X,6F6.2)'
;
printf,1,' '
printf,1,'Correlations carried out over the period ',perst,peren
;
close,1
;
end

Anyways, in this file, right at the top, in all caps with asterisks before and after to set it off as a heading, it says: applies a very artificial correction for decline. Artificial correction—it’s right there in the code, just two lines down from the top of the file, and then a list of values. And the values are labeled: fudge factor: fudge factor, artificial correction. This wasn’t sophisticated climate modeling jargon; right? This sounds like they were just making stuff up.

But to get to why this artificial correction stirred things up, you kind of need to know what was going on at the time: what was happening in the nineties, the late nineties, and the early two thousands—and about the hockey stick graph.

The Hockey Stick

Adam: In the late nineties, climate scientist Michael Mann, along with some others, Raymond Bradley and Malcolm Hughes, introduced the hockey stick graph.

It showed global temperatures holding steady for a thousand years and then shooting up sharply in the late 19th and early 20th centuries. Picture a hockey stick lying flat on the ground, and then suddenly at the end, the blade curving upward. That’s the shape of worldwide temperature for the globe and for the western hemisphere.

The graph wasn’t just scientific trivia, right? It exploded into the public view. It became this shorthand for the urgency of climate change. Al Gore held it up. It was a big moment in An Inconvenient Truth, and suddenly this image, this shape, was everywhere, a symbol of the crisis that was going on.

But that power made it a target. Almost immediately, it faced fierce scrutiny. Skeptics didn’t just question it; they attacked, claiming the data was manipulated to exaggerate warming to them. This wasn’t science revealed; it was a political weapon that had been forged to institute drastic policies. So when Climategate erupted, what was really happening is these phrases like artificial correction and fudge factor popped up in the leak code, and skeptics thought they’d hit the jackpot, right?

They had proof of fraud. They had proof that they didn’t have to worry.

Here’s the critical question. Was the hockey stick graph genuinely compromised? Was somebody misrepresenting things, or was this controversy more about misunderstanding? Were people misunderstanding the scientific process? Thankfully, we have the code. Now I just need to figure out how IDL works, and I need to find the data and understand what’s going on here.

Eric Raymond

Adam: The fudge factor is actually pretty straightforward. It’s a series of numbers from 1400 to 1992. It starts at zero, so we have a zero value from 1400 to 1904. Then it dips negative into the thirties, and then it shoots up in the fifties all the way through the seventies. Finally, it levels off.

I couldn’t actually figure out how to run IDL, so I did what any developer would do. I just converted it to Python.

If you graph those values, you see a long flat line, and then the shaft like a hockey stick starting up in 1950 onwards. It’s a blade that tilts sharply upward.

The code does more than just graph that fudge factor, though. It reads in climate data and applies a low pass filter to it, basically smoothing it out, and then it applies that fudge factor over top. So it did the same thing. I made up random climate data from 1400 to now, and then I applied the very artificial correction.

Then I can graph both with the correction and without. It’s a very straight line, but with it, it turns into a hockey stick. The fudge factor completely overshadows the real data. I can see why the skeptics were concerned.

When this surfaced, Eric Raymond, a well-known open source advocate, the guy who wrote The Cathedral and the Bazaar, and also a well-known social conservative, saw it too. He did the same process and found some of the same issues.

This is blatant data cooking, plain and simple. It flattens the warm temperatures of the 1930s and forties. See those negative coefficients? Then it adds a positive multiplier to create a dramatic hockey stick. This isn’t just a smoking gun. This is a siege cannon with the barrel still hot.

Eric Raymond’s a vivid blogger, right? “Siege Cannon Barrel, still hot” is powerful imagery, and it was coming from an expert in software, so it was hard to dismiss. He wasn’t just some random internet crank. Eric at the time had a big book out. He was a respected figure in the tech world, and he definitely understood code.

His vivid take on the situation helped shape how people first saw the code. He posited that this was an error cascade. The people at CRU had manipulated climate data with this hockey stick fudge factor, and that led many people to believe this false narrative about climategate. Soon, the world was buying into this big lie until this leak happened and the deception came to light.

Some claimed that climate change was fake, and this fudge factor in this code was proof. Climate change, of course, wasn’t fake, but that didn’t mean that scientists weren’t nudging the numbers. Both could be true. What was really happening? With any good investigation, you can’t stop at the first piece of evidence that fits the narrative. You have to keep digging, especially when the accusations are this big.

And the deeper I dug, the more I kept seeing another infamous phrase, one that seems like a direct confession that kept coming up: “hide the decline.” In reference to the original hockey stick graph published in Nature, there was an email in this leak that talked about Mike’s Nature Trick to bloggers and to the mainstream media. This felt like a confession. Some thought this hack was an inside job. Maybe someone at CRU was fed up with all the lies, and so they leaked this data out.

But before jumping to conclusions, we need to understand what this code is really doing. You see, climate science is actually pretty complicated. You can’t just read the file. You need to understand the context.

So heads up, we are about to do a deep dive, but stick with it. I think it’s worth it.

Earth’s Corrupted Log Files

Adam: All right. Imagine this: you’re on call. It’s 2:00 AM, and the pager goes off. The main transaction system is throwing errors; latency is spiking. You dive in, but something’s off. The detailed performance logs—the granular stuff you need—only go back six hours; before that, you just have daily averages. Nothing useful for debugging this spike. You can see the system is acting weird now, but the crucial question is: is this spike completely unprecedented? Or is this just Tuesday? And that’s when the batch jobs run, and then it throws some alerts like this, and you should ignore them. Without that historical context, without those older logs, you’re flying blind trying to figure out the root cause.

Climate science is exactly like this, but the system is planet Earth, and mistakes are considerably higher. We have solid, detailed data on the Earth’s climate: thermometer readings from weather stations, from ships, from lots of places, going roughly back 150 years. This is the instrumental record, and it tells a clear story: the planet’s average surface temperature has risen by about one and a half degrees Celsius over the last century, just like that production system with only six hours of logs.

So, is that degree and a half of warming normal for the Earth, or is it abnormal? We’ve got this huge blind spot before the late 1800s. We just don’t know.

So, to answer this question, scientists need to become data detectives. They need to find ways to reconstruct climate history from before the time of widespread measurement. But this isn’t like restoring logs from archives. Nature doesn’t keep clean, standardized, you know, JSON files. The data logs scientists have to work with are things like the width of tree rings, the chemistry of ancient ice layers drilled up from Greenland, the skeletons of corals, or even the temperature profiles found deep underground in boreholes. These are called climate proxies, and they’re imperfect; they’re noisy and measure climate indirectly. They’re sparsely located around the globe, and they sometimes record things other than temperature. They have gaps, and they come in completely different formats.

Piecing together the Earth’s climate history from fragmented and messy data is a huge challenge. Climate science is actually a lot like data archaeology. You’re using complex statistical modeling and a painstaking process to try to figure out if the picture you’re assembling is an accurate representation using all this proxy data.

The Proxies

Adam: So let’s look at some of the main types of data used. It’s really the only way to get an understanding of what’s happening in that file. The most famous temperature proxies are the tree rings, and this is essential to the story because this is actually what CRU focuses on. Trees grow a new layer each year, and how thick or dense that layer is often depends on the conditions during the growing season, maybe how warm the summer was or how much rain fell.

So you find some really old trees, drill a core, and count the rings back in time, measuring their properties. It sounds simple, but it’s actually messy. Trees only grow in the mid-latitudes of the globe, and you won’t find any trees in the ocean or in Antarctica. And even where they do grow, it’s not just climate affecting them. Younger trees grow faster; trees get diseases. Maybe a nearby tree falls, giving the tree you’re measuring more sunlight. It’s like a performance metric that’s being affected by random GC pauses, network hiccups that you weren’t tracking, the amount of work available to do, and a million other factors. But there are actually so many trees. You get lots of data, and hopefully with that much data, the individual noise can cancel out, and you can find the signal—the aggregate growth rate year upon year for an area where the trees are, going back as far as those trees do.

And actually even further, we’ll get to that. Next up are ice cores. You drill deep into an ice sheet on Greenland or Antarctica, or in a high mountain glacier, and you can get a lot of data out because as snow falls and compresses into ice year after year, it traps tiny bubbles of atmosphere from that specific year. Scientists can measure the CO2 concentration from hundreds or thousands of years ago. Ice cores are how we know that today’s CO2 levels are unprecedented. The ice itself, the frozen water molecules, also hold clues. The ratio of heavy oxygen isotopes to light ones changes depending on the temperature when the snow originally formed. So that’s another proxy.

But it’s not perfect. The isotope ratio can be thrown off by where the snow came from, not just the local temperature. And the deeper you go, the more the ice layers get compressed together, so the yearly resolution gets fuzzier and fuzzier. It’s like a log file where older entries are being aggressively compressed.

For oceans, and especially in the tropics, scientists look at corals. Corals build skeletons out of calcium, and they add layers year by year, sort of like tree rings. So corals give us this precious data that we were missing from the vast ocean areas where trees don’t grow. And then there are other types of proxies. There’s layers of sediment washed into lakes each year that can tell you about the levels of snow melt, and you can use that to infer temperature. Fossils and deep ocean mud give clues about temperature over millennia. They’ll often be really fuzzy in terms of what year it is, and you can even measure temperature down in boreholes that are drilled deep into the earth’s crust.

I don’t totally get how that one works.

The Data Integration Nightmare

Adam: But the point is, you’ve got all these different types of proxies. Tree rings measure summer temperatures in North America. Coral skeletons record sea surface temperatures in the tropical Pacific. Ice cores log polar temperatures. Lake mud will tell you about the spring snow melt.

So they’re all recording something about climate, but they’re all indirect and they’re all noisy, and they all have different time resolutions. Some annual, some spanning decades, some spanning centuries. And the dating isn’t always perfect. Someone is piecing this data together by hand.

Also, they all cover different parts of the globe and different seasons. Some stop abruptly, some end up with weird glitches in them. So how do you take this mix of messy, scattered, imperfect data and turn it into a clear picture of the climate over time?

How do you pull together data from systems that are so different and that are barely documented, and that are sometimes reliable, and get out of that a reliable view of the system’s past, of the Earth’s past?

Adam: The first problem you have to overcome is uneven data distribution. You might have hundreds of tree ring records from North America, but only a few crucial ice core records from the Arctic, and also a few coral records from the tropics. So if you pick a year, you have hundreds of values from different proxies and locations, but most of it’s tree rings.

If you just toss all this raw data into a model, the tree rings would dominate, skewing the results to reflect only the mid-latitude forests and ignoring all this vital polar and ocean data. That’s not ideal for a global temperature view.

So before we put together a model, we need to pre-process the data. We have to transform that chaotic mix of raw proxy measurements into smaller, more structured, and representative sets of features. We do this with principal component analysis.

It works like this. Imagine you’re monitoring, again, a massive microservice deployment. You’ve got hundreds, maybe thousands of metrics streaming in: CPU, load, memory usage, request latency, error counts, database connections for every single service instance.

So at one moment, you capture a snapshot. You got 500 CPU metrics from your web tier. You have 10 latency metrics from your database cluster, and five error rate metrics from your authentication service. So you have 515 numbers describing your system state at one particular moment in time. But looking at all 500 of these raw numbers is overwhelming and not helpful.

And many of those 500 CPU metrics are probably telling you the exact same thing. If the cluster’s under heavy load, most of these CPUs will be high. In other words, they’re all highly correlated variables, and you don’t necessarily care about tiny variations between CPU 101 and CPU 102. You care about the overall pattern of the load on that web tier.

So, principal component analysis (PCA) is an algorithm that spots these patterns or themes in your sea of metrics. It would scan all 500 CPU metrics and say the biggest variation is here. The main signal is whether the whole group is generally high or low, and we’ll call that PC1 (principal component one) for the web tier.

It might capture another pattern, like front-end servers are busy, but backend servers are idle, as PC component two. PCA creates these new synthetic variables, principal components, which are each made up of mixes of the underlying components. The cool thing about principal component analysis is it figures out patterns without needing to know what’s what.

It’s an unsupervised learning method that extracts correlated information from the data. And crucially, these principal components are ordered by how much of the total information in the original data they explain, and each principal component is uncorrelated with the last.

So back to the climate data: for a given year, you have these 500 tree ring measurements and a few ice cores and coral values. Instead of tossing all 500 noisy correlated tree ring values in the main model, you first extract the principal components.

PCA finds the main shared patterns of tree growth across that network. The first few principal components might capture 80 or 90% of the meaningful variation. The first component could literally represent the overall good growing conditions of the season, while the hundredth might just reflect something like rainfall in one very small area of North America.

PCA allows you to zero in on the big consistent patterns in tree growth, cutting through the noise of the individual trees.

So, PCA doesn’t give us a final temperature map from our tree rings. Instead, it gives us a neat, simplified data set. It gives us just a couple of data points to look at.

And the cool thing is, it’s all here in the data leak. While many climate models mix various metrics together for the most accuracy, our model is just based on tree ring data. If you look around, it’s not too hard to find the PCA file.

It’s in documents Osborne 36 in a file that starts with PCA. It’s in another IDL file, but getting that data ready for principal component analysis is no small feat either because there’s another file, documents Osborne 36 RD, all MDX 1 pro, that does a lot of the heavy lifting to process this raw data.

It is nice, though, that it’s all here. Now that I kind of am starting to understand IDL and how these climate models work, I can look through the files and kind of see what they’re doing.

Calibration

Adam: So now that we’ve got refined processed proxy features for each year, we can focus on calibration. Calibration depends upon the overlap in time between when we have actual temperature readings and when we have tree core measurements in our data. This overlap period is from 1856 to 1990.

That’s when our tree rings overlap with temperature data. Although that’s not quite true, and you’ll see why as we go. But yeah, that is the period where we both have processed proxy features and reliable thermometer temperatures that overlap; this is our ground truth for our climate model.

We’re building a statistical model to link patterns in our proxy features with those in the known temperature records from this overlap period.

Think of it like training a machine learning model. I mean, in this case, it’s actually not a machine learning model; it’s more simple statistics, but the idea is the same. You give it the processed proxy features as inputs and the instrumental temperatures as known outputs. The algorithm figures out the complex correlations and the weights, the best way to basically map from those inputs to the output temperatures during that time period.

In our data leak, this process is done alongside the principal component analysis. Ian Harris, known as ‘Harry,’ throughout this leak checks the principal components that are extracted against rainfall records; rainfall being the strongest non-temperature signal that we have records for.

This lets him extract the temperature component; which is the non-rainfall component; which is then used in the graph in the question briffa-98 file.

Now, here’s where it gets interesting. I feel like this is the part that the skeptics missed.

Predicting The Past

Adam: Harry calibrated his statistical model using the overlapping data, and the PCA helped him pull out the signal. So when you feed your trained model only the proxy data from before the thermometers existed from like 1000 AD to the start of our measurement area, the model, using the relationships it learned during calibration, gives its best estimate of the temperature for those years.

And just like that, you have a curve stretching back centuries, showing the estimated ups and downs of the past temperature. You might ask, as I did, and I had to look into this, how can you have tree rings that go back to 1000 AD?

Well, this tree ring dataset is the MXD dataset, and it actually uses live very old trees, but also dead preserved trees that can be exactly dated, and they can be exactly dated via their correlation to live trees. It’s more detective work, but basically, high altitude, very old dead wood can be found and can be precisely dated.

But yeah, building and running the algorithm is just the start. The next question is, does this work? Is this reconstruction solid, or did we just create a complicated statistical illusion? That’s where the verification step comes in.

The Hold Out

Adam: The verification step uses holdout validation. Remember that overlap period where we have both proxy data and thermometer readings. Instead of using all of that to train the statistical model, you deliberately hold back a chunk of the thermometer data, and then you can test against that to see if your model’s working.

If the reconstruction can successfully predict the temperatures in the period that you held out, it boosts your confidence that the relationships that learned are real. It’s like using a separate validation data set in machine learning; model validation is the key.

And we have a lot of files in this data leak, cal PR band, temp up, pro calibrate band, temp up pro, and so on and so forth. Many files in this leak are all aimed at validating the data.

And it’s actually in this validation step that we find the answer to hide the decline,

the controversial phrase that led to the reporting that climate scientists were hiding the truth.

Before The Average

Adam: But before we dive into those emails and what hiding the client is, there’s another layer to consider, because the past climate data isn’t just about pinning down a single global temperature. It is a complicated web.

The earth’s climate isn’t a simple thermostat that slowly goes up or down. It’s a chaotic system that’s fluctuating on multiple timescales, that are all layered on top of each other. You have events like El Nino and La Nina that pop up every few years, and they warm or cool big parts of the Pacific and shake up weather patterns around the world.

You have big volcanic eruptions that send aerosols into the atmosphere, and these particles reflect sunlight, and they cause global cooling for a year or two, and that’s just two of the timescales at play. There’s many more, and the big challenge for climate scientists is pulling apart all these overlapping signals.

It’s much more complicated than just a global yearly average temperature.

Decoding the code

Adam: But, okay. Alright. We’ve circled back; hopefully, you made it through all my background, with all that proxy data, with all those proxies, and with all that data complexity in mind. Let’s tackle these infamous phrases. Let’s break them down. First, let’s break down Mike’s Nature Trick. This sparked huge controversy, right? Was Mike Mann publishing something incorrect? Was he hiding things? Then we’ll cover Hide the Decline, the so-called smoking gun that caused A, B, C, CBC, the New York Times, and the Washington Post all to accuse climate scientists of misleading the public.

But yeah, first Mike’s Nature Trick. Mike Mann is the man behind the iconic climate change graph. He’s the one behind the original hockey stick graph, the one from Al Gore’s Inconvenient Truth.

And while Mike’s Nature Trick sounds like something from a spy novel, it’s not about secret manipulation; it’s about taking all this complex data and turning it into a simple graph. Mike had these projections from climate models, right? The proxy data and what they implied, and he also had real temperature data, thermometer readings—the straightforward stuff, where no crazy stats are needed.

You just check the thermometer. His trick was to put both types of data on one graph.

Mike used two separate lines, one for real measured temperatures from 1862 to today and another for proxy temperatures reaching far back in time, which he also added error bars to. Mike’s trick was putting both sets of data on one graph. The proxy data is complex, but without the real temperature data, which shoots up as a hockey stick blade, that’s what gives it punch.

The thing is that blade was never in doubt. It’s just the yearly average temperature. Any weather station could tell you that. Now, the folks at the CRU made a somewhat intentional misleading choice. Instead of using two separate lines, they combined them into one line: the instrumental and the projections.

Now, climatologists would understand that when the line hits modern times and the error bars go to zero, it’s showing real data and not a projection. But not everybody would understand that.

So that is a little bit misleading, but there are no lies involved.

But the real kicker, the real thing that upset people, was emails that said hide the decline. You know, you would have a cold winter or a snowstorm, and politicians would show up trying to cast a suspicious light at global warming with snowballs. Where’s global warming now? So when somebody said, hide the decline, they’re like, yes, I get it.

They were hiding the fact that it’s actually getting cold.

But as I said, it’s easy to verify that the world wasn’t getting colder. The world was, in fact, warming. The year 1999, from which this data came, was the hottest year on record.

Hide The Decline

Adam: So here’s the deal. Hiding the decline wasn’t about covering up a drop in global temperatures. It was about a decision to leave out unreliable post-1960s data.

You see, for centuries, tree ring data matched up well with temperature; warmer conditions meant denser wood formed late in the growing season, but around 1960, this relationship broke down.

This is known as the divergence problem, but it does seem like a real issue. We have this temperature data. This tree ring data is being used as a proxy to project backwards and tell the temperature a thousand years back.

But yet it doesn’t even work in known periods like from 1960 to present. How solid is our past reconstruction if these proxies seem flawed?

And uh, and here’s the thing, I actually found an answer for that. Me, just somebody who downloaded this data leak and started poking through and read a book or two to fill in some information, I figured out it’s pretty exciting for me and it involved reading lots of this IDL code.

Adam: But first, before I share what I find, I wanna say, you know, that questioning this data, looking carefully at this code, even if I assume that climate change is a given, is still a good thing. It’s not anti-science to check their work; critical examination is vital. That impulse that I feel to look closer, it’s a vital thing. Even when it’s uncomfortable. No field is immune to bad intentions. Sometimes even foundational work warrants a second look. Somebody needs to check it, and a big reminder of this is a major ongoing investigation in a completely different field: Alzheimer’s research.

So, before I tell you what I found in the data, lemme tell you about Alzheimer’s research. The dominant theory for decades was this amyloid hypothesis. It’s the idea that this sticky beta plaque in the brain was what caused the disease. In 2006, Silva and Leslie and his team published a paper in Nature that seemed to back the amyloid hypothesis. They identified this protein Aβ*56 and suggested that it caused memory issues in rats, and this paper became a cornerstone. It was cited thousands of times, and it ended up directing billions of dollars in research funding and drug development towards targeting these amyloid plaques.

However, over the years, things didn’t quite add up. Top Alzheimer’s labs tried to replicate his findings, but often they couldn’t do it consistently, and that was a big warning sign. Yet, some labs managed to replicate the results, which led to more research and drug development based on those findings.

Then enters Matthew Schrag. He wasn’t digging through emails or private messages. He wasn’t trying to read IDL files like I was. He was focused on the science. He was scrutinizing published papers in Alzheimer’s research, and he spotted some anomalies, especially in the images, including the papers. It started with some offshoot papers. But the more he dug, the more it led back to Les’s 2006 Nature paper. Basically, he was able to tell that the images had been photoshopped. Somebody had used a cloning tool, and you could see mismatched backgrounds or lines that appeared too clean.

And this wasn’t just online talk that he posted on his blog; no. He was a major investigator, and he led to a major investigation that was released in Science Magazine in 2022. It wasn’t just misunderstood jargon or internal debates. In this case, it was actually the integrity of visual evidence in peer-reviewed studies. It had a huge fallout. The fallout is actually still ongoing. Les’s university launched an investigation. Nature issued a cautionary editor’s note to the original paper. All these things feel pretty mild.

But what’s now known is that these results don’t hold up. This was fraud. The process of retraction is messy and slow because no one wants to admit they’ve been chasing a lie. There’s huge damage done to the field, but there’s also a chance for science to self-correct. Scientists are human, right? And some will cheat. And Scragg’s investigation shows the danger of a real error cascade. That 2006 paper wasn’t just a study, right? It was a foundation. Thousands of studies were built on it. Billions of dollars in funding followed; patients took drugs based on faulty research—drugs that were costly, drugs that had side effects, and that even led to deaths—drugs that ultimately failed to cure or help with Alzheimer’s. An entire field poured resources down a path that led nowhere, all because of some fraud.

In a key study, I mention this because this investigation reminds us that skepticism is vital. Questioning these findings, even influential ones, is crucial. This impulse to dig deeper is sound. That’s why I think I need to apply this skeptical spirit to Climategate and this briffa_sep98_e.pro.

But yeah, I think we can now understand what’s happening in that file. The startling comment that caused such a stir applies a very artificial correction for the decline, followed by the fudge factor array. We can now explain what those are at first glance.

You know, skeptics like Eric Raymond said that this was a siege cannon, and it seemed super damaging. It looked like clear evidence of data manipulation to force that hockey stick shape. But now we know the decline is not about global temperatures dropping. It’s about certain proxies like the tree ring data no longer being reliable indicators.

Here’s how I know; here’s what I found. Remember those calibration files I mentioned, like calibrate_band_temp.pro? They’re really crucial when you run the whole process. PCA correlate and then validate. On this tree ring data, the predictions that come out are pretty noisy. There’s something in the data, especially from the overlap period, that’s causing noise and making the predictions inaccurate.

So Harry or the team or whoever, after digging into the data, the issue became clear. The post-1960 tree data for centuries, these rings matched up with the temperature readings. Warmer summers meant denser rings, but after 1960, that link broke. The thermometer showed warming, but the rings suggested cooling.

Something changed. Something changed with how trees were growing on Earth. Maybe the extra CO2 from global warming, maybe the trees just don’t grow the same forever, maybe pollution, maybe chemicals; we don’t know. But the trees weren’t matching predictions, but they found a way to overcome this.

They would skip the post-1960s data for principal component analysis. By focusing on the data before 1960, they could better extract the signal. If they removed that 1960s data, they could better estimate the temperatures going backwards.

So that gave them a better ability to project backwards, but it led to a problem right when they fed that data forward to the post-1960s. The model predicted lower temperatures. So if the global temperature was 14 degrees in 1972, the model would say 12.

They found a way to build a model that predicts past temperatures well but shows a decline just as the world heats up. That is the divergence, right? That’s the failure of the specific proxy data post-1960. That’s the decline that they are hiding. The reason it diverges is because of the way they built the model to ignore whatever changed post-1960; it’s actually all in the leak. If you look through the calibration attempts, you can find.

Then performing these, they used the data from 1911 to 1960 to build a model and then calibrated it backwards using data from 1856 to 1910; that worked better than if they used 1911 to present day. This wasn’t a secret. They, in fact, published a paper on the divergence problem in Nature back in 1998.

It was a known issue, but it’s fascinating to me that you can dive into the code and see how they derived this. It doesn’t clear up everything, right? As I said, when a key proxy method goes wonky, just as we have better tools to check it, it does raise real questions about how reliable the method is.

But the puzzle here is about the limits of this specific proxy. It’s not about a lie.

Adam: And then going further, if we look at our file, our rfa, EP 98 E, the file name is telling; the underscore E is actually some old school version control. There’s actually A through D as well. And these are all found in a personal folder named Harris Tree for Ian Harry Harris, the programmer.

And that fudge factor, those hard-coded numbers that look like a hockey stick graph in the context of the divergence problem, it’s pretty clear.

This is actually Harry manually mixing in the instrumental data, the real world temperature data. As I said, ideally you’d show these as two separate lines. But Harry was just trying to manually hack in the instrumental data into his graph.

But here’s the real kicker, this wasn’t the code that was used for the paper in the linked files. There’s a whole different directory where the actual published data is. There’s briffa_sep98_decline1.pro, and briffa_sep98_decline2.pro. These files are quite similar, but they tackle the divergence problem differently. They don’t have a hardcoded fudge factor. They don’t mention an artificial decline. Instead, they read the actual instrumental data from files.

There’s no fudge factor. There’s just reading in the temperature and adding it to the graph. The actual methods used later just use temperature data from a real public source. So the core accusation that scientists were literally inventing numbers to fake warming doesn’t hold up when you actually look at what the files were.

The Bigger Picture

Adam: It’s also crucial to just zoom out, and remember what this data set is. This is the CRU high latitude tree ring density data, and this is the stuff with the divergence problem. It’s just one single thread in a vast tapestry of climate science. The overall conclusion that the Earth is warming, and that humans are the primary cause, doesn’t rest on this file or, in fact, on this leak.

It comes from the convergence of many independent lines of evidence gathered and analyzed by all kinds of scientists worldwide. In fact, the graph that Al Gore used was based on ice core samples, not this data at all. So there’s no error cascade here. The CRU data matters, especially for reconstructing detailed temperature maps of the Northern Hemisphere, land temperatures over the last millennium, but that’s just a part of the story.

The attackers who leaked these files, and the bloggers who spread the story, weren’t actually doing a thorough review of the CRU’s work. Perhaps that’s not surprising. Likely they just ran keyword searches for terms like trick or hide or artificial. And in this massive dump of emails and files, they found some juicy snippets in one file that was never used for a published paper, and they took them out of context and claimed that they found a lie and that they found a conspiracy.

Here’s where the Climategate story stands out as being quite different, right? Matthew Scragg wasn’t sifting through stolen emails for dirt. He was carefully examining published scientific evidence, one by one. He was questioning its integrity through complicated visual analysis. This was skepticism aimed at the science itself, leading to potential corrections for the field. In fact, he did it because he wanted to get the field back on track.

Climategate was driven by a specific code file. It used out of context chatter. It used experimental code to target scientists and to sow doubt rather than engaging with the full body of the published work. In fact, it was timed for this all to happen right before the Copenhagen Climate Conference. So there’s some pretty strong hints that there was a political agenda here: find a lie, and then you can say that they’re lying about everything.

Adam: But here’s the cool part. Maybe the real story of the Climategate files isn’t about conspiracy or fraud at all. Maybe it’s about something far more mundane, yet I think profoundly important: the unglamorous, often frustrating reality of being a programmer, trying to make sense of messy scientific data because, Ian Harry Harris, the CRU programmer whose name is on this folder, Harris’s tree in the leak; there’s another file, a massive text document, 15,000 lines long, called Harry ReadMe.txt. It’s basically Harry’s personal log stretching over the years, documenting his day-to-day struggles to maintain, update, and debug these climate data sets and to work on the code that’s used to process them.

And reading it is like, well, if you’ve ever worked on a legacy code base or if you’ve ever tried to integrate data from dozens of different inconsistent sources, I think you can feel a deep sense of empathy for Harry. Harry wasn’t writing about grand conspiracy; he was writing about the grind of data wrangling and the challenges of software archeology. He writes about an Australian dataset being a complete mess, that so many stations have been introduced, and he can’t keep track of it. He complains a lot about Tim and Tim’s code, and I assume that Tim is somebody who came before him and didn’t sufficiently document what he did.

Sometimes he just writes, “oh, fuck this,” all in caps, as in, “oh, fuck this, it’s Sunday night, and I’ve worked all weekend, and just when I thought it was done, I’m hitting yet another problem.” And it’s the hopeless state of our databases; there’s no data integrity. It’s just the catalog of issues that keep growing on and on. Reading Harry’s log, you don’t see this cunning manipulator working to hide inconvenient truths. You see just an overworked programmer, likely under-resourced, grappling with complex, messy, real-world data and imperfect legacy code.

And he leaves all these exasperated comments, and they don’t sound like admissions of fraud, just like the slightly cynical remarks of someone deep in the trenches of doing the difficult work of climate change. Maybe the real story of Climategate is not a scientific scandal, but a human one: a story about the immense and often invisible technical labor required to turn noisy observations into scientific understanding, and the pressures faced by those tasked with doing it, often without recognition or even the resources they need.

And then, after all that, they get attacked, and their private work files become the hot topic on ABC News.

Clearing The Air

Adam: So where does all this leave us after all the, the sound and the fury and the investigations and the accusations? I mean, what did Climategate really reveal? At first, the media jumped on this idea that this was a smoking gun. Nobody wanted to deal with global warming. I mean, nobody still wants to deal with it.

Al Gore called it an Inconvenient Truth. So there was hope. There was hope that it was all a mistake or a fraud, and people ran with that. Newspapers churned out stories of deception for weeks after the leak, and the investigations came much slower. But there were eight official inquiries. Yes, eight. And all came to the same conclusion.

No fraud, no scientific misconduct; climate science’s core finding stood firm. The hockey stick graph could be debated for some of its statistical details. You can debate the limits of some of these proxies, but it’s backed by many other studies. They use different methods and use different data.

The trick wasn’t a deception; it was just a graphical choice. Hiding the decline wasn’t hiding a global cooling trend. It was about dealing with a known issue.

Climategate wasn’t proof that climate change was a hoax. It was more like a case study in how internal scientific discussions and informal language and experimental messy code can be twisted when leaked into a charged climate where people are looking to create doubt.

Transparency

Adam: If I were to take a lesson from the Climategate saga, it would be about the necessity of transparency in science, especially things like climate science. What if, from the start, all the raw data and code and statistical methods were out there? What if they were publicly accessible to begin with? I imagine them on GitHub, ready for anyone to run and critique.

And actually, as a result of all this, CRU now has the instrumental data available under an open government license.

And while Eric Raymond’s initial take on the code file is what caused a big stir, he was right about one thing because he demanded that they open source the data, and I feel like that’s a principle I can agree with them on. Climate science, with its global stakes and complexities, should embrace open source, should embrace open access as much as possible.

Science isn’t always neat. It’s a human process full of debate and messy data and evolving methods. But like software development, it gets stronger and more robust and more trustworthy when the process is open, when the data is shared, when the code is available for review. That’s my takeaway from the whole affair.

It’s not about a conspiracy reveal, but a powerful argument for doing science in the open. We live in a world in which science is more than ever under attack and underfunded and being questioned and being politicized.

I think the best defense against that is to be open.

Outro

Adam: That was the show. How many people made it this far? I don’t know. Honestly, I started by diving into this Climategate code and it got more interesting as I went along, but I’m still pretty unsure about how interesting it is for others.

There’s like a lot of interesting tangents I went on that I had to cut as well, but I came away with one big idea. Climate science is kind of interesting and it’s a little bit like data science, except in climate science you’re dealing with messier data and you often have to gather it and label it yourself, but you get to work with a community that’s striving for shared knowledge.

Climategate makes it sound like it’s all about global warming models and politics, but really it’s more about diving deep into specific issues like how the layers of sediment and this certain dataset can affect the feedback cycle in the Atlantic ocean temperatures.

Harry’s exasperated, cynical grievances notwithstanding, it actually sounds pretty interesting.

But yeah, let me know what you think of this episode. And until next time, thank you so much for listening.

Appendix

Here are some books that helped me fill in the back story:

The leaked files are very findable googling around for “ClimateGate foip.zip” but also wikileaks hosts a copy:

Support CoRecursive

Hello,
I make CoRecursive because I love it when someone shares the details behind some project, some bug, or some incident with me.

No other podcast was telling stories quite like I wanted to hear.

Right now this is all done by just me and I love doing it, but it's also exhausting.

Recommending the show to others and contributing to this patreon are the biggest things you can do to help out.

Whatever you can do to help, I truly appreciate it!

Thanks! Adam Gordon Bell

Audio Player
00:00
00:00
57:33

briffa_sep98_e.pro