“The Big Deal about Big Data” with Dr. Gary King

“The Big Deal about Big Data” with Dr. Gary King


[MUSIC PLAYING] Welcome. My name is Ziyad Marar. I’m the executive vice
president of SAGE, and I’m responsible for our
global publishing strategy and really delighted to be here
and to introduce Gary King. The event is in
context. of us wanting to look hard at research
methods and I’ll come to that in a second. But it’s great to see people
representing policymaking and the learning
societies and media and SAGE colleagues
in one place. I hope you will actively
engage with Gary when he’s up and running. The format will be I’ll do
just a few minutes of intro, and then Gary will talk
for about 30 minutes and then take questions
and answers, after which we will go for drinks. But he reassures
me that if you want to interrupt him
along the way, you’re more than welcome to do that. But also, I’d like to
say a quick thank you to the American
Political Science Association and the
American Statistical Association, both of whom have
helped make this event happen. So thank you to
everyone involved there. So it’s often said that data
is to the 21st century what oil was to the 20th century. And that extraordinary fact
has almost become mundanely in our lives today. I think none of us
are startled anymore to hear these gigantic
numbers being bandied around, 1.5 billion people on
Facebook or, as I looked up this morning, 40 thousand
Google searches per second, where they become sort of the
hum-drum backdrop of our lives. Some people positive about
them, others not so much, the negatives be
inclined to say, the business model
of the internet is surveillance versus on
the other hand people saying, never has there been
before such an opportunity to improve human
flourishing with all that’s available through big data. And there’s no surprise that
organizations of various kinds have responded to
this with alacrity. These organizations,
many of them are commercial
organizations, are looking to get close
to their customers and to send messages out
as timely as possible. And that’s big and
understandable stuff. And then some similarly
in the research community, the
natural scientists and the physical
scientists have similarly been responding incredibly well. But I think that the
social scientists, we get a bit of a mixed story. And that’s while there’s
huge amounts of uptake, there’s also a fair amount
of caution expressed and doubt expressed as we. And I think one of the reasons
we wanted this event was really to help push the story of
data-intensive social science a step along. And so the expressions
concern range from the nature of social
science, problem domains, and issues around
methodology and ethics. A researcher studying
social inequality will worry about the people who
left off the digital grid more than a market researcher might. And people in
social sciences will tend to worry more about
privacy and the idea of informed consent, which isn’t really
an issue for high energy physicists. So we’ve got particular
dynamics that apply to the social sciences. Nevertheless, I think a huge
opportunity for social science to reinvent itself in
the light of big data. And this is particularly
important for us at SAGE because our history has been as
innovators in research methods since our founding. We were founded 51 years ago
by Sara Miller McCune, who thought of research methods as
a really profound connective tissue across the
research communities, across geographies, across
fields, across levels from students to professor. And we’ve innovated, whether
it’s through the Little Green Books or the rise of qualitative
research or mixed methods or evaluation throughout
history and even today are still innovating by launching our
Big Data & Society open access journal and more
ways to talk about. What next? Well, I think data in the
sense of social science is what’s next. And we feel that we can play a
role in helping bring together these communities. And so to our speaker, who
is the very acme, I think, of what’s– the emblem, I
think, of data-intensive social science in my head. And not only is
Professor Gary King and incredibly eminent
academic on many fronts, but he is such a pioneer
in this particular area. I know he needs no
introduction, but as you have seen in the
documentation, he is the professor Albert
J Weatherhead university professor at Harvard University. And that university
professor role is held by 24 professors
in the university. It’s an incredibly elite
and impressive role that he occupies. But it’s not just
about the contribution he makes as an
individual academic. It’s the fact that
he’s also the director of the Institute for
Quantitative Social Science, which has
led the way, I think, for many researchers to be
able take on this opportunity. And more than that,
I think Gary has been very good at
explaining to institutions how they could reconfigure
themselves to take advantage of this opportunity as well. And so I think it
requires massive amounts of collaboration to
do it really well. And I think [INAUDIBLE]
is one of the best things we’ve got to turn to. So we’re short of
time, so I’m going to just say it’s an
incredibly important moment in social science. SAGE is very committed
to supporting that. And we couldn’t think of
a better keynote speaker to have than
Professor Gary King. Please welcome
him with me today. Thank you. [APPLAUSE] Thanks. That’s good enough. I only wanted the applause, so
we can just end now, I think. And thanks for the introduction. I really appreciate it. If anybody has any questions,
like after I start talking, you should feel free to
either yell them out or scream obscenities or whatever
it is you choose. That’s totally fine. So that’s me. These are SAGE’s slides. Thanks for the ASA
and APSA and SAGE. And is there another one? Why is there no one other? No, no, thank you all for
coming and for sponsoring this. I really appreciate it. And our members of Congress. So I’m going to talk about
the big deal about big data. And I already gave the
answer, which is it’s not about the data, OK? So that’s not the
innovation in big data. So what’s the innovation? Well, let me explain. So data is easy to come by. It’s a free byproduct
of IT improvements in like every organization. If you buy it, it’s
becoming commoditized, so it’s becoming cheaper
and cheaper and cheaper. If you ignore it, wherever you
look at the end of the year, you’re going to have
more data than you did at the beginning of
the year, if you ignore it. If you pay any
attention at all, you’re going to have tons more data
without putting in much effort at all, OK? However, what are you going
to do with all that data? It’s not very helpful
by itself, right? Because you have to manage it. It’s valuable, so you
have to keep it, right? So basically it’s an expense. It doesn’t actually do
anything good for you. Hold on. There’s more to the story, OK? [LAUGHTER] The value is the analytics. The revolution is the analytics. The revolution
that we didn’t know how to do– the
thing that we did not know how to do before but
we know how to do now, what we’re learning
how to do know, is how to make the
data actionable. That’s really the secret. I love the phrase “big data,”
not because it describes anything accurately, but because
the media has figured out a way to describe what we
do and what SAGE publishes and what many of you do in
a way that my mom now think she understands what I do. [LAUGHTER] OK? And really, the public
genuinely gets this now. Well, they don’t completely get
it, but they get it much more than they did previously. And basically we failed to
communicate this to them, but somebody in the
media or the media collectively has
figured that out. And that’s really valuable. But the value is not data. It’s not the big. It’s the analytics. You can customize the output
for exactly what you want. I to contrast it
with Moore’s law. So you know what Moore’s law is? It’s a prediction, empirically
accurate prediction, it turns out, that the
speed and power of computers will double every
18 months, which has produced a lot of value. That’s a really good thing, OK? However, Moore’s
law is like nothing compared to like one
student data scientist who works on algorithms,
who in an afternoon can increase the speed of the
computer plus the algorithm by a factor of 1,000. That’s like not that big a deal. That happens quite regularly. Moore’s law never got
close to that, OK? No objection to Moore’s law, OK? We need that, all right? But don’t miss where
the real value is here. I’ll give one more example. And it won’t be curing
the common cold. We’ll work on that. A colleague had lots
of data coming in and ran a particular
analysis every year. And the data got bigger, right? There was more data this year. There’s more data the next year. There’s more data the next year. And all of a sudden,
there was more data than could fit on his computer. And so he called down
to the IT shop and said, spec me out a new
computer, right? I need a bigger computer
to be able to do this. And so my guys at the Institute
for Quantitative Social Science, which I run– thank
you for mentioning that. My guys and one woman–
because if you ever find a female sysop, you hire her. That’s my instructions to them. And we have one, OK? And maybe she is the
only one, but she’s one. In any event, my guys specced
out the cost of this computer. And the cost of the
computer was $2 million. Now, that’s a
beefy computer, OK? It is possible to raise money
to buy a $2 million computer. It’s a real thing, the $2
million dollar computer. We don’t really want to
buy a $2 million computer, but that’s what it
would cost to do. And so we intercepted this,
a graduate student and I, and we worked on it
for almost two hours. And now this professor
runs this algorithm on his laptop in 20 minutes. So this is actually– the
most with amazing thing about this story is that
it’s not even amazing. It happens like all
the time, right? The innovation is the analytics. OK, it’s low-cost. It’s low-infrastructure. All you need to do is
hire my students, OK? [LAUGHTER] It’s mostly human capital. That’s really what it is. It’s important to
understand also that if you go from
no analytics to just ordinary, off-the-shelf
analytics, you have a huge improvement. If you go, however, it’s
important to understand, from ordinary,
off-the-shelf analytics to innovative analytics
tuned to your problem, it’s orders of magnitude better. So it’s worth remembering that. And what’s ordinary is
drastically changing over time. Because that’s really
the revolution. We’ve learned more about
causal inference, the effect of something on something
else, and prediction, what’s going to happen in the
future, more in the last 20 or 30 years than at any
point in human history. It’s quite amazing, OK? And the cool thing is we all
get to be a part of this. It’s happening in the
social sciences, OK? So let me give you some context. There’s exciting data,
but without the analytics, it’s quite useless. So an example is exercise. The way they used
to measure exercise is they would do
a survey, and they would say, how many times in
the last week did you exercise? How many times did
you get on treadmill? You know the answer
to that question, but that’s not
necessarily related to how much energy
you output, right? So how could you measure it? Well, you all have cell
phones in your pockets, I bet. We could put a little piece of
software, if you don’t mind. There’s also an accelerometer
in your cell phone, which says how much you move. And the little piece
of software would just send us the information,
to be used only for scientific
purposes, I promise, OK? Now we have, let’s
say, 500,000 people with this little piece of
software on their cell phone. And we have continuous time
information on how much they’re moving. Think of how much
more valuable that is than how many times in the
last week did you exercise. It’s hugely more valuable. However, what are
you going to do with this trace of information
about how much you exercise second by second for 500,000
people over the course of, I don’t know, a month. What are you going to do
with information, right? Moreover, how are you going
to distinguish the couch potato sitting sleeping
on the train like this compared to the person on
a stationary bike doing an all-out sprint who’s
not moving according to the accelerometer with her
cell phone next door, right? How do you actually
distinguish that? You need the analytics in
order to interpret the data appropriately. It’s not just the data. Another example is in order
for democracy to work, you need activists. You need people pursuing office. If no one pursues office,
Democrats, Republicans, whatever, then
there really isn’t any point in having democracy. So political scientists
track activists. It’s important to all of us. So the way they used to do it
was a few thousand interviews. In fact, there’s a famous
book by some colleagues of mine, where they took a
random sample of the United States public of 15,000 people. And they asked them screener
questions, weeded it down to maybe the 2,000 that are
actually activists and then asked a long series of questions
and wrote a big book on the one snapshot of 2,000 people. And I realized that today, there
are 650 million social media posts that are available
publicly for us to analyze– 650 million. And people write about
everything, including basically everything
about politics, things that would
drive you crazy and things that you
would like and everything in between and everything else. That’s enormously
more information than the information
we used to get, OK? Only, what are
you supposed to do with 650 million social
media posts, right? Quite a lot of them are
about cat videos, right? Like what are you
supposed to do? How do you analyze that data? Once you have the
analytics, then it tuns out we can turn that
into meaningful information. Or take social contacts. The way they would do this in
many, many high quality surveys is, please give me a list
of your five best friends. And we’d ask just
for the first names so that we can then ask you
more questions about them. Alternatively, if you
give us permission, and we will protect
it, we can have a continuous record of
phone calls and e-mails and text messages and Bluetooths
and social media connections and you name it, right? So the amount of information
could be enormous. But what are you supposed
to do with that information? It requires detailed and
sophisticated analytic tools to be able to extract
information from it. Or economic development
in developing countries– one option is you
get the information from the government. I once traced the source
of all of information about AIDS in Africa. Like where does that
come from, right? Because they don’t really have
the institutional structure to be able to estimate this. There’s villages off far from–
like where this is actually coming from? It turns out that where
it was coming from was a guy in the World Health
Organization named Alan. He actually– [LAUGHTER] Really. And what Alan would do
is he’d get the data from the governments,
who would give them numbers when he would ask. And then he’d look at them and
he’d say, that’s not right. And he’d cross it off, and
he’d write the other number. And then that became official
record of AIDS in Africa, OK? Alan, by the way,
was really good. You wanted his numbers
rather than the government’s. [LAUGHTER] But still, that’s not
an empirical method, OK? Or not the empirical
method you want. So now if you want to
measure, let’s say, GDP, we can get satellite images of
human-generated light at night. You know those
photographs, right? Or road networks
or infrastructure or things like that. An undergraduate of mine studied
Chinese investment in Africa and used differences
in satellite imagery for where the investment was and
whether the investment actually had any had any effect
over a 10-year period. You could actually see it,
rather than just talking to somebody and
asking them what they remember was here 10 years ago. There’s many, many more things. But in each, without
the new analytics, the data are useless, OK? Let me just put this in
context of the social sciences. In 1995, Science magazine
asked 60 scientists about the future of their field. They said, what’s going to
happen over the next 25 years? And they asked for these
half-column descriptions from 60 scientists. Every one of the natural
and physical scientists said, we’re going to make
the most amazing discoveries and inventions, and we’re going
to cure these amazing diseases. And every single one
of the smaller number of social scientists that
answered the question said, well, we used to
be studying this, and it’s really exciting
because now we’re going to be studying this. And that pissed
me off. [LAUGHING] Right? It is important to be
studying this and that. But it’s also important to
be actually solving problems. And the change in
the social sciences is we’re changing
from studying them to actually solving problems. So I’m going to try to give you
some of those examples today. So here’s my first. How do you like this sentence? How to Read a Trillion Social
Media Posts and Classify Deaths Without Physicians. Now, I use this sentence
mostly because it’s never been uttered before in the
history of humanity. [LAUGHING] But also because I’m
going to show you the same underlying
methods enable you to solve completely
different problems. So that’s the power of
social science methodology, that you can solve problems
that you couldn’t have solved otherwise, and you can also
solve problems that you never even realized existed, OK? So here’s some
examples, basically examples of bad analytics. First one is verbal autopsies. What the heck is that, OK? If somebody dies in
the United States, there’s a death certificate. And someone sees the body,
a medical personnel, doctor sees the body, signs the
death certificate that says the person died from lung
cancer, whatever it was, OK? In most of the
developing world, that is, in most countries in the
world, when someone dies, they basically go
off to the bush and are never heard from again. No one sees the body. Autopsies are culturally
prohibited, right? You basically don’t know, OK? Moreover, these are
the places where we have to really figure out
what people are dying from because that’s where diseases
emerge that might affect us, OK? And actually, it’s going
to have disastrous effects in these places without
hospital infrastructure and things like that, OK? So how do you measure
the prevalence of different
diseases in countries without this kind of data? So what they do is
verbal autopsies. Well, what’s a verbal autopsy? You go to a household
where someone’s died, and you find the next
of kin or the caretaker, and you ask a series of sort
of uncomfortable questions. Did the person have stomach
pain before they died? Did they have bleeding
from their eyes? Did they have tire
tracks across their back? [LAUGHTER] That’s a joke. [LAUGHING] Right? You ask them questions
to try to figure out what the cause of death was. And then you give the answers,
maybe 50 answers to the 50 questions to a physician. And the physician says,
ah, that was tuberculosis. And when some smart-aleck
scientist came along and said– social scientist, by the
way– came along and said, let’s give that to
another physician. You give it to another physician
and they say, ah, malaria. And it turned out that when
you did this systematically with lots of examples,
the physicians were useless in this context. Physicians to be useful
needed to see the body and needed to do tests. Verbal autopsies, just the
physicians were useless. So what do you do? Well, let’s think
about that, OK? Put that over here, OK? And think about what
they’re trying to do. What they’re trying to do
is classify individuals into categories of death, OK? Now hold that thought, OK? Another thing people do
is sentiment analysis by word counts. So you see this a lot, OK? In the media, you
see people investing on the basis of Twitter
and things like that. And what they’re
doing is they’re counting the number of
tweets with certain words. Let me give you an example
of the kinds of things that can happen when you do this. There is a group that was trying
to predict US unemployment rates. And they thought, well, let’s
do this with social media. We can do this better
than the government. Let’s count the number
of times people say “classified” or “jobs”
or “unemployment” or all these things, right? And we’ll count a number of
tweets that have these words. And they did. And they plotted them
over time, and they correlated with official
government US unemployment rates. And they preceded them. So they were able to
predict US unemployment rates with the word count. What happens, though, is you
can do a little of this well, but then it fails
catastrophically, OK? As a side point, it’s
quite unlike heart surgery. You don’t do that
like a little bit. You either do heart surgery
or you don’t do heart surgery. You don’t feel unwell at
dinner and ask someone to pass a steak knife
to fix something. OK, let me go back to the story. OK, so this is something
people do a little, and actually, they were
predicting unemployment rates to some degree from this. And then they
noticed one day there was a spike in the number
of these words used. And they thought, let’s invest. And they start
investing and investing. And what happened? What they didn’t notice was that
was the day Steve Jobs died. And we didn’t notice
that Steve’s last name– [LAUGHTER] –was one of the worlds
that they were counting, OK? So catastrophic failure. Both of these fail. Now, the interesting
thing is they’re completely unrelated
substantive problems, but they have the same solution. And the reason why is
the key to both methods, the reason both
methods were failing, is they were classifying. As it happens in public
health, people in public health are not your doctor. People in public health
don’t care about anyone. They only care about everyone, OK? So that’s interesting. They don’t care what you die of. They only care about the
proportion of people dying from a particular disease. And when we study
Twitter, no one cares what
stapumpkin222 says, OK? It doesn’t matter what
anybody has to say. It only matters what
everybody has to say, OK? If you think about that, the
individual classification decisions are not the
quantity of interest. What’s the quantity of interest? What do we actually care about? What we actually care about is
the percent and the category. We care about the percent
of people dying from malaria in the United States. That’s zero– zero. That’s an effective
number, zero. We need to devote no
resources to malaria. OK, good. OK? In other parts of the
world, it’s not zero, so we have to devote
more resources. We are about the percentage. We don’t care who
died from malaria. Well, you and I might, but
as public health professions, we wouldn’t. Similarly, for estimating
sentiment or US unemployment rates from word
counts, we don’t really care about any one
of these tweets. We only care about
the percentage that fall in a category. Now, you’re probably
thinking, wait a second. How do you get the percentage? You just put them all in
the categories, add them up, and you get the percentage. And if you were
thinking that, you’d be absolutely right,
except you just assumed that every time
you put it in the category, you got it right. If you don’t get it
right all the time, those are two different
things, right? So let’s take all pairs
of countries every year since World War II. 1,000 of those million pairs
of countries were at war. So now let’s come up with the
prediction of the proportion of pairs of countries at war. Well, a really good prediction
that will be right nearly 100% of the time is there’s
never been any war, right? So that’s useless
information, right? The percent correctly
predicted is really high, and we don’t get a good
estimate of the thing that we really care about. So classification, although it
seemed like that was the way to get there, only gets
us there if we’re perfect, and we’re never perfect, OK? OK. So what does that do? So what that does
is all of a sudden we realize that
methodologically, we care about the second
thing, not the first thing. And if there’s another way of
getting to the second thing, estimating the percentages
in the categories, we don’t even care whether we do
a worse job in the first thing. And that’s what we did. We developed a set
of methods that gets us estimates of the
percent in a category that’s accurate on average. It’s called unbiased,
is the statistical term. It doesn’t even a classify
at all individually. I realize it sounds like magic. If anybody wants, I’m happy
to explain all the math. In fact, if anybody
flinches, I will. [LAUGHING] But basically, we came up with
a way of estimating the percent in a category. Once we realized what the
quantity of interest is, we used the tools that
we’ve all developed. We develop new
statistical procedures to be able to do this. And the really cool thing
was it solved these two completely unrelated problems. Actually, I was working on the
first problem for the World Health Organization, and
we solved this problem. And for a year, I was trying
to solve the sentiment analysis problem. And I tried every method
that the computer scientists had come up with. And everything was
a complete disaster. It just was nowhere near close. And so one day–
and I’d try things, and I’d give them to
my graduate students. And I’d say, here, I found this
new thing in the literature. Why don’t you try this? And like every morning they
would roll their eyes at me, like, this one’s
not going to work. None of the others worked. And then one day I realized
mathematically these two problems are exactly the
same, and our method works very well for verbal autopsies,
so it will work here. And I’d tell my graduate
students, they will work, and they’re rolling their eyes. But nonetheless, I said
mathematically this– and it actually works. So the consequence of this
is modern-day analytics lead to us developing
these algorithms. Actually Harvard patented
them and licensed them to a start-up company
called Crimson Hexagon. They have around 200 employees. They go around the world. They collect social media posts. They have about a trillion
social media posts, so that’s actually
a real number there. And they do brand
monitoring for people. I helped found this company. I put this up here. Crimson Hexagon,
I’m very proud, was named seven of the top 10
most innovative companies on the web, which
I mention primarily because my brother
works at Microsoft, and Microsoft was number nine. [LAUGHTER] And also, the same exact
method, in fact the same code, is used by the World Health
Organization and others to estimate the prevalence by
cause of death in countries all over the world. So that’s the cool
stuff that we get to be involved in in
social science methodology. OK. Now I’m going to completely
switch subjects, OK? So if you fall asleep,
it’s best to fall asleep right between slides and then– [LAUGHTER] OK? So these are more examples. OK, we’re in Washington, right? Just so you know. [LAUGHING]
So the United States Social Security Administration is
the incarnation of the Social Security program, right? The single largest
government program, lifted a whole generation out
of poverty, extremely popular, highly partisan, the third
rail of American politics, it’s called, right? The essence of a program
provides benefits to retirees and people who are
just disabled and families. In order for the program
to succeed, to survive, the Social Security
Administration must forecast how much is in
the Social Security Trust Fund. That’s what everything
depends upon, OK? These forecasts are used by the
Social Security Administration to make the whole thing work. So if retirees draw benefits out
too long– if someone invents a pill and we live
to 200, we’ll be cheering until we
realize there’s not enough money in Social
Security to pay for us all. And the trust fund
will go insolvent, and we won’t be able to have
enough money for retirement. So the forecasts are
essential to this program and to keeping people
out of poverty. It really matters. Moreover, many other United
States government programs depend upon these forecasts,
and those programs are run based upon
what the forecasts say. More than half of United States
government expenditures depend upon these forecasts–
more than half, . OK, so we looked
at these forecasts. The interesting thing
is these forecasts have been made for
85 years, right? And they weren’t forecasting
85 years ahead of time, so we had this amazing
scientific opportunity to evaluate forecasts, real,
out-of-sample forecasts, because the year they were
forecasting actually occurred. So we can actually
see how well they did. So we used the very
complicated statistical method to compare the truth
and the estimate. The method is known
as subtraction– [LAUGHTER] –to evaluate these forecasts. The methods, as it turns
out, had been little changed in 85 years. They’re mostly qualitative. And these 85 years are
not just any 85 years in the history of the
country or the world. These are the 85 years in
which we’ve learned more about forecasting,
as I mentioned a minute ago, than in any
time in human history, and this is the time that
the United Social Security Administration chose not to
update the methods by which they ensure the solvency
of our retirement, OK? So this seemed
like it was worthy of us paying somewhat
more attention to. OK, so we did quite
an extensive study. What are the results of this? Well, until about the
year 2000, the forecasts were about unbiased, which
is what you would want. It’s not fair to ask
of the actuaries who make the forecast that their
focus should be spot-on. They are forecasting
the future, after all. The future is uncertain. You heard it here. Until 2000, for 75 years,
or however many that is, they made forecasts. Sometimes they were too high. Sometimes they were too low. But on average, they
were about right. That’s what we seek. Since 2000, however, they
became systematically biased. And it’s not just one forecast. It’s a whole bunch of forecasts. And every forecast was
biased in the same direction. Every forecast was
biased in the direction of making the system look
healthier than it really is– every single one. Why is that? As a side point, why is that? Well, as– you
were looking ahead. I noticed it. OK. [LAUGHTER] Why is that? Well, as social scientists,
we do quantitative work. We’re quantifying things. But we never leave behind
the qualitative evidence. Because it’s impossible to
quantify all information. There’s always going to
be essential information that humans beings
have that we’re not going to be able to
completely quantify. And so we also did
interview people. And we figured out
what the reason was. Like how come this
all was biased? Is it just some partisan scheme? No, actually. That’s not what it is. What happened was
Social Security became much more partisan. That’s absolutely true. Both the Democrats
and the Republicans are at fault. They
both pushed very hard. They’re not actually
at fault, but they changed the environment. The actuaries did what we
would want good public servants to do. They hunkered down and
protected themselves from the Democrats
and Republicans, who might like the
forecast to go this way or go this way, right? Because it would be convenient
for their political arguments in changing public policy. The actuaries resisted
that just, like we wanted. But the actuaries also resisted
pretty much everything else, including the data. And as it happens– [LAUGHTER] And they insulated
themselves basically from the facts and the data. Since about 2000, as
it happens, Americans started living
unexpectedly longer lives, not in every category,
not every person, but on average,
they started living unexpectedly longer lives. If you’re taking statins,
keep taking them, OK? Although don’t
take medical advice from a political scientist. [LAUGHTER] But we started
living longer lives. Like these kinds of
innovations actually mattered, and you can see it in the data. And that’s a really
terrific thing. But it changed the forecast. Like you have different inputs. You should change
the outputs, OK? People smoked less. That’s good. But they ate more. That’s bad. The smoking less was better than
the eating more, what was bad, and that wound up working out
so that mortality was decreasing at a rate faster than expected. That’s not necessarily
going to continue to happen, but they ignored the
data, and as a result of ignoring the data,
they just missed what some people
estimated from our data was a trillion-dollar error, OK? OK, so in addition to
this, complicated method of subtraction, we
actually created an actual complicated
method of forecasting. We came up with a better
method of forecasting. How should the actuaries
actually forecast? We developed new social
science statistical methods that can forecast
much more accurately. They are, for example,
logically consistent, like, unfortunately, older
people have higher mortality rates than younger people. You would expect that. The methods the Social Security
Administration are using even to this day don’t require
that, there’s quite a number of other logical consistencies. They also produce much
more accurate forecasts. The last time we ran
this, the trust fund needed $800 billion more
than the actuaries actually indicated. So you know, bummer, but– [LAUGHTER] At last I think we have
better information. This, by the way,
doesn’t say anything about what public
policy should be, right? Our elected politicians get to
make these decisions somehow, or not make the decisions. That’s also a decision. But we hope to give them
much better facts on which to make the decisions, and
I think we’ve done that, OK? Deep breath, fall asleep,
wake up, new subject. [LAUGHTER] OK, gerrymandering–
redrawing legislative district boundaries, the boundaries
of a legislative district. This is really troubling
area, hugely troubling area. Is the most conflictual form of
politics in the United States, a most predictable form
of conflictual politics in the United States,
short of violence. And in many areas, it doesn’t
stop short of violence. There’s a lot of examples,
which we collect, of fist fights on the floor
of legislatures, almost all over redistricting. This is a big deal. For incumbents,
there’s hardly things that are more frustrating
than redistricting. Because what happens is
somebody in a basement, who looks a little like me, is
redrawing district boundaries and deciding who gets to
keep their job or not, right? That’s a hard job these
folks have, right? But fortunately they don’t do
this for professors, right? And for voters, it’s the
same kind of frustration. They’re not going
to lose their job as a result of
redistricting, but they’ll lose their representative. And they will be in
a different district. And they may have
had information about who they would
vote for, and now all of a sudden they
don’t because there’s a different configuration
of candidates. So this is a really
difficult area. It’s very, very partisan. It produces a train wreck
of litigation on cue. Like every 10 years
after the census, they allocate congressional
seats to states, and then every state
has to redistrict, and then every state
has to have litigation. It’s a complete train wreck. There’s huge amounts
of money wasted. Both sides do whatever
they can to get advantage. So analytically from a
social science perspective, what are the problems here? Well, first of all, there
was no agreed-upon standard of partisan fairness,
all Right how do you know what a fair
redistricting plan is? And secondly,
there was no method by which we could estimate
whether a plan met the standard. So we needed standards,
and we needed methods. Secondly, in order to figure
out whether the Voting Rights Act applies, the courts
required estimates of whether blacks and whites
and Asians and Hispanics were voting the same way. But we actually have this thing
in the United States called the secret ballot. And so although the law
says, you must tell us how different people are
voting, the law also says, you may not know. [LAUGHTER] OK, that’s where the
social scientists come in. Because we can
estimate these things. And that’s– I’m
getting ahead of myself. So we have solutions, OK? So the solutions– the solutions
were not about the data. The data have been
available for years, right? There is census data, and
there’s election data, and there’s where
the districts are. And there’s data at the precinct
level or the census block level. We have lots of data. First we needed a standard. So I and some co-authors
developed a standard for partisan
fairness, for what it means for a set of
legislative districts to be fair to both parties. So we developed the standard. It is agreed to pretty much by
all, that is, by both parties in almost all
major redistricting litigation, including cases
that have been decided by the Supreme Court this
term and many other terms by Supreme Court
justices that have written about it and many other
courts all across the land. So they basically agree with the
standard of partisan fairness. We also developed a sequence
of statistical methods, each one of which better
than the previous, or at least so we claimed in
order to get it published. But at least that
was the theory. And the methods estimate
on the basis of the data and the plans, the
different plans offered by the different parties
and citizen groups, et cetera, how
fair each plan was. And these basically are agreed
to pretty much by everybody. The parties who hate
each other, they all use the same methods,
which is really terrific. We’ve also developed what are
now very widely used methods for estimating
individual or group behavior from aggregate data. So if you only know the
percentage of African Americans in this district, and you know
the percentage of people voting for the Democrats, and there’s
lots of African Americans and lots of people
voting for the Democrats, it might be that the
African Americans are voting for the Democrats,
but it could actually be that lots of whites
that live in districts with African Americans are the
ones voting for the Democrats. Well, then that’s known as an
ecological inference problem. So we’ve come up with methods
that enable us to get around these problems. They’re uncertain, as you
might imagine, because there’s this information lost. But there are now used in
pretty much every court by experts on all sides. So these are just some ways that
social scientists have actually made some contributions. Are you ready for a
different example? [LAUGHING] OK, keywords. [LAUGHTER] It’s the obvious next
example, don’t you think? Humans, I’m going
to make the case, are horrible at
choosing keywords. Now, wait a second. That seems ridiculous. We all do Google
searches every day. I’m going to convince you
that you are horrible at doing Google searches. Well, you’re horrible
at thinking of keywords. Here’s an experiment, OK? You can play along if you want. We actually ran this experiment
with 43 undergraduates at a nice college in
Northeastern United States in Massachusetts,
sort of around Cambridge, Massachusetts. [LAUGHTER] So we gave this to
43 undergraduates. We said we have 10,000 Twitter
posts, each containing the word “Boston”– so the 10,000 posts
each used the word “Boston”– from the time period around
the Boston Marathon bombings, when all of us up there were
hunkered down in our houses because they closed the
roads and they said, you can’t go out. Please list any keywords– this
is the task I want you to do. Please list any
keywords which come to mind that select posts
in this set of tweets related to the bombings
but won’t pick posts that are unrelated to the bombing. So don’t pick the word
“the,” please, right? Because it’s in both. You have to pick words that are
just related to the bombings, OK? So you think about what that is. How many words can you think of? Just think of it, OK? OK. I won’t pick anybody
out from the crowd. I promise. OK, so here’s some examples, OK? Explosion, all right? Lockdown. Tsar– however you
say this guy’s name. Terrorist. These are the kinds of
words that people chose. Maybe some of you
chose these words. My guess is very few
chose these words. Am I right? I’ll show you why, OK? The median number
of keywords thought of when we gave them plenty
of time for our undergraduates was eight. Each person came up with–
some more, some less, but on average they came
up with about eight. The number of unique
keywords chosen or thought of across all 43 undergraduates
was 139 unique keywords, OK? Each person could only
think of about eight. So let’s ask the question. For each unique keyword,
how many of them were thought of and
by person and, when given a chance by
42 other people, every single one of them
fails to think of it. That’s pretty dramatic, right? How many? Turns out 2/3 of the
time– 2/3 of the time. So what does that mean exactly? That means humans
recognize keywords well. If I show you a
keyword and I ask you, is this relevant to the
bombings and not relevant to the not-bombings,
you’d know right away. I gave you these
examples right away. You knew they were
relevant, right? So we’re very good
at recognition. We suck at recall, OK? In fact, we’re so bad at
recall that I can convince you that all brains have an
inhibitory process that cause us to forget things, which
is sort of a cool fact, OK? Why would that happen? I can explain what why
it happens, but let me make it happen to
you right now, OK? Do you know the idea of it’s
on the tip of my tongue? That’s an inhibitory process,
something in your brain causing you to not remember something. So no one speak, OK? Think of your bank password. That’s why I asked
you not to speak. Now think of your
previous bank password. Now let’s assume you
don’t rotate them because the bank tells
you not to do that. Now think of the bank
password before that, OK? My guess is almost nobody
can think of it, OK? But if I showed it
to you– that would be a really cool magic trick. [LAUGHTER] And that is how I
paid for college. But if I showed it
to you, don’t you agree that you
would recognize it? OK, so that’s the thing. We’re good at recognition. We suck at recall. Surprising, but it’s true. OK, so what do we do about that? We’ve developed
some new technology. We call it thresher. And this technology, like
most really good technologies, does not fully automate
the human away. Because fully automated
technologies usually do stupid things,
like a driverless car without anyone to tell the
driverless car where to go. It wouldn’t be very useful. It would drive around
in circles, right? Fully automated doesn’t
really do anything. Fully human is inadequate. I just showed you. You are inadequate. We are all inadequate
because we’re humans, right? So what this technology does
is it suggests words to you so that you can do recognition,
which you’re good at, rather than recall,
which you’re bad at, OK? OK, so let me give
you an example. We’re going to find those
hiding in plain sight. We were studying China
and Chinese censorship. And I’ll give you example of
Chinese censorship in a minute. But we use this technology, this
thresher technology, in order to find the following. So first of all, I’ll
give you an example. OK, so anybody speak Chinese? Do you know what that is? Freedom, that’s right. That’s the word “freedom.” If you use the word “freedom”
in China on certain website but not others, certain
social media websites, it will be filtered out, OK? And you write it in
your social media post and you use that word,
you won’t see it. It will be filtered out– not
every one, but a lot of them, OK? So what do people do, right? Well, if you had to rewrite
a sentence without a word, it would be no problem, right? You can find some way of
writing around it, OK? What do they do? They use this word. Those who speak Chinese,
do you know what this is? First of all, look
closely, for those– this is not the same as this. It’s close, but it’s
not the same, OK? This means “eye field,” an eye
and a big field of daisies. It has no meaning at all, OK? So what people say is, we need
more “eye field” in China. And if you speak Chinese
and you see that, then you recognize that
the writer means that. It’s very clever,
don’t you think? And this is known
as a homograph. Here’s another example. This means harmonious
society, which is the official slogan
of the Communist party. This is filtered out. They don’t want you
to talk about that. So what do they do? Well, they don’t use that. They use this. What is that? Well, it turns out this
means “river crab” OK? And they say the policy of
river crab totally sucks, OK? Well, then why are
they doing that? The reason why is because if
you say these two words, which I have tried so many
times, and anybody who speaks Chinese
in the audience doesn’t think I came
anywhere near close. But this is, of
course, irrelevant. But this is known
as a homophone. These two sound almost the same. Can you say them? Yeah. Say it. [SPEAKING CHINESE] That’s right. [LAUGHTER] Well, here’s another example. This is Bo Guagua. He was the son of Bo
Xilai, and Bo Xilai was the guy that came down in
the biggest scandal in China in 20 years. So you used the word Bo Guagua,
and they censor it out, OK? So how did people
write about this? This was the most
fascinating event in China. If you were in China, you
would want to write about it and read about it and
figure it out, OK? And we in the
United States study China for all these cases. We wanted to follow the
thread of the conversation. When the people we
were studying were incredibly creative and
innovative and making up new things, we still
wanted to follow the thread of the conversation. If we had to sit there
and think of words, like in the Boston
bombings example, we would fail because we
had these inadequacies. But in this case, we would fail
even if we were incredibly good because how do you think of
something like that, right? OK, Bo Guagua. Instead, they changed
the first Chinese letter to the English letter B.
OK, I can sort of get that. They leave off the next letter. That sort of makes sense. OK, they use the second two. OK. Then use B melon. [LAUGHTER] Why is that? Well, one of these
characters by itself actually means melon, OK? Or ABD. Why would that be? We have to research this. Turns out that the princelings,
the sons and daughters of the power bureau people,
for whatever reason have means that are like Bo
Guagua, like ABB. And so ABB is an
abbreviation for princeling. And so these are ABB, right? So these are basically versions
of slang, some of which you might have actually
thought of if you had a really good day, OK? But the Chinese people are
doing this continuously. How would you be
able to follow this? We use this technology to
follow this kind of thing, OK? Let me tell you about
our China study. Because it’s sort of fun. This is a more general thing. OK, the previous approach to
studying censorship in China was talk to somebody that
had a post taken down. Like somebody would
write a post and notice that it was taken down. That’s what they do in China. They read the posts,
and they take down the ones they don’t like. It’s sort of amazing because
they have millions and millions of posts, right? Well, the problem is that
one person watching that would be like an
ant on an anthill. They don’t get to see
the big picture, right? So what we did is we
noticed that we were able to surprisingly–
actually we were very surprised–
that we figured out how to download all
Chinese-language social media posts before the Chinese
government could read and censor them. So we had the entire corpus of
Chinese-language social media posts that the Chinese people
couldn’t read because they weren’t allowed to. But we could, not because
we were allowed to, but because we had them, OK? And so then we developed
a network of computers around the world to check
on each of the posts to see whether it
was taken down, OK? And then so now
we have two piles of posts, one censored,
one uncensored. And we used methods of
automated text analysis that we developed in
social science methodology to try to figure out what the
Chinese government is after. OK, about 13% are censored
overall, just to orient you. Now, I’ll tell
you what we found. Overall, everybody knows
the goal of censorship. The goal of censorship is to
stop criticism and protest about the state, its
leaders, and their policies. So we knew that goal, and so
we went in with that goal. And we analyzed the data. And the first
thing we learned is that it was utterly
and completely wrong. Like we got nothing remotely
related to this goal. So we thought, well, OK,
that’s a good starting point. It would be nice if we
had an ending point also. And so we started to look at
the data in many different ways until we finally hit upon
a way that made some sense. We asked, what
could be the goal? What we did is we
broke up the question. And we said, well,
maybe the goal is to stop criticism of
the state, its leaders, and their policies. Or maybe the goal is to
stop collective action, stop protest. And we separated the two. Once we separated the two–
we had the theoretical idea to separate the two– all of
a sudden everything incredibly clarified. It’s like when you’re
at the eye doctor and they go, is it
one or is it two? It’s two, right? Because all of a sudden
you can see that there’s someone across the room. OK, what happened? The first was wrong. The second was right. Let me explain what that means. In China, what everybody
thought was you couldn’t criticize the government. What is actually the case is you
can criticize the government. You can say whatever you
want about the government. You can say, the leaders of this
town are all stealing money. Here’s how much. These are the bank
accounts they have the money hidden in overseas. And by the way, they
all have mistresses, and here are their names. That won’t be censored. But if you say, and
let’s go protest, that will be censored, OK? In fact, if you say,
the leaders of this town are doing such a great
job, let’s have a big rally in their favor– censored. They don’t care what
you think about them. They’re a bunch of dictators. What should you
think of them, right? They only care what you can do. They’re not afraid of the
United States government and our big military power. They have nuclear weapons. What are we going
to do, attack them? Right? It’s not going to happen. What are they afraid of? They’re afraid of their
own people, right? So their own people– how could
their own people affect them? Their own people could
affect them if they rise up and they have another
big Tienanmen Square and it spreads contagiously
across the country, OK? So they stop collective action. The implications of this
are really interesting. The implications is that social
media, then, is not merely something cool to study. It becomes actionable. It’s actionable for
the Chinese leaders, and, therefore, it’s
actionable for us. For the Chinese leaders,
they measure criticism to judge local officials. If you’re in charge of
China, what do you want? Well, you have more
power than Barack Obama and Bill Gates combined. You want to keep a
good thing going, OK? How do you keep a
good thing going? Well, you make sure there’s
no collective action. How do you do that? Well, there’s between 50,000
and 700,000 governmental units across China, depending
on how you count, OK? You appoint all of your best
friends to run them, OK? And there are a whole lot
of other people to run them. And then what happens if one
of them isn’t doing a good job? Well, then maybe
protest, and that protest may spread contagiously
across the country, and then you might
lose your job. How do you watch
all of those people? Well, actually social
media is a really good way. You could see why
they’re being criticized. And it turns out we can
use how much they’re being criticized to predict
whether or not they’re going to lose their job. Because they’re using that. And of course, they
censor to stop events with collective
action potential. So we all thought they
won’t allow criticism, but they allow criticism. It’s useful for them. We as academics can use the
criticism and the censorship to predict officials in trouble
and likely to be replaced. Or dissidents,
like before they’re arrested, if censorship
goes up very substantially on a particular dissident,
they need to duck, because they get arrested
four or five days later. Or a peace treaty–
a peace treaty between Vietnam and
China, they found oil in the South China Sea
between those two countries. And both countries started
taking a real liking to that body of water. And the media is talking
about potential conflict between the two countries. And they’re sort of shooting
across each other’s bow or the equivalent. And so what happened? Censorship was soaring. And then one day we noticed–
no one else did because nobody else had the data. We noticed it just stopped. They stopped
censoring completely. That’s the media talking
about it that time? They’re talking about maybe
there’s going to be a war. What’s going to happen? This is going to some
conflagration, yes. But we noticed
censorship had stopped. Four or five days later,
they signed a peace treaty with Vietnam. We don’t know for
sure, but we think what happened was
they made a deal, and they told the censors,
don’t worry about it, and they signaled. What’s happening
here is that you have a giant operation within
the Chinese government designed to slow the flow of
information, but it’s so large that it conveys a lot
about the intentions and the goals of
the Chinese leaders. It’s like a big elephant
tiptoeing around. It leaves big footprints. And if you look at scale
with big data social science methods, you can
actually see these things ahead of time and merging
scandals and things like that, disagreements
between central leaders and local leaders. We can also see those things. I’m going to– let’s see. I am going to not be
able see the time. OK, I’m going to skip
two slides and give you one final– agree with
everything on these slides, right? [LAUGHTER] OK. All right. Let me tell you about
one final message, which is written up there, OK? So let me ask you the
following question, OK? What university research
has had the biggest impact on your lives personally? Now let’s think about what
the biggest kinds of things universities have done. Well, first of all, all the
progress in the last 400 years has come from science. And most of that, not
all of it by any means, but quite a lot of it has
come from university research. You can think of this as
just research if you want. University was gratuitous. What research has had the
biggest impact on you, what university research? Maybe the genetics
revolution– we spend a lot of money on
the genetics revolution. The general genetics
revolution has taught us an enormous amount
about the basic biology of what’s going on. But it’s sort of been
a failure up until now in terms of curing
diseases, at least relative to its predictions. We think that’s going to
change, but we’re not there yet. Well, how about the Higgs
like particle or gravity waves more recently. Actually, to me those are
just like the coolest things that you could possibly see. But how much does it have
an effect on your lives personally? Some, some. Just the wonder is totally
worth the cost and way more that we pay for
these things, OK? I’m totally in favor. But I just want to
put it in context. How about exoplanets
and the Mars rovers? Yes, yes, yes, yes, yes. Please pay for this. How about doubling life
expectancy in the last century? Yes. [LAUGHING] OK? There’s a lot of things
that belong in this list. My only argument here
and my last ride today is that on this
list also belongs– I was hoping it was
going to come up– also belongs quantitative
social science. It also belongs on this list
of the most important ways that university research has
impacted you in your lives. Let’s think about what
it’s actually done. First of all, what is it? Well, it’s actually big data
or big data applied to people. We sometimes call it data
science or data analytics or statistics or [INAUDIBLE]
or political methodology. Or there’s actually like
100 different terms. And you could use
thresher technology to discover all those words
that you wouldn’t think of. What has quantitative
social science done? Well, it’s transformed
most Fortune 500 companies into data producers, data
analyzers, and data monetizers. It’s transformed most
Fortune 500 companies. It’s established
whole new industries. It’s altered
friendship networks. Facebook is basically a
social science innovation. So what is social media? Social media is the
largest increase in the expressive
capacity of the human race in the history of the world. That’s cool, OK? It’s changed political
campaigns completely. Hasn’t always
produced the outcomes we want, but we don’t mess
with the outcomes, OK? It’s transformed public health. It’s changed legal analysis. It’s not discovering anymore. It’s e-discovery. It’s impacted crime and policing
in huge ways, military as well. We invented economics,
transformed sports. Have you seen Moneyball? It’s transformed sports. Who would have thought
quantitative social science would have an effect on sports? It sets standards for
evaluating public policy. We did an experiment in Mexico,
where we evaluated their health care system. And we did a randomized
experiment, the largest randomized experiment
in health policy ever, where we randomly assigned
hospitals and doctors and medicines and money to
pay for it all to communities and learned a tremendous amount. And they got to improve
their health system. And millions and
millions of people got health care that wouldn’t
have got it otherwise because we did this evaluation. It was a really
gratifying kind of thing to be able to participate in. Anyway, there’s also
three et ceteras, because there’s
lots of other things that quantitative social
science had an impact on. And so if you’re anywhere
part of or touched this world, I just want you to remember that
this belongs on that list, OK? And I thank you for
all of your time. [APPLAUSE] Nobody interrupted me, which
I apologize for not insisting that you interrupt me. So interrupt me now. [INAUDIBLE] politicized to the
extent that– oh, I’m sorry. Oh, excellent. Mortality statistics go
back hundreds of years. So how could
Republican or Democrats disagree on
mortality statistics? I mean, there’s no hope
if that’s the case. [LAUGHTER] Well, let’s put it in context. So politicians don’t have
to agree with the facts. They can use facts
anywhere they want. We’re in a democracy. The truth is one of
many considerations. And so the politicians can
decide on whatever basis they want. We’re not going to tell
them how to decide. I agree with you
that it would be nice to have the
facts on the table. And the current mortality
statistics, the fact is, everybody does
agree with them. Forecasts about them,
we should be using best practices, absolutely. The fact that we’re not hurts
us all, Democrats, Republicans, retirees, workers, everybody. I completely agree. There’s no reason for that. OK, thank you. Sure. Run, run, run. [LAUGHING] Sorry. Thank you for a great talk. First, a comment on the Social
Security Administration– there were many
over years following 2000 who said that the actuaries
were just extrapolating. They weren’t taking into
account myriad factors, like you point out, health care
and changing lifestyles and the like. And Sam Preston and
others repeatedly showed in their research how
that did change their thinking. So a lot of this has to do with
the culture of an organization more than, I think,
the politics. Perhaps one of the
greatest challenges is trying to infer
meaning from language. And will we be able to put this
into terms of what’s written what people really mean? And do you think quantitative
social science will give us any insight? [CHUCKLING] It is the
culture in each of these– you asked two different
questions, which are essentially have the
same underlying piece, which is human beings communicate
in very complicated ways. We sometimes call it
culture or language. And there’s a way that
when we quantify that, we will lose some
of that information. It’s not only worry. It’s real. In fact the point
of quantification is mostly to throw
away other information so that we can have the
abstract and abstract summary. If we summarize the
wrong piece of it, then we completely
lose everything, absolutely, totally agree. That’s why I think the best
methods that we come up with are the ones that are
not fully automated but not fully human, the ones that are
human empowered but computer assisted. When you all write,
when you write letters, if you use Microsoft Word
or Emacs or Google Docs or whatever it is, what is that? That’s computer-assisted
writing. Like Microsoft Word doesn’t
tell you how to write. It helps you write, OK? So we have a project to do
computer-assisted reading. Well, what is that, actually? In fact, let me make
it more outrageous. It’s computer- assisted
conceptualization. Well, wait a second. Computers are not
allowed to do that. Only we do that. Well, actually, we have a
pretty lame working memory– really lame working memory–
really, really, really lame. We have this real limitation. Let me explain how limited
our working memory is. When you’re writing something,
if you re-write one sentence, what’s the first
thing you do after you rewrite that one sentence
and rearrange 10 words? You read it again. You read it again. You can’t remember 10
words and rearrange them? No, no, because
you are inadequate. You are humans.
[LAUGHING] Right? That’s the problem, right? So the idea that
we can be helped– yeah, of course, right? I mean, we can have
computer-assisted conceptualization. So we have some methods
of doing computer-assisted conceptualization. There’s a long way
there before we get replaced by the
robot, which I don’t think is going to happen. But yeah, so we have to
pay very close attention to the culture of organizations
and the deep qualitative information that
individuals have. And that is absolutely the
problem in the Social Security Administration. There is a culture in the
Office of the Chief Actuary that is they are at the
center of attention. They’ve done a very good
job of curating the data, but they don’t want to
share the data with anybody. Years ago that made sense
because other people would screw it up. But really what they should do
is they should take the data. They should just
make it available to the scientific community. They didn’t even share the data
with other parts of the Social Security Administration, OK? If they shared it with
the scientific community and provided
replication data sets like we’ve been doing in
academia, as you, might know, then you would have hundreds
of social scientists trying to do better. And if they did better, they
would cast no aspersions on the actuaries. It would be great
for the actuaries if someone did better. Someone at the end of
the day, by the way, has to make the call. And the person who has
to make the hard call is the chief actuary or the
Office of the Chief Actuary. Or they make the recommendation
to the Social Security board. They make the call, I guess. That’s fine. But they should have
the best advice. They shouldn’t be
doing it by themselves. The key contribution
of science, the reason why so the social sciences are
moving from studying problems individually, sort of in
our separate monasteries, to a scientific model
where we’re actually solving problems, is
because of the community. Because it’s much easier
to fool ourselves than it is to fool other people. And when we work together and
we share data, then one of us can check on the other, and
together we can do way better. That little paragraph
actually accounts for almost all progress
in the last 400 years. So that’s a long answer
to your question. Katie. Can I ask a follow-up
question on access to data? I mean, one of the barriers to
having more social scientists working like this is that it’s
often very difficult to get access to this kind of data. Either it’s corporately
owned and controlled, or, as you mentioned, government
isn’t willing to give it up, so how have you been able to
get access to the data? And what more should we all
be doing to push that forward? Yeah, so access to data
is absolutely essential. Let’s think of it in a
couple of different ways. So one is, at a
minimum, academics seem to be sharing
data with each other. That’s actually harder
than you would think, but we’ve made a lot of
progress over the years. In 1995, actually, I wrote
an article, as you know, called “Replication,
Replication,” where we encouraged scholars to
make data available when they publish their article. We’ve made enormous
progress since then, and actually, it’s now
the norm within academia. Governments have actually
made a lot of progress in making data available. As we’ve seen with Social
Security, not all parts of government, but local
governments right now are competing with each
other to make the bus schedules available so
the high school student can write the app to tell you
when the bus is going to come. That’s a terrific thing. They should continue
on that path, and they are, and that’s
a really great thing. The next stage is companies. It used to be that almost
all the data in the world was inside the university
because we created it. Now almost all the
data in the world is in companies and governments. Governments eventually
make data available. Companies don’t have to. And they now have more
data than anybody. There is a treaty to be
signed– to be negotiated first before we sign it, I think. But there’s a treaty to
be made, a grand treaty, between the academic world,
the commercial world, and government. And if we could sign this
treaty, what would happen, I think, and what
should happen is that the companies–
what do they want? They want access to the data. They don’t want the random
terror of government coming in one day and
saying, you know what? After two weeks, you have
to delete all your data and all their money also, right? They don’t want that. They basically would like
to be regulated in some way, as long as it’s predictable. Predictability in business
is very, very useful. So that’s sort of
what companies want. What academic researchers
want is access to that data. The data that
Google and Facebook and Microsoft and
all these companies have, we could learn so much
more about human behavior. We would live longer, healthier,
happier lives, and most of the issues that
members of Congress care about are actually the issues
that social scientists study, like the vast majority
of the issues. And we would be able to
make progress on everything from teenage pregnancy to
unemployment– you name it, every single one
of these issues– if we could get
access to these data. But it’s not fair to ask
these companies for access to the data if we’re going
to hurt them commercially. It’s not fair, right? So they have to get
something from it. They have to get some
promise that the people won’t come in and give them
random terror on– they call them the privacy nuts. They’re actually
very important to us. So that’s why I think
there’s a treaty to be made, like all three
partners would be way better off if we could
make this happen. Politically this is a
very difficult thing. But I think somebody
should work on that, maybe even in a building like this. [LAUGHTER] Thank you. Increasingly, large
public agencies that serve vulnerable
children and families, for example, state child welfare
and juvenile justice agencies, are very enamored of data
and predictive analytics, predictive risk modeling. For example, looking into
emergency room records to identify families most at
risk of abusing the children raises all kinds of questions
about bias and surveillance. I would be very interested
in your take on this problem. I know Columbia University
is developing a fair test procedure, an algorithm to
help those machine-learning pipelines. But I would welcome your
thoughts about that issue. Yeah, so this is a
really important issue in this area, but
also quite generally. The advantage of big
data is it becomes more and more informative. And that means we can solve
more and more problems. But it’s also potentially
more and more intrusive, OK? And so what’s going
to happen– I mean, there’s going to be a
lot of treaties signed. Some of them will be
signed like every year and they will be
different, right? But whatever they
are, there’s going to be some compromise,
because all of these issues are really important,
like we want to maintain individual privacy. And also we’d like to solve
these societal problems. My main answer to this,
is please don’t forget all of those parts, right? Some people in politics say
it’s a privacy violation. Nothing is ever only a
privacy violation, right? It also produces a good. So the good is– I mean,
how much privacy would you be willing to give up to live 10
years longer than your expected lifespan? [LAUGHTER] Right? It’s never put that way. It’s only put, can
I look at your email without your permission? Because the answer to
that question is no, OK? But if the benefit
were clear also, then it’s not like the
answer is necessarily yes, if your life were
miserable, right? These are difficult
questions, OK? But I just want us to
remember the good, OK? And there are ways of using
very private data, at least by academics, in ways that
can benefit everybody, OK? A majority of children who have
cancer in the United States are in randomized experiments. That is, we decide on their
care based on flips of coins, in very scientific ways
that benefit them also. But the parents of these
kids have, every single one of them, given permission. And you know what? We all benefit because of that. When we give up data in
the interest of science, big deal, right? So it could be that
somebody at Google or the governments
or somewhere is like looking at something
they shouldn’t be looking at. And if we find them, we’re
going to send them to jail, OK? But still, they could look at it
and they really like annoy us, OK? But if it’s something
that could cause us all to live 10 years longer
or solve the crime problem or eliminate the problem
of people dying from pain medicines or any
number of other issues, I just don’t want anybody
to forget the good. That’s all. Where do we put the– Have you seen
collaborations developing between quantitative fields
and social science fields? I mean, you have your
institute, but do you see more interdisciplinary
work in the future, more traditional academic walls
or societal walls breaking down as people merge their talents? Yeah, I think that’s
a great question. I’ll give one answer
with two parts. In the social
sciences, more and more social scientists,
economists, anthropologists, political scientists,
sociologists, psychologists are being trained
as social scientists at large. Not completely,
but more and more. More and more, there’s
co-authorships across fields. More and more, we
know what people are doing in different fields. When we wrote our paper
on Social Security, to respond to Myron
again, we were very interested– it was very
important for us to figure out the culture of that organization
and the psychology of people in a very difficult position
with enormous political pressure. And so we studied that. Actually, in psychology
there were people that worked very hard on
that on these kinds of issues and how we might be biased
under certain circumstances. And we had other kinds of
ways of recognizing that we were inadequate, frankly. That’s what that
literature shows. And it was very important
to us, incredibly important. And so there’s
much more of that. So that’s one part, is yes. The other part is also yes. The other part is the
methodological subfields of these different disciplines,
political methodology within political science,
econometrics within economics, sociological methodology
within sociology, psychometrics and psychological
statistics within psychology and cliometrics within
history and chemometrics within chemistry
and you name it, OK? All these are methodological
subdisciplines submerged within substantive fields. But they are now
talking to each other. And they’re forming a sort of
meta scientific discipline. We actually have a name for it. We’re calling it data science,
whatever the heck that is. It had to be a name different
than all the existing names, right? Even though there’s a
lot of existing names. And it’s great that there’s
these communications, because the solutions
in one field turn out to be
solutions in the others. We have a seminar at Harvard
every Wednesday at noon. You’re all invited. Free lunch, by the way. It’s in applied statistics. It’s billed as a tour of
Harvard statistical innovations and applications with weekly
stops in different disciplines. And I’ll give you one
example from this. We had an astronomer
come– oh, no. We had a political
scientist come one week, predicting the number of
presidential vetoes a year. Now, if you look at
that, it goes like this. The average is around 15,
some zero, some some more. It goes up, all right? And there’s some
statistical methods to predict that and
some innovations in statistical methods to
deal with a particular type of counts in
presidential veto data. Next week someone came from
the Chandra X-ray Observatory. What the heck is that? So that’s a satellite
that orbits the planet. It’s a telescope in
the X-ray spectrum. OK, so what’s that? Like what does the
data look like? So it’s basically like
a little checkerboard, and in each of the squares
of the checkerboard, it counts photons, little
particles of light. So it’s counting photons. So it counts, right? And if you look at the
data, you say, well, how many counts are there
per period or whatever it is? And the person standing
in front of the room said, well, it goes up and down. Sometimes it’s zero. Usually it’s about 15. And so it turned out that
the political scientists got some methods that the
astronomers were developing. And the astronomers
got some methods that the political
scientists were using. And there was seamless
communication of information, even though neither one had
any idea what the other one was talking about substantively. [LAUGHTER] So that’s the kind
of thing that I think we’d make a
lot of progress from. So it was a great question. [INAUDIBLE] He wants to stop us. [LAUGHTER] OK. Sure, sure, sure. [INAUDIBLE] I think we should, because I
know your time is limited, too. So I also wanted to
say that the line that was occurring to me from William
Gibson, the cyberpunk novelist, was, the future is
already arrived. It’s just unevenly distributed. I’d like to give a
huge thank you to Gary for giving us a glimpse
of that exciting future. [APPLAUSE] [MUSIC PLAYING]

Leave a Response

Your email address will not be published. Required fields are marked *