Interests of Society or Rights of Individuals? Promises and Challenges of Social Media and Big Data

(techno music) – Tonight’s program is
a really interesting one and turns out to be more
timely than we thought when we first scheduled it. The focus is on social media
and as you probably know, the social media we
use is a wonderful tool to enrich our lives, to provide us with
information and connections. And it’s also a wonderful
rich source of data that can be used in a variety of ways. It can be helpful to the
community not just to ourselves. The challenge, though, is balancing some of what we might learn from that data with our individual expectations
of privacy and security. So knowing that, our speaker
tonight has been working at a really interesting
intersection of social media and public health and geography and specifically, Dr. Tsou, who is here from San Diego State
University has been working he’s in the geography department which we might hear more about. Usually you don’t think about social media and public health as in
geography but apparently this is a dimension of geography
that’s quite often used. He’s a leader in working in this topic and is going to share with
us some of the science and technology behind this tonight and then we’ll have a conversation about some of the ethical issues. So, join me in welcoming Dr. Tsou. (applause) – Okay, thank you, Mike. Thank you, everyone. This is really my great
pleasure to come here today to share some of my research with you. So, I’m a Geographer but
I work with collaborate with different discipline
people from the public health to civil engineering,
communication, sociology. So, I think today because the
recent coronavirus outbreak so I like to pay more
attention in today’s topic, talk about the disease outbreak monitoring and also some issues
related to those situations. So, first of all, I want to introduce our Center for Human Dynamics
in the Mobile Age at San Diego State University we’re building this center
about four or six years ago. The idea for this center
is we want to create we call the transdisciplinary
collaboration. Because traditional research
emphasize on the side of the individual department
but a lot of today’s challenge we need to collaborate across
the different disciplines. So, in our center, we
are including the faculty from the public health,
geography, computer science, communications, also
sociology department together and then working folks on social media to resolve the real world problems such as disease outbreaks, transportation, and disaster evacuations. So those at our center have
several founded projects with the National Science Foundation and the National Institute of Health and so those are the list
of our faculty members and graduate students in our center. So, what exactly when we
talk about human dynamics? So, I’ll ask you, how many
of you has smart phone? On your hand? Okay, everybody has that right, so, today is a product everybody has that but the thing about smart
phone is not available in the human history until 15 years ago. The first model only revealed 2007. The whole human history,
millions and millions of years, the only the recently we
can track individuals, their behavior, their communication. But not just one person,
hundreds, thousands and millions of persons gather together. So this dynamic, this new
field called the human dynamic is a transdisciplinary
research focused on the understanding and analyzing
those patters, relations, narrative, and transition
for our activity, our behavior, and our communications. So I’m going to give you one example. This is a graphics created
by one of my students. You see this is downtown San Diego and every single dot,
there’s a purple dot, is a one person actually
tweeted using the Twitter. How many of you have Twitter account? Some do but not very many people but a lot of young generation
have Twitter account. But if only 1% people in San
Diego have Twitter account that created this type data. This shows you that every single hour how many people create a message on exact location in downtown. From the morning, from the
afternoon, from the evening. You can see in the midnight
not many people are in there but what information can
use for this type of data? We can use to track traffic congestion, we can use to predict business sales, we can also use for disaster evacuations and finally for the disease outbreak we can estimate how people move, how likely using the mathematical model to predict the disease spread. So there’s a lot of
potential use in this data collecting from your smart phone. So as geographers, the
first thing I really want emphasizing geography is very important. Geography is the key to
understanding the big data because geography deals
with the space and time. Location and time. Same about disease
tracking disease monitor. We need to know when the disease
heighten, that’s the time. When the first case,
second case, third case, also need to know the location of disease. In Hong Kong, in China, or in Taiwan, or in San Diego, in Seattle. So, analyzing the data between
the time and space place that’s the key to understand
the disease outbreak. Also we will use that to
monitor, create a model, computational model to
simulate those disease spread. So, today I want to
show you one of my case where I’ve been conduct
this one a few years using the twitter as the
source to tracking the disease, the flu outbreak in the U.S. So now I want to show you a video. – We are in the sneezy, achy, miserable heart of the flu season. – So far, 20 people have died of the flu in San Diego county. But new at 7 o’clock, 10
news reporter, Natasha Suvez, shows how one local researcher
is using social media to save lives. (coughing and sneezing) – [Reporter Natasha] If you’ve
ever twitted about feeling under the weather in the last four years, chances are one man at SDSU has seen it. Ming Tsou is using Twitter
to track flu outbreaks in 31 major cities across the
country with wild success. – We can collect not
just one or two people but millions of people or
even billions together. – [Reporter Natasha] Billions
of messages like this tweet. His technology often pulls out a location and will pinpoint words
like sore throat and cold. When twitter numbers are crunched and you compare them to
doctor’s submitted reports they are startlingly accurate. And the twitter data can
identify flu outbreaks ten days before the CDC report. – I was a little bit shocked
about how close we are. – Well, we want to give
you a look at Ming’s real time Twitter map. Take a look, in the last four months, San Diego has had more than
200 tweets about the flu. That’s compared to Los Angeles that has more than 800 tweets. But how can this save lives? Well, Ming reports these patterns
of an outbreak to the CDC allowing them to funnel
resources, vaccines, medications to the areas that will need it most. And Ming’s research from 2012
after Super Storm Sandy hit flu exploded on the east coast
and from there spread west. Using Twitter they hypothesized
it was the close quarters in the storm shelters
that caused the outbreak. – Sometimes those connections
can even follow such things as airplane routes,
travel routes, job routes. – [Reporter Natasha]
Natasha Suvez, 10 News. – Ming’s Twitter numbers on
the flu are so instantaneous and accurate that the CDC
request his data every two weeks. So far, here at home, his research shows it was a big flu spike around Christmas but he expects the cases to remain high. – Okay, so, this was a show video. But I want to highlight a few key things. First of all, we do collaborate
with the CDC back 2014. At that time CDC actually
traditional cities has special unit to tracking
all the flu season in U.S. But they also explore some potential using other kind of source so in 2014 they invited
nine national teams including San Diego State enter this called the Grand Challenge for
the Flu Outbreak Prediction. So we worked together for
13 weeks to submit out data and analyzing the group. So it was pretty amazing to see. So among this nine
teams there’s five teams using the social media
data including Twitter. There is four using the Google flu trend like key words and there
a two using Wikipedia. How many people are open
to modify the content. So I think that’s a very there’s one paper that was available. So, our case study we’re
collecting 31 cities using the geography geo-tag location. So not collecting everything in U.S. but the idea is we want
to track individual city with a different outbreak because different city will
have different pattern. And we also applied for
the machine learning the algorithm with a linguistic approach to analyzing our data. And finally, we represent
into the visualization format we call the Smart Dashboard. So here is the first year our counts so you can see that the purple line here is our prediction using the Twitter data. The red line is the CDC
official data called the ILI, influenza-like illness. So they are very closely related. The corrolation is .84
which is pretty good for flu outbreak prediction. And so I think the first year
we did a good job, right? So, how about second year? The second year we had in
addition to the national level the first year we also
do the San Diego level so actually we collaborate with the county Health and Human Services Agency with the epidemiology department
so they do announce those San Diego county flu watch weekly. They have a record for
every single confirmed case from the lab result in San Diego county and when compared with our Twitter data the correlation even higher is .9233. So that’s a very impressed number. So I think the first year we think, oh, we had did a good job, right? But second year, we have a big problem. Because Twitter, the data
we get from the company they changed their API. So the API is called a
mechanism which automatically feed the data from their server so if the company changed their
API we’re losing our data. So, the second year, 2014, we
lose almost 90% of our data. So, original we can gather
1000 tweets per day, in the 2014 we can only
gather 100 tweet per day. So they are reduced. So we have big problem, right? Anyone can guess what’s the
result we predicted that year? Actually even better. So, in that year we
only using geo-tag data from the tweets, only 4%
compared to original one but the results not bad. Why? Because when we talk about the big data the more important is the
how representative those data represent our general public. So traditional data
verses geo-tagged tweet even the number is smaller but actually that geo-tagged tweet is more representative, the
result can be co-related. So size doesn’t matter,
it’s really the pattern or your representation matters. So this is not bad, what
about the next year? Well, actually we create
another analysis on 2015 but the result is terrible. The number is dropped to .5 something. So we can compare this two lines. The gray line is 2014 season,
you see the peak in 2014? So, 2014 was a very severe flu season that means the signal is very strong. So when we analyzing using
the human social media the result is very good but in 2015 we don’t have flu season, we don’t have flu season. You can see that in the
around the Christmas time there’s no peak at all. Only the mild peak after
the January or February. So low season, no flu season, no signal, our prediction, very bad. So prediction is actually is related to the significance of a flu outbreaks. The most strong outbreaks the more better prediction you are. But also we found out an
interesting fact is week 16, around the end of February or early March, there’s a high peak we find
in our data, twitter data, anyone can guess what
happened for that week? There’s a lot of people talk about the flu but not about the flu as a disease. Anyone can guess? It’s about the Prince. That year Prince died
and the first few days people suspicious he got the flu and that’s the cause of the reason. So people tweeted about that
but why our machine learning didn’t pick that as an error? Because every time when
we do the machine learning we have to manually label those
numbers, those tweets, texts and this case we didn’t label
it in our original data base, original training set, so
that’s create a problem. So that tell us that in order
to monitor those outbreak we need to modify the change the algorithm for the machine learning every single year but not maybe every single month so you can not use a single algorithm single AI to do everything. Human being are very complicated. Every year there’s a new
meaning of a key word. There’s a new meaning like a statement that will trigger or impact your analysis. So that’s one thing we also learned. So in addition to the twitter social media there’s other research teams
analyzing the disease outbreak like this one is one of my favorite the Boston Children Hospital,
Dr. John Brownstein. He’s well known to create
this one called So they analyzing the disease
outbreak by collecting those local news channel
and those report cases so they have engines
collecting everybody’s news. So, there’s also another company in Canada called Bluedot also using
similar idea to collecting the foreign language news and
also animal disease networks. So some article they
mention in the wild magazine said they actually can predict
the coronavirus outbreak few weeks ahead of Chinese government. So, I don’t know for sure but
this is some report for that. So, my question is can we really
predicting those outbreaks in the current cases Covid-19? But if we want to predict we
them we need to know where the starting point outbreak,
what is the peak time, what is the scale and the
impact to our society, and when it will be over? So, as I mentioned, I’m a geographer and one of my focus is mapping. So the first thing to understand
an outbreak is to map. So this is the one very interesting map created by Johns Hopkins
Public Health department they’re using the software we call the Geographical Information System to display this disease outbreak. You can see I’m tracking here. This is the first week of the database it only happened in China mostly in Wuhan. But the second week, third week and today. So today you can see
that the disease actually is already spread out from
China center of disease to the global outbreak. I would say a global pandemic
is happening right now. So we have to be careful. On the other hand, the U.S. is still okay but it could change dramatically
in the next few weeks. Everybody have to be very careful. Is this the first time we
have this type outbreak? No, actually, 10 years
ago we do have this one called the H1N1. Some people call it the Swine Flu. This is the maps at that
time after the few weeks. So we can see this is global thing. The United States actually
have a lot of cases for H1N1. Today, H1N1 still exists but we put H1N1 as part of our flu season. But, the coronavirus is different. I have to say that. It’s totally different than flu outbreak so we have to be very, very careful. So I want to just from
the mapping perspective you can see some
similarity and differences. So we actually can use a
lot of different pattern to detecting our disease outbreak. Social media is one, local news channels but also when you see the
doctor your electronic medical record every
hospital, every clinical has those diagnosis records. Combining is you have a real time analysis that also can be used as
good prediction power. Other commercial data. When you’re shopping, when you go to the Rite
Aid to by the flu medicine, that record actually can be used to predict the flu outbreaks. Usually you not see
doctor immediately, right? The first thing you have is flu symptom you go to the Rite Aid or some drug store to buy some medicine so that data also be very useable. Your transaction, when you do
the credit card transaction like in China when the want
to predict the coronavirus break they actually use the
Chinese version of Apply pay to monitor the people’s
transaction moving between Wuhan city and other cities. So there’s a lot of other things. So people even did even
through the water sampling from the sewerage system
to see the drug abuse, see the situation. And also other people do
the survey, online survey or human survey but those
things are very expensive. The most cheaper inexpensive
in the real time data is probably social media. So it that a promising direction? Well, it depends, we
still have some problems. As a biggest problem from
detecting the disease even a disease is known
like flu we did a good job. If the disease like Covid-19
or SARS it’s challenging because we don’t know the name of disease. We don’t know how to track then. Another potential we use in
the syndromic surveillance. We rather track the disease
we are tracking the syndrome. You’re fever, coughing, headache,
short breathing, vomiting but those things this is the screen shot we did five or six years ago, we try to do that direction
but our result is not very good because the noise is too much. To many noise. A single cue of fever could mean anything. When you go to some passion is a fever. So that’s a lot of problems
but there’s potential. We could develop more the best machine to filter out those noise so we’re still working on that direction. So, hopefully, it could
be another direction. But also there other problem
for using the social media. First of all, most of the social media the young generation using
although the senior people citizen do using social media
somehow like our President. Yeah, he’s the favorite using the Twitter. But majority is still young. When the disease outbreak happen if the disease focus on
attacking the elder people those symptoms were not referring on the social media quickly. If the disease attacking young group then we will be catching
up signal very quickly. That is a minor problem. So that’s one problem,
another one is there’s robot and also fake news and noise. So when you in time with the
social media like Twitter actually there’s more than
30% of Twitter account are fake or we call the
robot, it’s not human being. So you may not aware of that. You are in time following
someone who is just a robot, automatic generated information and those things could be the problem because those fake news it could trigger a lot
of controversial issues. I will talk about in a few slides. So now the problem is in the social media there’s an imbalance about how many user and how many voice. Usually only 1% of the social
media user is very active. Those message are actually
including legit 1% user generate 16% of the message. 10% user generate 80% of message. You can see the diagram here. The first one is the people
only tweeted once per month, that’s the majority. I have a Twitter account, I’ve only tweet once per month, that’s it. But how many people tweet zero? Actually it’s maybe 10 times this number. So majority of the social
media user is silent. When we collecting that
sometimes we only listen to those chatter, those
people are very active. You have to be careful about
the balance of those things. And finally, folks that
talk about the fake news and echo chamber. There’s a big problem in the social media. A very effective way to
disseminate the information but if the information is fake news it was jeopardizing a lot of things. And one of the reason, one
of the study we analyzing and using the vaccine topic. So there’s a lot of
social media discussion whether anti-vaccine or pro-vaccine group. So, pro-vaccine is more
scientific community. You can see the construction
of the social network. It’s very scattered because
everyone individual network is not like a coherent. So information for that
network easy to disseminate or distribute to every different location. But on the other hand, the anti-vaccine group is very coherent, only have a very few
for the opinion leader. And information is only
tracking inside the group so that created we call echo chamber. So those type echo chamber network you can not allow other voice,
the same wrong information, fake news will always talking to each other inside the channel. So this is a problem also the
danger for the social media. So, for the fake news
and those know about it it will create a problem to
prevent effective communication, will also reduce the
trust of government agency and also will fragment our society so there’s a lot of problems. So, how can we identify the fake news? How can we identify the robots? There are some researchers
doing some prototyping. Hopefully in the future, if
tied with the social media, we will have sound index of matrix which tell you this person
how likely this person is a robot or not. Hopefully that is ongoing. So, I want to give you example of sometime the fake news is good or bad. So I was watching my
Twitter a few days ago, I saw this news from our
U.S. Surgeon General. So, this is not fake news, right? But actually has some wrong information so our U.S. Surgeon
General say serious people, stop buying masks. I agree with that. They are not effective in
preventing general public from catching the coronavirus. But if the health care
provider can get them I put it in community risk
so there is true part is we had to make sure our health provider get the mask they need
otherwise we jeopardize. Also in this message say
mask are not effective preventing from catching
the virus, is that true? Well, the idea is so
people feel comfortable not buying the mask right, but the truth is I’m
originally from Taiwan so I watch a lot of news in Taiwan. Taiwan facing this epidemic
one month before U.S. So the government actually
has started a lot of different compared to U.S. government so sometime this is actually
more information I get from the internet so mask actually help, it do help but there’s a
mask N95 or the surgical mask is the most effective way
to blocking the virus. But other mask is not effective. So does that mean you should only buy this
N95 and the surgical mask? Well, it’s not. After a few weeks the
Taiwan government says actually, all masks help. Even those low quality ones. Why? Because the mask is not
to block in the virus the mask is protecting yourself from your hand to your mouth. So that’s the biggest situation. If you don’t wear a mask
when you touch something you either going to touch
your nose or your hand. When you where a mask, your
behavior to endanger yourself will reduce significant. So it doesn’t matter what
kind of a mask you on, as long as you have a mask, you’re preventing the danger behavior. So any mask helps but that
information I could see that in U.S. at this point but
that information has announced in Taiwan, in Singapore, in
a lot of Asian countries. So in Taiwan actually CDC suggests when you need to wear mask you need to visit hospital or clinic or you take public
transportation, bus or train also if you are sick. So there are different policy right now from the Asian country and the U.S. So I think there’s an inconsistent. Try to be careful about that. Also there’s a key issue is
about the location of privacy. We know that we worry about
the virus, the coronavirus, also especially those confirmed cases but should we reveal the location
of those confirmed cases? Residential location? Yes or No? Well, on the transparent
side we need to reveal that where they been visit. They may visit like Sea World
or they visit Disney Island then they people will be more cautious about those locations. But on the other hand, it
will also put the people say oh, I didn’t visit those area, I can visit Lego Land it’s safer. No, Lego Land is not safer
so that create a problem. So actually there interesting this map shows you in
Singapore they actually decide to reveal every single confirmed case in their residential
location in Singapore. But in Taiwan the government said no, we are blocking that
information, we are not showing because it creates a
hassle or creates a danger or panic in the local community. So which one is correct? Which one should we do? So I think we can discuss that later on but I think the final
message I think I want to do is stigma situation. This is the most danger situation when we have a disease
outbreak for our society is not just because stigma or discrimination is bad
but also it will trigger the bad impact for the disease outbreak, fighting the disease, why? If you stigma those
patients who get infected so what if you do if you say, maybe I got the virus,
I got this coronavirus. Will you report yourself to the doctor? Will you report yourself
to the health agency? You will not if you will stigma that. Because I will say if I tell
the doctor my whole family will be ashamed and I will lose my job then you will hide it. So if we stigma that
there’s hundreds of people who may be get infected will hide it and they will create a bigger problem to spread out the disease. So we should not stigma those disease. We need to protecting to
make even some compensation. In Taiwan the people who got confirmed or isolated quantity the
government actually pay money to them to cover their
loss of their home works. So that’s the final
message I want to highlight so that will probably end my talk today. Thank you. (applause) – We will be taking some of
your questions in a moment but there’s a couple things I thought might be
useful to start off with. One is just infrastructure. I found it really interesting
when I heard about this idea of this kind of work in the
department of geography. We spoke about this briefly beforehand and it occurred to me as we talked that this something worth commenting on, the need to break down barriers between what a really pretty artificial
structures are saying. Here’s the department of X,
there’s the department of Y because you’re working
at the intersection of at least a couple of different fields. Do you have any thoughts on
the need to think of science in a more multi-disciplinary
way instead of–? – This actually is a trending
situation all over the world. In the National Science Foundation, in the National Institute of Health, all the major project, the large project had
to be transdisciplinary. So, there’s a requirement
say if you submit a proposal you have to include the three
or four different departments to submit a research proposal. Why we need that? Because a lot of today’s
challenge the research challenge it do need a different
perspective, work together and we call it king size. We using those different
collaboration among department will create similarly
synergy to innovative idea, innovative answer by using
those collaborations. – So that sounds like where
the good science will come from but it means that we should
be evolving towards a world where we don’t worry about
which department you’re in more about what kind of work do you do. So then, a second infrastructure question that I don’t think I heard you talk about that would be useful to note. Your collecting data from human beings. So, what sort of
responsibilities do you have? What does it look like for getting review and approval for doing this work? Do you just do this, nobody’s watching or is somebody watching? – Well, this is a great question. Actually we do think about this question when we first started
collecting the Twitter data. We think Twitter data is
from human being, right? And then we submit IRB to our University, their committee, guess what? What’s our response? Twitter data is not human data. So we actually are exempt from the IRB because we collecting data
from the serve company. The data is from the company. And this is not only
happen to our University, every researcher in U.S. on the Twitter or on the social media they got the same answer from the IRB. So I think that is a problem. I think traditional or law,
our regulation for the IRB as it is not intended to
those social media data so our law is behind that. So we have to be more
educated to our legislation. At the same time although
we’re exempt from the IRB, we still be very careful
because the data we collecting it could contain a lot
of sensitive information because those user, the Twitter user they don’t know that
we’re collecting publicly. They think they just talk to their friend. But actually, they don’t
know that can see about everybody in the whole world. So that’s a different conceptual situation and we do collecting a
lot of sensitive data when we talk the drug abuse, marijuana, even see the information people are selling drugs
online using Twitter. So those realizing we have
to be I think law is behind. We have to moving forward to
catch up with the new trend. – So, for those who aren’t
familiar with the IRB process this is what Universities,
research institutions use for research with human subjects. And institutional review board, IRB, looks at the research that
involves human subjects. Are you open to the idea that
maybe your community of people who study least Twitter accounts and maybe more generally, social media, perhaps creating it’s own body to try and review each other’s research. It’s separate from IRB’s. – Yeah, I think that’s a great idea. Also, I want to follow about the whether we need to be consent or not. I want to give an example. Let’s say Google street map, all right. Everyone uses the Google street maps and there’s a street view that Google can take
picture of on the street. Was that violated by
privacy situation or not? If you have your own camera, can you take everybody’s
picture on the street? Well, actually, yes and no. Sometimes I think the
law in general defines that if you have a specific target. I want to try and sing my
favorite start Taylor Swift. So I go on to follow,
stalking Taylor Swift, using the photo even on a public land, taking that picture, that’s illegal. If I don’t have intention to
target any specific person, I just want to take a
pan view, that’s okay. So, same as the Twitter data, if you using Twitter data
as a key word to analyze the general public in general,
I think that should be okay but if we using Twitter data to search for specific person, search
for their behavior, search for their individual tracking, that could be problematic. So it really depends on the motivation and also how the target of your subjects. So I think that could be a potential direction for the policy. – So we have a few questions here. I’m going to start with this one. Sort of broadens what you might look at. The question is, why analyze just Twitter and not more widely used
social media platforms like Facebook. And I’ll just comment that
I’ve learned over the last few years that different
groups, different demographics, use different platforms. So, it might be that this
audience, for example, uses Facebook more often. I don’t use either. – This is a great question. Also, I’ve been asked a lot. The answer is the company. Facebook is a closed company. Facebook don’t share their data. They don’t provide the API for
the researcher to collecting. I think sometimes they
try to open door before but there’s a bad event happen
and damage their reputation. So they shut down the door. So right now, for the researcher, can only go to their
headquarter inside their lab take a look at their data but you can not do any analysis
or bring the data back. So, Facebook is very
protecting their data. On the other hand, Twitter is more open. We can gather free API,
free data collection, even though they are collecting
only 10% of their data. If you want to get 100%
of data from Twitter you need to buy it, it’s very expensive. Very, very expensive. I also had to say that
the information you post on your Facebook and
Twitter belong to you? No, they belong to the company. So you are the product of
those Facebook, those Twitter. So when we collecting the data, we can not share or resell the data. That’s also the problem
for scientific research because we want to when we
have a journal publication, the journal say you need
to sharing open your data but because the license
was the Twitter API we can not share the
original data of Twitter. We can only share the summarizing result. So there’s a lot of problem
because the private company like Facebook, like Twitter
they control the data. – I didn’t realize that distinction which you can only study
what you can study. – Yes. – And that leads to a deeper
question at least for me that if you have particular demographics that are the ones that you’re studying and you are trying to make to
make predictions based on that that you might be unintentionally
biasing decisions. So, for example, you were
saying this technology, this approach could be used
to figure out where to target resources in the case of an outbreak but your actually choosing
to target communities that happen to be
over-represented with people who are using the particular platform. Are there ways to correct for that? Or what do we–? – Yeah, I think there’s
a several way we can do the resemble of the data. Even though, for example,
the social media, Twitter, the senior people the percentage is small but they are still large number because our base is large so we still have one or
two million senior citizen using the Twitter everyday. So the key is if we want
to combined the combination the key word, we can find
out what is the most common key word used by the senior citizen. Maybe there are more folks
on the retirement fund or the insurance, so we’re
using those key words. Every key word have a different
profile for the user group so we combine with different key word. We may be able to retrieve
and assemble the data that this group data
most of them are younger but the reason why there is still senior, we using another way to resemble data, to retrieve the most senior
people from the group so they some assembling
could be one solution to do the specializing analysis. – You reference this earlier also that different words might
be used in different ways and that they also might be moving targets because the language,
the vocabulary evolves so in that environment
I’m picturing the need to constantly be tweeking the algorithm and the question is, do you have any way to estimate
how far behind the curve you’re going to be because you have to how much time does it take to figure out what the right vocabulary is and by that time have things changed? – I think that exactly the
challenge we are facing from the academic perspective. So choosing the right key words, modifying and then I think every scenario, every situation has the time life span. So some, for example, we
analyzing the wildfire situation. The wildfire is happen only
like two or three weeks. If we’re emphasizing
disease outbreak like a flu we have to tracking the whole season. So every different scenario,
every different situation will have their signature. So, maybe we can base on the signature decide how often we need to
tweek or modify our algorithm so it’s really case by case. Also, different city, different culture will have a different way
to monitor their situation. – Change to another question. This person’s reflecting on
the idea if you want to know what’s going on in the battlefield, you should be asking
the privates, corporals, and sergeants, not the
colonels and generals and, I think, for that
reason they like this idea of going to the Twitter feed. You’re getting people on the ground. But they also recognize,
based on the question, that the information is imperfect. There’s different degrees in a sense to which it’s not accurate. So they’re asking you, how should officials use
big data, such as this, and AI if they are aware that they may not be using accurate data? So, you have to reflect on so how accurate does it have
to be to decide to use it? What are your thoughts? – I think that’s also a great question. I think we always try to
make a rising for our society but sometimes those results and there could be misuse or misunderstood so we have to be very
careful about these things and I think the first thing I would do is some sensitive to the situation we need to communicate with the authority. We can not just do our own things for example, when we
analyzing the disease outbreak we need to also talk to the county level, the epidemiology
department, talk to the CDC and share the information. Sometime our prediction
could have a strong potential crisis or panic, we have to be careful. The same thing because
another side of my research is for the disaster response. We’ll have wildfire,
we’ll have hurricanes, how do we analyze in the
media, social media response to the people who may
be stuck in their house and they need help so there’s those things but one danger for
that, especially in U.S. because we have official 911 channel the government agency
would prefer you call 911 rather than tweet about your crisis because they are not
responsible for your tweets but they are responsible
for your 911 call. However, if the crisis is scaled up, if there’s the 1000 911 calls,
authority will not handle at that time your tweeting on social media could be your potential to get help but it’s not guaranteed. So in U.S. when I talk
to our county office emergency services they also
have struggle with that one so using social media to
detect the general trending, general policy is fine. If you’re using social media
to detect individual user who needs help that could be problematic but also could be very useful. I think a few years
ago, Hurricane Katrina, in those area, actually
there’s a volunteer group, those volunteer group they
volunteer to create a community to monitor the social media
message who need for help. And they send their boat to
location to pick up people. I think there’s a term like
some kind of a navy association so that could be social
media can be work with a non-profit organization
for the volunteer group and non-official one,
they can work together. That could be potential. – So, in terms of
information people can use, I think what you’re suggesting though is that if you are in the midst
of one of those situations dial 911, tweet, maybe turn
to your Facebook account, just in case, anything you can
think of to get the word out. – Yes. – Yes. – The next question is one
that I think has cut across a number of programs where
we’re interested in something that as a society we can benefit from but in order to get the data, individuals have to give up something, in this case, their privacy. A very different example might
be doing cancer research. So if somebody has cancer,
they participate in studies, the community can benefit. Some people have even gone so
far as to argue in that case that we should expect
people who are benefiting from historical cancer research
to be willing to participate in research now as somebody
who is dealing with cancer. And you can argue similarly
as this person’s asking, should all of us allow
tracking of our personal data to help in making data more valid? In other words, to make this better. Should we just accept that this is a trade off we should have? – I would say this is enough
now the great question. There also dilemma we are facing today is the more information we gather, the more useful we can do the analysis but the more privacy intrusion
will happen to those records. Especially when I do a
lot of cancer research, we also want to aggregate
the data into a more noticed identification procedure. So, I think, ideally, in
the future we could develop the algorithm to original data collection could be very detailed but we have an algorithm to de-identify those sensitive information and we can aggregate we
call the synthesis data. So there are some research
that says way massage our data make the data useful but
it’s de-identifiable. There’s a sound statistic algorithm, I’m working on that direction. But I think in general it’s
still a big challenge for us it’s always the amount of bad to balance but I think every situation
will have a different priority, so some situation happen people will change their
attitude for the privacy so I think we really depend I would say the citizen is smart enough to make the right decision so I would prefer not decide
by the government agency or the top leader but decide
by the general public. You decide what kind of information you want to reveal or not. – Since you proposed that can we follow through how that would work so should everybody who
has a Twitter account be given an opt in, opt out option? And so then your data would
be drastically reduced because a lot of people
would fail to opt in or would they have to
opt out, in other words, you have to chose to not be involved and so you’re automatically
in, so any preference on that? – Actually that situation
happened in Twitter. I mentioned there’s a 1% of twitter user that usually turn on your GPS
tracker to track the location that is a voluntary behavior option but last year or two years ago, Twitter actually removed that option so people will not have option to opt in, they always opt out. I think that can reveal a few things but that also cause a problem. Our data collection for
the geo-tagged tweets reduced about 60% so
we are losing our data but I think that’s good for the community but this is a situation you can still reveal your
location by check in. There’s a function called the check in. You can check in to free science center or check in to the (muffled speaking) and that will have a mark so there’s different ways
to manipulate our data. I think give the option to the user I still think that’s a good thing to do and then the data may be not good enough but I think in the long
run enough people still can make some interesting analysis. – Next questions about
operationally how you use your data. Should data generated by social media always be complemented by other direct science generated data? What they’re saying basically
is independent information and everything I heard
would suggest you’d say yes, but I don’t know whether you want to reflect on that a bit more? Do you need anything
else besides your data? I would say social media is just part of the
mainstream information. Sometimes they are very useful but I will say it will never become the only information
source for decision making and for the disease outbreak I would say that for the
majority like the CDC, the major information is
from the official channel of the hospital, the
clinic but the social media can provide insight, provide
additional information so I think when we do those
and we have to be careful like even for the disaster
response the same thing. Social media has the value
but it’s not the only source. It’s very dangerous to
say we got only decision based on only on social media because it could have a lot of problems. – So, I’m going to move
on to another question that gets to the specifics
to how you recognize a particular, the characteristics
of a particular disease. So, you’ve talked about using
this technology for H1N1, for looking at influenza, generally and for looking at Covid-19. So, what can you say about
how you would distinguish those three and can you? – I think one thing is
every different disease has different symptoms
so as far as I know, the new Covid-19, they will have a fever, they will have a coughing but not many symptoms like muscle ache or something although there
is a small percentage, so if we’re using the
syndrome base to differentiate the pattern we may be to
differentiate the pattern between the flu and the Covid-19. And also about the
spreading like a pattern I think we can also tracking but the problem is when a new disease I’ll notice this starting we had no idea what the major symptoms are. We don’t know because the
Covid-19 has been happening for almost two months so now
we know this is a majority pattern, symptoms do then we can track it but when it’s just starting have no idea and there’s no way to
differentiate this one with the other disease. So, I think the more
knowledge we accumulate about the understand of the disease, the more possibility we can
tracking it more effectively. – Base on what we know now,
I’m just trying to figure out what this looks like in practice so if you wanted to know if
an outbreak of the flu or Covid-19 was in process
you would ask the question, are we seeing people talk
about muscle aches a lot, if you are then that might be the flu. And if you aren’t it might be Covid-19 but if they were both
going on at the same time, you would perhaps miss it as a Covid-19, is that the fear here? – That’s a good question. I think a single keyword is
maybe not that representable but combined the pattern,
I would say percentage, maybe there’s 8% people
mentioned the muscle ache, maybe 80% people mention
about the coughing. So when we analyze this we
call it like aggregate analysis and with a pattern the good thing is that those complicated pattern
actually machine learning can help us detect
those different patterns so I imagine in the
future we could develop some pattern for recognizing
not just your single key word using 100 different symptom
key word and see the trend and the pattern and compared
to different disease then we may be able to
differentiate those different by using machine learning
and hundreds of key words. – This causes me to wonder
whether you have an opinion or whether you believe
you opinion is informed about whether there is value
in closing borders at all for this or what do your data
and your approach tell you about whether that’s
something that’s useful to do? – Well, I worked on special
temporal modeling perspective maybe social media
because I’m a geographer so we do a lot of of
this space time model. Modeling how the people are traveling. Like a good product is diffusing but the disease is also
human being is the factor. We carry the disease so any
movement between a human being, between a city, between countries will create the potential
of disease outbreak. So, if you blocking that source, that is still the most effective way to blocking the disease. From predictable perspective, blocking border maybe sounds scary but from the public health
epidemiology perspective blocking the people’s transactions is probably the very effective way. Even though it’s against our nature but in disease outbreak
perspective human movement that still very effective. For example, in Taiwan and Singapore, in Taiwan, actually
the government blocking Maybe not blocking everybody
but anyone who are from China that city in China they can
not go to Taiwan right now. But if Taiwan people in
China they can come back, every time when they come back they are to quarantine for two weeks, everybody from China has
to quarantine for two weeks so that way that’s the procedure they have implemented
in Taiwan and Singapore. So I think not just shut
down the border completely, but the different procedure
to prevent communication between human being that is effective. – And just to be clear I see the principle but I’m wondering once cases
are present in a country, and if you’re defining
borders only as by country rather than by city, is it just a matter of time
then, before it spreads? Because it seems like then,
closing the border is too late. – Well, I would say not closing the border it really depends on the infection how many percentage of people were carried the disease to diffuse. It could be like the early
time is not closing the border but use a different way to
control the people’s movements so you can come across the border but once you cross the border
you have to quarantine for two weeks before you
can freely move around. That is doable but if you’re just blocking I think the harsh blocking
will create a problem. Why? When you do the harsh blocking
people will cross the border privately from illegal channels. From illegal channel then you
have no idea to control that and that illegal channel,
smuggling it will create a bigger problem for disease outbreak so through the border is open but with certain type of control I think that would be the best we to do. – We have time for one more question. – [Male Audience Member]
Could you get the same results by tracking the sales
of cold and flu tablets? Thank you. – Yes, yes. There’s already research doing tracking analyzing the Walmart and the Rite Aid they sell regular for the flu medicine and they can actual tracking
compared to flu season very, very good so there
already some research on that. But the problem some of the
data controlled by the vendors or private industry so in order to get the data its expensive. Sales data sometimes they are top secret. They want to compete with other company. So sometimes the data is a big problem. But I do see a few
research using those data and they are doing very good. – If it was required that those companies had to
make that data available, would that be better than trying to figure
things out from Twitter? – Well, it depends because
the sales information could be has some complication because, for example,
sometimes the flu medicine can be convert to the drug for some people so you may not detecting the flu outbreak you may detecting illegal
use of the flu medicine. – Drug dealers and such,
drug manufacturers. So I think we probably should finish there to be ethically on time. Thank you very much for
really entrusting us. – You’re welcome, my pleasure.
(applause) (techno music)

Leave a Response

Your email address will not be published. Required fields are marked *