How Data Mining Affects You People: Freakonomics was right, Terrorists Have Banking Patterns Too
Data Mining Is All Around You
Data Mining is a generic term that describes people looking at large amounts of data and find relations and patterns in the data that prove to be useful. With advent of computers that can store, access, and process large amounts of data, and increasingly sophisticated tools to do so, people are finding a lot more data that affects you and me.
When data mining was mentioned, most people think of how Amazon predicts what books you would like, or how Google would know what sort of ads to serve for you. Some may even know that your credit card issuer use data mining to detect fraudulent activities. However, data mining is far more wide-spread then you suspect, as they affect elections, cheating in schools, cheating online, law enforcement, counter-terrorism, and much much more.
Data Mining for Election Cheaters
November 2nd was election day in the US, and data mining is helping people spot "astro-turf" (fake grass-roots) uprising of sentiment for or against a candidate through social media such as Twitter. The Truthy Project at Indiana University spotted a few users who are "generating" a sentiment by promoting each other's tweets.
By studying Twittersphere "feeds" which shows all tweets, and the ability given by Twitter to look up Twitter users through the public API, scholars were able to perform "network analysis", which trace back tweets and retweets to the originator of the tweet, and from there, look up his or her information, and by looking at the relationship between the users (friends and such) you can predict whether the sentiment is actually spontaneous or actually instigated by someone to look spontaneous (but is actually not)
They found that some are using Twitter to generate attack messages and sending it to all related people as they were traced back to a group of accounts all created within a few minutes of each other. Twitter found out and suspended the accounts, but the tweets already reached over 60000 people. Real spontaneous sentiments would occur from widely disparate locations around the nation, from people who are not friends of each other
For more information, see Busted! Astroturf Campaign on Twitter
Data Mining for Cheating Teachers
There's a chapter in "Freakonomics" about how the Chicago schools archived results of achievement tests from every student for many years, and analysis of those results shows that some teachers cheated. But first we have to explain how a test is written.
In almost ALL standardized achievement tests, from the SAT down to STAR or whatever tests your state administers for your schools, whatever level, the questions are always arranged in order of ascending difficulty. The first question will be much easier than the last question. Thus, a student is far more likely to fail the later questions than the earlier questions.
Analysis of several classes that showed massive improvements shows that some of the test results are inconsistent with the predicted curve. Many students failed the initial easy questions, gotten a lot of the middle questions right even though they are more difficult, then failed the ending hard questions. The same students, when transferred to other classes (no longer under the same teacher), no longer shows such odd test patterns.
The conclusion is undeniable: the teacher erased the wrong results of the students, then 'fixed up" the results, and artificially inflating the test results. Several teachers were fired, according to the book, when the results were made public.
Read an excerpt from the book Freakonomics! (PDF download)
Data Mining for Terrorists
Counterterrorism don't always happen with guns blazing. A lot of work is done looking up what makes up a terrorist so one can try to identify them before they strike.
In a chapter of the book Super Freakonomics, one identified what sort of banking characteristics would a terrorist have, based on the patterns identified in US and UK based on the terrorists in 9/11 and 7/7, respectively. Some of the characteristics are obvious: Muslim names, lack of life insurance, lack of "Friday after-work ATM visits", lack of normal living expenses paid through checks or debit cards, many foreign wire transfers, and so on.
Let's just say that not all characteristics are as obvious as these, and only data mining could have discovered these patterns that perhaps not even the terrorists were aware as common factors.
In 2006, it was suspected that major phone companies such as BellSouth, AT&T, and Verizon had turned over terabytes of call records to help NSA populate a database that was supposed to help them detect terrorist phone usage patterns. Based on post-9/11 investigations, 206 international calls were made by the 19 terrorists that conducted the attacks.
There are criticisms that terrorism don't happen often enough to lend it self to data mining, but that is up to debate.
Data Mining for Criminals
Police is now heavily involved in data mining in order to concentrate their enforcement efforts on the "worst" parts of the city by identifying patterns and increasing patrols in the area during the hours, so they get more results for the same amount of resources. Many of the statistics are now available online, either through the local police department, or through a public website where you can find the crimes reported all around you, from simple theft to vandalism all the way up to assault and murder.
NYPD was probably the first to developed a management system called COMPSTATS, which heavily relies on some software that it was often mistaken for the software system (and vice versa). It was later adopted by various police departments around the country. The idea is to locate patterns such as crime-heavy locations and hours and saturate those areas with police presence to deter crime.
Even Microsoft and IBM gotten into the business. Microsoft gave away several software suites to Interpol to help track down exploited children.
Visit CrimeReports.com and see crimes in your area
Conclusion
As more data gets tabulated patterns can be teased out of the data to identity groups, trends, and other relations that even the people involved may not be aware of. That means you can me.
While privacy is important, data mining is not all bad. It is necessary to understand how the data is collected, used, and disseminated.