Public databases plus HPC equals your social security number

From a news release posted at HPCwire

Information available on the Internet can in certain cases be used to predict individual social-security numbers, posing a risk of identity theft that policy-makers and individuals should address. This finding, an unexpected consequence of public information in modern economies, published (Monday, July 6) in the Proceedings of the National Academy of Sciences (PNAS) and highlighted in the New York Times (July 7) and other national media, relied on computational resources of the TeraGrid, a National Science Foundation cyberinfrastructure program. It would have been difficult, if not impossible, to obtain these findings without these publicly-funded, high-performance computing (HPC) resources, says one of the lead researchers, Alessandro Acquisti, a professor at Carnegie Mellon University.

The research took advantage of an open-source clone of MATLAB, and a supercomputer at the Pittsburgh Supercomputing Center called Pople. Sadly, it didn’t even take an especially large portion of Pople to get the job done

After first working with desktop computers, the researchers turned last year to a PSC system called Pople (named for Nobel laureate chemist John Pople of Carnegie Mellon). A Silicon Graphics Altix 4700, installed in March 2008, Pople has 768 cores (processors) and 1.5 terabytes of shared memory (all of memory accessible from each core). The SSN runs used up to 400 of Pople’s cores and 800 gigabytes of memory, a large memory requirement that made Pople’s shared memory very helpful to the project.

The article tries to strike a hopeful tone about the bright future of social research enabled by HPC (and I do think it is bright), but in this context I found it chilling rather than exciting

“This project,” said Sergiu Sanielevici, PSC director of scientific applications and user support, who also leads user support and services for the TeraGrid, “exemplifies how powerful systems like Pople can open doors to data-mining and data-centric research in fields not traditionally associated with HPC, such as the social sciences, and make it possible to get answers that would otherwise be impractical or impossible.”

Goody. Maybe they’ll find a way to predict credit card numbers, too. You can find much more about the study (and the full manuscript of the report) at its web site, www.ssnstudy.org. And in case you are thinking that there is less to this than headline writers like me want to make of it just to spike up traffic, the following is from that website

However, we show that it is possible to predict individual SSNs simply from publicly available data. Based on observation of issuance patterns in the “Death Master File” (a public database that contains SSNs of people who have died), we were able to use information about an individual’s date and state of birth to predict narrow ranges of values likely to contain that individual’s SSN.  The predictions are particularly accurate for the SSNs of people who were born after 1988  (when the SSA initiated the Enumeration at Birth program, through which babies receive SSNs soon after birth) and in states with lower population. Since SSNs are predictable from public data, identity theft could occur even without events such as data breaches.

Resource Links: