Google, Government and Confidential Data
May 2007Whether you’re researching a global epidemic or looking for a soufflé recipe, the Internet has made finding most kinds of information a snap. One of the biggest and most frustrating exceptions has been the public sector—notably, the websites of federal, state, and local government agencies. The frustration comes because it is, after all, our data: our taxes pay for it, and a significant chunk of it concerns us as individuals, from birth records to probate documents. What makes that frustration even more acute is that, in many cases, the data we need is actually online, but is nearly impossible to find—even through a search engine.
A new collaboration between Google and a number of state governments and federal agencies aims to fix this problem by making existing online information easier for citizens to find. But that prospect raises an uncomfortable question: With government agencies already wracked by breaches of personal data and struggling to get their houses in order, isn’t there too much data out there already? Will making government data more accessible also make identity theft easier? Can Google’s initiative put a needed spotlight on a threat that must be addressed in any case—one that bureaucrats might prefer to sweep under the rug?
Diving into the “deep web”
At the root of this initiative is a phenomenon known to Internet search experts as the “deep web”: documents that are available online, but are difficult or impossible for search engines to find, index, and retrieve. Most often, that’s because the elusive information is hidden from the robotic web crawlers that map the web—tucked away on a password-protected web page, for instance, or served dynamically from a database in response to an online query. Even if the crawler does find the page, it’s often so poorly optimized that the result you want is hidden deep down in the search results.The paradoxical result: If you’re a tech-savvy web user who already knows where to look, the information is there for the asking; if you’re using a search engine, you’re flat out of luck. When you consider that as many as four out of five visitors to government websites get there, not through the site’s home page, but via a search engine, it becomes apparent that many people are not getting what they need from their government.
That’s just not right, says J.L. Needham, Google’s point person for public-sector content partnerships. To make government data easier for citizens to find and use, Google is collaborating with the U.S. and state governments to put existing “deep web” data within search engines’ reach. Using Google’s open Sitemap protocol and custom search engine, agencies are able to put specific databases—which contain the same data that’s already available from the agencies’ websites—in a place where web crawlers can find and index them.
Helping government do its job
One of the fundamental tasks of a democratic government is to ensure that public information and services are transparent and accessible to its citizens. But an unusually high proportion of the information on government websites, while available to online visitors, is “deep web” data served from databases that cannot be indexed by web crawlers. “One of the most important bodies of free, relevant information is on government websites,” says Needham. “The idea was to help government serve its citizens better by opening up pockets of the deep web.”So far, governors in four states—Arizona, California, Utah, and Virginia—have announced that they’re partnering with Google to make their public data more accessible. Google has also partnered with a number of federal agencies. Under the program, Google’s fulltime public sector team works with chief information officers in the states; those state experts then work with specific state agencies to set priorities and guide implementation.
Steering clear of personal data…mostly
According to Needham, those priorities “have veered toward four areas: health, employment, housing, and education”—all categories that set off alarm bells for privacy activists, given the huge amount of personal data compromised in recent years by public sector security breaches and shoddy data protection practices.In fact, though, the initiative is not specifically going after public records containing private information, says Needham. Instead, the emphasis has been on indexing documents and data unrelated to individuals, such as studies and technical reports. “Most are completely free of privacy risk,” says Needham.
Needham contrasts the data made searchable in Google’s initiative with such blunders as the U.S. Department of Agriculture database, online for more than a decade, that inadvertently exposed the Social Security numbers of some 63,000 grant recipients. “By and large, the federal agencies we’re working with publish nothing like this,” he said.
Has Google discussed risks to confidential information with state government officials? “Only in the case of a few databases,” he responds. “Most have no information about individuals whatsoever.” Exceptions include documents involving “licensed professionals, such as medical practitioners and real estate agents,” where the goal is “to make the license easier to find.”
The risks from local governments
While Google is working directly with state governments and federal agencies to help guide implementation of the Sitemap protocol, there may well be “implementation without oversight at the local level” down the road, Needham points out. “Here’s where the risk factor may appear. This is an open protocol, usable by managers and administrators at all levels of government.” As the record of data breaches to date makes clear, bad decisions are not just possible, they’re inevitable. “We’re augmenting the FAQ to address questions of privacy,” says Needham. Beyond that, the responsibility rests with the specific agency.Which leads to a provocative thought…
What if Governor Schwarzenegger were to say to Google, “We’ve had some problems here. Can you help?” Would Google be willing to flip the privacy issue on its head by actively searching for and flagging sensitive data, such as Social Security numbers, in public documents?“It’s not likely that we would get involved in something like that,” Needham replies. First of all, he says, “the Sitemap protocol in its current form is content-neutral,” and its four “very simple” parameters aren’t designed for such a task. In any case, he adds, “It would be outside our ambit. Our emphasis is on providing large-scale access to information sources, creating content-neutral tools for dissemination of content.” Needham notes, however, that Google does take privacy very seriously and is “very responsive” to requests to remove problematic content from its index.
That said, Needham adds, “If, as a consequence of this effort, we raise awareness about sensitive data in public documents, that’s not a bad thing at all. That’s not our explicit intent. But if that’s the outcome, that’s good for everybody.”
Concern of privacy activists
Making state government websites more accessible to the public advances the cause of open government and benefits individual users. But it also has some privacy advocates deeply concerned that the initiative will increase the risk of identity theft by exposing sensitive information to a bigger slice of the criminal population—in effect, lowering the bar for would-be abusers of other people’s personal data.Privacy advocates warn—and some state information technology leaders admit—that private information sits in hidden and not-so-hidden places on government websites. Unless states take action now to remove or redact sensitive personal data from their sites, say privacy activists, their partnership with Google could further expose citizens’ information to identity thieves and other criminals.
“Oh my God! Look at all these names!” cries privacy activist Betty “B.J.” Ostergren as she leads us on a tour of the Internet, trolling for sensitive information that shouldn’t be there. Ostergren is using Virginia’s state website—which includes Google custom search capability—to look up the name of her friend, a former state highway patrol officer. What she finds is a list of hundreds of ex-troopers, complete with their email and home addresses.
Ostergren’s guided Internet tours have made her famous. To demonstrate just how easy it is to find important personal information on the web, she clicked around state websites until she found the Social Security numbers, dates of birth, and former addresses of such prominent people as Colin Powell and former CIA director Robert Gates. Then she made the data available to the media.
Cleaning up ... or not
Top federal officials have since become much more careful about removing their own private information from public websites, Ostergren says. But the majority of citizens have little protection. As recently as last year, for instance, most states posted Uniform Commercial Code filings on their Secretary of State websites. All UCC applications record debtors’ Social Security numbers. When Ostergren and other activists alerted officials to this problem, many states removed UCC filings from their sites.Among the four states involved in the Google partnership, California, Arizona and Virginia decommissioned their websites’ UCC search functions. Spokespersons for all three secretaries of state say the agencies are still trying to figure out how to remove Social Security numbers from the forms before placing the databases back online.
But in Utah, tens of thousands of Social Security numbers are still available to any Internet user from the state Department of Commerce web site. The only hurdle for would-be identity thieves: They must click a button on the site affirming that they represent a company in compliance with state privacy laws. After that, thieves can buy access to Social Security numbers for $2 a pop—a bargain, considering that many data aggregators charge $35 to $50 each for Social Security numbers.
Choosing commerce over confidentiality
Some other states haven’t even taken this bare minimum step. In Massachusetts, a political disagreement combined with simple ineptitude currently puts hundreds of thousands of citizens’ identities at risk. The UCC web site maintained by Massachusetts Secretary of the Commonwealth William Galvin contains a treasure trove of Social Security numbers. The information is easy to find, and available at no cost. Despite repeated threats by Ostergren and other privacy advocates to post the private information of prominent Boston business and political leaders, Galvin refuses to take the site down.“These documents are necessary for commerce,” Galvin’s spokesman, Brian McNiff, told the Boston Globe.
Instead of taking the entire UCC database offline until the problem is fixed, Galvin directed workers in his agency to manually redact Social Security numbers from UCC filings, and then rescan the documents. But the bureaucrats apparently don’t understand their own system. They blacked out the numbers on all the origination documents, which record when the loans were first taken out, but never bothered to remove them from termination papers, which record when loans were paid off.
This means that the agency’s considerable efforts were entirely wasted. The Social Security numbers are still on the website for anyone to see.
“These people are so stupid,” Ostergren said.
“I’m going to deliberately duck your question”
Massachusetts has not yet partnered with a search engine company to make its state websites easier to search, but Google makes clear that its ultimate goal is to include all 50 states in the process. This has privacy activists worried about states like Utah and Massachusetts, which have been reticent to heed privacy advocates’ warnings of the dangers lurking on government websites—but also places like California, where officials have been forced to take the threat more seriously.Major data breaches, like one in 2002 in which hackers stole state employees’ Social Security numbers and payroll information, led California to pass the first breach notification law in the country. State officials, including Secretary of State Debra Bowen, have been quick to respond to privacy advocates’ concerns, removing databases from government websites until they can be scrubbed clean of sensitive information.
Meanwhile, many California state agencies have successfully sitemapped huge swaths of their websites using Google’s approach, says California’s Chief Technology Officer Clark Kelso. Agencies are doing the work as they see fit, Kelso says, with no unified plan for which websites and databases should be brought to the surface and which should remain obscure.
We asked: “So how well is California doing in its efforts to remove sensitive personal information from state sites?”
“I’m going to deliberately duck your question,” Kelso said. He then proceeded to duck the question in greater detail.
“I think overall we’re doing pretty well for an institution as large as California state government,” said Kelso. “That said, we have millions of pages of documents on our websites. So I’m not going to be able to vouch for every one of those pages.”
The risk in a nutshell
What we have, therefore, are many different ways in which states already are releasing sensitive, private information to the public, and likely to identity thieves. We have incompetence or intransigence on the part of state agency leaders and their subordinates. We also have well-intentioned people in charge of computer systems that so far have proved too sprawling and complex to control.The common denominator is that local, state and federal governments wrote their open records laws decades before anyone conceived of identity theft or the Internet. When the Internet came along, they generally applied the same principle to online records as they did to paper documents: Government’s job is to make public records public, not to censor them. This is an admirable goal, and a necessary one if the United States is going to continue as a democracy.
But a number of observers worry that making citizens’ Social Security numbers more easily available via Google and other search engines isn’t the right way to proceed.
“My big complaint with Google is there’s no nuance in their position,” says Pam Dixon, director of the World Privacy Forum. “It’s techno-determinism. No matter what happens, technology is always better than what came before. The bottom line is: You can’t have such a simplistic view of the world.”
Dixon continues: “There are public documents that are very hard to find just because of dismal web design. I’m all for fixing that. But let’s take our time. If these documents are going to be more visible, we need to redact Social Security numbers and other improperly secured data to protect vulnerable populations such as the disabled and the elderly.”
Giving petty criminals a boost
To skilled hackers and a growing assortment of organized crime mafias, it probably won’t make much difference whether or not states partner with Google. Either way, if there’s sensitive personal information to be found and abused, they’ll find and abuse it. But in cases where identity theft is a crime of opportunity, making private data easy to find on Google will expose more people to low-skilled criminals.“I think we forget that a lot of identity thieves are not at all sophisticated,” says Chris Jay Hoofnagle, senior staff attorney with the Samuelson Law, Technology, and Public Policy Clinic at the University of California, Berkeley. “There is the Eastern European mafia hacker—but so far, we haven’t even been able to stop the methamphetamine addict.”
There’s little doubt that making sensitive data available to search engines, even if that data is already online, dangles the forbidden fruit a little lower where hopheads, street gangsters, and casual criminals can reach it. But it’s worth remembering that organized criminals are already having no trouble finding this data, with or without Google. Which suggests that, with or without Google, people are already at risk.
Holding government accountable
For more than a decade, government agencies have exposed citizens to identity thieves by posting private information on the web that shouldn’t be there. The biggest change that comes with Google’s partnership is that states can no longer hide their heads in the sand and hope that nothing bad happens. In fact, given governments’ current cluelessness about the contents of their own websites, one could argue that partnering with Google gives them the chance to map and close their vulnerabilities—an unplanned privacy boon.But right now, not all governments are seizing this historic opportunity. Some states, like Virginia, are being very cautious not to bring to the surface websites filled with private data. Others, including California, are optimizing their websites willy-nilly, with virtually no central controls over which information should be redacted or removed before coming to the surface.
While Google could certainly do more to ensure that millions of online SSNs don’t become even easier to find, the real responsibility, and the real challenge, belongs to the state and federal governments. Security by obscurity is not secure. It’s lazy. And it’s no longer sufficient for people like Clark Kelso to say, “I’m not going to be able to vouch for every one of those pages.” States must find, redact and remove sensitive private information from their websites before using Google’s Sitemap protocol to bring them to the web’s surface. And if our government officials are unwilling to keep our private data private, we should fire them and find others who can.
The partnership with Google presents a great opportunity to make government more transparent to its citizens. It also presents a real risk if not implemented correctly. Let’s fight to make sure that the Google partnership helps our democracy without endangering our identities.
©2003-2010 Identity Theft 911, LLC. All rights reserved.