This morning I participated in a panel discussing the topic of big data privacy at the San Francisco ISACA Fall Conference. The Information Systems Audit and Control Association (ISACA) is a professional association of individuals interested in information systems audit, control and security, with over 50,000 members in 141 countries. Other representatives on the panel were from eBay, Pricewaterhouse Coopers, and CipherCloud.
Researching this topic and today’s discussion raised some interesting questions about the intersection of personal privacy and big data in this new age where it is becoming technically and economically viable to store and analyze enormous data volumes, such as every click on a website and every commerce transaction. All this big data, from both internal systems as well as externally sourced enrichment data can now be streamed into giant “data lakes” using open source based big data management platforms such as Hadoop and NoSQL databases. Using visual big data development tools and end-user visualization technology such as Pentaho makes it easier and easier for organizations to ingest, prepare and analyze this data resulting in previously unattainable insights that can be used to optimize revenue streams and reduce costs.
However, how can we ensure this “big data” is protected and never used in ways that intrude on individual privacy? Mature and integrated big data analytics platforms such as Pentaho can enforce data access controls as well as audit data usage, but today there is no industry standard for tagging how specific data elements may be used on a permanent basis. This has the potential to leading to risks down the road that data collected with individual personal consent may ultimately be used for purposes beyond the scope of the original consent policy. Is it time for government and industry standards bodies to tackle this issue with new technical standards that enforce data usage and aging policies on an ongoing basis? I have in mind something like an “XBRL” for data privacy, a standard taxonomy and semantics that enforces data usage policies regardless of source and platform.
Let me know what you think. Leave a comment below or @ian_fyfe
Chief Technology Evangelist