NIST Big Data Public Working Group


Security and privacy have also been affected by the emergence of the Big Data paradigm. A detailed discussion of the influence of Big Data on security and privacy is included in NBDIF: Volume 4, Security and Privacy. Some of the effects of Big Data characteristics on security and privacy summarized below: • Variety: Retargeting traditional relational database security to non-relational databases has been a challenge. An emergent phenomenon introduced by Big Data variety that has gained considerable importance is the ability to infer identity from anonymized datasets by correlating with apparently innocuous public databases, as discussed in Section 5.5.2. • Volume: The volume of Big Data has necessitated storage in multitiered storage media. The movement of data between tiers has led to a requirement of systematically analyzing the threat models and research and development of novel techniques. • Velocity: As with non-relational databases, distributed programming frameworks such as Hadoop were not developed with security as a primary objective. • Variability: Security and privacy requirements can shift according to the time-dependent nature of roles that collected, processed, aggregated, and stored it. Governance can shift as responsible organizations merge or even disappear Privacy concerns, and frameworks to address these concerns, predate Big Data. While bounded in comparison to Big Data, past solutions considered legal, social, and technical requirements for privacy in distributed systems, very large databases, and in high performance computing and communications (HPCC). The addition of new techniques to handle the variety, volume, velocity, and variability has amplified these concerns to the level of a national conversation, with unanticipated impacts on privacy frameworks. Security and Privacy concerns are present throughout any Big Data system. In the past, security focused on a perimeter defense, but now it is well understood that defense-in-depth is critical. The term security and privacy fabric in the context of the NBDRA (see NBDIF Volume 6: Reference Architecture, Section 3) conceptually describes the presence of security and privacy concerns in every part of a Big Data system. Fabric conceptually represents the presence of activities and components throughout a computing system. Security standards define a number of controls at each interface and for each component. Likewise, privacy is a concern for Big Data systems, where additional privacy concerns can be created through the fusion of multiple datasets, or the granularity of the data being collected.

7 BIG DATA MANAGEMENT Given the presence of management concerns and activities throughout all components and activities of Big Data systems, management is represented in the NIST reference architecture as a fabric, similar to its usage for security and privacy. The primary change to managing Big Data systems naturally centers around the distribution of the data. Sizing a set of data nodes is a new skill in Big Data engineering, since data on a node is typically replicated across two slave nodes (for failover). This increases the capacity needed to handle a specific amount of data. The choice must be made up front what data values in a field to use to split up the data across nodes (known as sharding). This choice may not be the best one in terms of the eventual analytics, so the distribution is monitored and potentially reallocated to optimize systems. At the infrastructure level, since many applications run in virtualized environments across multiple servers, the cluster management portion is not new, but the complexity of the data management has increased. 7.1 ORCHESTRATION The Orchestration role for Big Data systems is discussed in the NBDIF: Volume 6, Reference Architecture. This role focuses on all the requirements generation, and compliance monitoring on behalf of the organization and the system owner. One major change is in the negotiation of data access and usage rights with external data providers as well as the system’s data consumers. This includes the need to coordinate the data exchange software and data standards. 7.2 DATA GOVERNANCE Data governance is a fundamental element in the management of data and data systems. Data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise. The definition of data governance includes management across the complete data life cycle, whether the data is at rest, in motion, in incomplete stages, or in transactions. To maximize its benefit, data governance must also consider the issues of privacy and security of individuals of all ages, individuals as organizations, and organizations as organizations. Additional discussion of governance with respect to security and privacy can be found in the NBDIF: Volume 4, Security and Privacy. Data governance is needed to address important issues in the new global Internet Big Data economy. One major change is that an organization’s data is being accessed and sometimes updated from other organizations. This has happened before in the exchange of business information, but has now expanded beyond the exchange between direct business partners. Just as cloud-based systems now involve multiple organizations, Big Data systems can now involve external data over which the organization has no control. Another example of change is that many businesses provide a data hosting platform for data that is generated by the users of the system. While governance policies and processes from the point of view of the data hosting company are commonplace, the issue of governance and control rights of the data providers is new. Many questions remain including the following: Do they still own their data, or is the data owned by the hosting company? Do the data producers have the ability to delete their data? Can they control who is allowed to see their data? The question of governance resides between the value that one party (e.g., the data hosting company) wants to generate versus the rights that the data provider wants to retain to obtain their own value. New governance concerns arising from the Big Data paradigm need greater discussion, and will be discussed further during the development of the next version of this document.