Big Geo Data Analytics
editors: Peter Baumann, Chuck Heazel
Introduction
"Big Data is the term used to describe the deluge of data in our networked, digitized, sensor-laden,
information driven world. There is a broad agreement among commercial, academic, and government
leaders about the remarkable potential of Big Data to spark innovation, fuel commerce, and drive
progress. The availability of vast data resources carries the potential to answer questions previously out of
reach. However there is also broad agreement on the ability of Big Data to overwhelm traditional
approaches. The rate at which data volumes, speeds, and complexity are growing is outpacing scientific and
technological advances in data analytics, management, transport, and more.
The ability to create consensus-based vendor-neutral, technology and infrastructure agnostic solutions to
enable Big Data stakeholders to pick-and-choose best processing tools for collecting, curating, analyzing,
visualizing, and accessing massive of data on a most suitable computing platform and cluster while allowing
value-added from Big Data service providers and flow of data between the stakeholders in a cohesive and
secure manner is desirable." -- ISO SC32 Big Data Analytics Study Group
The purpose of this Domain Working Group is to answer that question, at least as it pertains to spatial-temporal data and analytics. We can start with the hard case, Big-Table platforms such as Hadoop and Accumulo. These platforms are built around a key-value model organized in virtual tables. The keys consist of row and column identifiers. Every data element (value) is uniquely identified by its row and column location within a Big Table. Complimenting this data model, is the Map-Reduce analytic environment. Map-Reduce is a distributed parallel processing service which distributes algorithms and data out to Virtual Machines within the Cloud, monitors the progress of their execution, then integrates the output of the multiple algorithms into a single result set. So Big Data provides us with two challenges:
1) How do we implement spatial-temporal data and operations within the Big Table model?
2) How can we take maximum advantage of the parallel processing capability of Map-Reduce for spatial-temporal analytics?
Cloud Service Models
The United States National Institute of Standards and Technology (NIST) has defined three cloud service models:
- Infrastructure as a Service (IaaS): This is the cloud as a collection of virtual machines. It's basically the "hardware".
- Platform as a Service (PaaS): Cloud resident operating system, services and development tools. This could include basic services such as WFC, WCS.
- Software as a Service (SaaS): Cloud resident applications. These go beyond basic services to include complex business logic.
US Government efforts to use the cloud have identified an additional model: Data as a Service. One could argue that Data as a Service is a part of Platform as a Service, but it is useful to consider as a separate model. So Data as a Service is the collection of Data Services which reside in the
PaaS model.
One could also postulate Processing as a Service. This correlates to processing and portrayal services such as WPS and WMS. The collection of standardized processing services in the cloud would constitute the Processing as a Service model which resides in the
PaaS model.
If we use business logic to assemble these
PaaS services for a particular purpose, and make that assembly available for re-use, then we have Applications as a Service which is
SaaS.
The OGC Web Services model fits into this break-out fairly well.
C. Heazel
Structure of this Document
Given the complexity of the Big Data Analytics suite it is a challenge to accommodate all different aspects, from application down to technical approaches. The
Reference Model of Open Distributed Processing (RM-ODP) has been designed, and actually proven instrumental, for such complex engineering tasks, and therefore it is adopted for this document. RM-ODP offers a viewpoint model allowing to elaborate different views independently.
Citing
Wikipedia: More specifically, the RM-ODP framework provides five generic and complementary viewpoints on the system and its environment:
- The enterprise viewpoint, which focuses on the purpose, scope and policies for the system. It describes the business requirements and how to meet them.
- The information viewpoint, which focuses on the semantics of the information and the information processing performed. It describes the information managed by the system and the structure and content type of the supporting data.
- The computational viewpoint, which enables distribution through functional decomposition on the system into objects which interact at interfaces. It describes the functionality provided by the system and its functional decomposition.
- The engineering viewpoint, which focuses on the mechanisms and functions required to support distributed interactions between objects in the system. It describes the distribution of processing performed by the system to manage the information and provide the functionality.
- The technology viewpoint, which focuses on the choice of technology of the system. It describes the technologies chosen to provide the processing, functionality and presentation of information.
These five viewpoints can be further grouped into three layers:
Enterprise - Enterprise Viewpoint
Abstract - Information and Computational Viewpoints
Implementation - Engineering and Technology Viewpoints
For example, ISO 19115 defines the conceptual metadata model. It is an Information (Abstract) Viewpoint standard. ISO 19139 is the XML schema for 19115. It is an Engineering (Implementation) Viewpoint standard. By separating the abstract from the implementation, new technologies are readily assimilated into the existing design.
Enterprise Viewpoint
purpose, scope and policies for the system. It describes the business requirements and how to meet them.
semantics of the information and the information processing performed. It describes the information managed by the system and the structure and content type of the supporting data.
Our first challenge is to provide the same level of spatial-temporal support to Cloud-based analytics that we provide to RDBMS-based applications. Our initial objective is to provide an OGC Simple Features like capability. Simple Features is at its heart an object model. Big Table is not. There are two initiatives underway to address this disconnect:
1)
GeoWEB is an R&D project run by the
InnoVision Directorate of NGA. It is built on
GeoTools and
GeoAPI.
2)
GeoMesa is an initiative to develop an "open source, distributed, spatial-temporal database on top of the Apache Accumulo column family store." It is also built on the
GeoTools platform.
The core data element of the Big Table model is the key-value pair. Keys are used to index and search for content. Values represent the content itself. The key-value structure for Accumulo is shown below. By including a Row ID and Column ID in the key, the key-value pairs form a big virtual sparsely populated table.
One of the questions we have address working with
GeoWEB is how to populate the value fields. In particular, how do we encode ISO 19115 metadata for storage in Accumulo?
GeoWEB geometry and spatial indexing is built on the OGC Simple Features Well Known Text (WKT) representation of geometry. It seems logical to use WKT for the metadata as well. However, the WKT standard does not cover ISO metadata. We have developed an extension to WKT which covers the metadata defined in ISO 19115. This material will be brought to the OGC once the documentation is completed and approved for release.
Another important aspect of Big Table analytics is query processing. Spatial-temporal queries can be complex. The Big Table query processing infrastructure must be up to the job. The attached paper,
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce, provides a very good description of the issues facing a Big Table query processor and how the authors addressed that challenge. They have also worked extensively with high performance spatial RDBMS and provide benchmarks on the performance differences between the two architectures. Finally, those of you who are fans of the movie
Fantastic Voyage will appreciate the spatial reference systems they work in.
Chuck Heazel
- latency issues due to network use in distributed environments
- discovery of data
- variety of data structures, origin, handling etc. while need for integration for decision making
- correlation of heterogeneous, distributed data based on location of target
- location is important, but there are other relationships which are orthogonal, and views need to be integrated; usual approaches include tagging data, metadata, ...
- representation, modeling, and use of time
- providing information anytime, and at the right time
- veracity aspects: validity, fitness for purpose,provenance issues, ...
- data quality
- how long should individual data be retained?
Computational Viewpoint
distribution through functional decomposition on the system into objects which interact at interfaces. It describes the functionality provided by the system and its functional decomposition.
- latency issues due to network use in distributed environments
Engineering Viewpoint
mechanisms and functions required to support distributed interactions between objects in the system. It describes the distribution of processing performed by the system to manage the information and provide the functionality.
Technology Viewpoint
choice of technology of the system. It describes the technologies chosen to provide the processing, functionality and presentation of information.
Conclusions
--
PeterBaumann - 12 Feb 2014