List of standards that can be useful for Citizen Science
(based on and Initial work done in the H2020 Groudtruth 2.0
The list of standards provided by this document has been designed to be of internal use in Ground Truth 2.0, but keeping in mind an approach generic enough to make it useful for other projects in the future. In addition, the aim is to submit it to the OGC CS.DWG for discussion and refinement.
List of standards bodies that will be considered in elaborating this list.
Many organisations are producing standards that are used by the community. The most recognized standardization bodies in the field are Open Geospatial Consortium (OGC), ISO TC 211, W3C
The difference between an API and the standards
Recently, there has been some confusion between APIs and standards. In the old days, companies opted for closed systems with no documented formats or interfaces. Recently many vendors release some tailored APIs (many times paired to tailored JSON formats). The Google maps and the Twitter APIs are two well known examples. This document recognizes that publishing the API endpoint and the API documentation are steps in the right direction. This way, they allow others to build clients on top of the systems; but ,in most cases, this reflects a dominant position in the market by providing a single web server that interacts with many clients, all of them technological lock-in to the server vendor. This approach does NOT provide interoperability between server systems, in the sense that clients and services from different vendors can communicate and be replaced if needed.
This does not preclude that OGC and other standardization bodies, in their neutral position, might consider standardizing APIs, allowing a proliferation of server (and client) implementations. More details about this discussion can be found in a recent OGC discussion paper.
The citizen science activities are composed by several components and actors. When discussing about standard it is important to have a clear idea on which is the standardization target. This subsection enumerates the possible ones and identifies the target that is relevant for this document.
Citizen science projects: There are so many projects these days that is could be good to have an inventory of projects to be able to discover them. Essentially we need a data model to collect the necessary information about them (topic, responsible party, URL to the app, URL to the collected data, etc). Association of Citizen Science projects might want to exchange information between them in an interoperable way. This standard is out of scope of this document.
Citizen science client applications: The proliferation of applications for mobile devices to facilitate the task of data collectors for each project ends in a myriad of them. It could be good to have standard interfaces to capture data from different projects. iNaturalist is one crosscutting application. These standards are out of scope of this document.
Citizens science variables: The objective of this is to collect data in a model that is interoperable with other systems (such as remote sensing of Earth observation in-situ research infrastructures) to ensure that the data collected will be compatible with other sources and potentially conflated with them. The concept of Essential Variables can help in this direction. Data capturing standards are published by CEOS; they are intended to specify how to capture in-situ data that can be useful for Remote Sensing calibration. These standards and methodologies are out of scope of this document.
Citizen science sensor communication interfaces: Many citizens science projects use cheap or DIY sensors that need to communicate with other devices to store or transmit the data. Standards related to this could deal with USB, WiFi
, 3G and other interfaces or radio communication infrastructures. These standards are out of scope of this document.
Citizen science collected data: This is subjected to the aspects considered in the Data Management Principles in GEOSS and all the standards related to them.
Standardization for data collected in CS project
To produce a coherent list of standards it is important to identify a classification criterion. A classical approach classifies the standards in web services, data encodings and query and filter languages. In this list we prefer to follow a classification that follows the GEOSS ten Data Management Principles (DMP) produced by GEO that can be found here: https://www.earthobservations.org/documents/dswg/201504_data_management_principles_long_final.pdf
Taking this route, we favor a better interoperability with GEOSS if the listed standards are implemented by the project.
For each principle in the ten DMP, this section will enumerate the standards available and will provide a very short justification of its usability in this context.
Data and all associated metadata will be discoverable, through catalogues and search engines, and data access and use conditions, including licenses, will be clearly indicated.
Note for the reader. This topic is fundamentally link with Data documentation and we recommend that you read DMP-4 in preparation for reading this one.
Discovery of information is achieved by search engines. They follow two main approaches: Metadata catalogues and Information crawlers. In metadata catalogues, metadata about resources is registered by catalogues that will index the metadata and allow for querying it. In information crawlers, text available on Internet is automatically read and indexed for direct search. Unfortunately, data is generally formed by sequences of numbers, dates or categories that cannot be directly interpreted without a description of them. This is commonly provided in form of metadata tags even if other alternatives could be possible. To standardize discovery we should deal with query languages and output formats. Google has provided detailed guidance
on structure markup of dataset description pages.
: This standard provides a simple query language to query a search engine by free text. It provides a Key and Value Pair (KVP) syntax (An example of KVP is: www.google.com?q=Ground+Truth+2.0. Form more details: https://en.wikipedia.org/wiki/Query_string
) that modern web search engines support and favours a response in an Atom feed. There is an extension of it called *OpenSearch Geo*(http://www.opengeospatial.org/standards/opensearchgeo
) that includes geospatial and temporal queries.
Metadata catalogues: The Catalogue Service For the Web (CSW)
) is the main standard for catalogues. There are two profiles, one for ISO 19115 records and another for ebRIM implementations that supports a more flexible model based on objects and relations.
Most of the CS projects act as sensors, collecting data. These sensors are not commonly described as ISO19115 datasets but as Observations and Measurements (O&M). Current implementations of the translations of sensor descriptions into ISO 19115 doesnt work well (as experimented in the ConnectinGEO
Another topic to consider is that sensors can be able to register themselves without human intervention. The equivalent could be CS activities that register themselves in catalogues. Even if there are solutions in the mass market arena, Im not aware of any standards to allow this.
Finally I would like to mention GEO-DAB (Discovery and Access Broker) that, in the end, to be discoverable in GEOSS the CS datasets need to be registered in the DAB. Register them one by one seems impractical so it could be more beneficial that we figure out a protocol where there is a central catalogue where all the relevant CS initiatives register and this catalogue is regularly harvested by the GEO-DAB.
DMP-2: Online Access
Data will be accessible via online services, including, at a minimum, direct download but preferably user-customizable services for access, visualization and analysis.
For this we have the classical OGC web services family.
The natural way to proceed is to consider the CS Observatories as sensors. Then the Sensor Observation Service (SOS)
) is the appropriate service to use. It better handles long time series of data capture in different stations that are regularly measure one or more parameters. These parameters can be numerical values but also more complex data formats like pictures and videos.
To subscribe to a service and be notified when something has been observed that is significant for us and receive an alert, the recently published Pub/Sub
) standard can be very useful.
Please note that data access might require also to consider security and licensing issues. We have learnt that standard licensing is one of the topics that LandSense
wants to capture.
Finally Web Processing Service (WPS)
) is a way to expose a geospatial analytical processing tool on the web.
is also a popular convention for online access to data.
DMP-3: Data Encoding
Data should be structured using encodings that are widely accepted in the target user community and aligned with organizational needs and observing methods, with preference given to non-proprietary international standards.
Several standards deal with data modelling that could be useful in CS.
The ISO 19109 provides what is called the General Feature Model. This is an abstract specification (not implementable directly) defining the concept of Feature and Feature Type.
An implementation of it is the Geographical Markup Language (GML)
) that allows for describing points, lines, polygons and more complex features.
) is another standard to encode features in JSON that is now an IETF standard (https://tools.ietf.org/html/rfc7946
Some observations can be also well represented in NetCDF
. Initially used for the climate community is being used more broadly and has been brought to the OGC process.
The natural way to proceed is to consider the CS Observatories as sensors. Observations and Measurements (O&M)
) allows fully describing sensor model that could be used for CS (SOS, O&M and SensorML
for a family of standards called Sensor Web Enablement (SWE). SWE is not an standard in itself. However, there is a SWE Common standard). Actually, OGC has proposed a profile of O&M for CS (SWE4CS
) in the COBWeb project that is ready to use by other CS projects. To be able to do this the CS activity needs to map their Earth observation activities into the O&M concepts.
One of the advantages of using O&M is the capacity of using time series in an easy manner and TimeSeriesML
) is a recent proposal to do that (it was extracted from the WaterML
There are many other data standards that could eventually be used for more specific purposes, such as raster data formats (e.g. GeoTIFF
) or the recently proposed GeoPackage
) (a Geospatial extension of MySQL
) or the Geodatabase formats (some of them using Simple Features for SQL
) as a query language).
DMP-4: Data Documentation
Data will be comprehensively documented, including all elements necessary to access, use, understand, and process, preferably via formal structured metadata based on international or community-approved standards.To the possible extent, data will be described in peer-reviewed publications referenced in metadata records.
Data documentation is what makes the data catalogues work. In essence it requires a metadata of some sort.
: The metadata standard that everybody in the geospatial world is using. Even is the core metadata
is limited to ~20 entries, the full standard specifies more than a hundred entries. NB: the concept of "Core Metadata" was removed from 19115-1:2014.
Sensor description standards: Sensor Model Language (SensorML)
) is a standard to describe the sensor used in a set of measurements. Can be used to describe a DIY sensor or a measurement done by a human sensor. O&M
includes the semantic description of the measurements that can be considered metadata about the meaning of the measurements.
Apart from the more common metadata there are other annotation standards that we can consider. They provide a more light and flexible schema for metadata.
DMP-5: Data Traceability
Data will include provenance metadata indicating the origin and processing history of raw observations and derived products, to ensure full traceability of the product chain.
Traceability is achieved by documenting details about the process done to a resource, also mentioning data sources used and actors involved in the processing.
(extended in the ISO19115-2
) provides a data model and XML encoding for lineage information.
The W3C PROV
) is the W3C
to document provenance of web resources.
It is worth noticing that by mentioning the actors involved in the data collection of individual observations, we can incur into privacy issues and personal data protection issues that needs to be considered.
It is worth noticing that the Business Process Modelling Language (BPML)
is an OMG (Object Management Group) standard (formerly known as BPEL) to document processing chains, based on Business Process Model and Notation (BPMN)
) (that is an extension UML activity diagrams).
DMP-6: Data Quality-Control
Data will be quality-controlled and the results of quality control shall be indicated in metadata; data made available in advance of quality control will be flagged in metadata as unchecked.
The geospatial data quality is described in the ISO19157
(formerly ISO19138) that provides a data model for providing quantitative and conformance quality and also a vocabulary of quality measurements. An important component of data quality is the uncertainty defined in the Guide to the Expression of Uncertainty in Measurement (GUM)
) and the Uncertainty Markup Language (UncertML)
), that is extended in the QualityML
(A vocabulary and an encoding broader that the proposed in ISO19157 developed during the EC FP7 GeoViQua
project and recently updated in the OGC testbed 12. The results of the update have been published as an OGC Public Engineering Report. It is NOT and OGC standard), both providing a list of quality statistics and a way to encode them.
DMP-7: Data Preservation
Data will be protected from loss and preserved for future use; preservation planning will be for the long term and include guidelines for loss prevention, retention schedules, and disposal or transfer procedures.
Not much has been done to ensure preservation of the CS Observatories data when the project is no longer able to maintain. The common practice in the geoinformation is to transfer it to an archive. This practice is described in the Open Archival Information System OAIS
(also knows as ISO 14721). The particularities of the geospatial information are being captured in the draft candidate ISO 19165
data and metadata preservation (based in both OAIS and ISO 19115).
Data and associated metadata held in data management systems will be periodically verified to ensure integrity, authenticity and readability.
Verification of integrity and authenticity is an aspect covered by Open Archival Information System OAIS
(ISO 14721) and included in ISO 19165
. A strategy that can help is the application of a packaging format such as ISO 29500-2 Open Packaging Convention (A recent paper has been published in support of this concept. X. Pons, J. Masó (2016) A comprehensive open package format for preservation and distribution of geospatial data and metadata. Computers & Geosciences 97, 8997.) (an alternative of a GeoPackage
that does not require format change.
DMP-9: Data Review and Reprocessing
Data will be managed to perform corrections and updates in accordance with reviews, and to enable reprocessing as appropriate; where applicable this shall follow established and agreed procedures.
The authors of this document are not aware of standards directly designed to do this. In any case, the use of Web Processing Service (WPS)
and provenance standards can help on having processing facilities ready and knowing how the previous version was created respectively.
DMP-10: Persistent and Resolvable Identifiers
Data will be assigned appropriate persistent, unique and resolvable identifiers to enable documents to cite the data on which they are based and to enable data providers to receive acknowledgement for use of their data.
User id and record id. It is important that users are able to identify their own records and being able to request their removal. Several attempts have been done to standardize data identifiers (http://libguides.lib.msu.edu/citedata
) with not much success. It seems that there is some consensus on assigning Digital Object Identifiers
) when data is stored in open repositories such as Panguea (https://www.pangaea.de/
) and Zenodo (https://zenodo.org/
Equivalent initiatives exist to register people and assign them an identifier such as Orchid (https://orcid.org/
). On the other side, standards like OpenID
) and SAML 2.0
) provide a standard way to distributed authentication . As an example Google and Facebook ids can be used by third parties to authenticate user ids with OpenID
avoiding the need to register in the third party website to get access and be identified.
- 07 Jan 2017
Existing standards and metadata schema for citizen-science project metadata include: ALA - BioCollect
, PPSR-CORE - CitSci
.org, The US Federal Crowdsourcing and Citizen Science Catalog, Dublin Core, GBIF - IPT, Project Open Data Metadata Schema - POD v1.1, CKAN API, DCAT, Schema.org, OGC, CobWeb
, ADIwg, ISO 19115/19110, Inspire.