Standardization and integration of different datasets

2022-08-17

Introduction

The first step of the bdc package handles the harmonization of heterogeneous datasets in a standard format simply and efficiently. How is this accomplished? Basically, by replacing the headers of original datasets with standardized terms. To do so, you have to fill out a configuration table to indicate which field names (i.e., column headers) of each original dataset match a list of Darwin Core standard terms.

Once standardized, datasets are then integrated into a standardized database having a minimum set of terms required for sharing biodiversity data and metadata across a wide variety of biodiversity applications (Simple Darwin Core standards).


We demonstrate the package’s usefulness using records on terrestrial plant species occurring in Brazil obtained from nine data aggregators (e.g., GIF, SpeciesLink, SiBBr, iDigBio, among others). The example datasets are available at https://doi.org/10.6084/m9.figshare.19260962 Please, click on “Download all” to obtain all datasets.

Next, copy the path in our computer where the datasets were saved. This path will be used in the Configuration Table (e.g., metadata) to read the datasets and create an integrated and standardized database.


⚠️IMPORTANT:

Installation

Check here how to install the bdc package.

Read the configuration table

Read an example of the configuration table. You can download the table by clicking on the “CSV” button. Next, indicate the path to the folder containing the example datasets in the configuration table.

metadata <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv",
                              package = "bdc"),
                  show_col_types = FALSE)


Changing the path containing the datasets.

# Path to the folder containing the example datasets. For instance:
path <- "C:/Users/myname/Documents/myproject/input_files/"

# Change in the Configuration table the path to the folder in your computer containing the example datasets
metadata$fileName <-
  gsub(pattern = "https://raw.githubusercontent.com/brunobrr/bdc/master/inst/extdata/input_files/",
       replacement = path,
       x = metadata$fileName)


Standardization and integration of datasets

Now, let’s merge the different datasets into a standardized database. Note that the standardized database integrating all dataset can be saved in the folder “Output/Intermediate” as “00_merged_database” if save_database = TRUE . The database is saved with a “csv” or “qs” extension, being “qs” a helpful format for quickly saving and reading large databases. “qs” files can be read using the function “qread” from the “qs” package.

database <-
bdc_standardize_datasets(metadata = metadata,
                         format = "csv",
                         overwrite = TRUE,
                         save_database = TRUE)

#>  0sStandardizing AT_EPIPHYTES file
#>  0s 0sStandardizing BIEN file
#>  0s 0sStandardizing DRYFLOR file
#>  0s 0sStandardizing GBIF file
#>  0s 0sStandardizing ICMBIO file
#>  0s 0sStandardizing IDIGBIO file
#>  0s 0sStandardizing NEOTROPTREE file
#>  0s 0sStandardizing SIBBR file
#>  0s 0sStandardizing SPECIESLINK file
#>
#> C:/Users/Bruno R. Ribeiro/Desktop/bdc/Output/Intermediate/00_merged_database.csv was created

An example of a standardized database containing the required field to run the bdc package.


⚠️IMPORTANT:

The standardized database embodies information on species taxonomy, geolocation, date of collection, and other relevant context information. Each field is classified in three categories according to its importance to run the function: i) required, i.e., the minimum information necessary to run the function, ii) recommended, i.e., not mandatory but having important details on species records, and iii) additional, i.e., information potentially useful for detailed data analyses.

Below are listed the specifications of each field of the configuration table:


config_description <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo_description.csv", package = "bdc"), show_col_types = FALSE)