Standardization and integration of different datasets

Introduction

The first step of the bdc package handles the harmonization of heterogeneous datasets in a standard format simply and efficiently. How is this accomplished? Basically, by replacing the headers of original datasets with standardized terms. To do so, you have to fill out a configuration table to indicate which field names (i.e., column headers) of each original dataset match a list of Darwin Core standard terms.

Once standardized, datasets are then integrated into a standardized database having a minimum set of terms required for sharing biodiversity data and metadata across a wide variety of biodiversity applications (Simple Darwin Core standards).

We demonstrate the package’s usefulness using records on terrestrial plant species occurring in Brazil obtained from nine data aggregators (e.g., GIF, SpeciesLink, SiBBr, iDigBio, among others). The example datasets are available at https://doi.org/10.6084/m9.figshare.19260962 Please, click on “Download all” to obtain all datasets.

Next, copy the path in our computer where the datasets were saved. This path will be used in the Configuration Table (e.g., metadata) to read the datasets and create an integrated and standardized database.

⚠️IMPORTANT:

Original datasets must be formatted in comma-separated format (.csv);
When filling out the configuration table, please provide the exact name of a column of the original dataset and the full path of each original dataset;
Names of original datasets without a corresponding DarwinCore term must be filled as NA (not available);
The function is adjustable so that you can insert other fields in the configuration table according to your needs. In such cases, we strongly recommend that the added terms follow the Darwin Core standards.

Installation

Check here how to install the bdc package.

Read the configuration table

Read an example of the configuration table. You can download the table by clicking on the “CSV” button. Next, indicate the path to the folder containing the example datasets in the configuration table.

metadata <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv",
                              package = "bdc"),
                  show_col_types = FALSE)

Changing the path containing the datasets.

# Path to the folder containing the example datasets. For instance:
path <- "C:/Users/myname/Documents/myproject/input_files/"

# Change in the Configuration table the path to the folder in your computer containing the example datasets
metadata$fileName <-
  gsub(pattern = "https://raw.githubusercontent.com/brunobrr/bdc/master/inst/extdata/input_files/",
       replacement = path,
       x = metadata$fileName)

Standardization and integration of datasets

Now, let’s merge the different datasets into a standardized database. Note that the standardized database integrating all dataset can be saved in the folder “Output/Intermediate” as “00_merged_database” if save_database = TRUE . The database is saved with a “csv” or “qs” extension, being “qs” a helpful format for quickly saving and reading large databases. “qs” files can be read using the function “qread” from the “qs” package.

database <-
bdc_standardize_datasets(metadata = metadata,
                         format = "csv",
                         overwrite = TRUE,
                         save_database = TRUE)

#>  0sStandardizing AT_EPIPHYTES file
#>  0s 0sStandardizing BIEN file
#>  0s 0sStandardizing DRYFLOR file
#>  0s 0sStandardizing GBIF file
#>  0s 0sStandardizing ICMBIO file
#>  0s 0sStandardizing IDIGBIO file
#>  0s 0sStandardizing NEOTROPTREE file
#>  0s 0sStandardizing SIBBR file
#>  0s 0sStandardizing SPECIESLINK file
#>
#> C:/Users/Bruno R. Ribeiro/Desktop/bdc/Output/Intermediate/00_merged_database.csv was created

An example of a standardized database containing the required field to run the bdc package.

⚠️IMPORTANT:

The standardized database embodies information on species taxonomy, geolocation, date of collection, and other relevant context information. Each field is classified in three categories according to its importance to run the function: i) required, i.e., the minimum information necessary to run the function, ii) recommended, i.e., not mandatory but having important details on species records, and iii) additional, i.e., information potentially useful for detailed data analyses.

Below are listed the specifications of each field of the configuration table:

Field: Name of the fields in DatabaseInfo.csv to be filled in.
Category: Classification of each field in DatabaseInfo.csv. required (RQ), i.e., the minimum information necessary to run the function, ii) recommended (RE), i.e., not mandatory but having important details on species records, and iii) additional (AD), i.e., information potentially useful for detailed data analyses. As general guidance, be careful to include all required fields and supply as many recommended and additional fields as possible.
Description: Description of the content of the specified field in the original database.
Type: Type of content data on the specified field in the original database.
Example: An example of a single content on the specified field in the original database.

config_description <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo_description.csv", package = "bdc"), show_col_types = FALSE)

Standardization and integration of different datasets

2022-08-17

Introduction

Installation

Read the configuration table

Standardization and integration of datasets