Principles of Data Management (for Biologists)

Joe Thorley · 2017-08-14 · 3 minute read

The following was presented at the Skidegate Council of the Haida Nation office on August 14th 2017.

It is also provided below.


Principles of Data Management (for Biologists)

Introduction

Biologists spends $1,000,000s of dollars collecting data with little regard for its management.

Study Design

Study design should preceed data management

  • Identify question(s)
    • what do we want to know and why?
  • Assess existing data/understanding
    • what do we already know?
  • Develop field protocol
    • how much will it cost?
    • how useful is the answer likely to be?

Data Management

Once a study design has been developed data management begins.

Data management cycles through the 10 stages of

  1. data collection
  2. data backup
  3. data security
  4. data digitization
  5. data cleansing
  6. data tidying
  7. data documentation
  8. data analysis
  9. data reporting
  10. data archiving

Data Collection

Field crews should be trained and informed and provided with standard protocols and data collection forms.

Printed forms on waterproof paper provide a cheap robust solution.

Data Backup

Duplicate data as soon as possible.

A smartphone camera is a simple way to duplicate data and sync to the cloud.

Data Security

Ensure the right people have access.

Dropbox provides simple data security and sharing.

Data Digitization

Get the data into a useable electronic form.

Excel is a useful data entry tool in the hands of a trained user.

Data Cleansing

Correct the inevitable errors.

At best, errors add noise; at worse, they invalidate subsequent analyses!

Data Tidying

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

Wickham 2014

SQLite (https://sqlite.org) is free, open-source, cross-platform, embedded database software.

Relational Data

From R For Data Science available via CC BY-NC-ND 3.0 US.

Data Documentation

Data are just numbers and categories unless people know what they mean.

A simple metadata table can provide a description and units for each variable

Table Column Units Description
Site Depth m The tidally corrected depth
Visit Hour PST8PDT The hour of the visit

Data Analysis

Analytic code can be shared on GitHub.

bcgov

The province already has a GitHub account for sharing code.

Data Reporting

An answer only has value if decision-makers are aware of it.

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources.

ResearchGate is a free way to share and discover research.

Data Archiving

Ensure others are able to use it in perpetuity.

Zenodo is free, citeable, discoverable, long-term, with open, restricted and closed access options.

Uses same cloud infrastructure as CERN’s own Large Hadron Collider (LHC) research data.

Summary

Data management requires trained personnel with an understanding of the principles but does not have to be expensive and pays for itself many times over.

DFO

Parks

DataBC

The provincial government has DataBC.

CKAN

The federal and provincial sites use CKAN.

CKAN is the world’s leading Open Source data portal platform.

It is free and open source with teams and private data.

A key feature is an API (application program interface) that allows code to interact with the repository.