The following was presented at the Skidegate Council of the Haida Nation office on August 14th 2017.
It is also provided below.
Principles of Data Management (for Biologists)
Introduction
Biologists spends $1,000,000s of dollars collecting data with little regard for its management.
Study Design
Study design should preceed data management
- Identify question(s)
- what do we want to know and why?
- Assess existing data/understanding
- what do we already know?
- Develop field protocol
- how much will it cost?
- how useful is the answer likely to be?
Data Management
Once a study design has been developed data management begins.
Data management cycles through the 10 stages of
- data collection
- data backup
- data security
- data digitization
- data cleansing
- data tidying
- data documentation
- data analysis
- data reporting
- data archiving
Data Collection
Field crews should be trained and informed and provided with standard protocols and data collection forms.
Printed forms on waterproof paper provide a cheap robust solution.
Data Backup
Duplicate data as soon as possible.
A smartphone camera is a simple way to duplicate data and sync to the cloud.
Data Security
Ensure the right people have access.
Dropbox provides simple data security and sharing.
Data Digitization
Get the data into a useable electronic form.
Excel is a useful data entry tool in the hands of a trained user.
Data Cleansing
Correct the inevitable errors.
At best, errors add noise; at worse, they invalidate subsequent analyses!
Data Tidying
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
SQLite (https://sqlite.org) is free, open-source, cross-platform, embedded database software.
Relational Data
From R For Data Science available via CC BY-NC-ND 3.0 US.
Data Documentation
Data are just numbers and categories unless people know what they mean.
A simple metadata table can provide a description and units for each variable
Table | Column | Units | Description |
---|---|---|---|
Site | Depth | m | The tidally corrected depth |
Visit | Hour | PST8PDT | The hour of the visit |
Data Analysis
Analytic code can be shared on GitHub.
bcgov
The province already has a GitHub account for sharing code.
Data Reporting
An answer only has value if decision-makers are aware of it.
Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources.
ResearchGate is a free way to share and discover research.
Data Archiving
Ensure others are able to use it in perpetuity.
Zenodo is free, citeable, discoverable, long-term, with open, restricted and closed access options.
Uses same cloud infrastructure as CERN’s own Large Hadron Collider (LHC) research data.
Summary
Data management requires trained personnel with an understanding of the principles but does not have to be expensive and pays for itself many times over.
DFO
Parks
DataBC
The provincial government has DataBC.
CKAN
The federal and provincial sites use CKAN.
CKAN is the world’s leading Open Source data portal platform.
It is free and open source with teams and private data.
A key feature is an API (application program interface) that allows code to interact with the repository.