The following was presented at the Skidegate Council of the Haida Nation office on August 14th 2017.
It is also provided below.
Principles of Data Management (for Biologists)
Biologists spends $1,000,000s of dollars collecting data with little regard for its management.
Study design should preceed data management
- Identify question(s)
- what do we want to know and why?
- Assess existing data/understanding
- what do we already know?
- Develop field protocol
- how much will it cost?
- how useful is the answer likely to be?
Once a study design has been developed data management begins.
Data management cycles through the 10 stages of
- data collection
- data backup
- data security
- data digitization
- data cleansing
- data tidying
- data documentation
- data analysis
- data reporting
- data archiving
Field crews should be trained and informed and provided with standard protocols and data collection forms.
Printed forms on waterproof paper provide a cheap robust solution.
Duplicate data as soon as possible.
A smartphone camera is a simple way to duplicate data and sync to the cloud.
Ensure the right people have access.
Dropbox provides simple data security and sharing.
Get the data into a useable electronic form.
Excel is a useful data entry tool in the hands of a trained user.
Correct the inevitable errors.
At best, errors add noise; at worse, they invalidate subsequent analyses!
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
SQLite (https://sqlite.org) is free, open-source, cross-platform, embedded database software.
From R For Data Science available via CC BY-NC-ND 3.0 US.
Data are just numbers and categories unless people know what they mean.
A simple metadata table can provide a description and units for each variable
|Site||Depth||m||The tidally corrected depth|
|Visit||Hour||PST8PDT||The hour of the visit|
Analytic code can be shared on GitHub.
The province already has a GitHub account for sharing code.
An answer only has value if decision-makers are aware of it.
Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources.
ResearchGate is a free way to share and discover research.
Ensure others are able to use it in perpetuity.
Zenodo is free, citeable, discoverable, long-term, with open, restricted and closed access options.
Uses same cloud infrastructure as CERN’s own Large Hadron Collider (LHC) research data.
Data management requires trained personnel with an understanding of the principles but does not have to be expensive and pays for itself many times over.
The provincial government has DataBC.
The federal and provincial sites use CKAN.
CKAN is the world’s leading Open Source data portal platform.
It is free and open source with teams and private data.
A key feature is an API (application program interface) that allows code to interact with the repository.