Tutorial 01 - Principles of careful (neural) data analysis

This module sketches some of the overall principles for careful neural data analysis.

  1. Keep data and analysis secure using servers and version control
  2. Look at the data
  3. Start simple
  4. Test and document acquisition and analysis code

1. Keep data and analysis secure

One of the worst things that can happen to researchers is the loss of data or hours/days of analysis. Although there’s not much that we can do about floods in the lab or unplanned subject deaths, we can reduce the risk of losing data once it’s acquired by saving it in multiple places and on secure servers. You should never use the raw data in your analysis without first making a copy, so that the raw (or promoted) data is always available to return to if needed. This means that when doing an analysis, it should always be on a copy of “working data”.

Similarily, analyses can take a lot of time and effort and should be protected against loss. The best way to do this is to use version control, such as git. Git is a version control system that tracks changes between files and allows you to return to previously committed code. Our lab uses GitHub to store different analyses.

2. Look at the data

Analysis will be meaningless if performed on bad data. But, even if you start out with good data there are many analysis steps that have the power to corrupt.

Be conscious to look at your data at every step to ensure you are working with what you expect. Two habits that help with this are visualization (explored in Module03) and testing (both unit testing and tests with fake data).

To see why this principle is critical, consider an analogy: a complex multistep experimental procedure such as surgically implanting a recording probe into the brain. In this setting, the surgeon always verifies the success of the previous step before proceeding. It would be disastrous to attempt to insert a delicate probe without making sure the skull bone is removed first; or to apply dental cement to the skull without first making sure the craniotomy is sealed!

Start by looking at the raw data to ensure it is as expected, without contamination or missing pieces. Then, apply the same mindset during your analysis to confirm the success of every step before proceeding.

3. Start simple

Before the start of data collection, you should identify the major steps in your data processing “workflow” – that is, the flow from raw data to the figures in the resulting publication. Doing this can often highlight key dependencies and important controls that help you collect the data such that you will can thoroughly test what you set out to do.

This sort of planning is especially important when performing experiments with long timelines, such as when chronically implanting animals for in vivo recording, where it may take up to two months to collect data from a single animal.

There are two main steps to this planning process:

  1. Create a schematic that illustrates your analysis workflow

Think in terms of raw data and data transformations to illustrate a workflow at a conceptual level.

For example, let’s create a workflow which seeks to determine whether the number of sharp wave-ripple (SWRs; these are candidate “replay” events in the hippocampus) events depend on an experimental manipulation.

The above workflow shows how SWR times are derived from the raw local field potential (LFP) data, and the event of interest is a component of all the events. Next, the number of SWR events is determined by finding the overlap between SWR times and the event of interest.

  1. Organize your data analysis workflow into pseudocode that can later be implemented in python

For the workflow above, it might look something like:

# Load LFP (.ncs)

# Slice lfp to the experiment time of interest

# Find SWR epochs using a Hilbert transform with a filter of 150-220Hz on the sliced_lfp

# Find epochs during the event_of_interest

# Find intersection between the swr_epochs and the event_of_interest

# Count the number of epochs

With this pseudocode, you can determine which parts are best done with a function, and which are easiest to access using a method on the data type. Many of these steps exist as functions or methods in the Nept codebase, but making the analysis steps explicit in this way provides a good foundation for well organized code.

In [1]:
import os
import nept

# Load LFP (.ncs)
data_path = os.path.join(os.path.abspath('.'), 'data')
data_folder = os.path.join(data_path, 'R042-2013-08-18')
data_lfp = 'R042-2013-08-18-CSC11a.ncs'

lfp = nept.load_lfp(os.path.join(data_folder, data_lfp))

# Slice lfp to the experiment time of interest
start = 3238.7
stop = 5645.2
lfp_sliced = lfp.time_slice(start, stop)

# Find SWR epochs using a Hilbert transform with a filter of 150-220Hz on the sliced_lfp
z_thresh = 3.0
power_thresh = 5.0
merge_thresh = 0.02
min_length = 0.01
thresh = (150, 220)
fs = 2000
swr = nept.detect_swr_hilbert(lfp_sliced,
                              fs=fs,
                              thresh=thresh,
                              z_thresh=z_thresh,
                              power_thresh=power_thresh,
                              merge_thresh=merge_thresh,
                              min_length=min_length)

# Find epochs during the event_of_interest
# Let's say our events of interest are the first and last 200 seconds of the task time
event_time = 200
event_of_interest = nept.Epoch([[start, start+event_time], [stop-event_time, stop]])

# Find intersection between the swr_epochs and the event_of_interest
swr_of_interest = swr.intersect(event_of_interest)

# Count the number of epochs
print('Number of SWR events during task time of interest:', swr_of_interest.n_epochs)
Number of SWR events during task time of interest: 66

Use good programming practice

There are many resources and opinions on what constitutes good programming practice. A good resource is Writing Idiomatic Python.

A few of the most important are:

  1. Avoid reinventing the wheel. Make use of available open-source libraries that have your desired functionality, such as those available on PyPI. In your own code, use functions effectively to make it easier to troubleshoot, re-use and extend.
  2. Test your code. Use an automated testing tool, such as pytest, for unit tests and tests with fake data. These tests will be helpful in instilling confidence in your data analysis, as well as checking that changes to your code do not break its functionality.
  3. Promote readability. Good documentation can save your future self and collaborators from struggling to understand your code. But readability doesn’t stop at documentation, it’s also about using useful variable names and idiomatic programming. The more programming you do, the better you will become at producing readable code if you make an effort to promote the readability of your code. This will mean that every now and then you spend some time to refactor your code with readability in mind, as well as choosing appropriate variable names and documenting while writing your code in the first place.

Neuroscience labs are moving towards being able to share the analysis code and raw data of published work, making it possible for anyone to be able to generate all the results and figures in the paper. An example of this is Bekolay et al. (2014), where the Notes section gives the link to a GitHub release with very nicely organized and documented code that reproduces the results.

Primarily, this means that you need to: - Annotate your data. Our lab uses an info file, which contains common descriptors as well as experiment-specific information. Check out an example info file, used in the analysis of a task involving shortcuts. As a bonus, standardizing annotation systems can reduce the effort required to combine data sets, such as in [van der Meer et al. Neuron 2010](http://www.cell.com/neuron/abstract/S0896-6273(1000507-6), where three large data sets recorded by three different people from different brain regions were combined.

  • Use relative locations for files. For code to be run on another machine, it’s important that the specific location of files are not hard-coded, but instead used relative to the location of the code.
  • Specify dependencies. Including which version of your own code you used to generate the results. This is especially important since we are often working with open-source code that is frequently updated. It is also useful to specify the exact operating system version and shared libraries, which may be best addressed using containers like Docker (see this blogpost by Russ Poldrack for discussion). A nice way to handle this is keeping code under version control on GitHub and making code releases with an updated version number for each publication.

Apply appropriate statistical concepts

For most projects, careful consideration of what statistics will eventually be done should begin before you collect any data at all. This will help you determine whether your experimental design and power are appropriate to answer your question. As such, you should be aware of major statistical concepts, including:

  1. The downfalls of underfitting and overfitting: modeling noise instead of the process of interest.
  2. Cross-validation: a powerful, general purpose tool for evaluating the “goodness” of a statistical model (and prevent overfitting).
  3. Resampling (aka bootstrapping, shuffling or permutation testing): generating synthetic data sets based on some known distribution, usually to compare to actual data.
  4. Model comparison: the process of determining which model best describes the data.

4. Test and document acquisition and analysis code

Both the acquisition code and analysis code need to be thoroughly tested and documented. Acquisition code is often dealing with hardware components that can malfunction, so it’s important that at the beginning of every experiment to thoroughly test that your code properly interfaces with the hardware and runs as expected.

Analysis pipelines can get complicated quickly, such that it can be difficult to track down bugs or other issues. You can use unit tests and fake data to limit the number of potential bugs and issues. For instance, if you input Poisson (random) spike data with a constant firing rate, totally independent of your experimental conditions, it better not be the case that your analysis reports a significant difference! And if instead you specify an increase firing rate during an experimental condition, your analysis should be able to handle that as well.