The GGIR function
GGIR comes with a large number of functions and optional settings
(function arguments also referred to as parameters) that together form a
processing pipeline. To ease interacting with GGIR, it has one function
to act as interface to all functionality, and this function is also
named GGIR
. You only need to learn how to work with this
central function, but it is important to understand that this function
interacts with all other functions in GGIR.
Further, it is important to understand that the
GGIR package
is structured in two complementary ways:
- Parts to reflect computational order.
- Parameter themes to reflect functional components.
Parts of the pipeline
GGIR, has a computational structure of six parts which are applied sequentially to the data:
- Part 1: Loads the data and stores derived signal features or also known as metrics as needed for the following parts.
- Part 2: Basic data quality assessment and description per day and per file.
- Part 3: Estimation of sustained inactivity bouts (rest periods) and sleep, needed for input to Part 4.
- Part 4: Labels the sustained inactive periods derived in Part 3 as sleep, or daytime sustained inactivity, per night and per file.
- Part 5: Compiles time series with classification of sleep and physical activity intensity categories by re-using information derived in part 2, 3 and 4. Next, Part 5 generates a describe summary such as total time in intensity categories, the number of bouts, time spent in bouts and average acceleration per category, and fragmentation.
- Part 6: Facilitates analyses that rely on direct access to the full recording such as household co-analysis and circadian rhythm analysis. The reason why it split up in parts is that it avoids having the re-run all preceding parts if you only want to make a small change in the more downstream parts. The specific order and content of the parts has grown for historical with Part 1 and 2 created in 2011-2013, Part 3 and 4 created in 2013-2015, Part 5 created in 2017-2020, and Part 6 created in 2023-2024.
Milestone data
Each part when run stores its intermediate data, which we refer to as
the milestone data, in an .RData
file. This
.RData
file is then read by the next part. Therefore, there
is no direct interaction between the parts, all exchange of information
works via the .RData
files. The advantage of this design is
that it offers internal modularity to ease re-processing and code
development.
The .RData
files are stored in sub-folders of the output
folder:
- Part 1:
meta/basic
- Part 2:
meta/ms2.out
- Part 3:
meta/ms3.out
- Part 4:
meta/ms4.out
- Part 5:
meta/ms5.out
- Part 6:
meta/ms6.out
As user, you are unlikely to ever need to interact with these
milestone data because all relevant output for you is stored in csv and
pdf files in the output folder results
. However, it can be
useful to know where the milestone data are stored for the following
reasons:
- By copying these
.RData
files to a new computer, you may continue your analyses without access to the original data files. This can for example be helpful if you process subsets of the same study on different computers, pooling the resulting milestone data allows you to finalise your analysis on a single computer. - When you run into a problem, these
.RData
files may allow you to share the problem as a reproducible example of the problem without having to share the original data file.
Parameter themes
The arguments of all GGIR package functions can be used as input argument to the GGIR function. We will refer to these arguments as input parameters in this documentation. The parameters are structured thematically, independently of the five parts they are used in:
- params_rawdata: parameters related to handling the raw data such as resampling or calibrating.
- params_metrics: parameters related to aggregating the raw data to epoch level summary metrics.
- params_sleep: parameters related to sleep detection.
- params_physact: parameters related to physical activity.
- params_247: parameters related to 24/7 behaviours that do not fall into the typical sleep or physical activity research category.
- params_output: parameters relating to how and whether output is stored.
- params_general: general parameters not covered by any of the above categories
There are a couple of ways to inspect the parameters in each parameter category and their default values:
Parameters vignette:
Documentation for all parameters in the parameter objects can be found the vignette: GGIR configuration parameters.
From R command line:
library(GGIR)
print(load_params())
If you are only interested in one specific category like sleep:
library(GGIR)
print(load_params()$params_sleep)
If you are only interested in parameter HASIB.algo
from
the
sleep_params
object:
library(GGIR)
print(load_params()$params_sleep[["HASIB.algo"]])
All of these parameters are accepted as parameter to function
GGIR
, because GGIR
is a shell around all GGIR
functionality. However, the params_
objects themselves
cannot be provided as input to GGIR
.
Configuration file.
GGIR stores all the parameter values as used in a csv-file named
config.csv. The file is stored after each run of GGIR in the root of the
output folder overwriting any existing config.csv file. So, if you would
like to add annotations in the file, e.g. in the fourth column, then you
will need to store it somewhere outside the output folder and explicitly
specify the path with parameter configfile
. Further, the
config.csv file itself is accepted as input to GGIR
with
parameter configfile
to replace the specification of all
the parameters, except datadir
and outputdir
,
see example below.
library(GGIR)
GGIR(datadir = "C:/mystudy/mydata",
outputdir = "D:/myresults",
configfile = "D:/myconfigfiles/config.csv")
The practical value of this is that it eases the replication of analysis, because instead of having to share you R script with colleagues, sharing your config.csv file will be sufficient.
Parameter extraction order
If parameters are provided in the GGIR call, then it always uses
those. If parameters are not provided in the function call, GGIR checks
whether there is a config.csv file either in the output folder or
specified via parameter configfile
and loads those values.
If a parameter is neither specified in the GGIR function call and not
available in the config.csv file then GGIR uses its default values which
can be inspected as discussed in the section above. Here, it is
important to realise that a consequence of this logic is that GGIR will
not revert to its default parameter values in a repeated analysis unless
you remove the parameter from the function call and delete the
config.csv file.
To ensure this is clear here are some example:
- If GGIR is used for the first time without specifying parameter
mvpathreshold
then it will use the default value, which is 100. - If you specify
mvpathreshold = 120
then GGIR will use this instead and store it in the config.csv file. - If you run GGIR again but this time delete
mvpathreshold = 120
from your GGIR call then GGIR will fall back on the value 120 as now stored in the config.csv file. - If you delete the config.csv file and run GGIR again then the value 100 will be used again.
Input files
Raw data
- GGIR currently works with the following accelerometer brands and formats:
- GENEActiv .bin
- Axivity AX3 and AX6 .cwa
- ActiGraph .csv and .gt3x (.gt3x only the newer format generated with firmware versions above 2.5.0. Serial numbers that start with “NEO” or “MRA” and have firmware version of 2.5.0 or earlier use an older format of the .gt3x file). If you want to work with .csv exports via the commercial ActiLife software then note that you have the option to export data with timestamps, which should be turned off. To cope with the absence of timestamps GGIR will calculate timestamps from the sample frequency, the start time and start date as presented in the file header.
- Movisens with data stored in folders.
- Any other accelerometer brand that generates csv output, see
documentation for functions
read.myacc.csv
and parameterrmc.noise
in the vignette Reading csv files with raw data in GGIR.
Externally derived epoch-level data
By default GGIR assumes that the data is raw as discussed at the start of this chapter. However, for some studies raw data is not available and all we have is an epoch level aggregate. For example, done by external software or done inside the accelerometer device. Although this can introduce some severe limitations to the transparency and flexibility of our analysis, GGIR makes an attempt to facilitate the analysis of these aggregations of the raw data. Please find below an overview of the file format currently facilitated:
Sensor brand | Data format | Combine with parameter |
---|---|---|
Actiwatch | .csv |
dataFormat = "actiwatch_csv"
extEpochData_timeformat = "%m/%d/%Y %H:%M:%S"
|
Actiwatch | .awd |
dataFormat = "actiwatch_awd"
extEpochData_timeformat = "%m/%d/%Y %H:%M:%S"
|
ActiGraph | .csv (count files, not be confused with csv raw data files) |
dataFormat = "actigraph_csv"
extEpochData_timeformat = "%m/%d/%Y %H:%M:%S"
|
Axivity | UKBiobank csv export of epoch level data | dataFormat = "ukbiobank_csv" |
To process these files GGIR loads their content and saves it as GGIR part 1 milestone data, essentially fooling the rest of GGIR to think that GGIR part 1 created the data. As will be discussed in chapter 2, GGIR does non-wear detection in two steps: The first step is done in part 1 and the second step is done in part 2. In relation to externally derived epoch data non-wear is detected by looking for consecutive zeros of one hour (Actiwatch, ActiGraph) or derived from the file (UK Biobank csv).
- All accelerometer data that need to be analysed should be stored in one folder or subfolders of that folder. Make sure the folder does not contain any files that are not accelerometer data.
- Choose an appropriate name for the folder, preferable with a reference to the study or project it is related to rather than just ‘data’, because the name of this folder will be used as an identifier of the dataset and integrated in the name of the output folder GGIR creates.
How to run your analysis?
The bare minimum input needed for GGIR
is:
Argument datadir
allows you to specify where you have
stored your accelerometer data and outputdir
allows you to
specify where you would like the output of the analyses to be stored.
This cannot be equal to datadir
. If you copy paste the
above code to a new R script (file ending with .R) and Source it in
R(Studio) then the dataset will be processed and the output will be
stored in the specified output directory.
Next, we can add argument mode
to tell GGIR which
part(s) to run, e.g. mode = 1:5
tells GGIR to run all five
parts and is the default. With argument overwrite
we can
tell GGIR whether to overwrite previously produced milestone data.
Further, argument idloc
tells GGIR where to find the
participant ID. The default setting most likely does not work for most
data formats, by which it is important that you tailor the value of this
argument to your study setting.
GGIR stores some of its output csv file format with by default comma
as column separator. However, this can be modified by argument
sep_reports
.
Each chapter in this digital book will highlight the key arguments related to specific topics.
From the R console on your own desktop/laptop
Create an R-script and put the GGIR call in it. Next, you can source
the R-script with the source
button in RStudio:
source("pathtoscript/myshellscript.R")
GGIR by default support multi-thread processing which is used to
process one input file per process, which can be turned off by setting
parameter do.parallel = FALSE
.
In a cluster
If processing data with GGIR on your desktop/laptop is not fast enough then we advise using a GGIR on a computing cluster. The way we did it on a Sun Grid Engine cluster is shown below, please note that some of these commands are specific to the computing cluster you are working on. Please consult your local cluster specialist to explore how to run GGIR on your cluster. Here, we only share how we did it. We had three files for the SGE setting:
submit.sh
for i in {1..707}; do
n=1
s=$(($(($n * $[$i-1]))+1))
e=$(($i * $n))
qsub /home/nvhv/WORKING_DATA/bashscripts/run-mainscript.sh $s $e
done
run-mainscript.sh
#! /bin/bash
#$ -cwd -V
#$ -l h_vmem=12G
/usr/bin/R --vanilla --args f0=$1 f1=$2 < /home/nvhv/WORKING_DATA/test/myshellscript.R
myshellscript.R
options(echo=TRUE)
args = commandArgs(TRUE)
if(length(args) > 0) {
for (i in 1:length(args)) {
eval(parse(text = args[[i]]))
}
}
GGIR(f0=f0,f1=f1,...)
You will need to update the ...
in the last line with
the parameters you used for GGIR
. Note that
f0=f0,f1=f1
is essential for this to work. The values of
f0
and f1
are passed on from the bash script.
Once this is all setup you will need to call bash submit.sh
from the command line. With the help of computing clusters GGIR has
successfully been run on some of the world’s largest accelerometer data
sets such as UK Biobank and German NAKO study.
Processing time
The time to process a typical seven-day recording should be anywhere in between 3 and 10 minutes depending on the sample frequency of the recording, the sensor brand, data format, the exact configuration of GGIR, and the specifications of your computer. If you are observing processing times of 20 minutes or longer for a seven-day recording then probably you are slowed down by other factors.
Some tips on how you may be able to address this:
- Make sure the data you process is on the same machine as where GGIR is run. Processing data located somewhere else on a computer network can substantially slow down software.
- Make sure your machine has 8GB or more RAM memory, using GGIR on old
machines with only 4GB is known to be slow. However, total memory is not
the only bottle neck, also consider the number of processes (threads)
your CPU can run relative to the amount of memory. Ending up with 2GB
per process seems a good target. It can be helpful to turn off parallel
processing with
do.paralllel = FALSE
. - Avoid doing other computational activities with your machine while running GGIR. For example, if you use DropBox or OneDrive make sure they do not sync while you are running GGIR. When using GGIR to process large datasets it is probably best not to use the machine. Make sure the machine is configured not to automatically turn off after X hours as that would terminate GGIR. Further, you may like to configure the machine to not fall asleep as that would pause GGIR.
- Reduce the amount of data GGIR loads in memory with parameter
chunksize
, which can be useful on machines with limited memory or when processing many files in parallel. Achunksize
value of 0.2 (minimum) will make GGIR load the data in chunks of only 20% of the size relative to the chunks it loads by default which is approximately 12 hours of data for the auto-calibration routine and 24 hours of data for the calculation of signal metrics.
GGIR output
GGIR always creates an output folder in the location as specified
with parameter outputdir
. The output folder name is
constructed as output_
followed by the name of the dataset
which is derived from the most distal folder name of the data directory
as specified with datadir
. We recommend this approach
because it ensures that output folder name and data directory have
matching names. However, it is possible to use datadir
to
specify a vector with paths to individual files, which may helpful if
you want to process a set of files but are not in a position to move
them to a new folder. In that scenario you will need to set parameter
studyname
to tell GGIR what the dataset name is.
Inside the output folder GGIR will create two subfolders:
meta
and results
as discussed earlier in this
chapter. Inside results
you will find folder named
QC
(Quality Checks). The name QC (Quality Checks) is
possibly somewhat confusing. Data quality checks are best started with
the files stored in the results folder, where the files in the QC
subfolder offer complementary information to help with the quality
check.
GGIR generates reports in the parts 2, 4, 5, and 6 of the pipeline.
With parameter do.report
you can specify for which of these
parts you want the reports to be generated. For example,
do.report = c(2, 5)
will only generated the report for
parts 2 and 5. For a full GGIR analyses we expect at least the following
output files:
Output files in results subfolder:
File path | Resolution | GGIR part |
---|---|---|
part2_summary.csv | one row per recording | 2 |
part2_daysummary.csv | one row per day | 2 |
part4_summary_sleep_cleaned.csv | one row per recording | 4 |
part4_nightsummary_sleep_cleaned.csv | one row per night | 4 |
visualisation_sleep.pdf | 40 recordings per page | 4 |
part5_daysummary_WW_L40M100V400_T5A5.csv | one row per day | 5 |
part5_personsummary_WW_L40M100V400_T5A5.csv | one row per recording | 5 |
Output files in results/QC subfolder:
File path | Resolution | GGIR part |
---|---|---|
data_quality_report.csv | one row per recording | 5 |
part4_summary_sleep_full.csv | one row per recording | 4 |
part4_nightsummary_sleep_full.csv | one row per night | 4 |
part5_daysummary_full_WW_L40M100V400_T5A5.csv | one row per day | 5 |
plots_to_check_data_quality_1.pdf | one recording per page | 2 |
Output files in meta/ms5.outraw subfolder:
File path | Resolution | GGIR part |
---|---|---|
40_100_400/ | one file with time series per recording | |
behavioralcodes2023-11-06.csv | dictionary for behavioural codes as used in the time series file |