One of the key tasks in analysing mobile sensing data is being able to link it to other data.
For example, when analysing physical activity data, it could be of interest to know how much
time a participant spent exercising before or after an ESM beep to evaluate their stress level.
link() allows you to map two data frames to each other that are on different time scales,
based on a pre-specified offset before and/or after. This function assumes that both x and
y have a column called time containing DateTimeClasses.
Usage
link(
x,
y,
by = NULL,
time,
end_time = NULL,
y_time,
offset_before = 0,
offset_after = 0,
add_before = FALSE,
add_after = FALSE,
name = "data",
split = by
)Arguments
- x, y
A pair of data frames or data frame extensions (e.g. a tibble). Both
xandymust have a column calledtime.- by
A character vector indicating the variable(s) to match by, typically the participant IDs. If NULL, the default,
*_join()will perform a natural join, using all variables in common acrossxandy. Therefore, all data will be mapped to each other based on the time stamps ofxandy. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.To join by different variables on
xandy, use a named vector. For example,by = c('a' = 'b')will matchx$atoy$b.To join by multiple variables, use a vector with
length > 1. For example,by = c('a', 'b')will matchx$atoy$aandx$btoy$b. Use a named vector to match different variables inxandy. For example,by = c('a' = 'b', 'c' = 'd')will matchx$atoy$bandx$ctoy$d.To perform a cross-join (when
xandyhave no variables in common), useby = character(). Note that thesplitargument will then be set to 1.- time
The name of the column containing the timestamps in
x.- end_time
Optionally, the name of the column containing the end time in
x. If specified, it meanstimedefines the start time of the interval andend_timethe end time. Note that this cannot be used at the same time asoffset_beforeoroffset_after.- y_time
The name of the column containing the timestamps in
y.- offset_before
The time before each measurement in
xthat denotes the period in whichyis matched. Must be convertible to a period bylubridate::as.period().- offset_after
The time after each measurement in
xthat denotes the period in whichyis matched. Must be convertible to a period bylubridate::as.period().- add_before
Logical value. Do you want to add the last measurement before the start of each interval?
- add_after
Logical value. Do you want to add the first measurement after the end of each interval?
- name
The name of the column containing the nested
ydata.- split
An optional grouping variable to split the computation by. When working with large data sets, the computation can grow so large it no longer fits in your computer's working memory (after which it will probably fall back on the swap file, which is very slow). Splitting the computation trades some computational efficiency for a large decrease in RAM usage. This argument defaults to
byto automatically suppress some of its RAM usage.
Value
A tibble with the data of x with a new column data with the matched data of y
according to offset_before and offset_after.
Details
y is matched to the time scale of x by means of time windows. These time windows are
defined as the period between x - offset_before and x + offset_after. Note that either
offset_before or offset_after can be 0, but not both. The "interval" of the measurements is
therefore the associated time window for each measurement of x and the data of y that also
falls within this period. For example, an offset_before of
minutes(30) means to match all data of y that occurred before each
measurement in x. An offset_after of 900 (i.e. 15 minutes) means to match all data of y
that occurred after each measurement in x. When both offset_before and offset_after are
specified, it means all data of y is matched in an interval of 30 minutes before and 15
minutes after each measurement of x, thus combining the two arguments.
The arguments add_before and add_after let you decide whether you want to add the last
measurement before the interval and/or the first measurement after the interval respectively.
This could be useful when you want to know which type of event occurred right before or after
the interval of the measurement. For example, at offset_before = "30 minutes", the data may
indicate that a participant was running 20 minutes before a measurement in x, However, with
just that information there is no way of knowing what the participant was doing the first 10
minutes of the interval. The same principle applies to after the interval. When add_before is
set to TRUE, the last measurement of y occurring before the interval of x is added to the
output data as the first row, having the time of x - offset_before (i.e. the start
of the interval). When add_after is set to TRUE, the first measurement of y occurring
after the interval of x is added to the output data as the last row, having the time of
x + offset_after (i.e. the end of the interval). This way, it is easier to calculate the
difference to other measurements of y later (within the same interval). Additionally, an
extra column (original_time) is added in the nested data column, which is the original time
of the y measurement and NULL for every other observation. This may be useful to check if
the added measurement isn't too distant (in time) from the others. Note that multiple rows may
be added if there were multiple measurements in y at exactly the same time. Also, if there
already is a row with a timestamp exactly equal to the start of the interval (for add_before = TRUE) or to the end of the interval (add_after = TRUE), no extra row is added.
Warning
Note that setting add_before and add_after each add one row to each nested
tibble of the data column. Thus, if you are only interested in the total count (e.g.
the number of total screen changes), remember to set these arguments to FALSE or make sure to
filter out rows that do not have an original_time. Simply subtracting 1 or 2 does not work
as not all measurements in x may have a measurement in y before or after (and thus no row
is added).
Examples
# Define some data
x <- data.frame(
time = rep(seq.POSIXt(as.POSIXct("2021-11-14 13:00:00"), by = "1 hour", length.out = 3), 2),
participant_id = c(rep("12345", 3), rep("23456", 3)),
item_one = rep(c(40, 50, 60), 2)
)
# Define some data that we want to link to x
y <- data.frame(
time = rep(seq.POSIXt(as.POSIXct("2021-11-14 12:50:00"), by = "5 min", length.out = 30), 2),
participant_id = c(rep("12345", 30), rep("23456", 30)),
x = rep(1:30, 2)
)
# Now link y within 30 minutes before each row in x
# until the measurement itself:
link(
x = x,
y = y,
by = "participant_id",
time = time,
y_time = time,
offset_before = "30 minutes"
)
#> # A tibble: 6 × 4
#> time participant_id item_one data
#> <dttm> <chr> <dbl> <list>
#> 1 2021-11-14 13:00:00 12345 40 <tibble [3 × 2]>
#> 2 2021-11-14 14:00:00 12345 50 <tibble [7 × 2]>
#> 3 2021-11-14 15:00:00 12345 60 <tibble [7 × 2]>
#> 4 2021-11-14 13:00:00 23456 40 <tibble [3 × 2]>
#> 5 2021-11-14 14:00:00 23456 50 <tibble [7 × 2]>
#> 6 2021-11-14 15:00:00 23456 60 <tibble [7 × 2]>
# We can also link y to a period both before and after
# each measurement in x.
# Also note that time, end_time and y_time accept both
# quoted names as well as character names.
link(
x = x,
y = y,
by = "participant_id",
time = "time",
y_time = "time",
offset_before = "15 minutes",
offset_after = "15 minutes"
)
#> # A tibble: 6 × 4
#> time participant_id item_one data
#> <dttm> <chr> <dbl> <list>
#> 1 2021-11-14 13:00:00 12345 40 <tibble [6 × 2]>
#> 2 2021-11-14 14:00:00 12345 50 <tibble [7 × 2]>
#> 3 2021-11-14 15:00:00 12345 60 <tibble [7 × 2]>
#> 4 2021-11-14 13:00:00 23456 40 <tibble [6 × 2]>
#> 5 2021-11-14 14:00:00 23456 50 <tibble [7 × 2]>
#> 6 2021-11-14 15:00:00 23456 60 <tibble [7 × 2]>
# It can be important to also know the measurements
# just preceding the interval or just after the interval.
# This adds an extra column called 'original_time' in the
# nested data, containing the original time stamp. The
# actual timestamp is set to the start time of the interval.
link(
x = x,
y = y,
by = "participant_id",
time = time,
y_time = time,
offset_before = "15 minutes",
offset_after = "15 minutes",
add_before = TRUE,
add_after = TRUE
)
#> # A tibble: 6 × 4
#> time participant_id item_one data
#> <dttm> <chr> <dbl> <list>
#> 1 2021-11-14 13:00:00 12345 40 <tibble [6 × 3]>
#> 2 2021-11-14 14:00:00 12345 50 <tibble [7 × 3]>
#> 3 2021-11-14 15:00:00 12345 60 <tibble [7 × 3]>
#> 4 2021-11-14 13:00:00 23456 40 <tibble [6 × 3]>
#> 5 2021-11-14 14:00:00 23456 50 <tibble [7 × 3]>
#> 6 2021-11-14 15:00:00 23456 60 <tibble [7 × 3]>
# If you participant_id is not important to you
# (i.e. the measurements are interchangeable),
# you can ignore them by leaving by empty.
# However, in this case we'll receive a warning
# since x and y have no other columns in common
# (except time, of course). Thus, we can perform
# a cross-join:
link(
x = x,
y = y,
by = character(),
time = time,
y_time = time,
offset_before = "30 minutes"
)
#> # A tibble: 6 × 4
#> time participant_id item_one data
#> <dttm> <chr> <dbl> <list>
#> 1 2021-11-14 13:00:00 12345 40 <tibble [6 × 3]>
#> 2 2021-11-14 14:00:00 12345 50 <tibble [14 × 3]>
#> 3 2021-11-14 15:00:00 12345 60 <tibble [14 × 3]>
#> 4 2021-11-14 13:00:00 23456 40 <tibble [6 × 3]>
#> 5 2021-11-14 14:00:00 23456 50 <tibble [14 × 3]>
#> 6 2021-11-14 15:00:00 23456 60 <tibble [14 × 3]>
# Alternatively, we can specify custom intervals.
# That is, we can create variable intervals
# without using fixed offsets.
x <- data.frame(
start_time = rep(
x = as.POSIXct(c(
"2021-11-14 12:40:00",
"2021-11-14 13:30:00",
"2021-11-14 15:00:00"
)),
times = 2
),
end_time = rep(
x = as.POSIXct(c(
"2021-11-14 13:20:00",
"2021-11-14 14:10:00",
"2021-11-14 15:30:00"
)),
times = 2
),
participant_id = c(rep("12345", 3), rep("23456", 3)),
item_one = rep(c(40, 50, 60), 2)
)
link(
x = x,
y = y,
by = "participant_id",
time = start_time,
end_time = end_time,
y_time = time,
add_before = TRUE,
add_after = TRUE
)
#> # A tibble: 6 × 5
#> start_time end_time participant_id item_one data
#> <dttm> <dttm> <chr> <dbl> <list>
#> 1 2021-11-14 12:40:00 2021-11-14 13:20:00 12345 40 <tibble>
#> 2 2021-11-14 13:30:00 2021-11-14 14:10:00 12345 50 <tibble>
#> 3 2021-11-14 15:00:00 2021-11-14 15:30:00 12345 60 <tibble>
#> 4 2021-11-14 12:40:00 2021-11-14 13:20:00 23456 40 <tibble>
#> 5 2021-11-14 13:30:00 2021-11-14 14:10:00 23456 50 <tibble>
#> 6 2021-11-14 15:00:00 2021-11-14 15:30:00 23456 60 <tibble>
