Link y to the time scale of x — link • mpathsenser

One of the key tasks in analysing mobile sensing data is being able to link it to other data. For example, when analysing physical activity data, it could be of interest to know how much time a participant spent exercising before or after an ESM beep to evaluate their stress level. link() allows you to map two data frames to each other that are on different time scales, based on a pre-specified offset before and/or after. This function assumes that both x and y have a column called time containing DateTimeClasses.

Usage

link(
  x,
  y,
  by = NULL,
  time,
  end_time = NULL,
  y_time,
  offset_before = 0,
  offset_after = 0,
  add_before = FALSE,
  add_after = FALSE,
  name = "data",
  split = by
)

Arguments

x, y

A pair of data frames or data frame extensions (e.g. a tibble). Both x and y must have a column called time.

by

A character vector indicating the variable(s) to match by, typically the participant IDs. If NULL, the default, *_join() will perform a natural join, using all variables in common across x and y. Therefore, all data will be mapped to each other based on the time stamps of x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.

To join by different variables on x and y, use a named vector. For example, by = c('a' = 'b') will match x$a to y$b.

To join by multiple variables, use a vector with length > 1. For example, by = c('a', 'b') will match x$a to y$a and x$b to y$b. Use a named vector to match different variables in x and y. For example, by = c('a' = 'b', 'c' = 'd') will match x$a to y$b and x$c to y$d.

To perform a cross-join (when x and y have no variables in common), use by = character(). Note that the split argument will then be set to 1.

time

The name of the column containing the timestamps in x.

end_time

Optionally, the name of the column containing the end time in x. If specified, it means time defines the start time of the interval and end_time the end time. Note that this cannot be used at the same time as offset_before or offset_after.

y_time

The name of the column containing the timestamps in y.

offset_before

The time before each measurement in x that denotes the period in which y is matched. Must be convertible to a period by lubridate::as.period().

offset_after

The time after each measurement in x that denotes the period in which y is matched. Must be convertible to a period by lubridate::as.period().

add_before

Logical value. Do you want to add the last measurement before the start of each interval?

add_after

Logical value. Do you want to add the first measurement after the end of each interval?

name

The name of the column containing the nested y data.

split

An optional grouping variable to split the computation by. When working with large data sets, the computation can grow so large it no longer fits in your computer's working memory (after which it will probably fall back on the swap file, which is very slow). Splitting the computation trades some computational efficiency for a large decrease in RAM usage. This argument defaults to by to automatically suppress some of its RAM usage.

Value

A tibble with the data of x with a new column data with the matched data of y according to offset_before and offset_after.

Details

y is matched to the time scale of x by means of time windows. These time windows are defined as the period between x - offset_before and x + offset_after. Note that either offset_before or offset_after can be 0, but not both. The "interval" of the measurements is therefore the associated time window for each measurement of x and the data of y that also falls within this period. For example, an offset_before of minutes(30) means to match all data of y that occurred before each measurement in x. An offset_after of 900 (i.e. 15 minutes) means to match all data of y that occurred after each measurement in x. When both offset_before and offset_after are specified, it means all data of y is matched in an interval of 30 minutes before and 15 minutes after each measurement of x, thus combining the two arguments.

The arguments add_before and add_after let you decide whether you want to add the last measurement before the interval and/or the first measurement after the interval respectively. This could be useful when you want to know which type of event occurred right before or after the interval of the measurement. For example, at offset_before = "30 minutes", the data may indicate that a participant was running 20 minutes before a measurement in x, However, with just that information there is no way of knowing what the participant was doing the first 10 minutes of the interval. The same principle applies to after the interval. When add_before is set to TRUE, the last measurement of y occurring before the interval of x is added to the output data as the first row, having the time of x - offset_before (i.e. the start of the interval). When add_after is set to TRUE, the first measurement of y occurring after the interval of x is added to the output data as the last row, having the time of x + offset_after (i.e. the end of the interval). This way, it is easier to calculate the difference to other measurements of y later (within the same interval). Additionally, an extra column (original_time) is added in the nested data column, which is the original time of the y measurement and NULL for every other observation. This may be useful to check if the added measurement isn't too distant (in time) from the others. Note that multiple rows may be added if there were multiple measurements in y at exactly the same time. Also, if there already is a row with a timestamp exactly equal to the start of the interval (for add_before = TRUE) or to the end of the interval (add_after = TRUE), no extra row is added.

Warning

Note that setting add_before and add_after each add one row to each nested tibble of the data column. Thus, if you are only interested in the total count (e.g. the number of total screen changes), remember to set these arguments to FALSE or make sure to filter out rows that do not have an original_time. Simply subtracting 1 or 2 does not work as not all measurements in x may have a measurement in y before or after (and thus no row is added).

Examples

# Define some data
x <- data.frame(
  time = rep(seq.POSIXt(as.POSIXct("2021-11-14 13:00:00"), by = "1 hour", length.out = 3), 2),
  participant_id = c(rep("12345", 3), rep("23456", 3)),
  item_one = rep(c(40, 50, 60), 2)
)

# Define some data that we want to link to x
y <- data.frame(
  time = rep(seq.POSIXt(as.POSIXct("2021-11-14 12:50:00"), by = "5 min", length.out = 30), 2),
  participant_id = c(rep("12345", 30), rep("23456", 30)),
  x = rep(1:30, 2)
)

# Now link y within 30 minutes before each row in x
# until the measurement itself:
link(
  x = x,
  y = y,
  by = "participant_id",
  time = time,
  y_time = time,
  offset_before = "30 minutes"
)
#> # A tibble: 6 × 4
#>   time                participant_id item_one data            
#>   <dttm>              <chr>             <dbl> <list>          
#> 1 2021-11-14 13:00:00 12345                40 <tibble [3 × 2]>
#> 2 2021-11-14 14:00:00 12345                50 <tibble [7 × 2]>
#> 3 2021-11-14 15:00:00 12345                60 <tibble [7 × 2]>
#> 4 2021-11-14 13:00:00 23456                40 <tibble [3 × 2]>
#> 5 2021-11-14 14:00:00 23456                50 <tibble [7 × 2]>
#> 6 2021-11-14 15:00:00 23456                60 <tibble [7 × 2]>

# We can also link y to a period both before and after
# each measurement in x.
# Also note that time, end_time and y_time accept both
# quoted names as well as character names.
link(
  x = x,
  y = y,
  by = "participant_id",
  time = "time",
  y_time = "time",
  offset_before = "15 minutes",
  offset_after = "15 minutes"
)
#> # A tibble: 6 × 4
#>   time                participant_id item_one data            
#>   <dttm>              <chr>             <dbl> <list>          
#> 1 2021-11-14 13:00:00 12345                40 <tibble [6 × 2]>
#> 2 2021-11-14 14:00:00 12345                50 <tibble [7 × 2]>
#> 3 2021-11-14 15:00:00 12345                60 <tibble [7 × 2]>
#> 4 2021-11-14 13:00:00 23456                40 <tibble [6 × 2]>
#> 5 2021-11-14 14:00:00 23456                50 <tibble [7 × 2]>
#> 6 2021-11-14 15:00:00 23456                60 <tibble [7 × 2]>

# It can be important to also know the measurements
# just preceding the interval or just after the interval.
# This adds an extra column called 'original_time' in the
# nested data, containing the original time stamp. The
# actual timestamp is set to the start time of the interval.
link(
  x = x,
  y = y,
  by = "participant_id",
  time = time,
  y_time = time,
  offset_before = "15 minutes",
  offset_after = "15 minutes",
  add_before = TRUE,
  add_after = TRUE
)
#> # A tibble: 6 × 4
#>   time                participant_id item_one data            
#>   <dttm>              <chr>             <dbl> <list>          
#> 1 2021-11-14 13:00:00 12345                40 <tibble [6 × 3]>
#> 2 2021-11-14 14:00:00 12345                50 <tibble [7 × 3]>
#> 3 2021-11-14 15:00:00 12345                60 <tibble [7 × 3]>
#> 4 2021-11-14 13:00:00 23456                40 <tibble [6 × 3]>
#> 5 2021-11-14 14:00:00 23456                50 <tibble [7 × 3]>
#> 6 2021-11-14 15:00:00 23456                60 <tibble [7 × 3]>

# If you participant_id is not important to you
# (i.e. the measurements are interchangeable),
# you can ignore them by leaving by empty.
# However, in this case we'll receive a warning
# since x and y have no other columns in common
# (except time, of course). Thus, we can perform
# a cross-join:
link(
  x = x,
  y = y,
  by = character(),
  time = time,
  y_time = time,
  offset_before = "30 minutes"
)
#> Warning: Using `by = character()` to perform a cross join was deprecated in dplyr 1.1.0.
#> ℹ Please use `cross_join()` instead.
#> ℹ The deprecated feature was likely used in the mpathsenser package.
#>   Please report the issue at <https://github.com/koenniem/mpathsenser/issues>.
#> # A tibble: 6 × 4
#>   time                participant_id item_one data             
#>   <dttm>              <chr>             <dbl> <list>           
#> 1 2021-11-14 13:00:00 12345                40 <tibble [6 × 3]> 
#> 2 2021-11-14 14:00:00 12345                50 <tibble [14 × 3]>
#> 3 2021-11-14 15:00:00 12345                60 <tibble [14 × 3]>
#> 4 2021-11-14 13:00:00 23456                40 <tibble [6 × 3]> 
#> 5 2021-11-14 14:00:00 23456                50 <tibble [14 × 3]>
#> 6 2021-11-14 15:00:00 23456                60 <tibble [14 × 3]>

# Alternatively, we can specify custom intervals.
# That is, we can create variable intervals
# without using fixed offsets.
x <- data.frame(
  start_time = rep(
    x = as.POSIXct(c(
      "2021-11-14 12:40:00",
      "2021-11-14 13:30:00",
      "2021-11-14 15:00:00"
    )),
    times = 2
  ),
  end_time = rep(
    x = as.POSIXct(c(
      "2021-11-14 13:20:00",
      "2021-11-14 14:10:00",
      "2021-11-14 15:30:00"
    )),
    times = 2
  ),
  participant_id = c(rep("12345", 3), rep("23456", 3)),
  item_one = rep(c(40, 50, 60), 2)
)
link(
  x = x,
  y = y,
  by = "participant_id",
  time = start_time,
  end_time = end_time,
  y_time = time,
  add_before = TRUE,
  add_after = TRUE
)
#> # A tibble: 6 × 5
#>   start_time          end_time            participant_id item_one data    
#>   <dttm>              <dttm>              <chr>             <dbl> <list>  
#> 1 2021-11-14 12:40:00 2021-11-14 13:20:00 12345                40 <tibble>
#> 2 2021-11-14 13:30:00 2021-11-14 14:10:00 12345                50 <tibble>
#> 3 2021-11-14 15:00:00 2021-11-14 15:30:00 12345                60 <tibble>
#> 4 2021-11-14 12:40:00 2021-11-14 13:20:00 23456                40 <tibble>
#> 5 2021-11-14 13:30:00 2021-11-14 14:10:00 23456                50 <tibble>
#> 6 2021-11-14 15:00:00 2021-11-14 15:30:00 23456                60 <tibble>