I’m exploring the idea of adding a function or set of functions to the simstudy package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated. It would also make it easy to explore spline methods, because as with any other simulated data set, we would know the underlying data generating process. Splines in R The bs function in the splines package, returns values from these basis functions based on the specification of knots and degree of curvature. I wrote a wrapper function that uses the…

Original Post: Can we use B-splines to generate non-linear data?

# Posts by Keith Goldfeld

## A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions

I just updated simstudy to version 0.1.5 (available on CRAN) so that it now includes several new distributions – exponential, discrete uniform, and negative binomial. As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean. Added variation In many situations, when count data are modeled, it turns out that the variance of the data exceeds the mean (a situation called over-dispersion). In this case an alternative model is used that allows for the greater variance, which is based on the negative binomial distribution. It turns out that if the negative binomial distribution…

Original Post: A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions

## CACE closed: EM opens up exclusion restriction (among other things)

This is the third, and probably last, of a series of posts touching on the estimation of complier average causal effects (CACE) and latent variable modeling techniques using an expectation-maximization (EM) algorithm . What follows is a simplistic way to implement an EM algorithm in R to do principal strata estimation of CACE. The EM algorithm In this approach, we assume that individuals fall into one of three possible groups – never-takers, always-takers, and compliers – but we cannot see who is who (except in a couple of cases). For each group, we are interested in estimating the unobserved potential outcomes (Y_0) and (Y_1) using observed outcome measures of (Y). The EM algorithm does this in two steps. The E-step estimates the missing class membership for each individual, and the M-step provides maximum likelihood estimates of the group-specific potential…

Original Post: CACE closed: EM opens up exclusion restriction (among other things)

## A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm

I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement. Defining data conditionally Of course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different…

Original Post: A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm

## Complier average causal effect? Exploring what we learn from an RCT with participants who don’t do what they are told.

Inspired by a free online course titled Complier Average Causal Effects (CACE) Analysis and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming. And I end by describing two simple methods to estimate the CACE, which we can compare to the truth (since this is a simulation); next time, I will describe a third way. Non-compliance in randomized trials Here’s the problem. In a randomized trial, investigators control the randomization process; they determine if an individual is assigned to…

Original Post: Complier average causal effect? Exploring what we learn from an RCT with participants who don’t do what they are told.

## Further considerations of a hidden process underlying categorical responses

In my previous post, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have multiple outcomes, but with a slight twist – these groups of interest can be interpreted in an ordered way. This conceptual latent process can provide another perspective on the models that are typically applied to analyze these types of outcomes. Categorical outcomes, generally Certainly, group membership is not necessarily intrinsically ordered. In a general categorical or multinomial outcome, a group does not necessarily have any quantitative relationship vis a vis the other groups. For example, if we were interested in primary…

Original Post: Further considerations of a hidden process underlying categorical responses

## A hidden process behind binary or other categorical outcomes?

I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. – six groups in total. The question was how best to evaluate effectiveness of the intervention? Since the outcome was categorical and ordinal – that is category 1 implied less meat consumption…

Original Post: A hidden process behind binary or other categorical outcomes?

## Be careful not to control for a post-exposure covariate

A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood – believing it was somehow confounding the relationship of trauma and substance use. A confounder is a common cause of an exposure/treatment and an outcome; it is hard to conceive of adult poverty as a cause of childhood events, even though it might be related to adult substance use (or maybe not). At best, controlling for adult poverty has no impact on the conclusions of the research; less good, though, is the possibility that it…

Original Post: Be careful not to control for a post-exposure covariate

## Should we be concerned about incidence – prevalence bias?

Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results – patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.What could explain this phenomenon? The…

Original Post: Should we be concerned about incidence – prevalence bias?

## Using simulation for power analysis: an example based on a stepped wedge study design

Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which can be a bit painful. My goal here is to show that at least writing the code need not be overwhelming.Recently, I was helping an investigator plan a stepped wedge cluster randomized trial to study the effects of modifying a physician support system on patient-level diabetes management. While analytic approaches for power calculations do exist in the context of this complex…

Original Post: Using simulation for power analysis: an example based on a stepped wedge study design