4 Presence-only data generation theory

Published

May 24, 2024

We now know about presence-only data and we know some theoretical basics about using a NHPP to model such data. There’s one last technical thing to discuss before we can get to actually fitting the model to the data. Often, when we are working with a new data type, it is very helpful to work in a simulated environment first. That way we can compare our models under an environment we completely control. However, to do so, we have to have be able to generate data according to the process we are interested in. Often, we can just use a marginal distribution to do so. But in the case of the NHPP, it’s not as straightforward. So we will describe how to simulate some presence-only data, in preparation to fit a Nonhomogeneous Poisson process model.

To generate data, let’s use the following ingredients: Each point observation \(i\) is associated with a site \(s_i\) area \(D\). For each location \(s_i\), there are associated covariates \(x_i = x(s_i)\) and \(z_i = z(s_i)\). And at each location, \(s_i\) represents the centroid of a quadrat \(A_i\). This is an essential assumption in our model, where we will use the Poisson likelihood to for the number of counts in each quadrat. At site \(A_i\) we observe counts \(N(A_i)\). This is described in Fithian (2015).

Referencing our initial theoretical outline, we need to simulate points with likelihood \[ \prod_i\lambda(s_i)\frac{\exp(-\lambda(D))}{n!}. \] In the case of the homogeneous point process case, we can simulate \(n\) observations simply from \[ N(D) \sim Poisson(\lambda|D|), \] where in the homogeneous case \(\lambda(s)=\lambda\), and is thus constant over \(D\). Then we distribute the \(n\) uniformly over \(D\).

In the case of the Nonhomogeneous Poisson Process, we can do the same, drawing \(n\) now from \[ N(D) \sim Poisson(\lambda(D)) = Poisson(\int_D\lambda(s)ds), \] and then distributing their locations over \(D\). Instead of uniformly distributing them however, they are placed according to the distribution \(\frac{\lambda(s)}{\lambda(D)}\). That is interesting. It is quite different than how we usually simulate non-spatial data.

To evaluate the integral over \(D\), remember that we are using as intensity \[ \lambda(A_i) = \int_A\lambda(s)ds = \int_A\exp[\alpha + \beta'x(s)]ds = |A|\exp[\alpha + \beta'x_i], \] because we only have information \(X_i\) for quadrat \(A_i\) centered around \(s_i\). Using the \(x_i\) we have, both \(\lambda(A)\) and \(\lambda(D)\) will be approximations at the covariate resolution, using the assumption that each \(x_i\) is constant over \(A_i\). This being the case, then for each observation we will use the same distribution as the other points that fall into the same \(A_i\). That is, multiple points will have the same distribution \(\frac{|A|\exp[\alpha + \beta'x_i]}{\sum_i |A|\exp[\alpha + \beta'x_i]}\). This goes somewhat against the idea of distributing the points in continuous space. However, we can uniformly distribute the points within each \(A_i\). If \(x_i\) is constant across \(A_i\), this is ok.

If we cannot assume this, using the continuous probability distribution for each point \((\lambda(s)/\lambda(D))\) is necessary to avoid the integral of \(\lambda(A)\). We can do so in a way that avoids computing \(\lambda(D)\) (Banerjee, 2014. p220), by (1) computing \(\lambda_{max}\), (2) simulating \(n\) from \(N(D) \sim Poisson(\lambda_{max}|D|)\), and (3) rejecting some of these points with probability \(\lambda(s_i)/\lambda_{max}\). In this way, the only thing we need is information at the level of \(\lambda(s_i\)) and we can avoid estimating \(\lambda(D)\) and \(\lambda(A)\). However, we are still limited by our covariate resolution. Essentially, the assumptions we make about the grid will actually effect the way we simulate data. For example, when the only covariate data we have is that which is associated with the quadrat we are working with, then each point realization will be simulated in aggregate, that is, a draw from a Poisson distribution characterized by an functional of intensity dependent only on the covariates. If our covariates exist on a finer scale than the quadrat, we can simulate the point locations more finely as well and take the approach of Banerjee. So note that we take the route of simulation depending on our available information. If we only have \(x_i\) for each quadrat, directly generating the points as a Poisson random variable is a good option because it avoids any complicated integrals. But if we have finer covariates than at the quadrat level we are working with, then thinning the process down with the approach of Banerjee and others is better because it also avoids the integrals and has finer resolution.

We must also take into account the bias associated with the observation of species with true intensity \(\lambda(s)\). In order to account for the bias caused by collecting presence-only observations, the NHPP points are also thinned by the sampling bias \(b\). \[ \lambda(s) = \theta(s)b(s) = \exp[\alpha + \beta'x(s) + \gamma + \delta'z(s)]. \]

There are a few things be aware of during the data gerating process with the NHPP: The area of the total space, \(|D|\), the area of each site (quadrat) \(|A_i|\), and the number of cells \(c\) that \(D\) is discretized into. The area of \(A\) depends on the number of discretizations (or cells), \(|A_i| = |D|/c\). The intensity will be scaled by this factor \(|D|/c\), so that as the area of \(|A|\) becomes smaller, the intensity converges to that of a “true” continuous Poisson process with its center at \(A_i\). We also assume that the intensity is constant over \(A\), so that we can indeed scale the intensity by the area factor, instead of needing to integrate over \(A\). This assumption is in contrast to assuming that we model the instensity directly as \(\lambda(A_i) = \exp(\beta'X)\) (versus \(\lambda(A_i) = \frac{|D|}{c}\exp(\beta'X)\)).