| Title: | Statistical Inference for Weighted Data |
|---|---|
| Description: | Analyzes and models data subject to sampling biases. Provides functions to estimate the density and cumulative distribution functions from biased samples of continuous distributions. Includes the estimators proposed by Bhattacharyya et al. (1988) <doi:10.1080/03610928808829825> and Jones (1991) <doi:10.2307/2337020> for density, and by Cox (2005, ISBN:052184939X) and Bose and Dutta (2022) <doi:10.1007/s00184-021-00824-3> for distribution, with different bandwidth selectors. Also includes a real length-biased dataset on shrub width from Muttlak (1988) <https://www.proquest.com/openview/3dd74592e623cdbcfa6176e85bd3d390/1?cbl=18750&diss=y&pq-origsite=gscholar>. |
| Authors: | Sánchez Martínez Noelia [cre, aut] (ORCID: <https://orcid.org/0000-0002-2869-1347>), Borrajo Garcia Maria Isabel [aut] (ORCID: <https://orcid.org/0000-0002-3372-1993>), Conde Amboage Mercedes [aut] (ORCID: <https://orcid.org/0000-0003-0306-8142>) |
| Maintainer: | Sánchez Martínez Noelia <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.1 |
| Built: | 2026-05-18 09:51:13 UTC |
| Source: | https://github.com/noeliasanchmrt/wdata |
This function implements the local bandwidth selector proposed by Bose and Dutta (2022) for their own kernel distribution estimator.
bw.F.BD( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, cy.seq )bw.F.BD( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, cy.seq )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector containing the points on which the local bandwidth is estimated. |
cy.seq |
A numeric vector representing the constants to be used in the bandwidth estimation for each point of |
Local bandwidths selectors are estimated using the formula:
where is a positive parameter that depends on the point and is an estimation of the standard deviation of the distribution given by
The parameter is provided to the function by using the argument cy.seq, which is a vector of positive values that is used to compute the bandwidth for each point in y.seq. Alternatively, a single numeric value can be provided, which will be used for all points in the y.seq vector.
The simulations carried out by Bose and Dutta (2022) suggest that choosing or provides good results in the tail region of the distribution, with tails defined as points below the 5th percentile or above the 95th percentile. On the other hand, provides good results for the remaining points.
If some bandwidths are not positive, they are replaced by the mean of the neighbors.
A numeric vector containing the bandwidths for each point in y.seq.
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
bw.F.BD(shrub.data$Width, y.seq = seq(0, 1, length.out = 512), cy.seq = rep(1, 512))bw.F.BD(shrub.data$Width, y.seq = seq(0, 1, length.out = 512), cy.seq = rep(1, 512))
This function computes the bandwidth selector for Jones (1991) kernel density estimator using the bias-corrected bootstrap method developed by Borrajo et al. (2017).
bw.f.BGM.boot1( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), bw0 = c("RT", "PI") )bw.f.BGM.boot1( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), bw0 = c("RT", "PI") )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
bw0 |
A character string specifying the method to determine the pilot bandwidth. Options are |
When bw0="RT", the bandwidth is given by
is the value returned by bw.f.BGM.rt. An alternative is to consider the following pilot bandwidth:
and is estimated under the assumption that is gaussian, which is implemented by setting bw0="PI". The quantities and depend only on the kernel and are defined as
The estimators and are given by
The bootstrap bandwidth value.
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
# Bandwidth value using bootstrap method with "RT" as pilot bandwidth bw.f.BGM.boot1(y = shrub.data$Width) # Bandwidth value using bootstrap method with "PI" as pilot bandwidth bw.f.BGM.boot1(y = shrub.data$Width, bw0 = "PI")# Bandwidth value using bootstrap method with "RT" as pilot bandwidth bw.f.BGM.boot1(y = shrub.data$Width) # Bandwidth value using bootstrap method with "PI" as pilot bandwidth bw.f.BGM.boot1(y = shrub.data$Width, bw0 = "PI")
This function computes the bandwidth selector for Jones (1991) kernel density estimator using the bias-corrected bootstrap method developed by Borrajo et al. (2017).
bw.f.BGM.boot2( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), bw0 = 1/8 * n^(-1/9), lower = IQR(y) * n^(-0.2) * 2000^(-1), upper = IQR(y) * (log(n)/n)^(0.2) * 500, nh = 200L, tol = 0.1 * lower, from = min(y) - (sort(y)[5] - min(y)), to = max(y) + (max(y) - sort(y, decreasing = TRUE)[5]), plot = TRUE )bw.f.BGM.boot2( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), bw0 = 1/8 * n^(-1/9), lower = IQR(y) * n^(-0.2) * 2000^(-1), upper = IQR(y) * (log(n)/n)^(0.2) * 500, nh = 200L, tol = 0.1 * lower, from = min(y) - (sort(y)[5] - min(y)), to = max(y) + (max(y) - sort(y, decreasing = TRUE)[5]), plot = TRUE )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
bw0 |
The bandwidth value to be used in |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points in the grid to evaluate the mean integrated squared error function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
from |
Numeric value specifying the lower bound to be used in |
to |
Numeric value specifying the upper bound to be used in |
plot |
Logical value indicating whether to plot the mean integrated squared error function. Default is |
The bandwidth returned is the one minimizing over a compact interval (determined by arguments lower and upper), i.e.,
and correspond with the expression of the mean integrated squared error and the mean squared error of the bootstrap estimator provided by Borrajo et al. (2017).
The bootstrap bandwidth value.
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
bw.f.BGM.boot2(shrub.data$Width, nh = 50L)bw.f.BGM.boot2(shrub.data$Width, nh = 50L)
This function estimates the bandwidth for Jones (1991) kernel density estimator using cross-validation criteria from Guillamón et al. (1998). It iterates through a range of bandwidth values and computes the cross-validation score for each bandwidth. The bandwidth that minimizes the cross-validation function is selected as the optimal bandwidth.
bw.f.BGM.cv( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), lower = IQR(y) * n^(-0.2) * 2000^(-1), upper = IQR(y) * (log(n))^(0.2) * n^(-0.2) * 500, nh = 200L, tol = 0.1 * lower, plot = TRUE )bw.f.BGM.cv( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), lower = IQR(y) * n^(-0.2) * 2000^(-1), upper = IQR(y) * (log(n))^(0.2) * n^(-0.2) * 500, nh = 200L, tol = 0.1 * lower, plot = TRUE )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points to evaluate the cross-validation function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
plot |
Logical value indicating whether to plot the cross-validation function. Default is |
The optimal bandwidth is the one that minimizes the cross-validation function, i.e.,
It holds that
where denotes convolution between two functions and is computed as
This function computes the bandwidth that minimizes the cross validation function, , on the interval determined by lower and upper. By default,
is the one suggested by Borrajo et al. (2017):
where IQR is the interquartile range.
The optimal bandwidth value based on cross-validation criteria.
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Guillamón A, Navarro J, Ruiz JM (1998).
“Kernel density estimation using weighted data.”
Communications in Statistics - Theory and Methods, 27(9), 2123-2135.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
bw.f.BGM.cv(shrub.data$Width) bw.f.BGM.cv(shrub.data$Width, kernel = "epanechnikov")bw.f.BGM.cv(shrub.data$Width) bw.f.BGM.cv(shrub.data$Width, kernel = "epanechnikov")
This function computes the bandwidth selector for Jones (1991) density estimator using the rule of thumb.
bw.f.BGM.rt( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )bw.f.BGM.rt( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
The bandwidth is given by
where and depend only on the kernel and are defined as
The estimators and are given by
is an estimation of the standard deviation of the distribution given by
The rule of thumb bandwidth value.
Jones MC (1991). “Kernel density estimation for length biased data.” Biometrika, 78(3), 511–519.
bw.f.BGM.rt(shrub.data$Width) bw.f.BGM.rt(shrub.data$Width, kernel = "epanechnikov")bw.f.BGM.rt(shrub.data$Width) bw.f.BGM.rt(shrub.data$Width, kernel = "epanechnikov")
This function performs bandwidth selection for Bose and Dutta (2022) kernel distribution estimator using cross-validation criteria. It iterates through a range of bandwidth values and computes the cross-validation score for each bandwidth. The bandwidth that minimizes the cross-validation function is selected as the optimal bandwidth.
bw.F.SBC.cv( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), lower = IQR(y) * length(y)^(-1/3) * 0.05, upper = IQR(y) * length(y)^(-1/3) * 5, nh = 200L, tol = 0.1 * lower, plot = TRUE )bw.F.SBC.cv( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), lower = IQR(y) * length(y)^(-1/3) * 0.05, upper = IQR(y) * length(y)^(-1/3) * 5, nh = 200L, tol = 0.1 * lower, plot = TRUE )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points to evaluate the cross-validation function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
plot |
A logical value indicating whether to plot the cross-validation function. Default is |
The optimal bandwidth is obtained as the one that minimizes the cross-validation function, that is,
and is the Bose and Dutta (2022) kernel distribution estimator without the observation .
The optimal bandwidth based on cross-validation criteria.
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
bw.F.SBC.cv(shrub.data$Width) bw.F.SBC.cv(shrub.data$Width, kernel = "epanechnikov")bw.F.SBC.cv(shrub.data$Width) bw.F.SBC.cv(shrub.data$Width, kernel = "epanechnikov")
This function computes the bandwidth selector for Bose and Dutta (2022) kernel distribution estimator using the plug-in method.
bw.F.SBC.pi( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )bw.F.SBC.pi( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
The bandwidth is given by:
where and depend only on the kernel and are defined as
where is the kernel distribution function associated with the kernel density function .
The estimators and are given by
is an estimator of
where is estimated assuming that follows a gaussian distribution and and are estimated by and as defined above.
The bandwidth value using the plug-in method.
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
bw.F.SBC.pi(shrub.data$Width, kernel = "epanechnikov")bw.F.SBC.pi(shrub.data$Width, kernel = "epanechnikov")
This function computes the bandwidth selector for Bose and Dutta (2022) kernel distribution estimator using the rule of thumb.
bw.F.SBC.rt( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )bw.F.SBC.rt( y, w = function(y) { ifelse(y >= 0, y, NA) }, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine") )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
The bandwidth is computed as follows:
where both and depend only on the kernel and are defined as
where is the kernel distribution function associated with the kernel density function .
The estimators and are given by
is an estimation of the standard deviation of the distribution given by
The optimal bandwidth value for Bose and Dutta (2022) kernel distribution estimator based on the rule of thumb.
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
bw.F.SBC.rt(shrub.data$Width) bw.F.SBC.rt(shrub.data$Width, kernel = "epanechnikov")bw.F.SBC.rt(shrub.data$Width) bw.F.SBC.rt(shrub.data$Width, kernel = "epanechnikov")
This function computes Bose and Dutta (2022) kernel distribution estimator given a sample and the corresponding biased function.
cdf.bd( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, bw = "bw.F.SBC.rt", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), from, to, nb = 512L, plot = TRUE, correction = c("none", "left", "right", "both"), ... )cdf.bd( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, bw = "bw.F.SBC.rt", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), from, to, nb = 512L, plot = TRUE, correction = c("none", "left", "right", "both"), ... )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector specifying the points where the distribution is estimated. Alternatively, |
bw |
The bandwidth to be used in the distribution estimation. |
kernel |
A character string specifying the kernel function. Available options: |
from |
Numeric value specifying the lower bound of the grid where the estimator is computed when |
to |
Numeric value specifying the upper bound of the grid where the estimator is computed when |
nb |
An integer specifying the number of points at which the estimator is computed when |
plot |
A logical value indicating whether to plot the estimation. Default is |
correction |
A character string specifying the boundary correction to be applied. Options are "none", "left", "right" and "both". Default is "none". |
... |
Additional arguments to be passed to bandwidth selection functions. |
Bose and Dutta (2022) kernel distribution estimator is expressed as
is the bandwidth, is the kernel distribution function and .
Bose and Dutta (2022) propose a truncation correction for variables with compact support for the estimator as follows:
The truncation correction is also valid for variables supported on or , replacing by 1 or by 0, respectively, in the above expression. This correction is implemented in the correction argument, which can take values "none", "left", "right" or "both". If "left", the estimator is corrected to 0 for values less than the minimum of y.seq; if "right", it is corrected to 1 for values greater than the maximum of y.seq; if "both", it applies both corrections simultaneously.
A list with the following components:
y.seq |
The points where the distribution is estimated. |
F.hat |
The estimated distribution values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
bw.F.SBC.rt, bw.F.BD, bw.F.SBC.cv, bw.F.SBC.pi
cdf.bd(shrub.data$Width, kernel = "epanechnikov") cdf.bd(shrub.data$Width, bw = "bw.F.SBC.cv")cdf.bd(shrub.data$Width, kernel = "epanechnikov") cdf.bd(shrub.data$Width, bw = "bw.F.SBC.cv")
This function computes Cox (2005) distribution estimator given a sample and the corresponding biased function.
cdf.cox( y, w = function(y) { ifelse(y >= 0, y, NA) } )cdf.cox( y, w = function(y) { ifelse(y >= 0, y, NA) } )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
Cox (2005) distribution estimator is expressed as
A function of class ecdf, inheriting from the stepfun class, and hence inheriting a knots method.
Cox D (2005). “Some sampling problems in technology.” In Hand D, Herzberg A (eds.), Selected Statistical Papers of Sir David Cox, volume 1, 81–92. Cambridge University Press.
cdf.cox(y = shrub.data$Width)cdf.cox(y = shrub.data$Width)
This function computes Bhattacharyya et al. (1988) density estimator given a sample and the corresponding biased function.
df.bhatta( y, w = function(y) { ifelse(y >= 0, y, NA) }, plot = TRUE, ... )df.bhatta( y, w = function(y) { ifelse(y >= 0, y, NA) }, plot = TRUE, ... )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample |
plot |
Logical indicating whether to plot the estimated density. Default is |
... |
Additional arguments to be passed to
|
Bhattacharyya et al. (1988) density estimator is computed as follows:
and is the kernel density estimate of the given data y using density function with main arguments bw and kernel.
A list with the following components:
y.seq |
The points where the density is estimated. |
f.hat |
The estimated density values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
Bhattacharyya BB, Franklin LA, Richardson GD (1988). “A comparison of nonparametric unweighted and length-biased density estimation of fibres.” Communications in Statistics - Theory and Methods, 17(11), 3629–3644.
# Rule of thumb df.bhatta(shrub.data$Width, bw = "nrd0") # Cross Validation df.bhatta(shrub.data$Width, bw = "ucv") # Sheather & Jones bhata_sj <- df.bhatta(shrub.data$Width, bw = "SJ-ste") # Rectangular kernel df.bhatta(shrub.data$Width, bw = "nrd0", kernel = "epanechnikov")# Rule of thumb df.bhatta(shrub.data$Width, bw = "nrd0") # Cross Validation df.bhatta(shrub.data$Width, bw = "ucv") # Sheather & Jones bhata_sj <- df.bhatta(shrub.data$Width, bw = "SJ-ste") # Rectangular kernel df.bhatta(shrub.data$Width, bw = "nrd0", kernel = "epanechnikov")
This function computes Jones (1991) kernel density estimator given a sample and the corresponding biased function.
df.jones( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, bw = "bw.f.BGM.rt", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), from, to, nb = 512L, plot = TRUE, ... )df.jones( y, w = function(y) { ifelse(y >= 0, y, NA) }, y.seq, bw = "bw.f.BGM.rt", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), from, to, nb = 512L, plot = TRUE, ... )
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points.
It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector specifying the points where the density is estimated. Alternatively, |
bw |
The smoothing bandwidth to be used in the density estimation. |
kernel |
A character string specifying the kernel function. Available options: |
from |
Numeric value specifying the lower bound of the grid where the estimator is computed when |
to |
Numeric value specifying the upper bound of the grid where the estimator is computed when |
nb |
An integer specifying the number of points at which the estimator is computed when |
plot |
A logical value indicating whether to plot the density estimation. Default is |
... |
Additional arguments to be passed to bandwidth selection functions. |
Jones (1991) kernel density estimator is expressed as
is the bandwidth, is the kernel density function and .
A list with the following components:
y.seq |
The points where the density is estimated. |
f.hat |
The estimated density values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
Jones MC (1991). “Kernel density estimation for length biased data.” Biometrika, 78(3), 511–519.
bw.f.BGM.rt, bw.f.BGM.cv, bw.f.BGM.boot1 , bw.f.BGM.boot2
# Rule of thumb df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.rt") # Cross Validation df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.cv") # Bootstrap df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "RT") df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "PI") df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot2", nh = 50L)# Rule of thumb df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.rt") # Cross Validation df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.cv") # Bootstrap df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "RT") df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "PI") df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot2", nh = 50L)
This function generates a biased sample of size n using Neumann (1951) acceptance-rejection method. The generated sample is biased according to the provided bias function w, with respect to the unbiased density function fx.
rbiased( n, w = function(y) { ifelse(y >= 0, y, NA) }, fx, lim = 0.01, plot = TRUE, stop = TRUE, shape1, shape2, location, scale, df, ncp, rate, df1, df2, shape, meanlog, sdlog, min, max, mgshape, mgscale, mgweight, pro, mean, sd )rbiased( n, w = function(y) { ifelse(y >= 0, y, NA) }, fx, lim = 0.01, plot = TRUE, stop = TRUE, shape1, shape2, location, scale, df, ncp, rate, df1, df2, shape, meanlog, sdlog, min, max, mgshape, mgscale, mgweight, pro, mean, sd )
n |
Sample size. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
fx |
Unbiased density function. Values allowed are
Beta ( |
lim |
Lower and upper limits for the range where the bias is significant and, hence, where |
plot |
Logical value indicating whether to generate a plot of the biased sample. Default is |
stop |
Logical value indicating whether to stop when bias function can not be evaluated in a generated value. Default is |
shape1, shape2
|
Additional arguments to be passed to the unbiased density function |
df, ncp
|
Additional arguments to be passed to the unbiased density function |
df1, df2
|
Additional arguments to be passed to the unbiased density function |
shape, rate, scale, location
|
Additional arguments to be passed to the unbiased density function |
min, max
|
Additional arguments to be passed to the unbiased density function |
mgshape, mgscale, mgweight
|
Additional arguments to be passed to the unbiased density function |
mean, sd, pro, meanlog, sdlog
|
Additional arguments to be passed to the unbiased density function |
This function implements Neumann (1951) acceptance-rejection method to generate a biased sample given an unbiased density function fx and a bias function w.
A numeric vector containing a biased sample from density fx and bias function w.
Neumann V (1951). “Various techniques used in connection with random digits.” Notes by G. E. Forsythe, Journal of Research of the National Bureau of Standards, Applied Math Series, 12(3), 36–38.
# Generate a length-biased sample of size 100 from an exponential distribution rbiased(n = 100, fx = "exp", rate = 2, plot = FALSE) # Generate a length-biased sample from a gamma distribution rbiased(n = 100, fx = "gamma", rate = 1.5^2, shape = 1.5) # Generate a biased sample from a gaussian distribution custom_bias <- function(y) { y^2 } rbiased(n = 100, w = custom_bias, fx = "norm", mean = 3, sd = 10, plot = TRUE) # Generate a biased sample from a mixture of gaussian distributions custom_bias <- function(y) { sqrt(abs(y)) + 5 } rbiased( n = 100, w = custom_bias, fx = "mixnorm", pro = rep(1 / 3, 3), mean = c(0.25, 0.5, 0.75), sd = rep(0.075, 3) )# Generate a length-biased sample of size 100 from an exponential distribution rbiased(n = 100, fx = "exp", rate = 2, plot = FALSE) # Generate a length-biased sample from a gamma distribution rbiased(n = 100, fx = "gamma", rate = 1.5^2, shape = 1.5) # Generate a biased sample from a gaussian distribution custom_bias <- function(y) { y^2 } rbiased(n = 100, w = custom_bias, fx = "norm", mean = 3, sd = 10, plot = TRUE) # Generate a biased sample from a mixture of gaussian distributions custom_bias <- function(y) { sqrt(abs(y)) + 5 } rbiased( n = 100, w = custom_bias, fx = "mixnorm", pro = rep(1 / 3, 3), mean = c(0.25, 0.5, 0.75), sd = rep(0.075, 3) )
Dataset containing the size of the Cercocarpus montanus species in an ancient quarry.
data(shrub.data)data(shrub.data)
A data.frame with the following variables:
Replica |
Replica identifier (I or II). |
Transect |
Transect identifier (1,2 or 3). |
Number |
Shrub/clump identifier. |
Intercept |
Length of the intersection of the clump of overlapping shrubs with the transect. |
Width |
Width between two lines tangent to the shrub and parallel to the transect. |
Height |
Maximum height of the shrub encountered by the transect. |
Stems |
Number of stems on the shrub encountered by the transect. |
During the fall semester of 1986, students in a graduate course on biological sampling techniques, taught by Lyman L. McDonald at the University of Wyoming, conducted a field study using the linear transect method to measure the size of Cercocarpus montanus shrubs in an old limestone quarry located just east of Laramie, Wyoming. In this area, rock fissures run predominantly north to south, and vegetation is denser within them. To align with this structure, a baseline was established approximately parallel to the fissures, and six transects were placed perpendicular to this baseline, grouped into two independent replicates (I and II), each comprising three equally spaced transects.
Students walked along the transects and recorded all Cercocarpus montanus individuals intersected. For each shrub, they measured maximum height, width (defined as the greatest distance between two tangents to the shrub’s contour, parallel to the transect), and the number of stems. Given the rhizomatous nature of the species and possible interconnections between neighboring shrubs, an individual was defined as a group of stems at the base separated by at least fifteen centimeters from the nearest neighbor. Additionally, the length of intersection with the transect line was recorded for each shrub cluster.
Due to the nature of the sampling method, wider shrubs had a higher probability of being intersected by a transect. As a result, the sample of shrub widths exhibits length bias. Measurements of height and number of stems are also affected by sampling bias, although the associated bias function is more complex, as it depends on the relationship between width and these other morphological features.
For further details on the sampling protocol and data structure, see Muttlak (1988).
Muttlak HA (1988). Some aspects of ranked set sampling with size biased probability of selection. Ph.D. thesis, University of Wyoming.
data(shrub.data) names(shrub.data) str(shrub.data) class(shrub.data)data(shrub.data) names(shrub.data) str(shrub.data) class(shrub.data)