CTSI Annual Pilot Awards to Improve the Conduct of Research

An Open Proposal Opportunity

Evidence-based Inputs for Sample Size Calculations: A Web-based Application

Proposal Status: 

Rationale:  Inaccurate sample size estimation leads to research studies that enroll too few or too many study participants. The former can result in failure to demonstrate anticipated effects and the latter to excessive costs, due to overuse of research participants and personnel time. While there are several important components of a good sample size calculation – including a compelling research question, an appropriate outcome variable, and an efficient study design – here we focus on improving the accuracy of the quantitative inputs. Examples of such parameters include:  

  • the variability of responses within and between eligible participants (accounting for correlations among replicate measurements at a given time and/or at different times),  
  • the mean response under standard-of-care conditions, and
  • the prevalence of the disease under study.

At present, the primary sources of parameter values are pilot studies (often too small to provide reliable estimates) and published studies (which often lack the needed values). Estimates of parameter values that are not evidence-based can introduce a large amount of error into the sample-size calculation, making it precise but inaccurate. The proposed web-based application would allow a clinician and biostatistician to identify required values in real time during a consulting visit, revising the query as needed to ensure a viable and cost-efficient study.  Given the enormous number of medical studies launched every year, access to accurate inputs to sample size calculations could vastly reduce waste of valuable resources.


Plan: The two key ingredients to obtaining a wide range of evidence-based inputs for sample size calculations are the availability of electronic sources of current health information and a convenient means of retrieving relevant information from the databases. During pilot funding, we will demonstrate feasibility of (1) identifying relevant existing databases; (2) quantifying improved accuracy of sample-size calculations based on of evidence-based parameters (as defined here) relative to those based on other sources of inputs; and (3) producing version 1 of a user-friendly web-based interface to access, summarize, and output evidence-based parameters in useful formats. Beyond pilot funding, additional databases could be tapped, and the breadth of access, summarization, and output could be expanded. 

Aim 1:  Identify Databases:

  • Ideal databases should provide longitudinally tracked individual-level health outcomes related to a wide range of conditions. Two excellent examples appear to be (1) Databases included in the UCSF CTSI Large Dataset Inventory, accessible at no cost through http://accelerate.ucsf.edu/research/celdac.  (Example: National Health Interview Survey) (2) The Kaiser healthcare database. [Must: Identify key personnel who could provide access, engage their interest, and establish an agreement to collaborate.]
  • With clinical partners: Establish a set of commonly expected requests that could be used to evaluate candidate databases on quality and availability of desired data.  
  • With database partners and computer experts: Optimize access to and manual retrieval of information.

Aim 2:  Proof of Concept  

  • Identify commonly used outcome variables and alternative choices through review of the published literature and discussion with clinical colleagues.
  • Identify a selection of recently published high-profile studies that reported the values of key study design parameters. For each: (1)  Document the eligibility criteria and parameter values that were used in the study design. (2) Document corresponding values drawn from the selected databases. (3) Examine the effect on the sample size of differences between the two sets of parameter values.

Aim 3: Create a web-based user interface, a manual of procedures with useful examples, and useful export files.

  • User experience:

          o   Make retrieval fully dynamic: any outcome stored in the database; retrieval tailored to major eligibility criteria (e.g., age and diagnosis) and design criteria (e.g., frequency of assessment per patient).

          o   Use drop-down menus, populated by database-specific data dictionaries, to ensure accurate spelling.

          o   Produce results that are easy for the investigator or biostatistician to manipulate. The software will process all (or a random sample of) eligible values to generate rates (see example appended), means, variances, and correlations, as needed. 

          o   Generate downloadable documentation of queries for user’s later access.

  • Developer experience:

         o   As queries of a database are made, record queries (including search criteria) and results.

         o   Identify unavailable measures; examine reasons. Automate or prompt search for an alternative database and/or measure.


Criteria and metrics for success:

Aim 1:  Compare the proposed database resources in terms of features, data quality, and costs: 

  • Are Common Data Elements used?
  • Is a data dictionary available for sorting and browsing to find measures available?
  • Does resource have the requested measures?
  • How current are the measures?

Aim 2:  For a range of recently published studies that reported the values of key study design parameters, evaluate the effect on the sample-size calculation of differences between reported parameter values and values retrieved from our proposed database resource(s).  Hypothesis:  Evidence-based values will modify the sample size calculation by at least 10%.   

Aim 3:  Compare the proposed database resource(s) in terms of ease of access and value of information gained: 

  • Poll users to obtain feedback on ease of use and value of information retrieved, by database.  
  • Summarize measures queried by frequency. 
  • Characterize variation among databases with respect to comprehensiveness and ease of use.
  • Estimate personnel costs associated with building database access.


Cost:  We seek funding to access at least two large free databases by leveraging the CTSI Large Database Inventory [Aim 1], to quantify the benefit of evidence-based parameter values on sample-size calculations [Aim 2], and to plan the computational work in fine detail [Aim 3].  Salary support $100,000 (12 months) for principal faculty and staff.


Collaborators: Joan Hilton (Epidemiology & Biostatistics) will lead the biostatistical aspects of the “App” development.  Tracy van Nunnery (Medicine) will lead a team of computer experts who will create the database interfaces. Kirsten Bibbins-Domingo (Medicine) will serve as lead clinical collaborator.


Example interface and output:  Dr Hilton and Mr Nunnery have on-going research collaborations that began in 2007. Mr. Nunnery and his team created HERO, the electronic medical record system used at Ward 86 of SFGH, and thus are well acquainted with HIPAA requirements. As an example of their work, users (clinicians) can query HERO to obtain the distribution of any patient characteristic captured by clinicians, limited to user-specified search criteria. The web-based interface for retrieving demographic data is shown below, with the date fields displayed (upper image).  The distributions of demographic characteristics were exported to an Excel spreadsheet (lower image).


Sounds like a good idea - Consultation Services has been interested in developing a systematic, formulaic and more automated approach to power calculations and we would be interested in working with you on this. However, will you be able to obtain good information on distributions and prevalence on enough measurements and within enough subgroups of the population from the datasets you're proposing to be broadly useful? It seems like the possibilities for what might be needed are endless, and the dataset resources limited.

Thanks for your support! I agree that finding a great database up front would make the most efficient use of our time; selecting the first database to work with is Aim 1. Input from my clinical collaborator will help ensure the initial resource is useful in a very wide range of UCSF-type research applications. Your input is most welcome too. Got database suggestions???

The quantities to be estimated may have some potential usefulness for study design, but I unfortunately believe that the proposal as envisioned would be building on an untenable foundation--the conventional power-based approach to choosing and/or justifying sample size. I don’t think that facilitating power calculations will provide any actual scientific benefit (sorry Mark), although there may be practical benefits. The proposal’s rationale presumes a meaningful pre-study definition of “too few” participants, which does not actually exist and is what I have called the “threshold myth” (see http://www.ctspedia.org/do/view/CTSpedia/SampleSizeFlaws#The_threshold_myth); it also seems to presume that poor cost efficiency can be avoided without consideration of the actual costs, which I believe is unrealistic. In practice, investigators must consider cost and feasibility in choosing sample size, and calculations are often just window dressing for choices based on other considerations (see http://www.ctspedia.org/do/view/CTSpedia/SampleSizeFlaws#Erosion_of_scie...). I therefore believe that this would have little impact on actual sample size choices. To the extent that it would influence the rare cases where choices are based only on the conventional power-based approach, those choices will not necessarily be improved, because the arbitrary goal of 80% power has no valid justification and is not necessarily optimal. More accurately estimating something that is not meaningful will not make it more meaningful. In addition, a very influential input is the difference to be “detected”, and this apparently will not be addressed. This is often the hardest to specify because 1) uncertainty about it is presumably large enough to warrant the study being proposed, and 2) the theoretical basis for choosing it is unclear (see http://www.ctspedia.org/do/view/CTSpedia/SampleSizeFlaws#Inherent_inaccu...). A practical approach to this problem is to calculate what difference will produce 80% power for the proposed sample size (which was chosen for other reasons). Perhaps this could be integrated into what is envisioned. This would be of practical utility for justifying sample sizes in proposals using power-based conventions. I think this could have some value for making conventional calculations (the window dressing) easier and making them seem more objective and accurate. The proposed estimates could also be used in the approach described at http://www.ctspedia.org/do/view/CTSpedia/SampleSizeFlaws#Sensitivity_ana..., instead of in conventional calculations. This approach, however, is not widely used, and the proposal does not seem geared toward any such use. Regarding Aim 2, inaccuracy of estimated inputs has been empirically documented already, and 10% impact on sample size is much too precise a standard (see, e.g., Vickers AJ: Underpowering in randomized trials reporting a sample size calculation. Journal of Clinical Epidemiology 2003, 56:717-720; more than half of high-profile RCTs had >2-fold inaccuracy). I would focus instead on documenting improvement in accuracy from use of the proposed database(s) and subset selection tools, along with the level of accuracy that they can attain. This seems like the key proof of concept that is needed (regardless of what use the estimates would be put to). I would undertake this initial validation before doing any work on usability and a user interface, because such initial results might be discouraging.

Peter, I appreciate that you too care about the quality of sample-size calculations and have gone so far as to explore the topic through methodological research, as documented in the online links you provide here. Thank you for your comments on my proposal. I number and respond to each of your three points below. POINT 1. "Investigators must consider cost and feasibility in choosing sample size."    RESPONSE: Whereas the text above roundly criticizes the “conventional power-based approach” and “the arbitrary goal of 80% power,” neither of these concepts appears in my proposal. I appreciate the nuances of sample-size calculations and have proposed that several important issues are better handled through collaboration between an investigator and a biostatistician. In addition, I identify overenrollment of participants as a problem, not just underenrollment. In its introduction, your website lists “three crucial flaws in [the] standard approach,” the second of which is, “relies strongly on inputs that generally cannot be accurately specified.” I believe our proposal directly addresses this very problem. We welcome you to join this project, if you wish to also pursue practical solutions.    POINT 2. "The difference to be detected” RESPONSE: I agree that the difference to be detected is a very important quantity. Because of its special status, I did not address it (or other “important components of a good sample size calculation – including a compelling research question, an appropriate outcome variable, and an efficient study design”). Some of my thoughts on this parameter follow. • As an example, in the setting of a randomized controlled trial designed to detect the superiority of a new therapy over a standard therapy, the value of interest may be the minimum clinically important difference (MCID) between arms – a value large enough to warrant inclusion of the new therapy among therapeutic options available to clinicians and their patients. • For a new therapy, person-level data will not exist in the databases I propose to tap, thus this is not a source of information about the MCID. Guidance from other resources could be obtained, such as from the literature published to date and preliminary studies. The combined efforts of a clinician and biostatistician during a consulting visit could be very fruitful in honing the value selected. • Importantly, the difference to be detected is a function of an outcome variable, the variation of which can be estimated from the databases I propose to tap. In turn, the variance of the difference can be estimated, accounting for correlation among repeated measurements within individuals. At present investigators have very little information to guide selection of this quantity; availability of evidence-based inputs would be a substantial leap forward.    POINT 3. "Regarding Aim 2" RESPONSE: I proposed a simple measure of the accuracy of evidence-based inputs: comparison of the sample size calculated using conventional inputs (e.g., based on literature search) with that using the proposed inputs. • To address point 3.3, the proposed inputs rely on the quality of the database; hence we propose to select the database first. • To address point 3.1, improvement in accuracy also relies on the quality of the conventional inputs and will vary across medical specialties, outcome variables, relevant patient populations, and many other features of the data. Accuracy gains should be higher in less charted territories. Thus a challenging test of improvement should be conducted using well characterized outcomes, such as those used in cardiovascular research, where anticipated gains are relatively small. • To address point 3.2, in addition to summarizing accuracy improvement by meaningful thresholds that would emerge as results unfold (10% should be seen as a place-holder), we would estimate accuracy on a continuous scale. Clearly, this “study” must be designed as carefully as any other. • Importantly, even when high quality, parameter estimates are available through conventional sources (i.e., when accuracy is high), if the investigator wants to change eligibility criteria relative to those sources, the parameter estimates also may change – unpredictably. Availability of evidence-based inputs would be a substantial leap forward.

Point 1 is the key one, so I will respond just to that. 1. If the proposal is to support calculations other than the standard power-based calculations with the minimum goal of 80% power, then this should be made clear in the proposal and explained in some detail. The inputs discussed are those used in standard calculations, and I see no reason to anticipate that the proposed product would not be used exclusively for standard calculations. I believe that an important consideration is that for any alternative approach, the primary issue for now is trying to make it acceptable to reviewers and feasible for investigators, rather than any refinements that this proposal might implement. I therefore think that this will mainly be useful for supporting standard calculations. As I said, this may have practical value (specifying inputs is a major headache, and choices are often challenged by reviewers), but I believe that the drawbacks of the standard approach will limit the actual scientific value that this could produce. In particular, even if this could completely resolve the problem of inherent uncertainty (which seems doubtful to me), the other two fundamental flaws in the conventional approach are still there. I have no interest in supporting the standard approach.

Commenting is closed.