Joint statistical meetings 2010 – first half


More on the petabyte milestone

In which I speculate about breaking through the petabyte milestone in clinical research

Information allergy

Reproducible research – further options

Mostly just a list of possible reproducible research options as a follow up to a previous entry. I still don't like these quite as much as R/Sweave, but they might do in a variety of situations.

  • Inference for R – connects R with Microsoft Office 2003 or later. I evaluated this a couple of years ago and I think there's a lot to like about it. It is very Weave-like, with a slight disadvantage that it really prefers the data to be coupled tightly with the report. However, I think it is just as easy to decouple these without using Inference's data features, which is advantageous when you want to regenerate the report when data is updated. Another disadvantage is that I didn't see a way to easily redo a report quickly, as you can with Sweave/LaTeX by creating a batch or shell script file (perhaps this is possible with Inference). Advantages – you can also connect to Excel and Powerpoint. If you absolutely require Office 2003 or later, Inference for R is worth a look. It is, however, not free.
  • R2wd (link is to a very nice introduction) which is a nice package a bit like R2HTML, except it writes to a Word file. (Sciviews has something similar, I think.) This is unlike many of the other options I've written about, because everything must be generated from R code. It is also a bit rough around the edges (for example, you cannot just write wdBody(summary(lm(y~x,data=foo))). I think some of the dependent packages, such as Statcomm, also allow connections to Excel and other applications, if that is needed.
  • There are similar solutions that allow connection to Openoffice or Google Documents, some of which can be found in the comments section of the previous link.

The solutions that connect R with Word are very useful for businesses that rely on the Office platform. The solutions that connect to Openoffice are useful for those who rely on the Openoffice platform, or need to exchange documents with those who rely on Microsoft Office but do not want to purchase it. However, for reproducible research in the way I'm describing these solutions are not ideal, because it allows the display version to be edited easily, which would make it difficult to update the report if there is new data. Perhaps if there were a solution to make the document "comment-only" (i.e. no one could edit the document but could only add comments) this would be a workable solution. So far, it's possible to manually set a protection flag to allow redlining but not source editing of a Word file, but my Windows skills are not quite sufficient to have that happen from, for example, a batch file or other script.

Exchanging with Google Docs is a different beast. Google Docs allows easy collaboration without having to send emails with attachments. I think that this idea will catch on, and once IT personnel are satisfied with security this idea (whether it's Google's system, Microsoft's attempt at catching up, or someone else's) will become the primary way of editing small documents that require heavy collaboration. Again, I'm not clear if it's possible to share a Google document with putting it into a comment-only mode, which I think would be required for a reproducible research context to work, but I think this technology will be very useful.

The effect of protocol amendments on statistical inference of clinical trials

Lu, Chow, and Zhang recently released1 an article detailing some statistical adjustments they claim need to be made when a clinical trial protocol is amended. While I have not investigated their method (they seem to revert to my first choice when there is no obvious or straightforward algorithm – the maximum likelihood method), I do appreciate the fact that they have even considered this issue at all. I have been thinking for a while that the way we tinker with clinical trials during their execution (all for good reasons, mind you) ought to be reflected in the analysis. For example, if a sponsor is unhappy with enrollment they will often alter the inclusion/exclusion criteria to speed enrollment. This, as Lu, et al. point out, tends to increase the variance of the treatment effect (and possibly affect the means as well). But rather than assess that impact directly, we end up analyzing a mixture of populations.

This and related papers seem to be rather heavy on the math, but I will be reviewing these ideas more closely over the coming weeks.