最近两三年,明显感觉关于R的书籍多了起来,年初更是进入了TOBIE排行版的前20,Oracle 又准备将R整合到其数据库产品中。和Hadoop集群的整合,必然会使R在大数据分析和可视化领域中火起来的。
Bioconductor上有大量的分析包,是生物界所无法回避的,Ewan Birney在其博文Five statistical things I wished I had been taught 20 years ago中更是单独把R给列出来,做为一个程序员,多少都应该学点R。
本文并不准备回答标题这么大的问题,这只是读文献(A Quick Guide to Teaching R Programming to Computational Biology Students)做个记录而已。
关于R语言的介绍,文章的附件1给出了很好的lecture note, 作者认为最好的学习方法是通过着手去解决一些生物学问题,并列出了以下三个问题:
- 序列比对,这个我以前实现过,参照了NBT上的文章What is dynamic programming?
- 离散逻辑方程,参考Simple mathematical models with very complicated dynamics. Nature 261:459–467.
- 元胞自动机,参考Mathematical games: The fantastic combinations of John Conway’s new solitaire game 'Life'. Scientific American 223:120–123.
可重复研究:
The idea of reproducible research is quite simple: to provide not only a brief description of, e.g., how some data has been analysed, but also to provide the code and data to allow someone else to recreate exactly the same sequence of steps
R的Sweave文档基本上就是加了R code的Latex,文档通过Sweave处理,R code被运行,所有的输出(文本或图片)以latex代码再插入到源文档,然后就可以通过latex编译生成PDF文档。附件2和3分别是Sweave源码和相应的PDF,通过估计pi的实例来介绍reproducible research,显然作者很看重这个能力,起码要写R包的话,这是必备的。
作者提供了几本参考书:
- Introductory Statistics with R
- Introduction to Probability with R
- A First Course in Statistical Programming with R
- S Programming
- R Graphics
并列出了几本计算生物学方面的书:
- Computational Genome Analysis
- Stochastic Modelling for Systems Biology
- R Programming for Bioinformatics
- Dynamical Models in Biology