Date created: May 8, 2021
Last edited: March 17, 2022
The data for this example come from an Ensembl
genome browser web page (release 105; Dec 2021):
GRCh38.illumina.brain.1.bam
:
From the README
of the data repository:
Female brain tissue with poly-A selection (50bp PE + 75bp SE)
About the Ensembl
:
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
First step of the current tutorial is downloading the file named:
GRCh38.illumina.brain.1.bam
[Note: the file have approx. 8.3GB
](Please note that the download speed from Ensemblâs FTP server is quite low, as I got download speed of 200-300 KB for a file as large as 8.3 GB
, so the downloading took several hours.)
Then move the downloaded file in the same directory as your R Markdown (or the R consoleâs working directory).
A primer on what are the differences between particular genomics data file formats is available at Institute for Systems Genomics â University of Conneticutâs web page: Bioinformatics => UCONN => Resources and Events => Tutorials => File Formats Tutorial.
See also the differences between FASTA, FASTQ, SAM, and BAM explained here (with additional references).
In short:
Hence, BAM files cannot be simply viewed in a file system (e.g., by using bashâs cat
command or a notepad) â instead it has to be loaded with a software package, such as the Bioconductor.
Information on what libraries are needed to load and manipulate on BAM files can be found, .e.g., in the course by Irizarry & Love (2015): Importing NGS data into Bioconductor.
The first step of loading the data in our example is to import the required library: Rsamtools
. (During importing the Rsamtools
library there are pleanty of messages being displayed. In order to make the output here in the R Markdown clear, these messages were hidden. See how to hide output in R Markdown):
library(Rsamtools)
Define a filename â the name of the file that was downloaded:
filename = 'GRCh38.illumina.brain.1.bam'
Get information of the size of the file. BAM is a binary file, below R
command will display the size in "autp"
scale, e.g., MB
/GB
, etc.:
utils:::format.object_size(file.info(filename)$size, "auto")
## [1] "8.3 Gb"
Because the file is has over 7 GB Rsamtools
does not loads the entire file right away. Instead the object representing genetic data is created and the data (information on sequences, quality and/or mapping) is loaded on request, see below.
Load the BAM file (create an object representing the file):
bf <- BamFile(filename)
bf
## class: BamFile
## path: GRCh38.illumina.brain.1.bam
## index: NA
## isOpen: FALSE
## yieldSize: NA
## obeyQname: FALSE
## asMates: FALSE
## qnamePrefixEnd: NA
## qnameSuffixStart: NA
seqinfo
functionThe basic object storing sequence information is SeqInfo
(notice the difference between seqinfo
â that is a function [lowercase], and SeqInfo
[camel case] â that is an object). The SeqInfo
object can be created (obtained) with the following seqinfo
function:
seqinfo(bf)
## Seqinfo object with 195 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## KI270757.1 71251 <NA> <NA>
## KI270741.1 157432 <NA> <NA>
## KI270756.1 79590 <NA> <NA>
## KI270730.1 112551 <NA> <NA>
## KI270739.1 73985 <NA> <NA>
## ... ... ... ...
## 16 90338345 <NA> <NA>
## 5 181538259 <NA> <NA>
## 3 198295559 <NA> <NA>
## MT 16569 <NA> <NA>
## chrEBV 171823 <NA> <NA>
For further data handling, it is pretty convenient to convert a SeqInfo
object into R
âs native data.frame
object, e.g., in order to preview all rows of the SeqInfo
data and visualize the information contained in this data structure. For an information on how to convert SeqInfo
into data.frame
, see: this GenomeInfoDb MIT resource.
In order to perform the conversion use the following command:
df <- as.data.frame(seqinfo(bf))
head(df)
## seqlengths isCircular genome
## KI270757.1 71251 NA <NA>
## KI270741.1 157432 NA <NA>
## KI270756.1 79590 NA <NA>
## KI270730.1 112551 NA <NA>
## KI270739.1 73985 NA <NA>
## KI270738.1 99375 NA <NA>
For more information on seqinfo
and seqlengths
, see: this R documentation page.
data.frame
df
is a data frame â a named list. We can use rownames
to see the names of the particular rows (i.e., names of the indexes, see: https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/):
rownames(df)
## [1] "KI270757.1" "KI270741.1" "KI270756.1" "KI270730.1" "KI270739.1"
## [6] "KI270738.1" "KI270737.1" "KI270312.1" "KI270591.1" "KI270371.1"
## [11] "KI270385.1" "KI270381.1" "KI270517.1" "KI270508.1" "KI270539.1"
## [16] "KI270593.1" "KI270530.1" "KI270588.1" "KI270392.1" "KI270375.1"
## [21] "KI270329.1" "KI270336.1" "KI270507.1" "KI270423.1" "KI270382.1"
## [26] "KI270383.1" "KI270468.1" "KI270379.1" "KI270515.1" "KI270582.1"
## [31] "KI270316.1" "KI270511.1" "KI270466.1" "KI270518.1" "KI270580.1"
## [36] "KI270340.1" "KI270422.1" "KI270584.1" "KI270528.1" "KI270330.1"
## [41] "KI270509.1" "KI270587.1" "KI270394.1" "KI270335.1" "KI270372.1"
## [46] "KI270388.1" "KI270310.1" "KI270522.1" "KI270366.1" "KI270334.1"
## [51] "KI270412.1" "KI270302.1" "KI270581.1" "KI270424.1" "KI270548.1"
## [56] "KI270396.1" "KI270374.1" "KI270395.1" "KI270387.1" "KI270418.1"
## [61] "KI270389.1" "KI270378.1" "KI270419.1" "KI270544.1" "KI270510.1"
## [66] "KI270448.1" "KI270590.1" "KI270529.1" "KI270429.1" "KI270376.1"
## [71] "KI270362.1" "KI270583.1" "KI270521.1" "KI270305.1" "KI270516.1"
## [76] "KI270337.1" "KI270425.1" "KI270384.1" "KI270393.1" "KI270373.1"
## [81] "KI270391.1" "KI270386.1" "KI270338.1" "KI270363.1" "KI270538.1"
## [86] "KI270467.1" "KI270465.1" "KI270320.1" "KI270303.1" "KI270411.1"
## [91] "KI270315.1" "KI270311.1" "KI270322.1" "KI270333.1" "KI270317.1"
## [96] "KI270304.1" "KI270417.1" "KI270420.1" "KI270390.1" "KI270589.1"
## [101] "KI270414.1" "KI270579.1" "KI270364.1" "KI270442.1" "KI270729.1"
## [106] "KI270736.1" "KI270438.1" "KI270519.1" "KI270512.1" "KI270435.1"
## [111] "KI270711.1" "GL000009.2" "GL000221.1" "KI270725.1" "KI270740.1"
## [116] "KI270751.1" "KI270746.1" "GL000213.1" "KI270744.1" "GL000220.1"
## [121] "KI270735.1" "KI270734.1" "KI270709.1" "KI270748.1" "KI270745.1"
## [126] "GL000208.1" "GL000224.1" "KI270752.1" "GL000214.1" "KI270742.1"
## [131] "KI270715.1" "GL000195.1" "KI270753.1" "KI270722.1" "KI270719.1"
## [136] "KI270755.1" "KI270712.1" "KI270732.1" "KI270708.1" "KI270731.1"
## [141] "KI270720.1" "GL000194.1" "KI270713.1" "KI270716.1" "KI270723.1"
## [146] "GL000226.1" "GL000219.1" "KI270750.1" "KI270724.1" "KI270707.1"
## [151] "KI270754.1" "KI270706.1" "GL000218.1" "KI270714.1" "KI270733.1"
## [156] "KI270721.1" "KI270749.1" "GL000216.2" "GL000205.2" "KI270718.1"
## [161] "KI270717.1" "KI270743.1" "GL000008.2" "KI270710.1" "GL000225.1"
## [166] "KI270747.1" "KI270726.1" "KI270728.1" "KI270727.1" "Y"
## [171] "20" "X" "13" "22" "10"
## [176] "6" "19" "14" "18" "2"
## [181] "4" "21" "9" "11" "17"
## [186] "8" "7" "15" "12" "1"
## [191] "16" "5" "3" "MT" "chrEBV"
We are interested only in a subset of sequences:
sequences_on_interest = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22",
"X", "Y", "MT", "chrEBV")
Now, we want to get df
âs rows by row names (how to get rows by row names: https://stackoverflow.com/a/18933280/8877692):
df_selection <- df[sequences_on_interest, , drop = FALSE]
df_selection
## seqlengths isCircular genome
## 1 248956422 NA <NA>
## 2 242193529 NA <NA>
## 3 198295559 NA <NA>
## 4 190214555 NA <NA>
## 5 181538259 NA <NA>
## 6 170805979 NA <NA>
## 7 159345973 NA <NA>
## 8 145138636 NA <NA>
## 9 138394717 NA <NA>
## 10 133797422 NA <NA>
## 11 135086622 NA <NA>
## 12 133275309 NA <NA>
## 13 114364328 NA <NA>
## 14 107043718 NA <NA>
## 15 101991189 NA <NA>
## 16 90338345 NA <NA>
## 17 83257441 NA <NA>
## 18 80373285 NA <NA>
## 19 58617616 NA <NA>
## 20 64444167 NA <NA>
## 21 46709983 NA <NA>
## 22 50818468 NA <NA>
## X 156040895 NA <NA>
## Y 57227415 NA <NA>
## MT 16569 NA <NA>
## chrEBV 171823 NA <NA>
Get also information on all the other sequences:
df_other <- df[!(row.names(df) %in% sequences_on_interest), , drop = FALSE]
head(df_other)
## seqlengths isCircular genome
## KI270757.1 71251 NA <NA>
## KI270741.1 157432 NA <NA>
## KI270756.1 79590 NA <NA>
## KI270730.1 112551 NA <NA>
## KI270739.1 73985 NA <NA>
## KI270738.1 99375 NA <NA>
For plotting, we need only the seqlenghts
column. In this example all the other columns are NA
s, so only the seqlenghts
will be plotted.
We will use a basic R bar plotting ggplot
utility.
library(ggplot2)
Plot sequence lengths for particular chromosomes:
# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_selection, aes(rownames(df_selection), seqlengths)) +
# Y values are given
geom_bar(stat='identity') +
# Commas at thousand places
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
# Rotate axes for better readability
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
# Don't sort labels on X axis alphabetically
scale_x_discrete(limits=rownames(df_selection)) +
# Plot and axis titles are useful
ggtitle('Sequence lenghts per chromosome') +
theme(plot.title = element_text(hjust = 0.5)) +
xlab('Chromosome') +
ylab('Sequence lenght')
Plot the length of the other sequences:
# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_other, aes(rownames(df_other), seqlengths)) +
# Y values are given
geom_bar(stat='identity') +
# Commas at thousand places
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
# Rotate axes for better readability
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
# Don't sort labels on X axis alphabetically
scale_x_discrete(limits=rownames(df_other)) +
# Plot and axis titles are useful
ggtitle('Sequence lenghts per chromosome') +
theme(plot.title = element_text(hjust = 0.5)) +
# Font size of the X axis labels
theme(axis.text.x = element_text(size = 6)) +
xlab('Sequence name') +
ylab('Sequence lenght')
Use Ctrl
+ +
(or Cmd
+ +
, or Ctrl
/Cmd
+ scrolls
) to zoom in on the sequence names on the X axis.
Technical comments on the above plotting utility:
SeqInfo
object is a data frame with row names. For how-to plot these with ggplot
see: this discussion thread.
Commas are used as thousand separators for better presentation of the scale. See this answer for an explanation how this was achieved.
stat='identity'
is used, instead of stat='count'
, because the Y values are there already.
By default ggplot
sorts bars alphabetically (x axis labels). In order to prevent this, see this solution.
Following the course by Irizarry & Love, (2015) more operations and/or inspection options are available for BAM files with the Rsamtools
library.
The code below (quickBamFlagSummary(bf)
) is not performed as part of this R Markdown, because it takes too much time to process this command. Moreover, you need a lot of RAM and swap memory to perform this command on large files (e.g., larger than 8GB).
(See how to not evaluate an R cell in R Markdown here.)
quickBamFlagSummary(bf)
The output of the above command should be similar to:
group | nb of | nb of | mean / max
of | records | unique | records per
records | in group | QNAMEs | unique QNAME
All records........................ A | 17010970 | 17010970 | 1 / 1
o template has single segment.... S | 17010970 | 17010970 | 1 / 1
o template has multiple segments. M | 0 | 0 | NA / NA
- first segment.............. F | 0 | 0 | NA / NA
- last segment............... L | 0 | 0 | NA / NA
- other segment.............. O | 0 | 0 | NA / NA
Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.
Details for group S:
o record is mapped.............. S1 | 14921660 | 14921660 | 1 / 1
- primary alignment......... S2 | 14921660 | 14921660 | 1 / 1
- secondary alignment....... S3 | 0 | 0 | NA / NA
o record is unmapped............ S4 | 2089310 | 2089310 | 1 / 1
scanBam(BamFile(filename, yieldSize=5))
## [[1]]
## [[1]]$qname
## [1] "HWI-BRUNOP16X_0001:3:46:7998:2050#0"
## [2] "HWI-BRUNOP16X_0001:3:47:12401:185182#0"
## [3] "HWI-BRUNOP16X_0001:3:68:7771:120971#0"
## [4] "HWI-BRUNOP16X_0001:3:64:7367:151706#0"
## [5] "HWI-BRUNOP16X_0001:3:67:5188:159467#0"
##
## [[1]]$flag
## [1] 0 97 0 65 0
##
## [[1]]$rname
## [1] KI270757.1 KI270757.1 KI270757.1 KI270757.1 KI270757.1
## 195 Levels: KI270757.1 KI270741.1 KI270756.1 KI270730.1 ... chrEBV
##
## [[1]]$strand
## [1] + + + + +
## Levels: + - *
##
## [[1]]$pos
## [1] 20301 20301 46974 47948 51349
##
## [[1]]$qwidth
## [1] 75 50 75 50 75
##
## [[1]]$mapq
## [1] 37 0 0 0 0
##
## [[1]]$cigar
## [1] "75M" "50M" "75M" "50M" "75M"
##
## [[1]]$mrnm
## [1] <NA> 17 <NA> 7 <NA>
## 195 Levels: KI270757.1 KI270741.1 KI270756.1 KI270730.1 ... chrEBV
##
## [[1]]$mpos
## [1] NA 21973011 NA 109911399 NA
##
## [[1]]$isize
## [1] 0 0 0 0 0
##
## [[1]]$seq
## DNAStringSet object of length 5:
## width seq
## [1] 75 ATGGTGTGAAATGGAATGGAACGGAATGGAATGG...GGAATCAACCCGAGTGGAATGGAACGGAAGGGA
## [2] 50 ATGGTGTGAGATGGAATGGAACGGAATGGAATGGAATGGAATGGAATCAA
## [3] 75 CCTGTGTCCATGTGATCTCATTGTTCAATTCCCA...TGAGAATATGCGGTGTTTGGTTTTTTGTTCTTG
## [4] 50 CCCTTTGAAAACTGGCACAAGACAGGGATGCCCTCTCTCACCACTTCTAT
## [5] 75 CAAATGGACATTTGGAGAGCTTGGACACCTATGG...GAAATATCTTCAAATAAAAACTAGGTAGAAGAA
##
## [[1]]$qual
## PhredQuality object of length 5:
## width seq
## [1] 75 gggggggggggggggggggggggggggggggggg...ggggggggggggggeggggggggdggggggggf
## [2] 50 gggggggggfgggggggggggggggggggggggggggggggggggggggg
## [3] 75 gggggggggggggggggggggggggggggggggg...fghgffgfggggghghggggfhgdggg]ZE`_^
## [4] 50 gggggggggggggggggggggggggggggggggggggggggggggggggg
## [5] 75 gggggggggggggggggggggggggggggggggg...gggggggggggggggggdddfgggghggggggg
R Studioâs console colors the output:
R markdown figure size (the global size) (knitr::opts_chunk$set(fig.width=12, fig.height=8)
)