Data
- During downloading a file
- Gemonic file formats
Loading BAM files
Accessing sequence information
- seqinfo function
  - Sequence information as a data.frame
- Plotting sequence lenghts
Additional materials and notes
Literature

Date created:  May 8, 2021
Last edited:   March 17, 2022

Data

The data for this example come from an Ensembl genome browser web page (release 105; Dec 2021):

Ensembl Single species data, Human (homo sapiens), GRCh38.illumina.brain.1.bam:
- Link to the information about the datasets: https://www.ensembl.org/info/data/ftp/index.html;
- Direct link to the FTP browser.

From the README of the data repository:

Female brain tissue with poly-A selection (50bp PE + 75bp SE)

About the Ensembl:

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

During downloading a file

First step of the current tutorial is downloading the file named:

GRCh38.illumina.brain.1.bam [Note: the file have approx. 8.3GB]

(Please note that the download speed from Ensembl’s FTP server is quite low, as I got download speed of 200-300 KB for a file as large as 8.3 GB, so the downloading took several hours.)

Then move the downloaded file in the same directory as your R Markdown (or the R console’s working directory).

Gemonic file formats

A primer on what are the differences between particular genomics data file formats is available at Institute for Systems Genomics – University of Conneticut’s web page: Bioinformatics => UCONN => Resources and Events => Tutorials => File Formats Tutorial.

See also the differences between FASTA, FASTQ, SAM, and BAM explained here (with additional references).

In short:

FASTA – nucleotide or amino acid sequences of nucleic acids and proteins (thereafter: “sequences”)
FASTQ – sequences + quality
SAM – sequences + quality + mapping (to a reference)
BAM – binary form of SAM

Hence, BAM files cannot be simply viewed in a file system (e.g., by using bash’s cat command or a notepad) – instead it has to be loaded with a software package, such as the Bioconductor.

Loading BAM files

Information on what libraries are needed to load and manipulate on BAM files can be found, .e.g., in the course by Irizarry & Love (2015): Importing NGS data into Bioconductor.

The first step of loading the data in our example is to import the required library: Rsamtools. (During importing the Rsamtools library there are pleanty of messages being displayed. In order to make the output here in the R Markdown clear, these messages were hidden. See how to hide output in R Markdown):

library(Rsamtools)

Define a filename – the name of the file that was downloaded:

filename = 'GRCh38.illumina.brain.1.bam'

Get information of the size of the file. BAM is a binary file, below R command will display the size in "autp" scale, e.g., MB/GB, etc.:

utils:::format.object_size(file.info(filename)$size, "auto")

## [1] "8.3 Gb"

Because the file is has over 7 GB Rsamtools does not loads the entire file right away. Instead the object representing genetic data is created and the data (information on sequences, quality and/or mapping) is loaded on request, see below.

Load the BAM file (create an object representing the file):

bf <- BamFile(filename)
bf

## class: BamFile 
## path: GRCh38.illumina.brain.1.bam
## index: NA
## isOpen: FALSE 
## yieldSize: NA 
## obeyQname: FALSE 
## asMates: FALSE 
## qnamePrefixEnd: NA 
## qnameSuffixStart: NA

Accessing sequence information

`seqinfo` function

The basic object storing sequence information is SeqInfo (notice the difference between seqinfo – that is a function [lowercase], and SeqInfo [camel case] – that is an object). The SeqInfo object can be created (obtained) with the following seqinfo function:

seqinfo(bf)

## Seqinfo object with 195 sequences from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   KI270757.1      71251       <NA>   <NA>
##   KI270741.1     157432       <NA>   <NA>
##   KI270756.1      79590       <NA>   <NA>
##   KI270730.1     112551       <NA>   <NA>
##   KI270739.1      73985       <NA>   <NA>
##   ...               ...        ...    ...
##   16           90338345       <NA>   <NA>
##   5           181538259       <NA>   <NA>
##   3           198295559       <NA>   <NA>
##   MT              16569       <NA>   <NA>
##   chrEBV         171823       <NA>   <NA>

For further data handling, it is pretty convenient to convert a SeqInfo object into R’s native data.frame object, e.g., in order to preview all rows of the SeqInfo data and visualize the information contained in this data structure. For an information on how to convert SeqInfo into data.frame, see: this GenomeInfoDb MIT resource.

In order to perform the conversion use the following command:

df <- as.data.frame(seqinfo(bf))
head(df)

##            seqlengths isCircular genome
## KI270757.1      71251         NA   <NA>
## KI270741.1     157432         NA   <NA>
## KI270756.1      79590         NA   <NA>
## KI270730.1     112551         NA   <NA>
## KI270739.1      73985         NA   <NA>
## KI270738.1      99375         NA   <NA>

For more information on seqinfo and seqlengths, see: this R documentation page.

Sequence information as a `data.frame`

df is a data frame – a named list. We can use rownames to see the names of the particular rows (i.e., names of the indexes, see: https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/):

rownames(df)

##   [1] "KI270757.1" "KI270741.1" "KI270756.1" "KI270730.1" "KI270739.1"
##   [6] "KI270738.1" "KI270737.1" "KI270312.1" "KI270591.1" "KI270371.1"
##  [11] "KI270385.1" "KI270381.1" "KI270517.1" "KI270508.1" "KI270539.1"
##  [16] "KI270593.1" "KI270530.1" "KI270588.1" "KI270392.1" "KI270375.1"
##  [21] "KI270329.1" "KI270336.1" "KI270507.1" "KI270423.1" "KI270382.1"
##  [26] "KI270383.1" "KI270468.1" "KI270379.1" "KI270515.1" "KI270582.1"
##  [31] "KI270316.1" "KI270511.1" "KI270466.1" "KI270518.1" "KI270580.1"
##  [36] "KI270340.1" "KI270422.1" "KI270584.1" "KI270528.1" "KI270330.1"
##  [41] "KI270509.1" "KI270587.1" "KI270394.1" "KI270335.1" "KI270372.1"
##  [46] "KI270388.1" "KI270310.1" "KI270522.1" "KI270366.1" "KI270334.1"
##  [51] "KI270412.1" "KI270302.1" "KI270581.1" "KI270424.1" "KI270548.1"
##  [56] "KI270396.1" "KI270374.1" "KI270395.1" "KI270387.1" "KI270418.1"
##  [61] "KI270389.1" "KI270378.1" "KI270419.1" "KI270544.1" "KI270510.1"
##  [66] "KI270448.1" "KI270590.1" "KI270529.1" "KI270429.1" "KI270376.1"
##  [71] "KI270362.1" "KI270583.1" "KI270521.1" "KI270305.1" "KI270516.1"
##  [76] "KI270337.1" "KI270425.1" "KI270384.1" "KI270393.1" "KI270373.1"
##  [81] "KI270391.1" "KI270386.1" "KI270338.1" "KI270363.1" "KI270538.1"
##  [86] "KI270467.1" "KI270465.1" "KI270320.1" "KI270303.1" "KI270411.1"
##  [91] "KI270315.1" "KI270311.1" "KI270322.1" "KI270333.1" "KI270317.1"
##  [96] "KI270304.1" "KI270417.1" "KI270420.1" "KI270390.1" "KI270589.1"
## [101] "KI270414.1" "KI270579.1" "KI270364.1" "KI270442.1" "KI270729.1"
## [106] "KI270736.1" "KI270438.1" "KI270519.1" "KI270512.1" "KI270435.1"
## [111] "KI270711.1" "GL000009.2" "GL000221.1" "KI270725.1" "KI270740.1"
## [116] "KI270751.1" "KI270746.1" "GL000213.1" "KI270744.1" "GL000220.1"
## [121] "KI270735.1" "KI270734.1" "KI270709.1" "KI270748.1" "KI270745.1"
## [126] "GL000208.1" "GL000224.1" "KI270752.1" "GL000214.1" "KI270742.1"
## [131] "KI270715.1" "GL000195.1" "KI270753.1" "KI270722.1" "KI270719.1"
## [136] "KI270755.1" "KI270712.1" "KI270732.1" "KI270708.1" "KI270731.1"
## [141] "KI270720.1" "GL000194.1" "KI270713.1" "KI270716.1" "KI270723.1"
## [146] "GL000226.1" "GL000219.1" "KI270750.1" "KI270724.1" "KI270707.1"
## [151] "KI270754.1" "KI270706.1" "GL000218.1" "KI270714.1" "KI270733.1"
## [156] "KI270721.1" "KI270749.1" "GL000216.2" "GL000205.2" "KI270718.1"
## [161] "KI270717.1" "KI270743.1" "GL000008.2" "KI270710.1" "GL000225.1"
## [166] "KI270747.1" "KI270726.1" "KI270728.1" "KI270727.1" "Y"         
## [171] "20"         "X"          "13"         "22"         "10"        
## [176] "6"          "19"         "14"         "18"         "2"         
## [181] "4"          "21"         "9"          "11"         "17"        
## [186] "8"          "7"          "15"         "12"         "1"         
## [191] "16"         "5"          "3"          "MT"         "chrEBV"

We are interested only in a subset of sequences:

sequences_on_interest = c("1",   "2",  "3",  "4",  "5",  "6",  "7",  "8",  "9", "10",
                          "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                          "21", "22",
                          "X", "Y", "MT", "chrEBV")

Now, we want to get df’s rows by row names (how to get rows by row names: https://stackoverflow.com/a/18933280/8877692):

df_selection <- df[sequences_on_interest, , drop = FALSE]
df_selection

##        seqlengths isCircular genome
## 1       248956422         NA   <NA>
## 2       242193529         NA   <NA>
## 3       198295559         NA   <NA>
## 4       190214555         NA   <NA>
## 5       181538259         NA   <NA>
## 6       170805979         NA   <NA>
## 7       159345973         NA   <NA>
## 8       145138636         NA   <NA>
## 9       138394717         NA   <NA>
## 10      133797422         NA   <NA>
## 11      135086622         NA   <NA>
## 12      133275309         NA   <NA>
## 13      114364328         NA   <NA>
## 14      107043718         NA   <NA>
## 15      101991189         NA   <NA>
## 16       90338345         NA   <NA>
## 17       83257441         NA   <NA>
## 18       80373285         NA   <NA>
## 19       58617616         NA   <NA>
## 20       64444167         NA   <NA>
## 21       46709983         NA   <NA>
## 22       50818468         NA   <NA>
## X       156040895         NA   <NA>
## Y        57227415         NA   <NA>
## MT          16569         NA   <NA>
## chrEBV     171823         NA   <NA>

Get also information on all the other sequences:

df_other <- df[!(row.names(df) %in% sequences_on_interest), , drop = FALSE]
head(df_other)

##            seqlengths isCircular genome
## KI270757.1      71251         NA   <NA>
## KI270741.1     157432         NA   <NA>
## KI270756.1      79590         NA   <NA>
## KI270730.1     112551         NA   <NA>
## KI270739.1      73985         NA   <NA>
## KI270738.1      99375         NA   <NA>

Plotting sequence lenghts

For plotting, we need only the seqlenghts column. In this example all the other columns are NAs, so only the seqlenghts will be plotted.

We will use a basic R bar plotting ggplot utility.

library(ggplot2)

Plot sequence lengths for particular chromosomes:

# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_selection, aes(rownames(df_selection), seqlengths)) +
  
  # Y values are given
  geom_bar(stat='identity') +
  
  # Commas at thousand places
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                 scientific = FALSE)) +
  
  # Rotate axes for better readability
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +

  # Don't sort labels on X axis alphabetically
  scale_x_discrete(limits=rownames(df_selection)) +
  
  # Plot and axis titles are useful
  ggtitle('Sequence lenghts per chromosome') +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab('Chromosome') + 
  ylab('Sequence lenght')

Plot the length of the other sequences:

# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_other, aes(rownames(df_other), seqlengths)) +
  
  # Y values are given
  geom_bar(stat='identity') +
  
  # Commas at thousand places
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                 scientific = FALSE)) +
  
  # Rotate axes for better readability
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +

  # Don't sort labels on X axis alphabetically
  scale_x_discrete(limits=rownames(df_other)) +
  
  # Plot and axis titles are useful
  ggtitle('Sequence lenghts per chromosome') +
  theme(plot.title = element_text(hjust = 0.5)) +
  # Font size of the X axis labels
  theme(axis.text.x = element_text(size = 6)) +
  xlab('Sequence name') + 
  ylab('Sequence lenght')

Use Ctrl + + (or Cmd + +, or Ctrl/Cmd + scrolls) to zoom in on the sequence names on the X axis.

Technical comments on the above plotting utility:

SeqInfo object is a data frame with row names. For how-to plot these with ggplot see: this discussion thread.
Commas are used as thousand separators for better presentation of the scale. See this answer for an explanation how this was achieved.
stat='identity' is used, instead of stat='count', because the Y values are there already.
By default ggplot sorts bars alphabetically (x axis labels). In order to prevent this, see this solution.

Additional materials and notes

Following the course by Irizarry & Love, (2015) more operations and/or inspection options are available for BAM files with the Rsamtools library.

BAM file summary

The code below (quickBamFlagSummary(bf)) is not performed as part of this R Markdown, because it takes too much time to process this command. Moreover, you need a lot of RAM and swap memory to perform this command on large files (e.g., larger than 8GB).

(See how to not evaluate an R cell in R Markdown here.)

quickBamFlagSummary(bf)

The output of the above command should be similar to:

                                group |    nb of |    nb of | mean / max
                                   of |  records |   unique | records per
                              records | in group |   QNAMEs | unique QNAME
All records........................ A | 17010970 | 17010970 |    1 / 1
  o template has single segment.... S | 17010970 | 17010970 |    1 / 1
  o template has multiple segments. M |        0 |        0 |   NA / NA
      - first segment.............. F |        0 |        0 |   NA / NA
      - last segment............... L |        0 |        0 |   NA / NA
      - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group S:
  o record is mapped.............. S1 | 14921660 | 14921660 |    1 / 1
      - primary alignment......... S2 | 14921660 | 14921660 |    1 / 1
      - secondary alignment....... S3 |        0 |        0 |   NA / NA
  o record is unmapped............ S4 |  2089310 |  2089310 |    1 / 1

Scanning a BAM file

scanBam(BamFile(filename, yieldSize=5))

## [[1]]
## [[1]]$qname
## [1] "HWI-BRUNOP16X_0001:3:46:7998:2050#0"   
## [2] "HWI-BRUNOP16X_0001:3:47:12401:185182#0"
## [3] "HWI-BRUNOP16X_0001:3:68:7771:120971#0" 
## [4] "HWI-BRUNOP16X_0001:3:64:7367:151706#0" 
## [5] "HWI-BRUNOP16X_0001:3:67:5188:159467#0" 
## 
## [[1]]$flag
## [1]  0 97  0 65  0
## 
## [[1]]$rname
## [1] KI270757.1 KI270757.1 KI270757.1 KI270757.1 KI270757.1
## 195 Levels: KI270757.1 KI270741.1 KI270756.1 KI270730.1 ... chrEBV
## 
## [[1]]$strand
## [1] + + + + +
## Levels: + - *
## 
## [[1]]$pos
## [1] 20301 20301 46974 47948 51349
## 
## [[1]]$qwidth
## [1] 75 50 75 50 75
## 
## [[1]]$mapq
## [1] 37  0  0  0  0
## 
## [[1]]$cigar
## [1] "75M" "50M" "75M" "50M" "75M"
## 
## [[1]]$mrnm
## [1] <NA> 17   <NA> 7    <NA>
## 195 Levels: KI270757.1 KI270741.1 KI270756.1 KI270730.1 ... chrEBV
## 
## [[1]]$mpos
## [1]        NA  21973011        NA 109911399        NA
## 
## [[1]]$isize
## [1] 0 0 0 0 0
## 
## [[1]]$seq
## DNAStringSet object of length 5:
##     width seq
## [1]    75 ATGGTGTGAAATGGAATGGAACGGAATGGAATGG...GGAATCAACCCGAGTGGAATGGAACGGAAGGGA
## [2]    50 ATGGTGTGAGATGGAATGGAACGGAATGGAATGGAATGGAATGGAATCAA
## [3]    75 CCTGTGTCCATGTGATCTCATTGTTCAATTCCCA...TGAGAATATGCGGTGTTTGGTTTTTTGTTCTTG
## [4]    50 CCCTTTGAAAACTGGCACAAGACAGGGATGCCCTCTCTCACCACTTCTAT
## [5]    75 CAAATGGACATTTGGAGAGCTTGGACACCTATGG...GAAATATCTTCAAATAAAAACTAGGTAGAAGAA
## 
## [[1]]$qual
## PhredQuality object of length 5:
##     width seq
## [1]    75 gggggggggggggggggggggggggggggggggg...ggggggggggggggeggggggggdggggggggf
## [2]    50 gggggggggfgggggggggggggggggggggggggggggggggggggggg
## [3]    75 gggggggggggggggggggggggggggggggggg...fghgffgfggggghghggggfhgdggg]ZE`_^
## [4]    50 gggggggggggggggggggggggggggggggggggggggggggggggggg
## [5]    75 gggggggggggggggggggggggggggggggggg...gggggggggggggggggdddfgggghggggggg

R Studio’s console colors the output:

Caption for the above image (visible if loaded correctly): R studio output of scanning a BAM file

Literature

Irizarry, R., & Love, M. (2015). Ph525x series-biomedical data science. URL: https://genomicsclass.github.io/book/pages/import_NGS.html

Loading BAM file and accessing sequence information

Mikołaj Buchwald

Data

During downloading a file

Gemonic file formats

Loading BAM files

Accessing sequence information

`seqinfo` function

Sequence information as a `data.frame`

Plotting sequence lenghts

Additional materials and notes

BAM file summary

Scanning a BAM file

Further reading

Literature

Loading BAM file and accessing sequence information

Mikołaj Buchwald

Data

During downloading a file

Gemonic file formats

Loading BAM files

Accessing sequence information

seqinfo function

Sequence information as a data.frame

Plotting sequence lenghts

Additional materials and notes

BAM file summary

Scanning a BAM file

Further reading

Literature

`seqinfo` function

Sequence information as a `data.frame`