Date created:  May 8, 2021
Last edited:   March 17, 2022

Data

The data for this example come from an Ensembl genome browser web page (release 105; Dec 2021):

From the README of the data repository:

Female brain tissue with poly-A selection (50bp PE + 75bp SE)

About the Ensembl:

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

During downloading a file

First step of the current tutorial is downloading the file named:

  • GRCh38.illumina.brain.1.bam [Note: the file have approx. 8.3GB]

(Please note that the download speed from Ensembl’s FTP server is quite low, as I got download speed of 200-300 KB for a file as large as 8.3 GB, so the downloading took several hours.)

Then move the downloaded file in the same directory as your R Markdown (or the R console’s working directory).

Gemonic file formats

A primer on what are the differences between particular genomics data file formats is available at Institute for Systems Genomics – University of Conneticut’s web page: Bioinformatics => UCONN => Resources and Events => Tutorials => File Formats Tutorial.

See also the differences between FASTA, FASTQ, SAM, and BAM explained here (with additional references).

In short:

  • FASTA – nucleotide or amino acid sequences of nucleic acids and proteins (thereafter: “sequences”)
  • FASTQ – sequences + quality
  • SAM – sequences + quality + mapping (to a reference)
  • BAM – binary form of SAM

Hence, BAM files cannot be simply viewed in a file system (e.g., by using bash’s cat command or a notepad) – instead it has to be loaded with a software package, such as the Bioconductor.

Loading BAM files

Information on what libraries are needed to load and manipulate on BAM files can be found, .e.g., in the course by Irizarry & Love (2015): Importing NGS data into Bioconductor.

The first step of loading the data in our example is to import the required library: Rsamtools. (During importing the Rsamtools library there are pleanty of messages being displayed. In order to make the output here in the R Markdown clear, these messages were hidden. See how to hide output in R Markdown):

library(Rsamtools)

Define a filename – the name of the file that was downloaded:

filename = 'GRCh38.illumina.brain.1.bam'

Get information of the size of the file. BAM is a binary file, below R command will display the size in "autp" scale, e.g., MB/GB, etc.:

utils:::format.object_size(file.info(filename)$size, "auto")
## [1] "8.3 Gb"

Because the file is has over 7 GB Rsamtools does not loads the entire file right away. Instead the object representing genetic data is created and the data (information on sequences, quality and/or mapping) is loaded on request, see below.

Load the BAM file (create an object representing the file):

bf <- BamFile(filename)
bf
## class: BamFile 
## path: GRCh38.illumina.brain.1.bam
## index: NA
## isOpen: FALSE 
## yieldSize: NA 
## obeyQname: FALSE 
## asMates: FALSE 
## qnamePrefixEnd: NA 
## qnameSuffixStart: NA

Accessing sequence information

seqinfo function

The basic object storing sequence information is SeqInfo (notice the difference between seqinfo – that is a function [lowercase], and SeqInfo [camel case] – that is an object). The SeqInfo object can be created (obtained) with the following seqinfo function:

seqinfo(bf)
## Seqinfo object with 195 sequences from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   KI270757.1      71251       <NA>   <NA>
##   KI270741.1     157432       <NA>   <NA>
##   KI270756.1      79590       <NA>   <NA>
##   KI270730.1     112551       <NA>   <NA>
##   KI270739.1      73985       <NA>   <NA>
##   ...               ...        ...    ...
##   16           90338345       <NA>   <NA>
##   5           181538259       <NA>   <NA>
##   3           198295559       <NA>   <NA>
##   MT              16569       <NA>   <NA>
##   chrEBV         171823       <NA>   <NA>

For further data handling, it is pretty convenient to convert a SeqInfo object into R’s native data.frame object, e.g., in order to preview all rows of the SeqInfo data and visualize the information contained in this data structure. For an information on how to convert SeqInfo into data.frame, see: this GenomeInfoDb MIT resource.

In order to perform the conversion use the following command:

df <- as.data.frame(seqinfo(bf))
head(df)
##            seqlengths isCircular genome
## KI270757.1      71251         NA   <NA>
## KI270741.1     157432         NA   <NA>
## KI270756.1      79590         NA   <NA>
## KI270730.1     112551         NA   <NA>
## KI270739.1      73985         NA   <NA>
## KI270738.1      99375         NA   <NA>

For more information on seqinfo and seqlengths, see: this R documentation page.


Sequence information as a data.frame

df is a data frame – a named list. We can use rownames to see the names of the particular rows (i.e., names of the indexes, see: https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/):

rownames(df)
##   [1] "KI270757.1" "KI270741.1" "KI270756.1" "KI270730.1" "KI270739.1"
##   [6] "KI270738.1" "KI270737.1" "KI270312.1" "KI270591.1" "KI270371.1"
##  [11] "KI270385.1" "KI270381.1" "KI270517.1" "KI270508.1" "KI270539.1"
##  [16] "KI270593.1" "KI270530.1" "KI270588.1" "KI270392.1" "KI270375.1"
##  [21] "KI270329.1" "KI270336.1" "KI270507.1" "KI270423.1" "KI270382.1"
##  [26] "KI270383.1" "KI270468.1" "KI270379.1" "KI270515.1" "KI270582.1"
##  [31] "KI270316.1" "KI270511.1" "KI270466.1" "KI270518.1" "KI270580.1"
##  [36] "KI270340.1" "KI270422.1" "KI270584.1" "KI270528.1" "KI270330.1"
##  [41] "KI270509.1" "KI270587.1" "KI270394.1" "KI270335.1" "KI270372.1"
##  [46] "KI270388.1" "KI270310.1" "KI270522.1" "KI270366.1" "KI270334.1"
##  [51] "KI270412.1" "KI270302.1" "KI270581.1" "KI270424.1" "KI270548.1"
##  [56] "KI270396.1" "KI270374.1" "KI270395.1" "KI270387.1" "KI270418.1"
##  [61] "KI270389.1" "KI270378.1" "KI270419.1" "KI270544.1" "KI270510.1"
##  [66] "KI270448.1" "KI270590.1" "KI270529.1" "KI270429.1" "KI270376.1"
##  [71] "KI270362.1" "KI270583.1" "KI270521.1" "KI270305.1" "KI270516.1"
##  [76] "KI270337.1" "KI270425.1" "KI270384.1" "KI270393.1" "KI270373.1"
##  [81] "KI270391.1" "KI270386.1" "KI270338.1" "KI270363.1" "KI270538.1"
##  [86] "KI270467.1" "KI270465.1" "KI270320.1" "KI270303.1" "KI270411.1"
##  [91] "KI270315.1" "KI270311.1" "KI270322.1" "KI270333.1" "KI270317.1"
##  [96] "KI270304.1" "KI270417.1" "KI270420.1" "KI270390.1" "KI270589.1"
## [101] "KI270414.1" "KI270579.1" "KI270364.1" "KI270442.1" "KI270729.1"
## [106] "KI270736.1" "KI270438.1" "KI270519.1" "KI270512.1" "KI270435.1"
## [111] "KI270711.1" "GL000009.2" "GL000221.1" "KI270725.1" "KI270740.1"
## [116] "KI270751.1" "KI270746.1" "GL000213.1" "KI270744.1" "GL000220.1"
## [121] "KI270735.1" "KI270734.1" "KI270709.1" "KI270748.1" "KI270745.1"
## [126] "GL000208.1" "GL000224.1" "KI270752.1" "GL000214.1" "KI270742.1"
## [131] "KI270715.1" "GL000195.1" "KI270753.1" "KI270722.1" "KI270719.1"
## [136] "KI270755.1" "KI270712.1" "KI270732.1" "KI270708.1" "KI270731.1"
## [141] "KI270720.1" "GL000194.1" "KI270713.1" "KI270716.1" "KI270723.1"
## [146] "GL000226.1" "GL000219.1" "KI270750.1" "KI270724.1" "KI270707.1"
## [151] "KI270754.1" "KI270706.1" "GL000218.1" "KI270714.1" "KI270733.1"
## [156] "KI270721.1" "KI270749.1" "GL000216.2" "GL000205.2" "KI270718.1"
## [161] "KI270717.1" "KI270743.1" "GL000008.2" "KI270710.1" "GL000225.1"
## [166] "KI270747.1" "KI270726.1" "KI270728.1" "KI270727.1" "Y"         
## [171] "20"         "X"          "13"         "22"         "10"        
## [176] "6"          "19"         "14"         "18"         "2"         
## [181] "4"          "21"         "9"          "11"         "17"        
## [186] "8"          "7"          "15"         "12"         "1"         
## [191] "16"         "5"          "3"          "MT"         "chrEBV"

We are interested only in a subset of sequences:

sequences_on_interest = c("1",   "2",  "3",  "4",  "5",  "6",  "7",  "8",  "9", "10",
                          "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                          "21", "22",
                          "X", "Y", "MT", "chrEBV")

Now, we want to get df’s rows by row names (how to get rows by row names: https://stackoverflow.com/a/18933280/8877692):

df_selection <- df[sequences_on_interest, , drop = FALSE]
df_selection
##        seqlengths isCircular genome
## 1       248956422         NA   <NA>
## 2       242193529         NA   <NA>
## 3       198295559         NA   <NA>
## 4       190214555         NA   <NA>
## 5       181538259         NA   <NA>
## 6       170805979         NA   <NA>
## 7       159345973         NA   <NA>
## 8       145138636         NA   <NA>
## 9       138394717         NA   <NA>
## 10      133797422         NA   <NA>
## 11      135086622         NA   <NA>
## 12      133275309         NA   <NA>
## 13      114364328         NA   <NA>
## 14      107043718         NA   <NA>
## 15      101991189         NA   <NA>
## 16       90338345         NA   <NA>
## 17       83257441         NA   <NA>
## 18       80373285         NA   <NA>
## 19       58617616         NA   <NA>
## 20       64444167         NA   <NA>
## 21       46709983         NA   <NA>
## 22       50818468         NA   <NA>
## X       156040895         NA   <NA>
## Y        57227415         NA   <NA>
## MT          16569         NA   <NA>
## chrEBV     171823         NA   <NA>

Get also information on all the other sequences:

df_other <- df[!(row.names(df) %in% sequences_on_interest), , drop = FALSE]
head(df_other)
##            seqlengths isCircular genome
## KI270757.1      71251         NA   <NA>
## KI270741.1     157432         NA   <NA>
## KI270756.1      79590         NA   <NA>
## KI270730.1     112551         NA   <NA>
## KI270739.1      73985         NA   <NA>
## KI270738.1      99375         NA   <NA>

Plotting sequence lenghts

For plotting, we need only the seqlenghts column. In this example all the other columns are NAs, so only the seqlenghts will be plotted.

We will use a basic R bar plotting ggplot utility.

library(ggplot2)

Plot sequence lengths for particular chromosomes:

# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_selection, aes(rownames(df_selection), seqlengths)) +
  
  # Y values are given
  geom_bar(stat='identity') +
  
  # Commas at thousand places
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                 scientific = FALSE)) +
  
  # Rotate axes for better readability
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +

  # Don't sort labels on X axis alphabetically
  scale_x_discrete(limits=rownames(df_selection)) +
  
  # Plot and axis titles are useful
  ggtitle('Sequence lenghts per chromosome') +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab('Chromosome') + 
  ylab('Sequence lenght')

Plot the length of the other sequences:

# Chromosomes on X axis and sequence lengths on Y axis
ggplot(df_other, aes(rownames(df_other), seqlengths)) +
  
  # Y values are given
  geom_bar(stat='identity') +
  
  # Commas at thousand places
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                 scientific = FALSE)) +
  
  # Rotate axes for better readability
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +

  # Don't sort labels on X axis alphabetically
  scale_x_discrete(limits=rownames(df_other)) +
  
  # Plot and axis titles are useful
  ggtitle('Sequence lenghts per chromosome') +
  theme(plot.title = element_text(hjust = 0.5)) +
  # Font size of the X axis labels
  theme(axis.text.x = element_text(size = 6)) +
  xlab('Sequence name') + 
  ylab('Sequence lenght')