{ "metadata": { }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Data manipulation with Pandas\n\nby [Anton Nekrutenko](https://training.galaxyproject.org/hall-of-fame/nekrut/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- What is Pandas\n- What is a dataframe\n- How to access data in dataframes\n\n**Objectives**\n\n- Understand data manipulation in Pandas\n\n**Time Estimation: 1h**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

SARS-CoV-2 as a research problem

\n

To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let’s learn about the coronavirus molecular biology.

\n

The following summary is based on these publications:

\n\n

Genome organization

\n

All coronaviruses contain non-segmented positive-strand RNA genome approx. 30 kb in length. It is invariably 5’-leader-UTR-replicase-S-E-M-N-3’UTR-poly(A). In addition,\nit contains a variety of accessory proteins interspersed throughout the genome (see Fig. below; (From Fehr and Perlman:2015)).

\n
\"GenomeOpen image in new tab

Figure 1: Genomic organization of representative α, β, and γ CoVs. An illustration of the MHV genome is depicted at the top. The expanded regions below show the structural and accessory proteins in the 3′ regions of the HCoV-229E, MHV, SARS-CoV, MERS-CoV and IBV. Size of the genome and individual genes are approximated using the legend at the top of the diagram but are not drawn to scale. HCoV-229E human coronavirus 229E, MHV mouse hepatitis virus, SARS-CoV severe acute respiratory syndrome coronavirus, MERS-CoV Middle East respiratory syndrome coronavirus, IBV infectious bronchitis virus.
\n

The long replicase gene encodes a number of non-structural proteins (nsps) and occupies 2/3 of the genome. Because of -1 ribosomal frameshifting within ORFab approx. in 25% of the cases\npolyprotein pp1ab is produced in place of pp1a. pp1a encodes 11 nsps while pp1ab encodes 15 (Fig. below from Sola:2015).

\n
\"DiscontinuousOpen image in new tab

Figure 2: Coronavirus genome structure and gene expression. (a) Coronavirus genome structure. The upper scheme represents the TGEV genome. Labels indicate gene names; L corresponds to the leader sequence. Also represented are the nsps derived from the processing of the pp1a and pp1ab polyproteins. PLP1, PLP2, and 3CL protease sites are depicted as inverted triangles with the corresponding color code of each protease. Dark gray rectangles represent transmembrane domains, and light gray rectangles indicate other functional domains. (b) Coronavirus genome strategy of sgmRNA expression. The upper scheme represents the TGEV genome. Short lines represent the nested set of sgmRNAs, each containing a common leader sequence (black) and a specific gene to be translated (dark gray). (c) Key elements in coronavirus transcription. A TRS precedes each gene (TRS-B) and includes the core sequence (CS-B) and variable 5′ and 3′ flanking sequences. The TRS of the leader (TRS-L), containing the core sequence (CS-L), is present at the 5′ end of the genome, in an exposed location (orange box in the TRS-L loop). Discontinuous transcription occurs during the synthesis of the negative-strand RNA (light blue), when the copy of the TRS-B hybridizes with the TRS-L. Dotted lines indicate the complementarity between positive-strand and negative-strand RNA sequences. Abbreviations: EndoU, endonuclease; ExoN, exonuclease; HEL, helicase; MTase, methyltransferase (green, N7-methyltransferase; dark purple, 2′-O-methyltransferase); nsp, nonstructural protein; PLP, papain-like protease; RdRp, RNA-dependent RNA polymerase; sgmRNA, subgenomic RNA; TGEV, transmissible gastroenteritis virus; TRS, transcription-regulating sequence; UTR, untranslated region.
\n

Virion structure

\n

Coronavirus is a spherical particle approx. 125 nm in diameter. It is covered with S-protein projections giving it an appearance of solar corona - hence the term coronavirus. Inside the envelope is a nucleocapsid that has helical symmetry far more typical of (-)-strand RNA viruses. There are four main structure proteins: spike (S), membrane (M), envelope (E), and nucleocapsid (N; Fig. below Masters:2006):

\n
\"VirionOpen image in new tab

Figure 3: Schematic of the coronavirus virion, with the minimal set of structural proteins
\n

Mature S protein is a trimer of two subunits: S1 and S2. The two subunits are produced from a single S-precursor by host proteases (see Kirchdoerfer:2016; this however is not the case for all coronaviruses such as SARS-CoV). S1 forms the receptor-binding domain, while S2 forms the stalk.

\n

M protein is the most abundant structural component of the virion and determines its shape. It possibly exists as a dimer.

\n

E protein is the least abundant protein of the capsid and possesses ion channel activity. It facilitates the assembly and release of the virus. In SARS-CoV it is not required for replication but is essential for pathogenesis.

\n

N proteins form the nucleocapsid. Its N- and C-terminal domains are capable of RNA binding. Specifically, it binds to transcription regulatory sequences and the genomic packaging signal. It also binds to nsp3\nand M protein possible tethering viral genome and replicase-transcriptase complex.

\n

Entering the cell

\n

Virion attaches to the cells as a result of interaction between S-protein and a cellular receptor. In the case of SARS-CoV-2 (as is in SARS-CoV) angiotensin converting enzyme 2 (ACE2) serves as the receptor binding to C-terminal portion of S1 domain. After receptor binding S protein is cleaved at two sites within the S2 subdomain. The first cleavage separates the receptor-binding domain of S from the membrane fusion domain. The second cleavage (at S2’) exposes the fusion peptide. The SARS-CoV-2 is different from SARS-CoV in that it contains a furin cleavage site that is processed during viral maturation within endoplasmatic reticulum (ER; see Walls:2020). The fusion peptide inserts into the cellular membrane and triggers joining on the two heptad regions with S2 forming an antiparallel six-helix bundle resulting in fusion and release of the genome into the cytoplasm (Fig. below from Jackson:2022.

\n
\"EnteringOpen image in new tab

Figure 4: Two spike (S) protein cleavage events are typically necessary for the coronavirus entry process: one at the junction of the S1 and S2 subunits and the other at the S2′ site, internal to the S2 subunit. In the case of SARS-CoV-2, the polybasic sequence at the S1–S2 boundary is cleaved during virus maturation in an infected cell, but the S2′ site is cleaved at the target cell following angiotensin-converting enzyme 2 (ACE2) binding. Virus binding to ACE2 (step 1) induces conformational changes in the S1 subunit and exposes the S2′ cleavage site in the S2 subunit. Depending on the entry route taken by SARS-CoV-2, the S2′ site is cleaved by different proteases. Left: If the target cell expresses insufficient transmembrane protease, serine 2 (TMPRSS2) or if a virus–ACE2 complex does not encounter TMPRSS2, the virus–ACE2 complex is internalized via clathrin-mediated endocytosis (step 2) into the endolysosomes, where S2′ cleavage is performed by cathepsins, which require an acidic environment for their activity (steps 3 and 4). Right: In the presence of TMPRSS2, S2′ cleavage occurs at the cell surface (step 2). In both entry pathways, cleavage of the S2′ site exposes the fusion peptide (FP) and dissociation of S1 from S2 induces dramatic conformational changes in the S2 subunit, especially in heptad repeat 1, propelling the fusion peptide forward into the target membrane, initiating membrane fusion (step 5 on the left and step 3 on the right). Fusion between viral and cellular membranes forms a fusion pore through which viral RNA is released into the host cell cytoplasm for uncoating and replication (step 6 on the left and step 4 on the right). Several agents disrupt interaction between the S protein and ACE2: ACE2 mimetics, therapeutic monoclonal antibodies targeting the neutralizing epitopes on the S protein and antibodies elicited by vaccination block virus binding to ACE2 and thus inhibit both entry pathways. By contrast, strategies targeting post-receptor-binding steps differ between the two pathways. Being a serine protease inhibitor, camostat mesylate restricts the TMPRSS2-mediated entry pathway. Hydroxychloroquine and chloroquine block endosomal acidification, which is necessary for cathepsin activity, and thus restrict the cathepsin-mediated entry pathway
\n

Replication

\n

The above figure shows that in addition to the full length (+)-strand genome there is a number of (+)-strand subgenomic RNAs corresponding to 3’-end of the complete viral sequence. All of these subgenomic RNAs (sgRNAs) share the same leader sequence that is present only once at the extreme 5’-end of the viral genome. These RNAs are produced via discontinuous RNA synthesis when the RNA-dependent RNA-polymerase (RdRp) switches template:

\n\n

☝ Model for the formation of genome high-order structures regulating N gene transcription. The upper linear scheme represents the coronavirus genome. The red line indicates the leader sequence in the 5′ end of the genome. The hairpin indicates the TRS-L. The gray line with arrowheads represents the nascent negative-sense RNA. The curved blue arrow indicates the template switch to the leader sequence during discontinuous transcription. The orange line represents the copy of the leader added to the nascent RNA after the template switch. The RNA-RNA interactions between the pE (nucleotides 26894 to 26903) and dE (nucleotides 26454 to 26463) and between the B-M in the active domain (nucleotides 26412 to 26421) and the cB-M in the 5′ end of the genome (nucleotides 477 to 486) are represented by solid lines. Dotted lines indicate the complementarity between positive-strand and negative-strand RNA sequences. Abbreviations: AD, active domain secondary structure prediction; B-M, B motif; cB-M, complementary copy of the B-M; cCS-N, complementary copy of the CS-N; CS-L, conserved core sequence of the leader; CS-N, conserved core sequence of the N gene; dE, distal element; pE, proximal element; TRS-L, transcription-regulating sequence of the leader (From Sola:2015).

\n

Furthermore, Sola:2015 suggests the coronavirus transcription model in which transcription initiation complex forms at the 3’-end of (+)-strand genomic RNA:

\n
\"CoronavirusOpen image in new tab

Figure 5: Three-step model of coronavirus transcription. (1) Complex formation. Proteins binding transcription-regulating sequences are represented by ellipsoids, the leader sequence is indicated with a red bar, and core sequences are indicated with orange boxes. (2) Base-pairing scanning. Negative-strand RNA is shown in light blue; the transcription complex is represented by a hexagon. Vertical lines indicate complementarity between the genomic RNA and the nascent negative strand. (3) Template switch. Due to the complementarity between the newly synthesized negative-strand RNA and the transcription-regulating sequence of the leader, template switch to the leader is made by the transcription complex to complete the copy of the leader sequence
\n

How virus is screened

\n

Two approaches: (1) RNAseq and (2) amplicons

\n

RNAseq approach requires a large sample quantity, but is suitable for detecting low-frequency variants. Ampliconic approaches are useful for samples containing a small fraction of viral RNA and are\nby far the most widely used (From NEB):

\n
\"PrimerOpen image in new tab

Figure 6: Nucleotide variants related to lineages of concern are indicated in yellow above a schematic SARS-CoV-2 genome. Primers that do not overlap with variants associated with these lineages are shown in blue. Overlapping primers are indicated in orange.
\n

Pandas!

\n
\n

This is an aggregated tutorial relying on material from the following fantastic sources:

\n\n
\n

Pandas (from “Panel Data”) is an essential piece of scientific (and not only) data analysis infrastructure. It is, in essence, a highly optimized library for manipulating very large tables (or “Data Frames”).

\n

Today we will be using a single notebook that explains the basics of this powerful tool:

\n

Pandas learning resources

\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "# Pandas, conventionally imported as pd\n", "import pandas as pd" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-2", "source": "

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time wrangling the data to get it into a usable form.

\n

Pandas is the primary tool in the Python ecosystem for handling data. Its primary object, the DataFrame is extremely useful in wrangling data. We will explore some of that functionality here and will put it to use in the next lesson.

\n

Basics

\n

The data set

\n

The dataset we will be using is a subset of metadata describing SARS-CoV-2 datasets from the Sequence Read Archive.

\n

It is obtained by going to https://www.ncbi.nlm.nih.gov/sra and performing a query with the following search terms: txid2697049[Organism:noexp].

\n

Results are downloaded using Send to: menu selecting File and then RunInfo. Let’s get these results into this notebook:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "!wget https://zenodo.org/records/10680001/files/sra_ncov.csv.gz" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-4", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "!gunzip -c sra_ncov.csv.gz | head" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-6", "source": "

The first line contains the headers for each column. The data follow. Given the file I/O skills you recently learned, you could write some functions to parse this file and extract the data you want.\nYou can imagine that this might be kind of painful. However, if the file format is nice and clean like we more or less have here, we can use pre-built tools. Pandas has a very powerful function, pd.read_csv()\nthat can read in a CSV file and store the contents in a convenient data structure called a data frame. In Pandas, the data type for a data frame is DataFrame, and we will use “data frame” and “DataFrame” interchangeably.

\n

Reading in data

\n

Let’s first look at the doc string of pd.read_csv().

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "pd.read_csv?" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-8", "source": "

Holy cow! There are so many options we can specify for reading in a CSV file. You will likely find reasons to use many of these throughout your research. For now, however, we do not need most of them.\nSo, let’s load in the data set. Note that even though the dataset is compressed with gzip we do not need to do anything additional - pandas magically understands and uncompresses the data while loading it into the dataframe.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "df = pd.read_csv('sra_ncov.csv.gz')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-10", "source": "

The same result can be achieved directly without downloading the file first:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "df = pd.read_csv('https://zenodo.org/records/10680001/files/sra_ncov.csv.gz')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-12", "source": "

We now have the data stored in a data frame. We can look at it in the Jupyter Notebook since Jupyter will display it in a well-organized, pretty way. Note that because our dataframe is big,\nwe only display the first five rows using head() function:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "df.head()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-14", "source": "

Indexing data frames

\n

The data frame is a convenient data structure for many reasons that will become clear as we start exploring. Let’s start by looking at how data frames are indexed. Let’s try to look at the first row.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "df[0]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-16", "source": "

Yikes! Lots of errors. The problem is that we tried to index numerically by row. We index DataFrames, by columns. And no column has the name 0 in this data frame, though there could be.\nInstead, you might want to look at the column with the percentage of correct face-matching tasks.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "df['Run'].head()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-18", "source": "

This gave us the numbers we were after. Notice that when it was printed, the index of the rows came along with it. If we wanted to pull out a single percentage correct, say corresponding to index 4, we can do that.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "df['Run'][4]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-20", "source": "

However, this is not the preferred way to do this. It is better to use .loc. This gives the location in the data frame we want.

\n
\n
\n

loc and iloc are both methods in the Pandas library for DataFrame indexing.

\n

loc: It is label-based indexing, meaning that you use the actual row and column labels to make selections. This means that you specify rows and columns based on their index labels.\nFor example, df.loc[3, ‘column_name’] will select the value in the third row and the column labeled ‘column_name’.

\n

iloc: It is integer-based indexing, where you use the integer position of the rows and columns to make selections. This means that you specify rows and columns based on their integer position.\nFor example, df.iloc[3, 2] will select the value in the fourth row and the third column (remembering that indexing starts at 0).

\n

In summary, loc uses labels for indexing, while iloc uses integer positions.

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "df.loc[4, 'Run']" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-22", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "df.iloc[4:6]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-24", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "df.iloc[4:6, [0,2,4]]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-26", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "df.loc[4:6, ['Run','size_MB','LibraryStrategy']]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-28", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "df.loc[4:6, 'Run':]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-30", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "df.loc[4:6, 'Run':'LibrarySource']" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-32", "source": "

It is also important to note that row indices need not be integers. And you should not count on them being integers. In practice, you will almost never use row indices, but rather use Boolean indexing.

\n

Filtering: Boolean indexing of data frames

\n

Let’s say I wanted to pull out accession numbers of runs produced by Pacific Biosciences machines (in this table such datasets are labeled as PACBIO_SMRT. I can use Boolean indexing to specify the row. Specifically, I want the row for which df['Platform'] == 'PACBIO_SMRT'. You can essentially plop this syntax directly when using .loc.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "df.loc[df['Platform'] == 'PACBIO_SMRT', 'Run']" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-34", "source": "

If I want to pull the whole record for that participant, I can use : for the column index.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "df.loc[df['Platform'] == 'PACBIO_SMRT', :].head(10)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-36", "source": "

Now, let’s pull out all PacBio records that were obtained from Amplicon sequencing. We can again use Boolean indexing, but we need to use an & operator.\nWe have not covered this bitwise operator before, but the syntax is self-explanatory in the example below. Note that it is important that each Boolean operation you are doing is in parentheses because of the precedence of the operators involved.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "df.loc[(df['Platform'] == 'PACBIO_SMRT') & (df['LibraryStrategy'] == 'AMPLICON'), :].head(3)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-38", "source": "

What is going on will be clearer if we set up our Boolean indexing first, as follows.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "inds = (df['Platform'] == 'PACBIO_SMRT') & (df['LibraryStrategy'] == 'AMPLICON')\n", "inds[2:6]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-40", "source": "

Notice that inds is an array (actually a Pandas Series, essentially a DataFrame with one column) of Trues and Falses. We can apply the unique function from NumPy to see how\nmany True and False rows we have:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "import numpy as np\n", "np.unique(inds, return_counts=True)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-42", "source": "

When we index with it using .loc, we get back rows where inds is True:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "df.loc[inds, :]" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-44", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "df.loc[(df['Platform'] == 'PACBIO_SMRT') & (df['LibraryStrategy'] == 'AMPLICON'), :].head(3)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-46", "source": "

Calculating with data frames

\n

The SRA data contains (sometimes) size of the run in MB (size_MB). Let’s suppose we want to consider only those run where size_MB is above 100. We might like to add a column to the data frame that specifies whether or not the corresponding run is above 100Mb. We can conveniently compute with columns. This is done element-wise.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "df['size_MB'] >= 100" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-48", "source": "

This tells us which run is above 100 Mb. We can simply add this back to the data frame.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "# Add the column to the DataFrame\n", "df['Over100Mb'] = df['size_MB'] >= 100\n", "\n", "# Take a look\n", "df.head()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-50", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "df.assign(over100 = df['size_MB']>=100)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-52", "source": "

A note of assign

\n

The assign method in the pandas library is used to add new columns to a pandas DataFrame. The method allows you to perform calculations or operations on existing columns to generate new columns without modifying the original DataFrame.

\n

For example, you can use the assign method to add a new column that calculates the sum of two existing columns or to add a column based on a complex calculation that involves multiple columns. The method also supports adding columns based on external data sources, such as NumPy arrays or other pandas DataFrames.

\n

Using assign method is useful because it allows you to create new columns and add them to a DataFrame in a way that is both readable and easy to maintain. The method is also chain-able, meaning you can add multiple columns in one call, making your code more concise and efficient.

\n

A note about vectorization

\n

Notice how applying the <= operator to a Series resulted in elementwise application. This is called vectorization. It means that we do not have to write a for loop to do operations on the elements of a Series or other array-like object. Imagine if we had to do that with a for loop.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "big_runs = []\n", "for size in df['size_MB']:\n", " big_runs.append(size >= 100)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-54", "source": "

This is cumbersome. The vectorization allows for much more convenient calculation. Beyond that, the vectorized code is almost always faster when using Pandas and Numpy because the looping is done with compiled code under the hood. This can be done with many operators, including those you’ve already seen, like +, -, *, /, **, etc.

\n

Outputting a new CSV file

\n

Now that we added the Over100Mb column, we might like to save our data frame as a new CSV that we can reload later. We use df.to_csv() for this with the index kwarg to ask Pandas not to explicitly write the indices to the file.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "df.to_csv('over100Mb_data.csv', index=False)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-56", "source": "

Let’s take a look at what this file looks like.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "!head over100Mb_data.csv" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-58", "source": "

Tidy data

\n

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules:

\n
    \n
  1. Each variable is a column.
  2. \n
  3. Each observation is a row.
  4. \n
  5. Each type of observation has its own separate data frame.
  6. \n
\n

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data that is convenient for visualization is different from that which is convenient for analysis.\nA tidy data frame is almost always much easier to work with than non-tidy formats.

\n

Also, let’s take a look at this article.

\n

The data set

\n

The dataset we will be using is a list of all SARS-CoV-2 datasets in Sequence Read Archive as of January 20, 2021.

\n

It is obtained by going to https://www.ncbi.nlm.nih.gov/sra and performing a query with the following search terms: txid2697049[Organism:noexp].

\n

Results are downloaded using Send to: menu selecting File and then RunInfo. Let’s get these results into this notebook:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "df = pd.read_csv('https://zenodo.org/records/10680001/files/sra_ncov.csv.gz')\n", "df = df[df['size_MB']> 0].reset_index(drop=True)\n", "\n", "# Take a look\n", "df" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-60", "source": "

This data set is in tidy format. Each row represents a single SRA dataset. The properties of each run are given in each column. We already saw the power of having the data in this format when we did Boolean indexing in the last lesson.

\n

Finding unique values and counts

\n

How many unique sequencing platforms do we have?

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "df['Platform'].unique()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-62", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "df['Platform'].value_counts()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-64", "source": "

Sorting

\n

(and axes!)

\n

Let’s start by sorting on index:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "df_subset = df.sample(n=10)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-66", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "df_subset" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-68", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "df_subset.sort_index()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-70", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-71", "source": [ "df_subset.sort_index(axis = 1)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-72", "source": "

Now let’s try sorting by values:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "df_subset.sort_values(by=['LibraryLayout'])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-74", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "df_subset.sort_values(by=['LibraryLayout','size_MB'])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-76", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "df_subset.sort_values(by=['LibraryLayout','size_MB'],ascending=[True,False])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-78", "source": "

Split-apply-combine

\n

Let’s say we want to compute the total size of SRA runs for each BioProject. Ignoring for the second the mechanics of how we would do this with Pandas, let’s think about it in English. What do we need to do?

\n
    \n
  1. Split the data set up according to the 'BioProject' field, i.e., split it up so we have a separate data set for each BioProject ID.
  2. \n
  3. Apply a median function to the activity in these split data sets.
  4. \n
  5. Combine the results of these averages on the split data set into a new, summary data set that contains classes for each platform and medians for each.
  6. \n
\n

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

\n

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

\n

Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

\n

Aggregation

\n

Let’s go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by BioProject.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-79", "source": [ "grouped = df.groupby(['BioProject'])\n", "\n", "# Take a look\n", "grouped" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-80", "source": "

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped. and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.sum() method is exactly what we want.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-81", "source": [ "df_sum = grouped.sum()\n", "\n", "# Take a look\n", "df_sum" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-82", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-83", "source": [ "df_sum = pd.DataFrame(grouped['size_MB'].sum())\n", "df_sum" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-84", "source": "

The outputted data frame has the sums of numerical columns only, which we have only one: size_MS. Note that this data frame has Platform as the name of the row index.\nIf we want to instead keep Platform (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-85", "source": [ "df_sum.reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-86", "source": "

Note, though, that this was not done in-place. If you want to update your data frame, you have to explicitly do so.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-87", "source": [ "df_sum = df_sum.reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-88", "source": "

We can also use multiple columns in our groupby() operation. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby,\napplying a median, and then resetting the index of the result, all in one line.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-89", "source": [ "df.groupby(['BioProject', 'Platform']).sum().reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-90", "source": "

This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.

\n

You now have tremendous power in your hands. When your data are tidy, you can rapidly accelerate the ubiquitous split-apply-combine methods. Importantly, you now have a logical\nframework to think about how you slice and dice your data. As a final, simple example, I will show you how to go start to finish after loading the data set into a data frame,\nsplitting by BioProject and Platform, and then getting the quartiles and extrema, in addition to the mean and standard deviation.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-91", "source": [ "df.groupby(['BioProject', 'Platform'])['size_MB'].describe()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-92", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-93", "source": [ "df.groupby(['BioProject', 'Platform'])['size_MB'].describe().reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-94", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-95", "source": [ "import numpy as np\n", "df.groupby(['BioProject', 'Platform']).agg({'size_MB':np.mean, 'Run':'nunique'})" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-96", "source": "

Yes, that’s right. One single, clean, easy-to-read line of code. In the coming tutorials, we will see how to use tidy data to quickly generate plots.

\n
\n
\n

Why np.mean is without quotes and nunique is with quotes?

\n
\n

Tidying a data set

\n

You should always organize your data sets in a tidy format. However, this is sometimes just not possible, since your data sets can come from instruments that do not output the\ndata in tidy format (though most do, at least in my experience), and you often have collaborators that send data in untidy formats.

\n

The most useful function for tidying data is pd.melt(). To demonstrate this we will use a dataset describing read coverage across SARS-CoV-2 genomes for several samples.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-97", "source": [ "df = pd.read_csv('https://zenodo.org/records/10680470/files/coverage.tsv.gz',sep='\\t')\n", "\n", "df.head()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-98", "source": "

Clearly, these data are not tidy. When we melt the data frame, the data within it, called values, become a single column. The headers, called variables, also become new columns.\nSo, to melt it, we need to specify what we want to call the values and what we want to call the variable. pd.melt() does the rest!

\n

\"Dataframe

\n
\n

Image from Pandas Docs.

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-99", "source": [ "melted = pd.melt(df, value_name='coverage', var_name=['sample'],value_vars=df.columns[3:],id_vars=['start','end'])\n", "\n", "melted.head()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-100", "source": "

…and we are good to do with a tidy DataFrame! Let’s take a look at the summary. This would allow us to easily plot coverage:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-101", "source": [ "import seaborn as sns\n", "sns.relplot(data=melted, x='start',y='coverage',kind='line')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-102", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-103", "source": [ "sns.relplot(data=melted, x='start',y='coverage',kind='line',hue='sample')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-104", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-105", "source": [ "melted.groupby(['sample']).describe()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-106", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-107", "source": [ "melted.groupby(['sample'])['coverage'].describe()" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-108", "source": "

To get back from melted (narrow) format to wide format we can use pivot() function.

\n

\"Dataframe

\n
\n

Image from Pandas Docs.

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-109", "source": [ "melted.pivot(index=['start','end'],columns='sample',values='coverage')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-110", "source": "

Working with multiple tables

\n

Working with multiple tables often involves joining them on a common key:

\n

\"Left

\n

In fact, this can be done in several different ways described below. But first, let’s define two simple dataframes:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-111", "source": [ "!pip install --upgrade pandasql" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-112", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-113", "source": [ "from pandasql import sqldf\n", "pysqldf = lambda q: sqldf(q, globals())" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-114", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-115", "source": [ "df1 = pd.DataFrame({\"key\": [\"A\", \"B\", \"C\", \"D\"], \"value\": np.random.randn(4)})\n", "df2 = pd.DataFrame({\"key\": [\"B\", \"D\", \"D\", \"E\"], \"value\": np.random.randn(4)})" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-116", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-117", "source": [ "df1" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-118", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-119", "source": [ "df2" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-120", "source": "

Inner join

\n

\"Inner

\n
\n

Figure from Wikimedia Commons

\n
\n

Using pandas merge:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-121", "source": [ "pd.merge(df1, df2, on=\"key\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-122", "source": "

Using pysqldf:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-123", "source": [ "pysqldf('select * from df1 join df2 on df1.key=df2.key')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-124", "source": "

Left join

\n

\"Left

\n
\n

Figure from Wikimedia Commons

\n
\n

Using pandas merge:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-125", "source": [ "pd.merge(df1, df2, on=\"key\", how=\"left\").fillna('.')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-126", "source": "

Using pysqldf:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-127", "source": [ "pysqldf('select * from df1 left join df2 on df1.key=df2.key').fillna('.')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-128", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-129", "source": [ "pysqldf('select df1.key, df1.value as value_x, df2.value as value_y from df1 left join df2 on df1.key=df2.key').fillna('.')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-130", "source": "

Right join

\n

\"Right

\n
\n

Figure from Wikimedia Commons

\n
\n

Using pandas merge:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-131", "source": [ "pd.merge(df1, df2, on=\"key\", how=\"right\").fillna('.')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-132", "source": "

Full join

\n

\"Full

\n
\n

Figure from Wikimedia Commons

\n
\n

Using pandas merge:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-133", "source": [ "pd.merge(df1, df2, on=\"key\", how=\"outer\").fillna('.')" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "To learn about pandas we will use SARS-CoV-2 data. Before actually jumping to Pandas let's learn about the coronavirus molecular biology." ], "id": "" } } }, { "id": "cell-134", "source": "

Putting it all together: Pandas + Altair

\n

Understanding Altair

\n

Vega-Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite. It offers a powerful and concise grammar that enables you to quickly build a wide range of statistical visualizations. This site contains a complete set of “how-to” tutorials explaining all aspects of this remarkable package.

\n

Importing data

\n

First, import all the packages we need:

\n
import pandas as pd\nimport altair as alt\nfrom datetime import date\ntoday = date.today()\n
\n

Next, read a gigantic dataset from DropBox:

\n
sra = pd.read_csv(\n\"https://zenodo.org/records/10680776/files/ena.tsv.gz\",\ncompression='gzip',\nsep=\"\\t\",\nlow_memory=False\n)\n
\n

This dataset contains a lot of rows:

\n
len(sra)\n
\n

Which would output:

\n
1000000\n
\n

Let’s look at the five random lines from this table (scroll sideways):

\n
sra.sample(5)\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
 study_accessionrun_accessioncollection_dateinstrument_platformlibrary_strategylibrary_construction_protocol
2470282PRJEB37886ERR86181652022-02-03 00:00:00ILLUMINAAMPLICONnan
5747352PRJNA686984SRR212945062022-07-10 00:00:00OXFORD_NANOPOREAMPLICONnan
2588443PRJEB37886ERR89971182022-02-15 00:00:00ILLUMINAAMPLICONnan
3662323PRJNA716984SRR160331342021-09-08 00:00:00PACBIO_SMRTAMPLICONFreed primers
5298351PRJNA716984SRR204559322022-04-10 00:00:00PACBIO_SMRTAMPLICONFreed primers
\n

This dataset also has a lot of columns:

\n
for _ in sra.columns: print(_)\n
\n

We do not need all the columns, so let’s restrict the dataset only to columns we would need. This would also make it much smaller:

\n
sra = sra[\n[\n'study_accession',\n'run_accession',\n'collection_date',\n'instrument_platform',\n'library_strategy',\n'library_construction_protocol'\n]\n]\n
\n

Cleaning the data

\n

The collection_date field will be useful for us to be able to filter out nonsense as you will see below. But to use it properly, we need tell Pandas that it is not just a text, but actually dates:

\n
sra = sra.assign(collection_date = pd.to_datetime(sra[\"collection_date\"]))\n
\n

Let’s see what are the earliest and latest entries:

\n
print('Earliest entry:', sra['collection_date'].min())\nprint('Latest entry:', sra['collection_date'].max())\n
\n
\n
Warning: Metadata is 💩
\n

😱 | Don’t get surprised here - the metadata is only as good as the person who entered it. So, when you enter metadata for you sequencing data – pay attention!!!

\n
\n
\n
Question
\n

Can you write a statement that would show how many rows contain these erroneous dates?

\n
👁 View solution\n
\n
sra[sra['collection_date'] == sra['collection_date'].min()]['run_accession'].nunique()\nsra[sra['collection_date'] == sra['collection_date'].max()]['run_accession'].nunique()\n
\n
\n
\n

The data will likely need a bit of cleaning:

\n
sra = sra[\n( sra['collection_date'] >= '2020-01-01' )\n&\n( sra['collection_date'] <= '2023-02-16' )\n]\n
\n

Finally, in order to build the heatmap, we need to aggregate the data:

\n
heatmap_2d = sra.groupby(\n['instrument_platform','library_strategy']\n).agg(\n{'run_accession':'nunique'}\n).reset_index()\n
\n

This will look something like this:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
 instrument_platformlibrary_strategyrun_accession
0BGISEQAMPLICON21
1BGISEQOTHER1067
2BGISEQRNA-Seq64
3BGISEQTargeted-Capture38
4BGISEQWGA1
5CAPILLARYAMPLICON3
6DNBSEQAMPLICON325
7DNBSEQOTHER5
8DNBSEQRNA-Seq64
9ILLUMINAAMPLICON4833110
10ILLUMINAOTHER204
11ILLUMINARNA-Seq34826
12ILLUMINATargeted-Capture14843
13ILLUMINAWCS74
14ILLUMINAWGA89499
15ILLUMINAWGS59618
16ILLUMINAWXS2
17ILLUMINAmiRNA-Seq28
18ION_TORRENTAMPLICON108617
19ION_TORRENTRNA-Seq75
20ION_TORRENTWGA724
21ION_TORRENTWGS1385
22OXFORD_NANOPOREAMPLICON445491
23OXFORD_NANOPOREOTHER31
24OXFORD_NANOPORERNA-Seq18617
25OXFORD_NANOPOREWGA7842
26OXFORD_NANOPOREWGS13670
27PACBIO_SMRTAMPLICON533112
28PACBIO_SMRTFL-cDNA11
29PACBIO_SMRTRNA-Seq1013
30PACBIO_SMRTSynthetic-Long-Read29
31PACBIO_SMRTTargeted-Capture465
32PACBIO_SMRTWGS48
\n

Plotting the data

\n

Now let’s create a graph. This graph will be layered: the “back” will be the heatmap squares and the “front” will be the numbers (see heatmap at the beginning of this page):

\n
back = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode(\nx=alt.X(\n\"instrument_platform:N\",\ntitle=\"Instrument\"\n),\ny=alt.Y(\n\"library_strategy:N\",\ntitle=\"Strategy\",\naxis=alt.Axis(orient='right')\n),\ncolor=alt.Color(\n\"run_accession:Q\",\ntitle=\"# Samples\",\nscale=alt.Scale(\nscheme=\"goldred\",\ntype=\"log\"\n),\n),\ntooltip=[\nalt.Tooltip(\n\"instrument_platform:N\",\ntitle=\"Machine\"\n),\nalt.Tooltip(\n\"run_accession:Q\",\ntitle=\"Number of runs\"\n),\nalt.Tooltip(\n\"library_strategy:N\",\ntitle=\"Protocol\"\n)\n]\n).properties(\nwidth=500,\nheight=150,\ntitle={\n\"text\":\n[\"Breakdown of datasets (unique accessions) from ENA\",\n\"by Platform and Library Strategy\"],\n\"subtitle\":\"(Updated {})\".format(today.strftime(\"%B %d, %Y\"))\n}\n)\nback\n
\n

This would give us a grid:

\n

\"Grid.

\n

Now, it would be nice to fill the rectangles with actual numbers:

\n
front = back.mark_text(\nalign=\"center\",\nbaseline=\"middle\",\nfontSize=12,\nfontWeight=\"bold\",\n).encode(\ntext=alt.Text(\"run_accession:Q\",format=\",.0f\"),\ncolor=alt.condition(\nalt.datum.run_accession > 200,\nalt.value(\"white\"),\nalt.value(\"black\")\n)\n)\nfront\n
\n

This would give us the text:

\n

\"Text

\n

To superimpose these on top of each other we should simply do this:

\n
back + front\n
\n

\"Full

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Pandas are an essential tool for data manipulation\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/gnmx-lecture6/tutorial.html#feedback) and check there for further resources!\n" ] } ] }