1 Welcome


Hi, I am Varun Khanna and welcome to my free tutorial on Intermediate . Mastering SQL or sequel is an essential skill in a data scientist’s toolbox. In this tutorial, you will learn to use aggregate functions like COUNT, MAX, AVG, SUM. You will find out how to perform complex operations using subqueries, how to join data from multiple tables in a single statement with JOINS? Finally, how to use set operators LIKE UNION, INTERSECT and EXCEPT and CASE statements in your SQL queries. The code and the data used in this tutorial are available for download and in fact, I encourage you to try to run the queries yourself with the help of the syntax provided and then check the answer.

So, let’s get started!!

1.1 Prerequisites

To successfully complete this tutorial you should have a basic understanding of SQL and R.

1.2 Objectives

By finishing this tutorial, students will be able to:

  1. Group and summarize the data together.
  2. Combine the grouping mechanisms with aggregation functions.
  3. Write queries that retrieve information by combining data from different tables.
  4. Combine the results of multiple queries in various ways.

1.3 Introduction and Refresher

SQL(Structured Query Language) is a database query language - a language specifically designed to interact with relational databases. It is possible to extract, update, filter, replace or insert data using SQL. Some common relational database management software which uses SQL to access the data are MySQL, SQLite, Microsoft SQL Server, PostgreSQL, Oracle, Sybase.

1.3.1 Load the library and database

We are going to use the RSQLite package in the following tutorial and the Amazon stock price dataset.

library(DBI)
AMZN <- read.table("data/AMZN.csv", stringsAsFactors = FALSE, header = TRUE, sep = ",")
# Create a database connection
stock <- dbConnect(RSQLite::SQLite(), "stock.sqlite")
# Write the data in the table
dbWriteTable(stock, "amazon",AMZN, overwrite = TRUE)

1.3.1.1 SELECT Statement

SELECT statement is the core of the SQL language and is used to query the database to retrieve the selected rows that match the user condition. The SELECT statement has five main clauses, although FROM is the only required clause.

Syntax:

SELECT [ALL | DISTINCT] column(s) FROM table [WHERE "condition(s)"] [GROUP BY column(s)] [HAVING "condition(s)"] [ORDER BY column(s) [ASC| DESC]] [LIMIT number]

Clauses table

Name Purpose
SELECT Columns to include in the query result set
FROM Identifies tables from which data has to be drawn
WHERE Filters unwanted data
GROUP BY Groups rows based on a criteria
HAVING Filters unwanted groups
ORDER BY Sorts the rows of the result set in increasing or decreasing order

Example 1: Look at all the columns for the first 5 rows of the dataset.

kable(dbGetQuery(stock, 'SELECT * FROM amazon LIMIT 5'))
Date year month day Open High Low Close Adj.Close Volume
1997-05-15 1997 5 15 2.437500 2.500000 1.927083 1.958333 1.958333 72156000
1997-05-16 1997 5 16 1.968750 1.979167 1.708333 1.729167 1.729167 14700000
1997-05-19 1997 5 19 1.760417 1.770833 1.625000 1.708333 1.708333 6106800
1997-05-20 1997 5 20 1.729167 1.750000 1.635417 1.635417 1.635417 5467200
1997-05-21 1997 5 21 1.635417 1.645833 1.375000 1.427083 1.427083 18853200

ALL and DISTINCT are keywords used to select ALL (by default) or unique rows in the query by discarding the duplicate entries.

Example 2: What are the first five DISTINCT years in the dataset?

dbGetQuery(stock, 'SELECT DISTINCT year FROM amazon LIMIT 5')
##   year
## 1 1997
## 2 1998
## 3 1999
## 4 2000
## 5 2001

1.3.1.2 Aggregate functions

SQL is excellent for aggregating data. Following are the aggregate functions students should be comfortable with before moving forward.

COUNT() - It counts the total number of non-null rows in a column. It is possible to add aliases using AS. For example:

Example 3: Count the total number of rows in the dataset.

dbGetQuery(stock, 'SELECT COUNT(*) AS "total_rows" FROM amazon')
##   total_rows
## 1       5350

SUM() - It adds all the values of a given column. It treats nulls as 0.

Example 4: Find the total volume traded for Amazon stock in the dataset.

dbGetQuery(stock, 'SELECT SUM(Volume) AS "total_vol" FROM amazon')
##     total_vol
## 1 41570051200

MIN()/MAX() - It returns the minimum and the maximum numerical or non-numerical values in a column.

Example 5: Return the first and last date Amazon stock was traded according to the dataset.

dbGetQuery(stock, 'SELECT MIN(Date) AS min_date, MAX(Date) AS max_date FROM amazon')
##     min_date   max_date
## 1 1997-05-15 2018-08-16

AVG() - It returns the average of the values from a selected group. It can only be used for numerical columns and ignores null values.

Example 6: Return the average open price for the stock.

dbGetQuery(stock, 'SELECT AVG(Open) AS avg_open FROM amazon')
##   avg_open
## 1 228.7646

1.3.1.3 GROUP BY and ORDER BY clause

Information retrieved from a database using SQL can be placed into separate categories using GROUP BY clause, which can then be aggregated independent of each other. The query proceeds in two stages. First, the rows are grouped according to grouping criteria. Then, the aggregate operations are performed on the rows of each category. The GROUP BY clause must precede the data items to be used for grouping.

ORDER BY clause is used to sort the results of the query either in ascending (ASC) or descending order (DESC).

1.3.1.3.1 GROUP BY single column

Example 7: Count number of days traded each year.

kable(dbGetQuery(stock, 'SELECT year, COUNT(*) AS days_traded FROM amazon GROUP BY year LIMIT 5'))
year days_traded
1997 160
1998 252
1999 252
2000 252
2001 248
1.3.1.3.2 GROUP BY multiple columns

Example 8: Count number of days traded in months of each year.

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY year, month LIMIT 5'))
year month days_traded
1997 5 11
1997 6 21
1997 7 22
1997 8 21
1997 9 21
1.3.1.3.3 GROUP BY column numbers

We can substitute column names with column numbers in the GROUP BY clause.

Example 9: GROUP BY Column 1 and 2

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY 1, 2 LIMIT 5'))
year month days_traded
1997 5 11
1997 6 21
1997 7 22
1997 8 21
1997 9 21
1.3.1.3.4 ORDER BY column names or numbers

Example 10: Sort the results of the previous query in descending order of the year.

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY 1, 2 ORDER BY year DESC LIMIT 5'))
year month days_traded
2018 1 21
2018 2 19
2018 3 21
2018 4 21
2018 5 22

1.3.1.4 HAVING Clause

HAVING - Similar to WHERE but it allows us to filter on aggregate columns. The HAVING clause always follows GROUP BY clause. Note: The properties that are tested in the GROUP BY clause must be the properties of the groups, not individuals.

Example 11:

Find the average open price of Amazon for each month of each year in the dataset.

kable(dbGetQuery(stock,
'SELECT year,month,avg(open) AS avg_open_price FROM amazon GROUP BY year,month LIMIT 5'))
year month avg_open_price
1997 5 1.686553
1997 6 1.547867
1997 7 2.168561
1997 8 2.222966
1997 9 3.424107

Example 12:

Find every month of each year where Amazon stock price was above $1500/share.

kable(dbGetQuery(stock, 'SELECT year, month, MAX(high) AS month_high FROM amazon GROUP BY year, month HAVING month_high > 1500 ORDER BY year, month'))
year month month_high
2018 2 1528.70
2018 3 1617.54
2018 4 1638.10
2018 5 1635.00
2018 6 1763.10
2018 7 1880.05
2018 8 1925.00

1.3.2 Exercise 1

1. Show all the columns from table amazon where year = 2014 or date > 2016-01-01 and sort the results in such a manner that the same days are displayed together despite different years or months. For example, data from all days with day 31 should appear together.

Sample Output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE (year = 2014 OR date > "2016-01-01") ORDER BY day LIMIT 5'))
Date year month day Open High Low Close Adj.Close Volume
2014-04-01 2014 4 1 338.09 344.43 338.00 342.99 342.99 3600100
2014-05-01 2014 5 1 304.13 310.48 304.00 307.89 307.89 4328600
2014-07-01 2014 7 1 325.86 333.20 325.10 332.39 332.39 3173600
2014-08-01 2014 8 1 313.69 315.83 304.59 307.06 307.06 7441500
2014-10-01 2014 10 1 322.04 322.17 315.55 317.46 317.46 3089100

2. Select all rows between the dates 2007-01-01 and 2009-01-01.

Sample output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE date BETWEEN "2007-01-01" AND "2009-01-01" LIMIT 5'))
Date year month day Open High Low Close Adj.Close Volume
2007-01-03 2007 1 3 38.68 39.06 38.05 38.70 38.70 12405100
2007-01-04 2007 1 4 38.59 39.14 38.26 38.90 38.90 6318400
2007-01-05 2007 1 5 38.72 38.79 37.60 38.37 38.37 6619700
2007-01-08 2007 1 8 38.22 38.31 37.17 37.50 37.50 6783000
2007-01-09 2007 1 9 37.60 38.06 37.34 37.78 37.78 5703000

3. Select all rows where the year is either 2007,2008 or 2018.

Sample output

kable(dbGetQuery(stock, 'SELECT * FROM amazon WHERE year IN (2007,2008,2018) LIMIT 5'))
Date year month day Open High Low Close Adj.Close Volume
2007-01-03 2007 1 3 38.68 39.06 38.05 38.70 38.70 12405100
2007-01-04 2007 1 4 38.59 39.14 38.26 38.90 38.90 6318400
2007-01-05 2007 1 5 38.72 38.79 37.60 38.37 38.37 6619700
2007-01-08 2007 1 8 38.22 38.31 37.17 37.50 37.50 6783000
2007-01-09 2007 1 9 37.60 38.06 37.34 37.78 37.78 5703000

4. SELECT all days when Amazon stock was traded after 10th (including) for the month of January for all years after 2000 (including).

Sample output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE date LIKE "2%-01-1_" LIMIT 5'))
Date year month day Open High Low Close Adj.Close Volume
2000-01-10 2000 1 10 72.5625 72.6250 65.5625 69.1875 69.1875 14757900
2000-01-11 2000 1 11 66.8750 70.0000 65.0000 66.7500 66.7500 10532700
2000-01-12 2000 1 12 67.8750 68.0000 63.0000 63.5625 63.5625 10804500
2000-01-13 2000 1 13 64.9375 67.1875 63.1250 65.9375 65.9375 10448100
2000-01-14 2000 1 14 66.7500 68.6250 64.0000 64.2500 64.2500 6853600

Do it after you have done JOINS.

5. Select all the days when the open price was above the average open price for that month. For example, if the average open price for Jan 1999 was 2.5 USD then select all the days for Jan 1999 where the open price was above 2.5 USD.

Sample output

kable(dbGetQuery(stock,'SELECT a.year,a.month,a.day,a.open,b.avg_open FROM amazon AS a INNER JOIN (SELECT year,month,avg(open) AS avg_open FROM amazon GROUP BY year,month) AS b ON a.year = b.year AND a.month = b.month AND a.open > b.avg_open LIMIT 5'))
year month day Open avg_open
1997 5 15 2.437500 1.686553
1997 5 16 1.968750 1.686553
1997 5 19 1.760417 1.686553
1997 5 20 1.729167 1.686553
1997 6 9 1.656250 1.547867

6. Count the number of days now.

Sample output

kable(dbGetQuery(stock,'SELECT a.year,a.month,COUNT(day) AS num_days, b.avg_open FROM amazon AS a INNER JOIN (SELECT year,month,avg(open) AS avg_open FROM amazon GROUP BY year,month) AS b ON a.year = b.year AND a.month = b.month AND a.open > b.avg_open GROUP BY a.year, a.month LIMIT 5'))
year month num_days avg_open
1997 5 4 1.686553
1997 6 7 1.547867
1997 7 15 2.168561
1997 8 11 2.222966
1997 9 10 3.424107

Disconnect from the stock database

dbDisconnect(stock)

1.4 Intermediate SQL


For the next few lessons you are going to use three datasets viz. drugs names, drugs data and drugs target. These datasets were collected from DRUGBANK website.


1.4.1 Load the data and create the database connection

drugs_data <- read.table("data/drugs/drugs_data", sep = "\t",header = T, stringsAsFactors = F)

drugs_target <- read.table("data/drugs/drugs_target", sep = ",",header = T, stringsAsFactors = F, quote = "\"")

drugs_names <- read.table("data/drugs/drugs_names", sep = ",",header = T, stringsAsFactors = F, quote = "\"")


# Create a database connection
drugs <- dbConnect(RSQLite::SQLite(), "drugs.sqlite")
# Write the data in the table
dbWriteTable(drugs, "drugs_data",drugs_data, overwrite = TRUE)
dbWriteTable(drugs, "drugs_target",drugs_target, overwrite = TRUE)
dbWriteTable(drugs, "drugs_names",drugs_names, overwrite = TRUE)

Example:

Display the records in each table

kable(dbGetQuery(drugs, 'SELECT * FROM drugs_names LIMIT 5'))
drugBankID name cid
DB00115 Cyanocobalamin 44176380
DB00116 Tetrahydrofolic acid 91443
DB00117 L-Histidine 6274
DB00118 S-Adenosylmethionine 34756
DB00119 Pyruvic acid 1060
kable(dbGetQuery(drugs, 'SELECT * FROM drugs_data LIMIT 5'))
cid complexity hbond_acceptors hbond_donors rotatable_bond molecular_formula molecular_weight tpsa total_charge heavy_atom_count tautomer_count
44176380 3220 21 10 16 C63H89CoN14O14P 1356.396 473.0 0 93 1000
91443 834 9 8 9 C19H23N7O6 445.436 207.0 0 32 85
6274 151 4 3 3 C6H9N3O2 155.157 92.0 0 11 2
34756 533 10 5 7 C15H23N6O5S+ 399.446 184.0 1 27 3
1060 84 3 1 1 C3H4O3 88.062 54.4 0 6 2
kable(dbGetQuery(drugs, 'SELECT * FROM drugs_target LIMIT 5'))
drugBankID name type uniProtID proteinName
DB00001 Lepirudin BiotechDrug P00734 Prothrombin
DB00002 Cetuximab BiotechDrug P00533 Epidermal growth factor receptor
DB00002 Cetuximab BiotechDrug O75015 Low affinity immunoglobulin gamma Fc region receptor III-B
DB00002 Cetuximab BiotechDrug P00736 Complement C1r subcomponent
DB00002 Cetuximab BiotechDrug P02745 Complement C1q subcomponent subunit A

All the displayed tables are self-explanatory. However, a brief description is given below:

  1. The drugs_names table is the first table in the drugs database. It contains drugbank id, the common name of the drug and PubChem compound id (cid).

  2. The drugs_data table is the second table in the database. It contains the PubChem compound id (cid) and various molecular properties of drugs like molecular weight, number of rotatable bonds, the total charge on the drug etc.

  3. The durgs_target table is the third table. It contains drugbank id, name, type of drug and the name of the protein or receptor the drug binds to.

1.4.2 Subqueries

A subquery (also called as an inner or nested query) is a SQL query that is inside a larger query and allows the user to perform operations in multiple steps. The inner query is executed first and the results are passed on to the outer query. Subqueries can occur at several places inside a query. For example:

  1. In SELECT clause
  2. In FROM clause
  3. In WHERE clause
  • Subquery with WHERE clause

Syntax:

SELECT table1.column_name(s) FROM table1 WHERE search_condition ( Subquery);

Example 1:

Select drugbank id, names of drugs with molecular weight less than equal to 500 from drugs_names and drugs_data tables.

kable(dbGetQuery(drugs,"SELECT dn.drugBankID, dn.name FROM drugs_names AS dn WHERE dn.cid IN ( SELECT dd.cid FROM drugs_data AS dd WHERE dd.molecular_weight <= 500) LIMIT 5"))
drugBankID name
DB00116 Tetrahydrofolic acid
DB00117 L-Histidine
DB00118 S-Adenosylmethionine
DB00119 Pyruvic acid
DB00120 L-Phenylalanine

Example 2a:

Select the count of drugs that bind to a protein target and sort the results by drugs count (no subquery required).

kable(dbGetQuery(drugs,
"SELECT  uniprotID, proteinName, COUNT(*) as drugs_count FROM drugs_target GROUP BY uniprotID, proteinName ORDER BY drugs_count DESC LIMIT 5"))
uniProtID proteinName drugs_count
P14867 Gamma-aminobutyric acid receptor subunit alpha-1 89
P35348 Alpha-1A adrenergic receptor 88
P35367 Histamine H1 receptor 84
P11229 Muscarinic acetylcholine receptor M1 80
P14416 D(2) dopamine receptor 76

Example 2b:

Select drugbank id, protein id and protein name from drugs_target table where a protein target binds to more than 40 drugs. (subquery required).

kable(dbGetQuery(drugs,"SELECT drugBankID,uniProtID,proteinName FROM drugs_target WHERE uniprotID IN(SELECT uniProtID FROM drugs_target GROUP BY uniprotID HAVING COUNT(uniprotID) >40) ORDER BY drugBankID LIMIT 5"))
drugBankID uniProtID proteinName
DB00154 P35354 Prostaglandin G/H synthase 2
DB00154 P23219 Prostaglandin G/H synthase 1
DB00159 P35354 Prostaglandin G/H synthase 2
DB00159 P23219 Prostaglandin G/H synthase 1
DB00176 P31645 Sodium-dependent serotonin transporter

Note:There are many instances where a protein target binds to more than one drug (like Gamma-aminobutyric acid receptor subunit alpha-1 binds 89 drugs) however, there are also drugs which bind to many protein targets. For example, DB00154 shown in the table above binds to two different protein targets.

Example 3a:

Let us now count the number of protein targets each drug binds to and sort the results by protein count (no subquery required).

kable(dbGetQuery(drugs,"SELECT  drugBankID, COUNT(*) AS protein_count FROM drugs_target GROUP BY drugBankID ORDER BY protein_count DESC LIMIT 5"))
drugBankID protein_count
DB03147 85
DB00898 51
DB00334 50
DB00543 48
DB01049 41

There are five drugs DB03147, DB00898, DB00334, DB00543 and DB01049 which binds to more than 40 protein targets. Pretty amazing..!! These drugs not only bind there proposed protein target but also others. This may explain why certain drugs have side effects.

Example 3b:

Now let us find all the protein targets these drugs bind to.

kable(dbGetQuery(drugs,"SELECT drugBankID,uniProtID,proteinName FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(drugBankID) > 40) LIMIT 5"))
drugBankID uniProtID proteinName
DB00334 P28223 5-hydroxytryptamine receptor 2A
DB00334 P14416 D(2) dopamine receptor
DB00334 P21728 D(1A) dopamine receptor
DB00334 P21918 D(1B) dopamine receptor
DB00334 P35462 D(3) dopamine receptor
  • Subquery with FROM clause

Syntax:

SELECT sub.column_name(s) FROM (Subquery);

Example 4:

The results from query 2a can also be obtained with the combination of subquery and an INNER JOIN.

kable(dbGetQuery(drugs,
"SELECT DISTINCT dt.uniProtID, dt.proteinName,sub.drugs_count FROM drugs_target AS dt INNER JOIN(SELECT uniProtID, COUNT(drugBankID) AS drugs_count FROM drugs_target GROUP BY uniProtID) AS sub USING(uniProtID) ORDER BY sub.drugs_count DESC LIMIT 5"))
uniProtID proteinName drugs_count
P14867 Gamma-aminobutyric acid receptor subunit alpha-1 89
P35348 Alpha-1A adrenergic receptor 88
P35367 Histamine H1 receptor 84
P11229 Muscarinic acetylcholine receptor M1 80
P14416 D(2) dopamine receptor 76

Example 5:

Select UniProt id, and count of drugs that bind to various opioid receptors (the type of proteins) from drugs_target table.

kable(dbGetQuery(drugs,"SELECT A.uniProtID,A.proteinName, COUNT(A.drugBankID) AS drug_count FROM (SELECT * FROM drugs_target WHERE proteinName LIKE '%opioid%') AS A GROUP BY uniProtID,proteinName"))
uniProtID proteinName drug_count
P35372 Mu-type opioid receptor 40
P41143 Delta-type opioid receptor 28
P41145 Kappa-type opioid receptor 32
Q99720 Sigma non-opioid intracellular receptor 1 6

1.4.3 Exercise 2

1. Find all drugs (drugbank id, UniProt id and protein name) which bind to only one protein target.

Sample Output

kable(dbGetQuery(drugs,"SELECT drugBankID, uniProtID, proteinName FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(uniProtID) = 1) LIMIT 5"))
drugBankID uniProtID proteinName
DB00001 P00734 Prothrombin
DB00006 P00734 Prothrombin
DB00007 P30968 Gonadotropin-releasing hormone receptor
DB00010 Q02643 Growth hormone-releasing hormone receptor
DB00012 P19235 Erythropoietin receptor

2. Find average molecular weight, hydrogen bond acceptors, hydrogen bond donors and rotatable bonds for drugs that bind Kinase proteins.

Sample Output

kable(dbGetQuery(drugs,"SELECT AVG(molecular_weight), AVG(hbond_acceptors), AVG(hbond_donors), AVG(rotatable_bond) FROM drugs_data WHERE cid IN(SELECT cid FROM drugs_names WHERE drugBankID IN (SELECT DISTINCT drugBankID FROM drugs_target WHERE proteinName LIKE '%kinase%'))"))
AVG(molecular_weight) AVG(hbond_acceptors) AVG(hbond_donors) AVG(rotatable_bond)
327.664 5.95 2.675 4.45

1.5 JOINS

The real power of SQL comes from fetching information by combining multiple tables. In fact, the term “relational” in Relational Database refers to the fact that tables within it are related to one another and contain a common key through which multiple tables can be combined on the fly. The JOIN is a core concept in the Relational Databases and hence SQL. The ability to relate information from different tables is an essential skill for a data scientist.

To list data from two or more tables following is required:
1. FROM clause should list all the tables. Order of listing does not matter however it may make a difference in the speed of execution. 2. JOIN operator should be used to coordinate the records in one table with the records in another table. Most common types of joins are:

  • INNER JOIN
  • LEFT JOIN
  • RIGHT JOIN
  • FULL JOIN
  • CROSS JOIN
  • SELF JOIN

1.5.1 INNER JOIN

In the animation below you can see that the ID field matches for 1 and 4 (shaded rows). INNER JOIN selects all the matching rows from both the tables. Mismatched rows are not shown.

INNER_JOIN

INNER_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name  

Example 1:

Select drugbank id, drug name and drug properties using drugs_name and drugs_data table.

kable(dbGetQuery(drugs, "SELECT drugs_names.drugBankID, drugs_names.name, drugs_data.molecular_formula, drugs_data.molecular_weight, drugs_data.hbond_acceptors, drugs_data.hbond_donors FROM drugs_names INNER JOIN drugs_data ON drugs_names.cid = drugs_data.cid LIMIT 5"))
drugBankID name molecular_formula molecular_weight hbond_acceptors hbond_donors
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 21 10
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 9 8
DB00117 L-Histidine C6H9N3O2 155.157 4 3
DB00118 S-Adenosylmethionine C15H23N6O5S+ 399.446 10 5
DB00119 Pyruvic acid C3H4O3 88.062 3 1

Alias:

Instead of writing a full table name, it is possible to use alias as a shortcut for example, the above query can also be run in the following manner:

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dd.hbond_acceptors, dd.hbond_donors FROM drugs_names AS dn INNER JOIN drugs_data AS dd ON dn.cid = dd.cid LIMIT 5"))
drugBankID name molecular_formula molecular_weight hbond_acceptors hbond_donors
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 21 10
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 9 8
DB00117 L-Histidine C6H9N3O2 155.157 4 3
DB00118 S-Adenosylmethionine C15H23N6O5S+ 399.446 10 5
DB00119 Pyruvic acid C3H4O3 88.062 3 1

JOIN multiple tables

Example 2:

Select drugbank id, drug name, molecular formula and the molecular weight of the drug and the protein name to which drug binds using drugs_names, drugs_data and drugs_target table.

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dt.proteinName FROM drugs_names AS dn INNER JOIN drugs_data AS dd ON dn.cid = dd.cid INNER JOIN drugs_target AS dt ON dn.drugBankID = dt.drugBankID LIMIT 10"))
drugBankID name molecular_formula molecular_weight proteinName
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methionine synthase
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methionine synthase reductase
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methylenetetrahydrofolate reductase
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methylmalonic aciduria and homocystinuria type C protein
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methylmalonic aciduria type A protein mitochondrial
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 Methylmalonyl-CoA mutase mitochondrial
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 Aminomethyltransferase mitochondrial
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 Bifunctional methylenetetrahydrofolate dehydrogenase/cyclohydrolase mitochondrial
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 Bifunctional purine biosynthesis protein PURH
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 C-1-tetrahydrofolate synthase cytoplasmic

USING Keyword

If the key name is the same in the two tables being joined we can also use USING keyword instead of ON keyword. For example:

Example 3:

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dd.hbond_acceptors, dd.hbond_donors FROM drugs_names AS dn INNER JOIN drugs_data AS dd USING(cid) LIMIT 5"))
drugBankID name molecular_formula molecular_weight hbond_acceptors hbond_donors
DB00115 Cyanocobalamin C63H89CoN14O14P 1356.396 21 10
DB00116 Tetrahydrofolic acid C19H23N7O6 445.436 9 8
DB00117 L-Histidine C6H9N3O2 155.157 4 3
DB00118 S-Adenosylmethionine C15H23N6O5S+ 399.446 10 5
DB00119 Pyruvic acid C3H4O3 88.062 3 1

1.5.2 LEFT JOIN

LEFT JOIN keeps all the original rows of the left table in addition to the matching rows from both tables. The values that do not match in the right table are marked as missing. In the animation below there are no values for ID 3 and 6 in the right table and therefore, are missing in the resulting LEFT JOIN table (grey shaded area).

LEFT_JOIN

LEFT_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name  

Example 4:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID, dt.proteinName,dn.name,dn.cid FROM drugs_target AS dt LEFT JOIN drugs_names AS dn ON dt.drugBankID = dn.drugBankID ORDER BY dt.drugBankID LIMIT 5"))
drugBankID proteinName name cid
DB00001 Prothrombin NA NA
DB00002 Epidermal growth factor receptor NA NA
DB00002 Low affinity immunoglobulin gamma Fc region receptor III-B NA NA
DB00002 Complement C1r subcomponent NA NA
DB00002 Complement C1q subcomponent subunit A NA NA

Note: that the DrugBank ID starts from DB00115 in drugs_names table (table on the right in above query) and hence the corresponding values are NULL in the resulting table. The USING keyword can also be used in a left join similar to an inner join.

1.5.3 RIGHT JOIN

RIGHT JOIN keeps all the original rows of the right table in addition to the matching rows from both the tables. The values that do not match in the left table are marked as missing. In the animation below there are no values for ID 2 and 5 in the left table and therefore, are missing in the resulting RIGHT JOIN table (grey shaded area). Note: RIGHT JOIN is equivalent to LEFT JOIN with tables reversed.

RIGHT_JOIN

RIGHT_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name  

In SQLite there is no RIGHT JOIN. However, we can emulate RIGHT JOIN as shown in the syntax below:

SELECT table1.column_name(s), table2.column_name(s) FROM table2 LEFT JOIN table1 ON table1.column_name = table2.column_name  

Example 5:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID, dt.proteinName,dn.name,dn.cid FROM drugs_names AS dn  LEFT JOIN drugs_target AS dt  ON dt.drugBankID = dn.drugBankID ORDER BY dt.drugBankID LIMIT 5"))
drugBankID proteinName name cid
NA NA Xanthophyll 5368396
NA NA Gadodiamide 60754
NA NA Pipobroman 4842
NA NA Calcium Acetate 6116
NA NA Crotamiton 688020

Note: This time the drugbank id and protein names columns have NULL values because there are 114 drugs in the left_table (drugs_names table) which do not have any corresponding information in the right table (drugs_target table). It means there is no information available on what and how many kinds of proteins these drugs bind.

1.5.4 FULL JOIN

The FULL JOIN returns all the records with or without a match in left and right table. The missing values will be filled as NULL on either side. It can potentially return a very large dataset.

FULL_JOIN

FULL_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 FULL JOIN table2 ON table1.column_name = table2.column_name.

In SQLite and MySQL there is no FULL JOIN. However, we emulate FULL JOIN using the syntax below:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name 
UNION ALL 
SELECT table1.column_name(s), table2.column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name.

Example 6:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID as id_from_drugs_target, dt.proteinName, dn.drugBankID AS id_from_drugs_names, dn.name FROM drugs_names AS dn LEFT JOIN drugs_target AS dt ON dt.drugBankID = dn.drugBankID 
  UNION ALL 
  SELECT dt.drugBankID, dt.proteinName, dn.drugBankID,dn.name FROM drugs_target AS dt LEFT JOIN drugs_names AS dn ON dt.drugBankID = dn.drugBankID 
  ORDER BY id_from_drugs_names,id_from_drugs_target LIMIT 5"))
id_from_drugs_target proteinName id_from_drugs_names name
DB00001 Prothrombin NA NA
DB00002 Epidermal growth factor receptor NA NA
DB00002 Low affinity immunoglobulin gamma Fc region receptor III-B NA NA
DB00002 Complement C1r subcomponent NA NA
DB00002 Complement C1q subcomponent subunit A NA NA

1.5.5 CROSS JOIN

The CROSS JOIN or Cartesian join results in all possible combinations of rows from the joined tables. Suppose table1 has M rows and table2 has N rows The CROSS JOIN would produce a result set of M x N rows, therefore, CROSS JOIN should be used with caution.

CROSS_JOIN

CROSS_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 CROSS JOIN table2

OR

SELECT table1.column_name(s), table2.column_name(s) FROM table1, table2

The CROSS JOIN query does not make sense in this dataset however, it was run for demonstration purpose:

Example 7:

Select Drugbank ID, Drug name, Protein name from drugs_target and drugs_names table.

kable(dbGetQuery(drugs,"SELECT dt.drugBankID, dn.name, dt.proteinName FROM drugs_names AS dn CROSS JOIN drugs_target AS dt LIMIT 5"))
drugBankID name proteinName
DB00001 Cyanocobalamin Prothrombin
DB00002 Cyanocobalamin Epidermal growth factor receptor
DB00002 Cyanocobalamin Low affinity immunoglobulin gamma Fc region receptor III-B
DB00002 Cyanocobalamin Complement C1r subcomponent
DB00002 Cyanocobalamin Complement C1q subcomponent subunit A

1.5.6 SELF JOIN

In a SELF JOIN a table is joined to itself.

Syntax:

SELECT t1.column_name(s), t2.column_name(s) FROM table1 AS t1, table2 AS t2 WHERE t1.column = t2.column

Example 8:

Select drugbank ids from drugs_target table which bind to both the proteins with id ‘P28223’ and ‘P46098’.

The following query will not work because it is not possible for the value of a single column in a single row to contain two values at the same time.

kable(dbGetQuery(drugs,"SELECT dt.drugBankID FROM drugs_target AS dt WHERE dt.uniProtID = 'P28223' AND dt.uniProtID = 'P46098'"))
drugBankID

This query will require a SELF JOIN. The correct query is as follows:

kable(dbGetQuery(drugs,
"SELECT DISTINCT dt1.drugBankID FROM drugs_target dt1, drugs_target dt2 WHERE dt1.drugBankID = dt2.drugBankID AND dt1.uniProtID = 'P28223' AND dt2.uniprotID = 'P46098'"))
drugBankID
DB00246
DB00334
DB00363
DB00370
DB00408
DB00543
DB00604
DB00726
DB01049
DB01224
DB01238

1.5.7 Exercise 3

1. Select cid, drugBankID, uniProtID and count the number of drugs that bind to the protein target from drugs_data, drugs_names and drugs_target tables.

Sample Output

kable(dbGetQuery(drugs,"SELECT cid,B.drugBankID,B.uniProtID, B.count FROM (SELECT drugBankID, A.uniProtID, A.count FROM drugs_target INNER JOIN (SELECT uniProtID, COUNT(uniProtID) AS count FROM drugs_target GROUP BY uniProtID) AS A USING(uniProtID) ORDER BY uniProtID) AS B LEFT JOIN drugs_names USING(drugBankID) ORDER BY B.uniProtID LIMIT 5"))
cid drugBankID uniProtID count
NA DB09060 A0A023W3H0 1
166548 DB00362 A2QLK4 3
2826718 DB00520 A2QLK4 3
477468 DB01141 A2QLK4 3
702 DB00898 A5X5Y0 2

2. Find those proteins and drugs which have only 1 binding partner and drug type in not BiotechDrug. For example, a drug should bind to a single protein target and a protein target should bind to a single drug.

Sample Output

kable(dbGetQuery(drugs,"SELECT drugBankID, name, sub.uniProtID, sub.proteinName FROM (SELECT drugBankID, uniProtID, proteinName, type FROM drugs_target WHERE uniProtID IN (SELECT uniProtID FROM drugs_target WHERE uniProtID IN (SELECT uniProtID FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(uniprotID) = 1)) GROUP BY uniProtID HAVING COUNT(drugBankID) = 1) AND type !='BiotechDrug') AS sub LEFT JOIN drugs_names USING(drugBankID) LIMIT 5"))
drugBankID name uniProtID proteinName
DB00183 Pentagastrin P32239 Gastrin/cholecystokinin type B receptor
DB00220 Nelfinavir O90777 HIV-1 protease
DB00348 Nitisinone P32754 4-hydroxyphenylpyruvate dioxygenase
DB00403 NA P32238 Cholecystokinin receptor type A
DB00410 Mupirocin P41972 Isoleucine–tRNA ligase

1.6 Set theory clauses

1.6.1 UNION

The UNION keyword allows us to combine the related information from two tables. The UNION will count the values only once i.e. will eliminate the duplicate rows. Note: Column numbers and data types must be same for the columns to combine.

Syntax:

SELECT table1.column_name(s) FROM table1 
UNION 
SELECT table2.column_name(s) FROM table2

Example 1:

Let us count the total Drugbank IDs from drugs_target and drugs_names tables individually and then using a UNION.

Count DISTINCT drugBankIDs in drugs_names and drugs_target table individually.

kable(dbGetQuery(drugs,"SELECT COUNT(DISTINCT(drugs_target.drugBankID)) AS drugs_target_ids, COUNT(DISTINCT(drugs_names.drugBankID)) AS drugs_names_ids FROM drugs_target,drugs_names"))
drugs_target_ids drugs_names_ids
1623 1189

Example 2:

Count DISTINCT drugBankIDs in combined tables using UNION.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS union_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn UNION  SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))
union_drugbank_ids
1732

1.6.2 UNION ALL

Like the UNION the UNION ALL keyword allows us to combine the related information from two tables however, the UNION ALL will return all the rows from the SELECT statement i.e. will not eliminate the duplicate rows and combine all rows from each table into a single table. Note: Column numbers and data types must be same for the columns to combine.

Syntax:

SELECT table1.column_name(s) FROM table1 
UNION ALL
SELECT table2.column_name(s) FROM table2

Example 3:

Let us count the total Drugbank IDs from drugs_target and drugs_names tables using UNION ALL keyword.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS union_all_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn UNION ALL SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))
union_all_drugbank_ids
8258

1.6.3 INTERSECT

The INTERSECT keyword produces rows that are common between the two tables.

Syntax:

SELECT table1.column_name(s) FROM table1 
INTERSECT
SELECT table2.column_name(s) FROM table2

Example 4:

Let us count Drugbank IDs common in drugs_target and drugs_names tables using INTERSECT keyword.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS intersect_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn INTERSECT SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))
intersect_drugbank_ids
1080

In MySQL INTERSECT keyword is not available, however, the effect can be simulated through IN operator.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS in_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn WHERE dn.drugBankID IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub"))
in_drugbank_ids
1080

1.6.4 EXCEPT

The EXCEPT keyword is used to return all the rows in the first table which are not in the second table. In Oracle MINUS is used instead of EXCEPT.

Syntax:

SELECT table1.column_name(s) FROM table1 
EXCEPT
SELECT table2.column_name(s) FROM table2

Example 5:

Let us count Drugbank IDs in drugs_names table which are not present in drugs_target table using EXCEPT keyword.

kable(dbGetQuery(drugs,"SELECT COUNT(drugBankID) AS except_id FROM (SELECT dn.drugBankID FROM drugs_names AS dn EXCEPT SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))
except_id
109

MySQL does not support EXCEPT or MINUS keyword. However, the same result can be achieved using the NOT IN operator.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS not_in_id FROM (SELECT dn.drugBankID FROM drugs_names AS dn WHERE dn.drugBankID NOT IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub"))
not_in_id
109

1.6.5 Exercise 4

1. Show all the drugbank ids in drugs_names and drugs_target table.

Sample Output

kable(dbGetQuery(drugs,'SELECT dn.drugBankID FROM drugs_names AS dn UNION ALL SELECT dt.drugBankID FROM drugs_target AS dt LIMIT 5'))
drugBankID
DB00115
DB00116
DB00117
DB00118
DB00119

2. Select drugbank id, cid, name and molecular weight of drugs that are present in drugs_names table, however, missing from the drugs_target table. Do an INNER JOIN of the resulting table with the drugs_data table.

Sample Output

kable(dbGetQuery(drugs,"SELECT sub.drugBankID, sub.cid, sub.name,molecular_weight FROM drugs_data INNER JOIN (SELECT dn.drugBankID,dn.cid,dn.name FROM drugs_names AS dn WHERE dn.drugBankID NOT IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub USING(cid) LIMIT 5"))
drugBankID cid name molecular_weight
DB00137 5368396 Xanthophyll 568.886
DB00225 60754 Gadodiamide 591.676
DB00236 4842 Pipobroman 356.058
DB00258 6116 Calcium Acetate 158.166
DB00265 688020 Crotamiton 203.285

3. Select drugbank id, cid, name and molecular weight of drugs that are present in both drugs_names and drugs_target table.

Sample Output

kable(dbGetQuery(drugs,
"SELECT sub.drugBankID, sub.cid, sub.name,molecular_weight FROM (SELECT dn.drugBankID,dn.cid,dn.name FROM drugs_names AS dn WHERE dn.drugBankID IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub LEFT JOIN drugs_data USING(cid) LIMIT 5"))
drugBankID cid name molecular_weight
DB00115 44176380 Cyanocobalamin 1356.396
DB00116 91443 Tetrahydrofolic acid 445.436
DB00117 6274 L-Histidine 155.157
DB00118 34756 S-Adenosylmethionine 399.446
DB00119 1060 Pyruvic acid 88.062

4. Devise a query that will generate a sequence of numbers from 0 .. 99

Hint: Use UNION ALL and CROSS JOIN

Sample Output

kable(dbGetQuery(drugs,"SELECT ones.num + tens.num AS seq FROM (SELECT 0 num UNION ALL SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num UNION ALL SELECT 4 num UNION ALL SELECT 5 num UNION ALL SELECT 6 num UNION ALL SELECT 7 num UNION ALL SELECT 8 num UNION ALL SELECT 9 num) ones CROSS JOIN (SELECT 0 num UNION ALL SELECT 10 num UNION ALL SELECT 20 num UNION ALL SELECT 30 num UNION ALL SELECT 40 num UNION ALL SELECT 50 num UNION ALL SELECT 60 num UNION ALL SELECT 70 num UNION ALL SELECT 80 num UNION ALL SELECT 90 num) tens ORDER BY seq LIMIT 5"))
seq
0
1
2
3
4

1.7 The SQL CASE Statements

The CASE statement in SQL is a way to handle if/else logic. The CASE statement is followed by at least one pair of WHEN and THEN statements and ends by END keyword. THE ELSE statement is optional and captures values not mentioned in WHEN and THEN statements.

1.7.1 Simple CASE Statements

Syntax 1:

CASE value 
WHEN compare_value1 THEN result1
WHEN compare_value2 THEN result2
...
ELSE result END

Syntax 2:

CASE 
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE result END

Example 1:

Suppose you want to find the drugs which are present in drugs_target table but not in drugs_names table.

kable(dbGetQuery(drugs,"SELECT dn.drugBankID, dt.name, CASE WHEN dn.drugBankID IS NULL THEN 'not' ELSE 'in' END AS in_drugs_names FROM drugs_target AS dt LEFT JOIN drugs_names AS dn USING(drugBankID) ORDER BY dt.name LIMIT 10"))
drugBankID name in_drugs_names
NA Insulin isophane not
NA Interferon Alfa-2a Recombinant not
NA Interferon Alfa-2a Recombinant not
NA Interferon Alfa-2b Recombinant not
NA Interferon Alfa-2b Recombinant not
NA ART-123 not
NA ART-123 not
NA AST-120 not
DB01048 Abacavir in
NA Abarelix not

1.7.2 Adding multiple conditions

Example 2:

Select cid, drug name, molecular formula and molecular weight of all drugs and group them into three categories as follows:

  1. small - drugs with the molecular weight of less than 200
  2. medium - drugs with the molecular weight of less than 400
  3. big - drugs with the molecular weight of above 400
kable(dbGetQuery(drugs, 'SELECT cid,name, molecular_formula, molecular_weight, 
  CASE WHEN molecular_weight < 200 
  THEN "small"
  WHEN molecular_weight < 400
  THEN "medium"  
  ELSE "big" 
  END AS size 
FROM drugs_data INNER JOIN drugs_names USING (cid) LIMIT 5'))
cid name molecular_formula molecular_weight size
44176380 Cyanocobalamin C63H89CoN14O14P 1356.396 big
91443 Tetrahydrofolic acid C19H23N7O6 445.436 big
6274 L-Histidine C6H9N3O2 155.157 small
34756 S-Adenosylmethionine C15H23N6O5S+ 399.446 medium
1060 Pyruvic acid C3H4O3 88.062 small

Explanation:

  1. The CASE statement checks the molecular weight of a compound in each row.
  2. For any compound, if the molecular weight is less than 200, the word “small” get inserted to the column we created as size. Similarly, “medium” and “big” are inserted accordingly.
  3. At the same time, JOIN takes place between drugs_data and drugs_names table.

1.7.3 Using aggregate functions with CASE Statements

Example 3:

Now count the drugs in each group from the query above.

kable(dbGetQuery(drugs,"SELECT CASE WHEN molecular_weight < 200    THEN 'small' WHEN molecular_weight < 400   THEN 'medium' ELSE 'big'  END AS size,COUNT(1) AS num_drugs FROM drugs_data INNER JOIN drugs_names USING (cid) GROUP BY size"))
size num_drugs
big 318
medium 702
small 162

1.7.4 Exercise 5

1. Group protein targets into three categories based on the number of drugs they bind. Three groups of protein targets are as follows:

  1. less_than_3_drugs - bind less than 3 drugs
  2. four_to_10_drugs - bind 4 or more but less than 10 drugs
  3. more_than_11_drugs - bind more than 11 drugs

Hint: In order to perform this query you have to use INNER JOIN, Subquery and Case statements.

Sample Output

kable(dbGetQuery(drugs,"SELECT SUM(CASE WHEN A.count<3 THEN 1 ELSE 0 END) AS less_than_3_drugs, SUM(CASE WHEN A.count>=4 AND A.count <= 10 THEN 1 ELSE 0 END) AS four_to_10_drugs, SUM( CASE WHEN A.count >= 11 THEN 1 ELSE 0 END) AS more_than_11_drugs FROM 
                 (SELECT DISTINCT dt.uniProtID, dt.proteinName,sub.count FROM drugs_target AS dt INNER JOIN (SELECT uniProtID, COUNT(drugBankID) AS count FROM drugs_target GROUP BY uniProtID) AS sub USING(uniProtID) ORDER BY sub.count) AS A"))
less_than_3_drugs four_to_10_drugs more_than_11_drugs
1352 197 131

Note: There are a large number of protein targets that bind up to 3 different drugs compared to protein targets which bind 4 or more different drugs. Based on this data we may conclude that protein targets are quite specific for the kind of drugs they bind.

2. Now group drugs into three categories based on the number of protein targets they bind. Three groups of drugs are as follows:

  1. less_than_3_protein_targets - bind less than 3 protein targets
  2. four_to_10_protein_targets - bind 4 or more but less than 10 targets
  3. more_than_11_protein_targets - bind more than 11 protein targets

Sample Output

kable(dbGetQuery(drugs,"SELECT SUM(CASE WHEN A.count<3 THEN 1 ELSE 0 END) AS less_than_3_protein_targets, SUM(CASE WHEN A.count>=4 AND A.count <= 10 THEN 1 ELSE 0 END) AS four_to_10_protein_targets, SUM( CASE WHEN A.count >= 11 THEN 1 ELSE 0 END) AS more_than_11_protein_targets FROM 
(SELECT DISTINCT dn.drugBankID, dn.name,sub.count FROM drugs_names AS dn INNER JOIN (SELECT drugBankID, COUNT(uniProtID) AS count FROM drugs_target GROUP BY drugBankID) AS sub USING(drugBankID) ORDER BY sub.count) AS A"))
less_than_3_protein_targets four_to_10_protein_targets more_than_11_protein_targets
609 241 114

Note: There are 609 drugs that bind 3 or fewer protein targets. We again observe high specificity for drugs and protein targets.

3. Find those biotech drugs from drugs_target table that end with ‘ab’ and use the case statement to form two groups as follows:

  1. Antibody - drugs ending with ‘ab’ and of type ‘BiotechDrug’
  2. Not antibody - all other drugs of type ‘BiotechDrug’

Sample Output

kable(dbGetQuery(drugs,"SELECT DISTINCT drugBankID, CASE WHEN name LIKE '%ab' THEN 'Antibody' ELSE 'Not Antibody' END class FROM drugs_target WHERE type = 'BiotechDrug' LIMIT 10"))
drugBankID class
DB00001 Not Antibody
DB00002 Antibody
DB00004 Not Antibody
DB00005 Not Antibody
DB00007 Not Antibody
DB00008 Not Antibody
DB00009 Not Antibody
DB00010 Not Antibody
DB00011 Not Antibody
DB00012 Not Antibody

1.8 Summary

Congratulations, on completing the tutorial you are an SQL ninja now!! In this tutorial you specifically learned:

  1. Various types of Subqueries
  2. Advanced usage of JOINS
  3. Set operators and their use in SQL statements
  4. CASE Statements for If/Else logic

In case you have any questions, comments please mail me.