1 Welcome

Hi, I am Varun Khanna and welcome to my free tutorial on Intermediate . Mastering SQL or sequel is an essential skill in a data scientist’s toolbox. In this tutorial, you will learn to use aggregate functions like COUNT, MAX, AVG, SUM. You will find out how to perform complex operations using subqueries, how to join data from multiple tables in a single statement with JOINS? Finally, how to use set operators LIKE UNION, INTERSECT and EXCEPT and CASE statements in your SQL queries. The code and the data used in this tutorial are available for download and in fact, I encourage you to try to run the queries yourself with the help of the syntax provided and then check the answer.

So, let’s get started!!

1.1 Prerequisites

To successfully complete this tutorial you should have a basic understanding of SQL and R.

1.2 Objectives

By finishing this tutorial, students will be able to:

Group and summarize the data together.
Combine the grouping mechanisms with aggregation functions.
Write queries that retrieve information by combining data from different tables.
Combine the results of multiple queries in various ways.

1.3 Introduction and Refresher

SQL(Structured Query Language) is a database query language - a language specifically designed to interact with relational databases. It is possible to extract, update, filter, replace or insert data using SQL. Some common relational database management software which uses SQL to access the data are MySQL, SQLite, Microsoft SQL Server, PostgreSQL, Oracle, Sybase.

1.3.1 Load the library and database

We are going to use the RSQLite package in the following tutorial and the Amazon stock price dataset.

library(DBI)
AMZN <- read.table("data/AMZN.csv", stringsAsFactors = FALSE, header = TRUE, sep = ",")
# Create a database connection
stock <- dbConnect(RSQLite::SQLite(), "stock.sqlite")
# Write the data in the table
dbWriteTable(stock, "amazon",AMZN, overwrite = TRUE)

1.3.1.1 SELECT Statement

SELECT statement is the core of the SQL language and is used to query the database to retrieve the selected rows that match the user condition. The SELECT statement has five main clauses, although FROM is the only required clause.

Syntax:

SELECT [ALL | DISTINCT] column(s) FROM table [WHERE "condition(s)"] [GROUP BY column(s)] [HAVING "condition(s)"] [ORDER BY column(s) [ASC| DESC]] [LIMIT number]

Clauses table

Name	Purpose
SELECT	Columns to include in the query result set
FROM	Identifies tables from which data has to be drawn
WHERE	Filters unwanted data
GROUP BY	Groups rows based on a criteria
HAVING	Filters unwanted groups
ORDER BY	Sorts the rows of the result set in increasing or decreasing order

Example 1: Look at all the columns for the first 5 rows of the dataset.

kable(dbGetQuery(stock, 'SELECT * FROM amazon LIMIT 5'))

Date	year	month	day	Open	High	Low	Close	Adj.Close	Volume
1997-05-15	1997	5	15	2.437500	2.500000	1.927083	1.958333	1.958333	72156000
1997-05-16	1997	5	16	1.968750	1.979167	1.708333	1.729167	1.729167	14700000
1997-05-19	1997	5	19	1.760417	1.770833	1.625000	1.708333	1.708333	6106800
1997-05-20	1997	5	20	1.729167	1.750000	1.635417	1.635417	1.635417	5467200
1997-05-21	1997	5	21	1.635417	1.645833	1.375000	1.427083	1.427083	18853200

ALL and DISTINCT are keywords used to select ALL (by default) or unique rows in the query by discarding the duplicate entries.

Example 2: What are the first five DISTINCT years in the dataset?

dbGetQuery(stock, 'SELECT DISTINCT year FROM amazon LIMIT 5')

##   year
## 1 1997
## 2 1998
## 3 1999
## 4 2000
## 5 2001

1.3.1.2 Aggregate functions

SQL is excellent for aggregating data. Following are the aggregate functions students should be comfortable with before moving forward.

COUNT() - It counts the total number of non-null rows in a column. It is possible to add aliases using AS. For example:

Example 3: Count the total number of rows in the dataset.

dbGetQuery(stock, 'SELECT COUNT(*) AS "total_rows" FROM amazon')

##   total_rows
## 1       5350

SUM() - It adds all the values of a given column. It treats nulls as 0.

Example 4: Find the total volume traded for Amazon stock in the dataset.

dbGetQuery(stock, 'SELECT SUM(Volume) AS "total_vol" FROM amazon')

##     total_vol
## 1 41570051200

MIN()/MAX() - It returns the minimum and the maximum numerical or non-numerical values in a column.

Example 5: Return the first and last date Amazon stock was traded according to the dataset.

dbGetQuery(stock, 'SELECT MIN(Date) AS min_date, MAX(Date) AS max_date FROM amazon')

##     min_date   max_date
## 1 1997-05-15 2018-08-16

AVG() - It returns the average of the values from a selected group. It can only be used for numerical columns and ignores null values.

Example 6: Return the average open price for the stock.

dbGetQuery(stock, 'SELECT AVG(Open) AS avg_open FROM amazon')

##   avg_open
## 1 228.7646

1.3.1.3 GROUP BY and ORDER BY clause

Information retrieved from a database using SQL can be placed into separate categories using GROUP BY clause, which can then be aggregated independent of each other. The query proceeds in two stages. First, the rows are grouped according to grouping criteria. Then, the aggregate operations are performed on the rows of each category. The GROUP BY clause must precede the data items to be used for grouping.

ORDER BY clause is used to sort the results of the query either in ascending (ASC) or descending order (DESC).

1.3.1.3.1 `GROUP BY` single column

Example 7: Count number of days traded each year.

kable(dbGetQuery(stock, 'SELECT year, COUNT(*) AS days_traded FROM amazon GROUP BY year LIMIT 5'))

year	days_traded
1997	160
1998	252
1999	252
2000	252
2001	248

1.3.1.3.2 `GROUP BY` multiple columns

Example 8: Count number of days traded in months of each year.

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY year, month LIMIT 5'))

year	month	days_traded
1997	5	11
1997	6	21
1997	7	22
1997	8	21
1997	9	21

1.3.1.3.3 `GROUP BY` column numbers

We can substitute column names with column numbers in the GROUP BY clause.

Example 9: GROUP BY Column 1 and 2

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY 1, 2 LIMIT 5'))

year	month	days_traded
1997	5	11
1997	6	21
1997	7	22
1997	8	21
1997	9	21

1.3.1.3.4 `ORDER BY` column names or numbers

Example 10: Sort the results of the previous query in descending order of the year.

kable(dbGetQuery(stock, 'SELECT year, month, COUNT(*) AS days_traded FROM amazon GROUP BY 1, 2 ORDER BY year DESC LIMIT 5'))

year	month	days_traded
2018	1	21
2018	2	19
2018	3	21
2018	4	21
2018	5	22

1.3.1.4 HAVING Clause

HAVING - Similar to WHERE but it allows us to filter on aggregate columns. The HAVING clause always follows GROUP BY clause. Note: The properties that are tested in the GROUP BY clause must be the properties of the groups, not individuals.

Example 11:

Find the average open price of Amazon for each month of each year in the dataset.

kable(dbGetQuery(stock,
'SELECT year,month,avg(open) AS avg_open_price FROM amazon GROUP BY year,month LIMIT 5'))

year	month	avg_open_price
1997	5	1.686553
1997	6	1.547867
1997	7	2.168561
1997	8	2.222966
1997	9	3.424107

Example 12:

Find every month of each year where Amazon stock price was above $1500/share.

kable(dbGetQuery(stock, 'SELECT year, month, MAX(high) AS month_high FROM amazon GROUP BY year, month HAVING month_high > 1500 ORDER BY year, month'))

year	month	month_high
2018	2	1528.70
2018	3	1617.54
2018	4	1638.10
2018	5	1635.00
2018	6	1763.10
2018	7	1880.05
2018	8	1925.00

1.3.2 Exercise 1

1. Show all the columns from table amazon where year = 2014 or date > 2016-01-01 and sort the results in such a manner that the same days are displayed together despite different years or months. For example, data from all days with day 31 should appear together.

Sample Output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE (year = 2014 OR date > "2016-01-01") ORDER BY day LIMIT 5'))

Date	year	month	day	Open	High	Low	Close	Adj.Close	Volume
2014-04-01	2014	4	1	338.09	344.43	338.00	342.99	342.99	3600100
2014-05-01	2014	5	1	304.13	310.48	304.00	307.89	307.89	4328600
2014-07-01	2014	7	1	325.86	333.20	325.10	332.39	332.39	3173600
2014-08-01	2014	8	1	313.69	315.83	304.59	307.06	307.06	7441500
2014-10-01	2014	10	1	322.04	322.17	315.55	317.46	317.46	3089100

2. Select all rows between the dates 2007-01-01 and 2009-01-01.

Sample output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE date BETWEEN "2007-01-01" AND "2009-01-01" LIMIT 5'))

Date	year	month	day	Open	High	Low	Close	Adj.Close	Volume
2007-01-03	2007	1	3	38.68	39.06	38.05	38.70	38.70	12405100
2007-01-04	2007	1	4	38.59	39.14	38.26	38.90	38.90	6318400
2007-01-05	2007	1	5	38.72	38.79	37.60	38.37	38.37	6619700
2007-01-08	2007	1	8	38.22	38.31	37.17	37.50	37.50	6783000
2007-01-09	2007	1	9	37.60	38.06	37.34	37.78	37.78	5703000

3. Select all rows where the year is either 2007,2008 or 2018.

Sample output

kable(dbGetQuery(stock, 'SELECT * FROM amazon WHERE year IN (2007,2008,2018) LIMIT 5'))

Date	year	month	day	Open	High	Low	Close	Adj.Close	Volume
2007-01-03	2007	1	3	38.68	39.06	38.05	38.70	38.70	12405100
2007-01-04	2007	1	4	38.59	39.14	38.26	38.90	38.90	6318400
2007-01-05	2007	1	5	38.72	38.79	37.60	38.37	38.37	6619700
2007-01-08	2007	1	8	38.22	38.31	37.17	37.50	37.50	6783000
2007-01-09	2007	1	9	37.60	38.06	37.34	37.78	37.78	5703000

4. SELECT all days when Amazon stock was traded after 10th (including) for the month of January for all years after 2000 (including).

Sample output

kable(dbGetQuery(stock,'SELECT * FROM amazon WHERE date LIKE "2%-01-1_" LIMIT 5'))

Date	year	month	day	Open	High	Low	Close	Adj.Close	Volume
2000-01-10	2000	1	10	72.5625	72.6250	65.5625	69.1875	69.1875	14757900
2000-01-11	2000	1	11	66.8750	70.0000	65.0000	66.7500	66.7500	10532700
2000-01-12	2000	1	12	67.8750	68.0000	63.0000	63.5625	63.5625	10804500
2000-01-13	2000	1	13	64.9375	67.1875	63.1250	65.9375	65.9375	10448100
2000-01-14	2000	1	14	66.7500	68.6250	64.0000	64.2500	64.2500	6853600

Do it after you have done JOINS.

5. Select all the days when the open price was above the average open price for that month. For example, if the average open price for Jan 1999 was 2.5 USD then select all the days for Jan 1999 where the open price was above 2.5 USD.

Sample output

kable(dbGetQuery(stock,'SELECT a.year,a.month,a.day,a.open,b.avg_open FROM amazon AS a INNER JOIN (SELECT year,month,avg(open) AS avg_open FROM amazon GROUP BY year,month) AS b ON a.year = b.year AND a.month = b.month AND a.open > b.avg_open LIMIT 5'))

year	month	day	Open	avg_open
1997	5	15	2.437500	1.686553
1997	5	16	1.968750	1.686553
1997	5	19	1.760417	1.686553
1997	5	20	1.729167	1.686553
1997	6	9	1.656250	1.547867

6. Count the number of days now.

Sample output

kable(dbGetQuery(stock,'SELECT a.year,a.month,COUNT(day) AS num_days, b.avg_open FROM amazon AS a INNER JOIN (SELECT year,month,avg(open) AS avg_open FROM amazon GROUP BY year,month) AS b ON a.year = b.year AND a.month = b.month AND a.open > b.avg_open GROUP BY a.year, a.month LIMIT 5'))

year	month	num_days	avg_open
1997	5	4	1.686553
1997	6	7	1.547867
1997	7	15	2.168561
1997	8	11	2.222966
1997	9	10	3.424107

Disconnect from the stock database

dbDisconnect(stock)

1.4 Intermediate SQL

For the next few lessons you are going to use three datasets viz. drugs names, drugs data and drugs target. These datasets were collected from DRUGBANK website.

1.4.1 Load the data and create the database connection

drugs_data <- read.table("data/drugs/drugs_data", sep = "\t",header = T, stringsAsFactors = F)

drugs_target <- read.table("data/drugs/drugs_target", sep = ",",header = T, stringsAsFactors = F, quote = "\"")

drugs_names <- read.table("data/drugs/drugs_names", sep = ",",header = T, stringsAsFactors = F, quote = "\"")


# Create a database connection
drugs <- dbConnect(RSQLite::SQLite(), "drugs.sqlite")
# Write the data in the table
dbWriteTable(drugs, "drugs_data",drugs_data, overwrite = TRUE)
dbWriteTable(drugs, "drugs_target",drugs_target, overwrite = TRUE)
dbWriteTable(drugs, "drugs_names",drugs_names, overwrite = TRUE)

Example:

Display the records in each table

kable(dbGetQuery(drugs, 'SELECT * FROM drugs_names LIMIT 5'))

drugBankID	name	cid
DB00115	Cyanocobalamin	44176380
DB00116	Tetrahydrofolic acid	91443
DB00117	L-Histidine	6274
DB00118	S-Adenosylmethionine	34756
DB00119	Pyruvic acid	1060

kable(dbGetQuery(drugs, 'SELECT * FROM drugs_data LIMIT 5'))

cid	complexity	hbond_acceptors	hbond_donors	rotatable_bond	molecular_formula	molecular_weight	tpsa	total_charge	heavy_atom_count	tautomer_count
44176380	3220	21	10	16	C63H89CoN14O14P	1356.396	473.0	0	93	1000
91443	834	9	8	9	C19H23N7O6	445.436	207.0	0	32	85
6274	151	4	3	3	C6H9N3O2	155.157	92.0	0	11	2
34756	533	10	5	7	C15H23N6O5S+	399.446	184.0	1	27	3
1060	84	3	1	1	C3H4O3	88.062	54.4	0	6	2

kable(dbGetQuery(drugs, 'SELECT * FROM drugs_target LIMIT 5'))

drugBankID	name	type	uniProtID	proteinName
DB00001	Lepirudin	BiotechDrug	P00734	Prothrombin
DB00002	Cetuximab	BiotechDrug	P00533	Epidermal growth factor receptor
DB00002	Cetuximab	BiotechDrug	O75015	Low affinity immunoglobulin gamma Fc region receptor III-B
DB00002	Cetuximab	BiotechDrug	P00736	Complement C1r subcomponent
DB00002	Cetuximab	BiotechDrug	P02745	Complement C1q subcomponent subunit A

All the displayed tables are self-explanatory. However, a brief description is given below:

The drugs_names table is the first table in the drugs database. It contains drugbank id, the common name of the drug and PubChem compound id (cid).
The drugs_data table is the second table in the database. It contains the PubChem compound id (cid) and various molecular properties of drugs like molecular weight, number of rotatable bonds, the total charge on the drug etc.
The durgs_target table is the third table. It contains drugbank id, name, type of drug and the name of the protein or receptor the drug binds to.

1.4.2 Subqueries

A subquery (also called as an inner or nested query) is a SQL query that is inside a larger query and allows the user to perform operations in multiple steps. The inner query is executed first and the results are passed on to the outer query. Subqueries can occur at several places inside a query. For example:

In SELECT clause
In FROM clause
In WHERE clause

Subquery with WHERE clause

Syntax:

SELECT table1.column_name(s) FROM table1 WHERE search_condition ( Subquery);

Example 1:

Select drugbank id, names of drugs with molecular weight less than equal to 500 from drugs_names and drugs_data tables.

kable(dbGetQuery(drugs,"SELECT dn.drugBankID, dn.name FROM drugs_names AS dn WHERE dn.cid IN ( SELECT dd.cid FROM drugs_data AS dd WHERE dd.molecular_weight <= 500) LIMIT 5"))

drugBankID	name
DB00116	Tetrahydrofolic acid
DB00117	L-Histidine
DB00118	S-Adenosylmethionine
DB00119	Pyruvic acid
DB00120	L-Phenylalanine

Example 2a:

Select the count of drugs that bind to a protein target and sort the results by drugs count (no subquery required).

kable(dbGetQuery(drugs,
"SELECT  uniprotID, proteinName, COUNT(*) as drugs_count FROM drugs_target GROUP BY uniprotID, proteinName ORDER BY drugs_count DESC LIMIT 5"))

uniProtID	proteinName	drugs_count
P14867	Gamma-aminobutyric acid receptor subunit alpha-1	89
P35348	Alpha-1A adrenergic receptor	88
P35367	Histamine H1 receptor	84
P11229	Muscarinic acetylcholine receptor M1	80
P14416	D(2) dopamine receptor	76

Example 2b:

Select drugbank id, protein id and protein name from drugs_target table where a protein target binds to more than 40 drugs. (subquery required).

kable(dbGetQuery(drugs,"SELECT drugBankID,uniProtID,proteinName FROM drugs_target WHERE uniprotID IN(SELECT uniProtID FROM drugs_target GROUP BY uniprotID HAVING COUNT(uniprotID) >40) ORDER BY drugBankID LIMIT 5"))

drugBankID	uniProtID	proteinName
DB00154	P35354	Prostaglandin G/H synthase 2
DB00154	P23219	Prostaglandin G/H synthase 1
DB00159	P35354	Prostaglandin G/H synthase 2
DB00159	P23219	Prostaglandin G/H synthase 1
DB00176	P31645	Sodium-dependent serotonin transporter

Note:There are many instances where a protein target binds to more than one drug (like Gamma-aminobutyric acid receptor subunit alpha-1 binds 89 drugs) however, there are also drugs which bind to many protein targets. For example, DB00154 shown in the table above binds to two different protein targets.

Example 3a:

Let us now count the number of protein targets each drug binds to and sort the results by protein count (no subquery required).

kable(dbGetQuery(drugs,"SELECT  drugBankID, COUNT(*) AS protein_count FROM drugs_target GROUP BY drugBankID ORDER BY protein_count DESC LIMIT 5"))

drugBankID	protein_count
DB03147	85
DB00898	51
DB00334	50
DB00543	48
DB01049	41

There are five drugs DB03147, DB00898, DB00334, DB00543 and DB01049 which binds to more than 40 protein targets. Pretty amazing..!! These drugs not only bind there proposed protein target but also others. This may explain why certain drugs have side effects.

Example 3b:

Now let us find all the protein targets these drugs bind to.

kable(dbGetQuery(drugs,"SELECT drugBankID,uniProtID,proteinName FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(drugBankID) > 40) LIMIT 5"))

drugBankID	uniProtID	proteinName
DB00334	P28223	5-hydroxytryptamine receptor 2A
DB00334	P14416	D(2) dopamine receptor
DB00334	P21728	D(1A) dopamine receptor
DB00334	P21918	D(1B) dopamine receptor
DB00334	P35462	D(3) dopamine receptor

Subquery with FROM clause

Syntax:

SELECT sub.column_name(s) FROM (Subquery);

Example 4:

The results from query 2a can also be obtained with the combination of subquery and an INNER JOIN.

kable(dbGetQuery(drugs,
"SELECT DISTINCT dt.uniProtID, dt.proteinName,sub.drugs_count FROM drugs_target AS dt INNER JOIN(SELECT uniProtID, COUNT(drugBankID) AS drugs_count FROM drugs_target GROUP BY uniProtID) AS sub USING(uniProtID) ORDER BY sub.drugs_count DESC LIMIT 5"))

uniProtID	proteinName	drugs_count
P14867	Gamma-aminobutyric acid receptor subunit alpha-1	89
P35348	Alpha-1A adrenergic receptor	88
P35367	Histamine H1 receptor	84
P11229	Muscarinic acetylcholine receptor M1	80
P14416	D(2) dopamine receptor	76

Example 5:

Select UniProt id, and count of drugs that bind to various opioid receptors (the type of proteins) from drugs_target table.

kable(dbGetQuery(drugs,"SELECT A.uniProtID,A.proteinName, COUNT(A.drugBankID) AS drug_count FROM (SELECT * FROM drugs_target WHERE proteinName LIKE '%opioid%') AS A GROUP BY uniProtID,proteinName"))

uniProtID	proteinName	drug_count
P35372	Mu-type opioid receptor	40
P41143	Delta-type opioid receptor	28
P41145	Kappa-type opioid receptor	32
Q99720	Sigma non-opioid intracellular receptor 1	6

1.4.3 Exercise 2

1. Find all drugs (drugbank id, UniProt id and protein name) which bind to only one protein target.

Sample Output

kable(dbGetQuery(drugs,"SELECT drugBankID, uniProtID, proteinName FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(uniProtID) = 1) LIMIT 5"))

drugBankID	uniProtID	proteinName
DB00001	P00734	Prothrombin
DB00006	P00734	Prothrombin
DB00007	P30968	Gonadotropin-releasing hormone receptor
DB00010	Q02643	Growth hormone-releasing hormone receptor
DB00012	P19235	Erythropoietin receptor

2. Find average molecular weight, hydrogen bond acceptors, hydrogen bond donors and rotatable bonds for drugs that bind Kinase proteins.

Sample Output

kable(dbGetQuery(drugs,"SELECT AVG(molecular_weight), AVG(hbond_acceptors), AVG(hbond_donors), AVG(rotatable_bond) FROM drugs_data WHERE cid IN(SELECT cid FROM drugs_names WHERE drugBankID IN (SELECT DISTINCT drugBankID FROM drugs_target WHERE proteinName LIKE '%kinase%'))"))

AVG(molecular_weight)	AVG(hbond_acceptors)	AVG(hbond_donors)	AVG(rotatable_bond)
327.664	5.95	2.675	4.45

1.5 JOINS

The real power of SQL comes from fetching information by combining multiple tables. In fact, the term “relational” in Relational Database refers to the fact that tables within it are related to one another and contain a common key through which multiple tables can be combined on the fly. The JOIN is a core concept in the Relational Databases and hence SQL. The ability to relate information from different tables is an essential skill for a data scientist.

To list data from two or more tables following is required:
1. FROM clause should list all the tables. Order of listing does not matter however it may make a difference in the speed of execution. 2. JOIN operator should be used to coordinate the records in one table with the records in another table. Most common types of joins are:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
CROSS JOIN
SELF JOIN

1.5.1 INNER JOIN

In the animation below you can see that the ID field matches for 1 and 4 (shaded rows). INNER JOIN selects all the matching rows from both the tables. Mismatched rows are not shown.

INNER_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name

Example 1:

Select drugbank id, drug name and drug properties using drugs_name and drugs_data table.

kable(dbGetQuery(drugs, "SELECT drugs_names.drugBankID, drugs_names.name, drugs_data.molecular_formula, drugs_data.molecular_weight, drugs_data.hbond_acceptors, drugs_data.hbond_donors FROM drugs_names INNER JOIN drugs_data ON drugs_names.cid = drugs_data.cid LIMIT 5"))

drugBankID	name	molecular_formula	molecular_weight	hbond_acceptors	hbond_donors
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	21	10
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	9	8
DB00117	L-Histidine	C6H9N3O2	155.157	4	3
DB00118	S-Adenosylmethionine	C15H23N6O5S+	399.446	10	5
DB00119	Pyruvic acid	C3H4O3	88.062	3	1

Alias:

Instead of writing a full table name, it is possible to use alias as a shortcut for example, the above query can also be run in the following manner:

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dd.hbond_acceptors, dd.hbond_donors FROM drugs_names AS dn INNER JOIN drugs_data AS dd ON dn.cid = dd.cid LIMIT 5"))

drugBankID	name	molecular_formula	molecular_weight	hbond_acceptors	hbond_donors
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	21	10
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	9	8
DB00117	L-Histidine	C6H9N3O2	155.157	4	3
DB00118	S-Adenosylmethionine	C15H23N6O5S+	399.446	10	5
DB00119	Pyruvic acid	C3H4O3	88.062	3	1

JOIN multiple tables

Example 2:

Select drugbank id, drug name, molecular formula and the molecular weight of the drug and the protein name to which drug binds using drugs_names, drugs_data and drugs_target table.

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dt.proteinName FROM drugs_names AS dn INNER JOIN drugs_data AS dd ON dn.cid = dd.cid INNER JOIN drugs_target AS dt ON dn.drugBankID = dt.drugBankID LIMIT 10"))

drugBankID	name	molecular_formula	molecular_weight	proteinName
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methionine synthase
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methionine synthase reductase
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methylenetetrahydrofolate reductase
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methylmalonic aciduria and homocystinuria type C protein
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methylmalonic aciduria type A protein mitochondrial
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	Methylmalonyl-CoA mutase mitochondrial
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	Aminomethyltransferase mitochondrial
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	Bifunctional methylenetetrahydrofolate dehydrogenase/cyclohydrolase mitochondrial
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	Bifunctional purine biosynthesis protein PURH
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	C-1-tetrahydrofolate synthase cytoplasmic

USING Keyword

If the key name is the same in the two tables being joined we can also use USING keyword instead of ON keyword. For example:

Example 3:

kable(dbGetQuery(drugs, "SELECT dn.drugBankID, dn.name, dd.molecular_formula, dd.molecular_weight, dd.hbond_acceptors, dd.hbond_donors FROM drugs_names AS dn INNER JOIN drugs_data AS dd USING(cid) LIMIT 5"))

drugBankID	name	molecular_formula	molecular_weight	hbond_acceptors	hbond_donors
DB00115	Cyanocobalamin	C63H89CoN14O14P	1356.396	21	10
DB00116	Tetrahydrofolic acid	C19H23N7O6	445.436	9	8
DB00117	L-Histidine	C6H9N3O2	155.157	4	3
DB00118	S-Adenosylmethionine	C15H23N6O5S+	399.446	10	5
DB00119	Pyruvic acid	C3H4O3	88.062	3	1

1.5.2 LEFT JOIN

LEFT JOIN keeps all the original rows of the left table in addition to the matching rows from both tables. The values that do not match in the right table are marked as missing. In the animation below there are no values for ID 3 and 6 in the right table and therefore, are missing in the resulting LEFT JOIN table (grey shaded area).

LEFT_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name

Example 4:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID, dt.proteinName,dn.name,dn.cid FROM drugs_target AS dt LEFT JOIN drugs_names AS dn ON dt.drugBankID = dn.drugBankID ORDER BY dt.drugBankID LIMIT 5"))

drugBankID	proteinName	name	cid
DB00001	Prothrombin	NA	NA
DB00002	Epidermal growth factor receptor	NA	NA
DB00002	Low affinity immunoglobulin gamma Fc region receptor III-B	NA	NA
DB00002	Complement C1r subcomponent	NA	NA
DB00002	Complement C1q subcomponent subunit A	NA	NA

Note: that the DrugBank ID starts from DB00115 in drugs_names table (table on the right in above query) and hence the corresponding values are NULL in the resulting table. The USING keyword can also be used in a left join similar to an inner join.

1.5.3 RIGHT JOIN

RIGHT JOIN keeps all the original rows of the right table in addition to the matching rows from both the tables. The values that do not match in the left table are marked as missing. In the animation below there are no values for ID 2 and 5 in the left table and therefore, are missing in the resulting RIGHT JOIN table (grey shaded area). Note: RIGHT JOIN is equivalent to LEFT JOIN with tables reversed.

RIGHT_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name

In SQLite there is no RIGHT JOIN. However, we can emulate RIGHT JOIN as shown in the syntax below:

SELECT table1.column_name(s), table2.column_name(s) FROM table2 LEFT JOIN table1 ON table1.column_name = table2.column_name

Example 5:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID, dt.proteinName,dn.name,dn.cid FROM drugs_names AS dn  LEFT JOIN drugs_target AS dt  ON dt.drugBankID = dn.drugBankID ORDER BY dt.drugBankID LIMIT 5"))

drugBankID	proteinName	name	cid
NA	NA	Xanthophyll	5368396
NA	NA	Gadodiamide	60754
NA	NA	Pipobroman	4842
NA	NA	Calcium Acetate	6116
NA	NA	Crotamiton	688020

Note: This time the drugbank id and protein names columns have NULL values because there are 114 drugs in the left_table (drugs_names table) which do not have any corresponding information in the right table (drugs_target table). It means there is no information available on what and how many kinds of proteins these drugs bind.

1.5.4 FULL JOIN

The FULL JOIN returns all the records with or without a match in left and right table. The missing values will be filled as NULL on either side. It can potentially return a very large dataset.

FULL_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 FULL JOIN table2 ON table1.column_name = table2.column_name.

In SQLite and MySQL there is no FULL JOIN. However, we emulate FULL JOIN using the syntax below:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name 
UNION ALL 
SELECT table1.column_name(s), table2.column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name.

Example 6:

Select Drugbank ID, Protein name from drugs_target (left_table), Drug name and Drug cid from drugs_names (right_table).

kable(dbGetQuery(drugs, "SELECT dt.drugBankID as id_from_drugs_target, dt.proteinName, dn.drugBankID AS id_from_drugs_names, dn.name FROM drugs_names AS dn LEFT JOIN drugs_target AS dt ON dt.drugBankID = dn.drugBankID 
  UNION ALL 
  SELECT dt.drugBankID, dt.proteinName, dn.drugBankID,dn.name FROM drugs_target AS dt LEFT JOIN drugs_names AS dn ON dt.drugBankID = dn.drugBankID 
  ORDER BY id_from_drugs_names,id_from_drugs_target LIMIT 5"))

id_from_drugs_target	proteinName	id_from_drugs_names	name
DB00001	Prothrombin	NA	NA
DB00002	Epidermal growth factor receptor	NA	NA
DB00002	Low affinity immunoglobulin gamma Fc region receptor III-B	NA	NA
DB00002	Complement C1r subcomponent	NA	NA
DB00002	Complement C1q subcomponent subunit A	NA	NA

1.5.5 CROSS JOIN

The CROSS JOIN or Cartesian join results in all possible combinations of rows from the joined tables. Suppose table1 has M rows and table2 has N rows The CROSS JOIN would produce a result set of M x N rows, therefore, CROSS JOIN should be used with caution.

CROSS_JOIN

Syntax:

SELECT table1.column_name(s), table2.column_name(s) FROM table1 CROSS JOIN table2

SELECT table1.column_name(s), table2.column_name(s) FROM table1, table2

The CROSS JOIN query does not make sense in this dataset however, it was run for demonstration purpose:

Example 7:

Select Drugbank ID, Drug name, Protein name from drugs_target and drugs_names table.

kable(dbGetQuery(drugs,"SELECT dt.drugBankID, dn.name, dt.proteinName FROM drugs_names AS dn CROSS JOIN drugs_target AS dt LIMIT 5"))

drugBankID	name	proteinName
DB00001	Cyanocobalamin	Prothrombin
DB00002	Cyanocobalamin	Epidermal growth factor receptor
DB00002	Cyanocobalamin	Low affinity immunoglobulin gamma Fc region receptor III-B
DB00002	Cyanocobalamin	Complement C1r subcomponent
DB00002	Cyanocobalamin	Complement C1q subcomponent subunit A

1.5.6 SELF JOIN

In a SELF JOIN a table is joined to itself.

Syntax:

SELECT t1.column_name(s), t2.column_name(s) FROM table1 AS t1, table2 AS t2 WHERE t1.column = t2.column

Example 8:

Select drugbank ids from drugs_target table which bind to both the proteins with id ‘P28223’ and ‘P46098’.

The following query will not work because it is not possible for the value of a single column in a single row to contain two values at the same time.

kable(dbGetQuery(drugs,"SELECT dt.drugBankID FROM drugs_target AS dt WHERE dt.uniProtID = 'P28223' AND dt.uniProtID = 'P46098'"))

drugBankID

This query will require a SELF JOIN. The correct query is as follows:

kable(dbGetQuery(drugs,
"SELECT DISTINCT dt1.drugBankID FROM drugs_target dt1, drugs_target dt2 WHERE dt1.drugBankID = dt2.drugBankID AND dt1.uniProtID = 'P28223' AND dt2.uniprotID = 'P46098'"))

drugBankID
DB00246
DB00334
DB00363
DB00370
DB00408
DB00543
DB00604
DB00726
DB01049
DB01224
DB01238

1.5.7 Exercise 3

1. Select cid, drugBankID, uniProtID and count the number of drugs that bind to the protein target from drugs_data, drugs_names and drugs_target tables.

Sample Output

kable(dbGetQuery(drugs,"SELECT cid,B.drugBankID,B.uniProtID, B.count FROM (SELECT drugBankID, A.uniProtID, A.count FROM drugs_target INNER JOIN (SELECT uniProtID, COUNT(uniProtID) AS count FROM drugs_target GROUP BY uniProtID) AS A USING(uniProtID) ORDER BY uniProtID) AS B LEFT JOIN drugs_names USING(drugBankID) ORDER BY B.uniProtID LIMIT 5"))

cid	drugBankID	uniProtID	count
NA	DB09060	A0A023W3H0	1
166548	DB00362	A2QLK4	3
2826718	DB00520	A2QLK4	3
477468	DB01141	A2QLK4	3
702	DB00898	A5X5Y0	2

2. Find those proteins and drugs which have only 1 binding partner and drug type in not BiotechDrug. For example, a drug should bind to a single protein target and a protein target should bind to a single drug.

Sample Output

kable(dbGetQuery(drugs,"SELECT drugBankID, name, sub.uniProtID, sub.proteinName FROM (SELECT drugBankID, uniProtID, proteinName, type FROM drugs_target WHERE uniProtID IN (SELECT uniProtID FROM drugs_target WHERE uniProtID IN (SELECT uniProtID FROM drugs_target WHERE drugBankID IN (SELECT drugBankID FROM drugs_target GROUP BY drugBankID HAVING COUNT(uniprotID) = 1)) GROUP BY uniProtID HAVING COUNT(drugBankID) = 1) AND type !='BiotechDrug') AS sub LEFT JOIN drugs_names USING(drugBankID) LIMIT 5"))

drugBankID	name	uniProtID	proteinName
DB00183	Pentagastrin	P32239	Gastrin/cholecystokinin type B receptor
DB00220	Nelfinavir	O90777	HIV-1 protease
DB00348	Nitisinone	P32754	4-hydroxyphenylpyruvate dioxygenase
DB00403	NA	P32238	Cholecystokinin receptor type A
DB00410	Mupirocin	P41972	Isoleucine–tRNA ligase

1.6 Set theory clauses

1.6.1 UNION

The UNION keyword allows us to combine the related information from two tables. The UNION will count the values only once i.e. will eliminate the duplicate rows. Note: Column numbers and data types must be same for the columns to combine.

Syntax:

SELECT table1.column_name(s) FROM table1 
UNION 
SELECT table2.column_name(s) FROM table2

Example 1:

Let us count the total Drugbank IDs from drugs_target and drugs_names tables individually and then using a UNION.

Count DISTINCT drugBankIDs in drugs_names and drugs_target table individually.

kable(dbGetQuery(drugs,"SELECT COUNT(DISTINCT(drugs_target.drugBankID)) AS drugs_target_ids, COUNT(DISTINCT(drugs_names.drugBankID)) AS drugs_names_ids FROM drugs_target,drugs_names"))

drugs_target_ids	drugs_names_ids
1623	1189

Example 2:

Count DISTINCT drugBankIDs in combined tables using UNION.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS union_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn UNION  SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))

union_drugbank_ids
1732

1.6.2 UNION ALL

Like the UNION the UNION ALL keyword allows us to combine the related information from two tables however, the UNION ALL will return all the rows from the SELECT statement i.e. will not eliminate the duplicate rows and combine all rows from each table into a single table. Note: Column numbers and data types must be same for the columns to combine.

Syntax:

SELECT table1.column_name(s) FROM table1 
UNION ALL
SELECT table2.column_name(s) FROM table2

Example 3:

Let us count the total Drugbank IDs from drugs_target and drugs_names tables using UNION ALL keyword.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS union_all_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn UNION ALL SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))

union_all_drugbank_ids
8258

1.6.3 INTERSECT

The INTERSECT keyword produces rows that are common between the two tables.

Syntax:

SELECT table1.column_name(s) FROM table1 
INTERSECT
SELECT table2.column_name(s) FROM table2

Example 4:

Let us count Drugbank IDs common in drugs_target and drugs_names tables using INTERSECT keyword.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS intersect_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn INTERSECT SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))

intersect_drugbank_ids
1080

In MySQL INTERSECT keyword is not available, however, the effect can be simulated through IN operator.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS in_drugbank_ids FROM (SELECT dn.drugBankID FROM drugs_names AS dn WHERE dn.drugBankID IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub"))

in_drugbank_ids
1080

1.6.4 EXCEPT

The EXCEPT keyword is used to return all the rows in the first table which are not in the second table. In Oracle MINUS is used instead of EXCEPT.

Syntax:

SELECT table1.column_name(s) FROM table1 
EXCEPT
SELECT table2.column_name(s) FROM table2

Example 5:

Let us count Drugbank IDs in drugs_names table which are not present in drugs_target table using EXCEPT keyword.

kable(dbGetQuery(drugs,"SELECT COUNT(drugBankID) AS except_id FROM (SELECT dn.drugBankID FROM drugs_names AS dn EXCEPT SELECT dt.drugBankID FROM drugs_target AS dt) AS sub"))

except_id
109

MySQL does not support EXCEPT or MINUS keyword. However, the same result can be achieved using the NOT IN operator.

kable(dbGetQuery(drugs,
"SELECT COUNT(drugBankID) AS not_in_id FROM (SELECT dn.drugBankID FROM drugs_names AS dn WHERE dn.drugBankID NOT IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub"))

not_in_id
109

1.6.5 Exercise 4

1. Show all the drugbank ids in drugs_names and drugs_target table.

Sample Output

kable(dbGetQuery(drugs,'SELECT dn.drugBankID FROM drugs_names AS dn UNION ALL SELECT dt.drugBankID FROM drugs_target AS dt LIMIT 5'))

drugBankID
DB00115
DB00116
DB00117
DB00118
DB00119

2. Select drugbank id, cid, name and molecular weight of drugs that are present in drugs_names table, however, missing from the drugs_target table. Do an INNER JOIN of the resulting table with the drugs_data table.

Sample Output

kable(dbGetQuery(drugs,"SELECT sub.drugBankID, sub.cid, sub.name,molecular_weight FROM drugs_data INNER JOIN (SELECT dn.drugBankID,dn.cid,dn.name FROM drugs_names AS dn WHERE dn.drugBankID NOT IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub USING(cid) LIMIT 5"))

drugBankID	cid	name	molecular_weight
DB00137	5368396	Xanthophyll	568.886
DB00225	60754	Gadodiamide	591.676
DB00236	4842	Pipobroman	356.058
DB00258	6116	Calcium Acetate	158.166
DB00265	688020	Crotamiton	203.285

3. Select drugbank id, cid, name and molecular weight of drugs that are present in both drugs_names and drugs_target table.

Sample Output

kable(dbGetQuery(drugs,
"SELECT sub.drugBankID, sub.cid, sub.name,molecular_weight FROM (SELECT dn.drugBankID,dn.cid,dn.name FROM drugs_names AS dn WHERE dn.drugBankID IN (SELECT dt.drugBankID FROM drugs_target AS dt)) AS sub LEFT JOIN drugs_data USING(cid) LIMIT 5"))

drugBankID	cid	name	molecular_weight
DB00115	44176380	Cyanocobalamin	1356.396
DB00116	91443	Tetrahydrofolic acid	445.436
DB00117	6274	L-Histidine	155.157
DB00118	34756	S-Adenosylmethionine	399.446
DB00119	1060	Pyruvic acid	88.062

4. Devise a query that will generate a sequence of numbers from 0 .. 99

Hint: Use UNION ALL and CROSS JOIN

Sample Output

kable(dbGetQuery(drugs,"SELECT ones.num + tens.num AS seq FROM (SELECT 0 num UNION ALL SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num UNION ALL SELECT 4 num UNION ALL SELECT 5 num UNION ALL SELECT 6 num UNION ALL SELECT 7 num UNION ALL SELECT 8 num UNION ALL SELECT 9 num) ones CROSS JOIN (SELECT 0 num UNION ALL SELECT 10 num UNION ALL SELECT 20 num UNION ALL SELECT 30 num UNION ALL SELECT 40 num UNION ALL SELECT 50 num UNION ALL SELECT 60 num UNION ALL SELECT 70 num UNION ALL SELECT 80 num UNION ALL SELECT 90 num) tens ORDER BY seq LIMIT 5"))

seq
0
1
2
3
4

1.7 The SQL CASE Statements

The CASE statement in SQL is a way to handle if/else logic. The CASE statement is followed by at least one pair of WHEN and THEN statements and ends by END keyword. THE ELSE statement is optional and captures values not mentioned in WHEN and THEN statements.

1.7.1 Simple CASE Statements

Syntax 1:

CASE value 
WHEN compare_value1 THEN result1
WHEN compare_value2 THEN result2
...
ELSE result END

Syntax 2:

CASE 
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE result END

Example 1:

Suppose you want to find the drugs which are present in drugs_target table but not in drugs_names table.

kable(dbGetQuery(drugs,"SELECT dn.drugBankID, dt.name, CASE WHEN dn.drugBankID IS NULL THEN 'not' ELSE 'in' END AS in_drugs_names FROM drugs_target AS dt LEFT JOIN drugs_names AS dn USING(drugBankID) ORDER BY dt.name LIMIT 10"))

drugBankID	name	in_drugs_names
NA	Insulin isophane	not
NA	Interferon Alfa-2a Recombinant	not
NA	Interferon Alfa-2a Recombinant	not
NA	Interferon Alfa-2b Recombinant	not
NA	Interferon Alfa-2b Recombinant	not
NA	ART-123	not
NA	ART-123	not
NA	AST-120	not
DB01048	Abacavir	in
NA	Abarelix	not

1.7.2 Adding multiple conditions

Example 2:

Select cid, drug name, molecular formula and molecular weight of all drugs and group them into three categories as follows:

small - drugs with the molecular weight of less than 200
medium - drugs with the molecular weight of less than 400
big - drugs with the molecular weight of above 400

kable(dbGetQuery(drugs, 'SELECT cid,name, molecular_formula, molecular_weight, 
  CASE WHEN molecular_weight < 200 
  THEN "small"
  WHEN molecular_weight < 400
  THEN "medium"  
  ELSE "big" 
  END AS size 
FROM drugs_data INNER JOIN drugs_names USING (cid) LIMIT 5'))

cid	name	molecular_formula	molecular_weight	size
44176380	Cyanocobalamin	C63H89CoN14O14P	1356.396	big
91443	Tetrahydrofolic acid	C19H23N7O6	445.436	big
6274	L-Histidine	C6H9N3O2	155.157	small
34756	S-Adenosylmethionine	C15H23N6O5S+	399.446	medium
1060	Pyruvic acid	C3H4O3	88.062	small

Explanation:

The CASE statement checks the molecular weight of a compound in each row.
For any compound, if the molecular weight is less than 200, the word “small” get inserted to the column we created as size. Similarly, “medium” and “big” are inserted accordingly.
At the same time, JOIN takes place between drugs_data and drugs_names table.

1.7.3 Using aggregate functions with CASE Statements

Example 3:

Now count the drugs in each group from the query above.

kable(dbGetQuery(drugs,"SELECT CASE WHEN molecular_weight < 200    THEN 'small' WHEN molecular_weight < 400   THEN 'medium' ELSE 'big'  END AS size,COUNT(1) AS num_drugs FROM drugs_data INNER JOIN drugs_names USING (cid) GROUP BY size"))

size	num_drugs
big	318
medium	702
small	162

1.7.4 Exercise 5

1. Group protein targets into three categories based on the number of drugs they bind. Three groups of protein targets are as follows:

less_than_3_drugs - bind less than 3 drugs
four_to_10_drugs - bind 4 or more but less than 10 drugs
more_than_11_drugs - bind more than 11 drugs

Hint: In order to perform this query you have to use INNER JOIN, Subquery and Case statements.

Sample Output

kable(dbGetQuery(drugs,"SELECT SUM(CASE WHEN A.count<3 THEN 1 ELSE 0 END) AS less_than_3_drugs, SUM(CASE WHEN A.count>=4 AND A.count <= 10 THEN 1 ELSE 0 END) AS four_to_10_drugs, SUM( CASE WHEN A.count >= 11 THEN 1 ELSE 0 END) AS more_than_11_drugs FROM 
                 (SELECT DISTINCT dt.uniProtID, dt.proteinName,sub.count FROM drugs_target AS dt INNER JOIN (SELECT uniProtID, COUNT(drugBankID) AS count FROM drugs_target GROUP BY uniProtID) AS sub USING(uniProtID) ORDER BY sub.count) AS A"))

less_than_3_drugs	four_to_10_drugs	more_than_11_drugs
1352	197	131

Note: There are a large number of protein targets that bind up to 3 different drugs compared to protein targets which bind 4 or more different drugs. Based on this data we may conclude that protein targets are quite specific for the kind of drugs they bind.

2. Now group drugs into three categories based on the number of protein targets they bind. Three groups of drugs are as follows:

less_than_3_protein_targets - bind less than 3 protein targets
four_to_10_protein_targets - bind 4 or more but less than 10 targets
more_than_11_protein_targets - bind more than 11 protein targets

Sample Output

kable(dbGetQuery(drugs,"SELECT SUM(CASE WHEN A.count<3 THEN 1 ELSE 0 END) AS less_than_3_protein_targets, SUM(CASE WHEN A.count>=4 AND A.count <= 10 THEN 1 ELSE 0 END) AS four_to_10_protein_targets, SUM( CASE WHEN A.count >= 11 THEN 1 ELSE 0 END) AS more_than_11_protein_targets FROM 
(SELECT DISTINCT dn.drugBankID, dn.name,sub.count FROM drugs_names AS dn INNER JOIN (SELECT drugBankID, COUNT(uniProtID) AS count FROM drugs_target GROUP BY drugBankID) AS sub USING(drugBankID) ORDER BY sub.count) AS A"))

less_than_3_protein_targets	four_to_10_protein_targets	more_than_11_protein_targets
609	241	114

Note: There are 609 drugs that bind 3 or fewer protein targets. We again observe high specificity for drugs and protein targets.

3. Find those biotech drugs from drugs_target table that end with ‘ab’ and use the case statement to form two groups as follows:

Antibody - drugs ending with ‘ab’ and of type ‘BiotechDrug’
Not antibody - all other drugs of type ‘BiotechDrug’

Sample Output

kable(dbGetQuery(drugs,"SELECT DISTINCT drugBankID, CASE WHEN name LIKE '%ab' THEN 'Antibody' ELSE 'Not Antibody' END class FROM drugs_target WHERE type = 'BiotechDrug' LIMIT 10"))

drugBankID	class
DB00001	Not Antibody
DB00002	Antibody
DB00004	Not Antibody
DB00005	Not Antibody
DB00007	Not Antibody
DB00008	Not Antibody
DB00009	Not Antibody
DB00010	Not Antibody
DB00011	Not Antibody
DB00012	Not Antibody

1.8 Summary

Congratulations, on completing the tutorial you are an SQL ninja now!! In this tutorial you specifically learned:

Various types of Subqueries
Advanced usage of JOINS
Set operators and their use in SQL statements
CASE Statements for If/Else logic

In case you have any questions, comments please mail me.

Intermediate SQL course

Varun Khanna

26 September 2018

1 Welcome

1.1 Prerequisites

1.2 Objectives

1.3 Introduction and Refresher

1.3.1 Load the library and database

1.3.1.1 SELECT Statement

1.3.1.2 Aggregate functions

1.3.1.3 GROUP BY and ORDER BY clause

1.3.1.3.1 GROUP BY single column

1.3.1.3.2 GROUP BY multiple columns

1.3.1.3.3 GROUP BY column numbers

1.3.1.3.4 ORDER BY column names or numbers

1.3.1.4 HAVING Clause

1.3.2 Exercise 1

1.4 Intermediate SQL

1.4.1 Load the data and create the database connection

1.4.2 Subqueries

1.4.3 Exercise 2

1.5 JOINS

1.5.1 INNER JOIN

1.5.2 LEFT JOIN

1.5.3 RIGHT JOIN

1.5.4 FULL JOIN

1.5.5 CROSS JOIN

1.5.6 SELF JOIN

1.5.7 Exercise 3

1.6 Set theory clauses

1.6.1 UNION

1.6.2 UNION ALL

1.6.3 INTERSECT

1.6.4 EXCEPT

1.6.5 Exercise 4

1.7 The SQL CASE Statements

1.7.1 Simple CASE Statements

1.7.2 Adding multiple conditions

1.7.3 Using aggregate functions with CASE Statements

1.7.4 Exercise 5

1.8 Summary

1.3.1.3.1 `GROUP BY` single column

1.3.1.3.2 `GROUP BY` multiple columns

1.3.1.3.3 `GROUP BY` column numbers

1.3.1.3.4 `ORDER BY` column names or numbers