Databases are systems that allow users to store and organize data and are useful when dealing with large amounts of data. On the other hand, spreadsheets are suitable for one-time analysis, quick charts, reasonable datasets, and allowing untrained people to work with data.
Databases are suitable for data integrity, handling large amounts of data, combining different datasets quickly, automating reuse steps, and supporting data for websites and applications.
A database is a collection of tables. Tables contain rows and columns, where the rows are known as records and the columns are known as fields. A column is a set of data values of a particular type, one value for each row of the database. A row represents a single data item in a table, and every row in the table has the same structure.
- 1. Overview
- 2. Setup
- 3. SQL Statement Fundamentals
- 4. General Challenge 1
- 5. GROUP BY Statements & Aggregate Functions
- 6. Assessment Test 1
- 7. JOIN Clause
- 8. Advanced SQL Commands
- 9. Assessment Test 2
- 10. Creating Databases and Tables
- 11. Assessment Test 3
- 12. Conditional Expressions and Procedures
- 13. Import and Export
- 14. PgSQL with Python
- 15. SQL Window Function Part 1
- 16. SQL Window Function Part 2
- 17. SQL With Clause and CTE (Common Table Expression) or Sub-Query Factoring
- 18. Practice Complex SQL Queries
- 19. Misc Notes
- 20. Solutions to Codility Exercise
Overview of the course curriculum and challenges.
The course is divided in the following sections...
- Section 1
- Databses and Table Basics
- SQL Statement Fundamentals
- GROUP BY Clause
- Assessment Test 1
- Section 2
- JOINS
- Advanced SQL
- Commands
- Assessment Test 2
- Section 3
- Create Database and Tables
- Assessment Test 3
- Views
- PostgreSQL with Python
Typical database users...
- Analyst
- Marketing
- Business
- Sales
- Technical
- Data Scientist
- Software Engineers
- Web Developers
Database Platform Options:
- PostgreSQL (focus of the course)
- Free (Open Source)
- Widely used on internet
- Multi platform
- MySQL & MariaSQL
- Free (Open Source)
- Widely used on internet
- Multi platform
- MS SQL Server Express
- Free, but with some limitations
- Compatible with SQL Server
- Windows only (-)
- Microsoft Access
- Cost (-)
- Not easy to use just SQL (-)
- SQLite
- Free (Open Source)
- Mainly command line (-)
SQL is the programming language used to communicate with our database. Example:
select customer_id, first_name, last_name
from sales
order by first_name;
-- the ; at the end of the query is optional is pgSQL
Challenges are based on the scenario that we've just been hired as a SQL consultant for a DVD Rental Store. Challenges increase in difficult over the course.
Challenge structure...
- Business Situation
- Challenge Question
- Expected Answer
- Hints
- Solution
We install and use the following applications.
- PostgreSQL: SQL Engine that stores data and reads queries and returns information
- PgAdmin: graphical User Interface for connecting with PostgreSQL
- Restore the provided database using the
.tar
file in the repo here. - VS Code: source-code editor. We use VS Code as a scratch pad and for general note taking.
Go to Servers > PostgreSQL 12 > Databases > dvdrental > Schemas > Tables to see the tables in the restored database.
We retrieve information from tables with queries. We construct queries with statements.
The select
statement is used to select data from a database. The data returned is stored in a result table, called the result-set.
The syntax we learn can be applied to any major type of SQL database. select
is the most common statement used, and it allows us to retrieve information from a table. Later on we will learn how to combine select
with other statements to perform more complex queries.
-- select columns c1,...,cn from table t1
select c1,...,cn from t1
-- select all columns from table t1, i.e. select the whole table
select * from t1
-- note that it is bad practice to use * if we don't need all columns (increased traffic between database server and app -> slows down retrieval of results)
Challenge: we want to send out a promotional email to our existing customers. Grab the first and last names of every customer and their email address.
The distinct
/ distinct()
keyword can be used to return only the distinct values in a column.
-- the DISTINCT keyword operates on a column
select distinct c from t1
-- we can add parenthesis for clarity. When adding more calls together, the parenthesis will be necessary
select distinct(c) from t1
Challenge: retrieve the distinct rating types out films could have in our database.
The count()
function returns the number of input rows that match a specific condition of a query. We can apply count()
on a specific column or just pass count(*)
, this should return the same result. It simply returns the number of rows in the table, regardless of the column we call.
select count(c1) from t1
-- it's useful to use a column name to contextualize the query (what question were we trying to answer?)
count()
is more useful when combined with other commands, e.g. distinct()
. We can answer questions like, how many unique names are there in the table?
select count(distinct(c1)) from t1
-- we're calling count on the result of distinct name
The where
clause is used to filter records. It is used to extract only those records that fulfill a specified condition.
select
and where
are the most fundamental SQL statements. The where
statement allows us to specify conditions on columns for the rows to be returned. Basic syntax:
select c1,...,cn from t1
where conditions
The where
clause appears immediately after the from
clause of the select
statement. The conditions are used to filter the rows returned from the select
statement. There are a variety of standard operators to construct the conditions.
- Comparison operators: compare a column value to something.
- Logical operators: allow us to combine multiple comparison operators.
Example:
select count(title) from film
where rental_rate >= 4 and replacement_cost >= 19.99 and rating = 'R'
-- condition on rental_rate, replacement_cost and rating
-- note that the COUNT function does not need a specific column, as it just counts the number of records expected to be returned by the SELECT statement
Challenge: from now on we will focus more on directly asking the business related questions, to more realistically model a typical task. Find the email for the customer with the name Nancy Thomas. Solution:
-- task: find the email for the customer with the name Nancy Thomas
-- return the columns of the customer table
-- select * from customer where 1=0
select email, first_name, last_name from customer
where first_name = 'Nancy' and last_name = 'Thomas'
Challenge: what is the movie Outlaw Hanky about? Solution:
-- select * from film where 1=0
select description from film
where title = 'Outlaw Hanky'
Challenge: get the phone number for the customer who lives at 259 Ipoh Drive.
-- select * from address where 1=0
select phone from "address"
where address='259 Ipoh Drive'
We can use order by
to sort rows based on a column value, in either ascending or descending order. Basic syntax:
select c1, c2 from t1
order by c1 asc / desc c2 asc / desc
-- ORDER BY uses ASC by default
Notice order by
towards the end of a query, since we want to do any selection and filtering first, before finally sorting. We can use order by
in multiple columns (this makes sense when one column has duplicate entries), e.g.
select company, employee, sales from sales
order by company asc , sales desc
-- we first sort rows based on the company and then we sort rows by sales (the second sorting does not affect the first one). Here we query the companies by ascending name and order the sales by descending amounts
limit
allows us to limit the number of rows returned for a query. Useful for not wanting to return every singlerow in a table, but only view the top few rows to get an idea of the table layout. ´limit´ becomes useful in combination with order by
.
limit
goes at the very end of a query request and is the last command to be executed. Basic syntax:
select c1 from t1
order by c1 desc limit 10
-- get the top 10 values in c1
For example:
select * from payment
where amount != 0
order by payment_date desc
limit 5
-- history of the 5 most recent payments where the amount is not zero
Challenge: we want to reward out first 10 paying customers. What are the customer ids of the first 10 customers who created a payment.
-- return columns names / layout of the table
-- select * from payment limit 1
select distinct(customer_id), payment_date from payment
-- where
order by payment_date asc
limit 10
-- assumption: we want to reward 10 different paying cusotmers
Challenge: a customer wants to quickly rent a video to watch over their short lunch break. What are the titles of the 5 shortest (in length of runtime) movies?
-- see layout of the table film
-- select * from film limit 1
select title, "length" from film
-- where
order by "length" asc
limit 5
Challenge: if a customer can watch a movie that is 50 minutes or less in run time, how many options does the customer have?
select count(title) from film
where "length" <= 50
The between
operator can be used to match a value against a range of values:
select c1 from t1
where c1 between m and M
-- this is the same as WHERE c1 >= m and c1 <= M
-- between includes the bounds
-- the set that contains the results is not disjunct
We can also use the negation not
:
select c1 from t1
where c1 not between m and M
-- this is the same as WHERE c1 < m OR c1 > M
-- not between does not include the bounds
-- the set that contains the results is disjunct
The between
operator can also be used with dates. Note that we need to format dates in the ISO 8601 standard format YY-MM-DD.
When using between
operator with dates that also include timestamp information, pay careful attention to using between
versus comparison operators, due to the fact that a datetime starts at 00:00:00 (and ends at 23:59:59). Later we will study more specific methods for datetime information types.
select c1 from t1
where c1 between 'YY-MM-DD hh:mm:ss.sss' and 'YY-MM-DD hh:mm:ss.sss'
-- if not specified hh:mm:ss.sss is set to 00:00:00.000, which affects the logic of the upper bound!
In certain cases you want to check for multiple possible value options, e.g. if a user's name shows up in
a list of know names. We can use the in
operator to create a condition that checks to see if a value included in a list of multiple options. Basic syntax:
select c1 from t1
where c1 in (opt1,...,optn)
-- this is the same as WHERE c1 = opt1 OR ... OR c1 = optn
-- the options should match the general syntax of the column
We can also use the negation not
:
select c1 from t1
where c1 not in (opt1,...,optn)
-- this is the same as WHERE c1 != opt1 AND ... AND c1 != optn
We've already been able to perform direct comparisons against strings. But what we want to match against a general pattern in a string? Example:
- All emails ending in '@gmail.com'
- All names that begin with an 'A'
The like
operator (case-sensitive) allows us to perform pattern matching against string data with the use of wildcard characters:
- Percent
%
: matches any sequence of characters (can be blank too). - Underscore
_
: matches any single character
select c1 from t1
where c1 like 'A%'
-- all names in c1 that begin with an 'A'
The operator ilike
is case-insensitive. PgSQL also supports full regex capabilities (see docs). Example:
select * from customer
where first_name ilike 'j%' and last_name ilike 's%'
-- look for customers with a first name that starts with 'j' and a last name that starts with 's' (case-insensitive)
%
can be blank too, _
cannot be blank. Example:
select * from customer
where first_name ilike 'heather%'
-- does return rows with the first name 'Heather'
select * from customer
where first_name ilike 'heather_'
-- does not return row with the first name 'Heather'
We can combine operations and clauses to create more complex queries.
select * from customer
where first_name like 'A%' and last_name not like 'B%'
order by first_name
-- return all customers whose name start with an 'A' and last name does not start with a 'B'. Order the results ascending by the first name.
How many payment transaction were greater than $5.00?
select count(amount) from payment
where amount > 5
How many actors have a first name that starts with the letter P?
select count(distinct(actor_id))
from actor
where first_name ilike 'p%'
How many unique districts are our customers from?
select count(distinct(district)) from address
-- where
Retrieve the list of names for those distinct districts from the previous question.
select distinct(district) from address
-- where
How many films have a rating of R and a replacement cost between $5 and $15? (*)
select count(title) from film
where rating = 'R' and replacement_cost between 5 and 15
-- note that rating is mpaa_rating data type
How many films have the word Truman somewhere in the title?
select count(title) from film
where title ilike '%truman%'
group by
will allow us to aggregate data and apply functions to better understand how data is distributed per category.
Overview:
- Aggregate Functions
- GROUP BY Statement
- Theory
- Implementation
- Challenge Tasks for GROUP BY
- HAVING - Filtering with a GROUP BY
- Challenge Tasks for HAVING
The main idea behind an aggregate function is to take multiple inputs and return a single output (stat). PgSQL aggregate functions can be found in the documentation.
Most common aggregate functions:
avg()
returns average valuecount()
returns number of valuesmax()
returns maximum valuemin()
returns minimum valuesum()
returns the sum of all values
Aggregate function calls happen only in the select
statement or having
clause!
Special notes:
avg()
returns a floating point value with many decimal places (e.g. 2.342418..)- You can use
round()
to specify precision after the decimal
- You can use
count()
simply returns the number of rows, which means by convention we just usecount(*)
Calling an aggregate function returns a single value, which cannot be returned with another column. In order to call other columns we need the group by
statement. It is possible to return multiple single values in one query. For example:
select min(replacement_cost), max(replacement_cost) from film
We can calculate the average cost:
select round(avg(replacement_cost),2) from film
Additionally, we can determine statistics in ordered sets with ordered-set aggregate functions using the keyword within group
.
select
min(replacement_cost),
percentile_disc(0.25) within group(
order by
replacement_cost
) as "25p",
percentile_disc(0.5) within group(
order by
replacement_cost
) as median,
round(avg(replacement_cost), 2) as avg,
percentile_disc(0.75) within group(
order by
replacement_cost
) as "75p",
max(replacement_cost),
mode() within group(
order by
replacement_cost
)
from film
The group by
statement allows us to aggregate columns per some category. We need to choose a categorical column to group by
. Categorical columns are non-continuous. Keep in mind, they can still be numerical, such as cabin class categories on a ship (e.g. Class 1, Class 2, Class 3)
After choosing the categorical column we're essentially splitting the table up on a per category basis in n subtables. We can then aggregate the columns in the subtables with an aggregate function. Basic syntax:
select cat_col, add_function(data_col)
from table_
where condition_on_cat_col
group by cat_col
-- the GROUP BY statement must appear right after the FROM or WHERE statement
Note that the group by
statement must appear right after the from
or where
statement.
In the select
statement, columns must either have an aggregate function OR be in a group by
call. Example:
select company, division, sum(sales)
-- sales won't appear in GROUP BY so we need an aggregate function to select it
from finance_table
group by company, division
-- this return the total number of sales per division per company
Another example from our database:
select
rental_duration,
rating,
round(avg(replacement_cost), 2) as avg_replacement_cost,
percentile_disc(0.5) within group(
order by
replacement_cost
) as median_replacement_cost
from
film
group by
rental_duration,
rating
order by
rental_duration,
rating
-- return the average and median replacement cost per rental duration per rating
We can also add a where
statement to the query. where
statements should not refer to the aggregation result. Later on we'll use the having
clause to filter on those results. The having
clause was added to SQL because the where
keyword cannot be used with aggregate functions. Example:
select company, division, sum(sales)
from finance_table
where division in ('marketing', 'transport')
-- where cannot be used with aggregate function
group by company, division
Another example from our database:
select rating, round(avg(replacement_cost),2), percentile_disc(0.5) within group(order by replacement_cost)
from film
where rating in ('R','NC-17')
group by rating
order by rating
-- returns average and median replacement cost for the ratings R and NC-17
An example using where
and having
:
select
rental_duration,
rating,
round(avg(replacement_cost), 2) as avg_replacement_cost,
percentile_disc(0.5) within group(
order by
replacement_cost
) as median_replacement_cost
from
film
where
rating in ('R','NC-17')
group by
rental_duration,
rating
having
round(avg(replacement_cost), 2) > 19
order by
rental_duration,
rating
-- returns the average and median replacement cost per rental duration per rating where the rating is R or NC-17 and the average replacement cost is greater than 19
If we want to sort results based on the aggregate, we must reference the entire function. Example:
select company, sum(sales)
from finance_table
group by company
order by sum(sales) desc
limit 5
-- returns top 5 companies based on total sales
Examples:
select customer_id from payment
group by customer_id
-- the result is the same as SELECT distinct(customer_id) FROM payment
We want to find out the top 5 customers that spent the most money:
select customer_id, sum(amount)
from payment
-- where
group by customer_id
order by sum(amount) desc
limit 5
Alternative:
with top_customer as (
select
customer_id,
sum(amount) as s_amount
from
payment -- where
group by
customer_id -- having
order by
sum(amount) desc
limit
5
)
select
t.customer_id,
first_name,
last_name,
s_amount
from
top_customer as t
join customer as c on t.customer_id = c.customer_id
We want to find out how many transaction occurred per customer:
select customer_id, count(amount) from payment
group by customer_id
order by count(amount) desc
Alternative
with tx_count as (
select
customer_id,
count(payment_id) as tx_count
from
payment
group by
customer_id
)
select
t.customer_id,
first_name,
last_name,
tx_count
from
tx_count as t
join customer as c on t.customer_id = c.customer_id
order by
tx_count desc
limit
5
We want to find out the sum amount per customer per staff member:
select staff_id, customer_id, sum(amount) from payment
group by staff_id, customer_id
order by staff_id, customer_id
We want to find out the date with the most sum amount:
select date(payment_date), sum(amount) from payment
group by date(payment_date)
order by sum(amount) desc
-- note that the payment_date is a timestamp -> we remove the time with the date() function
Another example; find out the date with the most sum amount after April 29, 2007:
select date(payment_date), sum(amount)
from payment
where date(payment_date) > '2007-04-29'
group by date(payment_date)
order by sum(amount) desc
-- limit 5
Challenge: We have two staff members, with Staff IDs 1 and 2. We want to give a bonus to the staff member that handled the most payments. (Most in terms of number of payments processed, not total dollar amount). How many payments did each staff member handle and who gets the bonus?
select staff_id, count(payment_id) from payment
group by staff_id
Alternative solution:
select staff_id, count(distinct(payment_id))
from payment
where staff_id in (1,2)
group by staff_id
order by count(distinct(payment_id)) desc
-- limit 5
Challenge: Corporate HQ is conducting a study on the relationship between replacement cost and a movie MPAA rating. What is the average replacement cost per MPAA rating?
select rating, round(avg(replacement_cost),2)
from film
group by rating
order by avg(replacement_cost) desc
Challenge: We are running a promotion to reward our top 5 customers with coupons. What are the customer ids of the top 5 customers by total spend?
select customer_id, round(sum(amount),2)
from payment
group by customer_id
order by sum(amount) desc
limit 5
Alternative:
with c_sum as (
select
customer_id,
sum(amount)
from
payment
group by
customer_id
limit
5
)
select
c.customer_id,
first_name,
last_name,
sum
from
c_sum as c
join customer as u on c.customer_id = u.customer_id
order by
sum desc
The having
clause allows us to filter after an aggregation has already taken place. Example:
select company, sum(sales)
from finance_table
where company != 'Google'
group by company
having sum(sales) > 1000
Here we're aggregating sales per company. We can filter before executing the group by
statement, since it's not being aggregated. We can not use where
to filter based off of aggregate results, because the pgSQL aggregates after the where
clause is executed. An additional filtering using aggregated results can be done with the having
clause. Example:
select customer_id, sum(amount)
from payment
-- we remove some arbitrary customers by id
where customer_id not in (184,87,477)
group by customer_id
having sum(amount) > 150
order by sum(amount) desc
-- the restul is the records of the customer_id and sum amount, excluding some customers and where (having clause) the aggregated sum amount is bigger than 150, ordered by the descending sum amount
Example:
-- number of customer per store
select store_id, count(customer_id)
from customer
group by store_id
We can expand on this example. We want to see the number of customer per store in stores that have more than 300 customers.
-- number of customer per store with more than 300 customers
select store_id, count(customer_id)
from customer
group by store_id
having count(customer_id) >= 300
Challenge: We are launching a platinum service for our most loyal customers. We will assign platinum status to customers that have had 40 or more transaction payments. What customer_ids are eligible for platinum status?
select customer_id, count(*)
from payment
group by customer_id
having count(*) >= 40
Challenge: What are the customer ids of customers who have spent more than $100 in payment transactions with our staff_id member 2?
select
customer_id,
sum(amount)
from
payment
where
staff_id = 2
group by
customer_id
having
sum(amount) > 100
order by
sum(amount) desc
Return the customer IDs of customers who have spent at least $110 with the staff member who has an ID of 2.
select staff_id, customer_id, sum(amount)
from payment
where staff_id = 2
group by staff_id, customer_id
having sum(amount) > 110
How many films begin with the letter J?
select count(title)
from film
where title like 'J%'
What customer has the highest customer ID number whose name starts with an 'E' and has an address ID lower than 500?
select customer_id, first_name, last_name
from customer
where first_name like 'E%' and address_id < 500
order by customer_id desc
limit 1
Wrong answer:
select max(customer_id), first_name, last_name
from customer
where first_name ilike 'E%'
and address_id < 5000
-- group by
-- having
-- order by
-- limit
JOINS will allow us to combine information from multiple tables. See documentation.
Overview:
- Creating an alias with the
as
clause - Understanding the different kinds of
join
inner join
outer join
full join
union
- Challenge Task
Aliases are used to give a table, a column in a table or result, a temporary name. Basic syntax:
select c1 as new_c1_name
from t1
select sum(c1) as new_c1_name
from t1
-- useful for readability of the data output
The as
operator gets executed at the very end of a query, meaning that we can not use the alies inside a where
operator. This means the alias is only valid in the select
statement. Example:
select count(amount) as "Number of Transactions"
from payment
select customer_id, sum(amount) as "Total Spent"
from payment
group by customer_id
having sum(amount) > 150
-- using "Total Spent" in the having clause returns an error because "Total Spen" does not exist (alias gets assigned at the very end)
Note: column aliases are for better readability of the output, whereas table aliases are for better readability and structuring of the query.
JOINs allow us to combine multiple tables together. The main reason for the different JOIN types is to decide how to deal with information only present in one of the joined tables.
The inner join
keyword selects records that have matching values in both tables. Basis syntax:
select order.order_id, customer.first_name
from order inner join customer on order.customer_id = customer.customer_id
-- in this syntax, the INNER keyword is optional
-- inner join order ...
-- the inner join is symmetrical: the order of the tables does not matter
Result:
set | id_A | id_B |
---|---|---|
A ∩ B | ... | ... |
If we just use join
without the inner
, PgSQL will treat it as an inner join
.
Example: we want to join the payment and customer table.
select payment.payment_id, payment.customer_id, customer.email
from payment inner join customer on payment.customer_id = customer.customer_id
-- this shows only customer that have done a payment
Alternative in PostgreSQL:
select p.payment_id, c.first_name
-- from payment as p inner join customer as c on p.customer_id = c.customer_id
from payment as p join customer as c using(customer_id)
The words inner
and outer
are optional in all forms. inner
is the default; left
, right
, and full
imply an outer join.
There are few different types of OUTER JOINs. They will allow us to specify how to deal with values only present in one of the tables being joined. We will explain:
full outer join
- clarifying
where null
- clarifying
left outer join
- clarifying
where null
- clarifying
right outer join
- clarifying
where null
- clarifying
Basic syntax of full outer join
:
select * from order
full outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
-- a full outer join is symmetrical: the order of the tables does not matter
Result:
set | id_A | id_B |
---|---|---|
A ∩ B | ... | ... |
A \ B | ... | NULL |
B \ A | NULL | ... |
We can further qualify the statement with a full outer join
with where
(and the help of null
values): get rows unique to either table (rows not found in both tables), i.e. a XOR join (opposite of INNER join):
select * from order
full outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
where order.customer_id is null or customer.customer_id is null
-- a full outer join with a symmetrical where clause is symmtrical
Result:
set | id_A | id_B |
---|---|---|
A \ B | ... | NULL |
B \ A | NULL | ... |
Example:
select *
from customer
full outer join payment
on customer.customer_id = payment.customer_id
where customer.customer_id is null or payment.customer_id is null
-- we want to return (a) customer ids that are not present in the payment table (customer without historical payment data) and (b) customer ids that are not present in the customer table, but have done payments -> no customer has this property -> we're in compliance with this policy
A left outer join
results the set of records that are in the left table, if there is no match with the right table, the results are null.
select * from order
left outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
-- a left outer join is not symmetrical!
Result:
set | id_A | id_B |
---|---|---|
A ∩ B | ... | ... |
A \ B | ... | NULL |
We can further qualify the statement with a full left join
with where
(and the help of null
values): get rows unique to left table. What if we only wanted entries unique to Table A? Those rows found Table A and not found in Table B.
select * from order
left outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
where customer.customer_id is null
-- a left outer join is not symmetrical!
Result:
set | id_A | id_B |
---|---|---|
A \ B | ... | NULL |
Example:
select film.film_id, inventory.film_id, film.title, inventory.inventory_id
-- select count(distinct(film.film_id))
-- select count(distinct(inventory.film_id))
from film
left join inventory on film.film_id = inventory.film_id
-- there are 958 distinct films IDs in our inventory and 1000 distinct films on the film table
We return all films in the film table, even if they're not present in the inventory table. We can use the null
values generated to identify films that are not in our inventory:
select film.film_id, inventory.film_id, film.title, inventory.inventory_id
-- select count(distinct(film.film_id))
-- select count(distinct(inventory.film_id))
from film
left join inventory on film.film_id = inventory.film_id
-- there are 958 distinct films IDs in our inventory and 1000 distinct films on the film table
where inventory.film_id is null
-- this returns the films that are in our films table but not in our inventory
Alternative:
select
distinct f.film_id,
f.title
from
film as f full
outer join inventory as i on f.film_id = i.film_id
where
i.film_id is null
order by
f.film_id asc
Alternative to example:
select count(distinct(f.film_id))
-- f.film_id, i.film_id, f.title
from film as f left join inventory as i on f.film_id = i.film_id
where i.film_id is null
-- # of films that are not in our inventory
A right join
is essentially the same as a left join
, except the tables are switched. This would be the same as switching the table order in left join
.
select * from order
right outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
-- a right outer join is not symmetrical!
Result:
set | id_A | id_B |
---|---|---|
A ∩ B | ... | ... |
B \ A | NULL | ... |
We can add a where
qualifier:
select * from order
right outer join customer
-- in this syntax, the OUTER keyword is optional
on order.customer_id = customer.customer_id
where order.customer_id is null
-- a right outer join is not symmetrical!
Result:
set | id_A | id_B |
---|---|---|
B \ A | NULL | ... |
The union
operator is used to combine the result-set of two or more select
statements.
- Every
select
statement withinunion
must have the same number of columns - The columns must also have similar data types
- The columns in every
select
statement must also be in the same order
It basically serves to directly concatenate two results together, essentially "pasting" them together. Basic syntax:
select c_1,...,c_n from t1
union
select c_1,...,c_n from t2
Result:
col_1 | col_2 |
---|---|
A.col_1 | A.col_2 |
B.col_2 | B.col_2 |
Challenge: California sales tax laws have changed and we need to alert our customers of this through email. What are the emails of the customers who live in California?
-- task: what are the emails of the customers that live in California
-- California is a district in the address table
select customer.first_name, customer.last_name, customer.email, address.district
from customer
right join address on customer.address_id = address.address_id
where address.district = 'California'
-- this query returns the same information if we perform an inner join; we perform a right join to make sure that there is no person in California that does not have an email in the database
Alternative:
select a.district, c.email
from address as a join customer as c on a.address_id = c.address_id
where a.district ilike 'california'
order by c.email asc
Challenge: A customer walks in and is a huge fan of the actor "Nick Wahlberg" and wants to know which movies he is in. Get a list of all the movies "Nick Wahlberg" has been in.
-- task: get all Nick Wahlberg movies
-- tables: film for the title, film_actor for relationship, actor for first and last name
select actor.first_name, actor.last_name, film.title
from actor
join film_actor
on actor.actor_id = film_actor.actor_id
join film
on film.film_id = film_actor.film_id
where actor.first_name = 'Nick' and actor.last_name = 'Wahlberg'
Alternative:
select
f.title, a.first_name, a.last_name
from film_actor as fa
join actor as a on fa.actor_id = a.actor_id
join film as f on f.film_id = fa.film_id
where
a.first_name ilike 'nick' and
a.last_name ilike 'wahlberg'
-- group by
-- having
order by f.title asc
-- limit
Section Overview:
- Timestamps and EXTRACT
- Math Functions
- String Functions
- Sub-query
- Self-Join
In Part One, we will go over a few commands that report back time and date information. These will be more useful when creating our own tables and databases, rather than querying a database.
We've already seen that PostgreSQL can hold date and time information:
time
Contains only timedate
Contains only datetimestamp
Contains date and timetimestamptz
Contains date,time, and timezone
Careful considerations should be made when designing a table and database and choosing a time data type. Depending on the situation you may or may not need the full level of timestamptz
. Remember, you can always remove historical information, but you can't add it.
Let's explore functions and operations related to these specific data types:
timezone
now
timeofday
current_time
current_date
We use the show
function:
show all
show timezone
-- show runtime parameters
select now()
-- return timestamp
select timeofday()
-- return string representation of timestamp
select current_time
-- return time with timezone
select current_date
-- return date
Let's explore extracting information from a time based data type using:
extract()
age()
to_char()
extract(field FROM source)
allows you to "extract" or obtain a sub-component of a date value. Valid values for field can be found in the documentation. Basic syntax:
extract(year from date_col)
In a full query it becomes:
-- select date_col becomes
select extract(year from date_col)
from t1
Example:
select extract(month from payment_date) as "Month"
from payment
group by extract(month from payment_date)
-- we can extract: year, quarter, month, week, day
age()
calculates and returns the current age given a timestamp. Basic syntax:
age(date_col)
-- return for example: 13 years mon 5 days 01:34:13.003423
Example:
select age(payment_date)
from payment
to_char()
general function to convert data types to text (see Documentation). Useful for timestamp formattin. Basic syntax:
to_char(date_col, 'mm-dd-yyyy')
Example:
select to_char(payment_date,'MM/dd/YYYY')
from payment
Challenge: During which months did payments occur? Format your answer to return back the full month name.
-- task: during which month did payments occur?
-- select extract(month from payment_date) <- this returns the months as a number
select to_char(payment_date, 'Month')
from payment
group by to_char(payment_date, 'Month')
Alternative:
select distinct(to_char(payment_date, 'Month'))
from payment
Alternative (ordering the month chronologically):
-- task: during which month did payments occur?
select distinct(extract(month from payment_date)) as "num_month", to_char(payment_date, 'Month')
from payment
order by "num_month"
Alternative (without the alias):
select distinct(extract(month from payment_date)), to_char(payment_date, 'Month')
from payment
order by date_part
Challenge: How many payments occurred on a Monday?
-- task: how many payments occured on a monday?
select to_char(payment_date, 'day') as weekday, count(*)
from payment
group by weekday
-- this returns the list of weekday and it's corresponding number of payments
select extract(dow from payment_date), count(*)
from payment
where extract(dow from payment_date) = 1
group by date_part
-- for dow sunday (0) -> monday (1)
Alternative (more compact):
select count(*)
from payment
where extract(dow from payment_date) = 1
See documentation. Examples:
-- mathematical functions and operators: we will focus on rental_rate and replacement_cost
select round(rental_rate/replacement_cost*100,2) as "percent_cost"
from film
select round(0.1*replacement_cost,2) as deposit
from film
See documentation. PostgreSQL also provides a variety of string functions and operators that allow us to edit, combine, and alter text data columns. Examples:
select length(first_name) from customer
-- returns the length of the first_name of every record
select first_name || ' ' || last_name as full_name
from customer
-- concatenate text
select upper(first_name) || ' ' || upper(last_name) as full_name
from customer
-- concatenate text
-- use lower, upper or initcap functions to convert a string expression, values in a column, etc., to lowercase, uppercase, and proper case.
-- task: create email for customer
select lower(left(first_name,1)) || '_' || lower(last_name) || '@mail.com' as customer_email
from customer
-- using string functions and operators to create a new column that is useful for a certain situation
See documentation. We discuss how to perform a subquery and the exists
function.
A sub query allows you to construct complex queries, essentially performing a query on the results of another query. The syntax is straightforward and involves two select
statements. Example:
select student, grade
from test_score
-- return the data for all the students
select avg(grade)
from test_score
-- returns the average score for the test
Question: How can we get a list of students who scored better than the average grade?
select student, grade
from test_score
where grade > (
select avg(grade)
from test_score
)
-- returns the data for students who scored better than the average grade
The subquery is performed first since it is inside the parenthesis. We can also use the in
operator in conjunction with a subquery to check against multiple results returned. For example we can operate on a separate table:
select student, grade
from test_score
where student in (
select student
from honor_roll
)
-- we could also do this with a join
The exists
operator is used to test for existence of rows in a subquery. Typically a subquery is passed in the exists()
function to check if any rows are returned with the subquery (see w3). The exists
operator is used to test for the existence of any record in a subquery. The exists
operator returns TRUE if the subquery returns one or more records. Typical syntax:
select c1
from t1
where exists(
select c1
from t1
where cond1
)
Examples:
-- task: return film titles that have a higher than average rental cost
select title, rental_rate
from film
where rental_rate > (
select avg(rental_rate) from film
)
-- subquery: calculate avg rental rate
-- select avg(rental_rate) from film
-- task: grad film titles that have been returned between certain dates
select film_id, title
from film
where film_id in (
select inventory.film_id
from inventory
join rental on rental.inventory_id = inventory.inventory_id
where return_date between '2005-05-29' and '2005-05-30'
)
order by title asc
-- problem: we do not have the film id
-- solution: we grad the tile with the help of the inventory id. We get the film_id from the inventory and use it as a subquery.
-- task: find customers that have at least one payment, whose amount is greater than 11
select customer_id, first_name, last_name
from customer as c
where exists (
select *
from payment as p
where
p.customer_id = c.customer_id
and p.amount > 11
)
order by first_name
-- returns 8 customers
Alternative:
-- task: find customers that have at least one payment, whose amount is greater than 11
select c.customer_id, c.first_name, c.last_name
from payment as p, customer as c
where
p.customer_id = c.customer_id
and p.amount > 11
Negation:
select count(distinct(c.customer_id))
from customer as c
where not exists (
select *
from payment as p
where
p.customer_id = c.customer_id
and p.amount > 11
)
-- returns 591
-- returns customers that do not have any payment greater than 11
-- note that this query returns customers that do not exists in the subquery (NOT clients that have payments with an amount less than 11)
Note that we cannot negate the alternative to produce the same result, since all customers have at least a payment with an amount less than or equal to 11.
A self-join is query in which a table is joined to itself. Self-joins are useful for comparing values in a column of rows within the same table.
The self join can be viewed a join of two copies of the same table. The table is not actually copied, but SQL performs the command as though it were. There is no special keyword for a self join, its simply standard JOIN syntax with the same table in both parts.
However, when using a self join it is necessary to use an alias for the table, otherwise the table names would be ambiguous. Let's see a syntax example of this. Basic syntax:
select c1.col, c2.col
from customer as c1
join customer as c2 on c1.some_col = c2.other_col
A self join is a join in which a table is joined with itself (which is also called unary relationships), especially when the table has a FOREIGN KEY which references its own PRIMARY KEY. To join a table itself means that each row of the table is combined with itself and with every other row of the table (see w3).
Example:
We store employee records in a table. Employees report to other employees, e.g. employee 1 reports to employee 2. We want results showing the employee name and their reports recipient name.
With the join
operator:
select e_emp.name as employee_name, e_rep.name as rep_name
from employee as e_emp
join employee as e_rep
on e_rep.emp_id = e_emp.rep_id
Without the join
opeartor:
select e_emp.name as employee_name, e_rep.name as rep_name
from employee as e_emp, employee as e_rep
where e_rep.emp_id = e_emp.rep_id
Visually the example looks like this:
EMPLOYEE
emp_id | name | rep_id |
---|---|---|
1 | Andrew | 3 |
2 | Bob | 3 |
3 | Charlie | 4 |
4 | David | 1 |
The join produces:
EMPLOYEE AS e_emp, EMPLOYEE AS e_rep
e_emp.emp_id | e_emp.name | e_emp.rep_id | e_rep.emp_id | e_rep.name | e_rep.rep_id |
---|---|---|---|---|---|
1 | Andrew | 3 | 1 | Andrew | 3 |
1 | Andrew | 3 | 2 | Bob | 3 |
1 | Andrew | 3 | 3 | Charlie | 4 |
1 | Andrew | 3 | 4 | David | 1 |
2 | Bob | 3 | 1 | Andrew | 3 |
... | ... | ... | ... | ... | ... |
We join on foreign key = primary key:
EMPLOYEE AS e_emp JOIN EMPLOYEE AS e_rep ON e_rep.emp_id = e_emp.emp_id
e_emp.emp_id | e_emp.name | e_emp.rep_id | e_rep.emp_id | e_rep.name | e_rep.rep_id |
---|---|---|---|---|---|
1 | Andrew | 3 | 3 | Charlie | 4 |
... | ... | ... | ... | ... | ... |
Example:
-- task: find all the pairs of films that have the same length (explicit self-join)
select f_left.title, f_right.title, f_right.length
from film as f_left
join film as f_right on f_left.length = f_right.length
where f_left.title != f_right.title
order by f_right.length
Alternatives:
-- task: find all the pairs of films that have the same length (explicit self-join w/out where clause)
select f_left.title, f_right.title, f_right.length
from film as f_left
join film as f_right on
f_left.length = f_right.length and
f_left.film_id != f_right.film_id
order by f_right.length
-- task: find all the pairs of films that have the same length (implicit self-join)
select f_left.title, f_right.title, f_right.length
from film as f_left, film as f_right
where
f_left.length = f_right.length and
f_left.title != f_right.title
order by f_right.length
We will use a new database for a set of exercise questions. This database has a public
and cd
schema. This means the queries for the FROM tables will have cd.
in front them.
Running Command:
/Applications/pgAdmin 4.app/Contents/SharedSupport/pg_restore --host "localhost" --port "0000" --username "postgres" --no-password --dbname "exercise" --section=pre-data --section=data --section=post-data --verbose "/.../exercise.tar"
How can you retrieve all the information from the cd.facilities table?
select *
from cd.facility
You want to print out a list of all of the facilities and their cost to members. How would you retrieve a list of only facility names and costs?
select name, m_cost
from cd.facility
How can you produce a list of facilities that charge a fee to members?
select name, m_cost
from cd.facility
where m_cost > 0
How can you produce a list of facilities that charge a fee to members, and that fee is less than 1/50th of the monthly maintenance cost? Return the facid, facility name, member cost, and monthly maintenance of the facilities in question.
select fac_id, name, m_cost, maintenance
from cd.facility
where
m_cost > 0
and m_cost < maintenance/50.0
How can you produce a list of all facilities with the word 'Tennis' in their name?
select *
from cd.facility
where name ilike '%tennis%'
How can you retrieve the details of facilities with ID 1 and 5? Try to do it without using the OR operator.
select *
from cd.facility
where fac_id in (1,5)
How can you produce a list of members who joined after the start of September 2012? Return the memid, surname, firstname, and joindate of the members in question.
select *
from cd.member
where join_date between '2012-09-01' and now()
Alternative:
select *
from cd.member
where join_date::date >= '2012-09-01'
How can you produce an ordered list of the first 10 surnames in the members table? The list must not contain duplicates.
select distinct(last_name)
from cd.member
where last_name != 'GUEST'
order by last_name
limit 10
You'd like to get the signup date of your last member. How can you retrieve this information?
select max(join_date)
from cd.member
-- return max
Alternative 1:
select first_name, last_name, join_date
from cd.member
order by join_date desc
limit 1
-- order and limit
Alternative 2:
select first_name, last_name, join_date
from cd.member
where join_date in (
select max(join_date)
from cd.member
)
-- where ... in clause; condition is single value
Alternative 3:
select first_name, last_name, join_date
from cd.member as m_left
join (
select max(join_date) as max_join_date
from cd.member
) as m_right
on m_left.join_date = m_right.max_join_date
-- inner join on single value
Produce a count of the number of facilities that have a cost to guests of 10 or more.
select count(*)
from cd.facility
where g_cost >= 10
Produce a list of the total number of slots booked per facility in the month of September 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.
select fac_id, sum(slot) as booked_slots_sept
from cd.booking
where extract(month from start_time) = 9
-- where start_time::date >= '2012-09-01' and start_time::date <= '2012-10-01'
group by fac_id
order by sum(slot) asc
Produce a list of facilities with more than 1000 slots booked. Produce an output table consisting of facility id and total slots, sorted by facility id.
select fac_id, sum(slot)
from cd.booking
group by fac_id
having sum(slot) > 1000
order by fac_id
How can you produce a list of the start times for bookings for tennis courts, for the date '2012-09-21'? Return a list of start time and facility name pairings, ordered by the time.
select b.start_time, f.name
from cd.booking as b
join cd.facility as f on b.fac_id = f.fac_id
where
b.start_time::date = '2012-09-21'
and f.name ilike '%tennis%court%'
order by b.start_time
-- keyword as is optional
Alternative:
select start_time, f.name
from cd.booking as b
join cd.facility as f on b.fac_id = f.fac_id
where
start_time between '2012-09-21' and '2012-09-22'
and f.name ilike '%tennis%court%'
order by start_time
How can you produce a list of the start times for bookings by members named 'David Farrell'?
select start_time
from cd.booking as b
join cd.member as m on b.mem_id = m.mem_id
where
m.first_name = 'David'
and m.last_name = 'Farrell'
Let's now shift our focus to creating our own databases and tables.
Section Overview
- Data Types
- Primary and Foreign Keys
- Constraints
create
insert
update
delete
,alter
,drop
We first focus on learning a few theoretical concepts, such as choosing the correct data type for a stored value and setting possible constraints on it. We will also learn about primary and foreign keys.
Main data types in SQL:
- Boolean
True
orFalse
- Character
char
,varchar
, andtext
- Numeric
integer
andfloating-point number
- Temporal
date
,time
,timestamp
, andinterval
- UUID: Universally Unique Identifiers
- Array
- Stores an
array
ofstrings
,numbers
, etc.
- Stores an
- JSON
- Hstore key-value pair
- Special types such as network address and geometric data.
When creating databases and tables, you should carefully consider which data types should be used for the data be stored. Review the documentation to see limitations of data types.
A primary key is a column or a group of columns used to identify a row uniquely in a table. For example, our dvdrental database we saw customers had a unique, non-null customer_id column as their primary key.
Primary keys are also important since they allow us to easily discern what columns should be used for joining tables together. Later we will learn about serial
data type.
A foreign key is a field or group of fields in a table that uniquely identifies a row in another table. A foreign key is defined in a table that references to the primary key of the other table. The table that contains the foreign key is called referencing table or child table. The table to which the foreign key references is called referenced table or parent table. A table can have multiple foreign keys depending on its relationships with other tables.
Recall in the dvdrental database payment table, each payment row had its unique payment_id (a primary key) and identified the customer that made the payment through the customer_id (a foreign key since it references the customer table's primary key).
You may begin to realize primary key and foreign key typically make good column choices for joining together two or more tables. When creating tables and defining columns, we can use constraints to define columns as being a primary key, or attaching a foreign key relationship to another table.
Constraints are the rules enforced on data columns on table. These are used to prevent invalid data from being entered into the database. This ensures the accuracy and reliability of the data in the database.
Constraints can be divided into two main categories:
- Column Constraints: Constrains the data in a column to adhere to certain conditions.
- Table Constraints: Applied to the entire table rather than to an individual column.
The most common column constraints used:
not null
Constraint: Ensures that a column cannot have NULL value.unique
Constraint: Ensures that all values a column are different.primary
Key: Uniquely identifies each row/record in a database table.foreign
Key: Constrains data based or columns in other tables.check
Constraint: Ensures that all values in a column satisfy certain conditions.exclusion
Constraint: Ensures that if any two rows are compared on the specified column or expression using the specified operator, not all of these comparisons will return TRUE.
The most common table constraints used:
check(condition)
: to check condition when inserting or updating data.refenrences
: to constrain the value stored in the column that must exist in a column in another table.unique(column_list)
: Forces the values stored in the columns listed inside the parentheses to be unique.primary key(columns_list)
: Allows you to define the primary key that consists of multiple columns.
General syntax:
CREATE TABLE table_name(
column_name TYPE column_constraint,
column_name TYPE column_constraint,
table_constraint table_constraint
)
INHERITS existing_table_name;
Example:
create table player(
player_id serial primary key,
age smallint not null
)
Always refer to the documentation when choosing data types. serial
datatype:
- In PostgreSQL, a sequence is a special kind of database object that generates a sequence of integers.
- A sequence is often used as the primary key column in a table.
- It will create a sequence object and set the next value generated by the sequence as the default value for the column.
- This is perfect for a primary key, because it logs unique integer entries for you automatically upon insertion.
- If a row is later removed, the column with the
serial
data type will not adjust, marking the fact that a row was removed from the sequence, e.g. for 1,2,3,5,6,7 -> you know row 4 was removed at some point.
We create a database (in the table we generate a password and an email column, see here and here for best practices):
create table account(
user_id serial primary key,
username varchar(50) unique not null,
password varchar(50) not null,
email varchar(255) unique not null,
create_date timestamp not null,
last_login timestamp
)
create table job(
job_id serial primary key,
job_name varchar(255) unique not null
)
create table account_job(
user_id int references account(user_id),
-- serial should be only used as a primary key for the table it is in
job_id int references job(job_id),
hire_date timestamp
)
The insert into
statement is used to insert new records in a table. General syntax:
insert into table(col1, col2, ...)
values
(val1_1, val2_1, ...),
(val1_2, val2_2, ...)
...
Syntax for inserting values from another table:
insert into table(col1, col2, ...)
select col1, col2, ...
from another_table
where condition
Keep in mind, the inserted row values must match up for the table, including constraints. serial
columns do not need to be provided a value (since serial is a sequence object, so it will automatically update the next available int
for that row).
Example:
insert into account(username, password, email, create_date)
values
('Luis','password','[email protected]',now())
insert into job(job_name)
values
('Astronaut')
insert into account_job(user_id, job_id, hire_date)
values
(1,1,now())
The update
statement is used to modify the existing records in a table. General syntax:
update table
set
col1 = val1,
col2 = val2,
...
where condition
Example:
update account
set last_login = now()
where last_login is null
We can also reset everything if we don't add a where
condition:
update account
set last_login = now()
We can also set everything based on another column:
update account
set last_login = create_date
We can also use another table's values (so called UPDATE JOIN):
update table_left
set col_left = table_right.col_right
from table_right
where table_left.id = table_right.id
We can also return the affected rows:
update account
set last_login = create_date
returning account_id, last_login
Examples:
update account
set last_login = now()
returning *
update account
set last_login = create_date
returning *
update account_job as aj
set hire_date = a.create_date
from account as a
where a.user_id = aj.user_id
returning *
Note: Be careful when updating records in a table! Notice the where
clause in the update
statement. The where
clause specifies which record(s) should be updated. If you omit the where
clause, all records in the table will be updated!
The delete
statement is used to delete existing records in a table. Basic syntax:
delete from t1
where id = 1
We can delete rows based on their presence in other tables:
delete from t1
using t2
where t1.id = t2.id
We can delete all rows in a table:
delete from t1
Note: Be careful when deleting records in a table! Notice the where
clause in the delete
statement. The where
clause specifies which record(s) should be deleted. If you omit the where
clause, all records in the table will be deleted!
Similar to the update
statement, we can also add in a returning
call to return rows that were removed.
The alter table
statement is used to add, delete, or modify columns in an existing table. The alter table
statement is also used to add and drop various constraints on an existing table. See documentation for all the options available in the synopsis.
In general the alter
clause allows for changes to an existing table structure, e.g.:
- Adding, dropping or renaming columns
- Changing a column's data type
- Set
default
values for a column - Add
check
constraints - Rename table
General syntax:
alter table table_name
-- action
-- rename table
rename to new_name
-- rename column
rename column col_name to new_col_name
-- add columns
add column new_col type
-- add unique
add unique(col_name)
-- remove columns
drop column col_name
-- alter constraints
alter column data_type
set default value
drop default value
set not null
drop not null
add constraint constraint_name
Examples:
-- insert into info(title)
-- values ('some new title')
-- returns err: violates not-null constraint
alter table info
alter column ppl drop not null
To alter unique constraint see here.
drop
allows for the complete removal of a column in a table. In PostgreSQL this will also automatically remove all of its indexes and constraints involving the column. However, it will not remove columns used in views, triggers, or stored procedures without the additional cascade
clause. General syntax:
alter table table_name
drop column col_name -- cascade
-- add the cascade keyword to remove all dependencies
Check for existence to avoid error:
alter table table_name
drop column if exists col_name
Drop multiple columns:
alter table table_name
drop column col1
drop column col2
The check
constraint is used to limit the value range that can be placed in a column. If you define a check
constraint on a column it will allow only certain values for this column. If you define a check
constraint on a table it can limit the values in certain columns based on values in other columns in the row. Basically, the check
constraint allows us to create more customized constraints that adhere to a certain condition. General syntax:
create table example(
id serial primary key,
age smallint check (age > 21),
parent_age smallint check (parent_age > age)
)
Example:
create table employee(
emp_id serial primary key,
fist_name varchar(50) not null,
last_name varchar(50) not null,
birthdate date check(birthdate > '1900-01-01'),
hire_date date check(hire_date > birthdate),
salary int check (salary > 0)
)
This will test your knowledge of the previous section, focused on creating databases and table operations. This test will actually consist of a more open-ended assignment. Complete the following task:
- Create a new database called "school" this database should have two tables: teachers and students.
- The students table should have columns for student_id, first_name, last_name, homeroom_nr, phone, email, and graduation year.
- The teachers table should have columns for teacher_id, first_name, last_name, homeroom_nr, department, email, and phone.
- The constraints are mostly up to you, but your table constraints do have to consider the following:
- We must have a phone number to contact students in case of an emergency.
- We must have ids as the primary key of the tables
- Phone numbers and emails must be unique to the individual.
- Once you've made the tables, insert a student named Mark Watney (student_id=1) who has a phone number of 777-555-1234 and doesn't have an email. He graduates in 2035 and has 5 as a homeroom number.
- Then insert a teacher names Jonas Salk (teacher_id = 1) who as a homeroom number of 5 and is from the Biology department. His contact info is: [email protected] and a phone number of 777-555-4321.
Solution:
create table student(
student_id serial primary key,
first_name varchar(50) not null,
last_name varchar(50) not null,
homeroom_nr varchar(50),
phone varchar(50),
email varchar(255),
grad_year date
)
create table teacher(
teacher_id serial primary key,
first_name varchar(50) not null,
last_name varchar(50) not null,
homeroom_nr smallint,
department varchar(255),
email varchar(255),
phone varchar(255)
)
-- change data type from date to smallint to store only the grad year
alter table student
alter column grad_year type smallint using (extract(year from grad_year)::smallint)
-- change data type from varchar to smallint to store only a number
alter table student
alter column homeroom_nr type smallint using (homeroom_nr::smallint)
-- add unique constraint to phone and email columns (contraint will be assign to one name!)
alter table student
add unique(phone, email)
insert into student(
first_name,
last_name,
phone,
grad_year,
homeroom_nr
)
values(
'Mark',
'Watney',
'777-555-1234',
2035,
5
)
insert into teacher(
first_name,
last_name,
homeroom_nr,
department,
email,
phone
)
values(
'Jonas',
'Salk',
5,
'Biology',
'[email protected]',
'777-555-4321'
)
Section Overview
case
coalesce
nullif
cast
- Views
- Import and Export Functionality
These keywords and functions will allow us to add logic to our commands and workflows in SQL.
We can use the case
statement to only execute SQL code when certain conditions are met. This is very similar to if ... else
statements in other programming languages.
There are two main ways to use a case
statement, either a general case
or a case
expression. Both methods can lead to the same results. Let's first show the syntax for a general case
(more flexible):
case
when cond1 then res1
when cond2 then res2
else default_res
end
from t1
Example:
a |
---|
1 |
2 |
select a
case
when a = 1 then 'one'
-- if the instance is equal to one, return the string 'one'
when a = 2 then 'two'
else 'other'
end -- as new_col_name
from test
-- this generates a new column named 'Case'
a | Case |
---|---|
1 | one |
2 | two |
The case
expression syntax first evaluates an expression then compares the result with each value in the when
clauses sequentially (it only works for equality):
case expresssion
when val1 then res1
when val2 then res2
else default_res
end
select a
case a
when 1 then 'one'
when 2 then 'two'
else 'other'
end -- as new_col_name
from test
-- yields the same resulst as the query above
Example:
select customer_id,
case
when (customer_id <= 100 ) then 'Premium'
when (customer_id between 101 and 200) then 'Plus'
else 'Normal'
end as tier
from customer
order by customer_id
select customer_id,
case customer_id
when 2 then 'First Place'
when 5 then 'Second Place'
when 11 then 'Third Place'
else null
end as raffle_results
from customer
order by raffle_results
Example:
select rental_rate, count(*)
from film
group by rental_rate
order by rental_rate
We can reformat the results of the query above with the case
statement:
select
sum(
case
rental_rate
when 0.99 then 1
else 0
end
) as bargains,
sum(
case
rental_rate
when 2.99 then 1
else 0
end
) as regular,
sum(
case
rental_rate
when 4.99 then 1
else 0
end
) as premium
from
film
Alternative:
select
case
rental_rate
when 0.99 then 'bargains'
when 2.99 then 'regular'
when 4.99 then 'premium'
end as tier,
rental_rate,
count(*)
from
film
group by
rental_rate
Challenge: We want to know and compare the various amounts of films we have per movie rating. Use case
and the dvdrental database:
select
sum(
case rating
when 'NC-17' then 1
else 0
end) as nc17,
sum(
case rating
when 'G' then 1
else 0
end ) as g,
sum(
case rating
when 'PG' then 1
else 0
end) as pg,
sum(
case rating
when 'PG-13' then 1
else 0
end) as pg13,
sum(
case rating
when 'R' then 1
else 0
end ) as r
from film
Better option:
select rating, count(*)
from film
group by rating
The coalesce
function accepts an unlimited number of arguments. It returns the first argument that is not null. If all arguments are null, the coalesce
function will return null.
coalesce(arg_1,...,arg_n)
The coalesce
function becomes useful when querying a table that contains null values and substituting it with another value. Example:
select item, (price-discount) as final_price
from price_table
-- this query returns null as the final_price if the discount is null
This becomes:
select item, (price-coalesce(discount,0)) as final price
from price_table
The cast
operator let's you convert from one data type into another. Keep in mind not every instance of a data type can be cast
to another data type, it must be reasonable to convert the data, for example '5' to an integer will work, 'five' to an integer will not. There are two options:
select cast('5' as int)
select '5'::int
The NULLIF function takes in 2 arguments and returns null
if both are equal, otherwise it returns the first argument passed.
Example:
nullif(10,10)
-- returns null
This becomes very useful in cases where a null
value would cause an error or unwanted result.
Example: Given this table calculate the ratio of department A to deparment B.
Name | Dept |
---|---|
Alice | A |
Bob | A |
Charlie | B |
select(
sum(case when dept = 'A' then 1 else 0 end)/
sum(case when dept = 'B' then 1 else 0 end)
) as dept_ratio
from dept
But what happens when department B has no people?
select(
sum(case when dept = 'A' then 1 else 0 end)/
nullif(sum(case when dept = 'B' then 1 else 0 end),0)
) as dept_ratio
from dept
If the denominator is 0 we return a null.
A view is a virtual table based on the result-set of an SQL statement.
A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database. You can add SQL statements and functions to a view and present the data as if the data were coming from one single table.
Often there are specific combinations of tables and conditions that you find yourself using quite often for a project. Instead of having to perform the same query over and over again as a starting point, you can create a view
to quickly see this query with a simple call.
Basic syntax:
select c1,c2,c3,c4
from t1
join t2
on t1.c1 = t2.c3
Becomes:
select * from view
A view is a database object that is of a stored query. A view can be accessed as a virtual table in PostgreSQL. Notice that a view does not store data physically, it simply stores the query. It transforms a complex query into a (virtual table). You can also update and alter existing views.
Examples:
create view customer_info as
select first_name, last_name, address
from customer as c
join address as a
on c.address_id = a.address_id
Afterwards we can call:
select * from customer_info
To modify the view we can:
create or replace view customer_info as
select first_name, last_name, address, district
from customer as c
join address as a
on c.address_id = a.address_id
To drop / delete / remove the view we can (check first if it exists to prevent errors):
drop view if exists customer_info
To rename:
alter view customer_info rename to customer_info_new
Useful links:
- Documentation on COPY Function
- How to import CSV file data into a PgSQL table?
- How to import and export data using CSV files in PostgreSQL
- Create table from csv file with headers
Not every outside data file will work, variations in formatting, macros, data types, etc. may prevent the import command from reading the file, at which point, you must edit your file to be compatible with PgSQL.
The Import command DOES NOT create a table for you. It assumes a table is already created. Currently there is no automated way within pgAdmin to create table directly from a .csv
file.
Import example:
command " "\\copy public.simple (id, a, b, c) FROM '/Users/aceituno/Desktop/projects-ss21/sql/simple_table.csv' DELIMITER ',' CSV HEADER QUOTE '\"' ESCAPE '''';""
We imported twice so we deleted duplicates with:
-- task: delete duplicates
-- solution: find a identifying column and use the ctdi (physical location of the row version)
-- assumptions: we can identify duplicates based on column id
delete from simple ls using(
select min(ctid) as ctid ,id
from simple
group by id
having count(*)>1
) rs
where ls.id = rs.id
and ls.ctid <> rs.ctid
-- The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b = T2.b.
-- ctid: The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change each time it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.
Logic of deleting duplicates:
-- general delete syntax
delete from ls
using rs
where ls.id = ls.id
-- our specific case
delete from ls
using rs
where ls.id = rs.id
and rc.ctid != ls.ctid
-- construction rs
select min(ctid) as ctid, id
from ls
group by id
having count(*)>1
In this section of the course I'll give you a quick overview of how to use the psycopg2 library with Python to interact with a database in PostgreSQl with Python.
From SQL Window Function. Documentation on Window Function. Documentation on Window Functions.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities. For example these queries:
select max(salary) as max_salary
from employee
select dept_name, max(salary) as max_salary_per_department
from employee
group by dept_name
Become these:
select e.*, max(salary) over() as max_salary
from employee as e
-- the over clause with no parameter produces a window of the whole table
select e.*, max(salary) over( partition by dept_name ) as max_salary_per_dept
from employee as e
-- shows max salary per deparment in addition to employee data
Notes that the rows considered by a window function are those of the “virtual table” produced by the query's FROM
clause (as filtered by its WHERE
, GROUP BY
, and HAVING
clauses if any). For example, a row removed because it does not meet the WHERE condition is not seen by any window function.
select e.*, row_number() over() as rn
from employee e
-- gives each record a unique identifier; to use it, we can pass the query as a subquery
select e.*, row_number() over( partition by dept_name ) over() as rn
from employee e
-- gives each record per department an identifier
Note that You can also control the order in which rows are processed by window functions using ORDER BY within OVER. (The window ORDER BY does not even have to match the order in which the rows are output.) Here is an example:
SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC)
FROM empsalary
From our previous example
select e.*, row_number() over(partition by dept_name order by emp_id) as rn
from employee e
-- We make the assumption that emp_id reflects when the employee joined the company
We use this query as a subquery to fetch the first 2 employees from each department to join the company
select *
from
(select e.*, row_number() over(partition by dept_name order by emp_id) as rn from employee e) sq
where hq.rn < 3
select e.*, rank() over(partition by dept_name order by salary desc) as rnk
from employee e
We use this subquery to determine the employees with the top 3 salaries per department
select *
from (select e.*, rank() over(partition by dept_name order by salary desc) as rnk
from employee e) sq
where sq.rnk < 4
select e.*, rank() over(partition by dept_name order by salary desc) as rnk, dense_rank() over(partition by dept_name order by salary desc) as dense_rnk
from employee e
Alternative:
select e.*, rank() over w as rnk, dense_rank() over w as dense_rnk
from employee e
window w as over(partition by dept_name order by salary desc)
Note that when a query involves multiple window functions, it is possible to write out each one with a separate OVER clause, but this is duplicative and error-prone if the same windowing behavior is wanted for several functions. Instead, each windowing behavior can be named in a WINDOW clause and then referenced in OVER. For example:
SELECT sum(salary) OVER w, avg(salary) OVER w
FROM empsalary
WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);
Task: Fetch a query to display if the salary of an employee is higher, lower or equal to the previous employee. Preparation:
select e.*, lag(salary) over (partition over dept_name order by emp_id) as prev_emp_salary
from employee e
We can pass some other arguments with the lag function (see Documentation).
select e.*, lag(salary,2,0) over (partition over dept_name order by emp_id) as prev_emp_salary
from employee e
Lead gives us the rows that are following the current row.
select e.*, lead(salary) over (partition over dept_name order by emp_id) as next_emp_salary
from employee e
Solution:
select
e.*,
lag(salary) over w as prev_emp_salary,
case
when e.salary > lag(salary) over w then 'Higher than previous employee'
when e.salary < lag(salary) over w then 'Lower than previous employee'
when e.salary = lag(salary) over w then 'Same as previous employee'
end sal_range
from employee e
window w as (partition over dept_name order by emp_id)
Another example from our database:
select
p.*,
lag(amount) over w1 as prev_amount,
case
when amount > lag(amount) over w2 then 'Higher than previous amount'
when amount < lag(amount) over w2 then 'Lower than previous amount'
when lag(amount) over w2 is null then 'No info'
else 'The same as previous amount'
end as amount_range
from payment p
window w1 as (partition by customer_id order by payment_date), w2 as (partition by customer_id order by payment_date)
-- note that w1 and w2 are identical
The same example more concise:
select
p.*,
lag(amount) over w as prev_amount,
case
when amount > lag(amount) over w then 'Higher than previous amount'
when amount < lag(amount) over w then 'Lower than previous amount'
when lag(amount) over w is null then 'No info'
else 'The same as previous amount'
end as amount_range
from payment p
window w as (partition by customer_id order by payment_date)
From SQL Window Function Part 2. Documentation on Window Functions. Task: Write a query to display the most expensive product under each category (corresponding ot each record).
select p.*, first_value(product_name) over(partition by product_category order by price desc) as most_exp_per_category
from product p
Example from DVD rental data:
select distinct(customer_id), first_value(payment_id) over(partition by customer_id order by amount desc) as most_exp_payment
from payment p
order by customer_id
This window function does the same as the following group by
and aggregate function.
select customer_id, payment_id
from payment p
group by customer_id, payment_id
having amount = max(amount)
order by customer_id
Task: Write a query to display the least expensive product under each category (corresponding to each record)
select *, last_value(product_name) over(partition by product_category order by price desc) as least_exp_product
from product
This delivers the wrong result! The reason is the default frame clause.
Note that first_value
, last_value
, and nth_value
consider only the rows within the "window frame", which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE or ROWS) to the OVER clause. See Section 4.2.8 for more information about frame specifications. The default frame clause is:
select
p.*,
last_value(product_name) over(
partition by product_category
order by price desc
range between unbounded preceding and current row
)
from product p
To use the last_value
window function, we need to adjust the frame clause:
select
p.*,
last_value(product_name) over(
partition by product_category
order by price desc
range between unbounded preceding and unbounded following
)
from product p
An alternative frame is the following:
To use the last_value
window function, we need to adjust the frame clause:
select
p.*,
last_value(product_name) over(
partition by product_category
order by price desc
rows between unbounded preceding and unbounded following
)
from product p
This alternative becomes relevant when we have duplicates. rows
considers the current row while range
considers the last row of all duplicate values. We can also specify the number of rows:
select
p.*,
last_value(product_name) over(
partition by product_category
order by price desc
rows between 2 preceding and 2 following
)
from product p
An example from our database:
-- fetch lowest amount per customer
-- default frame: range between unbounded preceding and current row
select
p.*,
-- first_value(payment_id) over(partition by customer_id order by amount desc) as highest_payment,
last_value(amount) over(
partition by customer_id
order by amount desc
range between unbounded preceding and unbounded following
) as lowest_payment
from payment p
-- order by amount
select
*,
first_value(product_name) over w as most_exp_product,
last_value(product_name) over w as least_exp_product
from product
window w as (partition by product_category order by price desc range between unbounded preceding and unbounded following)
-- order by
Fetch a value from any particular position. Task: Write a query to display the second most expensive product under each category.
select
*,
nth_value(product_name, 2) over(partition by product_category order by price desc) as second_most_expensive
from product
If the number of records is less that the parameter, the nth_value
function will return null
. This query won't display correctly for the first record (see frame
clause above). We can fix this:
select
*,
nth_value(product_name, 2) over w as second_most_expensive
from product
window w as (partition by product_category order by price desc range between unbounded preceding and unbounded following)
Task: Write a query to segregate all the expensive phones, mid range and the cheaper phones.
with phone_bucket as (
select
*,
ntile(3) over w as bucket
from product
where product_category = 'Phone'
)
select
product_name,
case
when bucket = 1 then 'Expensive Phone'
when bucket = 2 then 'Mid Phone'
when bucket = 3 then 'Cheaper Phone'
end
from phone_bucket
cume_dist = sum (# of rows with the same value as the current row / # of rows)
Task: Write a query to fetch all products which are constituting the first 30% of the data in products table based on price -> write a query to fetch the top 30% products based on price.
select
*,
cume_dist() over (order by price desc) as cume_dist_by_price
from product
-- cast cume_dist() to numeric to use round function
We can clean up the table:
select
*,
round(cume_dist() over (order by price desc)::numeric*100,2) as cume_dist_by_price
from product
-- cast cume_dist() to numeric to use round function
We can then use this query as a subquery and filter our data:
with cume_dist_t as (
select
*,
round(cume_dist() over (order by price desc)::numeric*100,2) as cume_dist_by_price
from product
)
select product_name, (cume_dist_by_price || '%') as cume_dist_pct
from cume_dist_t
where cume_dist_by_price <= 30
Similar to cume_list
. Relative rank of the current row. percent_rank = (# of current row - 1) / (# of rows -1)
Task: Write a query to identify how much percentage more expensive is the phone "Galaxy Z Fold 3" when compared to all products
select
*,
round(percent_rank() over(order by price)::numeric*100,2) as pct_rank
from product
We use this as our subquery:
with pct_rank as (
select
*,
round(percent_rank() over(order by price)::numeric*100,2) as pct_rank
from product
)
select product_name, pct_rank
from pct_rank
where product_name = 'Galaxy Z Fold 3'
From SQL WITH Clause.
Task: Find employees with a salary higher than the average.
with average_salary as (
select avg(salary)::int
from employee
)
select *
from employee, average_salary
where salary > average_salary
Task: Find stores with sales higher than the average across all stores.
-- find sales per store
select store_id, sum(cost) as sales_per_store
from sales
group by store_id
-- compute average revenue across all stores
select round(avg(sales_per_store)::numeric, 2) as avg_sales
from (
select store_id, sum(cost) as sales_per_store
from sales
group by store_id
)
-- filter by the comparisson average revenue < store revenue
select *
from (
select store_id, sum(cost) as sales_per_store
from sales
group by store_id
) sales_per_store
join (
select round(avg(sales_per_store)::numeric, 2) as avg_sales
from (
select store_id, sum(cost) as sales_per_store
from sales
group by store_id
) sales_per_store
) avg_sales
on sales_per_store.sales_per_store > avg_sales.avg_sales
This query becomes:
with sales_per_store (store_id, sales_per_store) as (
select store_id, sum(cost) as sales_per_store
from sales
group by store_id
), avg_sales (avg_sales) as (
select round(avg(sales_per_store)::numeric, 2) as avg_sales
from sales_per_store
)
select *
where sales_per_store > avg_sales
Example from our database:
with revenue (customer_id, revenue) as (
select customer_id, rental_id * amount as revenue
from payment
), revenue_per_store (customer_id, revenue_per_store) as (
select customer_id, sum(revenue) as revenue_per_store
from revenue
group by customer_id
order by customer_id
), avg_revenue (avg_revenue) as (
select round(avg(revenue_per_store) :: numeric, 2) as avg_revenue
from revenue_per_store
)
select customer_id, revenue_per_store
from revenue_per_store, avg_revenue
where revenue_per_store > avg_revenue
From Practice Complex SQL Queries.
-- Query 1:
-- Write a SQL query to fetch all the duplicate records from a table.
--Tables Structure:
create table users (user_id int primary key, user_name varchar(30) not null, email varchar(50));
insert into users
values (1, 'Sumit', '[email protected]'),
(2, 'Reshma', '[email protected]'),
(3, 'Farhana', '[email protected]'),
(4, 'Robin', '[email protected]'),
(5, 'Robin', '[email protected]'),
(4, 'Robin', '[email protected]');
select *
from users;
Solutions:
select max(user_id), user_name, email
from users
group by user_name, email
having count(user_name) > 1 and count(email) > 1
Solution from video: Use a window function with row_number()
.
with rn_dups as (
select *, row_number() over(partition by user_name, email order by user_id) as rn
from users
order by user_id
)
select user_name, email, max(rn) as nr_of_dups
from rn_dups
where rn > 1
group by user_name, email
-- Query 2:
-- Write a SQL query to fetch the second record from a employee table.
--Tables Structure:
-- drop table if exists employee;
create table employee
( emp_ID int primary key
, emp_NAME varchar(50) not null
, DEPT_NAME varchar(50)
, SALARY int);
insert into employee values(101, 'Mohan', 'Admin', 4000);
insert into employee values(102, 'Rajkumar', 'HR', 3000);
insert into employee values(103, 'Akbar', 'IT', 4000);
insert into employee values(104, 'Dorvin', 'Finance', 6500);
insert into employee values(105, 'Rohit', 'HR', 3000);
insert into employee values(106, 'Rajesh', 'Finance', 5000);
insert into employee values(107, 'Preet', 'HR', 7000);
insert into employee values(108, 'Maryam', 'Admin', 4000);
insert into employee values(109, 'Sanjay', 'IT', 6500);
insert into employee values(110, 'Vasudha', 'IT', 7000);
insert into employee values(111, 'Melinda', 'IT', 8000);
insert into employee values(112, 'Komal', 'IT', 10000);
insert into employee values(113, 'Gautham', 'Admin', 2000);
insert into employee values(114, 'Manisha', 'HR', 3000);
insert into employee values(115, 'Chandni', 'IT', 4500);
insert into employee values(116, 'Satya', 'Finance', 6500);
insert into employee values(117, 'Adarsh', 'HR', 3500);
insert into employee values(118, 'Tejaswi', 'Finance', 5500);
insert into employee values(119, 'Cory', 'HR', 8000);
insert into employee values(120, 'Monica', 'Admin', 5000);
insert into employee values(121, 'Rosalin', 'IT', 6000);
insert into employee values(122, 'Ibrahim', 'IT', 8000);
insert into employee values(123, 'Vikram', 'IT', 8000);
insert into employee values(124, 'Dheeraj', 'IT', 11000);
select * from employee;
Solution:
with rn as (
select *, row_number() over(order by emp_id) as rn
from employee
)
select *
from rn
where rn = 2
-- Query 3:
-- Write a SQL query to display only the details of employees who either earn the highest salary or the lowest salary in each department from the employee table.
--Tables Structure:
-- drop table if exists employee;
create table employee
( emp_ID int primary key
, emp_NAME varchar(50) not null
, DEPT_NAME varchar(50)
, SALARY int);
insert into employee values(101, 'Mohan', 'Admin', 4000);
insert into employee values(102, 'Rajkumar', 'HR', 3000);
insert into employee values(103, 'Akbar', 'IT', 4000);
insert into employee values(104, 'Dorvin', 'Finance', 6500);
insert into employee values(105, 'Rohit', 'HR', 3000);
insert into employee values(106, 'Rajesh', 'Finance', 5000);
insert into employee values(107, 'Preet', 'HR', 7000);
insert into employee values(108, 'Maryam', 'Admin', 4000);
insert into employee values(109, 'Sanjay', 'IT', 6500);
insert into employee values(110, 'Vasudha', 'IT', 7000);
insert into employee values(111, 'Melinda', 'IT', 8000);
insert into employee values(112, 'Komal', 'IT', 10000);
insert into employee values(113, 'Gautham', 'Admin', 2000);
insert into employee values(114, 'Manisha', 'HR', 3000);
insert into employee values(115, 'Chandni', 'IT', 4500);
insert into employee values(116, 'Satya', 'Finance', 6500);
insert into employee values(117, 'Adarsh', 'HR', 3500);
insert into employee values(118, 'Tejaswi', 'Finance', 5500);
insert into employee values(119, 'Cory', 'HR', 8000);
insert into employee values(120, 'Monica', 'Admin', 5000);
insert into employee values(121, 'Rosalin', 'IT', 6000);
insert into employee values(122, 'Ibrahim', 'IT', 8000);
insert into employee values(123, 'Vikram', 'IT', 8000);
insert into employee values(124, 'Dheeraj', 'IT', 11000);
select * from employee;
Solution:
with max_salary as (
select dept_name, max(salary)
from employee
group by dept_name
), min_salary as (
select dept_name, min(salary)
from employee
group by dept_name
)
select
e.emp_id, e.emp_name, e.dept_name, e.salary,
case
when e.salary = mas.max then 'Highest Salary'
else 'Lowest Salary'
end as description
from employee e, max_salary mas, min_salary mis
where
e.dept_name = mas.dept_name and e.salary = mas.max or
e.dept_name = mis.dept_name and e.salary = mis.min
order by e.dept_name, salary desc
-- Query 4:
-- From the doctors table, fetch the details of doctors who work in the same hospital but in different speciality.
--Table Structure:
-- drop table if exists doctors;
create table doctors
(
id int primary key,
name varchar(50) not null,
speciality varchar(100),
hospital varchar(50),
city varchar(50),
consultation_fee int
);
insert into doctors values
(1, 'Dr. Shashank', 'Ayurveda', 'Apollo Hospital', 'Bangalore', 2500),
(2, 'Dr. Abdul', 'Homeopathy', 'Fortis Hospital', 'Bangalore', 2000),
(3, 'Dr. Shwetha', 'Homeopathy', 'KMC Hospital', 'Manipal', 1000),
(4, 'Dr. Murphy', 'Dermatology', 'KMC Hospital', 'Manipal', 1500),
(5, 'Dr. Farhana', 'Physician', 'Gleneagles Hospital', 'Bangalore', 1700),
(6, 'Dr. Maryam', 'Physician', 'Gleneagles Hospital', 'Bangalore', 1500);
Solution:
select d1.id, d1.name, d1.speciality, d1.hospital, d1.city, d1.consultation_fee
from doctors d1 join doctors d2 on d1.hospital = d2.hospital and d1.speciality != d2.speciality
-- Query 5:
-- From the login_details table, fetch the users who logged in consecutively 3 or more times.
--Table Structure:
-- drop table if exists login_details;
create table login_details(
login_id int primary key,
user_name varchar(50) not null,
login_date date);
delete from login_details;
insert into login_details values
(101, 'Michael', current_date),
(102, 'James', current_date),
(103, 'Stewart', current_date+1),
(104, 'Stewart', current_date+1),
(105, 'Stewart', current_date+1),
(106, 'Jimmy', current_date+2),
(107, 'Michael', current_date+2),
(108, 'Stewart', current_date+3),
(109, 'Stewart', current_date+3),
(110, 'James', current_date+4),
(111, 'James', current_date+4),
(112, 'James', current_date+5),
(113, 'James', current_date+6);
Solution:
with l as (
select *, lag(user_name) over w as prev_user1, lag(user_name,2) over w as prev_user2
from login_details
window w as (order by login_date, login_id)
)
select *
from l
where user_name = prev_user1 and prev_user1 = prev_user2
Alternative solution:
with rep as (
select
*,
case
when lag(user_name) over (order by login_date) = user_name then 'repeat'
else 'not repeat'
end as rep
from login_details
), cume_rep as (
select
*,
case
when rep = 'repeat' and lag(rep) over (order by login_date) = rep then 1
else 0
end as cume_rep
from rep
)
select *
from cume_rep
where cume_rep = 1
Better alternative solution. This solution also counts the number of consecutive logins by the user.
with rep as (
select
*,
case
when lag(user_name) over(order by login_date) = user_name then 1
else 0
end as rep
from login_details
), island_head as (
select
*,
case
when
rep = 0 and lead(rep) over(order by login_date) = 1 then login_id
end as island_head
from rep
), island_id as(
select
*,
case
when rep = 1 then max(island_head) over(order by login_date)
when rep = 0 then island_head
end as island_id
from island_head
)
select
*,
count(island_id) over(partition by island_id order by login_date rows between unbounded preceding and unbounded following)
from island_id
order by login_date, island_head
To make the result set more compact:
with rep as (
select
*,
case
when lag(user_name) over(order by login_date) = user_name then 1
else 0
end as rep
from login_details
), island_head as (
select
*,
case
when
rep = 0 and lead(rep) over(order by login_date) = 1 then login_id
end as island_head
from rep
), island_id as(
select
*,
case
when rep = 1 then max(island_head) over(order by login_date)
when rep = 0 then island_head
end as island_id
from island_head
), island as(
select
*,
count(island_id) over(partition by island_id order by login_date rows between unbounded preceding and unbounded following) as consecutive_logins
from island_id
)
select island_id, consecutive_logins, user_name
from island
where consecutive_logins != 0
group by island_id, consecutive_logins, user_name
order by island_id
-- Query 6:
-- From the students table, write a SQL query to interchange the adjacent student names.
-- Note: If there are no adjacent student then the student name should stay the same.
--Table Structure:
-- drop table if exists students;
create table students
(
id int primary key,
student_name varchar(50) not null
);
insert into students values
(1, 'James'),
(2, 'Michael'),
(3, 'George'),
(4, 'Stewart'),
(5, 'Robin');
-- select * from students;
Solution:
select
*,
case
when id % 2 = 1 then lead(student_name,1,student_name) over()
when id % 2 = 0 then lag(student_name,1,student_name) over()
end as new_student_name
from students
-- Query 7:
-- From the weather table, fetch all the records when London had extremely cold temperature for 3 consecutive days or more.
-- Note: Weather is considered to be extremely cold then its temperature is less than zero.
--Table Structure:
-- drop table if exists weather;
create table weather
(
id int,
city varchar(50),
temperature int,
day date
);
delete from weather;
insert into weather values
(1, 'London', -1, to_date('2021-01-01','yyyy-mm-dd')),
(2, 'London', -2, to_date('2021-01-02','yyyy-mm-dd')),
(3, 'London', 4, to_date('2021-01-03','yyyy-mm-dd')),
(4, 'London', 1, to_date('2021-01-04','yyyy-mm-dd')),
(5, 'London', -2, to_date('2021-01-05','yyyy-mm-dd')),
(6, 'London', -5, to_date('2021-01-06','yyyy-mm-dd')),
(7, 'London', -7, to_date('2021-01-07','yyyy-mm-dd')),
(8, 'London', 5, to_date('2021-01-08','yyyy-mm-dd'));
-- select * from weather;
Solution:
with
streak as (
select *,
case
when temperature < 0 and lag(temperature) over (order by id) >= 0 then id -- head (does consider first record)
when temperature < 0 and lead(temperature) over (order by id) < 0 then 1 -- body
when temperature < 0 and lead(temperature) over (order by id) >= 0 then 1 -- tail
when temperature < 0 then 1 --handles the last record
end as streak
from weather),
island_id as (
select
*,
case
when streak is not null then max(streak) over(order by id)
end as island_id
from streak),
island_size as (
select
*,
count(island_id) over(partition by island_id order by id rows between unbounded preceding and unbounded following) as island_size
from island_id
)
select id, city, temperature, day, island_size
from island_size
where island_size >= 3
order by id
We skipped exercise 9 because the task was not clear.
-- Query 9:
-- Find the top 2 accounts with the maximum number of unique patients on a monthly basis.
-- Note: Prefer the account if with the least value in case of same number of unique patients
--Table Structure:
-- drop table if exists patient_logs;
create table patient_logs
(
account_id int,
date date,
patient_id int
);
insert into patient_logs values (1, to_date('02-01-2020','dd-mm-yyyy'), 100);
insert into patient_logs values (1, to_date('27-01-2020','dd-mm-yyyy'), 200);
insert into patient_logs values (2, to_date('01-01-2020','dd-mm-yyyy'), 300);
insert into patient_logs values (2, to_date('21-01-2020','dd-mm-yyyy'), 400);
insert into patient_logs values (2, to_date('21-01-2020','dd-mm-yyyy'), 300);
insert into patient_logs values (2, to_date('01-01-2020','dd-mm-yyyy'), 500);
insert into patient_logs values (3, to_date('20-01-2020','dd-mm-yyyy'), 400);
insert into patient_logs values (1, to_date('04-03-2020','dd-mm-yyyy'), 500);
insert into patient_logs values (3, to_date('20-01-2020','dd-mm-yyyy'), 450);
select * from patient_logs;
Solution:
with dp as
(select distinct to_char(date, 'month') as month,
account_id,
patient_id
from patient_logs
order by month),
c as
(select month,
account_id,
count(account_id) as c
from dp
group by month, account_id),
r as
(select *,
rank() over(partition by month order by c desc, account_id) as rnk
from c)
select *
from r
where rnk < 3
order by month, rnk
- SQL Cheat Sheet:
limit
noffset
m: skip m row and return the next n rows.
- Use column names
create_date
(date) andlast_update
(timestamp without timezone). - A database is a collection of tables. Tables contain rows and columns, where the rows are known as records and the columns are known as fields. A column is a set of data values of a particular type, one value for each row of the database. A row represents a single data item in a table, and every row in the table has the same structure.
- Single vs Double Quotes: Double quotes are for names of tables or fields. Sometimes You can omit them. The single quotes are for string constants. This is the SQL standard. In the verbose form, your query looks like this:
select * from "table1" where "column1"='name1';
- SQL
COUNT
function is the simplest function and very useful in counting the number of records, which are expected to be returned by a SELECT statement. - SQL SELECT with DISTINCT on multiple columns
- Additional PgSQL keywords and functions:
- length()
- offset
- PgSQL Pattern Matching: Be wary of accepting regular-expression search patterns from hostile sources. If you must do so, it is advisable to impose a statement timeout.
Searches using
SIMILAR TO
patterns have the same security hazards, sinceSIMILAR TO
provides many of the same capabilities as POSIX-style regular expressions.LIKE
searches, being much simpler than the other two options, are safer to use with possibly-hostile pattern sources. - W3 SQL Tutorial
- Full Text Search PostgreSQL
- Learn PostgreSQL Tutorial - Full Course for Beginners
- Official Tutorials and Other Resources
- EDB Offer
- Tutorials point
- Show all tables:
SELECT * FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
- A Visual Explanation of SQL Joins
- Join (SQL) Wiki
- Table Covert Online
- Math Symbols List
- Casting columns to date
- Transpose Results
- Dollar Quoting
- Delete duplicate records
- Creating multiple tables with sqlite3
select
count distinct
from
where
=<>
is null
in ()
between x and y
like
ilike
group by
having
order by
limit
not
keyword appears before the conditionorder by
-columns also appears inselect
-column- if we
group by
->select
-columns appear in:group by
-statement OR- are in an aggregate function
- column in
select
-> column ingroup by
- aggregate function appear in:
select
ORhaving
select distinct(event_type), nth_value(delta, 1) over(partition by event_type) as value
from
(select e.*, lag(value) over( partition by event_type order by time desc) - value as delta
from events e
order by event_type) delta
where delta is not null
order by event_type
Alternative (since we remove the null values with the where
clause and parameter of nth_value is 1):
select distinct(event_type), first_value(delta) over(partition by event_type) as value
from
(select e.*, lag(value) over( partition by event_type order by time desc) - value as delta
from events e
order by event_type) delta
where delta is not null
order by event_type
Alternative with with
as syntactic convenience to make the query more readable.
with delta as (
select e.*, lag(value) over( partition by event_type order by time desc) - value as delta
from events e
order by event_type
)
select distinct(event_type), first_value(delta) over(partition by event_type) as value
from delta
where delta is not null
order by event_type