[SQL] How to Use EXISTS and IN
In this article, I will explain the difference between SQL EXISTS and IN.
Both are often used to extract “rows that meet certain conditions” using subqueries, but there are differences in behavior and performance.
In particular, let’s look at how to use them from the perspectives of NULL handling and performance.
Preparation
In this site, I will use python libraries sqlite3 and ipython-sql to execute and explain SQL on Jupyter Notebook.
Also, I will use Polars to display dataframes.
github
- The jupyter notebook file on github is here
google colaboratory
- If you want to run it on google colaboratory, click here
Author’s Environment
Here is the author’s environment.
!sw_vers
ProductName: macOS
ProductVersion: 15.5
BuildVersion: 24F74
!python -V
Python 3.12.12
Import necessary libraries.
import sqlite3
import polars as pl
%load_ext sql
%config SqlMagic.feedback = True
%config SqlMagic.style = '_DEPRECATED_DEFAULT'
Data Preparation
As a sample, create a customer table Customers and an order table Orders.
import os
if os.path.exists("data.db"):
os.remove("data.db")
os.system("touch data.db")
os.system("chmod 664 data.db")
# Database connection
%sql sqlite:///data.db
'Connected: @data.db'
%%sql
CREATE TABLE Customers (
id INTEGER PRIMARY KEY,
name TEXT
);
CREATE TABLE Orders (
id INTEGER PRIMARY KEY,
customer_id INTEGER,
amount INTEGER
);
INSERT INTO Customers (name) VALUES
('Aさん'),
('Bさん'),
('Cさん'),
('Dさん');
INSERT INTO Orders (customer_id, amount) VALUES
(1, 100),
(1, 200),
(3, 500);
* sqlite:///data.db
Done.
Done.
4 rows affected.
3 rows affected.
[]
%%sql
SELECT * FROM Customers;
* sqlite:///data.db
Done.
| id | name |
|---|---|
| 1 | Aさん |
| 2 | Bさん |
| 3 | Cさん |
| 4 | Dさん |
%%sql
SELECT * FROM Orders;
* sqlite:///data.db
Done.
| id | customer_id | amount |
|---|---|---|
| 1 | 1 | 100 |
| 2 | 1 | 200 |
| 3 | 3 | 500 |
Task: Extract customers with order history
A-san and C-san have order history, but B-san and D-san do not. Let’s extract only customers with order history.
Method 1: Using IN
Use the IN clause to extract customers included in the list of customer_id in the Orders table.
query = """
SELECT
*
FROM
Customers
WHERE
id IN (SELECT customer_id FROM Orders);
"""
with sqlite3.connect("data.db") as conn:
df = pl.read_database(query, connection=conn)
display(df)
shape: (2, 2)
| id | name |
|---|---|
| 1 | “Aさん” |
| 3 | “Cさん” |
Method 2: Using EXISTS
Use the EXISTS clause to check whether there are related rows in the Orders table.
query = """
SELECT
*
FROM
Customers c
WHERE
EXISTS (
SELECT
1
FROM
Orders o
WHERE
c.id = o.customer_id
);
"""
with sqlite3.connect("data.db") as conn:
df = pl.read_database(query, connection=conn)
display(df)
shape: (2, 2)
| id | name |
|---|---|
| 1 | “Aさん” |
| 3 | “Cさん” |
Explanation and Differences
The results are the same, but there are the following differences.
NULL handling
INcan behave unintuitive whenNULLis included in the list, such as when the result ofNOT INbecomes all unknown (empty).EXISTSonly looks at the existence of rows, so it is less affected byNULL.
Performance
INoften retrieves all subquery results before comparing.EXISTSstops searching as soon as it finds even one row that meets the condition (returns True), so it tends to be faster especially when there are a large number of subquery results.
Summary
- Basically, it is safer to use EXISTS.
- Especially when using
NOT IN, it is strongly recommended to rewrite it to NOT EXISTS to avoid unexpected behavior due toNULL.