In this article we are going to learn how to make contingency tables in Postgres!
Contingency tables (also called pivot tables) are a popular way to display data that shows relationships between 2 variables, this example will be based on sales data and we will be looking at how much of each product each employee sold. Employee name and product name being the 2 variables which have a relationship which we are examining.
Data
This will be our data:
CREATE TABLE sales (
employee TEXT NOT NULL,
product TEXT NOT NULL,
quantity INT NOT NULL
);
INSERT INTO sales (employee,product,quantity)
VALUES
('Tyler','Pasta',1),
('Mark','Pasta',2),
('Tyler','Pepper',5),
('Mark','Pepper',45),
('Caroline','Pepper',100),
('Mark','Cheese',14),
('Rob','Cheese',2);
Why contingency tables are good
If we run SELECT * FROM sales
we get an output that looks like this:
Employee | Product | Quantity |
---|---|---|
Tyler | Pasta | 1 |
Mark | Pepper | 45 |
Tyler | Pepper | 5 |
Looking at the data formatted like this makes it hard to compare the sales of different employees across different products so we want to transform the data into a beautiful format like this:
Employee | Pasta | Pepper |
---|---|---|
Tyler | 1 | 5 |
Mark | 45 |
Manual query creation
Postgres provides a crosstab
function to turn data into the format we want. It will take the original data and transform it so that the first column contains the name of an employee (row name) and then following columns contain sale values for each product (categories) for this employee.
This function takes 2 arguments:
- A data query - this query will fetch the actual data
- A categories query - this query defines the categories columns
So this is how we would go about creating the contingency table:
CREATE EXTENSION tablefunc; -- We need to create this extension first
-- Base data
SELECT
*
FROM sales;
-- Contingency table query
SELECT
*
FROM crosstab(
$$
-- Data query
SELECT
-- Column order needs to be: row name, category, value
employee,
product,
quantity
FROM sales
ORDER BY employee ASC -- We need to order by the row name
$$,
$$
-- Categories query
-- Using DISTINCT because each category can only appear once in this query
SELECT DISTINCT product FROM sales
$$
-- We need to specify which columns are returned from the function
) AS ct(employee text, Pasta text, Pepper text, Cheese text);
This process one major drawback:
We need to specify which columns get returned from the query which means that if the number of rows returned by the categories query changes we will need to make changes to the query.
Automated query creation
We can get around this problem by generating the column definition based on the result of the categories query, for example with a function like this:
CREATE OR REPLACE FUNCTION contingency_query(data_query TEXT, categories_query TEXT)
RETURNS TEXT
LANGUAGE plpgsql
AS $$
DECLARE res TEXT;
BEGIN
-- SELECT return columns specification into a TEXT variable
EXECUTE E'SELECT FORMAT(''(row_name TEXT, %s)'', STRING_AGG(FORMAT(''%s TEXT'', category), '', '')) FROM (' || categories_query || ')' INTO res;
-- RETURN crosstab query with the data and categories queries and correct columns specified
RETURN FORMAT('SELECT * FROM CROSSTAB(
$data$
%s
$data$,
$categories$
%s
$categories$
) AS ct%s', data_query, categories_query, res);
END;
$$
-- Function usage
SELECT contingency_query('SELECT employee, product, quantity FROM sales ORDER BY employee ASC', 'SELECT DISTINCT product AS category FROM sales ORDER BY product asc');
-- The function call will return a query that looks like this:
-- We still need to execute the query to get the contingency table
SELECT * FROM CROSSTAB(
$data$
SELECT employee, product, quantity FROM sales ORDER BY employee ASC
$data$,
$categories$
SELECT DISTINCT product AS category FROM sales ORDER BY product ASC
$categories$
) AS ct(row_name TEXT, Cheese TEXT, Pasta TEXT, Pepper TEXT);
The contingency_query
requires the output column of the categories query to be called category
, hence the AS category
in the example. It also casts all the returned values to TEXT, this is probably fine but can be tuned to fit your use-case better. These improvements are left as an exercise for the reader ;)