7 DuckDB SQL Queries That Save You Hours of Pandas Work

7 DuckDB SQL Queries That Save You Hours of Pandas Work
This article explores how to leverage DuckDB, an in-process SQL OLAP database, as a powerful alternative to the Pandas library for data analysis tasks. It demonstrates how DuckDB's SQL-based approach can offer significant performance and readability advantages, especially when dealing with complex operations.
Introduction to DuckDB
DuckDB is highlighted as an easy-to-install, serverless SQL database that integrates seamlessly with Python notebooks. Unlike traditional databases, it requires no setup and can work alongside Pandas, offering a familiar yet more efficient way to handle data manipulation.
Key Advantages of DuckDB over Pandas
The article emphasizes several areas where DuckDB excels:
- Readability and Conciseness: SQL syntax often leads to cleaner and more understandable code, especially for complex filtering, aggregation, and conditional logic, compared to the often verbose Pandas chaining.
- Performance: DuckDB is designed for analytical workloads and can outperform Pandas on large datasets.
- Reduced Boilerplate: Many operations that require multiple lines of code and intermediate variables in Pandas can be achieved in a single, more elegant SQL query with DuckDB.
Data Project: Uber Business Modeling
The article uses a real-world data project from Uber, focusing on partner business modeling, to illustrate DuckDB's capabilities. The project involves analyzing driver data to compare different bonus programs and model business scenarios.
Scenario 1: Comparing Bonus Programs
1. Multi-Criteria Filtering:
DuckDB's WHERE
clause with AND
operators simplifies complex filtering. The article contrasts this with Pandas' chained boolean masks, showing how DuckDB's SQL approach is more maintainable as the number of conditions increases. An example is provided to identify drivers qualifying for a bonus based on hours online, trips completed, acceptance rate, and rating.
- DuckDB Query Example:
SELECT COUNT(*) AS qualified_drivers, COUNT(*) * 50 AS total_payout FROM data WHERE "Supply Hours" >= 8 AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90 AND "Trips Completed" >= 10 AND Rating >= 4.7
2. Fast Aggregation:
DuckDB's SUM()
and COUNT()
functions allow for efficient aggregation within a single query. This is compared to Pandas, where filtering and aggregation are often separate steps. The article demonstrates calculating payouts based on the number of trips completed for a different bonus program.
- DuckDB Query Example:
SELECT COUNT(*) AS qualified_drivers, SUM("Trips Completed") * 4 AS total_payout FROM data WHERE "Trips Completed" >= 12 AND Rating >= 4.7
3. Boolean Logic for Overlaps and Differences:
DuckDB's native support for AND
, OR
, and NOT
in the WHERE
clause simplifies the detection of overlaps and unique conditions. The article shows how to find drivers qualifying for one bonus but not another using NOT
.
- DuckDB Query Example:
SELECT COUNT(*) AS only_option1 FROM data WHERE "Supply Hours" >= 8 AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90 AND "Trips Completed" >= 10 AND Rating >= 4.7 AND NOT ("Trips Completed" >= 12 AND Rating >= 4.7)
4. Conditional Filters for Cohort Sizing:
DuckDB can efficiently calculate cohort sizes and percentages using subqueries within a single SQL statement. This is contrasted with Pandas, which requires manual filtering, counting, and division. The example calculates the percentage of drivers meeting specific criteria (e.g., low trip count, low acceptance rate, high rating).
- DuckDB Query Example:
SELECT ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM data), 2) AS percentage FROM data WHERE "Trips Completed" < 10 AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) = 4.7
Scenario 2: Revenue and Expense Modeling
5. Basic Arithmetic for Revenue Modeling:
DuckDB allows direct embedding of arithmetic expressions within SELECT
statements, simplifying calculations like annual revenue and net income. This is shown to be more concise than Pandas, which often requires multiple intermediate variables.
- DuckDB Query Example:
SELECT 200 * 6 * (52 - 3) AS annual_revenue, 200 * (52 - 3) AS gas_expense, 500 * (52 - 3) AS rent_expense, 400 * 12 AS insurance_expense, (200 * 6 * (52 - 3)) - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income FROM data
6. Conditional Calculations for Dynamic Expense Planning:
DuckDB facilitates conditional calculations within queries, such as adjusting costs based on percentages. This is presented as a more streamlined approach than Pandas, which might involve multiple lines of code and variable updates.
- DuckDB Query Example:
SELECT 200 * 1.05 * 49 AS gas_expense, 400 * 0.8 * 12 AS insurance_expense, 0 AS rent_expense, (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense FROM data
7. Goal-Driven Math for Revenue Targeting:
DuckDB's use of Common Table Expressions (CTEs) allows for modular and readable multi-step calculations to meet business goals, such as hitting revenue targets or covering costs. The article contrasts this with Pandas, which can lead to deeply nested calculations and scattered code. The final example calculates the required weekly revenue uplift for a driver to cover car payments and maintain their net income.
- DuckDB CTE Example:
WITH vars AS ( SELECT 52 AS total_weeks_per_year, 3 AS weeks_off, 6 AS days_per_week, 200 AS fare_per_day, 400 AS monthly_insurance, 200 AS gas_per_week, 500 AS vehicle_rent, 40000 AS car_cost ), base AS ( SELECT total_weeks_per_year, weeks_off, days_per_week, fare_per_day, monthly_insurance, gas_per_week, vehicle_rent, car_cost, total_weeks_per_year - weeks_off AS weeks_worked, (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue, (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas, (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent, (monthly_insurance * 12) AS original_insurance FROM vars ), compare AS ( SELECT *, (original_gas + original_rent + original_insurance) AS original_total_expense, (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income FROM base ), new_costs AS ( SELECT *, gas_per_week * 1.05 * weeks_worked AS new_gas, monthly_insurance * 0.8 * 12 AS new_insurance FROM compare ), final AS ( SELECT *, new_gas + new_insurance + car_cost AS new_total_expense, original_net_income + new_gas + new_insurance + car_cost AS required_revenue, required_revenue / weeks_worked AS required_weekly_revenue, original_annual_revenue / weeks_worked AS original_weekly_revenue FROM new_costs ) SELECT ROUND(required_weekly_revenue, 2) AS required_weekly_revenue, ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift FROM final
Conclusion
DuckDB offers a compelling alternative to Pandas for data scientists, providing a more efficient, readable, and concise way to perform complex data analysis tasks using SQL. Its seamless integration with notebooks and powerful query capabilities make it a valuable tool for anyone working with data.
Original article available at: https://www.kdnuggets.com/7-duckdb-sql-queries-that-save-you-hours-of-pandas-work