Skip to content

devnich/sql-data-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introducing databases and SQL: Why use a database?

Performance

Correctness

There are two aspects of "correctness": Enforcing consistency and eliminating ambiguity. A database enforces consistency with a combination of data types, rules (e.g., foreign keys, triggers, etc.), and atomic transactions. It eliminates ambiguity by forbidding NULLs.

  1. You can represent simple data in a single table

  2. The single table breaks down when your data is complex

    If you use a nested representation, the individual table cells are no longer atomic. The tool for query, search, or perform analyses rely on the atomic structure of the table, and they break down when the cell contents are complex.

  3. Complex data with duplicate row

    • Storing redundant information has storage costs
    • Redundant rows violate the Don't Repeat Yourself [DRY] principle. Every copy is an opportunity to introduce errors or inconsistencies into the data.
    • Storing multidimensional data in a single table increases the chance that your records will have NULL fields, which will complicate future queries (more on this later)
  4. Solution: Normalize the data by breaking it into multiple tables

    Animals Sightings
    • Every row of every table contains unique information
    • Normalization is a continuum. We could normalize this data further, but there is a trade-off in terms of sane table management. Finding the correct trade-off is a matter of taste, judgment, and domain-specific knowledge.

Encode Domain Knowledge

  • Encodes shape of domain
  • Embeds domain rules: e.g. cannot have a customer transaction without a customer account
  • Rules provide additional layer of correctness in the form of constraints
  • note that forbidding NULL seems much more reasonable in this context!

Extensions

  • Functions
  • Data types (GIS, JSON, date/time, searchable document, currency...)
  • Full-text search

Accessing data with queries

Basic queries

  1. Select everything from a table

    SELECT *
    FROM surveys;
  2. Select a column

    SELECT year
    FROM surveys;
  3. Select multiple columns

    SELECT year, month, day
    FROM surveys;
  4. Limit results

    SELECT *
    FROM surveys
    LIMIT 10;
  5. Get unique values

    SELECT DISTINCT species_id
    FROM surveys;
    -- Return distinct pairs
    SELECT DISTINCT year, species_id
    FROM surveys;
  6. Calculate values

    -- Convert kg to g
    SELECT plot_id, species_id, weight/1000
    FROM surveys;
  7. SQL databases have functions

    SELECT plot_id, species_id, ROUND(weight/1000, 2)
    FROM surveys;

Filtering

  1. Filter by a criterion

    SELECT *
    FROM surveys
    WHERE species_id='DM';
    SELECT *
    FROM surveys
    WHERE year >= 2000;
  2. Combine criteria with booleans

    SELECT *
    FROM surveys
    WHERE (year >= 2000) AND (species_id = 'DM');
    SELECT *
    FROM surveys
    WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS');

Challenge 1

Get all of the individuals in Plot 1 that weighed more than 75 grams, telling us the date, species id code, and weight (in kg).

Building complex queries

Use sets ("tuples") to condense criteria.

SELECT *
FROM surveys
WHERE (year >= 2000) AND (species_id IN ('DM', 'DO', 'DS'));

Sorting

  1. Sort by a column value

    SELECT *
    FROM species
    ORDER BY taxa ASC;
  2. Descending sort

    SELECT *
    FROM species
    ORDER BY taxa DESC;
  3. Nested sort

    SELECT *
    FROM species
    ORDER BY genus ASC, species ASC;

Challenge 2

Write a query that returns year, species_id, and weight in kg from the surveys table, sorted with the largest weights at the top.

Order of execution

Queries are pipelines

Aggregating and grouping data (i.e. reporting)

COUNT

SELECT COUNT(*)
FROM surveys;
-- SELECT only returns non-NULL results
SELECT COUNT(weight), AVG(weight)
FROM surveys;

Challenge 3

  1. Write a query that returns the total weight, average weight, minimum and maximum weights for all animals caught over the duration of the survey.
  2. Modify it so that it outputs these values only for weights between 5 and 10.

GROUP BY (i.e. summarize, pivot table)

  1. Aggregate using GROUP BY

    SELECT species_id, COUNT(*)
    FROM surveys
    GROUP BY species_id;
  2. Group by multiple nested fields

    SELECT year, species_id, COUNT(*), AVG(weight)
    FROM surveys
    GROUP BY year, species_id;

Ordering aggregated results

SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id
ORDER BY COUNT(species_id) DESC;

Aliases

Create temporary variable names for future use. This will be useful later when we have to work with multiple tables.

  1. Create alias for column name

    SELECT MAX(year) AS last_surveyed_year
    FROM surveys;
  2. Create alias for table name

    SELECT *
    FROM surveys AS surv;

The HAVING keyword

  1. WHERE filters on database fields; HAVING filters on aggregations

    SELECT species_id, COUNT(species_id)
    FROM surveys
    GROUP BY species_id
    HAVING COUNT(species_id) > 10;
  2. Using aliases to make results more readable

    SELECT species_id, COUNT(species_id) AS occurrences
    FROM surveys
    GROUP BY species_id
    HAVING occurrences > 10;
  3. Note that in both queries, HAVING comes after GROUP BY. One way to think about this is: the data are retrieved (SELECT), which can be filtered (WHERE), then joined in groups (GROUP BY); finally, we can filter again based on some of these groups (HAVING).

Challenge 4

Write a query that returns, from the species table, the number of species in each taxa, only for the taxa with more than 10 species.

SELECT taxa, COUNT(*) AS n
FROM species
GROUP BY taxa
HAVING n > 10;

Saving queries for future use

A view is a permanent query; alternatively, it is a table that auto-refreshes based on the contents of other tables.

  1. A sample query

    SELECT *
    FROM surveys
    WHERE year = 2000 AND (month > 4 AND month < 10);
  2. Save the query permanently as a view

    CREATE VIEW summer_2000 AS
    SELECT *
    FROM surveys
    WHERE year = 2000 AND (month > 4 AND month < 10);
  3. Query the view (i.e. the query results) directly

    SELECT *
    FROM summer_2000
    WHERE species_id = 'PE';

NULL

Start with slides: NULLs are missing data and give deceptive query results. Then demo:

  1. Count all the things

    SELECT COUNT(*)
    FROM summer_2000;
  2. Count all the not-females

    SELECT COUNT(*)
    FROM summer_2000
    WHERE sex != 'F';
  3. Count all the not-males. These two do not add up!

    SELECT COUNT(*)
    FROM summer_2000
    WHERE sex != 'M';
  4. Explicitly test for NULL

    SELECT COUNT(*)
    FROM summer_2000
    WHERE sex != 'M' OR sex IS NULL;

Combining data with joins

(Inner) joins

  1. Join on fully-identified table fields

    SELECT *
    FROM surveys
    JOIN species
    ON surveys.species_id = species.species_id;
  2. Join a subset of the available columns

    SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
    FROM surveys
    JOIN species
    ON surveys.species_id = species.species_id;
  3. Join on table fields with identical names

    SELECT *
    FROM surveys
    JOIN species
    USING (species_id);

Challenge 5

Write a query that returns the genus, the species name, and the weight of every individual captured at the site.

SELECT species.genus, species.species, surveys.weight
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;

Other join types

Slides: Talk about the structure of joins

Combining joins with sorting and aggregation

SELECT plots.plot_type, AVG(surveys.weight)
FROM surveys
JOIN plots
ON surveys.plot_id = plots.plot_id
GROUP BY plots.plot_type;

(Optional) Challenge 6

Write a query that returns the number of animals caught of each genus in each plot. Order the results by plot number (ascending) and by descending number of individuals in each plot.

SELECT surveys.plot_id, species.genus, COUNT(*) AS number_indiv
FROM surveys
JOIN species
ON surveys.species_id = species.species_id
GROUP BY species.genus, surveys.plot_id
ORDER BY surveys.plot_id ASC, number_indiv DESC;

(Optional) Functions COALESCE and NULLIF

  1. Replace missing values (NULL) with a preset value using COALESCE

    SELECT species_id, sex, COALESCE(sex, 'U')
    FROM surveys;
  2. Replacing missing values allows you to include previously-missing rows in joins

    SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
    FROM surveys
    JOIN species
    ON COALESCE(surveys.species_id, 'AB') = species.species_id;
  3. NULLIF is the inverse of COALESCE; you can mask out values by converting them to NULL

    SELECT species_id, plot_id, NULLIF(plot_id, 7)
    FROM surveys;

(Optional) SQLite on the command line

Basic commands

sqlite3     # enter sqlite prompt
.tables     # show table names
.schema     # show table schema
.help       # view built-in commands
.quit

Getting output

  1. Formatted output in the terminal

    .headers on
    .help mode
    .mode column
    SELECT * FROM species WHERE taxa = 'Rodent';
  2. Output to .csv file

    .mode csv
    .output test.csv
    SELECT * FROM species WHERE taxa = 'Rodent';
    .output stdout

(Optional) Database access via programming languages

R language bindings

  1. Resources

  2. Import libraries

    library("DBI")
    library("RSQLite")
  3. FYI, use namespaces explicitly

    con <- DBI::dbConnect(RSQLite::SQLite(), "../data/portal_mammals.sqlite")
  4. Show tables

    dbListTables(con)
  5. Show column names

    dbListFields(con, "species")
  6. Get query results at once

    df <- dbGetQuery(con, "SELECT * FROM surveys WHERE year = 2000")
    head(df)
  7. Use parameterized queries

    df <- dbGetQuery(con, "SELECT * FROM surveys WHERE year = ? AND (month > ? AND month < ?)",
                     params = c(2000, 4, 10))
    head(df)
  8. Disconnect

    dbDisconnect(con)

(Optional) What kind of data storage system do I need?

Non-atomic write; sequential read

  1. Files

Single atomic write (database-level lock); query-driven read

  1. SQLite
  2. Microsoft Access

Multiple atomic writes (row-level lock); query-driven read

  1. PostgreSQL: https://www.postgresql.org
  2. MySQL/MariaDB
  3. Oracle
  4. Microsoft SQL Server
  5. ...etc.

(Optional) Performance tweaks and limitations

Getting the most out of your database

  1. Use recommended settings, not default settings
  2. Make judicious use of indexes
  3. Use the query planner (this will provide feedback for item 2)
  4. Cautiously de-normalize your schema

Where relational databases break down

  1. Very large data (hardware, bandwidth, and data integration problems)
  2. Distributed data (uncertainty about correctness)

Why are distributed systems hard?

  1. CAP theorem
    • In theory, pick any two: Consistent, Available, Partition-Tolerant
    • In practice, Consistent or Available in the presence of a Partition
  2. Levels of data consistency
  3. Fallacies of distributed computing
    1. The network is reliable
    2. Latency is zero
    3. Bandwidth is infinite
    4. The network is secure
    5. Topology doesn't change
    6. There is one administrator
    7. Transport cost is zero
    8. The network is homogeneous

Endnotes

Credits

References

Data Sources

About

Databases and SQL for Data Scientists

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published