How to Load Data into an SQL database

From simple inserts to bulk loads

Posted by Tariq Abughofa on September 26, 2019 · 7 mins read #sql

Data storage is the most integral part of a transactional database system. In this article, I will go into details about different ways to load data into a transaction SQL system. From inserting a couple of records, all the way to millions of records.

The first and easiest way to insert data into the database is through the ORM (Object-Relational Mapping) tool. For the purpose of this tutorial, I will use Rails’s ActiveRecord as a demonstration of ORM operations. Inserting data through ORM ensures the execution of all the business rules and validations so you don’t worry about it. It’s as easy as doing this:

Users.create(username: "John Doe", role: "admin")

The disadvantages of this method is that it doesn’t scale. For each record, it creates a model object , execute the attached callbacks for validation and business rules, and executes a DML transaction.

Some frameworks provide a way to do a bulk insert instead. A single INSERT SQL query is prepared and a single sql statement is sent to the database, without instantiating the model or invoking model callbacks or validations. For example, in Rails 6 it’s something like:

result = Users.insert_all(
  [
    { id: 1,
      username: 'John Doe',
      role: 'admin' },
    { id: 2,
      username: 'Jane Doe',
      role: 'admin' },
  ]
)

Or that can be done with raw SQL as it has exactly the same effect:

INSERT INTO
  users(id, username, role)
  VAULES
  (1, "John Doe", "admin"),
  (2, 'Jane Doe', "admin");

Both solutions can have conflicts (such as breaking primary key uniqueness or unsuitable data types), or they can break business roles or high-level application validation rules. In both instances, the programmer has to deal with the problems by handling database-level error and ensuring the execution or the following of higher-level rules. For example, if you want to ignore duplicates on the the primary key column column you can add something like this to the end of the query:

ON CONFLICT (id) DO NOTHING;

What if the loaded data contains updates to existing rows in the database? This operation is called an upsert. If the row already exist in the table (the existence is determined based on the primary key), the row is updated with the passed values. Otherwise, it is inserted as a new row. In Rails 6, It’s as easy as replacing the insert_all function with upsert_all:

result = Users.upsert_all(
  [
    { id: 1,
      username: 'John Doe',
      role: 'admin' },
    { id: 2,
      username: 'Jane Doe',
      role: 'admin' },
  ]
)

In SQL it will be something like this:

INSERT INTO
  users(id, username, role)
  VAULES
    (1, 'John Doe', 'admin'),
    (2, 'Jane Doe', 'admin')
ON DUPLICATE KEY
UPDATE id=VALUES(id), username=VALUES(username), role=VALUES(role);

The scalability of this method is still limited. There is a maximum limit to the query length in most servers and even if the limit doesn’t exist you wouldn’t want to send a query with a length in the order of gigabytes over the network. A simple solution is to UPSERT in batches. That would be something like this:

record_num = records.length
batch = 1000
batch_num = record_num / 1000
(1..batch_num).do |n|
  lower_bound = (batch_num - 1) * batch
  higher_bound = batch_num * batch
  Users.upsert_all(records[lower_bound..higher_bound])
end

Or with raw SQL instead of the upsert_all function. Same thing.

This solution technically scales well. However, to increase the performance, most SQL databases has a copy functionality which loads data from a file into a table. To use that functionality, the data is dumped into a file in a format that the database engine supports (a common one is CSV). After that, the table is filled from the file with an SQL command such as:

COPY users FROM 'path/to/my/csv/file.csv';

This statement appends to the users table all the data within the CSV file. This solution can give much better performance for files in the scale of multiple GBs of data.

However this solution doesn’t handle upserts. To do so, we create a temporary table, fill it from the file, then merge it in the original table using the primary key:

CREATE TEMP TABLE tmp_table
...; /* same schema as `users` table */

COPY tmp_table FROM 'path/to/my/csv/file.csv';

INSERT INTO users
  SELECT *
  FROM tmp_table;

This solution avoid doing multiple transactions to the database and ensures higher performance. It’s definitely worth looking into when loading GBs of data into the database.


Card image cap
The "FooBar" Interview Question

I was asked this question at the interview that got me my first job and...

Card image cap
Git: Theirs vs Ours

Conflicts in git can be a pain to deal with especially that you have to...