Redshift and Glue Best Practices

This page provides some best practices for optimizing data formats, handling nested data, and writing efficient queries while using Data Refinery. For an overview of how to query and export data, review the Export Source Data and Set-up and Query Source Versions pages.

Table of contents

Preferred Data Formats
Querying Data Best Practices

Preferred Data Formats

While Data Refinery supports multiple types of files, different file types have different benefits. The goal of this section is to provide more information on the benefits and downsides of different file specifications.

Apache Parquet or Apache ORC

Apache Parquet files are column-based files that include metadata about file contents in their headers. These files are more difficult to produce compared to JSONL files but offer the best query performance and the lowest storage cost. Metadata in the file headers provides information to the query engine about the contents of the file. This means files will be skipped if they contain no records that would match a query.

JSON Line File (.jsonl)

The JSON Line file is a good “all around” file to use. It provides a good mixture of performance and ease of creation. A JSON Line file includes one fully complete JSON object per line. Updating or adding a new key to a JSONL object will properly create a new column in a Source. This will happen even if the key is the only key present in that object.

JSON Line files are slightly less performant than Apache Parquet or Apache ORC because they do not include metadata in the file headers. As a result, the database will read the full files to run queries.

JSON Array File (.json)

Similar to JSON line files (.jsonl), JSON array files use the JSON text format to represent data records. However, all records are represented as objects within a single surrounding array. This is a common format when data is exported from application APIs.

The same performance guidance for JSON line files applies to JSON array files. For optimal performance, these files can be transformed to a more performant data type and loaded into a refined Source.

Comma-Separated Values (.csv)

Comma-separated values (CSV) is a widely used file format for storing tabular data. It can be easily imported to and exported from spreadsheet tools like Microsoft Excel, making it a versatile choice for data exchange. In this format, the first line of the file contains the column names of the table. Each subsequent line represents a record as a table row.

Working with CSV

CSV lacks a standardized schema, which could lead to data inconsistency if files within a Source have different header rows. Data could also be interpreted incorrectly if files have columns ordered differently or have missing or extra headers.

While CSV is supported by Data Refinery, it is recommended that CSV is only used in precursory Sources. Whether the schema is manually or automatically managed, column headers must always be included in every uploaded CSV file. Additionally, CSV headers can be alphanumeric but not simply numeric. CSV data should be transformed to a more performant data type before performing any data actions.

If choosing to work with CSV files, ensure:

Consistent Delimiter Usage: the chosen delimiter is consistent within a file, between files within a Source, and across all Versions. Mixed delimiters can cause a shift in the data, resulting in query failures.
Header Detection Rule Adherence: if relying on the system to detect the first row in a CSV as the list of column headers, consult the header detection rules here.
Consistent Column Headings: new columns must only be appended. Column order cannot change within a Source, across all Versions.
Double Quote Enclosures: the data must be enclosed in double-quotes to accurately import data. Ensure quotes are balanced (i.e., all opening quotes are followed at some point by a closing quote).
Unicode Control Character Compatibility: Some Unicode control characters (e.g., \u0000 and \u0007) may not load correctly.
Date Format Consistency: Mixing date/timestamp formats causes dates and timestamps to be loaded as plain strings instead of dates/timestamps.
Endline Consistency: Mixing line-ending character usage results in inconsistent data loading behavior.
Endline Escaping: Escaped line-ending characters are loaded literally and not interpreted as escape sequences. However, endline characters in query values should be escaped.

Note. Columns containing only numeric values are imported as numbers, with leading zeros removed.

Source Name

When creating a Source, only printable ASCII characters can be used for the Source name. This is because the name of the Source will be used to create the table name in Glue and Redshift, both of which have limitations on what characters can be used. While the following two characters can be used to create a table name, they will need to be escaped when querying.

Character	Escape Character	Unicode	ASCII
double quote	`\"`	`U+0022`	34
backslash	`\\`	`U+005C`	92

Column Names And Sanitizing Special Characters

It is recommended to also use only printable ASCII characters for column names. While some special characters can be used, this is not recommended. The following special characters are known to cause query errors. They should be sanitized or escaped before querying:

Character	Escape Character	Unicode	ASCII
horizontal tab	`\t`	`U+0009`	9
new line/line feed (LF)	`\n`	`U+000A`	10
form feed (FF)	`\f`	`U+000C`	12
return/carriage return (CR)	`\r`	`U+000D`	13
double quote	`\"`	`U+0022`	34
backslash	`\\`	`U+005C`	92

Nested Fields Size Limit

If the uploaded data contains nested fields, Data Refinery will automatically infer them to struct or array column type. The size limit of all field names in a nested column combined is 131072 bytes. This limit could be reached if the field is a map of unbounded size. To avoid reaching the limit, consider removing the field from imported data, or preprocessing it to an array instead.

Querying Data Best Practices

The following provides the best methods for querying in Data Refinery.

File Upload Size

For the best query performance, files should be no larger than 100MB. Files larger than 1GB will cause queries to fail.

Cast `version` Column

Due to some underlying technical limitations, the version column that contains the version number is always loaded as string type. Therefore, the version column must be cast to int before it can be used for comparison:

cast(t.version as int).

Querying Semistructured Data

Follow the AWS Guide for querying semistructured data, array navigation, object unpivoting, and dynamic typing.

Nested Queries

Redshift does not allow some nested queries when querying nested data structures. Where possible, use SQL temporary table instead.

For example, for a set of data that contains nested fields:

{
    "version": 1,
    "attributes": {
        "attribute1": "string",
        "attribute2": "string"
    }
}

A query with any nested SELECT statements will result in a Redshift error:

-- This will error! 
SELECT 
    t1."attributes".attribute1
FROM 
    t1 
WHERE 
    t1.version = (select max(t2.version) from refinery_schema.t1 as t2)

To mitigate this issue, the nested query can be converted to use TEMP TABLE instead:

-- This won't error 
CREATE TEMP TABLE cte1 AS SELECT t1.id, max(cast(t1.version as int)) AS max_version FROM t1 GROUP BY t1.id;

SELECT 
    t1.id,
    t1."attributes".attribute1,
    t1."attributes".attribute2, 
    cast(t1.version as int) as version
FROM 
    t1
INNER JOIN 
    cte1 ON t1.id = cte2.id 
AND 
    cast(t1.version as int) = cte2.max_version;

Note. Temporary table is visible only within the current session. The table is automatically dropped at the end of the session in which it is created.