Redshift and Glue Best Practices
This page provides some best practices for optimizing data formats, handling nested data, and writing efficient queries while using Data Refinery. For an overview of how to query and export data, review the Export Source Data and Set-up and Query Source Versions pages.
Table of contents
Preferred Data Formats
While Data Refinery supports multiple types of files, different file types have different benefits. The goal of this section is to provide more information on the benefits and downsides of different file specifications.
Apache Parquet or Apache ORC
Apache Parquet files are column-based files that include metadata about file contents in their headers. These files are more difficult to produce compared to JSONL files but offer the best query performance and the lowest storage cost. Metadata in the file headers provides information to the query engine about the contents of the file. This means files will be skipped if they contain no records that would match a query.
JSON Line File (.jsonl)
The JSON Line file is a good “all around” file to use. It provides a good mixture of performance and ease of creation. A JSON Line file includes one fully complete JSON object per line. Updating or adding a new key to a JSONL object will properly create a new column in a Source. This will happen even if the key is the only key present in that object.
JSON Line files are slightly less performant than Apache Parquet or Apache ORC because they do not include metadata in the file headers. As a result, the database will read the full files to run queries.
JSON Array File (.json)
Similar to JSON line files (.jsonl), JSON array files use the JSON text format to represent data records. However, all records are represented as objects within a single surrounding array. This is a common format when data is exported from application APIs.
The same performance guidance for JSON line files applies to JSON array files. For optimal performance, these files can be transformed to a more performant data type and loaded into a refined Source.
Comma-Separated Values (.csv)
Comma-separated values (CSV) is a widely used file format for storing tabular data. It can be easily imported to and exported from spreadsheet tools like Microsoft Excel, making it a versatile choice for data exchange. In this format, the first line of the file contains the column names of the table. Each subsequent line represents a record as a table row.
Working with CSV
CSV lacks a standardized schema, which could lead to data inconsistency if files within a Source have different header rows. Data could also be interpreted incorrectly if files have columns ordered differently or have missing or extra headers.
While CSV is supported by Data Refinery, it is recommended that CSV is only used in precursory Sources. Whether the schema is manually or automatically managed, column headers must always be included in every uploaded CSV file. Additionally, CSV headers can be alphanumeric but not simply numeric. CSV data should be transformed to a more performant data type before performing any data actions.
If choosing to work with CSV files, ensure:
- Consistent Delimiter Usage: the chosen delimiter is consistent within a file, between files within a Source, and across all Versions. Mixed delimiters can cause a shift in the data, resulting in query failures.
- Header Detection Rule Adherence: if relying on the system to detect the first row in a CSV as the list of column headers, consult the header detection rules here.
- Consistent Column Headings: new columns must only be appended. Column order cannot change within a Source, across all Versions.
- Double Quote Enclosures: the data must be enclosed in double-quotes to accurately import data. Ensure quotes are balanced (i.e., all opening quotes are followed at some point by a closing quote).
- Unicode Control Character Compatibility: Some Unicode control characters (e.g.,
\u0000
and\u0007
) may not load correctly. - Date Format Consistency: Mixing date/timestamp formats causes dates and timestamps to be loaded as plain strings instead of dates/timestamps.
- Endline Consistency: Mixing line-ending character usage results in inconsistent data loading behavior.
- Endline Escaping: Escaped line-ending characters are loaded literally and not interpreted as escape sequences. However, endline characters in query values should be escaped.
Note. Columns containing only numeric values are imported as numbers, with leading zeros removed.
Source Name
When creating a Source, only printable ASCII characters can be used for the Source name. This is because the name of the Source will be used to create the table name in Glue and Redshift, both of which have limitations on what characters can be used. While the following two characters can be used to create a table name, they will need to be escaped when querying.
Character | Escape Character | Unicode | ASCII |
---|---|---|---|
double quote | \" | U+0022 | 34 |
backslash | \\ | U+005C | 92 |
Column Names And Sanitizing Special Characters
It is recommended to also use only printable ASCII characters for column names. While some special characters can be used, this is not recommended. The following special characters are known to cause query errors. They should be sanitized or escaped before querying:
Character | Escape Character | Unicode | ASCII |
---|---|---|---|
horizontal tab | \t | U+0009 | 9 |
new line/line feed (LF) | \n | U+000A | 10 |
form feed (FF) | \f | U+000C | 12 |
return/carriage return (CR) | \r | U+000D | 13 |
double quote | \" | U+0022 | 34 |
backslash | \\ | U+005C | 92 |
Nested Fields Size Limit
If the uploaded data contains nested fields, Data Refinery will automatically infer them to struct
or array
column type. The size limit of all field names in a nested column combined is 131072 bytes. This limit could be reached if the field is a map
of unbounded size. To avoid reaching the limit, consider removing the field from imported data, or preprocessing it to an array
instead.
Querying Data Best Practices
The following provides the best methods for querying in Data Refinery.
File Upload Size
For the best query performance, files should be no larger than 100MB. Files larger than 1GB will cause queries to fail.
Cast version
Column
Due to some underlying technical limitations, the version
column that contains the version number is always loaded as string
type. Therefore, the version
column must be cast to int
before it can be used for comparison:
cast(t.version as int)
.
Querying Semistructured Data
Follow the AWS Guide for querying semistructured data, array navigation, object unpivoting, and dynamic typing.
Nested Queries
Redshift does not allow some nested queries when querying nested data structures. Where possible, use SQL temporary table instead.
For example, for a set of data that contains nested fields:
{
"version": 1,
"attributes": {
"attribute1": "string",
"attribute2": "string"
}
}
A query with any nested SELECT
statements will result in a Redshift error:
-- This will error!
SELECT
t1."attributes".attribute1
FROM
t1
WHERE
t1.version = (select max(t2.version) from refinery_schema.t1 as t2)
To mitigate this issue, the nested query can be converted to use TEMP TABLE
instead:
-- This won't error
CREATE TEMP TABLE cte1 AS SELECT t1.id, max(cast(t1.version as int)) AS max_version FROM t1 GROUP BY t1.id;
SELECT
t1.id,
t1."attributes".attribute1,
t1."attributes".attribute2,
cast(t1.version as int) as version
FROM
t1
INNER JOIN
cte1 ON t1.id = cte2.id
AND
cast(t1.version as int) = cte2.max_version;
Note. Temporary table is visible only within the current session. The table is automatically dropped at the end of the session in which it is created.