Loading Data FAQs

This page is dedicated to common questions and recommendations in regard to loading data into Data Refinery.

Table of contents

How should I translate my tables into Sources?
Should my data be loaded into Data Refinery as embedded JSON within a CSV?
Are there recurring requirements to loading data that I can familiarize to make the loading data process more efficient?

How should I translate my tables into Sources?

Recommendation: Depending on the table structure, tables might need to be pre-processed before being uploaded as Sources. This means, depending on which data is being compared, the columns and rows of the tables have to be formatted in a way that would make the data easily comparable.

As an example, addresses that are recorded could be nested into the table which can present an issue when comparing columns. To expand on this, as addresses are collected, a number of columns are used to store the data rather than placing a whole address into one cell. This would mean that an Address Line, City, State, Zip Code, etc. would each require a column in the table. However, not all of these columns may be populated which could result in missing data when queried.

To learn more about uploading data to Sources, read the How to Upload Source Data section on the Sources page.

Should my data be loaded into Data Refinery as embedded JSON within a CSV?

Recommendation: It is possible to load data as an embedded JSON within a CSV, but you have to encapsulate the JSON in quotes to make it act as a string. However, this will mean that the JSON cannot be queried as a structured object within the table. Extracting the JSON to a second Source, requiring the creation of another table to upload, would allow it to be queried and joined as a structured object.

Are there recurring requirements to loading data that I can familiarize to make the loading data process more efficient?

Recommendation: Kingland’s Data Refinery only accepts certain file types to upload data. When a Source is created, it is imperative to create versions using the same file type. Additionally, as updates are being uploaded to Data Refinery (as a new Source version) it is recommended to reupload the whole file instead of ONLY the new additions. Read the Preferred Data Formats and Querying Data Best Practices sections on the Redshift and Glue Best Practices page to learn more.

Also, it is strongly suggested that file sizes are smaller than ~100MB for the best query performance. To learn more about how to upload data to a new version, read the Upload Data to the New Version section on the Set-up and Query Source Versions page.