Export Source Data
This document will explain how to export Source information from the Data Warehouse and how to re-upload the exported data to a new Source. The export of this data will require setting up a database tool, such as DBeaver
. See the Query the Data Warehouse page for help on how to do this.
Additionally, to export the Source data, Sources must be uploaded to the Data Warehouse. See Set-up and Query Source Versions for further instruction on this.
Table of contents
About the Example
Two versions of the same Source customer_table
will be used for this example. The two versions of the customer data uploaded have the same number and naming of columns. However, the first version does not have language
data and the second version does not have any job_title
data.
Moreover, not all of the last names match. So assume, for this example, the first version is more accurate than the second. However, the first version will not always have last names populated. So, if a last_name
exists in the first version, then that will be used. Otherwise, the last_name
from second version will be used.
Query the Source Data Versions
-
From the database tool, open the table and view the two versions of your Source. Sort by
id
to easily view the versions. -
Create a query that compares the table to itself.
-
Adjust the query to alter the result set. In this query, there are three changes being made to the data.
- First, the
last_name
field is being populated by the first version, unless the field is null. In which case, the second versionlast_name
field would be used. In this case, the query above Matias (id = 5), has a differentlast_name
value for each version and this query has made the first version a priority for this field. - Second, the first version has the
job_title
field populated and the second version does not. So, the first version is used to populate that field in the result set. - Third, the second version has the
language
field populated and the first version does not. So, the second version is used to populate thelanguage
field in the result set.
- First, the
Export Queried Source Data
- Right click the result set in the database tool and click Export Data.
-
Then, click the Export to CSV file option and click Next.
-
If the result set is over 10,000 rows, click the Extract type dropdown and select By segments to increase or reduce “Fetch size” as needed. When done, click Next.
-
For the Format settings, the default settings for a csv are sufficient.
-
Next, on the Output page, there is the “Directory” field that represents where the file will be saved. This can be changed as needed. The file name field uses the table name and timestamp as the default value for the name, but this can also be changed as needed. Once the needed changes have been made, click Next.
-
Verify Source and Target settings then confirm by clicking Proceed.
Upload Refined Source Data and View All Sources
Create a third version for the Source and upload the new file to it. For instructions on how to do this, see the Create a New Version and Upload Data to the New Version sections of the Set-up and Query Versions page.
Glue Crawlers
are scheduled to run every hour. The uploaded data will be available to view in the database tool once the Glue Crawler
has finished traversing the new data.
After the Crawler has run, open the database tool and refresh the table. Open the customers_table
and sort by id
. All three versions of the Source, including the refined Source version, can be found here.