Sources
Data Refinery Sources represent external systems that provide data to Data Refinery. Sources are created and managed under the Data Refinery Designer (DR Designer) Projects tab. Sources are displayed under Projects as the Source is associated to the Project. A user can only see these Projects and Sources with the proper permissions or by being assigned a User Project Role (UPR). Once a Source is created, use the Source Upload action to bind the Source to its external data. Sources can also be managed through the Data Refinery Designer API (DR Designer API).
Table of contents
How to Create a Source
Data is made available for querying in Data Refinery through Data Refinery Sources. Follow the steps below to create a new Source.
-
Once logged into Data Refinery Designer, the Projects page will load with the Projects list appearing in a column to the left of the screen. A user should select the desired Project from the list in order to create a new Source.
If a user needs to create a new Project, see How to Create a Project.
-
Once the Project has been selected, Project details should appear to the right of the Projects list. Click the Create Source button.
-
Fill out the fields in the “Create Source” form.
- Name: Required. Preferred Name of the Source.
- Description: Required. Additional details for the Source, explaining Type, and including any special characters.
- Type: Required. Source types indicate characteristics of the data bound to the Source whether it is external, mixed, or finished for external use.
- External: Data retrieved from an external system to be analyzed in Data Refinery.
- Documentation: A Source that uses Artificial Intelligence (AI) models to extract structured Data Attributes from unstructured data files.
- Classification: Optional. Required for External Source type. Represents the file type of data that can be uploaded to the source. Options include: JSON, JSONL, CSV, ORC and parquet.
Note. For more information about different Source types, see External Sources or Documentation Sources below.
-
Click Create at the bottom of the form.
Once the Source has been created, it will appear in the Sources section of the Project details.
Refer to the POST (Create a Source) API in the DR Designer API Projects Reference for instructions on how to create Sources programmatically.
How to Update Source Details
Any user who can view a Source in DR Designer has the ability to update Source details. To edit information in an existing Source, follow the steps below.
-
Select the Projects tab in the navigational bar at the top of the page.
-
To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
-
Select the Source that needs to be updated.
-
Once the Source details are open, the “Name” and “Description” fields can be updated. This is indicated by the transparent pencil icon to the right of the field.
When hovering over “Name” or “Description,” a user can select one of these fields to edit.
- For Source Name, edit the field by selecting the box and typing a new name. All printable ASCII characters are valid.
- For Source Description, edit the field by selecting the box and typing a new description.
Note. When changing the “Name” of the Source, the underlying table in the Warehouse Database will be renamed. This may cause queries that other users have written to break. Be cautious when performing this type of update!
-
After making an updated selection, select the green checkmark that appears directly below the “Name,” “Type,” or “Description” fields (see image below) or click away to save the change. Select the ‘X’ to discard the changes.
Refer to the PUT (Update Source) API in the DR Designer API Projects Reference for instructions on how to update Sources programmatically.
How to Sample Data
In DR Designer, a limited query is run to show a sample of the data that is uploaded to a Source. This sample is a preview of the data that users can review to ensure the data is in the correct place. Users can view this sample under the Sample Data tab of a Source in DR Designer. To sample the data uploaded to a Source, follow the steps below.
-
To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
-
Select the Source that a user would like to sample the data.
-
The Source details will open to the right of the page. Click the Sample Data tab.
-
If the data has been uploaded and processed in the database, meaning AWS Glue has crawled the dataset, a table will automatically appear with the first ten entries.
How to Delete a Source
To delete a Source, follow the steps below.
-
On the Projects page, navigate to the desired Project and select the drop down arrow to expose the existing associated Sources.
-
Select which Source a user would like to delete. This will open Source details on the right side of the page.
-
Click the Delete Source button in the top right-hand corner of the page.
-
A dialog will appear, requiring the user to acknowledge that all Project associations and Source versions will be removed before deleting the Source. The user must select both options, enabling a Delete button that completes the task.
The data will remain queryable, but no new data will be uploaded. As a result, no new schema changes will be queryable. The Source is removed asynchronously, and removal may fail without notice.
Refer to the DELETE (Delete Source) API in the DR Designer API Projects Reference for instructions on how to programmatically delete a Source.
How to Manage Project-Source Associations
Project-Source associations link or un-link Sources to Projects for querying. A user with an Owner UPR or PROJECT_ADMIN permission has the authority to make these associations. See the Editing Project-Source Associations section on the Projects page for more information.
External Sources
External Sources represent data in a structured format from other external systems. The data in these Sources can have new data directly uploaded, and can have multiple versions of the data stored for analysis. Data in these Sources may be deleted.
How to Upload Source Data
Once a Data Refinery Source is defined, it must be bound to external data before it can support queries. The action to bind external data to a Source is called “Upload.” Follow the steps below to upload Source data. If a Source does not have a classification one will be assigned using the extension of the next file uploaded to the Source.
-
To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
-
Select the Source a user will be uploading data to.
-
The Source details will open on the right side of the page. Click the Upload Dataset button. This will open an “Upload Dataset” form.
-
Select which version to upload data using the dropdown in the “Upload to Version” field. Then, choose a file to upload by clicking the Choose File button under the “Upload File” field.
Currently supported file types are CSV, JSON, JSONL, ORC, and Parquet.
-
When the fields are complete, select Create. A green banner will appear at the top of the screen indicating a successful upload of the dataset.
The dataset is now uploaded and linked to the Source. However, the uploaded file will not be available to query until AWS Glue crawls the dataset based on configuration settings.
Refer to the POST (Upload data to Source) sources/{ID}/upload API in the DR Designer API Projects Reference for instructions on uploading a Source programmatically.
How to Create a New Source Version
Data Refinery Sources can be versioned, where each Source can have many versions of data. Data Refinery automatically indexes queries against these versions. To create a new Source Version, follow the steps below.
-
To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
-
Select the Source a user will be creating a new version.
-
The Source details will open on the right side of the page. Click the Create Version button. This will open a “Create Version” form.
-
Add a version comment, if desired, and click the Create button. The new version should appear under the version tab of the Source details view. A user can now upload data to the new version!
For instructions on how to create a new Version for a Source programmatically, refer to the POST sources/{ID}/versions API in the DR Designer API Projects Reference.
Documentation Sources
Documentation Sources parse unstructured documents into structured attributes using Artificial Intelligence (AI). In a Documentation Source, data may not be versioned since all attributes represent a point-in-time view of the document and how the AI model extracted the data from the document. Data may not be manually uploaded to a Documentation Source. The only method to add data is by uploading a document and extracting the attributes.
How Documentation Sources Work
When creating a Documentation Source, users with an appropriate role or the PROJECT_ADMIN
permission can specify up to 20 Attribute Targets. These Attribute Targets define specific data elements to extract from documents. Each attribute can include an optional description to document its purpose, though these descriptions are not provided to the AI model during extraction.
For more information about Attribute Target configuration, see Configuring Attribute Targets below. Documents may be uploaded before configuring Attribute Targets, but no data extraction will occur until Attribute Targets are properly configured.
After Attribute Targets have been configured, documents can be uploaded to the Documentation Source using either the standard upload APIs or via the Upload Dataset button in DR Designer. Document processing time varies based on document size and the amount of data being extracted, but will usually take a couple of minutes. Once processing completes, the extracted data is stored in the Warehouse Database and becomes available for querying. For more information on querying options, see Query the Data Warehouse.
Configuring Attribute Targets
To configure Attribute Targets, a user must have an Analyst or Owner User Project Role on a Project associated with the Documentation Source, or they must have the PROJECT_ADMIN
global permission. To configure Attribute Targets, follow the steps below.
-
Navigate to the Attribute Targets tab on a Source. This tab will only be visible on Documentation Source types.
-
Select the Create Attribute Target button to open a modal window allowing creation of an Attribute Target.
-
Select the Create button to create the Attribute Target.
Note. Updating and Deleting Attribute Targets is available via the API. See the Update API and the Delete API for more information. This feature will be added to the UI in a future release.
Data Refinery AI Privacy Information
Data Refinery uses Anthropic’s Claude through AWS Bedrock to process data uploaded to Data Refinery. Information on Bedrock’s security and privacy policies are available below.
Data Refinery does not log or store the requests or responses made to the AI model in their raw format, only the resulting attributes extracted from the documents are stored. After a document has successfully processed, the raw document is removed from the Data Refinery system and is no longer able to be retrieved. Metadata about the document such as its name and size are retained for audit and billing purposes.