Sources

Data Refinery Sources represent external systems that provide data to Data Refinery. Sources are created and managed under the Data Refinery Designer (DR Designer) Projects tab. Sources are displayed under Projects as the Source is associated to the Project. A user can only see these Projects and Sources with the proper permissions or by being assigned a User Project Role (UPR). Once a Source is created, use the Source Upload action to bind the Source to its external data. Sources can also be managed through the Data Refinery Designer API (DR Designer API).

Table of contents

How to Create a Source
How to Update Source Details
How to Sample Data
How to Delete a Source
How to Manage Project-Source Associations
External Sources
- How to Upload Source Data
- How to Create a New Source Version
Documentation Sources

How to Create a Source

Data is made available for querying in Data Refinery through Data Refinery Sources. Follow the steps below to create a new Source.

Once logged into Data Refinery Designer, the Projects page will load with the Projects list appearing in a column to the left of the screen. A user should select the desired Project from the list in order to create a new Source.

If a user needs to create a new Project, see How to Create a Project.
Once the Project has been selected, Project details should appear to the right of the Projects list. Click the Create Source button.
Fill out the fields in the “Create Source” form.
- Name: Required. Preferred Name of the Source.
- Description: Required. Additional details for the Source, explaining Type, and including any special characters.
- Type: Required. Source types indicate characteristics of the data bound to the Source whether it is external, mixed, or finished for external use.
  - External: Data retrieved from an external system to be analyzed in Data Refinery.
  - Documentation: A Source that uses Artificial Intelligence (AI) models to extract structured Data Attributes from unstructured data files.
- Classification: Required for External Source type. Represents the file type of data that can be uploaded to the source. Options include: JSON, JSONL, CSV, ORC and parquet. Ignored for Documentation Source type.
Note. For more information about different Source types, see External Sources or Documentation Sources below.
Click Create at the bottom of the form.

Once the Source has been created, it will appear in the Sources section of the Project details.

Refer to the POST (Create a Source) API in the DR Designer API Projects Reference for instructions on how to create Sources programmatically.

How to Update Source Details

Any user who can view a Source in DR Designer has the ability to update Source details. To edit information in an existing Source, follow the steps below.

Select the Projects tab in the navigational bar at the top of the page.
To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
Select the Source that needs to be updated.
Once the Source details are open, the “Name” and “Description” fields can be updated. This is indicated by the transparent pencil icon to the right of the field.

When hovering over “Name” or “Description,” a user can select one of these fields to edit.
- For Source Name, edit the field by selecting the box and typing a new name. All printable ASCII characters are valid.
- For Source Description, edit the field by selecting the box and typing a new description.
Note. When changing the “Name” of the Source, the underlying table in the Warehouse Database will be renamed. This may cause queries that other users have written to break. Be cautious when performing this type of update!
After making an updated selection, select the green checkmark that appears directly below the “Name,” “Type,” or “Description” fields (see image below) or click away to save the change. Select the ‘X’ to discard the changes.

Refer to the PUT (Update Source) API in the DR Designer API Projects Reference for instructions on how to update Sources programmatically.

How to Sample Data

In DR Designer, a limited query is run to show a sample of the data that is uploaded to a Source. This sample is a preview of the data that users can review to ensure the data is in the correct place. Users can view this sample under the Sample Data tab of a Source in DR Designer. To sample the data uploaded to a Source, follow the steps below.

To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
Select the Source that a user would like to sample the data.
The Source details will open to the right of the page. Click the Sample Data tab.
If the data has been uploaded and processed in the database, meaning AWS Glue has crawled the dataset, a table will automatically appear with the first ten entries.

How to Delete a Source

To delete a Source, follow the steps below.

On the Projects page, navigate to the desired Project and select the drop down arrow to expose the existing associated Sources.
Select which Source a user would like to delete. This will open Source details on the right side of the page.
Click the Delete Source button in the top right-hand corner of the page.
A dialog will appear, requiring the user to acknowledge that all Project associations and Source versions will be removed before deleting the Source. The user must select both options, enabling a Delete button that completes the task.

The data will remain queryable, but no new data will be uploaded. As a result, no new schema changes will be queryable. The Source is removed asynchronously, and removal may fail without notice.

Refer to the DELETE (Delete Source) API in the DR Designer API Projects Reference for instructions on how to programmatically delete a Source.

How to Manage Project-Source Associations

Project-Source associations link or un-link Sources to Projects for querying. A user with an Owner UPR or PROJECT_ADMIN permission has the authority to make these associations. See the Editing Project-Source Associations section on the Projects page for more information.

External Sources

External Sources represent data in a structured format from other external systems. The data in these Sources can have new data directly uploaded, and can have multiple versions of the data stored for analysis. Data in these Sources may be deleted.

How to Upload Source Data

Once a Data Refinery Source is defined, it must be bound to external data before it can support queries. The action to bind external data to a Source is called “Upload.” Follow the steps below to upload Source data. If a Source does not have a classification one will be assigned using the extension of the next file uploaded to the Source.

To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
Select the Source a user will be uploading data to.
The Source details will open on the right side of the page. Click the Upload Dataset button. This will open an “Upload Dataset” form.
Select which version to upload data using the dropdown in the “Upload to Version” field. Then, choose a file to upload by clicking the Choose File button under the “Upload File” field.

Currently supported file types are CSV, JSON, JSONL, ORC, and Parquet.
When the fields are complete, select Create. A green banner will appear at the top of the screen indicating a successful upload of the dataset.

The dataset is now uploaded and linked to the Source. However, the uploaded file will not be available to query until AWS Glue crawls the dataset based on configuration settings.

Refer to the POST (Upload data to Source) sources/{ID}/upload API in the DR Designer API Projects Reference for instructions on uploading a Source programmatically.

How to Create a New Source Version

Data Refinery Sources can be versioned, where each Source can have many versions of data. Data Refinery automatically indexes queries against these versions. To create a new Source Version, follow the steps below.

To locate the desired Source, a user must first search for the Project that the Source has an association to. A user can do this by scrolling down the list or typing the Project name in the search bar on the Projects page. Select the Project among the list to the left of the page and click the arrow to reveal a dropdown list of associated Sources.
Select the Source a user will be creating a new version.
The Source details will open on the right side of the page. Click the Create Version button. This will open a “Create Version” form.
Add a version comment, if desired, and click the Create button. The new version should appear under the version tab of the Source details view. A user can now upload data to the new version!

For instructions on how to create a new Version for a Source programmatically, refer to the POST sources/{ID}/versions API in the DR Designer API Projects Reference.

Documentation Sources

Documentation Sources parse unstructured documents into structured attributes using Artificial Intelligence (AI). In a Documentation Source, data may not be versioned since all attributes represent a point-in-time view of the document and how the AI model extracted the data from the document. Data may not be manually uploaded to a Documentation Source. The only method to add data is by uploading a document and extracting the attributes.

How Documentation Sources Work

When creating a Documentation Source, users with an appropriate role or the PROJECT_ADMIN permission can specify up to 20 Attribute Targets. These Attribute Targets define specific data elements to extract from documents. Each attribute can include an optional description to document its purpose, though these descriptions are not provided to the AI model during extraction.

For more information about Attribute Target configuration, see the Configure Attribute Targets section below. DR Designer prevents Document upload until at least one Attribute Target is defined. Configure all desired Attribute Targets before uploading Documents.

After Attribute Targets have been configured, documents can be uploaded to the Documentation Source using either the standard upload APIs or via the Upload Document button in DR Designer. Document processing time varies based on document size and the amount of data being extracted, but will usually take a couple of minutes. Once processing completes, the extracted data is stored in the Warehouse Database and becomes available for querying. For more information on querying options, see the Query the Data Warehouse page.

How to Create a Documentation Source

Data is made available for querying in Data Refinery through Data Refinery Sources. Follow the steps below to create a new Documentation Source.

Once logged into DR Designer, the Projects page will load with the Projects list appearing in a column to the left of the screen. Select the desired Project from the list in order to create a new Documentation Source.

If a user needs to create a new Project, see the How to Create a Project page.
Once the Project has been selected, Project details should appear to the right of the Projects list. Click the Create Source button.
Fill out the fields in the “Create Source” form. A user should provide the Name, Description, and select “Documentation” as Type. For more information about the “Create Source” form fields, see How to Create a Source.
Click Create at the bottom of the form.

Once the Source has been created, it will appear in the Sources section of the Project details.
Next, configure Attribute Targets. See the Configure Attribute Targets section below.

Configure Attribute Targets

Source Attribute Targets are desired attribute names of data values in the document to target for data extraction. When a Document is uploaded, the full set of Attribute Targets along with additional context information (see below) is submitted to the AI Model. Extracted values will appear in the Data Warehouse with the Attribute Target name. To configure Attribute Targets, a user must have an Analyst or Owner User Project Role on a Project associated with the Documentation Source, or they must have the PROJECT_ADMIN global permission.

To configure Attribute Targets, follow the steps below.

Select the Attribute Targets tab under Source details. This tab will only be visible on Documentation Source types.
Select the Create Attribute Target button to open a modal window allowing creation of an Attribute Target.
Create Attribute Target. Enter the desired Attribute name and its description. A user should create Attribute Targets based on information desired for future queries.

Attribute Targets should focus on simple extractions, using terms that are identified in the document. Attribute Targets do not need to be exact or match the text as AI will infer from the document. This process might require several trials and thorough editing of Attribute Targets to ensure data being collected is what’s desired.

For example, Attribute Targets like “Legal Name” or “Role” are more precise than creating targets searching for “Name” only. This would yield more accurate results for the user and could require less time to make changes.

For complicated Attribute Targets, filling out the documentationAdditionalPromptData with additional context on how to extract the attributes will provide more accurate results. This can be accomplished via the API using the Update Source API, and will be added to the UI shortly.
Select the Create button to create the Attribute Target.

Update Attribute Targets

To update Attribute Targets, a user must have an Analyst or Owner User Project Role on a Project associated with the Documentation Source, or they must have the PROJECT_ADMIN global permission.

To update Attribute Targets, follow the steps below.

Select the Attribute Targets tab under Source details. This tab will only be visible on Documentation Source types.
A list of Attribute Targets should appear. The “Attribute Name” and “Attribute Description” fields can be updated.

When hovering over or selecting the pencil icon next to “Attribute Name” or “Attribute Description,” a user can select one of these fields to edit.
- For Attribute Name, edit the field by selecting the box and typing a new name. All printable ASCII characters are valid.
- For Attribute Description, edit the field by selecting the box and typing a new description.
After making an updated selection, select the green checkmark that appears directly below the “Attribute Name” or “Attribute Description” fields (see image below) or click away to save the change. Select the ‘X’ to discard the changes.

Refer to the PUT (Update Source Attribute Target) API in the DR Designer API Projects Reference for instructions on how to update Source Attribute Targets programmatically.

Delete Attribute Targets

To delete Attribute Targets, a user must have an Analyst or Owner User Project Role on a Project associated with the Documentation Source, or they must have the PROJECT_ADMIN global permission.

To delete Attribute Targets, follow the steps below.

Select the Attribute Targets tab under Source details. This tab will only be visible on Documentation Source types.
In the Action column to the right of the Attribute Target, select the Delete button.
A dialog box will appear to warn the user that the Attribute Target will be deleted and the action cannot be undone. Select the Delete button to confirm the deletion or the Cancel button to return to the Attribute Targets list.

Upon deletion, the user will return to the Attribute Targets list.

Refer to the DELETE (Delete Source Attribute Target) API in the DR Designer API Projects Reference for instructions on how to delete Source Attribute Targets programmatically.

How to Upload a Document

After a Documentation Source has been created and Attribute Targets have been configured, a user is ready to upload a document.

Under the desired Project, and linked Documentation Source, select the Documents tab under the Source details.
Select the Upload Document button.
An “Upload Document” modal will appear. Click the Choose File button to select a file for upload.
Once a file has been chosen, click Upload to finish the process.

Data Refinery AI Privacy Information

Data Refinery uses Anthropic’s Claude through AWS Bedrock to process data uploaded to Data Refinery. Information on Bedrock’s security and privacy policies are available below.

Data Refinery does not log or store the requests or responses made to the AI model in their raw format, only the resulting attributes extracted from the documents are stored. After a document has successfully processed, the raw document is removed from the Data Refinery system and is no longer able to be retrieved. Metadata about the document such as its name and size are retained for audit and billing purposes.