Data Refinery User Guide
Welcome to the Kingland Data Refinery User Guide. This guide begins with a product overview and continues with detailed feature information. The overview provides a high-level description of Data Refinery and an example of how it can be used.
Table of contents
What is Data Refinery
Kingland Data Refinery is a regulatory and risk data refinement platform that is designed for repeated implementation and use across the enterprise. Data Refinery provides users with the ability to investigate, analyze, and change data in different ways. Customers are able to avoid the many common pitfalls of data analysis projects and quickly begin to analyze data. Data Refinery does this by making common data analysis tasks easier, including:
- Ingesting Data. Instead of establishing a new database, users can quickly and easily ingest data from multiple, disparate data sources.
- Deriving Schemas. Instead of defining new database schemas, creating new tables, or executing complex Extract, Transform, and Load (ETL) jobs to transform data, Data Refinery infers the schema of uploaded files and automatically stores the data in a table for querying.
- Sharing Data. Instead of needing to establish a folder share or a read-only database, Data Refinery provides the ability to assign user permissions to designated data.
- Versioning Data. Instead of needing to manually version every upload and maintain a table structure to track versions, Data Refinery permits creating multiple versions of the same Source. Versions of a Source are stored in the same table with incrementing versions updating regularly. Data Refinery automatically indexes queries against those versions.
- Executing Automated Background Tasks. Instead of needing to create and manage infrastructure and scheduling utilities, users can define and schedule single-run or periodic Tasks that are executed in the background. Background Tasks can be used to offload time-consuming and resource-intensive processes.
By avoiding common pitfalls, Data Refinery makes the creation of data analysis projects easier. This allows users to establish new data projects from existing data sources without needing to ask for new extracts from those sources.
Users can interact with Data Refinery in two ways: the Data Refinery Designer (DR Designer), a web-based user interface, or the Data Refinery Designer API (DR Designer API), a web-based programmatic REST API.
Core Concepts
As part of understanding Data Refinery, it is helpful to understand several core concepts. These concepts appear in many areas of Data Refinery and hence throughout this user guide.
Sources
A Data Refinery Source represents data from an external system. A Source can involve a one-time upload or repeated uploads of data. One external system can provide multiple Sources of data to be stored within the Data Refinery.
Sources can be created through the DR Designer or DR Designer API. They consist of three simple items of information:
- Name. A name for the Source that will show up in DR Designer and is queryable via DR Designer API. The name must contain only printable ASCII characters.
- Description. A description of what is contained in the Source, including details about the Source system and how often the data is refreshed.
- Classification. A file type classification is required when creating a Source. For Sources that exist without classifications: the classification will be extracted from the file extension of the next uploaded file and will be applied to the Source.
- Type. Source types indicate characteristics of the data bound to the Source such as whether it is external data, mixed data, or finished data ready for external use.
Each Source can have many versions, and may have data uploaded to it independently from other Sources.
Note. It can take up to one hour for some changes to the Source or its data to appear for querying. These changes include but are not limited to creating a new Source, updating a Source name, and adding fields to a Source. This time-frame is a configurable setting that can be reduced to make the data queryable sooner. However, this can potentially lead to increased usage of cloud compute resources as data is scanned for changes more often.
Projects
A Data Refinery Project represents a grouping of Sources. A Project serves as a method for assigning users and permissions to the data contained within the grouped Sources. Kingland recommends that each Project have an objective for its purpose so that the listing of Projects acts as a list of open activities within Data Refinery. However, Projects can be used for anything that requires a grouping of Sources.
Projects can be created through DR Designer or DR Designer API, and require three simple items of information:
- Name. A name for the Project that will show up in DR Designer and is queryable via DR Designer API.
- Description. A description of the contents or purpose of the Project, potentially listing the objective of the Project or whether it’s currently active.
- Project Type. Project types indicate the intended usage of the Project.
Background Tasks
A Data Refinery Background Task represents a unit of work. This work can be executed immediately or scheduled to run periodically (see more in Getting Started with Background Tasks). Each Background Task is given an independent runtime with dedicated CPU, Memory, and Storage. There is no duration limit on how long a Background Task may run.
Background Tasks can be created through DR Designer or DR Designer API, and require a few simple items of information:
- Name. A name for the Background Task that is descriptive of what it does.
- Schedule. A schedule using cron syntax that details when the task should run.
- Image URL. A URL to the task definition image.
Querying Data
All data that is uploaded to a Source is exposed via a Database URL for querying using standard ANSI SQL. Currently, Data Refinery uses a Redshift database driver for queries, but other database systems may be supported in the future. Each Source that is created is queryable as its own database table within the Database URL. Each user of the Data Refinery has their own database username and password to access the database.
Example of Using Data Refinery
This Data Refinery case study begins with an example company attempting to produce a consolidated sales report across many different sales departments. This company has grown exponentially through acquisitions, which means they have many different sales systems, each tracking a substantial amount of data in various formats. The goal of the company is to produce a single, consolidated dashboard showing their aggregate sales across departments.
Sources
In this case study, the company would import data from each sales system, placing each system into its own Source. Each sales system provides data in a different format, and on a different schedule:
- Sales System 1 provides data in CSV via a database export, and supplies the data once per day
- Sales System 2 provides data in JSON because it’s exposed via a REST API, and supplies the data once per hour
- Sales System 3 provides data in XML, and supplies the data once per week as part of a custom-written report. No one in the company currently supports this system so the format cannot change.
The output for each sales system requires no changes to the data files to work with Data Refinery. Each sales system becomes its own Source within Data Refinery, uploading as its original file format. Then, Data Refinery exposes each Source in a new database table for use in query tools.
Projects
To start with, a Project called “Sales Dashboard” is created. This Project contains the three Sources from the previous section. Several Data Engineers are added to the Project to help craft queries for a sales dashboard. A user is created for a Business Intelligence (BI) dashboard tool and added to the Project.
Data Refinery Data Access
The BI dashboard has access to the data in the Project for running the dashboard query. The Data Engineers produce a query using Union
statements that run a query on each of the individual data Sources into one table. Then, the BI tool can render it into a bar graph.
The information from the BI tool connects directly to the Data Refinery using the Database URL. Every time the data in the Data Refinery is updated, the BI tool now updates the dashboard automatically. This combines the data from all three upstream sales systems into a single bar chart. Additionally, it provides visibility into those tools without requiring unified schema or ETL jobs to reformat data.
As shown, each sales system is inputted as its own Source into Data Refinery under the Project Sales Dashboard. The BI Tool then queries for information from the Warehouse Database to render the desired data.
Help Resources
If this User Guide is incorrect or incomplete, please reach out to a Customer Service Lead at Kingland for assistance.