Sep 26, 2024
/
Engineering
Introducing Universal Data Representation (UDR)
Author:
Joel Christner
Overview
Data is the core of any analytics, machine learning, or artificial intelligence workflow. Businesses data is spread across a variety of silos and while a multitude of data management stacks exist, many of them fail to truly yield meaningful results when it comes to actually addressing the core uses cases they were initially deployed to address. Making matters worse, the market at large is highly fragmented, and making a selection of one particular tool often reduces the surface area for selecting the adjacent tools to appropriate satisfy a set of data management use cases.
What is UDR?
UDR, or Universal Data Representation, from View is a patent-pending core architecture element and it represents a homogenous representation of heterogeneous data from heterogeneous data sources. As part of our processing pipeline, we examine data posited into our platform (S3, API, or through our crawlers and connectors) and identify the type of content, understand its geometry, extract terms contained within, infer the schema and its types, produce a flattened representation of the source data, extract semantic cells and chunk their contents, and produce an inverted index for the data.
Why are each of these important?
Type detection – while we may receive hints that a file is of type application/json, or, the file might have the docx extension, that might simply not be the case. Magic signature analysis yields the actual type of the data, which allows us to customize the processing performed against that data
Geometry – understanding the geometry of the data, for instance, row counts, cell counts, maximum depth, maximum number of nodes, allows us to better understand how deeply positioned data might be and if any irregularities in its structure exist
Term extraction – extraction of terms from within the document creates the starting point from which we can build a dictionary identifying which documents contain certain words or values
Schema inference – for semi-structured (e.g. JSON and structured (SQL data tables) data types, understanding the key-value pairs (or the column definitions) allows us to understand which data assets contain information of what kind (e.g. string, integer)
Flattened representation – by flattening the schema hierarchy and reproducing it in a queryable XPath like form, traversal of the document and identifying documents that have specific key-value combinations becomes easier
Semantic cell extraction – with an understanding of the type of content, extract regions from within the data that contain semantically-relevant data, and chunk its contents into appropriately-sized pieces
Inverted index – the foundation of any search engine, an inverted index is useful to identify
How is UDR used?
Having a homogenous representation of heterogenous data is powerful in numerous ways:
Providing a consistent structure from which to query data assets (discover data relevant to a given task)
Creating meaningful relationships amongst data assets given the properties within the data itself.
Once View generates UDR, we populate a graph database with metadata (including some of the aforementioned metadata), and we populate our semantic search and data catalog platform, called Lexi. With Lexi, you can query your data assets on any number of different dimensions to find data related to a topic of interest. From there, the resultant data can be associated with a knowledge base (vector repository) and a model for embeddings generation. Once generated, these embeddings power View Assistant, our integrated AI-powered interactive experience, that allows you to choose a model, customize chat settings and parameters, and have a conversation with your data.
Summary
UDR is View’s patent-pending data format that creates a homogeneous representation of heterogeneous data from heterogeneous sources. Once generated, UDR is stored in Lexi for semantic search and data set retrieval. Resultant data can be coupled with a language model and vector repository, enabling interactive, conversational experiences with your data.