The inevitability and power of real-time data

Software Engineering
Updated on
May 16, 2024

The ability to shorten the data feedback loop and act while data is still relevant is incredibly valuable. In order to lower response times to data insights, companies require a system that can alert in seconds or minutes rather than hours. The longer the response time, the higher the likelihood of increased fraudulent transactions or more pervasive impact of an outage. Imagine a bank’s alarm system sounding 6 hours after the vault has been breached - not helpful at all. This is why MTTR (mean time to resolve) is a key metric in incident response and most organizations have OKRs to reduce this metric. 

Data feedback loop from creation, ingestion, analysis, to implementing changes

Some businesses understand the criticality of having real-time data. They’ve either adopted a modern data streaming solution or have built makeshift solutions in-house. These makeshift solutions often solve for a particular use case but cannot be easily replicated, thus not generalizable or scalable. 

Other businesses don’t leverage real-time data at all and some claim that there is no business need. We would argue there is a need, but these businesses don’t realize it yet or can’t make it happen because they don’t have mature data strategies. Having any data strategy is difficult and requires significant upfront investment (both monetary and other resources), which is why 70% of enterprises don’t. Without a clear data strategy that informs business decisions, companies are effectively acting on pure instincts, which does not scale. 

Interestingly, A lack of data is almost never the reason why businesses forgo a data strategy. In a 2020 Matillion survey, the average company is drawing from ~400 data sources… I’m sure that number has only increased since then. So much information just sitting there – unutilized. 

So, why aren’t more businesses leveraging real-time data? 

The short answer is that it’s hard. It’s complicated to set up and costly to maintain, from both a monetary and effort perspective. In order to build out truly performant data streaming capabilities, a company needs a team of engineers with expertise in distributed systems, database/warehouse architecture, and streaming infrastructure (like Kafka). On top of that, the solution needs to be scalable to manage data events that come in from a variety of sources. Let’s dive into these requirements to understand what’s at stake. 

Architecture designed for scale

The ideal solution should account for scale of varying levels without having to do major migrations or significant refactors. Volume may vary drastically depending on the data source –  for example: one could expect low volume for a transactions table and high volume for an events table.

Some factors to consider:

  1. Can the solution scale to multiple different data sources?
  • ~ How easy is it to add new data sources?
  • ~ How easy is it to manage across all the data sources?

    2. Can the solution scale to handle < 100 queries per second (QPS) to 1m+ QPS?

  • ~ Is the solution horizontally scalable?
  • ~ Do the workers require coordination? Or are they stateless and distributed?

The problem with out of order & missing events 

An ideal solution must also ensure there are no out of order and/or missing events when scaling up the number of partitions. Otherwise companies end up with inaccurate data at the destination, reflecting a false view of the world when performing analysis. 

Example showing the issue with processing out of order events in CDC based data replication

The data integration industry has largely conformed to batch processing 

Due to a number of reasons, including the technologies that were available at the time (Apache Kafka was created in 2011), the data integration industry has conformed to batch processing, or jobs that are triggered off of a schedule or cron. Some popular cron-job frameworks include Spark, Airflow, Dagster, and MapReduce. Achieving sub-minute data latency is extremely difficult with batch processing, which is why real-time data transfer requires a fundamental architectural shift to stream processing.

Building real-time data pipelines is an expensive and complicated project - and organizations need to continue to maintain this system once built. As such it doesn’t really make sense for most companies to build this out in-house.

The value of real-time data

What can having real-time data pipelines mean for your company? In short, a lot. Probably more than what you can imagine in the medium to long term. The business case for real-time data streaming varies for different industries - we’ll walk through two examples to make the value proposition more concrete. 

Anti-fraud / anti-cheat

One of the most obvious use cases of real-time data is fraud detection. It’s easy to understand why real-time is needed - you don’t want to delay identifying fraudulent transactions or behaviors. The faster you can catch bad actors, the better. Anti-fraud is really broad, so let’s dive into the anti-cheat use case for video game companies.  

Anti-cheat software is very important for a game’s success. Video games with high levels of cheating and hacking degrade player experiences and disrupt the in-game economy. This results in lower player engagement, increased refund rates, lower game ratings, etc. which ultimately means lower revenue and growth. There’s a research paper called Video Game Anti-Cheat Software and Its Importance to a Game’s Success that goes more in-depth if you’re interested. 

You’re now thinking - surely video game companies already have anti-cheat processes leveraging real-time data in place. Yes, they do, and it looks something like the diagram below. Simplistically, video game companies scan gaming logs and run them against their anti-cheat model for anomalies. This piece is done in real-time, typically by integrating directly to the game server database. The anti-cheat model is created using various data from gaming databases, server logs, and data warehouses. The more data the model has, the more accurate it is. The more recent data the model has, the more relevant it is. 

Data architecture of an anti-fraud or anti-cheat model

The goal is not to eliminate cheating - that’s near impossible. The goal is twofold:

  • Increase accuracy in bot detection
  • Decrease time to detect bad actors

Having real-time or near real-time data streamed into the anti-cheat model actually helps with both these metrics. By having access to live data, the company is able to leverage online machine learning and dynamically adapt to new patterns in the data (incremental learning is another ML method that is applied with data streams). This means they can identify malicious behaviors such as bug exploits faster and more accurately. 

Vehicle marketplace

Another interesting application of real-time data pipelines is in ecommerce. For this example, let’s look at used vehicle marketplaces like Carvana or Shift, which buys and sells cars. Having near real-time data is important for determining pricing when purchasing used cars and to lower the number of days the car sits in their inventory. 

To determine the offer price, these marketplaces rely on pricing algorithms based on many variables. To keep things simple, let’s focus on two major variables:

  • Actual transaction data of new and used car sales from data providers (similar to MLS data for the real estate market)
  • Inventory levels
Variables that feed into a vehicle marketplace's pricing algorithm

Data from these two tables, along with many other ones, are fed into a data warehouse and AI algorithms determine the price and quantity to purchase for a specific vehicle make/model. Operations analysts leverage insights from the algorithm to determine how many Toyota Prii to buy and at what price. 

What are the risks associated with having lagged data powering the AI pricing algorithm? 

  • If the algorithm cannot analyze actual used car transactions data in real-time, the company can mis-price (overpay or underpay). In the case of overpaying, the car is likely to be resold at a loss.
  • If the algorithm does not have access to inventory levels in real-time, then it will not pause car purchases when inventory is “full”. If a company typically sells 10k Toyota Prii per day and they’ve already purchased 10k by noon, the company should stop buying Toyota Prii. But if the data takes 6-12 hours to be updated, the company will likely over-purchase.

Both of these negatively impact contribution margins. With real-time data replication, the company can price more accurately, increase inventory turnover, and increase margins.

Artie Transfer makes real-time data streaming easy

Artie Transfer is a real-time data replication solution for databases and data warehouses. We believe the world is moving towards real-time everything, and we’re excited to be a part of that journey.

Our open source solution is easy to set up by following our documentation. If you would like to leverage real-time data at your company, schedule a demo or email us at hi@artie.so!

Table of contents
    Author
    Jacqueline Cheong
    Co-founder & CEO