The Four C’s of Data Quality

It seems somewhat simple, but the better the source data is, the better the resulting predictions or recommendations are going to be. Think of data as the rocket fuel for your AI journey. Here is a set of dimensions that can be used to evaluate source data:

Coverage:

  • Coverage is the total amount of source data that we have available for our wells. The source data required for AI solutions is not just the sensor stream.
  • Coverage is going to describe the number of data sources that we have available for a well.  Many of the models that are used in advanced systems, such as OspreyData, are combining sensor-based machine learning models and physics-based models. Completing a holistic view of well design, complete with maintenance history and failure reports is critical to successful model development.

Continuity:

  • Think of continuity as how much source data is available without gaps or lapses. This is very important when reviewing sensor streams or the pool of dynacards available.
  • The graph shows a set of sensors, in this case for a rod pump well. Visually, we can see that each of these sensors is complete for the time range we are viewing, looking at about six months of the signal. There are a couple of quick lapses, but very few. Generally, this well has very good continuous data.
  • As contrast, this graph shows a different example. Visually, this well has a number of lapses. The yellow highlights are showing the individual lapses. It’s not clear if there aren’t more lapses in signal than there are signals. Being unable to see the signal stream proceeding the failure meant the machine learning models had to exclude those failures from training.

Consistency:

  • Consistency is the frequency of updates or new values in a time series data stream. The consistency or frequency of updates must be higher than the indication of the failure we are attempting to predict.
  • The data in the first graph looks pretty reasonable. Over a five day period, there is some variability in the signals, mostly in the tubing pressure. But what if the signal was only being collected on an hourly basis? There might be changes that are being masked by that frequency of collection.
  • The second graph shows signal data adjusted to a five-minute interval. It becomes clear that there is a tremendous amount of variability in this signal. In this case, the issue was a result of problems with back-pressure regulation that was resulting in foaming in the tube. The signal looks completely different based on a simple adjustment of the frequency of updates.

Connectedness:

  • Connectedness indicates the ability to trace a thread of connections for a well across all of the source data.
  • When we can determine how the well design ties to signal stream and that ties to the asset management systems, that reaches a degree of connectedness. In the first example, almost every source of data for a well has a slightly modified version of the name. While it may be somewhat clear to see the similarities here, when the project started, a great deal of confusion occurred as various teams had to look for “other” names in their systems.
  • The second example is no less problematic, though more subtle. This shows how teams could possibly name at the same physical sensor, with different names. This becomes more challenging in that while it is the same sensor signal, changing the name also changes context. Trying to have teams understand all the differences can be quite confusing.

These are the “Four C’s” of data quality, and these few examples show the various dimensions by which we may evaluate the quality of source data.

Please contact us with any questions you may have about source data quality. We would love to discuss our solutions for maximizing your oil and gas ROI.

Related Articles