A speaker at the recent Data Eng Bytes conference demonstrated the incredible speed that a data platform can import one billion records. And it was amazing. The speaker imported one billion sensor data records in just a few seconds. Incredible. Having imported similar quantities of sensor IoT data into a relational database, we can say from experience that it is lightening fast.
This is not a criticism of the conference but rather an example of a very common data problem.
The speaker demonstrated a data import of IoT sensor data. From the demo, as with many demos we have observed, it was obvious that the speaker had no understanding of the data she was importing, the reason why the data should be imported, and most importantly, what to do with the data once it was imported. Having significant experience with sensor data, here’s what actually has to happen with the data.
Speaker’s perception
- Import one billion records into a database.
Reality
- Import one billion records into a database.
- Add a lot of indexes.
- Connect it to an operational database, or at least connect to operational data that changes frequently affecting the results of the validation.
- Connect it to dimensional data.
- If it’s customer data, connect to the owner of the data.
- Validate the data.
- Test each record for spikes and false positives.
- Assess records in batches for sensor drift.
- Apply a moving average.
- Apply a date series dimension.
- Apply a time series dimension.
- Combine this data with other sensor data which records a different dimension of the world, e.g. weather data.
- Apply min/max/incremental values and compare.
- Apply some ML or AI to find exceptions and trends.
Whether the import process takes 2 seconds or 20 minutes is less relevant than how easy it is to work with the data once it’s been imported.
All this requires updates, loops, CTEs, selects, inserts, very complex SQL or Python.
The problem with these demos is that they do not apply to real world scenarios. And it’s not until you’re well into the project that you find that these tools are super quick to write, but a complete dog on the update, or that you need to use other database tools to complete the work. And then that initial few seconds becomes pretty meaningless.
In the world of data, over simplifying complexity often creates additional complexity.


