I have a table in Apache Kudu with millions of rows. Each row captures a value for a measurement type for a device id. Assume the same device can output multiple measurement types. Each set of measurements is captured every minute.
For each value of a measurement type for a given device id, I would like to compute the mean and std-deviation of corresponding values over last 15, 30, and 60 minutes.
If I were to accomplish this in a traditional procedural way, I would select a bunch of rows where a computation has not been performed and loop over those records. For each record I would retreive the previous 15 minutes worth of data for the corresponding device id, measurement type and update that row.
However this is terribly inefficient.
Not being well versed in Apache spark, I am looking for guidance and tips on how I might accomplish this using set based operations.
An “algorithm” I came up with is, get the first row (via min) that has a null for the statistic value, then retrieve rows that are 15 minutes behind and 15 minutes ahead (30 minute window) and for each row with null stat compute the stat as a moving window. This does not seem clean.
Any recommendations or clues highly appreciated.