Title: | Testbench for Univariate Time Series Cleaning |
---|---|
Description: | A reliable and efficient tool for cleaning univariate time series data. It implements reliable and efficient procedures for automating the process of cleaning univariate time series data. The package provides integration with already developed and deployed tools for missing value imputation and outlier detection. It also provides a way of visualizing large time-series data in different resolutions. |
Authors: | Mayur Shende [aut, cre] , Neeraj Bokde [aut] , Andrés E. Feijóo-Lorenzo [aut] |
Maintainer: | Mayur Shende <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2024-10-25 03:17:49 UTC |
Source: | https://github.com/mayur1009/cleants |
animate_interval()
creates an animated plot using a cleanTS
object
and a interval.
animate_interval(obj, interval)
animate_interval(obj, interval)
obj |
A cleanTS object. |
interval |
A numeric or character, specifying the viewing interval. |
First, the data is split according to the interval
argument passed to the
function. If it is a numeric value, the cleaned data is split into dataframes
containing interval
observations. It can also be a string, like
1 week, 3 months, 14 days, etc. In this case, the data
is split according to the interval
given, using the timestamp column. Then
an animation is created using the spliited data, with the help of gganimate
package. The animate_interval()
function returns a list containing
the gganim
object used to generate the animation and the number of
states in the data. The animation can be generated using the
gen.animation()
function and saved using the anim_save()
function. The plots in the animation also contain a short summary, containing
the statistical information and the number of missing values, outliers,
missing timestamps, and duplicate timestamps in the data shown in that frame
of animation.
A list containing:
animation: A gganim
object.
nstates: The number of states in the animation.
## Not run: # Create a `gganim` using `animate_interval()` function a <- animate_interval(cts, "10 year") # cts -> `cleanTS` object created using `cleanTS()`. ## End(Not run)
## Not run: # Create a `gganim` using `animate_interval()` function a <- animate_interval(cts, "10 year") # cts -> `cleanTS` object created using `cleanTS()`. ## End(Not run)
This function is used to check and verify the input data given as input.
The package needs a univariate time series as input. This function keeps
the first 2 columns, first is renamed as time and second is renamed as value.
If the optional time
and value
arguments are provided then they are used
to determine the relevant columns in the data.
check_input(df, dt_format, time, value)
check_input(df, dt_format, time, value)
df |
A data frame containing the input data. If it contains more than
two columns then specify the names of time and value columns using the
|
dt_format |
Format of timestamps used in the data. It uses lubridate formats as mentioned here. |
time |
The name of column in provided data to be used as time column. |
value |
The name of column in provided data, to be used as value(observations) column. |
Data containing 2 columns, time and value. Time column is converted to POSIX object and value to numeric.
cleanTS()
is the main function of the package which creates a cleanTS
object. It performs all the different data cleaning tasks, such as
converting the timestamps to proper format, imputation of missing values,
handling outliers, etc. It is a wrapper function that calls all the other
internal functions to performs different data cleaning tasks.
cleanTS( data, date_format, imp_methods = c("na_interpolation", "na_locf", "na_ma", "na_kalman"), time = NULL, value = NULL, replace_outliers = TRUE )
cleanTS( data, date_format, imp_methods = c("na_interpolation", "na_locf", "na_ma", "na_kalman"), time = NULL, value = NULL, replace_outliers = TRUE )
data |
A data frame containing the input data. By default, it considers
that the first column to contain the timestamps and the second column
contains the observations.If that is not the case or if it contains more than
two columns then specify the names of time and value columns using the
|
date_format |
Format of timestamps used in the data. It uses lubridate formats as mentioned here. More than one formats can be using a vectors of strings. |
imp_methods |
The imputation methods to be used. |
time |
Optional, the name of column in provided data to be used as time column. |
value |
Optional, the name of column in provided data, to be used as value column. |
replace_outliers |
Boolean, if |
The first task is to check the input time series data for structural and data type-related errors. Since the functions need univariate time series data, the input data is checked for the number of columns. By default, the first column is considered to be the time column, and the second column to be the observations. Alternatively, if the time and value arguments are given, then those columns are used. The time column is converted to a POSIX object. The value column is converted to a numeric type. The column names are also changed to time and value. All the data is converted to a data.table object. This data is then passed to other functions to check for missing and duplicate timestamps. If duplicate timestamps are found, then the observation values are checked. If the observations are the same, then only one copy of that observation is kept. But if the observations are different, then it is not possible to find the correct one, so the observation is set to NA. This data is the passed to a function for finding and handling missing observations. The methods given in the imp_methods argument are compared and selected. The MCAR and MAR values are handled seperately. After the best methods are found, imputation is performed using those methods. The user can also pass user-defined functions for comparison. The user-defined function should follow the structure as the default functions. It should take a numeric vector containing missing values as input, and return a numeric vector of the same length without missing values as output. Once the missing values are handled the data is checked for outliers. If the replace_outliers parameter is set to TRUE in the cleanTS() function, then the outliers are replaced by NA and imputed using the procedure mentioned for imputing missing values. Then it creates a cleanTS object which contains the cleaned data, missing timestamps, duplicate timestamps, imputation methods, MCAR imputation error, MAR imputation error, outliers, and if the outliers are replaced then imputation errors for those imputations are also included. The cleanTS object is returned by the function.
A cleanTS
object which contains:
Cleaned data
Missing timestamps
Duplicate timestamps
Imputation errors
Outliers
Outlier imputation errors
## Not run: # Convert sunspots.month to dataframe data <- timetk::tk_tbl(sunspot.month) print(data) # Randomly insert missing values to simulate missing value imputation set.seed(10) ind <- sample(nrow(data), 100) data$value[ind] <- NA # Perform cleaning cts <- cleanTS(data, date_format = "my", time = "index", value = "value") print(cts) ## End(Not run)
## Not run: # Convert sunspots.month to dataframe data <- timetk::tk_tbl(sunspot.month) print(data) # Randomly insert missing values to simulate missing value imputation set.seed(10) ind <- sample(nrow(data), 100) data$value[ind] <- NA # Perform cleaning cts <- cleanTS(data, date_format = "my", time = "index", value = "value") print(cts) ## End(Not run)
This function detects outliers/anomalies in the data. If the
replace_outlier
argument is set to TRUE
, then the outliers are removed
and imputed using the provided imputation methods.
detect_outliers(dt, replace_outlier, imp_methods)
detect_outliers(dt, replace_outlier, imp_methods)
dt |
A data.table. |
replace_outlier |
Boolean, defaults to |
imp_methods |
The imputation methods to be used. |
The outliers found in the data. If the outliers are replaced, then the imputation errors are also returned.
This function finds and removes the duplicate timestamps in the time columns of the data.
duplicate_timestamps(dt)
duplicate_timestamps(dt)
dt |
Input data |
A list of data.table without duplicate timestamps and the duplicate timestamps.
Helper function to find the time difference between two given timestamps.
find_dif(time1, time2)
find_dif(time1, time2)
time1 |
POSIXt or Date object. |
time2 |
POSIXt or Date object. |
String, specifying the time interval between time1
and time2
.
It contains a integer and the unit, for e.g., 5 weeks, 6 months,
14 hours, etc.
This function takes the list outputted by animate_interval()
and generates
a GIF animation. It is a simple wrapper around the gganimate::animate()
function with some defaults. The generated GIF can be saved using the
anim_save()
function. By default, in the animate() function only 50 states
in the data are shown. So, to avoid this gen.animation() defines the default
value for the number of frames. Also, the duration argument has a default
value equal to the number of states, making the animation slower. More
arguments can be passed, which are then passed to animate(), like, height,
width, fps, renderer, etc.
gen.animation(anim, nframes = 2 * anim$nstates, duration = anim$nstate, ...)
gen.animation(anim, nframes = 2 * anim$nstates, duration = anim$nstate, ...)
anim |
List outputted by the |
nframes |
Number of frames. Defaults to double the number of states in the animation. |
duration |
The duration of animation. Defaults to the number of states in the animation. |
... |
Extra arguments passed to |
Does not return any value.
## Not run: a <- animate_interval(cts, "10 year") # Generate animation using `gen.animation()` if(interactive()){ gen.animation(a, height = 700, width = 900) } # Save animation using `anim_save()` anim_save("filename.gif") ## End(Not run)
## Not run: a <- animate_interval(cts, "10 year") # Generate animation using `gen.animation()` if(interactive()){ gen.animation(a, height = 700, width = 900) } # Save animation using `anim_save()` anim_save("filename.gif") ## End(Not run)
gen.report()
generates a report of the entire process, the changes made
to the original data and details about the impurities found in the data.
gen.report(obj)
gen.report(obj)
obj |
A cleanTS object. |
Does not return any value.
## Not run: # Perform cleaning cts <- cleanTS(data, date_format = "my", time = "index", value = "value") gen.report(cts) ## End(Not run)
## Not run: # Perform cleaning cts <- cleanTS(data, date_format = "my", time = "index", value = "value") gen.report(cts) ## End(Not run)
This function handles missing values in the data. It compares various imputation methods and finds the best one for imputation.
impute(dt, methods)
impute(dt, methods)
dt |
A data.table. |
methods |
The imputation methods to be used. |
A data.table with missing data imputed, and the imputation errors.
Interactive plot is similar to the animated plot, but gives the used some control over the animation. It runs a shinyApp instead of creating a GIF.
interact_plot(obj, interval)
interact_plot(obj, interval)
obj |
A cleanTS object. |
interval |
A numeric or character, specifying the viewing interval. |
The problem with an animated plot is that the user does not have any control
over the animation. There is not play or pause functionality so that the
user can observe any desired frame. This can be achieved by adding
interactivity to the plot. The interact_plot()
function creates and runs a
shiny widget locally on the machine. It takes the cleanTS object and splits
the cleaned data according to the interval
argument, similar to the
animate interval()
function. It then creates a shiny widget which shows the
plot for the current state and gives a slider used to change the state.
Unlike animate_interval()
it provides a global report containing information
about complete data, and a state report giving information about the current
state shown in the plot.
Does not return any value.
## Not run: if(interactive()){ # Using the same data used in `cleanTS()` function example. interact_plot(cts, interval = "1 week") } ## End(Not run)
## Not run: if(interactive()){ # Using the same data used in `cleanTS()` function example. interact_plot(cts, interval = "1 week") } ## End(Not run)
mergecsv()
takes a folder containing CSV files and merges them into a
single data.table. It is assumed that the first column of all the CSVs
contains the timestamps.
mergecsv(path, formats)
mergecsv(path, formats)
path |
Path to the folder. |
formats |
Datetime formats. |
All these files are read and the first column is parsed to a proper DateTime object using the formats given in the formats argu- ment. Then these dataframes are merged using the timestamp column as a common column. The merged data frame returned by the function contains the first column as the timestamps.
Merged data.table
.
This function finds and inserts the missing timestamps in the time columns
of the data. The observations for the inserted timestamps are filled with
NA
.
missing_timestamps(dt)
missing_timestamps(dt)
dt |
Input data |
A list of data.table with inserted missing timestamps and the missing timestamps.
Print method for cleanTS
class.
## S3 method for class 'cleanTS' print(x, ...)
## S3 method for class 'cleanTS' print(x, ...)
x |
cleanTS object |
... |
Other arguments |
Does not return any value.
## Not run: # Using the same data as in `cleanTS()` function example. cts <- cleanTS(data, "my") print(cts) ## End(Not run)
## Not run: # Using the same data as in `cleanTS()` function example. cts <- cleanTS(data, "my") print(cts) ## End(Not run)