1. Read monthly input file
2. mark data with current month and current year
3. write a timestamp to identify the load_date
4. Read history file
5. Concatenate the two data sets
6. Overwrite history file
We need to concatenate the two data sets and then overwrite the history file rather than just appending the new data to the History file. This because the option "append" in the I/O writing node might not respect the columns order.
How can I prevent the user from uploading twice the same monthly data?
I do not want to have the data for February twice in my history file. It would affect all my data mining evaluations. For example, just the mean of a value over the last 12 months would be affected and wrong. So, again, how can I prevent the user from uploading twice the same monthly data?
The first idea was to use a Java Snippet node. However, I was not able to find a way to exit with error from the java snippet node if the pair (month, year) was already present in the history file. I had to find another solution to provoke a controlled stop of the workflow in case data with pair (month, year) were already present in the history file. The following figure shows my solution.
In case two records have the same pair (month, year) the RowID operation will fail and will give this error message: "ERROR RowID Execute failed: Error in row 3: Duplicate value: Feb_2009 already exists. Check the 'Ensure uniqueness' option to handle duplicates.". Which is exactly the behavior that we wanted.
The figure gives more details about the implementation of this strategy.
- GroupBy (month, year) produces only one record for each pair (month, year) in both the history file and the monthly file
- A new column is produced containing the combined string "year"_"month"
- The two data sets (history and monthly) are concatenated together
- The combined string described at point 2 is used as RowID without "ensure uniqueness". This will give an error when two rows with the same cobined string will be present in the data set.
- If the workflow does not stop because of errors, the history file and the current records are concatenated together
- The current records coming from the two GroupBy nodes are removed by means of a row filter
- The column containing the combined string is also removed. We are now back with the original history data set if no errors were found.