|
|

|
Re:analysis – Tax audit data mining In the last issue of the newsletter, we went mushroom hunting. In this issue we will be doing something completely different; hunt for unpaid taxes. This is a case study describing a tax auditor’s analysis of income tax returns in order to see if there are any consistencies to why some of them are wrong. The data set Analysis Lessons learned Downloads Further reading The data set This is a real data set originating from the Swedish tax authority (Skatteverket). It concerns income tax returns for persons who during the year sold mutual funds managed in a foreign country. In order to preserve the privacy of both the taxpayers and the asset management companies, all personally identifiable information has been removed and the names of the companies and the funds have also been changed. Altogether this data set consists of 3119 income tax returns represented as a table with the following columns: | No | Key | Id | Id-number | | Unreported amount | Numeric | ignore | How big amount the tax payer has hidden away from taxation. If the tax payer has correctly reported, the amount zero is given. If the amount is negative, the tax payer should have gotten money back from the tax authorities. We are ignoring this, since it has a very unfavourable distribution for modelling, and the primary target is to find the incorrect cases. | | Correct/Incorrect | Nominal | class | Has the value ‘Correct’ if the income tax return is correct, and ‘Incorrect’ otherwise. This is what we want to be able to predict. | | Age | Numeric | numeric | The age of the tax payer | | Income from business | Nominal | nominal | ‘Yes’ if the tax payer has income from a private business, no otherwise. | | Taxable fortune | Nominal | nominal | In Sweden, there is a tax on personal net assets of more than 1.5 MSEK. This column has the value ‘Yes’ if the person has reported this. | | Managed by Nowhere A.M. | Nominal | nominal | ‘Yes’ if the fund that the tax payer has sold was managed by the company ‘Nowhere Asset Management’. This may be of interest since this company has given better information to the tax payer about the duty of reporting than other companies. | | Sex | Nominal | nominal | The sex of the tax payer. | | Type of form | Nominal | nominal | There are two types of income-tax return forms; ‘Simplified’ is a simplified form where the tax authorities have filled in all numbers reported for the person by employers, banks, etc, and ‘Complete’ is a form where nothing is filled in. | | Income from employment | Numeric | numeric | Income from employment | | Zip code | Nominal/Numeric | numeric | The postal code of the tax payer. It could be treated both as a categoric variable, but since they are ordered in a way such that numbers close to each other correspond to places in the same area, it actually makes more sense to treat them as numeric. | | Fund company | Nominal | categoric | The name of the company who managed the fund (the names have been changed). | | Fund name | Nominal | categoric | The name of the fund that was sold (the names have been changed). | | Reported by | Nominal | categoric | The name of the management firm who reported the sale to the tax authority (the names have been changed) | | Profit/Loss | Numeric | numeric | The amount of the realised profit or loss that was the result of the sale (in Swedish Krona, SEK). Positive numbers are profit and negative numbers represent loss. | Analysis Let’s look at what we can deduce about the income tax return using decision trees, and whether we can reasonably predict from values of the income-tax return form if the tax payer is withholding tax or not. Create a new modelling experiment. Import the data into RDS and set the variables according to the table above. First, add a default tree model, and then add two ‘Custom tree models’ (that can be fully configured). Finally add a default ensemble model. Configure the two custom tree methods (which should now be named ‘Tree 2’ and ‘Tree 3’) in the following way: For the first method change the parameters ‘Minimum number of estimation examples’ and ‘Minimum number of separation examples’ to 15, and for the second method change them to 30. The minimum examples parameters and their meaning When RDS creates a tree, the training data is partitioned into two equally sized parts – one that is used for finding the best new condition, and another part that is used to estimate the parameters for the created partitions. For classification, the estimated parameters are the class probabilities. Minimum number of estimation examples – is used to instruct RDS that there can not be less than this number of examples in the set used for estimating the parameters of the rule. The larger the number of examples there are in a rule, the more reliable will the estimation become. Minimum number of separation examples – is used to instruct RDS not to consider new partitions in a tree if at least one of the created subgroups has fewer than this number of examples in the set used for finding the new conditions. | Keep the default validation method for a start. When you are done, start the experiment by pressing the start button. This experiment takes a few minutes to complete. While it is running, take a look at the models in the model browser as they appear. The first model consists of somewhere in the area of 70 rules (i.e. leaf nodes). Look at the variable importance graph in the right hand side of the model browser. It seems as if the variable ‘Profit/Loss’ is the most important predictor. Zoom in and pan to the top of the tree, in order to see what the most important splits in the tree are. Here we can see that the first and most important split is that a Profit/Loss of somewhere in the order of a few thousand SEK, which identifies mainly correct income tax returns for high amounts, and incorrect income tax returns for low amounts. Follow the tree further down on left hand side where the majority of the incorrect cases are the and look at the next node. Here we can see that the next split also uses variable ‘Profit/Loss’, and in this case singles out a small group of correct cases from the rest by a condition saying that there should be a loss of more than about 1000 SEK (Profit/Loss < -1000). This is no surprise; loss means money back, or at least lower taxes, so people can be expected to report this. Continue to browse this and the other models for as long as you like. What other findings are there? When the experiment is ready, take a look at the overall statistics. Here we can see that the estimated performance of the models created vary between 60 % for the first tree model to about 65 % for the ensemble model. We can also see that the size of the tree models drops for increasing number of minimum examples in the nodes. Now we have created three different decision trees and an ensemble model. These first results indicate that the parameters for setting the minimum size of the number of examples in the nodes have importance for the performance of the tree models. In order to do a more thorough validation, repeat the experiment and change the validation method to N-fold cross-validation. Keep the default number of folds (N = 10). Then run it again by clicking the ‘Start’ button. This time, the results point in a certain direction – the accuracy of the tree models improves with larger number of minimum number of examples, the difference between the tree constructed for at least 30 examples and the default setting of 5 is even significant. But still, the ensemble method is estimated to give a predictive performance of almost 68 %, which is significantly better than all of the different tree models. Lessons learned The obvious finding is that the most distinguishing feature for determining whether a tax payer makes his return of income of the sale of a foreign fund correctly is whether the profit or loss is large or small. This finding is the same for all tree models and is totally independent of how the parameters are set for the model, RDS finds the connection regardless. We have also seen that tuning the parameters related to the number of examples in the nodes of the tree can improve the predictive accuracy of individual tree models. But as in most cases, using an ensemble model of multiple tree models gives us a significant improvement in predictive accuracy. Decision tree models are best suited for finding significant relationships and interesting segments, while they almost always are less powerful for making predictions as compared with ensembles. What would the tax authority do with this information? Perhaps the first conclusion would be to look at the ensemble model being significantly better than random and say ‘let’s use it to catch those trying to withhold their tax by using the model to select those most likely to cheat for further manual follow-up’. On the other hand, the strongest finding is that the amounts withheld are small. This indicates that the reason for people not paying the taxes correctly is perhaps because that they are either lazy or ignorant, and not because they want to cheat. Perhaps they simply don’t bother to include the transaction in the income-tax return if the amount is too small. In the latter case, the most appropriate action would be to help the tax payer to be honest. And in this story, which actually is based on a true story, this is what happened. Now all sales of funds managed abroad are reported by the fund management firms to the tax authorities and are included in the simplified forms. If you have experience from sales or marketing, in what way does the task of finding missing taxes differ from finding profitable customers? It is all about how much you know about the persons, and the quality of your data. Downloads Download the data set as a tab-delimited text file Download the experiment including the data set as an RDS experiment file Further reading About the Swedish tax system - An introduction to the Swedish tax system written by the Swedish government
|
|