There are two pieces to this puzzle -- building the model and then applying the results. There are a number of software companies that offer products to build models. The comment below gets at the root of the problem -- "What specifically are you trying to predict?" The second piece to the puzzle falls into the "investigative side" and can be used in conjunction with the model. For example, if you know which polices are suspicious (had misused insurance coverage or had filed fraudulent claims that were not paid), you probably also want to look for other claims that may have some piece of information in common with the bad claims you already identified. This could indicate collusion or hidden relationships that need to be investigated. The best way of showing these hidden relationships is through link analysis, a form of data visualization that shows linkages between claims, addresses, phone numbers, and other attributes and allows you to trace the linkages from the suspicious claims to other, more recent claims. Here's a link to a video we posted on identifying fraud using data visualization. This example is for bank fraud but it also applies to insurnace fraud: http://www.centrifugesystems.com/technology/applications/fraud.php
I suggest adding a little more about the data set you would be using to model. A little more about the data set can help others understand the problem you are stating. If its a rare data behavior then you are going to have to look at modeling that will have to take false positives into account. Usually that is remedied by over sampling or replication.
I'm assuming you are wanting a predictive model. I would suggest starting with a logistic regression approach and study on those methods. Once you have learned those methods then you can look into some hybrid modeling methods. Machine Learning could be another resource.
For this type of model -- suspicious claim identification - you'll need a dataset with a dependent variable on which you will build the model. For fraudulent claims, that would be a binary indicator for known fraudulent claims (no false positives) and known legitimate claims. Check to see if you have this data set and the time period for the data. You will need a representative sample of both types of claims.
For a "suspicious claims" model, the dependent variable would be defined a little differently. You need a set of claims that were flagged as suspicious, investigated and found to be fraudulent. I would also include a set of claims that were flagged as suspicious, investigated and found to be legitimate. This type of model would be designed to eliminate the false positves and false negatives prior to the investigative process which could be valuable in resource allocation.
To be clear, there are different types of models you can build here -- one to predict fraud and one to predict suspicious claims that turn out to truly be fraud. Let this group know what data you have to work with and they can help guide you.
From that point, you need as much independent or "attribute" data as you can get -- time of day claim is filed, data about the claim, geographic location, age of person submitting the claim, number of phone numbers and addresses attached to claim, info about other parties involved, was the claim submitted through the web -- as much data as you can get. Generally, the more independent data you toss at the model, the more explanatory power the model may have.
Once the model is built, your stat package should produce scores and deciles for the records and an algorithm to apply to a new dataset which does not have the dependent variable. Resulting scores will tell you which are likley to be suspicious or fraudulent, depending on what you are modeling.
I would also investigate any outliers in the model results to try and refine the input dataset used in the model. You may discover an attribute that could be used to predict more cases accurately in the future. I hope this helps.
Here are my suggestions for modeling insurance fraud:
• Step 1: Supervised Learning. We start with claims that have already been confirmed as fraud by SIU. Build a predictive model to score new claims and identify those that are similar to claims that were found to be fraud in the past. This is the low hanging fruit. The result of a good supervised model would be that a greater proportion of claims referred to SIU will be found to be fraud. This allows SIU to uncover more fraud with the same level of resources.
• Step 2: Unsupervised Learning. An approach to uncover novel or unknown types of fraudulent activity. This is really just anomaly detection: claims with unusual combinations of characteristics can be isolated but manual investigation will be required to determine which, if any, are fraudulent. If certain types of anomalies are found to often be fraudulent, other, similar claims can be referred to SIU. Riskier than Supervised Learning – uncovering anomalies does not mean they will be fraudulent. But offers greater upside than Supervised Learning – can uncover schemes which are unknown and therefore not being detected. Positive results from Step 1 (Supervised) can be used to support Step 2 (Unsupervised).