Subscribe to DSC Newsletter

Python has become the language of choice for a lot of statisticians and data scientists. The main driving force behind Python’s adoption among data scientists is the Pandas, the open source library tool in of Python.

Here are some features of the library:

  1. A fast and efficient DataFrame object for data manipulation with integrated indexing;
  2. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format
  3. Flexible reshaping and pivoting of data sets;
  4. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  5. Columns can be inserted and deleted from data structures for size mutability;
  6. Time series-functionality: date range generation and frequency conversion
  7. Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets

Based on these features, we have given below some examples using Pandas on home insurance rates data from vHomeInsurance.

The Home Insurance rates for the the US is the base data set States to explore Pandas.

 

We have assume the Input File to be:US_Homeinsurance_Rates.csv

This file consists state wise home insurance rates from 1999 to 2013 and the example data set is given below & a summary chart for 2013.

 

                                               

state

State Short

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2013

Alaska

AK

601

607

614

668

731

810

837

850

861

856

1455

Alabama

AL

457

476

501

533

681

793

847

894

905

845

1667

Arkansas

AR

501

494

537

616

721

768

775

802

762

788

1545

Arizona

AZ

418

438

460

543

614

642

635

640

634

628

1227

California

CA

578

592

599

660

753

835

895

937

925

911

1819

Colorado

CO

559

571

595

660

762

811

807

813

826

842

1271

Connecticut

CT

546

571

608

652

714

777

823

878

929

980

1723

District of Columbia

DC

617

635

655

697

806

894

963

1012

1089

926

1574

Delaware

DE

317

329

358

390

442

488

498

530

559

535

1586

Florida

FL

657

666

709

786

810

929

1083

1386

1534

1390

1364

Georgia

GA

448

458

471

517

570

635

672

703

724

749

2155

Hawaii

HI

606

554

553

565

687

726

837

776

850

862

1304

Iowa

IA

347

358

382

450

542

575

594

596

610

612

1028

Idaho

ID

319

326

334

382

433

448

457

477

422

387

969

Illinois

IL

387

410

428

516

610

659

660

674

700

628

1322

 

To start the process of using the data, we need to import the standard pandas libraries

import pandas as pd

 

Next we can read the data from csv file using read_csv function in pandas, and to view the summary head function will be useful.

Insurance_rates=pd.read_csv("US_Homeinsurance_Rates.csv",delimiter=”,”)

Insurance_rates.head()

Output of head

                                                               

state

State Short

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2013

Alaska

AK

601

607

614

668

731

810

837

850

861

856

1455

Alabama

AL

457

476

501

533

681

793

847

894

905

845

1667

Arkansas

AR

501

494

537

616

721

768

775

802

762

788

1545

Arizona

AZ

418

438

460

543

614

642

635

640

634

628

1227

California

CA

578

592

599

660

753

835

895

937

925

911

1819

 

Once we are done with the quick reading of the data, we can start exploring the data. Here is an example of a simple Sort Function.

 

syntax of sort: DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort',na_position='last')

Now we will sort the data by “2013” column

 

Insurance_rates.sort(['2013','State'],ascending=[1, 0])

Output:

 

                                               

Maine

ME

336

352

372

416

462

513

553

573

596

572

827

Vermont

VT

414

426

451

493

552

608

646

677

704

650

873

Idaho

ID

319

326

334

382

433

448

457

477

422

387

969

Iowa

IA

347

358

382

450

542

575

594

596

610

612

1028

North Carolina

NC

427

449

484

527

576

623

644

649

674

683

1085

 

Based, on the above sorting, you will find that North Carolina home insurance rates was the 5th cheapest in the US.

 

We can move onto a relatively powerful statistical summarization function called “Describe” where the data set can be summarized according to key metrics like count, min, max, average etc.

 

result.describe()

Output:

               1999        2000        2001         2002         2003  \

count   51.000000   51.000000   51.000000    51.000000    51.000000   

mean   463.745098  477.823529  502.509804   562.705882   643.313725   

std    114.516696  113.403828  119.359017   144.584411   154.357311   

min    266.000000  287.000000  308.000000   340.000000   433.000000   

25%    387.500000  403.500000  424.500000   478.500000   551.500000   

50%    441.000000  451.000000  480.000000   536.000000   621.000000   

75%    529.500000  548.000000  566.500000   613.500000   717.500000   

max    861.000000  880.000000  955.000000  1238.000000  1328.000000   

 

             2004         2005         2006         2007         2008  \

count    51.000000    51.000000    51.000000    51.000000    51   

mean    700.725490   730.078431   758.000000   778.000000   757

std     164.628197   176.989925   203.835326   231.311219   217   

min     448.000000   457.000000   477.000000   422.000000   387

25%     604.500000   631.500000   639.000000   646.500000   620   

50%     659.000000   682.000000   706.000000   721.000000   703   

75%     773.000000   831.500000   858.500000   882.500000   859.000000   

max    1362.000000  1372.000000  1409.000000  1534.000000  1460.000000   

 

             2013  

count    51.000000  

mean   1479.000000  

std     359.323698  

min     827.000000  

25%    1287.500000  

50%    1431.000000  

75%    1631.500000  

max    2699.000000  

 

To find the median and mean, here are the commands:

 

 

result.median()

result.mean()

output:

 

1999     441

2000     451

2001     480

2002     536

2003     621

2004     659

2005     682

2006     706

2007     721

2008     703

2013    1431

1999     463.745098

2000     477.823529

2001     502.509804

2002     562.705882

2003     643.313725

2004     700.725490

2005     730.078431

2006     758.000000

2007     778.000000

2008     757.509804

2013    1479.000000

 

We can attempt a more complicated problem of finding the percentage change from 2003 to 2013. Executing a query like this in SQL would have required several steps and calculation of many base values before getting a final answer. In Pandas, the query is just two lines!

 

To find the percentage of change from 2003 to 2013

we can easily add new columns to Data Frame. just say DataFrame[‘column_name’]=Value

 

result['Percentage Change']=(result['2013']-result['2003'])*100/result['2003']

print result.loc[:,['State.1','2003','2013','Percentage Change']]

Output

  State Short  2003  2013  Percentage Change

0           AK   731  1455          99.042408

1           AL   681  1667         144.787078

2           AR   721  1545         114.285714

3           AZ   614  1227          99.837134

4           CA   753  1819         141.567065

5           CO   762  1271          66.797900

6           CT   714  1723         141.316527

7           DC   806  1574          95.285360

8           DE   442  1586         258.823529

9           FL   810  1364          68.395062

10          GA   570  2155         278.070175

 

The above examples illustrate, that if you are a data scientist or analyst used to Python, Pandas can quickly become a must have tool.

This is a guest blog post by vHomeInsurance.com.

Views: 5279

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service