A Data Science Central Community
Python has become the language of choice for a lot of statisticians and data scientists. The main driving force behind Python’s adoption among data scientists is the Pandas, the open source library tool in of Python.
Here are some features of the library:
Based on these features, we have given below some examples using Pandas on home insurance rates data from vHomeInsurance.
The Home Insurance rates for the the US is the base data set States to explore Pandas.
We have assume the Input File to be:US_Homeinsurance_Rates.csv
This file consists state wise home insurance rates from 1999 to 2013 and the example data set is given below & a summary chart for 2013.
state |
State Short |
1999 |
2000 |
2001 |
2002 |
2003 |
2004 |
2005 |
2006 |
2007 |
2008 |
2013 |
Alaska |
AK |
601 |
607 |
614 |
668 |
731 |
810 |
837 |
850 |
861 |
856 |
1455 |
Alabama |
AL |
457 |
476 |
501 |
533 |
681 |
793 |
847 |
894 |
905 |
845 |
1667 |
Arkansas |
AR |
501 |
494 |
537 |
616 |
721 |
768 |
775 |
802 |
762 |
788 |
1545 |
Arizona |
AZ |
418 |
438 |
460 |
543 |
614 |
642 |
635 |
640 |
634 |
628 |
1227 |
California |
CA |
578 |
592 |
599 |
660 |
753 |
835 |
895 |
937 |
925 |
911 |
1819 |
Colorado |
CO |
559 |
571 |
595 |
660 |
762 |
811 |
807 |
813 |
826 |
842 |
1271 |
Connecticut |
CT |
546 |
571 |
608 |
652 |
714 |
777 |
823 |
878 |
929 |
980 |
1723 |
District of Columbia |
DC |
617 |
635 |
655 |
697 |
806 |
894 |
963 |
1012 |
1089 |
926 |
1574 |
Delaware |
DE |
317 |
329 |
358 |
390 |
442 |
488 |
498 |
530 |
559 |
535 |
1586 |
Florida |
FL |
657 |
666 |
709 |
786 |
810 |
929 |
1083 |
1386 |
1534 |
1390 |
1364 |
Georgia |
GA |
448 |
458 |
471 |
517 |
570 |
635 |
672 |
703 |
724 |
749 |
2155 |
Hawaii |
HI |
606 |
554 |
553 |
565 |
687 |
726 |
837 |
776 |
850 |
862 |
1304 |
Iowa |
IA |
347 |
358 |
382 |
450 |
542 |
575 |
594 |
596 |
610 |
612 |
1028 |
Idaho |
ID |
319 |
326 |
334 |
382 |
433 |
448 |
457 |
477 |
422 |
387 |
969 |
Illinois |
IL |
387 |
410 |
428 |
516 |
610 |
659 |
660 |
674 |
700 |
628 |
1322 |
To start the process of using the data, we need to import the standard pandas libraries
import pandas as pd |
Next we can read the data from csv file using read_csv function in pandas, and to view the summary head function will be useful.
Insurance_rates=pd.read_csv("US_Homeinsurance_Rates.csv",delimiter=”,”) Insurance_rates.head() |
Output of head
|
Once we are done with the quick reading of the data, we can start exploring the data. Here is an example of a simple Sort Function.
syntax of sort: DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort',na_position='last')
Now we will sort the data by “2013” column
Insurance_rates.sort(['2013','State'],ascending=[1, 0]) |
Output:
|
Based, on the above sorting, you will find that North Carolina home insurance rates was the 5th cheapest in the US.
We can move onto a relatively powerful statistical summarization function called “Describe” where the data set can be summarized according to key metrics like count, min, max, average etc.
result.describe() |
Output:
1999 2000 2001 2002 2003 \ count 51.000000 51.000000 51.000000 51.000000 51.000000 mean 463.745098 477.823529 502.509804 562.705882 643.313725 std 114.516696 113.403828 119.359017 144.584411 154.357311 min 266.000000 287.000000 308.000000 340.000000 433.000000 25% 387.500000 403.500000 424.500000 478.500000 551.500000 50% 441.000000 451.000000 480.000000 536.000000 621.000000 75% 529.500000 548.000000 566.500000 613.500000 717.500000 max 861.000000 880.000000 955.000000 1238.000000 1328.000000
2004 2005 2006 2007 2008 \ count 51.000000 51.000000 51.000000 51.000000 51 mean 700.725490 730.078431 758.000000 778.000000 757 std 164.628197 176.989925 203.835326 231.311219 217 min 448.000000 457.000000 477.000000 422.000000 387 25% 604.500000 631.500000 639.000000 646.500000 620 50% 659.000000 682.000000 706.000000 721.000000 703 75% 773.000000 831.500000 858.500000 882.500000 859.000000 max 1362.000000 1372.000000 1409.000000 1534.000000 1460.000000
2013 count 51.000000 mean 1479.000000 std 359.323698 min 827.000000 25% 1287.500000 50% 1431.000000 75% 1631.500000 max 2699.000000 |
To find the median and mean, here are the commands:
result.median() result.mean() |
output:
1999 441 2000 451 2001 480 2002 536 2003 621 2004 659 2005 682 2006 706 2007 721 2008 703 2013 1431 1999 463.745098 2000 477.823529 2001 502.509804 2002 562.705882 2003 643.313725 2004 700.725490 2005 730.078431 2006 758.000000 2007 778.000000 2008 757.509804 2013 1479.000000 |
We can attempt a more complicated problem of finding the percentage change from 2003 to 2013. Executing a query like this in SQL would have required several steps and calculation of many base values before getting a final answer. In Pandas, the query is just two lines!
To find the percentage of change from 2003 to 2013
we can easily add new columns to Data Frame. just say DataFrame[‘column_name’]=Value
result['Percentage Change']=(result['2013']-result['2003'])*100/result['2003'] print result.loc[:,['State.1','2003','2013','Percentage Change']] |
Output
State Short 2003 2013 Percentage Change 0 AK 731 1455 99.042408 1 AL 681 1667 144.787078 2 AR 721 1545 114.285714 3 AZ 614 1227 99.837134 4 CA 753 1819 141.567065 5 CO 762 1271 66.797900 6 CT 714 1723 141.316527 7 DC 806 1574 95.285360 8 DE 442 1586 258.823529 9 FL 810 1364 68.395062 10 GA 570 2155 278.070175 |
The above examples illustrate, that if you are a data scientist or analyst used to Python, Pandas can quickly become a must have tool.
This is a guest blog post by vHomeInsurance.com.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge