Combining, Cleaning and Normalizing Data

Indian Health Service

Tribes

According to the National Congress of American Indians, there are approximately 573 federally recognized Indian Nations (interchangeably called tribes, nations, bands, pueblos, communities and native villages) in the United States 1. Nearly 40% (229) are located in the state of Alaska, while the others are located in 35 other states. Each Nation is ethnically, culturally and linguistically diverse with its own unique history1.

Indian Health Service

The goal of the Indian Health Service (IHS), is to provide federally funded health service to American Indians and/or Alaska Natives. The National Patient Information Reporting System (NPIRS), instituted the National Data Warehouse (NDW). The NDW is a data warehouse environment specifically for the Indian Health Service's (IHS) national data repository 2. Within this repository is information regarding various levels of patient care. https://www.ihs.gov/ndw/

Standard Codebook

Looking deeper within the NDW are tables with the approved codes sets from the Indian Health Service (IHS) Standard Code Book (SCB). The Standard Code Book is a uniform listing of descriptive terms and codes for the purpose of recording/reporting medical information collected during the provision of health care services. One of the tables is named "Community". A community is an area in which a tribe, is known to reside. It is not designated by coordinates. Because of this, it is difficult to determine where a community may be.

Goal of Project

The goal of this project is to associate Indian Health Service communities to a Geographic Names Information System (GNIS) location. The Indian Health Service (IHS) consists of 14,770 unique active and inactive communities. This project also demonstrates some of the most tedious part sof preparing datasets for exploration=data cleansing and normalization-also called munging.

Geographic Information Names System (GNIS)

The Geographic Names Information System (GNIS) is the federal and national standard for geographic nomenclature. 3 The U.S. Geological Survey developed the GNIS in support of the U.S. Board on Geographic Names as the official repository of domestic geographic names data, the official vehicle for geographic names use by all departments of the Federal Government, and the source for applying geographic names to federal electronic and printed products 3. The GNIS download consists of 2,279,278 locations within the United States as a text file. There are also individual state text files for usage.

Datasets

Table 1: Geographic Names Information Systems (GNIS) columns | Name | Type | |-----------------------|-----------| | Feature Name | Number | | Feature Class | Character | | State Alpha | Character | | State Numeric | Character | | County Name | Character | | County Numeric | Character | | Primary Latitude DMS | Character | | Primary Longitude DMS | Character | | Primary Latitude DEC | Number | | Primary Longitude DEC | Number | | Elevation (meters) | Number | | Source Latitude DMS | Character | | Source Longitude DMS | Character | | Source Latitude DEC | Number | | Source Longitude DEC | Number | | Elevation (feet) | Number | | Map Name | Character | | Date Created | Date | | Date Edited | Date | Table 2: Community table from Indian Health Service Standard Codebook | Name | Type | |-----------|-----------| | Code | Number | | State | Character | | County | Character | | Community | Character | | ASU Code | Number | | Status | Character |

Download Files

A copy of The Indian Health Service communities was downloaded from the standard codebook in excel file format: https://www.ihs.gov/scb/index.cfm?module=W_COMMUNITY&option=list&num=84&newquery=1

As well as a copy of the GNIS locations in text file format (NationalFile_20181201.zip): https://geonames.usgs.gov/domestic/download_data.htm

Basic Normalization

The data was normalized by converting the columns of interest to string format for easier data handling, stripped of leading and trailing whitespaces and applied uppercase lettering. Inactive communities and unknown locations were excluded from this analysis. The same process was completed for the GNIS file for columns: ‘COUNTY_NUMERIC’, ‘STATE_NUMERIC’, ‘FEATURE_NAME’ and ‘FEATURE_CLASS’. The assumption is the Community column in the IHS standard codebook should be similar or identical to the FEATURE_NAME column in the GNIS dataset.

In [1]:
import pandas as pd  
import fuzzymatcher  
#Read in the IHS Communities File
df1=pd.read_excel('Communities.xlsx')
#Change Columns of interest to strings
df1[['Code', 'State', 'County', 'Community', 'Status']] = df1[['Code', 'State', 'County', 'Community', 'Status']].astype(str)
#Apply uppercase and strip leading and trailing whitespaces of all object columns
df1 = df1.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df1 = df1.apply(lambda x: x.str.upper() if x.dtype == "object" else x)
#The code is the INCITS code, I am only interested in the state code at this time
df1['Code'] = df1['Code'].str[0:2]
#Checks to view result and counts...a must
df1.head() 
#Checking the overall structure of the dataset
df1.shape
Out[1]:
(14769, 6)
In [2]:
#Read in the GNIS file
df2=pd.read_csv('NationalFile_20181201.txt', "|")
#Change Columns of interest to strings
df2[['COUNTY_NUMERIC', 'STATE_NUMERIC', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'COUNTY_NAME']] = df2[['COUNTY_NUMERIC', 'STATE_NUMERIC', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'COUNTY_NAME']].astype(str)
#Apply uppercase and strip leading and trailing whitespaces of all object columns
df2 = df2.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df2 = df2.apply(lambda x: x.str.upper() if x.dtype == "object" else x)
#Checks to view result and counts...checking the FEATURE_NAME column given that it is of most interest at this time
df2['FEATURE_NAME'] 
df2.head() 
#Looking for Unnamed Columns
df2.loc[:,~df2.columns.str.contains('^Unnamed')]  
#Checking the overall structure of the dataset
df2.shape
Out[2]:
(2279278, 20)

Inner Merge Once the data in both datasets were normalized, an inner merge was performed on State Code (Code), Community, STATE_NUMERIC and FEATURE_NAME and saved in a dataframe (df3). After the merge, duplicates were dropped based upon state and Community name. Then communities were identified that were not included in this merge due to spelling differences between IHS and GNIS datasets. The merged output was then placed into a separate dataframe (df4).

Merging/Matching

In [3]:
#Merging IHS dataset to GNIS dataset based on the two digit State Code and Name
df3=pd.merge(df1,df2, how='inner', left_on=['Code','Community'],right_on=['STATE_NUMERIC','FEATURE_NAME'])  
df3.shape  
#Drop duplicates after initial merge (there are many locations in the United States that have identical names, 
#only interested in locations within the same state as a quality check)
df4=df3.sort_values('Community').drop_duplicates(subset=['Code','Community'],keep='first')  
df4.loc[:,~df4.columns.str.contains('^Unnamed')]  
df4.shape 
Out[3]:
(9734, 26)

Identitfy Residuals

The communities identified that were not included in the merge were classified as “residuals”. These residuals were placed into a dataframe (df5) to isolate them from the rest of the data to be used for further analysis.

In [4]:
df5=df1[~df1['Community'].isin(df4['FEATURE_NAME'])]  
df5.shape  
Out[4]:
(4055, 6)
In [12]:
df5.head()
Out[12]:
<bound method NDFrame.head of       Code State              County           Community ASU Code    Status
0       01    AL             BALDWIN         BAY MINETTE     5874    ACTIVE
1       01    AL             BALDWIN              DAPHNE     5874    ACTIVE
3       01    AL             BALDWIN            FAIRHOPE     5874    ACTIVE
5       01    AL             BALDWIN         ROBERTSDALE     5874    ACTIVE
7       01    AL             BALDWIN          SUMMERDALE     5874    ACTIVE
8       01    AL             BALDWIN          SILVERHILL     5874    ACTIVE
9       01    AL             BALDWIN             PERDIDO     5874    ACTIVE
10      01    AL             BALDWIN              LOXLEY     5874    ACTIVE
11      01    AL             BALDWIN         GULF SHORES     5874    ACTIVE
12      01    AL             BALDWIN        ORANGE BEACH     5874    ACTIVE
14      01    AL             BALDWIN             LILLIAN     5874    ACTIVE
15      01    AL             BALDWIN        SPANISH FORT     5874    ACTIVE
17      01    AL             BALDWIN           STAPLETON     5874    ACTIVE
18      01    AL              BUTLER            MCKENZIE     5800    ACTIVE
20      01    AL             CONECUH              REPTON     5800    ACTIVE
21      01    AL              DEKALB          RAINSVILLE     5800    ACTIVE
22      01    AL            ESCAMBIA             BREWTON     5874    ACTIVE
23      01    AL            ESCAMBIA       EAST ESCAMBIA     5874    ACTIVE
24      01    AL            ESCAMBIA            FLOMATON     5874    ACTIVE
25      01    AL            ESCAMBIA        MCCULLOUG-HU     5874    ACTIVE
26      01    AL            ESCAMBIA             HUXFORD     5874    ACTIVE
27      01    AL            ESCAMBIA        EAST BREWTON     5874    ACTIVE
28      01    AL            ESCAMBIA              ATMORE     5874    ACTIVE
29      01    AL              MOBILE        BAYOU LA BAT     5874    ACTIVE
30      01    AL              MOBILE          CITRONELLA     5874    ACTIVE
31      01    AL              MOBILE           GRAND BAY     5874    ACTIVE
32      01    AL              MOBILE              MOBILE     5874    ACTIVE
34      01    AL              MOBILE              SEMMES     5874    ACTIVE
35      01    AL              MOBILE        TANNER-WILLM     5874    ACTIVE
36      01    AL              MOBILE            THEODORE     5874    ACTIVE
...    ...   ...                 ...                 ...      ...       ...
14679   55    WI   WISCONSIN UNKNOWN        WISCONSN UNK     1100  INACTIVE
14680   55    WI   WISCONSIN UNKNOWN        WISCONSN UNK     1100  INACTIVE
14694   56    WY             FREMONT        BOULDER FLAT     4046    ACTIVE
14697   56    WY             FREMONT        DRY CREEK RA     4046    ACTIVE
14700   56    WY             FREMONT         FT WASHAKIE     4046    ACTIVE
14708   56    WY             FREMONT         ST STEPHENS     4046    ACTIVE
14714   56    WY         HOT SPRINGS        HAMILTON DOM     4046    ACTIVE
14722   56    WY             NATRONA             BARNUNN     4000    ACTIVE
14746   56    WY         WYOMING UNK         WYOMING UNK     4000    ACTIVE
14747   56    WY         WYOMING UNK         WYOMING UNK     4000  INACTIVE
14748   56    WY         WYOMING UNK         WYOMING UNK     4000  INACTIVE
14750   72    PR     PUERTO RICO UNK     PUERTO RICO UNK     9999    ACTIVE
14751   72    PR     PUERTO RICO UNK        PUERTO R UNK     5800  INACTIVE
14752   72    PR     PUERTO RICO UNK        PUERTO R UNK     5800  INACTIVE
14753   78    VI  VIRGIN ISLANDS UNK  VIRGIN ISLANDS UNK     9999    ACTIVE
14754   78    VI  VIRGIN ISLANDS UNK        VIRGIN I UNK     5800  INACTIVE
14755   78    VI  VIRGIN ISLANDS UNK        VIRGIN I UNK     5800  INACTIVE
14756   95    NL     NETHERLANDS UNK     NETHERLANDS UNK     9999    ACTIVE
14757   96    CA          CANADA UNK            CORNWALL     5800    ACTIVE
14758   96    CA          CANADA UNK     CORNWALL ISLAND     5800    ACTIVE
14759   96    CA          CANADA UNK                SNYE     5800    ACTIVE
14760   96    CA          CANADA UNK           ST. REGIS     5800    ACTIVE
14761   96    CA          CANADA UNK          CANADA UNK     9999    ACTIVE
14762   96    CA          CANADA UNK          CANADA UNK     9999  INACTIVE
14763   96    CA          CANADA UNK          CANADA UNK     9999  INACTIVE
14764   97    MX          MEXICO UNK          MEXICO UNK     9999    ACTIVE
14765   97    MX          MEXICO UNK          MEXICO UNK     9999  INACTIVE
14766   97    MX          MEXICO UNK          MEXICO UNK     9999  INACTIVE
14767   98   NAN         UNSPECIFIED         UNSPECIFIED     9999  INACTIVE
14768   99     ?             UNKNOWN             UNKNOWN     9999    ACTIVE

[4055 rows x 6 columns]>

Fuzzy Matching

A fuzzy match was performed on the residuals with a python package named “fuzzymatcher”, and saved in a dataframe (df6).The goal of this package is to match two dataframes based upon one or two similar fields. A fuzzy match was performed on Community and FEATURE_NAME. After matches are identified, the package ranks each record by probabilistic scoring. Duplicate records were excluded by both state and Community name so that only unique matches by state were included in the final output. Any excluded community was classified as a residual. This is the ink to Github repository for the fuzzymatcher package: https://github.com/RobinL/fuzzymatcher

In [10]:
#This will take some time to run-not the most efficient method on a dataset this size
df6=fuzzymatcher.fuzzy_left_join(df5,df2,left_on="Community", right_on="FEATURE_NAME")  
df6.shape 
Out[10]:
(4055, 29)
In [11]:
df6.head()
Out[11]:
best_match_score __id_left __id_right Code State County Community ASU Code Status FEATURE_ID ... PRIM_LONG_DEC SOURCE_LAT_DMS SOURCE_LONG_DMS SOURCE_LAT_DEC SOURCE_LONG_DEC ELEV_IN_M ELEV_IN_FT MAP_NAME DATE_CREATED DATE_EDITED
0 0.437734 0_left 78409_right 01 AL BALDWIN BAY MINETTE 5874 ACTIVE 113588.0 ... -87.773047 NaN NaN NaN NaN 82.0 269.0 BAY MINETTE NORTH 09/04/1980 08/16/2013
26 0.315869 1_left 121385_right 01 AL BALDWIN DAPHNE 5874 ACTIVE 157933.0 ... -87.903605 NaN NaN NaN NaN 49.0 161.0 DAPHNE 09/04/1980 12/06/2013
51 0.323311 2_left 82920_right 01 AL BALDWIN FAIRHOPE 5874 ACTIVE 118120.0 ... -87.903326 NaN NaN NaN NaN 37.0 121.0 DAPHNE 09/04/1980 08/27/2013
101 0.368564 3_left 90480_right 01 AL BALDWIN ROBERTSDALE 5874 ACTIVE 125703.0 ... -87.711932 NaN NaN NaN NaN 44.0 144.0 ROBERTSDALE 09/04/1980 03/20/2008
122 0.358645 4_left 118725_right 01 AL BALDWIN SUMMERDALE 5874 ACTIVE 155262.0 ... -87.699709 NaN NaN NaN NaN 34.0 112.0 FOLEY 09/04/1980 03/20/2008

5 rows × 29 columns

In [7]:
#Keeping only state to state output
df7 = df6[df6['State'] == df6['STATE_ALPHA']]  
df7.shape  
#Save only unique output and dropping duplicates
df8=df7.sort_values('Community').drop_duplicates(subset=['Community','Code'],keep='first')  
df8.shape  
Out[7]:
(1884, 29)
In [14]:
df8.columns
Out[14]:
Index(['best_match_score', '__id_left', '__id_right', 'Code', 'State',
       'County', 'Community', 'ASU Code', 'Status', 'FEATURE_ID',
       'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'STATE_NUMERIC',
       'COUNTY_NAME', 'COUNTY_NUMERIC', 'PRIMARY_LAT_DMS', 'PRIM_LONG_DMS',
       'PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'SOURCE_LAT_DMS', 'SOURCE_LONG_DMS',
       'SOURCE_LAT_DEC', 'SOURCE_LONG_DEC', 'ELEV_IN_M', 'ELEV_IN_FT',
       'MAP_NAME', 'DATE_CREATED', 'DATE_EDITED'],
      dtype='object')

The output after the fuzzy matching contains columns I do not want, so I simply deleted them. This was useful if I plan to do further matching in another method and would like to output my results to another file.

In [21]:
#Need to see what columns are included in this dataframe before needing to delete
df8.drop(['best_match_score','__id_left','__id_right'], axis=1, inplace=True)
#Saving the results from initial match to a file
df8.to_excel('Fuzzy Communities With Match_1.xlsx') 

Identify Further Residuals

Locations that were not included in the first iteration of matches are to be identified here. Further matching on text is necessary to ensure the most locations are captured.

In [23]:
df9=df6[df6['State'] != df6['STATE_ALPHA']]  
df10=df9.sort_values('Community').drop_duplicates(subset=['Community','Code'],keep='first')  
In [24]:
df10.shape
Out[24]:
(2035, 29)

Conclusion

In conclusion 2,035 IHS locations remain unlocated due to various reasons due to differences in spelling, the location existing in a neighboring state or simply not existing in both datasets. Further discovery is needed from this point forward. I hope you found this notebook useful especially when dealing with text data and joining two textual datasets.

  1. Tribal Nations & the United States: An Introduction. Retrieved from http://www.ncai.org/about-tribeshttp://www.ncai.org/about-tribes. Accessed March 7, 2019

  2. Indian Health Service. The Federal Health Program for American Indians and Alaska Natives. https://www.ihs.gov/scb/index.cfm?module=W_COMMUNITY&option=list&num=84&newquery=1. Accessed March 7, 2019.

  3. Eastern Region Geography. Domestic and Antarctic Names - State and Topical Gazetteer Download Files. https://geonames.usgs.gov/domestic/download_data.htm. Accessed March 7, 2019.