According to the National Congress of American Indians, there are approximately 573 federally recognized Indian Nations (interchangeably called tribes, nations, bands, pueblos, communities and native villages) in the United States 1. Nearly 40% (229) are located in the state of Alaska, while the others are located in 35 other states. Each Nation is ethnically, culturally and linguistically diverse with its own unique history1.
The goal of the Indian Health Service (IHS), is to provide federally funded health service to American Indians and/or Alaska Natives. The National Patient Information Reporting System (NPIRS), instituted the National Data Warehouse (NDW). The NDW is a data warehouse environment specifically for the Indian Health Service's (IHS) national data repository 2. Within this repository is information regarding various levels of patient care. https://www.ihs.gov/ndw/
Looking deeper within the NDW are tables with the approved codes sets from the Indian Health Service (IHS) Standard Code Book (SCB). The Standard Code Book is a uniform listing of descriptive terms and codes for the purpose of recording/reporting medical information collected during the provision of health care services. One of the tables is named "Community". A community is an area in which a tribe, is known to reside. It is not designated by coordinates. Because of this, it is difficult to determine where a community may be.
The goal of this project is to associate Indian Health Service communities to a Geographic Names Information System (GNIS) location. The Indian Health Service (IHS) consists of 14,770 unique active and inactive communities. This project also demonstrates some of the most tedious part sof preparing datasets for exploration=data cleansing and normalization-also called munging.
The Geographic Names Information System (GNIS) is the federal and national standard for geographic nomenclature. 3 The U.S. Geological Survey developed the GNIS in support of the U.S. Board on Geographic Names as the official repository of domestic geographic names data, the official vehicle for geographic names use by all departments of the Federal Government, and the source for applying geographic names to federal electronic and printed products 3. The GNIS download consists of 2,279,278 locations within the United States as a text file. There are also individual state text files for usage.
A copy of The Indian Health Service communities was downloaded from the standard codebook in excel file format: https://www.ihs.gov/scb/index.cfm?module=W_COMMUNITY&option=list&num=84&newquery=1
As well as a copy of the GNIS locations in text file format (NationalFile_20181201.zip): https://geonames.usgs.gov/domestic/download_data.htm
The data was normalized by converting the columns of interest to string format for easier data handling, stripped of leading and trailing whitespaces and applied uppercase lettering. Inactive communities and unknown locations were excluded from this analysis. The same process was completed for the GNIS file for columns: ‘COUNTY_NUMERIC’, ‘STATE_NUMERIC’, ‘FEATURE_NAME’ and ‘FEATURE_CLASS’. The assumption is the Community column in the IHS standard codebook should be similar or identical to the FEATURE_NAME column in the GNIS dataset.
import pandas as pd import fuzzymatcher #Read in the IHS Communities File df1=pd.read_excel('Communities.xlsx') #Change Columns of interest to strings df1[['Code', 'State', 'County', 'Community', 'Status']] = df1[['Code', 'State', 'County', 'Community', 'Status']].astype(str) #Apply uppercase and strip leading and trailing whitespaces of all object columns df1 = df1.apply(lambda x: x.str.strip() if x.dtype == "object" else x) df1 = df1.apply(lambda x: x.str.upper() if x.dtype == "object" else x) #The code is the INCITS code, I am only interested in the state code at this time df1['Code'] = df1['Code'].str[0:2] #Checks to view result and counts...a must df1.head() #Checking the overall structure of the dataset df1.shape
#Read in the GNIS file df2=pd.read_csv('NationalFile_20181201.txt', "|") #Change Columns of interest to strings df2[['COUNTY_NUMERIC', 'STATE_NUMERIC', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'COUNTY_NAME']] = df2[['COUNTY_NUMERIC', 'STATE_NUMERIC', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'COUNTY_NAME']].astype(str) #Apply uppercase and strip leading and trailing whitespaces of all object columns df2 = df2.apply(lambda x: x.str.strip() if x.dtype == "object" else x) df2 = df2.apply(lambda x: x.str.upper() if x.dtype == "object" else x) #Checks to view result and counts...checking the FEATURE_NAME column given that it is of most interest at this time df2['FEATURE_NAME'] df2.head() #Looking for Unnamed Columns df2.loc[:,~df2.columns.str.contains('^Unnamed')] #Checking the overall structure of the dataset df2.shape
Inner Merge Once the data in both datasets were normalized, an inner merge was performed on State Code (Code), Community, STATE_NUMERIC and FEATURE_NAME and saved in a dataframe (df3). After the merge, duplicates were dropped based upon state and Community name. Then communities were identified that were not included in this merge due to spelling differences between IHS and GNIS datasets. The merged output was then placed into a separate dataframe (df4).
#Merging IHS dataset to GNIS dataset based on the two digit State Code and Name df3=pd.merge(df1,df2, how='inner', left_on=['Code','Community'],right_on=['STATE_NUMERIC','FEATURE_NAME']) df3.shape #Drop duplicates after initial merge (there are many locations in the United States that have identical names, #only interested in locations within the same state as a quality check) df4=df3.sort_values('Community').drop_duplicates(subset=['Code','Community'],keep='first') df4.loc[:,~df4.columns.str.contains('^Unnamed')] df4.shape
The communities identified that were not included in the merge were classified as “residuals”. These residuals were placed into a dataframe (df5) to isolate them from the rest of the data to be used for further analysis.
<bound method NDFrame.head of Code State County Community ASU Code Status 0 01 AL BALDWIN BAY MINETTE 5874 ACTIVE 1 01 AL BALDWIN DAPHNE 5874 ACTIVE 3 01 AL BALDWIN FAIRHOPE 5874 ACTIVE 5 01 AL BALDWIN ROBERTSDALE 5874 ACTIVE 7 01 AL BALDWIN SUMMERDALE 5874 ACTIVE 8 01 AL BALDWIN SILVERHILL 5874 ACTIVE 9 01 AL BALDWIN PERDIDO 5874 ACTIVE 10 01 AL BALDWIN LOXLEY 5874 ACTIVE 11 01 AL BALDWIN GULF SHORES 5874 ACTIVE 12 01 AL BALDWIN ORANGE BEACH 5874 ACTIVE 14 01 AL BALDWIN LILLIAN 5874 ACTIVE 15 01 AL BALDWIN SPANISH FORT 5874 ACTIVE 17 01 AL BALDWIN STAPLETON 5874 ACTIVE 18 01 AL BUTLER MCKENZIE 5800 ACTIVE 20 01 AL CONECUH REPTON 5800 ACTIVE 21 01 AL DEKALB RAINSVILLE 5800 ACTIVE 22 01 AL ESCAMBIA BREWTON 5874 ACTIVE 23 01 AL ESCAMBIA EAST ESCAMBIA 5874 ACTIVE 24 01 AL ESCAMBIA FLOMATON 5874 ACTIVE 25 01 AL ESCAMBIA MCCULLOUG-HU 5874 ACTIVE 26 01 AL ESCAMBIA HUXFORD 5874 ACTIVE 27 01 AL ESCAMBIA EAST BREWTON 5874 ACTIVE 28 01 AL ESCAMBIA ATMORE 5874 ACTIVE 29 01 AL MOBILE BAYOU LA BAT 5874 ACTIVE 30 01 AL MOBILE CITRONELLA 5874 ACTIVE 31 01 AL MOBILE GRAND BAY 5874 ACTIVE 32 01 AL MOBILE MOBILE 5874 ACTIVE 34 01 AL MOBILE SEMMES 5874 ACTIVE 35 01 AL MOBILE TANNER-WILLM 5874 ACTIVE 36 01 AL MOBILE THEODORE 5874 ACTIVE ... ... ... ... ... ... ... 14679 55 WI WISCONSIN UNKNOWN WISCONSN UNK 1100 INACTIVE 14680 55 WI WISCONSIN UNKNOWN WISCONSN UNK 1100 INACTIVE 14694 56 WY FREMONT BOULDER FLAT 4046 ACTIVE 14697 56 WY FREMONT DRY CREEK RA 4046 ACTIVE 14700 56 WY FREMONT FT WASHAKIE 4046 ACTIVE 14708 56 WY FREMONT ST STEPHENS 4046 ACTIVE 14714 56 WY HOT SPRINGS HAMILTON DOM 4046 ACTIVE 14722 56 WY NATRONA BARNUNN 4000 ACTIVE 14746 56 WY WYOMING UNK WYOMING UNK 4000 ACTIVE 14747 56 WY WYOMING UNK WYOMING UNK 4000 INACTIVE 14748 56 WY WYOMING UNK WYOMING UNK 4000 INACTIVE 14750 72 PR PUERTO RICO UNK PUERTO RICO UNK 9999 ACTIVE 14751 72 PR PUERTO RICO UNK PUERTO R UNK 5800 INACTIVE 14752 72 PR PUERTO RICO UNK PUERTO R UNK 5800 INACTIVE 14753 78 VI VIRGIN ISLANDS UNK VIRGIN ISLANDS UNK 9999 ACTIVE 14754 78 VI VIRGIN ISLANDS UNK VIRGIN I UNK 5800 INACTIVE 14755 78 VI VIRGIN ISLANDS UNK VIRGIN I UNK 5800 INACTIVE 14756 95 NL NETHERLANDS UNK NETHERLANDS UNK 9999 ACTIVE 14757 96 CA CANADA UNK CORNWALL 5800 ACTIVE 14758 96 CA CANADA UNK CORNWALL ISLAND 5800 ACTIVE 14759 96 CA CANADA UNK SNYE 5800 ACTIVE 14760 96 CA CANADA UNK ST. REGIS 5800 ACTIVE 14761 96 CA CANADA UNK CANADA UNK 9999 ACTIVE 14762 96 CA CANADA UNK CANADA UNK 9999 INACTIVE 14763 96 CA CANADA UNK CANADA UNK 9999 INACTIVE 14764 97 MX MEXICO UNK MEXICO UNK 9999 ACTIVE 14765 97 MX MEXICO UNK MEXICO UNK 9999 INACTIVE 14766 97 MX MEXICO UNK MEXICO UNK 9999 INACTIVE 14767 98 NAN UNSPECIFIED UNSPECIFIED 9999 INACTIVE 14768 99 ? UNKNOWN UNKNOWN 9999 ACTIVE [4055 rows x 6 columns]>
A fuzzy match was performed on the residuals with a python package named “fuzzymatcher”, and saved in a dataframe (df6).The goal of this package is to match two dataframes based upon one or two similar fields. A fuzzy match was performed on Community and FEATURE_NAME. After matches are identified, the package ranks each record by probabilistic scoring. Duplicate records were excluded by both state and Community name so that only unique matches by state were included in the final output. Any excluded community was classified as a residual. This is the ink to Github repository for the fuzzymatcher package: https://github.com/RobinL/fuzzymatcher
#This will take some time to run-not the most efficient method on a dataset this size df6=fuzzymatcher.fuzzy_left_join(df5,df2,left_on="Community", right_on="FEATURE_NAME") df6.shape
|0||0.437734||0_left||78409_right||01||AL||BALDWIN||BAY MINETTE||5874||ACTIVE||113588.0||...||-87.773047||NaN||NaN||NaN||NaN||82.0||269.0||BAY MINETTE NORTH||09/04/1980||08/16/2013|
5 rows × 29 columns
#Keeping only state to state output df7 = df6[df6['State'] == df6['STATE_ALPHA']] df7.shape #Save only unique output and dropping duplicates df8=df7.sort_values('Community').drop_duplicates(subset=['Community','Code'],keep='first') df8.shape
Index(['best_match_score', '__id_left', '__id_right', 'Code', 'State', 'County', 'Community', 'ASU Code', 'Status', 'FEATURE_ID', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA', 'STATE_NUMERIC', 'COUNTY_NAME', 'COUNTY_NUMERIC', 'PRIMARY_LAT_DMS', 'PRIM_LONG_DMS', 'PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'SOURCE_LAT_DMS', 'SOURCE_LONG_DMS', 'SOURCE_LAT_DEC', 'SOURCE_LONG_DEC', 'ELEV_IN_M', 'ELEV_IN_FT', 'MAP_NAME', 'DATE_CREATED', 'DATE_EDITED'], dtype='object')
The output after the fuzzy matching contains columns I do not want, so I simply deleted them. This was useful if I plan to do further matching in another method and would like to output my results to another file.
#Need to see what columns are included in this dataframe before needing to delete df8.drop(['best_match_score','__id_left','__id_right'], axis=1, inplace=True) #Saving the results from initial match to a file df8.to_excel('Fuzzy Communities With Match_1.xlsx')
Locations that were not included in the first iteration of matches are to be identified here. Further matching on text is necessary to ensure the most locations are captured.
df9=df6[df6['State'] != df6['STATE_ALPHA']] df10=df9.sort_values('Community').drop_duplicates(subset=['Community','Code'],keep='first')
In conclusion 2,035 IHS locations remain unlocated due to various reasons due to differences in spelling, the location existing in a neighboring state or simply not existing in both datasets. Further discovery is needed from this point forward. I hope you found this notebook useful especially when dealing with text data and joining two textual datasets.
Tribal Nations & the United States: An Introduction. Retrieved from http://www.ncai.org/about-tribeshttp://www.ncai.org/about-tribes. Accessed March 7, 2019
Indian Health Service. The Federal Health Program for American Indians and Alaska Natives. https://www.ihs.gov/scb/index.cfm?module=W_COMMUNITY&option=list&num=84&newquery=1. Accessed March 7, 2019.
Eastern Region Geography. Domestic and Antarctic Names - State and Topical Gazetteer Download Files. https://geonames.usgs.gov/domestic/download_data.htm. Accessed March 7, 2019.