IaaGeoDataCleaning.CleaningUtils package¶
Submodules¶
IaaGeoDataCleaning.CleaningUtils.coordinates_validator module¶
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.add_country_code(data, ctry_col)¶ Append two new columns to the data containing each entry’s country’s country codes.
Parameters: - data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- ctry_col (str.) – name of the country column.
Returns: the modified dataframe with the new columns ‘ISO2’ and ‘ISO3’ for two-letter and three-letter country codes respectively.
Return type: DataFrame if the type of data is DataFrame or str, or geopandas.GeoDataFrame if it is geopandas.GeoDataFrame.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Rabat', 'Lyon', 'Cleveland'], ... 'Country': ['Morocco', 'France', 'United States of America']}) >>> add_country_code(df=data, ctry_col='Country') City Country ISO2 ISO3 0 Rabat Morocco MA MAR 1 Lyon France FR FRA 2 Cleveland United States of America US USA
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.cell_in_data(data, val, col, abs_tol=0.1)¶ Find the entries whose values in the passed column match the queried value.
If querying a numeric value, the function will return all entries whose corresponding cells approximate the passed value with the specified absolute tolerance.
If querying a string, the function will return all entries whose corresponding cells contain the passed value, case insensitive.
Parameters: - data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- val (str, int, or float.) – queried value.
- col (str.) – name of queried column.
- abs_tol (float.) –
Returns: all entries meeting the condition.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Birmingham', 'Brussels', 'Berlin'], 'Country': ['England', 'Belgium', 'Germany'], ... 'Latitude': [52.48, 50.85, 52.52], 'Longitude': [-1.89, 4.35, 13.40]}) >>> cell_in_data(data=df, val='brussels', col='City') City Country Latitude Longitude 1 Brussels Belgium 50.85 4.35 >>> cell_in_data(data=df, val=52.5, col='Latitude') City Country Latitude Longitude 0 Birmingham England 52.48 -1.89 2 Berlin Germany 52.52 13.40
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_columns(df, cols)¶ Check to see whether the column names are present in the dataframe.
Parameters: - df (DataFrame or geopandas.GeoDataFrame.) –
- cols (list of str or set of str.) –
Returns: Return type: bool.
Raises: KeyError – if any of the column names cannot be found in the dataframe.
>>> import pandas as pd >>> df = pd.DataFrame({'Location': ['Beijing', 'Sao Paulo', 'Amsterdam'], ... 'Country': ['China', 'Brazil', 'Netherlands']}) >>> df Location Country 0 Beijing China 1 Sao Paulo Brazil 2 Amsterdam Netherlands >>> check_columns(df=df, cols=['Country', 'Location']) True
Note
Function will always return True or raise an error.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_country_geom(geodata, geo_iso2_col, shapedata, shape_geom_col, shape_iso2_col)¶ Filter all of the entries in geodata whose coordinates are within their indicated country by iterating through a shapefile of country polygons and finding locations that are in each polygon.
Parameters: - geodata (geopandas.GeoDataFrame.) – dataframe of locations with spatial geometries.
- geo_iso2_col (str.) – name of the two-letter country code column in dataframe.
- shapedata (geopandas.GeoDataFrame.) – shapefile dataframe.
- shape_geom_col (str.) – name of the geometry column in the shapefile dataframe.
- shape_iso2_col (str.) – name of the two-letter country code column in the shapefile dataframe.
Returns: all of the entries that were verified as having their location in the respective indicated country.
Return type: geopandas.GeoDataFrame.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_data_geom(eval_col, iso2_col, all_geodata, shapedata, shape_geom_col, shape_iso2_col)¶ Take in a collection of spatial dataframes that are variations of a single dataframe and check to see which geometry actually fall within the borders of its preset country. If an entry is verified as correct with its original inputs, the other variations will not be appended
Generate two dataframes, one that combines all of the entries in the collection that are marked as verified, and one for entries whose respective geometry does not correspond to the preset country for any variation.
Parameters: - eval_col (str.) – name of the column to distinguish between entries (should be a column in all of the dataframes).
- iso2_col (str.) – name of the two-letter country code column.
- all_geodata (geopandas.GeoDataFrame or list or set of geopandas.GeoDataFrame) – collection of spatial dataframes.
- shapedata (geopandas.GeoDataFrame.) – shapefile dataframe.
- shape_geom_col (str.) – name of the geometry column in the shapefile dataframe.
- shape_iso2_col (str.) – name of the two-letter country code column in the shapefile dataframe.
Returns: two dataframes, one with verified entries, and one with invalid entries.
Return type: tuple of (geopandas.GeoDataFrame, geopandas.GeoDataFrame)
- ..note::
The function assumes that the first dataframe in the collection is the original dataframe.
The verified dataframe might contain multiple entries for the same initial entry if two or more of its variations match its preset country.
flip_coords()should be called first to generate the dataframe collection to optimize this function.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.convert_df_crs(df, out_crs=4326)¶ Change projection from input projection to provided crs (defaults to 4326)
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.cross_check(data, first_col, second_col)¶ Filter all of the entries in data whose values for
first_colandsecond_colare equal.Parameters: - data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- first_col (str.) – column name.
- second_col (str.) – column name.
Returns: all qualified entries.
Return type: DataFrame
>>> import pandas as pd >>> df = pd.DataFrame({'Country': ['Australia', 'Indonesia', 'Denmark'], 'Entered_ISO2': ['AUS', 'ID', 'DK'], ... 'Matched_ISO2': ['AU', 'ID', 'DK']}) >>> cross_check(data=df, first_col='Entered_ISO2', second_col='Matched_ISO2') Country Entered_ISO2 Matched_ISO2 1 Indonesia ID ID 2 Denmark DK DK
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.export_df(df, extension, filename, directory)¶ Export the dataframe to a file.
Parameters: - df (DataFrame or geopandas.GeoDataFrame.) –
- extension (str.) – outfile extension (.csv or .xlsx).
- filename (str.) – outfile name (without extension).
- directory (str.) – outfile directory.
Returns: absolute filepath to outfile.
Return type: str.
Raises: TypeError – if file extension is not csv or xlsx.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.filter_data_without_coords(data, lat_col, lng_col)¶ Generate two dataframes to filter out entries where no latitudinal and longitudinal data was entered.
Parameters: - data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- lat_col (str.) – name of the latitude column.
- lng_col (str.) – name of the longitude column.
Returns: two dataframes, one with all of the entries with coordinates and one of those without.
Return type: tuple of (DataFrame, DataFrame) if the type of data is DataFrame or str, tuple of (geopandas.GeoDataFrame, geopandas.GeoDataFrame) if it is geopandas.GeoDataFrame.
Note
Entries whose latitude and longitude are both 0 are considered as having no inputs.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Addis Ababa', 'Manila', 'Dubai'], ... 'Country': ['Ethiopia', 'Philippines', 'United Arab Emirates'], ... 'Latitude': [8.98, 14.35, 0], 'Longitude': [38.76, 21.00, 0]}) >>> filter_data_without_coords(data=df, lat_col='Latitude', lng_col='Longitude') ( City Country Latitude Longitude 0 Addis Ababa Ethiopia 8.98 38.76 1 Manila Philippines 14.35 21.00, City Country Latitude Longitude 2 Dubai United Arab Emirates 0.0 0.0)
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.flip_coords(data, lat_col, lng_col, prj=4326)¶ Generate 8 geopandas.GeoDataFrames, each with two columns comprising one latitude-longitude combination among [(lat, lng), (lat, -lng), (-lat, lng), (-lat, -lng),
(lng, lat), (lng, -lat), (-lng, lat), (-lng, -lat)].Parameters: - data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- lat_col (str.) – name of the latitude column.
- lng_col (str.) – name of the longitude column.
- prj (int.) – EPSG code for spatial projection.
Returns: Return type: list of geopandas.GeoDataFrame.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Addis Ababa', 'Manila', 'Vienna', 'Mexico City', 'Puebla'], ... 'Country': ['Ethiopia', 'Philippines', 'Austria', 'Mexico', 'Mexico'], ... 'Latitude': [8.98, 14.35, 0, 19.25, None], 'Longitude': [38.76, 21.00, 0, -99.10, None]}) >>> dfs = flip_coords(data=df, lat_col='Latitude', lng_col='Latitude', prj=4326) >>> dfs[1] City Country Latitude Longitude Flipped_Lat Flipped_Lng geometry 0 Addis Ababa Ethiopia 8.98 38.76 8.98 -38.76 POINT (-38.76 8.98) 1 Manila Philippines 14.35 21.00 14.35 -21.00 POINT (-21 14.35) 2 Vienna Austria 0.00 0.00 0.00 -0.00 POINT (-0 0) 3 Mexico City Mexico 19.25 -99.10 19.25 99.10 POINT (99.10 19.25) 4 Puebla Mexico NaN NaN 0.00 0.00 POINT (0 0)
Note
Point geometry is formatted as (lng, lat).
Null latitude and longitude are converted to 0s.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.geocode_coordinates(data, loc_col, ctry_col)¶ Use Photon API to geocode entries based on their location and country to find their coordinates.
Perform a quick validation of the query result by comparing the returned country to the preset country.
Three new fields representing the returned address, latitude, and longitude are appended to geocoded entries.
Parameters: - data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- loc_col (str.) – name of the location (lower level) column.
- ctry_col (str.) – name of the location (higher level) column.
Returns: two dataframes, one with all of the locations that Photon was able to find, and one with locations that could not be queried.
Return type: tuple of (DataFrame, DataFrame)
Note
Returned locations might not be 100% accurate.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Toronto', 'Dhaka', 'San Andres'], 'Country': ['Canada', 'Bangladesh', 'El Salvador']}) >>> geocode_coordinates(data=df, loc_col='City', ctry_col='City') ( City Country Geocoded_Adr Geocoded_Lat Geocoded_Lng 0 Toronto Canada Toronto, Ontario, Canada 43.653963 -79.387207 1 Dhaka Bangladesh Dhaka, 12, Dhaka Division, Bangladesh 23.759357 90.378814, City Country 3 San Andres El Salvador)
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.get_projection(prj_file)¶ Determine the EPSG code from .prj file.
Parameters: prj_file (str.) – filepath to the .prj file. Returns: Return type: int. >>> get_projection('/home/example_user/example_shapefile_directory/example_shapefile.prj') 4326
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.get_shape(shp_file)¶ Generate a GeoDataFrame from .shp file.
Parameters: shp_file (str.) – filepath to the .shp file. Returns: Return type: geopandas.GeoDataFrame
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.process_shapefile(shapefile=None)¶ Take in a shapefile directory and parse the filepath to each file in the directory.
Parameters: shapefile (str) – filepath to shapefile directory. Returns: dictionary with the file extension as keys and the complete filepath as values. Return type: dict of {str: str} >>> process_shapefile('/home/example_user/example_shapefile_directory') {'dbf': '/home/example_user/example_shapefile_directory/example_shapefile.dbf', 'prj': '/home/example_user/example_shapefile_directory/example_shapefile.prj', 'shp': '/home/example_user/example_shapefile_directory/example_shapefile.shp', 'shx': '/home/example_user/example_shapefile_directory/example_shapefile.shx'}
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.query_data(data, query_dict, excl=False)¶ Find all entries that meet the conditions specified in the query dictionary.
If
excl=True, the function only returns entries meeting every single criteria. Else, it returns any entry that meets at least one of the conditions.Parameters: - data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- query_dict (dict of {str: list, str: set, or str: str}.) – dictionary whose keys are column names mapping to the queried value(s).
- excl (bool.) – exclusive or inclusive search.
Returns: all entries meeting the condition(s).
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Birmingham', 'Brussels', 'Berlin'], 'Country': ['England', 'Belgium', 'Germany'], ... 'Latitude': [52.48, 50.85, 52.52], 'Longitude': [-1.89, 4.35, 13.40]}) >>> query_data(data=df, query_dict={'Latitude': [52.5, 40], 'City': 'Berlin'}, excl=False) City Country Latitude Longitude 0 Birmingham England 52.48 -1.89 2 Berlin Germany 52.52 13.40 >>> query_data(data=df, query_dict={'Latitude': 52.5, 'City': 'berlin'}, excl=True) City Country Latitude Longitude 2 Berlin Germany 52.52 13.4
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.read_data(data, cols)¶ Generate a dataframe and verify that the specified columns are in the dataframe.
Parameters: - data (str, DataFrame, geopandas.GeoDataFrame) – filepath (.csv or .xlsx extension) or dataframe.
- cols (list of str or set of str.) –
Return type: DataFrame if the type of data is DataFrame or str, or geopandas.GeoDataFrame if it is geopandas.GeoDataFrame.
Raises: TypeError – if a different type is passed for data or the file extension is not .csv or .xlsx.
>>> import pandas as pd >>> df = pd.DataFrame({'City': ['Delhi', 'Giza'], 'Country': ['India', 'Egypt'], ... 'Latitude': [28.68, 30.01], 'Longitude': [77.22, 31.13]}) >>> read_data(data=df, cols={'Country', 'Latitude', 'Longitude'}) City Country Latitude Longitude 0 Delhi India 28.68 77.22 1 Giza Egypt 30.01 31.13
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.read_file(file_path)¶ Generate a dataframe from .xlsx or .csv file.
Parameters: file_path (str.) – Returns: Return type: DataFrame. Raises: TypeError – if the file extension is not .csv or .xlsx.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.rtree(geodata, polygon)¶ Use geopandas’s R-tree implementation to find all of the locations in geodata in the spatial polygon.
Parameters: - geodata (geopandas.GeoDataFrame.) – dataframe of locations with spatial geometries.
- polygon (shapely.geometry.Polygon.) –
Returns: all of the entries with locations in the polygon.
Return type: geopandas.GeoDataFrame.
-
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.to_gdf(data, lat_col, lng_col, prj=4326)¶ Generate a geopandas.GeoDataFrame.
Parameters: - data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
- lat_col (str.) – name of the latitude column.
- lng_col (str.) – name of the longitude column.
- prj (int.) – EPSG code for spatial projection.
Returns: Return type: geopandas.GeoDataFrame.
IaaGeoDataCleaning.CleaningUtils.modify_data module¶
-
class
IaaGeoDataCleaning.CleaningUtils.modify_data.Modifier(incorrect_locs, correct_locs, geocoded_locs)¶ Bases:
objectClass acts as a command line tool for accepting/rejecting proposed data modifications and stores the validated data and any data already marked as correct as a new file.
Parameters: - incorrect_locs – String filepath to flipped locations.
- correct_locs – String filepath to locations that have been verified.
- geocoded_locs – String filepath to geocoded locations
-
make_commands()¶ Create lists of commands and their descriptions
Returns: Tuple containing commands, descriptions, and the help command
-
run(output_directory, lat_col='Latitude', lng_col='Longitude', rec_lat_col='Flipped_Lat', rec_lng_col='Flipped_Lng', country_col='Country', loc_col='Location', geoc_rec_lng_col='Geocoded_Lat', geoc_rec_lat_col='Geocoded_Lng')¶ Iterate over cleaned data file and prompt user to confirm changes. Changes are then stored in a new file and the original data is left untouched.
Params: Optional params to define column names
-
IaaGeoDataCleaning.CleaningUtils.modify_data.run_mod()¶