IaaGeoDataCleaning.CleaningUtils package

Submodules

IaaGeoDataCleaning.CleaningUtils.coordinates_validator module

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.add_country_code(data, ctry_col)

Append two new columns to the data containing each entry’s country’s country codes.

Parameters:
  • data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • ctry_col (str.) – name of the country column.
Returns:

the modified dataframe with the new columns ‘ISO2’ and ‘ISO3’ for two-letter and three-letter country codes respectively.

Return type:

DataFrame if the type of data is DataFrame or str, or geopandas.GeoDataFrame if it is geopandas.GeoDataFrame.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Rabat', 'Lyon', 'Cleveland'],
...                    'Country': ['Morocco', 'France', 'United States of America']})
>>> add_country_code(df=data, ctry_col='Country')
        City                   Country ISO2 ISO3
0      Rabat                   Morocco   MA  MAR
1       Lyon                    France   FR  FRA
2  Cleveland  United States of America   US  USA
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.cell_in_data(data, val, col, abs_tol=0.1)

Find the entries whose values in the passed column match the queried value.

If querying a numeric value, the function will return all entries whose corresponding cells approximate the passed value with the specified absolute tolerance.

If querying a string, the function will return all entries whose corresponding cells contain the passed value, case insensitive.

Parameters:
  • data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • val (str, int, or float.) – queried value.
  • col (str.) – name of queried column.
  • abs_tol (float.) –
Returns:

all entries meeting the condition.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Birmingham', 'Brussels', 'Berlin'], 'Country': ['England', 'Belgium', 'Germany'],
...                    'Latitude': [52.48, 50.85, 52.52], 'Longitude': [-1.89, 4.35, 13.40]})
>>> cell_in_data(data=df, val='brussels', col='City')
       City  Country  Latitude  Longitude
1  Brussels  Belgium     50.85       4.35
>>> cell_in_data(data=df, val=52.5, col='Latitude')
         City  Country  Latitude  Longitude
0  Birmingham  England     52.48      -1.89
2      Berlin  Germany     52.52      13.40
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_columns(df, cols)

Check to see whether the column names are present in the dataframe.

Parameters:
  • df (DataFrame or geopandas.GeoDataFrame.) –
  • cols (list of str or set of str.) –
Returns:

Return type:

bool.

Raises:

KeyError – if any of the column names cannot be found in the dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame({'Location': ['Beijing', 'Sao Paulo', 'Amsterdam'],
...                    'Country': ['China', 'Brazil', 'Netherlands']})
>>> df
    Location      Country
0    Beijing        China
1  Sao Paulo       Brazil
2  Amsterdam  Netherlands
>>> check_columns(df=df, cols=['Country', 'Location'])
True

Note

Function will always return True or raise an error.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_country_geom(geodata, geo_iso2_col, shapedata, shape_geom_col, shape_iso2_col)

Filter all of the entries in geodata whose coordinates are within their indicated country by iterating through a shapefile of country polygons and finding locations that are in each polygon.

Parameters:
  • geodata (geopandas.GeoDataFrame.) – dataframe of locations with spatial geometries.
  • geo_iso2_col (str.) – name of the two-letter country code column in dataframe.
  • shapedata (geopandas.GeoDataFrame.) – shapefile dataframe.
  • shape_geom_col (str.) – name of the geometry column in the shapefile dataframe.
  • shape_iso2_col (str.) – name of the two-letter country code column in the shapefile dataframe.
Returns:

all of the entries that were verified as having their location in the respective indicated country.

Return type:

geopandas.GeoDataFrame.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.check_data_geom(eval_col, iso2_col, all_geodata, shapedata, shape_geom_col, shape_iso2_col)

Take in a collection of spatial dataframes that are variations of a single dataframe and check to see which geometry actually fall within the borders of its preset country. If an entry is verified as correct with its original inputs, the other variations will not be appended

Generate two dataframes, one that combines all of the entries in the collection that are marked as verified, and one for entries whose respective geometry does not correspond to the preset country for any variation.

Parameters:
  • eval_col (str.) – name of the column to distinguish between entries (should be a column in all of the dataframes).
  • iso2_col (str.) – name of the two-letter country code column.
  • all_geodata (geopandas.GeoDataFrame or list or set of geopandas.GeoDataFrame) – collection of spatial dataframes.
  • shapedata (geopandas.GeoDataFrame.) – shapefile dataframe.
  • shape_geom_col (str.) – name of the geometry column in the shapefile dataframe.
  • shape_iso2_col (str.) – name of the two-letter country code column in the shapefile dataframe.
Returns:

two dataframes, one with verified entries, and one with invalid entries.

Return type:

tuple of (geopandas.GeoDataFrame, geopandas.GeoDataFrame)

..note::

The function assumes that the first dataframe in the collection is the original dataframe.

The verified dataframe might contain multiple entries for the same initial entry if two or more of its variations match its preset country.

flip_coords() should be called first to generate the dataframe collection to optimize this function.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.convert_df_crs(df, out_crs=4326)

Change projection from input projection to provided crs (defaults to 4326)

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.cross_check(data, first_col, second_col)

Filter all of the entries in data whose values for first_col and second_col are equal.

Parameters:
  • data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • first_col (str.) – column name.
  • second_col (str.) – column name.
Returns:

all qualified entries.

Return type:

DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'Country': ['Australia', 'Indonesia', 'Denmark'], 'Entered_ISO2': ['AUS', 'ID', 'DK'],
...                    'Matched_ISO2': ['AU', 'ID', 'DK']})
>>> cross_check(data=df, first_col='Entered_ISO2', second_col='Matched_ISO2')
     Country Entered_ISO2 Matched_ISO2
1  Indonesia           ID           ID
2    Denmark           DK           DK
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.export_df(df, extension, filename, directory)

Export the dataframe to a file.

Parameters:
  • df (DataFrame or geopandas.GeoDataFrame.) –
  • extension (str.) – outfile extension (.csv or .xlsx).
  • filename (str.) – outfile name (without extension).
  • directory (str.) – outfile directory.
Returns:

absolute filepath to outfile.

Return type:

str.

Raises:

TypeError – if file extension is not csv or xlsx.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.filter_data_without_coords(data, lat_col, lng_col)

Generate two dataframes to filter out entries where no latitudinal and longitudinal data was entered.

Parameters:
  • data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • lat_col (str.) – name of the latitude column.
  • lng_col (str.) – name of the longitude column.
Returns:

two dataframes, one with all of the entries with coordinates and one of those without.

Return type:

tuple of (DataFrame, DataFrame) if the type of data is DataFrame or str, tuple of (geopandas.GeoDataFrame, geopandas.GeoDataFrame) if it is geopandas.GeoDataFrame.

Note

Entries whose latitude and longitude are both 0 are considered as having no inputs.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Addis Ababa', 'Manila', 'Dubai'],
...                    'Country': ['Ethiopia', 'Philippines', 'United Arab Emirates'],
...                    'Latitude': [8.98, 14.35, 0], 'Longitude': [38.76, 21.00, 0]})
>>> filter_data_without_coords(data=df, lat_col='Latitude', lng_col='Longitude')
(         City      Country  Latitude  Longitude
0  Addis Ababa     Ethiopia      8.98      38.76
1       Manila  Philippines     14.35      21.00,
    City               Country  Latitude  Longitude
2  Dubai  United Arab Emirates       0.0        0.0)
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.flip_coords(data, lat_col, lng_col, prj=4326)

Generate 8 geopandas.GeoDataFrames, each with two columns comprising one latitude-longitude combination among [(lat, lng), (lat, -lng), (-lat, lng), (-lat, -lng),

(lng, lat), (lng, -lat), (-lng, lat), (-lng, -lat)].
Parameters:
  • data (str, DataFrame, geopandas.GeoDataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • lat_col (str.) – name of the latitude column.
  • lng_col (str.) – name of the longitude column.
  • prj (int.) – EPSG code for spatial projection.
Returns:

Return type:

list of geopandas.GeoDataFrame.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Addis Ababa', 'Manila', 'Vienna', 'Mexico City', 'Puebla'],
...                    'Country': ['Ethiopia', 'Philippines', 'Austria', 'Mexico', 'Mexico'],
...                    'Latitude': [8.98, 14.35, 0, 19.25, None], 'Longitude': [38.76, 21.00, 0, -99.10, None]})
>>> dfs = flip_coords(data=df, lat_col='Latitude', lng_col='Latitude', prj=4326)
>>> dfs[1]
          City      Country  Latitude  Longitude  Flipped_Lat  Flipped_Lng             geometry
0  Addis Ababa  Ethiopia         8.98      38.76         8.98       -38.76  POINT (-38.76 8.98)
1  Manila       Philippines     14.35      21.00        14.35       -21.00    POINT (-21 14.35)
2  Vienna       Austria         0.00        0.00         0.00        -0.00         POINT (-0 0)
3  Mexico City  Mexico          19.25     -99.10        19.25        99.10  POINT (99.10 19.25)
4  Puebla       Mexico            NaN        NaN         0.00         0.00          POINT (0 0)

Note

Point geometry is formatted as (lng, lat).

Null latitude and longitude are converted to 0s.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.geocode_coordinates(data, loc_col, ctry_col)

Use Photon API to geocode entries based on their location and country to find their coordinates.

Perform a quick validation of the query result by comparing the returned country to the preset country.

Three new fields representing the returned address, latitude, and longitude are appended to geocoded entries.

Parameters:
  • data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • loc_col (str.) – name of the location (lower level) column.
  • ctry_col (str.) – name of the location (higher level) column.
Returns:

two dataframes, one with all of the locations that Photon was able to find, and one with locations that could not be queried.

Return type:

tuple of (DataFrame, DataFrame)

Note

Returned locations might not be 100% accurate.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Toronto', 'Dhaka', 'San Andres'], 'Country': ['Canada', 'Bangladesh', 'El Salvador']})
>>> geocode_coordinates(data=df, loc_col='City', ctry_col='City')
(     City        Country                            Geocoded_Adr  Geocoded_Lat  Geocoded_Lng
0  Toronto     Canada       Toronto, Ontario, Canada                  43.653963    -79.387207
1  Dhaka       Bangladesh   Dhaka, 12, Dhaka Division, Bangladesh     23.759357     90.378814,
         City      Country
3  San Andres  El Salvador)
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.get_projection(prj_file)

Determine the EPSG code from .prj file.

Parameters:prj_file (str.) – filepath to the .prj file.
Returns:
Return type:int.
>>> get_projection('/home/example_user/example_shapefile_directory/example_shapefile.prj')
4326
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.get_shape(shp_file)

Generate a GeoDataFrame from .shp file.

Parameters:shp_file (str.) – filepath to the .shp file.
Returns:
Return type:geopandas.GeoDataFrame
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.process_shapefile(shapefile=None)

Take in a shapefile directory and parse the filepath to each file in the directory.

Parameters:shapefile (str) – filepath to shapefile directory.
Returns:dictionary with the file extension as keys and the complete filepath as values.
Return type:dict of {str: str}
>>> process_shapefile('/home/example_user/example_shapefile_directory')
{'dbf': '/home/example_user/example_shapefile_directory/example_shapefile.dbf',
 'prj': '/home/example_user/example_shapefile_directory/example_shapefile.prj',
 'shp': '/home/example_user/example_shapefile_directory/example_shapefile.shp',
 'shx': '/home/example_user/example_shapefile_directory/example_shapefile.shx'}
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.query_data(data, query_dict, excl=False)

Find all entries that meet the conditions specified in the query dictionary.

If excl=True, the function only returns entries meeting every single criteria. Else, it returns any entry that meets at least one of the conditions.

Parameters:
  • data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • query_dict (dict of {str: list, str: set, or str: str}.) – dictionary whose keys are column names mapping to the queried value(s).
  • excl (bool.) – exclusive or inclusive search.
Returns:

all entries meeting the condition(s).

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Birmingham', 'Brussels', 'Berlin'], 'Country': ['England', 'Belgium', 'Germany'],
...                    'Latitude': [52.48, 50.85, 52.52], 'Longitude': [-1.89, 4.35, 13.40]})
>>> query_data(data=df, query_dict={'Latitude': [52.5, 40], 'City': 'Berlin'}, excl=False)
         City  Country  Latitude  Longitude
0  Birmingham  England     52.48      -1.89
2      Berlin  Germany     52.52      13.40
>>> query_data(data=df, query_dict={'Latitude': 52.5, 'City': 'berlin'}, excl=True)
     City  Country  Latitude  Longitude
2  Berlin  Germany     52.52       13.4
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.read_data(data, cols)

Generate a dataframe and verify that the specified columns are in the dataframe.

Parameters:
  • data (str, DataFrame, geopandas.GeoDataFrame) – filepath (.csv or .xlsx extension) or dataframe.
  • cols (list of str or set of str.) –
Return type:

DataFrame if the type of data is DataFrame or str, or geopandas.GeoDataFrame if it is geopandas.GeoDataFrame.

Raises:

TypeError – if a different type is passed for data or the file extension is not .csv or .xlsx.

>>> import pandas as pd
>>> df = pd.DataFrame({'City': ['Delhi', 'Giza'], 'Country': ['India', 'Egypt'],
...                    'Latitude': [28.68, 30.01], 'Longitude': [77.22, 31.13]})
>>> read_data(data=df, cols={'Country', 'Latitude', 'Longitude'})
    City Country  Latitude  Longitude
0  Delhi   India     28.68      77.22
1   Giza   Egypt     30.01      31.13
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.read_file(file_path)

Generate a dataframe from .xlsx or .csv file.

Parameters:file_path (str.) –
Returns:
Return type:DataFrame.
Raises:TypeError – if the file extension is not .csv or .xlsx.
IaaGeoDataCleaning.CleaningUtils.coordinates_validator.rtree(geodata, polygon)

Use geopandas’s R-tree implementation to find all of the locations in geodata in the spatial polygon.

Parameters:
  • geodata (geopandas.GeoDataFrame.) – dataframe of locations with spatial geometries.
  • polygon (shapely.geometry.Polygon.) –
Returns:

all of the entries with locations in the polygon.

Return type:

geopandas.GeoDataFrame.

IaaGeoDataCleaning.CleaningUtils.coordinates_validator.to_gdf(data, lat_col, lng_col, prj=4326)

Generate a geopandas.GeoDataFrame.

Parameters:
  • data (str or DataFrame.) – filepath (.csv or .xlsx extension) or dataframe.
  • lat_col (str.) – name of the latitude column.
  • lng_col (str.) – name of the longitude column.
  • prj (int.) – EPSG code for spatial projection.
Returns:

Return type:

geopandas.GeoDataFrame.

IaaGeoDataCleaning.CleaningUtils.modify_data module

class IaaGeoDataCleaning.CleaningUtils.modify_data.Modifier(incorrect_locs, correct_locs, geocoded_locs)

Bases: object

Class acts as a command line tool for accepting/rejecting proposed data modifications and stores the validated data and any data already marked as correct as a new file.

Parameters:
  • incorrect_locs – String filepath to flipped locations.
  • correct_locs – String filepath to locations that have been verified.
  • geocoded_locs – String filepath to geocoded locations
make_commands()

Create lists of commands and their descriptions

Returns:Tuple containing commands, descriptions, and the help command
run(output_directory, lat_col='Latitude', lng_col='Longitude', rec_lat_col='Flipped_Lat', rec_lng_col='Flipped_Lng', country_col='Country', loc_col='Location', geoc_rec_lng_col='Geocoded_Lat', geoc_rec_lat_col='Geocoded_Lng')

Iterate over cleaned data file and prompt user to confirm changes. Changes are then stored in a new file and the original data is left untouched.

Params:Optional params to define column names
IaaGeoDataCleaning.CleaningUtils.modify_data.run_mod()

Module contents