Mapping latitude and longitude to country, state, or city

azhenley | 115 points

Nice approach. It reminds me of an approach I saw used to resolve coordinates to countries. Instead of loading all country polygons, the team created a bitmap and used colors to map each pixel to a country code. The bitmap wasn't super large and compresses pretty nicely in png format. This worked well enough and it dumbed down the country lookup to simply figuring out the color for a coordinate. Neat trick. And you could probably figure out if you are dealing with an edge case by simply looking at neighboring pixels and fall back to something more expensive if you hit one of those.

And of course with edge cases, there are lots of them but mostly it's fine. One case that comes to mind is that of the border town of Baarle-Nassau On the border with the Netherlands and Belgium. This village has some of the weirdest borders in the world. There are Belgian exclaves inside Dutch enclaves. In some cases the border runs through houses and you can enter in one country and leave in another. Some of the exclaves are just a few meters. There are a few more examples like this around the world.

Another issue is the fractal nature of polygons. I once found a polygon for New Zealand that was around 200MB that broke my attempts to index it. This doesn't matter of course for resolving country codes because it is an island. But it's a reason I implemented the Douglas Peucker algorithm to simplify the polygon mentioned in the article at some point.

jillesvangurp | 2 days ago

You could save a bunch of space by encoding the data in a compact binary format and then loading it into a Float16Array.

In a .js file, each character is UTF-16 (2 bytes). Your current encoding uses 23 characters per coordinate, or 46 bytes.

Using 16-bit floats for lat/lon gives you accuracy down to 1 meter. You would need 4 bytes per coordinate. So that's a reduction by 91%.

You can't store raw binary bytes in a .js file so it would need to be a separate file. Or you can use base64 encoding (33% bigger than raw binary) in .js file (more like 6 bytes per coordinate).

(Edited to reflect .min.js)

tantalor | 2 days ago

Great approach.

Worth noting that there is a 6 decimal precision on the coordinates of the 90kb (gz) `coord2state.min.js` ... which suggests an accuracy that may not be present in the simplified data (i.e. <1m).

Before you increase tolerance to decrease filesize, you could consider lowering this decimal precision to 5, 4 or even 3 decimals given the "country, state, or city" requirement.

I also like the idea of using a heavily cached, heavily compressed image that is perfect for the >95% of the country that isn't within a pixel of a border. With a subsequent request for another heavily cached vector tile that encompasses any lat/lng within your 1px tolerance.

som | 2 days ago

> A side effect of the geometry simplication is that there are some very small gaps between states. Based on your use case, you'll need to handle the case of the point not being within any state borders. In these rare cases, you could fall back to a different method, such as distance checking centroid points, adding an episilon to all state borders, or simply asking the user. (The user may also be in another country or in the ocean...)

This is a common topic and easily dealt with by working with topology-informed geometries; most simplification algorithms support topology handling between different features. For instance, TopoJSON can be used.

m2fkxy | 2 days ago

> I set up an experiment that compares the original geometry with the simplified geometry by testing 1,000,000 random points within the US.

I'd be curious if the reliability is different if, instead of random locations, you limited it to locations with some level of population density. Because a lot of the USA is rural, so that random set is not going to correlate well to where people actually are. It probably matters more the farther east you go as well, as the population centers overlap borders more when you get to the eastern seaboard.

codingdave | 2 days ago

Use Nominatim: https://nominatim.org/

It can be self-hosted, with constant replication. There's also Photon which is a cut-down version of it: https://photon.komoot.io

cyberax | 2 days ago

> A side effect of the geometry simplication is that there are some very small gaps between states. Based on your use case, you'll need to handle the case of the point not being within any state borders. In these rare cases, you could fall back to a different method, such as distance checking centroid points, adding an episilon to all state borders, or simply asking the user. (The user may also be in another country or in the ocean...)

If your pre-simplification input geometries form a coverage[0], you can use e.g. ST_CoverageSimplify[1] or coverage.simplify[2] to simplify them without introducing gaps.

[0] http://lin-ear-th-inking.blogspot.com/2022/07/polygonal-cove... [1] https://postgis.net/docs/ST_CoverageSimplify.html [2] https://shapely.readthedocs.io/en/2.1.0/reference/shapely.co...

urschrei | 2 days ago

The right term for what you are doing is "reverse geocoding".

Centrino | 2 days ago

One obvious optimization that would half the size of the data and also solve the gaps problem is to keep every border between two states once and not twice (for each polygon). This would require some processing of the geometry to find the intersection points, but assuming that in the original data, the border between two states is the same exact line, it shouldn't be hard.

shooshx | 2 days ago

Last week I looked into such simplification/decimation algorithms to simplify lines sent to a showlaser projector. Turns out there is a whole bunch of different algorithms for decimation, each with different trade-offs.

It might be interesting to see how the edge cases mentioned in the article are impacted by switching to, for example, Visvalingam-Whyatt [0].

[0]: For a Python implementation: https://github.com/urschrei/simplification

ThisNameIsTaken | 2 days ago

For reducing the number of points, I had to do something similar but for an isochrone. There were 2000 points for each isochrone and we had like 1000s of map markers. I simply picked every 200th point from the isochrone polygon and works reasonably well.

Of course, mapbox provides a parameter in the API to reduce the number of points using Douglas-Peucker algorithm. But I didn't want to make API call every single time, so we stored it and used a simple distilling depending on the use case.

sawyna | 2 days ago

How small can you get if you accept that some users might have to disambiguate when they're too close to a border? Should top out at four choices, users with no geo data has to choose between all. Feels like we can make the assumption that borders are quite sparsely populated for the most part, of course excluding cities built on borders but those are exceptions and users there might be more accepting of having to choose.

Zobat | 2 days ago

I wrote something similar a while ago for doing lon/lat -> congressional district reverse geocoding. Instead of simplifying the polygons, it splits the U.S. into tiles using a k-d tree and fetches the proper tile as a static file to do the final lookup: https://github.com/ianh/district-tiler

panic | 2 days ago

For reducing the number of points I've often used mapshaper.org.

For deciding if a user is in Texas you could create a simple polygon completely inside Texas and one in Oklahoma. 99% would fall in the simple polygon and the rest go to the detailed polygons. Or create bounds near the complex river borders and use the detailed polygons there.

On the other hand I just use simple, non-optimized functions for qquiz.com.

zeke | 2 days ago

Wow how many reverse geocode requests does a google API key make to get a bill into the thousands!?

Maybe a good iteration of this is use the .01 accuracy line work for the 99.9% of users but anything within 100m of a border could be sent to google API to get the edge cases. Probably would be in the free tier.

icameron | 2 days ago

Fun fact, there are pockets of New York State that are fully enveloped by New Jersey.

https://m.youtube.com/watch?v=SgZ1f4ACZBQ

montroser | 2 days ago

When simplifying the borders of regions, and you still want to locate any point to one of the regions, you need to simplify using topology.

Demiurge | 2 days ago

Good writeup/tool. Seeing the number of vertices just for state boundaries makes me a little less hostile to Google's API.

kylecazar | 2 days ago

Anybody have a library to go from lat/long to DMA ID?

1024core | 2 days ago

This is great! Now where's the same thing for time zones?

mark-r | 2 days ago

Very practical library. Might use it in one of my projects soon!

Also, Missouri has more vertices than Kansas, suck it!

lenerdenator | 2 days ago