Should I create a separate table for geo-location points?

Posted on

Question :

I am developing the backend for a service that resembles Uber in some aspects (just a simplification to give you the general context). Users will book rides and as such users will define the pickup and drop-off locations. These locations will be addresses in a city (with the possibility of multiple cities in the future). The locations I will receive from the frontend/app will be in the form of latitude, longitude, and google-maps-verified address.

I am wondering whether I should store the locations as a separate table, or whether I should embed the locations in whatever table I need them. For example, the bookings table will have a pickup location and a dropoff location. Should these be just references to a locations table, or should I add more columns to store the location right there? I will need at least 3 columns per location (lat, long, address string), with the possibility to have more columns if we decide to break the street address into more components.

The way I see it the two approaches trade-off memory and time performance:

  • A separate table will save some space, as we avoid copying the same
    information multiple times. But I am not sure how much space will be
    saved as we do not have enough real data yet. My ballpark estimate is that a medium-sized city has tens of thousands unique addresses. So if we take 100 ride bookings there is little chance for overlap. If we have 100,000 ride bookings there is bound to be considerable overlap. However, space is not really a limiting factor in our design, even with millions of bookings. It does not really matter if we spend 100MB more in our database.
  • A separate table will make processing slower. First of all, each time we get a booking request we will have to search if the locations in the request already exist in our database. If not, we add them, otherwise we use the existing reference. With potentially tens of thousands addresses in our database this will take some time. Is this extra time significant compared to our other time consuming tasks? An example of another time-consuming task is a call to the google-maps api to find routes from one location to another and get a travel time estimate. I believe this operation will take longer that searching in our database. So it might make little sense optimising the database access when we have other bottlenecks. How about other operations? Say we want to find bookings within a bounding [lat, long] box. This can become more time-consuming with a separate table as we’ll need a full table join. Or do a cumbersome two-step search where we find the locations IDs in our table that are within the bounding box and then search the bookings table with these IDs. If we have hundreds or thousands locations in a bounding box this might not be a good way to do it. But again, I am not sure about the utility of trying to optimise this operation. I am not sure how often will I need to search by location. It seems not that often, so I will not gain/lose much one way or the other.

It seems to me that there are no strong arguments based on the space-time tradeoff. Are there other aspects I should consider? For example: clarity of design, ease of development, flexibility and expandability. The separate table approach seems to offer more structure and the opportunity for more flexibility if the model needs to change. For example, it might be beneficial to break the street address in multiple components (street, city, state). We are less likely to do this if we are to use location columns in other tables (we need to use at least 3 columns already, and breaking the address in 3 components, brings this number to 5). About flexibility: in the future, we might have multiple-point rides. Having a location table will help with that.

I am leaning towards a separate table, but I do not see strong arguments for or against it. Are there other arguments that I am missing?

Not sure if the following details are relevant but I will be using postgresql as the RDBMS and sqlalchemy as an ORM to query the database.

Answer :

resembles Uber in some aspects […] pickup and drop-off locations

So you need to use PostGIS, and PgRouting.

I will need at least 3 columns per location (lat, long, address string)

No, because the long/lat fit into a POINT from which you’ll have to generate your topology.

I am wondering whether I should store the locations as a separate table,

You should store them with the information about the node, of course. GIS is just data. And all PostgreSQL tables are in unordered heaps. You don’t break apart tables into 1:1 relationships without reason. Store your locations with a nullable field geog that has a geography(POINT 4326) column.

if we decide to break the street address into more components.

You don’t do that in professional gis projects. You let an extension like address_standardizer do it for you. So you store your user input, and then you may cache a column with the stdaddr record type.

But I am not sure how much space will be saved as the we do not have enough real data yet.

I think your first concern should be how to best perform this task. I don’t take it your familiar enough yet to begin optimizing. Before you start this project, I’d suggest you get up to snuff with GIS:

Learning the contents: ~600 pages will likely save you a lot of time in the future.

General recommendation

I am in favor of this:

whether I should embed the locations in whatever table I need them

Do not create a separate table for storing geo-location points only. Geo location points doesn’t really make sense themselves alone.

I would say keep the geo location points to the tables about events. Events such as pickup, drop-off, meet up, dine, and so on…

Albert Einstein once said it’s about space-time. It’s not about location only (space); also it’s not about time only. It should be the binding of space and time. Consider these:

  • Where will the user be picked up? and when?
  • Where will the user be dropped off? and when?

There is always a question about when every time you talk about location. One good direction is to always think of it as an event.

Besides, would there be a part of your application where you would pin all the locations (pickups-and-drop-offs) on a single map and homogeneously as if they are all just locations? I think most likely, that will not be the case.

Instead, what you may want is to pin all the pickups, pin all the drop-offs, etc.

Again, for me, it would be better to store the locations with a mindset about events.

Specifically for your purpose, this would require you to hook your bookings table to pickups and dropoffs event tables. With this, you can have as many pickup locations and dropoff locations. Thus, extendable as what you’ve said.

Alternative

In favor of this:

whether I should store the locations as a separate table

Unless though if you use the What 3 Words (W3W) service/application. W3W split the world into 3-meter blocks. Each block then was given a unique ID. The ID is in format word1.word2.word3.

With w3w though, you don’t need to store geo-location points on a separate table. They already did that for you. All you have to do is use their IDs. I’m not so sure though if 3-meter resolutions are enough for most applications.

Credentials

By the way, I work on staple crop research. We have lots of GIS applications and we also use PostgreSQL.

You asked whether you should (1) embed locations in a bookings table …, or (2) have them in a separate table and establish a one-to-many relationship between them.

If those are your only options, choose #1. It’s easy and it will work.

Number 2 would be non-trivial to implement, a pain to maintain, and ultimately lead to loss of referential integrity because your keys would be non-deterministic. People have been building rock-solid relations on geographic data since the 1990s, but it is not simple. A roll-your-own approach is about as likely to succeed as homemade encryption. Respondents who mentioned GIS were trying to help you. Abel Melquiades Callejo gave you good advice. He gets my upvote.

To answer your question (in the comments). The reason you shouldn’t derive a foreign key from (address, lat, lon) is that all three of those values are subject to change without notice, and the latter two may even fluctuate depending on your hardware stack.

1) Addresses: streets are renamed and renumbered, postal codes are adjusted. That’s not a rare event. People don’t usually notice changes unless they’re directly affected, but nationwide (in the US) changes are so frequent that the US Postal Service publishes semi-annual updates to its official list of deliverable addresses (which is the address returned by most address-standarization APIs).
2) lat/lon. Depending on what is available, Google returns one or more values for lat/lon: street-level, rooftop, estimated, polygon, etc. The values themselves are expressed with varying levels of precision/accuracy (from hundreds of meters to sub-atomic). They’re constantly adjusted using data from streetview, satellite images, zoning maps, building permits, census records, etc… When lat/long is expressed with preposterous precision (in angstrom), you know that somewhere upstream numbers were cast as floats then back to text before being sent out as JSON. That can happen on your end, too, if your code does automatic type conversion. (You mentioned you’re working in Python…) Also, the precision of floating point arithmetic varies with hardware. On both ends. Absolute equivalence is not part of the guarantee.

More generally, there’s the question of what an address means where you live. In the US, a standardized address may be very far from where a rider would wait. For example, shopping malls and airports often have a single standard address located where a public street connects to the parking lot or loop past the terminals. Some office parks and multi-building apartment complexes have a single street address and a central hub or mailroom. Buildings in cities may have a “prestige” or “vernacular” address like “100 Main Street” but a less memorable standardized address around the corner, like “12955 Lincoln St.” If your customer entered “100 Main Street”, where would the car go? I recommend asking your front end devs which lat/long they’re sending you, and suggest they also include device location (which has its own kinds of ambiguity but they’re more widely understood).

Finally, you wrote that since you’re using sqlalchemy the foreign key solution would be trivial to implement… All that glitters is not gold. Beware of gold-plated objects.

Leave a Reply

Your email address will not be published. Required fields are marked *