This post explains the different demographic retrieval schemes that are in use today.  Illustrations along with pros and cons of each retrieval scheme are provided. This is an extremely important concept in market analysis and can lead to very different numbers on your demographic reports.  In fact, the choice of retrieval scheme can often lead to larger differences in the demographic profile than if you switch demographic providers altogether!

What is a demographic retrieval scheme?

A demographic retrieval scheme refers to the methodology used to allocate block group data (the smallest unit of geography for which the US Census releases detailed data) to a trade area that encompasses a portion of the block group.  Figure 1 illustrates the need for allocation.

Figure 1: The need for retrieval schemes In Figure 1 the black/white circle represents a trade area in which we want to calculate demographics.  The blue dashed polygons are block groups and their corresponding population numbers are shown in blue.  If the trade area followed block group boundaries the allocation would be simple, it would be 100%.  However, the above trade area splits four block groups.  How do we determine the population from each block group that is within our trade area?  The answer depends on the retrieval scheme your system is using.

#### Proportional Area

The proportional area methodology assumes equal distribution of population within the block group. The proportion of the population allocated to a trade area from a block group is based on the proportion of land area overlap between the trade area and the block group.  For example, lets look at Figure 2 which shows the total area of each block group along with the amount of overlap between each block group and the trade area.

Figure 2: Proportional Area Overview Given the data in Figure 2, the demographics would be calculated as shown in the following table.

Table 1: Proportional Area Example

 Block Group (BG) ID Population in BG % Overlap between Polygon and BG Population In Polygon A 100 22% (.22/1) 22 (22% * 100) B 150 16% (.22/1.4) 23 (16% * 150) C 50 44% (.62/1) 22 (44% * 50) D 500 35% (.35/1.4) 175 (35% * 500)

In the above table, column 4 is calculated by multiplying the population in the block group (column 2) by the % overlap between the polygon and the block group (column 3).

 Pros: Very easy to calculate from within a GIS or spatially enabled database. Cons: The assumption of uniform distribution of population is greatly flawed, especially in high growth areas where block groups tend to be very geographically large. This method can become slow since it requires a polygon-polygon overlay.

#### Block Group Centroid

The block group centroid methodology includes/excludes a block group based on the location of the block group’s centroid in relation to the polygon.  If the block group’s centroid is within the trade area its population is wholly included.  If the block group centroid is outside the trade area than it is completely excluded.

Figure 3: Block Group Centroid Sample In the graphic above the black/white ring represents the trade area, the blue squares are the centroids of the block groups, red indicates block groups whose population would be included based on the location of their centroids, and the yellow indicates block groups that would be excluded.

 Pros: Easy to calculate from within a GIS or spatially enabled database since it is a simple point in polygon operation on a relatively small amount of points (there are approximately 211,000 block groups in the 2000 US Census. Cons: Accuracy is a major issue.  As figure 3 illustrates, numerous block groups can be excluded even when a large percentage of their land area is contained in the trade area.

#### Block Centroid

Block centroid is the most widely used retrieval scheme today and is based off of the census block.  A census block is the smallest geographic area for which the Bureau of the Census collects and tabulates the decennial US Census.  There were approximately 8 million census blocks in the 2000 US Census.  However, only a very limited amount of data is released at this geographic level (population, households, group quarters, housing units).

The methodology for this retrieval scheme works as follows:

1. Assume we have a base weight table which contains all the census blocks, the lat/long of their centroid, and the percent of the block group total for the data fields (i.e. population, households, etc).
2. Determine all census blocks that are contained in the polygon.  This is done via a point in polygon operation against the base weight table.
3. Once all the census blocks that are contained in the polygon are identified, a query is performed which groups the blocks based on their block group id and sums the weights.  The result is a final weight table which includes block group id, and a total weight for each variable (i.e. population, housing units, etc).
4. The final weight table is used to proportion the block group data to the polygon.

Figure 4: Block Centroid Sample Figure 4 illustrates the block groups as blue polygons, the block centroids as blue squares and also shows the street network as light black lines.  Given figure 4, the demographics would be calculated as shown in Table 2.  For the sake of simplicity, lets assume each block centroid in a block group has the same weight.  For example, block group A has four block centroids, thus we are going to assume each has a weight of 25%.  In actuality, each block centroid is typically given a different weight based of either the 2000 US Census or an updated number supplied by the demographic provider.

Table 2: Block Centroid Sample

 Block Group (BG) ID Population in BG % of Population in Trade Area Overlap Population In Trade Area A 100 25% (1 of 4) 25 (25% * 100) B 150 16% (1 of 6) 23 (16% * 150) C 50 50% (2 of 4) 25 (50% * 50) D 500 33% (2 of 6) 165 (33% * 500)

 Pros: Since each block centroid can be assigned different weights, this methodology does not assume equal distribution of population. Fast to calculate. Some demographic providers provide updated weights as part of their demographic products.  This is very important since without it you are assuming the weights from the 2000 Census are still correct. Cons: Census blocks are only updated once every 10 years. Census blocks are just an artificial point that represents the geographic centroid of the block.  In many cases, this centroid could be in the water or in the middle of a cemetery.  As Figure 4 illustrates, the block centroids are not located along the road segments. In rural areas (or areas that were rural in 2000), the blocks can still be very large. Since it is a point based system, a census block is either wholly included or excluded.

#### Postal/Building Based

The methodology for the postal/building based retrieval scheme is very similar to the block centroid scheme except it is using a combination of USPS zip+4 data and housing start data instead of census blocks.  As of the time of this writing there are approximately 29 million residential based zip+4s in the United States.

Figure 5: Postal/Building Based Sample Figure 5 illustrates the block groups as red polygons, the postal zip+4s as blue squares and also shows the street network as light black lines.  By comparing Figure 5 to Figure 4 you can see that the quantity of zip+4s in this area is much larger than the amount of block centroids and that the zip+4s follow the road network.  As we did for block centroids, for the sake of simplicity lets assume each zip+4 in a block group has the same weight.  The following table outlines how population would be calculated for the trade area.

Table 3: Postal Based Sample

 Block Group (BG) ID Population in BG % of Population in Trade Area Overlap Population In Trade Area A 100 7.6% (11 of 145) 8 (7.6% * 100) B 150 2.2% (5 of 228) 3 (2.2% * 150) C 50 30.8% (64 of 208) 15 (30.8% * 50) D 500 51.9% (112 of 216) 259 (51.9% * 500)

 Pros: Since this method relies on postal zip+4 data, it ignores areas where people are not living. The USPS releases the zip+4 data monthly. As new communities are built and start receiving mail, this methodology will account for the new postal deliveries. The location of the zip+4s follows the distribution of population. Cons: Since it is a point based system, a zip+4 is either wholly included or excluded. Given the large number of zip+4 points, this method is slower than block centroid retrieval.

#### Street Based

A street based retrieval scheme assumes people are evenly distributed along the street network.  The proportion of the population allocated to a ring/polygon from a block group is based on the proportion of street length between the ring/polygon and the block group.  The following table outlines how this is accomplished.

Table 4: Street Based Example

 Block Group (BG) ID Total miles of Streets In BG Total miles of Streets In Polygon from BG % of Population to Allocate A 2 .35 17.5% B 3 .35 11.7% C 2 .7 35% D 4 1.1 27.5%

In the above table, column 2 represents the total miles of streets within the block group.  Column 3 represents the total miles of streets from the block group that are within the polygon.  The percent of population to allocate is then simply calculated by dividing column 3 by column 2.

 Pros: This method does not rely on an all or nothing inclusion of a centroid. Since its street based, this method ignores all areas where people cannot be living. As new communities are built and the streets get added, this methodology will account for the new streets. Cons: The assumption of even distribution of population along the street network is questionable.  For example, think about cul-de-sacs which have a very high population to street length ratio. The calculation of total street segment length within the polygon is computationally intensive.