Global Reach/India Survey Documentation

Overview

This documentation is a supplement to our India Phone Survey results. It provides an overview of the context and methodologies through which the phone survey data were gathered and organized. To enable those interested in further investigation, we also provide some recommendations on how to use the raw data to optimize meaningful analysis and exploration.

India phone survey

There are a total of 19 questions in the survey, addressing the following categories:

Internet use
Mobile phone use (smartphones & basic voice/SMS phones)
Awareness and use of Wikipedia
General demographics

Phone surveys were conducted between June and October 2016 by Votomobile. This survey is a composite of 7 individual regional surveys. This approach was taken to minimize the number of languages offered to an individual caller from 12 languages to just the 3-6 commonly used languages in that region. We also varied the number of responses collected per region to approximately reflect the population of that region in comparison to the total population of India. We chose a high number of samples to minimize the margin of error and to provide enough data for useful analysis of different regions of India.

Here are the main questions this survey was designed to answer. However, analyzing the full data set allows you to conduct more in-depth data explorations and gain further insights around these questions:

What is the actual number of people who use the internet?

(Real-world behavior makes this difficult to measure from industry reports, since people might have access to the internet through schools, friends, internet cafés, public Wifi, etc.)

For internet users: What do people mostly use the internet for?
For non-internet users: Why not use the internet?
How many people use smartphones?
Do people with smartphones use the internet from just Wifi? Or just cellular service?
How many people think that they don’t use the internet, but still use Facebook or WhatsApp?
How many people have heard of Wikipedia? What do they use it for? How often?
If they have heard of Wikipedia, but aren’t using it, why not?

Selection of the 7 individual surveyed regions

Regions chosen had to be derived from the geographical areas of ‘calling circles / area codes’ of India’s mobile phone system.
From the calling circle coverage, areas of similar geography and language use were combined into 7 distinct regions for separate surveys.
Each regional survey size was influenced by its population relative to the population of India.
Including all of India’s languages and geographies would have been cost prohibitive. However, the languages and regions chosen are expected to cover more than 95% of India’s population.

Table of Regions

Calling Groups	Areas Included	Languages Included
A	Tamil Nadu, Kerala	Hindi, Malayalam, Punjabi, Tamil, Telugu, English
B	Uttar Pradesh, Bihar, Rajasthan, Jharkhand, Uttarkhand, Delhi, North Eastern states	Hindi, Punjabi, English
C	Maharashtra, Gujarat	Gujurati, Hindi, Marathi, English
D	Madhya Pradesh, Odisha, Chhattisgarh	Hindi, Marathi, Odia, Punjabi, English
E	Punjab, Haryana, Himachal Pradesh	Hindi, Punjabi, English
F	West Bengal, Assam	Assamese, Bengali, English, Hindi, Punjabi
G	Andhra Pradesh, Karnataka, Telangana	Kannada, Tamil, Telugu, English

Where to get the data

This page shows graphs of the responses received for each question in the survey.

The full data set can be found at:

Dan Foy (2016). India phone survey 2016. figshare. doi:10.6084/m9.figshare.5404834

This is the canonical version which contains a CSV including every answer from each of the 9235 responses.

The full text of the questions can be found here.

Using the data effectively for analysis

Looking at India as a whole

For an overview of the Indian population, you should turn on the “India Representation Subset” filter to obtain a subset of 2700 responses, with each regional survey size contribution determined by its population percentage.

Looking at regional subsets

Studying the data set from a regional level should provide additional insights. India consists of regions that are drastically different from each other, and drawing conclusions by combining all its regions may not always give us a holistic view of the population. To avoid this, you must ensure that the “India Representation Subset” filter is turned off before filtering to the region of interest.

Important to note: The regional and India representation filters should not be used in combination, because together they can reduce the available regional data significantly.

Impact of combining regional and country filtering:

For instance, let's focus on Calling Group A (Tamil Nadu, Kerala).
When only the regional filter is on, 865 full responses are available for analysis.
When both the regional filter and “India Representation Subset” (country proportionality) filter are on, only 247 of the 865 full responses are available, causing analysis to be less statistically significant.

Individual survey responses

Within the CSV file, each row represents one survey taken, with each column containing the response to the associated question. In certain cases, some questions that should have been asked were not, and these entries were marked as “Missing’'.

When analyzing results from questions Q9A-Q9D, Q12 and Q13, set filter to just “Full Responses”.
When analyzing results from any other questions, you can include non-full responses to increase the sample size with fully valid data.
When filtering to the “India Representation Subset”, all responses are already from the full response set and no special treatment is needed.

Facebook / WhatsApp questions

The questions asking if the respondents use Facebook or WhatsApp are only asked if they previously said that they do not use the internet. This is by design - we wanted to use this question to gauge how many people did not understand that Facebook was part of the internet. The responses to these two questions were not intended to measure the full use of Facebook or WhatsApp.

Non-linear progression & Margin of Error

It is important to note that this survey is non-linear. Depending on how a question is answered, the flow of the rest of the survey may change. For example, if a respondent says that he or she does not have a smartphone, we skip the smartphone-related questions. You can review the flow diagram to see how the survey progresses. For proper statistical validity, our survey size is large enough where the questions asked of all respondents have a 95% degree of certainty of being accurate within a 2% margin of error.

Methodologies

Addressing Biases

One issue with phone surveys is the tendency for some respondents to favor the first response to a question. To address this problem, most of the survey questions presented the responses in a random order for each call. This distributes any bias evenly among the responses instead of accumulating it all on one response. Note that questions that have a 'none of these' or 'other' response always kept this option as the last one presented. A couple of survey questions, however, have a strong order dependency of their responses and are confusing if they are presented in a completely random order. For instance, when we ask how often they use Wikipedia, asking in a non-sequential order would not make sense (e.g. an order of “once a week”, “once a month”, “once a day”). For these questions, we would randomly present the question in one of two orders: either from lowest to highest, or highest to lowest.

Calculation of Proportionality

To achieve a full India representation, we introduced proportionality to determine the number of responses we select per region for analysis:

We determined the actual regional population of India referencing “List of states and union territories of India by population”.
We summed up all the actual population of each region represented in the survey to 1,151,284,905.
We calculated the % of total population each calling circle represented in the survey. For instance, calling circle A (Tamil Nadu and Kerala) had a total population of 105,526,635, which constituted to about 9% of the total population.
We proportionalized sample size to 2700 and calculated the number of responses per region to take into consideration a full India representation.
We ordered raw data chronologically and filtered out complete responses based on calculated proportionality. We added a column “ India Representation Subset” and indicated selected response as “TRUE”. To obtain data for a full India representation, simply select “TRUE”.

Calling Group	Zone	Population	% of Total Population	Proportionalized Response Size (2700 responses)
Group A	Tamil Nadu	72,138,958	6.27%
	Kerala	33,387,677	2.9%	247.59
Group B	Uttar Pradesh	199,281,477	17.31%
	Bihar	103,804,637	9.02%
	Rajasthan	68,621,012	5.96%
	Jharkhand	32,966,238	2.86%	949.05
Group C	Maharashtra	112,372,972	9.76%
	Gujarat	60,383,628	5.24%	405.00
Group D	Madhya Pradesh	72,597,565	6.31%
	Odisha	41,947,358	3.64%
	Chhattisgarh	25,540,196	2.22%	328.59
Group E	Punjab	27704236	2.41%
	Haryana	25,353,081	2.2%
	Himachal Pradesh	6,864,602	0.6%	140.67
Group F	West Bengal	91,347,736	7.93%
	Assam	31,169,272	2.71%	287.28
Group G	Andhra Pradesh	49,386,799	4.29%
	Karnataka	61,130,704	5.31%
	Telangana	35,286,757	3.06%	341.82
		1,151,284,905	100.00%	2700

Skipped questions / Full responses

Votomobile experienced a logic flow problem with some of the responses, which led to a small set of questions being occasionally skipped (only possible with Q9A-Q9D, Q12 and Q13). When one of those questions was incorrectly skipped, that particular response in the spreadsheet is set to ‘Missing’, and the entry in the ‘Full response’ column is set to FALSE for filtering purposes.

To address this issue, Votomobile conducted extra full surveys to make up for the incomplete responses. In the current spreadsheet, both the original (with ‘Missing’ marked where needed) and the additional responses are combined together for analysis. For our initial analysis of the data set, we only used responses marked as “Full Response” for our results.

External links

India survey data, figshare