By Hina Shaikh and Attique-ur-Rehman
Countries across the world are relying on ever-larger amounts of data to inform policy design. Efforts for improving policymaking however call for rethinking the types of data collected, process of collating and organizing it, and eventually integrating it with existing information.
Big data in Pakistan
As Pakistan transitions towards e-governance and evidence-based policymaking, the government is focusing not only on generating new data but also overhauling existing data collection and management systems across the entire spectrum of the government machinery. Some success stories include the National Database and Registration Authority (NADRA) that now maintains one of the world’s largest citizen’s database based on facial recognition and biometric data. Benazir Income Support Programme (BISP), Pakistan’s flagship social protection initiative also boasts the country’s most extensive poverty database – an output of the largest and first ever door to door poverty survey. Pakistan has also just concluded its sixth Housing and Population census, which shows the total population exceeding 207 million!
Sub-national governments are also making some progress on this front. The Punjab Board of Information and Technology (PITB) has introduced the use of technology-based interventions to generate real-time data, which help monitor service delivery and inform appropriate policy and its implementation. PBIT is now extending its services to other provinces.
Data for Research
The availability of new information is catalyzing data-driven solutions in the policy realm, while simultaneously helping researchers track progress and gaps in policies and programs. In this process, researchers are assisting in both creating new data and demonstrating efficient ways of using it. Examples of IGC’s initiatives in this regard include its collaboration with the State Bank of Pakistan and Pakistan Bureau of Statistics in the roll-out of the Management of Organisational Practices survey (MOPS), appended to the latest round of the Census of Manufacturing Industries (CMI). This is the first extension of MOPs outside of the US. IGC has also supported the digitisation of the census on small and cottage industry in Punjab conducted by Punjab Small Industries Corporation (PSIC), which can now be used for an unprecedented analysis of non-farm economic activity in rural Punjab.
Gaps and Issues in data
Ability to generate and use data effectively, especially in the form of evidence, requires defining what to measure and ensuring quality of what is being measured.
- Measuring the right indicators
Information needs to be collected in light of its ultimate application so that it addresses the relevant policy questions. In the case of Pakistan, estimates on health and other socio-economic outcomes are not systematically produced, making it difficult to generate evidence about the effectiveness of existing policy, and further discouraging an already weak culture of data use.
The lack of pertinent data is stark. A 2014 study commissioned by IGC tried to determine key factors impacting public health outcomes in Punjab. The analysis, based on two sets of population-based surveys – the Punjab Demographic and Health Surveys (DHS) of 2006 and 2012 and the Punjab Multiple Indicator Cluster Surveys (MICS) of 2008 and 2011, was unable to answer any questions regarding the correlation between various policy inputs and health conditions. Moreover, these surveys contained no information on water sources that could help determine water quality and, hence, its impact on health.
- Accuracy of Data
The relevance of data for decision-making is undermined not just by its absence but also by its inaccuracy. For example, in the Punjab Directory of industries 2016, duplicate firms are treated as different firms and assigned distinct serial numbers, increasing the chances of double counting. In other instances, same industry types are named differently at various places, posing a risk of incorrectly categorizing firms by industry type.
IGC researchers took over two months to clean data collected by Punjab Small Industries Corporation (PSIC). However, since the data has not been entered systematically and consistently through a code, a ‘hammer’ is spelt in at least five different manners – hamer, hamr, hamar, hathora, hathori! Eventually researchers found only 24,000 of the 164,000 observations valid for analysis.
Amidst this anguish some of the entries provide comic relief! In the PSIC data annual revenue of one flour mill (atta chaki) in Koth Addu was more than Rs. 8.7 trillion with a working capital of Rs. 100,000 while the The Pakistan Standard of Living Measurement Survey 2015-16 contains households with multiple household heads many of them aged 10 or younger, with the youngest being a four-year-old!
- Consistency across data sets
There is a lot of data that already ‘exists’ but not in forms that can be merged. For an IGC project on education in Khyber Pakhtunkhwa (KP), researchers had to extract data from multiple sources. The monthly data of Independent Monitoring Unit (IMU) does not coincide with the yearly data from Education Management Information System (EMIS). Across both datasets, schools with the same code are reported against different tehsils and circles. Figures in the published copy of the EMIS report also differed from information in the raw data. With the hope that corrective measures will follow, these observations have been shared with KPs Elementary and School Education Department.
In an earlier work, IGC researchers also highlighted the paucity of an integrated socio-economic dataset at the micro-level especially for urban planning. In the absence of a common spatial identifier (such as a mohalla), various datasets cannot be cross-linked. This has resulted in poorly targeted and short-sighted policies for urban Pakistan.
Looking for solutions
Robust data collection systems are needed to efficiently capture and ensure integrity of data, and to ensure they correspond to government’s capacity to utilize the information effectively. While analysis of data has become sophisticated, most data collection remains paper-based, prone to a variety of slipups.
The use of technology such as computer-assisted personal interviewing (CAPI) can make the entire process of data collection, entry and analysis more cost and time efficient. Through CAPI, the interviewer reads the questions to the respondent from the screen of a handheld android-device (usually a phone or a tablet) preloaded with the questionnaire. The responses are immediately entered into the device. Such applications have checks at the backend to ensure data is accurately captured. This eliminates the need for manual re-entering of data and minimises chances of errors. Moreover, this data, which is crypted, is automatically synced and uploaded to a central server and can be viewed only by authorised persons.
Given the gradual shift towards data-based policymaking and monitoring, technology-driven solutions are now becoming essential. For an on-going education project in KP, IGC researchers have used Census and Surveying Processing System (CSPro) – a software free to download and use and barring the one-time cost of android-based devices, use of this not only saved time and money but also ensured integrity and reliability of data. Encouragingly, government statisticians in Pakistan are gradually catching onto this trend. Pakistan Bureau of Statistics and Bureau of Statistics in Punjab now use CSPro, while the BISP has also made a shift from paper based to computer aided interviewing for updating the country’s poverty database. These nascent trends need to be built upon at both the national and sub-national level in order to improve the knowledge base, and thus efficacy, of policy design and implementation.
Hina Shaikh is the Country Economist at the International Growth Centre
Attique-ur-Rehman is a Research Associate at the Consortium for Development Policy Research