大数据研究中需要考量的实践问题(RAS2022)
Title:Practical issues to consider when working with big data(RAS2022)
Abstract:Increasing access to alternative or “big data” sources has given rise to an explosion in the use of these data in economics-based research. However, in our enthusiasm to use the newest and greatest data, we as researchers may jump to use big data sources before thoroughly considering the costs and benefits of a particular dataset. This article highlights four practical issues that researchers should consider before working with a given source of big data. First, big data may not be conceptually different from traditional data. Second, big data may only be available for a limited sample of individuals, especially when aggregated to the unit of interest. Third, the sheer volume of data coupled with high levels of noise can make big data costly to process while still producing measures with low construct validity. Last, papers using big data may focus on the novelty of the data at the expense of the research question. I urge researchers, in particular PhD students, to carefully consider these issues before investing time and resources into acquiring and using big data.


Summary:In many ways, big data is not inherently different from other types of data. However, researchers, especially PhD students, can forget this in the excitement of learning about a new dataset. This article highlights four practical issues to consider when conducting economics-based research using big data. First, a particular source of big data may not be conceptually different from traditional data. As a result, studies that simply replicate prior results using big data may lack contribution, especially if the new data suffers from the same issues as prior data (e.g., endogeneity). Second, big data sources may only be available for a limited number of entities, especially when aggregated to the unit of interest, leading to limited statistical power and generalizability. Third, high levels of noise and the large volume of data can make big data costly to process; however, an arduous data cleaning process itself does not ensure that empirical proxies are tied to the constructs of interest. Last, interesting research questions are difficult to reverse engineer after the fact, and researchers who invest heavily in big data before generating a research question may end up with a paper that focuses on validating data in the absence of economic intuition. Researchers who keep these four issues in mind can potentially save themselves considerable resources (not to mention heartache!) by avoiding low-impact, high-cost projects and by instead focusing on research questions with the greatest potential for contribution.
文章结论:在许多方面,大数据与其他类型的数据没有本质上的区别。然而,研究人员,尤其是博士生,在学习新数据集的兴奋中可能会忘记这一点。本文强调了利用大数据进行基于经济学的研究时需要考虑的四个实际问题。首先,某个特定的大数据来源可能在概念上与传统数据没有区别。因此,简单地利用大数据复制先前的结果的研究可能缺乏贡献,特别是如果新数据存在与先前数据相同的问题(如内生性)。第二,大数据源可能只适用于数量有限的实体,特别是当汇总到感兴趣的单位时,导致统计能力和可推广性有限。第三,高水平的噪音和大量的数据会使大数据的处理成本很高;然而,艰巨的数据清理过程本身并不能确保经验性的代用指标与感兴趣的构件相联系。最后,有趣的研究问题很难在事后进行逆向工程,研究人员如果在产生研究问题之前对大数据进行大量投资,最终可能会得到一篇在缺乏经济直觉的情况下专注于验证数据的论文。牢记这四个问题的研究人员,可以通过避免低影响、高成本的项目,转而专注于具有最大贡献潜力的研究问题,来为自己节省大量资源(更不用说心痛了!)。
通过www.DeepL.com/Translator(免费版)翻译