Technical Design Rationale
The Gap
I loved the power and flexibility of Python for data analysis, but there wasn’t a library that was accessible and user-friendly for researchers who wanted to perform common statistical analyses without becoming experts in the underlying code or statistical theory.
While powerful libraries existed, they often required a steep learning curve, forcing users to dive deep into documentation or source code just to run a t-test or generate a summary table. The quality of documentation and the usability of outputs varied widely. It was often difficult to find a tool that provided clear, concise results ready for reporting without additional formatting or manipulation. As a result, I found myself spending more time figuring out how to use the libraries effectively than focusing on the analysis itself.
At the time, Python wasn’t as widely used in research as it is now, and the ecosystem was still developing. I needed a tool that would allow me to quickly perform common analyses and provide outputs ready for interpretation—reproducible results with ease of use. So, I started developing a package for myself to fill this gap, allowing me to focus on the what and why of the results, rather than the how of the mechanics.
After building it, I realized this wasn’t just a personal fix; there was a genuine need in the Python ecosystem. I decided to make it open-source to help others facing similar challenges.
When documenting the package, my goal was clarity. I wanted to provide examples of effective use alongside citations for the formulas and methods used. This ensures users can understand the underlying mechanics if they choose, but don’t need to in order to use the tool effectively. The aim was to make Python a competitive option with Stata, R, SAS, and SPSS for common univariate and bivariate analyses, accessible to researchers with a wide range of backgrounds and expertise.
Design Principles
Ease of Use & Accessibility (primary design goal): make the library accessible to researchers who know statistics but may not be Python experts. Clear function names, intuitive parameters, and outputs ready for interpretation without additional formatting.
Clarity of Output: Outputs are designed to be clear and informative, using Pandas DataFrames as the standard format. Familiar to many users and easily manipulated or exported for reporting.
Methodological Transparency: Each function is thoroughly documented with examples and citations for the formulas and methods used. Users can understand the underlying mechanics if they choose, but don’t need to in order to use the tool effectively.
Reproducibility: The library supports reproducible research practices with outputs that can be easily exported and shared. Clear information about methods ensures results can be replicated across different platforms.
Complementarity: ResearchPy is designed to complement existing libraries like SciPy, NumPy, and Statsmodels, rather than replace them. It fills gaps in functionality while allowing users to leverage these libraries for complex analyses.
Cross-Validated Rigor: All outputs are rigorously tested and cross-checked against established software like Stata, R, SAS, and SPSS to ensure accuracy and reliability. This builds trust in the results generated by ResearchPy.
Verification & Rigor: The “Trust Layer”
In statistical software, accuracy is non-negotiable. A tool that is easy to use but produces incorrect p-values or effect sizes is not a convenience; it is a liability. ResearchPy was built with a “verification-first” mindset, ensuring that every output stands up to the scrutiny of peer review and regulatory standards.
Cross-Platform Validation Strategy The core of our rigor lies in a systematic cross-validation process. We do not rely solely on internal unit tests. Instead, we run identical datasets through ResearchPy and compare the outputs against the “gold standards” of the field:
Stata
R
SAS
SPSS
This process covers:
Point Estimates: Means, medians, regression coefficients, and odds ratios.
Uncertainty Metrics: Standard errors, confidence intervals, and p-values.
Effect Sizes: Cohen’s d, Pearson’s r, and other standardized metrics.
Model Fit Statistics: R-squared, and log-likelihood values.
The Verification Method Since the proprietary software does not provide access to internal computational tolerances, verification is performed by comparing the numerical values in the outputted tables or data structures directly. Agreement was checked to the largest outputted decimal place, which typically ranges from 4 to 10 decimal places depending on the statistic and the reference software’s default formatting.
If a result falls outside this visible precision threshold, the test fails, and the discrepancy is investigated immediately. This approach ensures that the results a user sees in ResearchPy match what they would see in Stata, R, SAS, or SPSS for the same analysis.
Roadmap & Evolution: Guided by User Needs
ResearchPy began as a solution to a specific, immediate gap in univariate and bivariate analysis. However, the needs of the research community are dynamic. The roadmap for ResearchPy is not driven by a rigid corporate product plan, but by the evolving requirements of researchers and the advancement of open science.
Current Scope & Stability
Core Functionality: Robust support for univariate and bivariate analyses (summary tables, t-tests, correlations, ANOVA, Chi-square).
Reporting Ready: Outputs are formatted for immediate inclusion in manuscripts and reports.
Stability: The core API for these common analyses is stable and production-ready.
Planned Expansions The next phase of development focuses on extending ResearchPy into multivariate and multivariable modeling, bridging the gap between exploratory analysis and complex inference:
Generalized Linear Models (GLM): A unified interface allowing users to specify families (Gaussian, Binomial, Poisson, etc.) and link functions, covering Logistic, Poisson, and Negative Binomial regression.
Advanced Regression: Enhanced support for Linear Regression with interaction terms and polynomial features, with automatic effect size calculations.
Model Diagnostics: Integrated tools for residual analysis, multicollinearity checks (VIF), and assumption testing, all returning structured DataFrames.
User-Driven Development Unlike commercial software where features are dictated by market research, ResearchPy’s evolution is community-led:
Feedback Loop: Feature requests and bug reports from users directly influence the priority of the development queue.
Open Contribution: The open-source nature invites contributions from the community, ensuring the tool benefits from diverse expertise.
Transparency: The development roadmap is visible in the repository, allowing users to see what is being worked on and vote on priorities.
Commitment to Independence Regardless of how the “Hearts of Science” initiative evolves, ResearchPy remains a standalone, independent tool. Its functionality does not depend on any external organizational structure. Whether used as a standalone library or as part of a larger ecosystem, the code remains open, accessible, and free for the community to use and extend.
Relationship to Hearts of Science: The Emerging Container
ResearchPy was born from a specific need: to make statistical analysis in Python accessible, transparent, and reproducible. Over time, it became clear that this need was not isolated. It was part of a broader pattern of work; including educational resources like Python for Data Science and a commitment to open, ethical research practices that shared the same underlying values.
Hearts of Science (HoS) is the name given to this emerging pattern.
It is not a rebrand, a pivot, or a corporate entity created to sell a product. Rather, it is an organizing concept intended to:
Reduce Fragmentation: Provide a coherent home for related tools and educational resources that share a commitment to methodological rigor and accessibility.
Ensure Continuity: Serve as a narrative anchor that explains why these tools were built the way they were, connecting the dots between code, education, and ethics.
Support Evolution: Act as a flexible container that can grow to include future initiatives aligned with the mission of improving the human condition through better data.
Independence is Paramount Crucially, ResearchPy remains fully usable and independent. Its functionality, stability,
and value do not depend on the existence or evolution of Hearts of Science. Whether HoS becomes a formal organization, remains a loose collection of projects, or simply serves as a philosophical label, ResearchPy will continue to be maintained as a standalone, open-source library for the community.
The relationship is simple: ResearchPy is the work. Hearts of Science is the “why” behind the work.