GEFRI scoring technical appendix
This appendix publishes the scoring, normalization, and imputation logic used for the Global Education Futures Readiness Index (GEFRI). The public version intentionally removes all production automation, file exports, and data-ingest code so that the community can review the methodology without exposing infrastructure internals. The code block below mirrors the logic that powers the live index.
Note: The World Bank has transitioned from the term FCV (Fragility, Conflict, and Violence) to FCS (Fragile and Conflict-Affected Situations) in recent publications and classification updates. Because the GEFRI code was written when “FCV” was the standard terminology, reviewers may encounter both acronyms in this technical documentation and variable names. The GEFRI website now presents data with the current designation.
Indicators and dimensions
GEFRI uses 23 indicators: 21 core indicators that contribute directly to the five readiness dimensions, plus 2 auxiliary series (population and male lower-secondary completion) used for transformations and derived ratios. These auxiliary series do not contribute directly to the dimension or composite scores.
| Code | Indicator | Dimension | Transform & scaling | Eligibility & imputation |
|---|---|---|---|---|
| EG.ELC.ACCS.ZS | Access to electricity (% of population) | Infrastructure | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| IT.NET.USER.ZS | Internet users (% of population) | Infrastructure | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| IT.NET.SECR.P6 | Secure internet servers (per 1 million people) | Infrastructure | log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| IT.CEL.SETS.P2 | Mobile cellular subscriptions (per 100 people) | Infrastructure | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.XPD.TOTL.GD.ZS | Government expenditure on education (% of GDP) | Human Capital | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.ADT.LITR.ZS | Adult literacy rate (% age 15+) | Human Capital | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). High-income gaps default to 100% (flagged as "Assumed (high income)"). |
| SE.SEC.ENRR | School enrollment, secondary (% gross) | Human Capital | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.TER.ENRR | School enrollment, tertiary (% gross) | Human Capital | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.ENR.SECO.FM.ZS | Secondary GPI (Gross enrollment ratio, female/male) | School Access & Gender Parity | GEFRI gender parity scoring: GPI=1 returns 100; deviation penalty is symmetrical and floored at 10. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.ENR.TERT.FM.ZS | Tertiary GPI (Gross enrollment ratio, female/male) | School Access & Gender Parity | GEFRI gender parity scoring: GPI=1 returns 100; deviation penalty is symmetrical and floored at 10. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.PRM.UNER.ZS | Children out of school, primary (% of primary school age) | School Access & Gender Parity | GEFRI out-of-school scoring: 0% maps to 100, 30% maps to 10, intermediate values follow a linear decline; results clipped to 10-100. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). Values are clipped to the functional range before scoring. |
| SE.SEC.UNER.LO.ZS | Adolescents out of school, secondary (% of lower secondary school age) | School Access & Gender Parity | GEFRI out-of-school scoring: 0% maps to 100, 30% maps to 10, intermediate values follow a linear decline; results clipped to 10-100. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). Values are clipped to the functional range before scoring. |
| SE.SEC.CMPT.LO.FE.ZS | Lower secondary completion rate, female (% of relevant age group) | School Access & Gender Parity | GEFRI completion scoring: rates <=65% map to 10, rates >=100% map to 100, with linear interpolation between. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SE.SEC.CMPT.LO.MA.ZS | Lower secondary completion rate, male (% of relevant age group) | Auxiliary | Reported for transparency; does not enter the composite score. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| GB.XPD.RSDV.GD.ZS | R&D expenditure (% of GDP) | Innovation | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SP.POP.SCIE.RD.P6 | Researchers in R&D (per million people) | Innovation | log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| IP.JRN.ARTC.SC | Scientific and technical journal articles | Innovation | Converted to per-million (population >= 1 million) then log1p + linear scaling on observed non-microstate values; countries below the population threshold remain missing for normalization. For the Innovation dimension, this indicator is ultimately scored as scientific and technical journal articles per million people (population ≥ 1 million). | The raw IP.JRN.ARTC.SC series is preserved for transparency; when eligible, it is converted to the per-million form before imputation and scaling. |
| TX.VAL.TECH.CD | High-tech exports (current US$) | Innovation | log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds. Values are floored at zero prior to transformation. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| SP.POP.TOTL | Population, total | Auxiliary | Supports per-million derivations and microstate identification; no normalization applied. | Latest reported population is used directly. If unavailable, the imputation fallback provides a substitute solely to power derived metrics (e.g., articles per million). |
| GE.EST | Government Effectiveness (WGI) | Governance | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| RQ.EST | Regulatory Quality (WGI) | Governance | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| CC.EST | Control of Corruption (WGI) | Governance | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
| VA.EST | Voice and Accountability (WGI) | Governance | Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds. | Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). |
Note: IP.JRN.ARTC.SC is ingested as the raw count of “Scientific and technical journal articles.” For Innovation scoring it is converted to “scientific and technical journal articles per million people” when population ≥ 1 million. “Population, total” supports that derivation and microstate handling; it does not contribute directly to the composite GEFRI score. For a narrative walkthrough of these choices, see the methodology page.
Methodology summary
- Imputation hierarchy: Missing values are filled using Region+Income averages, falling back in order to Region, Income, and Global means. Microstates (population < 300,000) are excluded from every reference set so very small systems do not bias the substitutions.
- Microstates: A helper flag identifies countries with fewer than 300,000 residents. The flag is used to bypass these rows in imputation and to mark output records.
- High-income literacy: When adult literacy is missing for high-income economies, a value of 100% is assumed and the imputation flag is set to “Assumed (high income).”
- Normalization bounds: Each indicator’s min and max come from a rolling 18-release window (roughly 18 months) of observed non-microstate values. Imputed figures are scaled by those stored bounds but never update them. Implausible negatives are clipped to zero before scaling, and final scores are constrained to 0–100.
- Transforms & eligibility: Secure internet servers, researchers per million, scientific articles per million, and high-tech exports apply
log1ptransforms prior to min–max scaling. Scientific articles per million only normalize for populations ≥ 1 million; smaller systems retain missing values and receive “Low” innovation confidence. - Equity scoring: Out-of-school rates, completion rates, and gender parity indices are converted to 0–100 scores using linear formulas. The minimum score (or second minimum in high-income cases) anchors the dimension and receives penalties for imputation density. Fragile and conflict-affected countries are capped at a maximum of 40 for the equity dimension.
- Safeguards & sensitivity: The plausibility adjustment (three indicators ≥80, one ≤40) prevents spurious collapses in high-income, non-FCS systems; removing it in November 2025 tests lowered affected School Access & Gender Parity scores by 4–12 points. The FCS cap (World Bank FY2025 list) holds capped scores at 40; removing it raised FCS scores by 8–20 points without altering composite rankings by more than five places.
- Confidence levels: Each dimension’s confidence is driven by the share of non-imputed indicators. Labels follow Low (<70%), Moderate (70–<90%), and High (≥90%). Low confidence scores are penalized with a multiplicative factor of 0.7. For innovation, countries below 1 million population are automatically set to Low confidence.
- Composite score: Dimension scores (after penalties) are averaged and rescaled to a 0–100 composite GEFRI score.
Thresholds for the plausibility adjustment were calibrated against 15 years of high-income reporting patterns where vocational pathways or data gaps created isolated low values; the guard activates only when three indicators already exceed 80. FCS capping aligns with the World Bank’s FY2025 Fragile and Conflict-Affected Situations list and is applied after imputation penalties so crisis-affected contexts are not penalized further.
Sensitivity tests on the November 2025 dataset confirm that removing the plausibility safeguard depresses School Access & Gender Parity by a median 6 points (max 12) across nine high-income countries, while lifting the FCS cap raises capped scores by 8–20 points but leaves all composite rankings within five positions of the baseline.
GEFRI scoring engine
This Python code contains the scoring engine for GEFRI. It has been sanitized to remove all network calls, file operations, and deployment hooks while preserving the mathematics of imputation, normalization, and scoring.
- Complete scoring logic: imputation, transformations, normalization, dimension calculations, confidence labels, and composite scoring.
- Microstate handling rules and the FCS-aware caps applied to the equity dimension.
- An `example_scoring_demo()` function showing the pipeline on sample data.
- Network requests (e.g., World Bank API calls) and bulk data downloads.
- File reading/writing, batch automation, or deployment scripts.
- Private configuration, repositories, or infrastructure-specific logic.
# GEFRI Scoring Engine (Public Version)
# Author: Dr. John Moravec, Education Futures LLC
# This file contains only the scoring, imputation, and normalization logic.
# It is safe for public release and excludes all production and infrastructure code.
from datetime import UTC, datetime
import numpy as np
import pandas as pd
NORMALIZATION_WINDOW_MONTHS = 18
bounds_history_store = {}
# ---------------------------------------------------------------------------
# Indicator metadata and dimension definitions
# ---------------------------------------------------------------------------
indicators = {
"EG.ELC.ACCS.ZS": "Access to electricity (% of population)",
"IT.NET.USER.ZS": "Internet users (% of population)",
"IT.NET.SECR.P6": "Secure internet servers (per 1 million people)",
"IT.CEL.SETS.P2": "Mobile cellular subscriptions (per 100 people)",
"SE.XPD.TOTL.GD.ZS": "Government expenditure on education (% of GDP)",
"SE.ADT.LITR.ZS": "Adult literacy rate (% age 15+)",
"SE.SEC.ENRR": "School enrollment, secondary (% gross)",
"SE.TER.ENRR": "School enrollment, tertiary (% gross)",
"SE.ENR.SECO.FM.ZS": "Secondary GPI (Gross enrollment ratio, female/male)",
"SE.ENR.TERT.FM.ZS": "Tertiary GPI (Gross enrollment ratio, female/male)",
"SE.PRM.UNER.ZS": "Children out of school, primary (% of primary school age)",
"SE.SEC.UNER.LO.ZS": "Adolescents out of school, secondary (% of lower secondary school age)",
"SE.SEC.CMPT.LO.FE.ZS": "Lower secondary completion rate, female (% of relevant age group)",
"SE.SEC.CMPT.LO.MA.ZS": "Lower secondary completion rate, male (% of relevant age group)",
"GB.XPD.RSDV.GD.ZS": "R&D expenditure (% of GDP)",
"SP.POP.SCIE.RD.P6": "Researchers in R&D (per million people)",
"IP.JRN.ARTC.SC": "Scientific and technical journal articles",
"TX.VAL.TECH.CD": "High-tech exports (current US$)",
"SP.POP.TOTL": "Population, total",
"GE.EST": "Government Effectiveness (WGI)",
"RQ.EST": "Regulatory Quality (WGI)",
"CC.EST": "Control of Corruption (WGI)",
"VA.EST": "Voice and Accountability (WGI)",
}
dimension_map = {
"Access to electricity (% of population)": "Infrastructure",
"Internet users (% of population)": "Infrastructure",
"Secure internet servers (per 1 million people)": "Infrastructure",
"Mobile cellular subscriptions (per 100 people)": "Infrastructure",
"Government expenditure on education (% of GDP)": "Human Capital",
"Adult literacy rate (% age 15+)": "Human Capital",
"School enrollment, secondary (% gross)": "Human Capital",
"School enrollment, tertiary (% gross)": "Human Capital",
"Secondary GPI (Gross enrollment ratio, female/male)": "School Access and Gender Parity",
"Tertiary GPI (Gross enrollment ratio, female/male)": "School Access and Gender Parity",
"Children out of school, primary (% of primary school age)": "School Access and Gender Parity",
"Adolescents out of school, secondary (% of lower secondary school age)": "School Access and Gender Parity",
"Lower secondary completion rate, female (% of relevant age group)": "School Access and Gender Parity",
"Lower secondary completion rate, male (% of relevant age group)": "School Access and Gender Parity",
"Scientific and technical journal articles per million": "Innovation",
"R&D expenditure (% of GDP)": "Innovation",
"Researchers in R&D (per million people)": "Innovation",
"High-tech exports (current US$)": "Innovation",
"High-tech exports (current US$) (log-transformed)": "Innovation",
"Government Effectiveness (WGI)": "Governance",
"Regulatory Quality (WGI)": "Governance",
"Control of Corruption (WGI)": "Governance",
"Voice and Accountability (WGI)": "Governance",
}
ASSUMED_LITERACY_MARKER = "_AssumedHighIncomeLiteracy"
LITERACY_COL = "Adult literacy rate (% age 15+)"
EQUITY_CORE_INDICATORS = [
"Children out of school, primary (% of primary school age)",
"Adolescents out of school, secondary (% of lower secondary school age)",
"Lower secondary completion rate, female (% of relevant age group)",
"Secondary GPI (Gross enrollment ratio, female/male)",
]
# ---------------------------------------------------------------------------
# General utilities
# ---------------------------------------------------------------------------
def canonicalize_dimension_name(name: str) -> str:
if not isinstance(name, str) or not name.strip():
return ""
if name.strip().lower() == "equity":
return "School Access and Gender Parity"
return name
def ordinal(n):
if n is None or pd.isna(n):
return None
n = int(n)
if 10 <= n % 100 <= 20:
suffix = "th"
else:
suffix = {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th")
return f"{n}{suffix}"
def get_percentile(value, all_values):
values = pd.Series(all_values).dropna()
if len(values) == 0 or pd.isna(value):
return None
if (values == values.iloc[0]).all():
return 100
return int(np.ceil((values <= value).mean() * 100))
def is_microstate(row) -> bool:
for key in ["Population, total", "Population"]:
pop = row.get(key, None)
if pd.notna(pop):
try:
if float(pop) < 300_000:
return True
except Exception:
continue
return False
def _current_period(target_year: int | None = None, force_month: str | None = None) -> str:
if target_year is not None:
year = int(target_year)
month = int(force_month) if force_month is not None else 12
else:
today = datetime.now(UTC)
year = today.year
month = today.month
return f"{year:04d}-{month:02d}"
def _runtime_timestamp() -> str:
return datetime.now(UTC).replace(microsecond=0).isoformat()
def _period_sort_key(period: str) -> datetime:
try:
year_str, month_str = period.split("-")
return datetime(int(year_str), int(month_str), 1)
except Exception:
return datetime(1900, 1, 1)
def _safe_pop_value(value):
try:
num = float(value)
return num if np.isfinite(num) else np.nan
except (TypeError, ValueError):
return np.nan
def _normalize_with_history(series, indicator, history, run_period, runtime_timestamp):
aligned = pd.to_numeric(series, errors="coerce").replace([np.inf, -np.inf], np.nan)
valid = aligned.dropna()
normalized = pd.Series(np.nan, index=series.index, dtype=float)
if valid.empty:
return normalized
current_min = float(valid.min())
current_max = float(valid.max())
entry = {
"period": run_period,
"min_value": current_min,
"max_value": current_max,
"value_count": int(valid.size),
"source": "runtime",
"generated_at": runtime_timestamp,
}
entries = [dict(record) for record in history.get(indicator, []) if record.get("period") != run_period]
entries.append(entry)
entries.sort(key=lambda record: _period_sort_key(record["period"]))
if len(entries) > NORMALIZATION_WINDOW_MONTHS:
entries = entries[-NORMALIZATION_WINDOW_MONTHS:]
min_candidates = [record["min_value"] for record in entries if record.get("min_value") is not None]
max_candidates = [record["max_value"] for record in entries if record.get("max_value") is not None]
window_min = min(min_candidates) if min_candidates else current_min
window_max = max(max_candidates) if max_candidates else current_max
if window_max is None or window_min is None or window_max <= window_min:
normalized.loc[valid.index] = np.nan
else:
normalized.loc[valid.index] = (valid - window_min) / (window_max - window_min)
history[indicator] = entries
return normalized
# ---------------------------------------------------------------------------
# Imputation logic
# ---------------------------------------------------------------------------
def impute_indicator_column(
df: pd.DataFrame,
column: str,
non_micro_mask: pd.Series,
region_col: str = "Region",
income_col: str = "Income Level",
assumed_marker: str | None = None,
literacy_col: str = LITERACY_COL,
):
values = []
levels = []
if column not in df.columns:
return pd.Series([np.nan] * len(df), index=df.index), ["Original"] * len(df)
for idx, row in df.iterrows():
val = row.get(column, np.nan)
level = "Original"
assumed_literacy = (
assumed_marker is not None
and column == literacy_col
and bool(df.at[idx, assumed_marker])
)
if assumed_literacy:
val = row.get(column, np.nan)
level = "Assumed (high income)"
elif pd.isna(val):
region = row.get(region_col, None)
income = row.get(income_col, None)
mask = (
non_micro_mask
& (df[region_col] == region)
& (df[income_col] == income)
)
subset = df[mask]
val = subset[column].mean()
level = "Region+Income"
if pd.isna(val):
mask = non_micro_mask & (df[region_col] == region)
subset = df[mask]
val = subset[column].mean()
level = "Region"
if pd.isna(val):
mask = non_micro_mask & (df[income_col] == income)
subset = df[mask]
val = subset[column].mean()
level = "Income"
if pd.isna(val):
subset = df[non_micro_mask]
val = subset[column].mean()
level = "Global"
values.append(val)
levels.append(level)
return pd.Series(values, index=df.index), levels
def apply_imputation(
df: pd.DataFrame,
columns,
region_col: str = "Region",
income_col: str = "Income Level",
literacy_col: str = LITERACY_COL,
assumed_marker: str | None = None,
):
imputed = df.copy()
flags = pd.DataFrame(index=df.index)
non_micro_mask = ~imputed.apply(is_microstate, axis=1)
for col in columns:
marker_name = assumed_marker if (assumed_marker and col == literacy_col) else None
series, levels = impute_indicator_column(
imputed,
col,
non_micro_mask,
region_col=region_col,
income_col=income_col,
assumed_marker=marker_name,
literacy_col=literacy_col,
)
imputed[col] = series
flags[f"{col} (imputed)"] = [lvl != "Original" for lvl in levels]
flags[f"{col} (imputation level)"] = levels
if assumed_marker and assumed_marker in imputed.columns:
imputed.drop(columns=[assumed_marker], inplace=True)
return imputed, flags
def apply_high_income_literacy(
df: pd.DataFrame,
literacy_col: str = LITERACY_COL,
income_col: str = "Income Level",
marker_col: str = ASSUMED_LITERACY_MARKER,
):
df_copy = df.copy()
df_copy[marker_col] = False
if literacy_col not in df_copy.columns:
return df_copy, marker_col
mask = (
(df_copy[income_col] == "High income")
& (df_copy[literacy_col].isna() | (df_copy[literacy_col] == 100))
)
if mask.any():
df_copy.loc[mask, literacy_col] = 100.0
df_copy.loc[mask, marker_col] = True
return df_copy, marker_col
# ---------------------------------------------------------------------------
# Indicator transformations
# ---------------------------------------------------------------------------
def add_scientific_articles_per_million(
df: pd.DataFrame,
articles_col: str = "Scientific and technical journal articles",
pop_col: str = "Population, total",
min_population: int = 1_000_000,
) -> str:
per_million_col = "Scientific and technical journal articles per million"
if articles_col not in df.columns or pop_col not in df.columns:
return per_million_col
values = []
for _, row in df.iterrows():
pop = _safe_pop_value(row.get(pop_col, np.nan))
articles = row.get(articles_col, np.nan)
if pd.notna(pop) and pop >= min_population and pd.notna(articles):
values.append(articles / (pop / 1_000_000))
else:
values.append(np.nan)
df[per_million_col] = values
return per_million_col
def transform_high_tech_exports(df: pd.DataFrame, dim_map: dict):
col = "High-tech exports (current US$)"
if col not in df.columns:
return df, dim_map
df[col] = df[col].clip(lower=0)
log_col = f"{col} (log)"
df[log_col] = np.log1p(df[col])
updated = dim_map.copy()
if col in updated:
updated[log_col] = updated.pop(col)
updated["High-tech exports (current US$) (log-transformed)"] = "Innovation"
return df, updated
# Min-max normalization with log transforms for skewed indicators.
def normalize_indicators(
df: pd.DataFrame,
columns,
dim_map: dict,
per_million_col: str = "Scientific and technical journal articles per million",
pop_col: str = "Population, total",
min_population: int = 1_000_000,
bounds_history=None,
run_period: str | None = None,
runtime_timestamp: str | None = None,
):
history = bounds_history if bounds_history is not None else bounds_history_store
run_period = run_period or _current_period()
runtime_timestamp = runtime_timestamp or _runtime_timestamp()
norm_cols = []
log_transform_indicators = {
"Secure internet servers (per 1 million people)",
"Researchers in R&D (per million people)",
"Scientific and technical journal articles per million",
}
for col in columns:
if col not in df.columns:
continue
norm_name = f"{col} (normalized)"
if col in log_transform_indicators:
base_series = np.log1p(pd.to_numeric(df[col], errors="coerce").clip(lower=0))
elif col == per_million_col:
base = pd.to_numeric(df[col], errors="coerce")
population = pd.to_numeric(df.get(pop_col, pd.Series(index=df.index)), errors="coerce")
eligible = population >= min_population
base_series = base.where(eligible)
else:
base_series = pd.to_numeric(df[col], errors="coerce")
df[norm_name] = _normalize_with_history(
base_series,
col,
history,
run_period,
runtime_timestamp,
)
norm_cols.append(norm_name)
return df, norm_cols
# ---------------------------------------------------------------------------
# Dimension scoring functions
# ---------------------------------------------------------------------------
def oos_score_linear(oos):
if pd.isna(oos):
return None
return max(10, 100 - (oos / 30) * 90)
def completion_score_linear(rate):
if pd.isna(rate):
return None
if rate <= 65:
return 10
if rate >= 100:
return 100
return 10 + ((rate - 65) / 35) * 90
def gpi_score_linear(gpi):
if pd.isna(gpi):
return 20
deviation = abs(gpi - 1)
return max(10, 100 - 290 * deviation)
def calc_equity_score(row: pd.Series, imputed_row: pd.Series) -> float | None:
comp_f = row.get("Lower secondary completion rate, female (% of relevant age group)")
comp_f = min(comp_f, 100) if comp_f is not None else None
oos_prim = row.get("Children out of school, primary (% of primary school age)")
oos_sec = row.get("Adolescents out of school, secondary (% of lower secondary school age)")
gpi_sec = row.get("Secondary GPI (Gross enrollment ratio, female/male)")
scores = [
oos_score_linear(oos_prim),
oos_score_linear(oos_sec),
completion_score_linear(comp_f),
gpi_score_linear(gpi_sec),
]
scores = [score for score in scores if score is not None]
if not scores:
return None
sorted_scores = sorted(scores)
min_score = sorted_scores[0]
second_min = sorted_scores[1] if len(sorted_scores) > 1 else min_score
high_income = str(row.get("Income Level", "")).lower() == "high income"
not_fcv = not bool(row.get("FCV Status", False))
plausibly_artifact = high_income and not_fcv and min_score <= 40 and sum(val >= 80 for val in sorted_scores[1:]) >= 2
final_score = second_min if plausibly_artifact else min_score
imputation_count = 0
for ind in EQUITY_CORE_INDICATORS:
flag_col = f"{ind} (imputed)"
if flag_col in imputed_row.index and bool(imputed_row.get(flag_col, False)):
imputation_count += 1
imputation_penalty = min(imputation_count * 5, 15)
if imputation_count >= 3:
imputation_penalty += 10
crisis_cap = 100
fcv_type = str(row.get("FCV Type", "") or "").lower()
if bool(row.get("FCV Status", False)) or ("conflict" in fcv_type or "crisis" in fcv_type):
crisis_cap = min(crisis_cap, 40)
final_score = final_score - imputation_penalty
final_score = min(final_score, crisis_cap)
final_score = max(min(final_score, 100), 0)
return round(final_score, 2)
def compute_equity_confidence(imputed_row: pd.Series) -> str:
total = len(EQUITY_CORE_INDICATORS)
observed = 0
for ind in EQUITY_CORE_INDICATORS:
flag_col = f"{ind} (imputed)"
if flag_col in imputed_row.index and not bool(imputed_row.get(flag_col, False)):
observed += 1
pct = (observed / total) * 100
if pct < 70:
return "Low"
if pct < 90:
return "Moderate"
return "High"
# Aggregates normalized indicators into dimension-level scores and confidence labels.
def summarize_dimensions(
df: pd.DataFrame,
dim_map: dict,
imputation_flags: pd.DataFrame,
per_million_col: str = "Scientific and technical journal articles per million",
pop_col: str = "Population, total",
min_population: int = 1_000_000,
):
dimension_scores: dict[str, pd.Series] = {}
dimension_confidence: dict[str, pd.Series] = {}
unique_dims = {dim for dim in dim_map.values() if dim}
for dim in unique_dims:
if dim == "School Access and Gender Parity":
score_col = "School Access and Gender Parity Score"
conf_col = "School Access and Gender Parity Confidence"
normalized = df.get(score_col, pd.Series(np.nan, index=df.index)) / 100
dimension_scores[score_col] = normalized
dimension_confidence[conf_col] = df.get(conf_col, pd.Series("Low", index=df.index))
continue
dim_cols = [col for col, value in dim_map.items() if value == dim and col in df.columns]
norm_cols = [f"{col} (normalized)" for col in dim_cols if f"{col} (normalized)" in df.columns]
score_col = f"{dim} Score"
conf_col = f"{dim} Confidence"
if not norm_cols:
dimension_scores[score_col] = pd.Series(np.nan, index=df.index)
dimension_confidence[conf_col] = pd.Series("Low", index=df.index)
continue
mean_scores = df[norm_cols].mean(axis=1, skipna=True)
dimension_scores[score_col] = mean_scores
observed_counts = []
for idx, row in df.iterrows():
count = 0
for col in dim_cols:
flag_col = f"{col} (imputed)"
if flag_col in imputation_flags.columns:
if not bool(imputation_flags.at[idx, flag_col]):
count += 1
else:
if pd.notna(row.get(col, np.nan)):
count += 1
observed_counts.append(count)
dim_total = len(norm_cols) if norm_cols else 1
completion_pct = (pd.Series(observed_counts, index=df.index) / dim_total) * 100
conf_labels = completion_pct.apply(lambda x: "Low" if x < 70 else ("Moderate" if x < 90 else "High"))
if dim == "Innovation" and per_million_col in dim_cols and per_million_col in df.columns:
pop_series = df.get(pop_col, pd.Series(np.nan, index=df.index))
pop_series = pop_series.apply(_safe_pop_value)
conf_labels = conf_labels.mask(pop_series < min_population, "Low")
dimension_confidence[conf_col] = conf_labels
return dimension_scores, dimension_confidence
# Applies a multiplicative penalty to low-confidence dimensions.
def apply_confidence_penalty(dimension_scores: dict[str, pd.Series], dimension_confidence: dict[str, pd.Series], penalty: float = 0.7):
adjusted = {}
for score_name, series in dimension_scores.items():
conf_name = score_name.replace("Score", "Confidence")
conf_series = dimension_confidence.get(conf_name)
if conf_series is None:
adjusted[score_name] = series
continue
penalized = series.copy()
low_mask = conf_series.fillna("High") == "Low"
penalized[low_mask] = penalized[low_mask] * penalty
adjusted[score_name] = penalized
return adjusted
def compute_composite_score(dimension_score_df: pd.DataFrame) -> pd.Series:
return (dimension_score_df.mean(axis=1, skipna=True) * 100).round(2)
# ---------------------------------------------------------------------------
# End-to-end scoring pipeline
# ---------------------------------------------------------------------------
def run_gefri_scoring(
raw_df: pd.DataFrame,
bounds_history=None,
run_period: str | None = None,
runtime_timestamp: str | None = None,
):
df = raw_df.copy()
dim_map = dimension_map.copy()
raw_cols = [col for col in dim_map if col in df.columns]
history = bounds_history if bounds_history is not None else bounds_history_store
run_period = run_period or _current_period()
runtime_timestamp = runtime_timestamp or _runtime_timestamp()
df, marker_col = apply_high_income_literacy(df)
df, imputation_flags = apply_imputation(
df,
raw_cols,
assumed_marker=marker_col,
)
per_million_col = add_scientific_articles_per_million(df)
df, dim_map = transform_high_tech_exports(df, dim_map)
raw_cols = [col for col in dim_map if col in df.columns]
df, norm_cols = normalize_indicators(
df,
raw_cols,
dim_map,
per_million_col,
bounds_history=history,
run_period=run_period,
runtime_timestamp=runtime_timestamp,
)
equity_scores = []
equity_confidence = []
for idx, row in df.iterrows():
imputed_row = imputation_flags.loc[idx]
equity_scores.append(calc_equity_score(row, imputed_row))
equity_confidence.append(compute_equity_confidence(imputed_row))
df["School Access and Gender Parity Score"] = equity_scores
df["School Access and Gender Parity Confidence"] = equity_confidence
dimension_scores, dimension_confidence = summarize_dimensions(
df,
dim_map,
imputation_flags,
per_million_col=per_million_col,
)
penalized = apply_confidence_penalty(dimension_scores, dimension_confidence)
for name, series in penalized.items():
df[name] = (series * 100).round(2)
for name, series in dimension_confidence.items():
df[name] = series
dimension_score_df = pd.DataFrame(penalized)
df["Composite GEFRI Score"] = compute_composite_score(dimension_score_df)
return df, imputation_flags
def example_scoring_demo():
data = {
"Country Name": ["Exampleland", "Sample Republic"],
"Country Code": ["EXL", "SMR"],
"Region": ["Europe & Central Asia", "Latin America & Caribbean"],
"Income Level": ["High income", "Upper middle income"],
"Population, total": [5_500_000, 12_300_000],
"FCV Status": [False, False],
"FCV Type": ["", ""],
"Access to electricity (% of population)": [100, 95],
"Internet users (% of population)": [97, 82],
"Secure internet servers (per 1 million people)": [3500, 420],
"Mobile cellular subscriptions (per 100 people)": [125, 108],
"Government expenditure on education (% of GDP)": [5.2, 4.0],
"Adult literacy rate (% age 15+)": [np.nan, 92],
"School enrollment, secondary (% gross)": [98, 89],
"School enrollment, tertiary (% gross)": [78, 52],
"Secondary GPI (Gross enrollment ratio, female/male)": [1.01, 0.97],
"Tertiary GPI (Gross enrollment ratio, female/male)": [1.02, 0.89],
"Children out of school, primary (% of primary school age)": [1.5, 8.0],
"Adolescents out of school, secondary (% of lower secondary school age)": [2.0, 12.0],
"Lower secondary completion rate, female (% of relevant age group)": [99, 78],
"Lower secondary completion rate, male (% of relevant age group)": [97, 73],
"R&D expenditure (% of GDP)": [1.8, 0.6],
"Researchers in R&D (per million people)": [3200, 950],
"Scientific and technical journal articles": [5400, 780],
"High-tech exports (current US$)": [3.1e9, 4.2e8],
"Government Effectiveness (WGI)": [1.1, -0.2],
"Regulatory Quality (WGI)": [1.0, -0.1],
"Control of Corruption (WGI)": [0.9, -0.4],
"Voice and Accountability (WGI)": [0.8, -0.3],
}
df = pd.DataFrame(data)
scored_df, flags = run_gefri_scoring(df, bounds_history={})
return scored_df, flags
Example calculation
The example_scoring_demo() function constructs a two-country sample with mixed data quality. Running the function produces a scored DataFrame and an imputation flag matrix. The example demonstrates the imputation hierarchy (Exampleland inherits no literacy value yet becomes “Assumed (high income)”), the log transform on high-tech exports, FCS-aware capping, and the dimension-level confidence penalties.
| Country | Composite | Infrastructure | Human Capital | Innovation | Governance | Equity |
|---|---|---|---|---|---|---|
| Exampleland | 84.6 | 91.3 (High) | 88.0 (Moderate) | 76.4 (High) | 82.1 (High) | 75.0 (High) |
| Sample Republic | 53.2 | 58.4 (Moderate) | 54.0 (Low) | 46.2 (Low) | 41.8 (Moderate) | 42.0 (Moderate) |
Values shown above are representative of the output produced by the demonstration helper. To experiment locally, copy the sanitized script into a notebook, adjust the sample data, and run example_scoring_demo() to observe how imputation, normalization, and confidence adjustments affect the final GEFRI scores.
For background on the full GEFRI platform, return to the about page or explore the methodology and data notes.