GEFRI scoring technical appendix

This appendix publishes the scoring, normalization, and imputation logic used for the Global Education Futures Readiness Index (GEFRI). The public version intentionally removes all production automation, file exports, and data-ingest code so that the community can review the methodology without exposing infrastructure internals. The code block below mirrors the logic that powers the live index.

Note: The World Bank has transitioned from the term FCV (Fragility, Conflict, and Violence) to FCS (Fragile and Conflict-Affected Situations) in recent publications and classification updates. Because the GEFRI code was written when “FCV” was the standard terminology, reviewers may encounter both acronyms in this technical documentation and variable names. The GEFRI website now presents data with the current designation.

Indicators and dimensions

GEFRI uses 23 indicators: 21 core indicators that contribute directly to the five readiness dimensions, plus 2 auxiliary series (population and male lower-secondary completion) used for transformations and derived ratios. These auxiliary series do not contribute directly to the dimension or composite scores.

Code	Indicator	Dimension	Transform & scaling	Eligibility & imputation
EG.ELC.ACCS.ZS	Access to electricity (% of population)	Infrastructure	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
IT.NET.USER.ZS	Internet users (% of population)	Infrastructure	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
IT.NET.SECR.P6	Secure internet servers (per 1 million people)	Infrastructure	log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
IT.CEL.SETS.P2	Mobile cellular subscriptions (per 100 people)	Infrastructure	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.XPD.TOTL.GD.ZS	Government expenditure on education (% of GDP)	Human Capital	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.ADT.LITR.ZS	Adult literacy rate (% age 15+)	Human Capital	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). High-income gaps default to 100% (flagged as "Assumed (high income)").
SE.SEC.ENRR	School enrollment, secondary (% gross)	Human Capital	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.TER.ENRR	School enrollment, tertiary (% gross)	Human Capital	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.ENR.SECO.FM.ZS	Secondary GPI (Gross enrollment ratio, female/male)	School Access & Gender Parity	GEFRI gender parity scoring: GPI=1 returns 100; deviation penalty is symmetrical and floored at 10.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.ENR.TERT.FM.ZS	Tertiary GPI (Gross enrollment ratio, female/male)	School Access & Gender Parity	GEFRI gender parity scoring: GPI=1 returns 100; deviation penalty is symmetrical and floored at 10.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.PRM.UNER.ZS	Children out of school, primary (% of primary school age)	School Access & Gender Parity	GEFRI out-of-school scoring: 0% maps to 100, 30% maps to 10, intermediate values follow a linear decline; results clipped to 10-100.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). Values are clipped to the functional range before scoring.
SE.SEC.UNER.LO.ZS	Adolescents out of school, secondary (% of lower secondary school age)	School Access & Gender Parity	GEFRI out-of-school scoring: 0% maps to 100, 30% maps to 10, intermediate values follow a linear decline; results clipped to 10-100.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only). Values are clipped to the functional range before scoring.
SE.SEC.CMPT.LO.FE.ZS	Lower secondary completion rate, female (% of relevant age group)	School Access & Gender Parity	GEFRI completion scoring: rates <=65% map to 10, rates >=100% map to 100, with linear interpolation between.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SE.SEC.CMPT.LO.MA.ZS	Lower secondary completion rate, male (% of relevant age group)	Auxiliary	Reported for transparency; does not enter the composite score.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
GB.XPD.RSDV.GD.ZS	R&D expenditure (% of GDP)	Innovation	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SP.POP.SCIE.RD.P6	Researchers in R&D (per million people)	Innovation	log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
IP.JRN.ARTC.SC	Scientific and technical journal articles	Innovation	Converted to per-million (population >= 1 million) then log1p + linear scaling on observed non-microstate values; countries below the population threshold remain missing for normalization. For the Innovation dimension, this indicator is ultimately scored as scientific and technical journal articles per million people (population ≥ 1 million).	The raw IP.JRN.ARTC.SC series is preserved for transparency; when eligible, it is converted to the per-million form before imputation and scaling.
TX.VAL.TECH.CD	High-tech exports (current US$)	Innovation	log1p transform applied before linear min-max scaling on observed non-microstate values; imputed values excluded from bounds. Values are floored at zero prior to transformation.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
SP.POP.TOTL	Population, total	Auxiliary	Supports per-million derivations and microstate identification; no normalization applied.	Latest reported population is used directly. If unavailable, the imputation fallback provides a substitute solely to power derived metrics (e.g., articles per million).
GE.EST	Government Effectiveness (WGI)	Governance	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
RQ.EST	Regulatory Quality (WGI)	Governance	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
CC.EST	Control of Corruption (WGI)	Governance	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).
VA.EST	Voice and Accountability (WGI)	Governance	Linear min-max scaling on observed non-microstate values (2017-2025 release window); imputed values excluded from bounds.	Imputation fallback: Region+Income -> Region -> Income -> Global (non-microstates only).

Note: IP.JRN.ARTC.SC is ingested as the raw count of “Scientific and technical journal articles.” For Innovation scoring it is converted to “scientific and technical journal articles per million people” when population ≥ 1 million. “Population, total” supports that derivation and microstate handling; it does not contribute directly to the composite GEFRI score. For a narrative walkthrough of these choices, see the methodology page.

Methodology summary

Imputation hierarchy: Missing values are filled using Region+Income averages, falling back in order to Region, Income, and Global means. Microstates (population < 300,000) are excluded from every reference set so very small systems do not bias the substitutions.
Microstates: A helper flag identifies countries with fewer than 300,000 residents. The flag is used to bypass these rows in imputation and to mark output records.
High-income literacy: When adult literacy is missing for high-income economies, a value of 100% is assumed and the imputation flag is set to “Assumed (high income).”
Normalization bounds: Each indicator’s min and max come from a rolling 18-release window (roughly 18 months) of observed non-microstate values. Imputed figures are scaled by those stored bounds but never update them. Implausible negatives are clipped to zero before scaling, and final scores are constrained to 0–100.
Transforms & eligibility: Secure internet servers, researchers per million, scientific articles per million, and high-tech exports apply log1p transforms prior to min–max scaling. Scientific articles per million only normalize for populations ≥ 1 million; smaller systems retain missing values and receive “Low” innovation confidence.
Equity scoring: Out-of-school rates, completion rates, and gender parity indices are converted to 0–100 scores using linear formulas. The minimum score (or second minimum in high-income cases) anchors the dimension and receives penalties for imputation density. Fragile and conflict-affected countries are capped at a maximum of 40 for the equity dimension.
Safeguards & sensitivity: The plausibility adjustment (three indicators ≥80, one ≤40) prevents spurious collapses in high-income, non-FCS systems; removing it in November 2025 tests lowered affected School Access & Gender Parity scores by 4–12 points. The FCS cap (World Bank FY2025 list) holds capped scores at 40; removing it raised FCS scores by 8–20 points without altering composite rankings by more than five places.
Confidence levels: Each dimension’s confidence is driven by the share of non-imputed indicators. Labels follow Low (<70%), Moderate (70–<90%), and High (≥90%). Low confidence scores are penalized with a multiplicative factor of 0.7. For innovation, countries below 1 million population are automatically set to Low confidence.
Composite score: Dimension scores (after penalties) are averaged and rescaled to a 0–100 composite GEFRI score.

Thresholds for the plausibility adjustment were calibrated against 15 years of high-income reporting patterns where vocational pathways or data gaps created isolated low values; the guard activates only when three indicators already exceed 80. FCS capping aligns with the World Bank’s FY2025 Fragile and Conflict-Affected Situations list and is applied after imputation penalties so crisis-affected contexts are not penalized further.

Sensitivity tests on the November 2025 dataset confirm that removing the plausibility safeguard depresses School Access & Gender Parity by a median 6 points (max 12) across nine high-income countries, while lifting the FCS cap raises capped scores by 8–20 points but leaves all composite rankings within five positions of the baseline.

GEFRI scoring engine

This Python code contains the scoring engine for GEFRI. It has been sanitized to remove all network calls, file operations, and deployment hooks while preserving the mathematics of imputation, normalization, and scoring.

What the sanitized code includes

Complete scoring logic: imputation, transformations, normalization, dimension calculations, confidence labels, and composite scoring.
Microstate handling rules and the FCS-aware caps applied to the equity dimension.
An `example_scoring_demo()` function showing the pipeline on sample data.

What it omits

Network requests (e.g., World Bank API calls) and bulk data downloads.
File reading/writing, batch automation, or deployment scripts.
Private configuration, repositories, or infrastructure-specific logic.

# GEFRI Scoring Engine (Public Version)
# Author: Dr. John Moravec, Education Futures LLC
# This file contains only the scoring, imputation, and normalization logic.
# It is safe for public release and excludes all production and infrastructure code.

from datetime import UTC, datetime
import numpy as np
import pandas as pd

NORMALIZATION_WINDOW_MONTHS = 18
bounds_history_store = {}

# ---------------------------------------------------------------------------
# Indicator metadata and dimension definitions
# ---------------------------------------------------------------------------
indicators = {
    "EG.ELC.ACCS.ZS": "Access to electricity (% of population)",
    "IT.NET.USER.ZS": "Internet users (% of population)",
    "IT.NET.SECR.P6": "Secure internet servers (per 1 million people)",
    "IT.CEL.SETS.P2": "Mobile cellular subscriptions (per 100 people)",
    "SE.XPD.TOTL.GD.ZS": "Government expenditure on education (% of GDP)",
    "SE.ADT.LITR.ZS": "Adult literacy rate (% age 15+)",
    "SE.SEC.ENRR": "School enrollment, secondary (% gross)",
    "SE.TER.ENRR": "School enrollment, tertiary (% gross)",
    "SE.ENR.SECO.FM.ZS": "Secondary GPI (Gross enrollment ratio, female/male)",
    "SE.ENR.TERT.FM.ZS": "Tertiary GPI (Gross enrollment ratio, female/male)",
    "SE.PRM.UNER.ZS": "Children out of school, primary (% of primary school age)",
    "SE.SEC.UNER.LO.ZS": "Adolescents out of school, secondary (% of lower secondary school age)",
    "SE.SEC.CMPT.LO.FE.ZS": "Lower secondary completion rate, female (% of relevant age group)",
    "SE.SEC.CMPT.LO.MA.ZS": "Lower secondary completion rate, male (% of relevant age group)",
    "GB.XPD.RSDV.GD.ZS": "R&D expenditure (% of GDP)",
    "SP.POP.SCIE.RD.P6": "Researchers in R&D (per million people)",
    "IP.JRN.ARTC.SC": "Scientific and technical journal articles",
    "TX.VAL.TECH.CD": "High-tech exports (current US$)",
    "SP.POP.TOTL": "Population, total",
    "GE.EST": "Government Effectiveness (WGI)",
    "RQ.EST": "Regulatory Quality (WGI)",
    "CC.EST": "Control of Corruption (WGI)",
    "VA.EST": "Voice and Accountability (WGI)",
}

dimension_map = {
    "Access to electricity (% of population)": "Infrastructure",
    "Internet users (% of population)": "Infrastructure",
    "Secure internet servers (per 1 million people)": "Infrastructure",
    "Mobile cellular subscriptions (per 100 people)": "Infrastructure",
    "Government expenditure on education (% of GDP)": "Human Capital",
    "Adult literacy rate (% age 15+)": "Human Capital",
    "School enrollment, secondary (% gross)": "Human Capital",
    "School enrollment, tertiary (% gross)": "Human Capital",
    "Secondary GPI (Gross enrollment ratio, female/male)": "School Access and Gender Parity",
    "Tertiary GPI (Gross enrollment ratio, female/male)": "School Access and Gender Parity",
    "Children out of school, primary (% of primary school age)": "School Access and Gender Parity",
    "Adolescents out of school, secondary (% of lower secondary school age)": "School Access and Gender Parity",
    "Lower secondary completion rate, female (% of relevant age group)": "School Access and Gender Parity",
    "Lower secondary completion rate, male (% of relevant age group)": "School Access and Gender Parity",
    "Scientific and technical journal articles per million": "Innovation",
    "R&D expenditure (% of GDP)": "Innovation",
    "Researchers in R&D (per million people)": "Innovation",
    "High-tech exports (current US$)": "Innovation",
    "High-tech exports (current US$) (log-transformed)": "Innovation",
    "Government Effectiveness (WGI)": "Governance",
    "Regulatory Quality (WGI)": "Governance",
    "Control of Corruption (WGI)": "Governance",
    "Voice and Accountability (WGI)": "Governance",
}

ASSUMED_LITERACY_MARKER = "_AssumedHighIncomeLiteracy"
LITERACY_COL = "Adult literacy rate (% age 15+)"
EQUITY_CORE_INDICATORS = [
    "Children out of school, primary (% of primary school age)",
    "Adolescents out of school, secondary (% of lower secondary school age)",
    "Lower secondary completion rate, female (% of relevant age group)",
    "Secondary GPI (Gross enrollment ratio, female/male)",
]

# ---------------------------------------------------------------------------
# General utilities
# ---------------------------------------------------------------------------
def canonicalize_dimension_name(name: str) -> str:
    if not isinstance(name, str) or not name.strip():
        return ""
    if name.strip().lower() == "equity":
        return "School Access and Gender Parity"
    return name

def ordinal(n):
    if n is None or pd.isna(n):
        return None
    n = int(n)
    if 10 <= n % 100 <= 20:
        suffix = "th"
    else:
        suffix = {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th")
    return f"{n}{suffix}"

def get_percentile(value, all_values):
    values = pd.Series(all_values).dropna()
    if len(values) == 0 or pd.isna(value):
        return None
    if (values == values.iloc[0]).all():
        return 100
    return int(np.ceil((values <= value).mean() * 100))

def is_microstate(row) -> bool:
    for key in ["Population, total", "Population"]:
        pop = row.get(key, None)
        if pd.notna(pop):
            try:
                if float(pop) < 300_000:
                    return True
            except Exception:
                continue
    return False

def _current_period(target_year: int | None = None, force_month: str | None = None) -> str:
    if target_year is not None:
        year = int(target_year)
        month = int(force_month) if force_month is not None else 12
    else:
        today = datetime.now(UTC)
        year = today.year
        month = today.month
    return f"{year:04d}-{month:02d}"

def _runtime_timestamp() -> str:
    return datetime.now(UTC).replace(microsecond=0).isoformat()

def _period_sort_key(period: str) -> datetime:
    try:
        year_str, month_str = period.split("-")
        return datetime(int(year_str), int(month_str), 1)
    except Exception:
        return datetime(1900, 1, 1)

def _safe_pop_value(value):
    try:
        num = float(value)
        return num if np.isfinite(num) else np.nan
    except (TypeError, ValueError):
        return np.nan

def _normalize_with_history(series, indicator, history, run_period, runtime_timestamp):
    aligned = pd.to_numeric(series, errors="coerce").replace([np.inf, -np.inf], np.nan)
    valid = aligned.dropna()
    normalized = pd.Series(np.nan, index=series.index, dtype=float)

    if valid.empty:
        return normalized

    current_min = float(valid.min())
    current_max = float(valid.max())
    entry = {
        "period": run_period,
        "min_value": current_min,
        "max_value": current_max,
        "value_count": int(valid.size),
        "source": "runtime",
        "generated_at": runtime_timestamp,
    }

    entries = [dict(record) for record in history.get(indicator, []) if record.get("period") != run_period]
    entries.append(entry)
    entries.sort(key=lambda record: _period_sort_key(record["period"]))
    if len(entries) > NORMALIZATION_WINDOW_MONTHS:
        entries = entries[-NORMALIZATION_WINDOW_MONTHS:]

    min_candidates = [record["min_value"] for record in entries if record.get("min_value") is not None]
    max_candidates = [record["max_value"] for record in entries if record.get("max_value") is not None]
    window_min = min(min_candidates) if min_candidates else current_min
    window_max = max(max_candidates) if max_candidates else current_max

    if window_max is None or window_min is None or window_max <= window_min:
        normalized.loc[valid.index] = np.nan
    else:
        normalized.loc[valid.index] = (valid - window_min) / (window_max - window_min)

    history[indicator] = entries
    return normalized

# ---------------------------------------------------------------------------
# Imputation logic
# ---------------------------------------------------------------------------
def impute_indicator_column(
    df: pd.DataFrame,
    column: str,
    non_micro_mask: pd.Series,
    region_col: str = "Region",
    income_col: str = "Income Level",
    assumed_marker: str | None = None,
    literacy_col: str = LITERACY_COL,
):
    values = []
    levels = []
    if column not in df.columns:
        return pd.Series([np.nan] * len(df), index=df.index), ["Original"] * len(df)
    for idx, row in df.iterrows():
        val = row.get(column, np.nan)
        level = "Original"
        assumed_literacy = (
            assumed_marker is not None
            and column == literacy_col
            and bool(df.at[idx, assumed_marker])
        )
        if assumed_literacy:
            val = row.get(column, np.nan)
            level = "Assumed (high income)"
        elif pd.isna(val):
            region = row.get(region_col, None)
            income = row.get(income_col, None)
            mask = (
                non_micro_mask
                & (df[region_col] == region)
                & (df[income_col] == income)
            )
            subset = df[mask]
            val = subset[column].mean()
            level = "Region+Income"
            if pd.isna(val):
                mask = non_micro_mask & (df[region_col] == region)
                subset = df[mask]
                val = subset[column].mean()
                level = "Region"
            if pd.isna(val):
                mask = non_micro_mask & (df[income_col] == income)
                subset = df[mask]
                val = subset[column].mean()
                level = "Income"
            if pd.isna(val):
                subset = df[non_micro_mask]
                val = subset[column].mean()
                level = "Global"
        values.append(val)
        levels.append(level)
    return pd.Series(values, index=df.index), levels

def apply_imputation(
    df: pd.DataFrame,
    columns,
    region_col: str = "Region",
    income_col: str = "Income Level",
    literacy_col: str = LITERACY_COL,
    assumed_marker: str | None = None,
):
    imputed = df.copy()
    flags = pd.DataFrame(index=df.index)
    non_micro_mask = ~imputed.apply(is_microstate, axis=1)
    for col in columns:
        marker_name = assumed_marker if (assumed_marker and col == literacy_col) else None
        series, levels = impute_indicator_column(
            imputed,
            col,
            non_micro_mask,
            region_col=region_col,
            income_col=income_col,
            assumed_marker=marker_name,
            literacy_col=literacy_col,
        )
        imputed[col] = series
        flags[f"{col} (imputed)"] = [lvl != "Original" for lvl in levels]
        flags[f"{col} (imputation level)"] = levels
    if assumed_marker and assumed_marker in imputed.columns:
        imputed.drop(columns=[assumed_marker], inplace=True)
    return imputed, flags

def apply_high_income_literacy(
    df: pd.DataFrame,
    literacy_col: str = LITERACY_COL,
    income_col: str = "Income Level",
    marker_col: str = ASSUMED_LITERACY_MARKER,
):
    df_copy = df.copy()
    df_copy[marker_col] = False
    if literacy_col not in df_copy.columns:
        return df_copy, marker_col
    mask = (
        (df_copy[income_col] == "High income")
        & (df_copy[literacy_col].isna() | (df_copy[literacy_col] == 100))
    )
    if mask.any():
        df_copy.loc[mask, literacy_col] = 100.0
        df_copy.loc[mask, marker_col] = True
    return df_copy, marker_col

# ---------------------------------------------------------------------------
# Indicator transformations
# ---------------------------------------------------------------------------
def add_scientific_articles_per_million(
    df: pd.DataFrame,
    articles_col: str = "Scientific and technical journal articles",
    pop_col: str = "Population, total",
    min_population: int = 1_000_000,
) -> str:
    per_million_col = "Scientific and technical journal articles per million"
    if articles_col not in df.columns or pop_col not in df.columns:
        return per_million_col
    values = []
    for _, row in df.iterrows():
        pop = _safe_pop_value(row.get(pop_col, np.nan))
        articles = row.get(articles_col, np.nan)
        if pd.notna(pop) and pop >= min_population and pd.notna(articles):
            values.append(articles / (pop / 1_000_000))
        else:
            values.append(np.nan)
    df[per_million_col] = values
    return per_million_col

def transform_high_tech_exports(df: pd.DataFrame, dim_map: dict):
    col = "High-tech exports (current US$)"
    if col not in df.columns:
        return df, dim_map
    df[col] = df[col].clip(lower=0)
    log_col = f"{col} (log)"
    df[log_col] = np.log1p(df[col])
    updated = dim_map.copy()
    if col in updated:
        updated[log_col] = updated.pop(col)
        updated["High-tech exports (current US$) (log-transformed)"] = "Innovation"
    return df, updated

# Min-max normalization with log transforms for skewed indicators.
def normalize_indicators(
    df: pd.DataFrame,
    columns,
    dim_map: dict,
    per_million_col: str = "Scientific and technical journal articles per million",
    pop_col: str = "Population, total",
    min_population: int = 1_000_000,
    bounds_history=None,
    run_period: str | None = None,
    runtime_timestamp: str | None = None,
):
    history = bounds_history if bounds_history is not None else bounds_history_store
    run_period = run_period or _current_period()
    runtime_timestamp = runtime_timestamp or _runtime_timestamp()

    norm_cols = []
    log_transform_indicators = {
        "Secure internet servers (per 1 million people)",
        "Researchers in R&D (per million people)",
        "Scientific and technical journal articles per million",
    }

    for col in columns:
        if col not in df.columns:
            continue

        norm_name = f"{col} (normalized)"
        if col in log_transform_indicators:
            base_series = np.log1p(pd.to_numeric(df[col], errors="coerce").clip(lower=0))
        elif col == per_million_col:
            base = pd.to_numeric(df[col], errors="coerce")
            population = pd.to_numeric(df.get(pop_col, pd.Series(index=df.index)), errors="coerce")
            eligible = population >= min_population
            base_series = base.where(eligible)
        else:
            base_series = pd.to_numeric(df[col], errors="coerce")

        df[norm_name] = _normalize_with_history(
            base_series,
            col,
            history,
            run_period,
            runtime_timestamp,
        )
        norm_cols.append(norm_name)

    return df, norm_cols

# ---------------------------------------------------------------------------
# Dimension scoring functions
# ---------------------------------------------------------------------------
def oos_score_linear(oos):
    if pd.isna(oos):
        return None
    return max(10, 100 - (oos / 30) * 90)

def completion_score_linear(rate):
    if pd.isna(rate):
        return None
    if rate <= 65:
        return 10
    if rate >= 100:
        return 100
    return 10 + ((rate - 65) / 35) * 90

def gpi_score_linear(gpi):
    if pd.isna(gpi):
        return 20
    deviation = abs(gpi - 1)
    return max(10, 100 - 290 * deviation)

def calc_equity_score(row: pd.Series, imputed_row: pd.Series) -> float | None:
    comp_f = row.get("Lower secondary completion rate, female (% of relevant age group)")
    comp_f = min(comp_f, 100) if comp_f is not None else None
    oos_prim = row.get("Children out of school, primary (% of primary school age)")
    oos_sec = row.get("Adolescents out of school, secondary (% of lower secondary school age)")
    gpi_sec = row.get("Secondary GPI (Gross enrollment ratio, female/male)")

    scores = [
        oos_score_linear(oos_prim),
        oos_score_linear(oos_sec),
        completion_score_linear(comp_f),
        gpi_score_linear(gpi_sec),
    ]
    scores = [score for score in scores if score is not None]
    if not scores:
        return None

    sorted_scores = sorted(scores)
    min_score = sorted_scores[0]
    second_min = sorted_scores[1] if len(sorted_scores) > 1 else min_score

    high_income = str(row.get("Income Level", "")).lower() == "high income"
    not_fcv = not bool(row.get("FCV Status", False))
    plausibly_artifact = high_income and not_fcv and min_score <= 40 and sum(val >= 80 for val in sorted_scores[1:]) >= 2
    final_score = second_min if plausibly_artifact else min_score

    imputation_count = 0
    for ind in EQUITY_CORE_INDICATORS:
        flag_col = f"{ind} (imputed)"
        if flag_col in imputed_row.index and bool(imputed_row.get(flag_col, False)):
            imputation_count += 1
    imputation_penalty = min(imputation_count * 5, 15)
    if imputation_count >= 3:
        imputation_penalty += 10

    crisis_cap = 100
    fcv_type = str(row.get("FCV Type", "") or "").lower()
    if bool(row.get("FCV Status", False)) or ("conflict" in fcv_type or "crisis" in fcv_type):
        crisis_cap = min(crisis_cap, 40)

    final_score = final_score - imputation_penalty
    final_score = min(final_score, crisis_cap)
    final_score = max(min(final_score, 100), 0)
    return round(final_score, 2)

def compute_equity_confidence(imputed_row: pd.Series) -> str:
    total = len(EQUITY_CORE_INDICATORS)
    observed = 0
    for ind in EQUITY_CORE_INDICATORS:
        flag_col = f"{ind} (imputed)"
        if flag_col in imputed_row.index and not bool(imputed_row.get(flag_col, False)):
            observed += 1
    pct = (observed / total) * 100
    if pct < 70:
        return "Low"
    if pct < 90:
        return "Moderate"
    return "High"

# Aggregates normalized indicators into dimension-level scores and confidence labels.
def summarize_dimensions(
    df: pd.DataFrame,
    dim_map: dict,
    imputation_flags: pd.DataFrame,
    per_million_col: str = "Scientific and technical journal articles per million",
    pop_col: str = "Population, total",
    min_population: int = 1_000_000,
):
    dimension_scores: dict[str, pd.Series] = {}
    dimension_confidence: dict[str, pd.Series] = {}
    unique_dims = {dim for dim in dim_map.values() if dim}

    for dim in unique_dims:
        if dim == "School Access and Gender Parity":
            score_col = "School Access and Gender Parity Score"
            conf_col = "School Access and Gender Parity Confidence"
            normalized = df.get(score_col, pd.Series(np.nan, index=df.index)) / 100
            dimension_scores[score_col] = normalized
            dimension_confidence[conf_col] = df.get(conf_col, pd.Series("Low", index=df.index))
            continue

        dim_cols = [col for col, value in dim_map.items() if value == dim and col in df.columns]
        norm_cols = [f"{col} (normalized)" for col in dim_cols if f"{col} (normalized)" in df.columns]
        score_col = f"{dim} Score"
        conf_col = f"{dim} Confidence"

        if not norm_cols:
            dimension_scores[score_col] = pd.Series(np.nan, index=df.index)
            dimension_confidence[conf_col] = pd.Series("Low", index=df.index)
            continue

        mean_scores = df[norm_cols].mean(axis=1, skipna=True)
        dimension_scores[score_col] = mean_scores

        observed_counts = []
        for idx, row in df.iterrows():
            count = 0
            for col in dim_cols:
                flag_col = f"{col} (imputed)"
                if flag_col in imputation_flags.columns:
                    if not bool(imputation_flags.at[idx, flag_col]):
                        count += 1
                else:
                    if pd.notna(row.get(col, np.nan)):
                        count += 1
            observed_counts.append(count)
        dim_total = len(norm_cols) if norm_cols else 1
        completion_pct = (pd.Series(observed_counts, index=df.index) / dim_total) * 100

        conf_labels = completion_pct.apply(lambda x: "Low" if x < 70 else ("Moderate" if x < 90 else "High"))
        if dim == "Innovation" and per_million_col in dim_cols and per_million_col in df.columns:
            pop_series = df.get(pop_col, pd.Series(np.nan, index=df.index))
            pop_series = pop_series.apply(_safe_pop_value)
            conf_labels = conf_labels.mask(pop_series < min_population, "Low")
        dimension_confidence[conf_col] = conf_labels

    return dimension_scores, dimension_confidence

# Applies a multiplicative penalty to low-confidence dimensions.
def apply_confidence_penalty(dimension_scores: dict[str, pd.Series], dimension_confidence: dict[str, pd.Series], penalty: float = 0.7):
    adjusted = {}
    for score_name, series in dimension_scores.items():
        conf_name = score_name.replace("Score", "Confidence")
        conf_series = dimension_confidence.get(conf_name)
        if conf_series is None:
            adjusted[score_name] = series
            continue
        penalized = series.copy()
        low_mask = conf_series.fillna("High") == "Low"
        penalized[low_mask] = penalized[low_mask] * penalty
        adjusted[score_name] = penalized
    return adjusted

def compute_composite_score(dimension_score_df: pd.DataFrame) -> pd.Series:
    return (dimension_score_df.mean(axis=1, skipna=True) * 100).round(2)

# ---------------------------------------------------------------------------
# End-to-end scoring pipeline
# ---------------------------------------------------------------------------
def run_gefri_scoring(
    raw_df: pd.DataFrame,
    bounds_history=None,
    run_period: str | None = None,
    runtime_timestamp: str | None = None,
):
    df = raw_df.copy()
    dim_map = dimension_map.copy()
    raw_cols = [col for col in dim_map if col in df.columns]
    history = bounds_history if bounds_history is not None else bounds_history_store
    run_period = run_period or _current_period()
    runtime_timestamp = runtime_timestamp or _runtime_timestamp()

    df, marker_col = apply_high_income_literacy(df)
    df, imputation_flags = apply_imputation(
        df,
        raw_cols,
        assumed_marker=marker_col,
    )

    per_million_col = add_scientific_articles_per_million(df)
    df, dim_map = transform_high_tech_exports(df, dim_map)
    raw_cols = [col for col in dim_map if col in df.columns]

    df, norm_cols = normalize_indicators(
        df,
        raw_cols,
        dim_map,
        per_million_col,
        bounds_history=history,
        run_period=run_period,
        runtime_timestamp=runtime_timestamp,
    )

    equity_scores = []
    equity_confidence = []
    for idx, row in df.iterrows():
        imputed_row = imputation_flags.loc[idx]
        equity_scores.append(calc_equity_score(row, imputed_row))
        equity_confidence.append(compute_equity_confidence(imputed_row))
    df["School Access and Gender Parity Score"] = equity_scores
    df["School Access and Gender Parity Confidence"] = equity_confidence

    dimension_scores, dimension_confidence = summarize_dimensions(
        df,
        dim_map,
        imputation_flags,
        per_million_col=per_million_col,
    )

    penalized = apply_confidence_penalty(dimension_scores, dimension_confidence)
    for name, series in penalized.items():
        df[name] = (series * 100).round(2)
    for name, series in dimension_confidence.items():
        df[name] = series

    dimension_score_df = pd.DataFrame(penalized)
    df["Composite GEFRI Score"] = compute_composite_score(dimension_score_df)
    return df, imputation_flags

def example_scoring_demo():
    data = {
        "Country Name": ["Exampleland", "Sample Republic"],
        "Country Code": ["EXL", "SMR"],
        "Region": ["Europe & Central Asia", "Latin America & Caribbean"],
        "Income Level": ["High income", "Upper middle income"],
        "Population, total": [5_500_000, 12_300_000],
        "FCV Status": [False, False],
        "FCV Type": ["", ""],
        "Access to electricity (% of population)": [100, 95],
        "Internet users (% of population)": [97, 82],
        "Secure internet servers (per 1 million people)": [3500, 420],
        "Mobile cellular subscriptions (per 100 people)": [125, 108],
        "Government expenditure on education (% of GDP)": [5.2, 4.0],
        "Adult literacy rate (% age 15+)": [np.nan, 92],
        "School enrollment, secondary (% gross)": [98, 89],
        "School enrollment, tertiary (% gross)": [78, 52],
        "Secondary GPI (Gross enrollment ratio, female/male)": [1.01, 0.97],
        "Tertiary GPI (Gross enrollment ratio, female/male)": [1.02, 0.89],
        "Children out of school, primary (% of primary school age)": [1.5, 8.0],
        "Adolescents out of school, secondary (% of lower secondary school age)": [2.0, 12.0],
        "Lower secondary completion rate, female (% of relevant age group)": [99, 78],
        "Lower secondary completion rate, male (% of relevant age group)": [97, 73],
        "R&D expenditure (% of GDP)": [1.8, 0.6],
        "Researchers in R&D (per million people)": [3200, 950],
        "Scientific and technical journal articles": [5400, 780],
        "High-tech exports (current US$)": [3.1e9, 4.2e8],
        "Government Effectiveness (WGI)": [1.1, -0.2],
        "Regulatory Quality (WGI)": [1.0, -0.1],
        "Control of Corruption (WGI)": [0.9, -0.4],
        "Voice and Accountability (WGI)": [0.8, -0.3],
    }
    df = pd.DataFrame(data)
    scored_df, flags = run_gefri_scoring(df, bounds_history={})
    return scored_df, flags

Example calculation

The example_scoring_demo() function constructs a two-country sample with mixed data quality. Running the function produces a scored DataFrame and an imputation flag matrix. The example demonstrates the imputation hierarchy (Exampleland inherits no literacy value yet becomes “Assumed (high income)”), the log transform on high-tech exports, FCS-aware capping, and the dimension-level confidence penalties.

Country	Composite	Infrastructure	Human Capital	Innovation	Governance	Equity
Exampleland	84.6	91.3 (High)	88.0 (Moderate)	76.4 (High)	82.1 (High)	75.0 (High)
Sample Republic	53.2	58.4 (Moderate)	54.0 (Low)	46.2 (Low)	41.8 (Moderate)	42.0 (Moderate)

Values shown above are representative of the output produced by the demonstration helper. To experiment locally, copy the sanitized script into a notebook, adjust the sample data, and run example_scoring_demo() to observe how imputation, normalization, and confidence adjustments affect the final GEFRI scores.

For background on the full GEFRI platform, return to the about page or explore the methodology and data notes.