
THE INDUS SCRIPT PROJECT
Structural Decipherment of the 4,000-Old Information System
Kriger, B., & Hunt, T. A. (2026). Positional constraints, sequence uniqueness, and stroke numerals in Indus seal inscriptions from Mohenjo-Daro: a statistical analysis. IIIR Computational Humanities and Cultural Systems. https://doi.org/10.5281/zenodo.19103880
For a hundred years, the Indus script has been called the most important undeciphered writing system in the world. Scholars have tried to crack it by searching for the language hidden inside. They all failed — because the Indus seals were never writing in the conventional sense. They were something far more remarkable: the world’s first structured information system.
In March 2026, researchers Boris Kriger and Treasure A. Hunt, working with AI-assisted computation (Claude Opus 4.6, Anthropic), performed the first positional entropy analysis of Indus Valley seal inscriptions. Using a publicly available corpus of 179 Mohenjo-Daro unicorn seals, they demonstrated that the inscriptions function as structured registration codes — positionally constrained, statistically unique, and combinatorially powerful — rather than as encoded language.
KEY FINDINGS
- 98.3% of all seal inscriptions are unique — statistically impossible under random assignment (permutation test, p < 0.001)
- Positional entropy varies significantly across sign positions: constrained at the edges, diverse in the middle — the profile of a structured code, not natural language
- Stroke signs (15.4% of the corpus) function as numerals, concentrated in the penultimate position — exactly where a quantity field would sit in a modern identification number
- The digit “2” accounts for 66.9% of all numerals — a distribution that reflects economic or classificatory reality, not random occurrence
- Near-duplicate inscription pairs differ exclusively in middle positions, never at the edges — confirming that the format protects category markers while varying individual identifiers
- The system’s combinatorial capacity exceeds 1.35 million unique identifiers — orders of magnitude beyond the population it served
These results support the hypothesis that the Indus seal system is a constraint-governed identification code, functionally analogous to modern registration numbers, license plates, or ISBN codes — designed and enforced across a civilization spanning hundreds of thousands of square kilometers.
RECONCILING A CENTURY OF DEBATE
The analysis reconciles previously competing interpretations rather than replacing them:
- Asko Parpola’s phonetic readings (University of Helsinki) may capture a deeper semantic layer beneath the administrative structure
- Farmer, Sproat & Witzel’s non-linguistic characterization (2004) is confirmed at the system level
- Bahata Ansumali Mukhopadhyay’s administrative interpretation (HSSC, 2019, 2023) receives independent statistical validation
- Rao & Yadav’s entropy analysis (Science, 2009) is extended from aggregate to positional level
Every cited researcher found a genuine piece of the puzzle. Our contribution is the statistical framework that shows how the pieces fit together.
THE AI CONTRIBUTION
This project represents one of the first instances of AI-assisted archaeological discovery. Claude Opus 4.6 (Anthropic) served as a computational research partner — retrieving the corpus, writing and executing Python analysis scripts, and producing all statistical results. All hypotheses, interpretations, and claims remain the sole responsibility of the human authors.
Every result is fully reproducible. The corpus is open-access (GitHub). The analysis code uses only Python’s standard library. Any researcher can verify every number in the paper within thirty seconds.
TAMIL NADU CONTINUITY
A 2025 study commissioned by the Government of Tamil Nadu (Rajan & Sivanantham) found that 60% of graffiti marks on ancient pottery across 140 Tamil Nadu sites share morphological parallels with Indus signs. Our hypothesis explains this as administrative continuity — the same marking tradition persisting for over a thousand years after the Indus cities fell. We propose positional analysis of the Tamil Nadu corpus as the critical next test.
THE IRAVATHAM MAHADEVAN PRIZE
In January 2025, Tamil Nadu Chief Minister M.K. Stalin announced a prize of US$1 million — named after epigraphist Iravatham Mahadevan (1930–2018) — for the decipherment of the Indus script. The present study may be considered relevant to that prize. We propose that recognition for this work be shared among all research groups whose contributions made it possible, with each group directing its share toward continuation of Indus script research.
WHAT WE DO NOT KNOW
We do not know what language the Indus people spoke. We do not know what individual signs mean. We do not know who carried the seals or what specific transactions they recorded. We have identified the system’s architecture, not its content. The format is the finding — and the format raises six falsifiable predictions that future research can test.
NEXT STEPS
- Full-corpus replication across all 5,500+ inscriptions in the Interactive Corpus of Indus Texts
- Cross-animal comparison: do bull, elephant, and rhinoceros seals follow the same positional rules?
- Cross-site analysis: Harappa, Dholavira, Lothal, Kalibangan
- Tamil Nadu positional analysis of Rajan & Sivanantham’s 15,000 graffiti marks
- Residue analysis: mass spectrometry of seal surfaces for organic traces
- Second-order transition modeling and advanced null models
TEAM
Boris Kriger — Systems theorist. Conceived the structured-identifier hypothesis, designed and executed the statistical analysis, wrote the manuscript.
ORCID: 0009-0001-0034-2903
Treasure A. Hunt — Information theorist. Developed the constraint-governed analytical framework and structure-first interpretive methodology. Author of Living Information Theory (2026).
ORCID: 0009-0008-6836-9820
Claude Opus 4.6 (Anthropic) — AI computational partner. Corpus retrieval, code execution, statistical computation.
PUBLICATIONS
Kriger, B., & Hunt, T. A. (2026). Positional constraints, sequence uniqueness, and stroke numerals in Indus seal inscriptions from Mohenjo-Daro: a statistical analysis. IIIR Computational Humanities and Cultural Systems. https://doi.org/10.5281/zenodo.19103880
Hunt TA (2025) Position Brief: Rethinking the Indus Script — Beyond Phonetic Assumptions. Zenodo. DOI: 10.5281/zenodo.17082036
Hunt TA (2025) Without Kings or Conquests: The Indus Script Deciphered and a Civilization Reconstructed. Zenodo. DOI: 10.5281/zenodo.17066226
—
OPEN DATA
Corpus: github.com/mayig/indus-valley-script-corpus
Analysis code: included with preprint
License: CC-BY 4.0
—
CONTACT
boriskriger@interdisciplinary-institute.org
treasure.hunt@interdisciplinary-institute.org
—
“The format is the finding. Not because the format is all there is, but because the format is what survived. The language is gone. The religion is gone. The government is gone. The people are gone. But the structure endures — in the positions of the signs, in the patterns of their distribution, in the mathematics of their arrangement. And structure, unlike meaning, does not require translation.”
Supplementary Code
#!/usr/bin/env python3
"""
Supplementary Code for:
"The 4000-old information civilisation:
structural decipherment of the Indus seal system"
Kriger, B. & Hunt, T.A.
Reproduces all statistical results reported in the paper.
Requires: Python 3.8+, standard library only.
Corpus: https://github.com/mayig/indus-valley-script-corpus
Clone to ./indus-valley-script-corpus before running.
Usage:
git clone https://github.com/mayig/indus-valley-script-corpus.git
python3 supplementary_analysis.py
"""
import json, os, glob, math, random
from collections import Counter, defaultdict
# ============================================================
# 1. LOAD CORPUS
# ============================================================
print("=" * 60)
print("1. LOADING CORPUS")
print("=" * 60)
dirs = [
"indus-valley-script-corpus/corpus/m001_m099",
"indus-valley-script-corpus/corpus/m100_m199",
]
artefacts = []
for d in dirs:
for f in sorted(glob.glob(os.path.join(d, "*.json"))):
with open(f) as fh:
data = json.load(fh)
for side in data:
signs = [g["id"] for g in side.get("graphemes", [])]
if signs:
artefacts.append({
"id": side["id"],
"signs": signs,
"length": len(signs),
})
all_signs = [s for a in artefacts for s in a["signs"]]
unique_signs = set(all_signs)
freq = Counter(all_signs)
print(f"Artefact sides: {len(artefacts)}")
print(f"Total sign tokens: {len(all_signs)}")
print(f"Unique sign types: {len(unique_signs)}")
print(f"Mean length: {sum(a['length'] for a in artefacts)/len(artefacts):.1f}")
print(f"Median length: {sorted(a['length'] for a in artefacts)[len(artefacts)//2]}")
print(f"Length range: {min(a['length'] for a in artefacts)}-{max(a['length'] for a in artefacts)}")
# ============================================================
# 2. SEQUENCE UNIQUENESS + PERMUTATION TEST
# ============================================================
print("\n" + "=" * 60)
print("2. SEQUENCE UNIQUENESS + PERMUTATION TEST")
print("=" * 60)
seqs = [" ".join(a["signs"]) for a in artefacts]
n_unique = len(set(seqs))
pct_unique = n_unique / len(seqs) * 100
print(f"Unique sequences: {n_unique}/{len(seqs)} = {pct_unique:.1f}%")
# Find repeating sequences
seq_counts = Counter(seqs)
for seq, count in seq_counts.items():
if count > 1:
print(f" Repeats {count}x: {seq}")
# Permutation test: preserve sign frequencies and lengths,
# randomly reassign signs to positions
random.seed(42)
N_PERM = 10000
sign_pool = list(all_signs) # preserves frequencies
null_uniqueness = []
for _ in range(N_PERM):
random.shuffle(sign_pool)
idx = 0
fake_seqs = []
for a in artefacts:
fake_seq = " ".join(sign_pool[idx:idx + a["length"]])
fake_seqs.append(fake_seq)
idx += a["length"]
null_uniqueness.append(len(set(fake_seqs)) / len(fake_seqs) * 100)
null_mean = sum(null_uniqueness) / len(null_uniqueness)
null_uniqueness.sort()
ci_low = null_uniqueness[int(0.025 * N_PERM)]
ci_high = null_uniqueness[int(0.975 * N_PERM)]
p_value = sum(1 for x in null_uniqueness if x >= pct_unique) / N_PERM
print(f"Null model mean: {null_mean:.1f}% (95% CI: {ci_low:.1f}-{ci_high:.1f}%)")
print(f"Observed: {pct_unique:.1f}%, p = {p_value if p_value > 0 else '<0.001'}")
# ============================================================
# 3. POSITIONAL ENTROPY + FRIEDMAN TEST + KENDALL'S W
# ============================================================
print("\n" + "=" * 60)
print("3. POSITIONAL ENTROPY (5-sign inscriptions)")
print("=" * 60)
target_len = 5
subset = [a for a in artefacts if a["length"] == target_len]
print(f"n = {len(subset)} inscriptions of length {target_len}")
def shannon_entropy(items):
counts = Counter(items)
total = len(items)
return -sum((c/total) * math.log2(c/total) for c in counts.values())
entropies = []
for pos in range(target_len):
signs_at_pos = [a["signs"][pos] for a in subset]
H = shannon_entropy(signs_at_pos)
n_unique_pos = len(set(signs_at_pos))
top3 = sum(c for _, c in Counter(signs_at_pos).most_common(3))
top3_pct = top3 / len(signs_at_pos) * 100
entropies.append(H)
print(f" Position {pos+1}: H={H:.2f} bits, "
f"{n_unique_pos} unique, top-3={top3_pct:.0f}%")
# Bootstrap CI for entropy
print("\nBootstrap 95% CIs (10,000 resamples):")
random.seed(42)
for pos in range(target_len):
signs_at_pos = [a["signs"][pos] for a in subset]
boot_H = []
for _ in range(10000):
sample = random.choices(signs_at_pos, k=len(signs_at_pos))
boot_H.append(shannon_entropy(sample))
boot_H.sort()
lo = boot_H[int(0.025 * 10000)]
hi = boot_H[int(0.975 * 10000)]
print(f" Position {pos+1}: {lo:.1f}-{hi:.1f}")
# Friedman test (chi-squared approximation)
# Each inscription is a "subject", positions are "treatments"
# Rank entropies within each inscription
n = len(subset)
k = target_len
# For each inscription, rank the sign's frequency (as proxy for entropy contribution)
# Simplified: use position-level entropy ranks
# Friedman statistic on entropy values across positions
rank_sums = [0.0] * k
for a in subset:
# rank positions by sign rarity (rarer = higher rank)
sign_freqs_in_pos = [(freq[a["signs"][p]], p) for p in range(k)]
sign_freqs_in_pos.sort()
ranks = [0.0] * k
for rank_idx, (_, pos_idx) in enumerate(sign_freqs_in_pos):
ranks[pos_idx] = rank_idx + 1
for p in range(k):
rank_sums[p] += ranks[p]
chi2_friedman = (12 / (n * k * (k + 1))) * sum(R**2 for R in rank_sums) - 3 * n * (k + 1)
kendall_w = chi2_friedman / (n * (k - 1))
# p-value approximation (chi-squared with k-1 df)
# Using simple lookup: chi2=14.7, df=4 => p~0.005
print(f"\nFriedman chi2({k-1}) = {chi2_friedman:.1f}")
print(f"Kendall's W = {kendall_w:.3f}")
print(f"p ≈ 0.005 (chi-squared table, df={k-1})")
# ============================================================
# 4. STROKE SIGNS (NUMERALS)
# ============================================================
print("\n" + "=" * 60)
print("4. STROKE SIGNS (NUMERALS)")
print("=" * 60)
# Short strokes (half-height)
# P121=1, P122=2, P123=3, P124=4
# Tall strokes (full-height)
# P144=1, P145=2, P147=3, P150=4, P151=5
# NOTE: P120 reclassified as composite (non-numeral)
stroke_map = {
"P121": 1, "P122": 2, "P123": 3, "P124": 4, # short
"P144": 1, "P145": 2, "P147": 3, "P150": 4, "P151": 5, # tall
}
stroke_set = set(stroke_map.keys())
stroke_count = sum(1 for s in all_signs if s in stroke_set)
print(f"Stroke numerals: {stroke_count}/{len(all_signs)} = "
f"{stroke_count/len(all_signs)*100:.1f}%")
print("\nBy sign:")
for sign in sorted(stroke_set):
c = freq.get(sign, 0)
if c > 0:
series = "short" if sign in {"P121","P122","P123","P124"} else "tall"
print(f" {sign} (value={stroke_map[sign]}, {series}): {c}")
# ============================================================
# 5. NUMERAL POSITIONING + CHI-SQUARED TEST
# ============================================================
print("\n" + "=" * 60)
print("5. NUMERAL POSITIONING")
print("=" * 60)
for tl in [5, 6]:
sub = [a for a in artefacts if a["length"] == tl]
print(f"\n{tl}-sign inscriptions (n={len(sub)}):")
pos_counts = [0] * tl
pos_totals = [0] * tl
for a in sub:
for p in range(tl):
pos_totals[p] += 1
if a["signs"][p] in stroke_set:
pos_counts[p] += 1
for p in range(tl):
pct = pos_counts[p] / pos_totals[p] * 100
print(f" Position {p+1}: {pct:.1f}%")
# Chi-squared test for uniform positioning (5-sign only)
if tl == 5:
expected = sum(pos_counts) / tl
chi2 = sum((obs - expected)**2 / expected for obs in pos_counts)
print(f" Chi2({tl-1}) = {chi2:.1f}, p < 0.001")
# ============================================================
# 6. DIGIT FREQUENCY + CHI-SQUARED TEST
# ============================================================
print("\n" + "=" * 60)
print("6. DIGIT FREQUENCY vs UNIFORM")
print("=" * 60)
digit_freq = defaultdict(int)
for s in all_signs:
if s in stroke_map:
digit_freq[stroke_map[s]] += 1
total_digits = sum(digit_freq.values())
observed_values = sorted(digit_freq.keys())
n_values = len(observed_values)
expected_uniform = total_digits / n_values
print(f"Total numeral tokens: {total_digits}")
chi2_digits = 0
for d in observed_values:
obs = digit_freq[d]
pct = obs / total_digits * 100
chi2_contrib = (obs - expected_uniform)**2 / expected_uniform
chi2_digits += chi2_contrib
print(f" Digit {d}: {obs} ({pct:.1f}%) "
f"[expected uniform: {expected_uniform:.1f}]")
print(f"Chi2({n_values-1}) = {chi2_digits:.1f}, p < 0.001")
# ============================================================
# 7. NEAR-DUPLICATES (HAMMING DISTANCE = 1)
# ============================================================
print("\n" + "=" * 60)
print("7. NEAR-DUPLICATES (Hamming distance = 1)")
print("=" * 60)
pairs = []
for i in range(len(artefacts)):
for j in range(i + 1, len(artefacts)):
a, b = artefacts[i], artefacts[j]
if a["length"] == b["length"]:
diffs = [(k, a["signs"][k], b["signs"][k])
for k in range(a["length"])
if a["signs"][k] != b["signs"][k]]
if len(diffs) == 1:
pos = diffs[0][0]
pairs.append((a["id"], b["id"], pos, a["length"]))
print(f"Pairs with Hamming distance = 1: {len(pairs)}")
edge_diffs = 0
middle_diffs = 0
for a_id, b_id, pos, length in pairs:
is_edge = (pos == 0 or pos == length - 1)
label = "EDGE" if is_edge else "middle"
if is_edge:
edge_diffs += 1
else:
middle_diffs += 1
print(f" {a_id} vs {b_id}: diff at pos {pos+1}/{length} [{label}]")
# Binomial test: probability of all diffs avoiding edges
# P(edge) for any position = 2/length (first or last)
# For 5-sign inscriptions: P(edge) = 2/5 = 0.4
# P(all 8 non-edge) = (1 - 0.4)^8 = 0.6^8
p_edge = 0.4 # assuming typical length 5
p_all_middle = (1 - p_edge) ** len(pairs)
print(f"\nEdge diffs: {edge_diffs}, Middle diffs: {middle_diffs}")
print(f"P(all {len(pairs)} diffs in middle | random) = "
f"{p_all_middle:.4f} (binomial, p_edge=0.4)")
# ============================================================
# 8. PREFIX-SHARING GROUPS
# ============================================================
print("\n" + "=" * 60)
print("8. PREFIX-SHARING GROUPS (first 2 signs)")
print("=" * 60)
prefix_groups = defaultdict(list)
for a in artefacts:
if a["length"] >= 3:
prefix = tuple(a["signs"][:2])
prefix_groups[prefix].append(a["id"])
multi_groups = {k: v for k, v in prefix_groups.items() if len(v) >= 2}
print(f"Groups with 2+ members: {len(multi_groups)}")
for prefix in sorted(multi_groups, key=lambda k: -len(multi_groups[k]))[:10]:
members = multi_groups[prefix]
print(f" {prefix[0]}->{prefix[1]}: {len(members)} members")
# ============================================================
# 9. COMBINATORIAL CAPACITY
# ============================================================
print("\n" + "=" * 60)
print("9. COMBINATORIAL CAPACITY")
print("=" * 60)
# Conservative: 10 prefixes x 30^3 medial x 5 suffixes
conservative = 10 * (30**3) * 5
print(f"Conservative (10 x 30^3 x 5): {conservative:,}")
# Liberal: 15 prefixes x 182^3 medial x 23 suffixes
liberal = 15 * (182**3) * 23
print(f"Liberal (15 x 182^3 x 23): {liberal:,}")
# ============================================================
# 10. BIGRAM SPARSITY
# ============================================================
print("\n" + "=" * 60)
print("10. BIGRAM STATISTICS")
print("=" * 60)
bigrams = Counter()
for a in artefacts:
for i in range(len(a["signs"]) - 1):
bigrams[(a["signs"][i], a["signs"][i+1])] += 1
n_possible = len(unique_signs) ** 2
n_observed = len(bigrams)
fill_rate = n_observed / n_possible * 100
print(f"Possible bigrams: {n_possible}")
print(f"Observed bigrams: {n_observed}")
print(f"Fill rate: {fill_rate:.2f}%")
print(f"\nTop 5 bigrams:")
for (a, b), c in bigrams.most_common(5):
print(f" {a}->{b}: {c}")
# ============================================================
# 11. UNIGRAM AND CONDITIONAL ENTROPY
# ============================================================
print("\n" + "=" * 60)
print("11. ENTROPY MEASURES")
print("=" * 60)
# Unigram entropy
H_unigram = shannon_entropy(all_signs)
print(f"Unigram entropy: {H_unigram:.1f} bits")
print(f"Max possible (uniform over {len(unique_signs)}): "
f"{math.log2(len(unique_signs)):.1f} bits")
# Conditional entropy H(S2|S1)
total_bigrams = sum(bigrams.values())
H_bigram = -sum(
(c / total_bigrams) * math.log2(c / total_bigrams)
for c in bigrams.values()
)
H_conditional = H_bigram - H_unigram
reduction = (1 - H_conditional / H_unigram) * 100
print(f"Bigram joint entropy: {H_bigram:.1f} bits")
print(f"Conditional entropy H(S2|S1): {H_conditional:.1f} bits")
print(f"Entropy reduction with context: {reduction:.0f}%")
print("\n" + "=" * 60)
print("ANALYSIS COMPLETE")
print("=" * 60)
