IIIR Computational Humanities and Cultural Systems

THE INDUS SCRIPT PROJECT
Structural Decipherment of the 4,000-Old Information System

Kriger, B., & Hunt, T. A. (2026). Positional constraints, sequence uniqueness, and stroke numerals in Indus seal inscriptions from Mohenjo-Daro: a statistical analysis. IIIR Computational Humanities and Cultural Systems. https://doi.org/10.5281/zenodo.19103880

Positional_constraints_sequence_uniqueness_stroke_numerals_Indus_Mohenjo-Daro_v5_AUTHOR Download

For a hundred years, the Indus script has been called the most important undeciphered writing system in the world. Scholars have tried to crack it by searching for the language hidden inside. They all failed — because the Indus seals were never writing in the conventional sense. They were something far more remarkable: the world’s first structured information system.

In March 2026, researchers Boris Kriger and Treasure A. Hunt, working with AI-assisted computation (Claude Opus 4.6, Anthropic), performed the first positional entropy analysis of Indus Valley seal inscriptions. Using a publicly available corpus of 179 Mohenjo-Daro unicorn seals, they demonstrated that the inscriptions function as structured registration codes — positionally constrained, statistically unique, and combinatorially powerful — rather than as encoded language.

KEY FINDINGS

98.3% of all seal inscriptions are unique — statistically impossible under random assignment (permutation test, p < 0.001)
Positional entropy varies significantly across sign positions: constrained at the edges, diverse in the middle — the profile of a structured code, not natural language
Stroke signs (15.4% of the corpus) function as numerals, concentrated in the penultimate position — exactly where a quantity field would sit in a modern identification number
The digit “2” accounts for 66.9% of all numerals — a distribution that reflects economic or classificatory reality, not random occurrence
Near-duplicate inscription pairs differ exclusively in middle positions, never at the edges — confirming that the format protects category markers while varying individual identifiers
The system’s combinatorial capacity exceeds 1.35 million unique identifiers — orders of magnitude beyond the population it served

These results support the hypothesis that the Indus seal system is a constraint-governed identification code, functionally analogous to modern registration numbers, license plates, or ISBN codes — designed and enforced across a civilization spanning hundreds of thousands of square kilometers.

RECONCILING A CENTURY OF DEBATE

The analysis reconciles previously competing interpretations rather than replacing them:

Asko Parpola’s phonetic readings (University of Helsinki) may capture a deeper semantic layer beneath the administrative structure
Farmer, Sproat & Witzel’s non-linguistic characterization (2004) is confirmed at the system level
Bahata Ansumali Mukhopadhyay’s administrative interpretation (HSSC, 2019, 2023) receives independent statistical validation
Rao & Yadav’s entropy analysis (Science, 2009) is extended from aggregate to positional level

Every cited researcher found a genuine piece of the puzzle. Our contribution is the statistical framework that shows how the pieces fit together.

THE AI CONTRIBUTION

This project represents one of the first instances of AI-assisted archaeological discovery. Claude Opus 4.6 (Anthropic) served as a computational research partner — retrieving the corpus, writing and executing Python analysis scripts, and producing all statistical results. All hypotheses, interpretations, and claims remain the sole responsibility of the human authors.

Every result is fully reproducible. The corpus is open-access (GitHub). The analysis code uses only Python’s standard library. Any researcher can verify every number in the paper within thirty seconds.

TAMIL NADU CONTINUITY

A 2025 study commissioned by the Government of Tamil Nadu (Rajan & Sivanantham) found that 60% of graffiti marks on ancient pottery across 140 Tamil Nadu sites share morphological parallels with Indus signs. Our hypothesis explains this as administrative continuity — the same marking tradition persisting for over a thousand years after the Indus cities fell. We propose positional analysis of the Tamil Nadu corpus as the critical next test.

THE IRAVATHAM MAHADEVAN PRIZE

In January 2025, Tamil Nadu Chief Minister M.K. Stalin announced a prize of US$1 million — named after epigraphist Iravatham Mahadevan (1930–2018) — for the decipherment of the Indus script. The present study may be considered relevant to that prize. We propose that recognition for this work be shared among all research groups whose contributions made it possible, with each group directing its share toward continuation of Indus script research.

WHAT WE DO NOT KNOW

We do not know what language the Indus people spoke. We do not know what individual signs mean. We do not know who carried the seals or what specific transactions they recorded. We have identified the system’s architecture, not its content. The format is the finding — and the format raises six falsifiable predictions that future research can test.

NEXT STEPS

Full-corpus replication across all 5,500+ inscriptions in the Interactive Corpus of Indus Texts
Cross-animal comparison: do bull, elephant, and rhinoceros seals follow the same positional rules?
Cross-site analysis: Harappa, Dholavira, Lothal, Kalibangan
Tamil Nadu positional analysis of Rajan & Sivanantham’s 15,000 graffiti marks
Residue analysis: mass spectrometry of seal surfaces for organic traces
Second-order transition modeling and advanced null models

TEAM

Boris Kriger — Systems theorist. Conceived the structured-identifier hypothesis, designed and executed the statistical analysis, wrote the manuscript.
ORCID: 0009-0001-0034-2903

Treasure A. Hunt — Information theorist. Developed the constraint-governed analytical framework and structure-first interpretive methodology. Author of Living Information Theory (2026).
ORCID: 0009-0008-6836-9820

Claude Opus 4.6 (Anthropic) — AI computational partner. Corpus retrieval, code execution, statistical computation.

PUBLICATIONS

Hunt TA (2025) Position Brief: Rethinking the Indus Script — Beyond Phonetic Assumptions. Zenodo. DOI: 10.5281/zenodo.17082036

Hunt TA (2025) Without Kings or Conquests: The Indus Script Deciphered and a Civilization Reconstructed. Zenodo. DOI: 10.5281/zenodo.17066226

—

OPEN DATA

Corpus: github.com/mayig/indus-valley-script-corpus
Analysis code: included with preprint
License: CC-BY 4.0

—

CONTACT

boriskriger@interdisciplinary-institute.org
treasure.hunt@interdisciplinary-institute.org

—

“The format is the finding. Not because the format is all there is, but because the format is what survived. The language is gone. The religion is gone. The government is gone. The people are gone. But the structure endures — in the positions of the signs, in the patterns of their distribution, in the mathematics of their arrangement. And structure, unlike meaning, does not require translation.”

`Supplementary Code`

#!/usr/bin/env python3

"""

Supplementary Code for:

"The 4000-old information civilisation:

structural decipherment of the Indus seal system"

Kriger, B. & Hunt, T.A.

Reproduces all statistical results reported in the paper.

Requires: Python 3.8+, standard library only.

Corpus: https://github.com/mayig/indus-valley-script-corpus

Clone to ./indus-valley-script-corpus before running.

Usage:

git clone https://github.com/mayig/indus-valley-script-corpus.git

python3 supplementary_analysis.py

"""

import json, os, glob, math, random

from collections import Counter, defaultdict

# ============================================================

# 1. LOAD CORPUS

# ============================================================

print("=" * 60)

print("1. LOADING CORPUS")

print("=" * 60)

dirs = [

"indus-valley-script-corpus/corpus/m001_m099",

"indus-valley-script-corpus/corpus/m100_m199",

]

artefacts = []

for d in dirs:

for f in sorted(glob.glob(os.path.join(d, "*.json"))):

with open(f) as fh:

data = json.load(fh)

for side in data:

signs = [g["id"] for g in side.get("graphemes", [])]

if signs:

artefacts.append({

"id": side["id"],

"signs": signs,

"length": len(signs),

})

all_signs = [s for a in artefacts for s in a["signs"]]

unique_signs = set(all_signs)

freq = Counter(all_signs)

print(f"Artefact sides: {len(artefacts)}")

print(f"Total sign tokens: {len(all_signs)}")

print(f"Unique sign types: {len(unique_signs)}")

print(f"Mean length: {sum(a['length'] for a in artefacts)/len(artefacts):.1f}")

print(f"Median length: {sorted(a['length'] for a in artefacts)[len(artefacts)//2]}")

print(f"Length range: {min(a['length'] for a in artefacts)}-{max(a['length'] for a in artefacts)}")

# ============================================================

# 2. SEQUENCE UNIQUENESS + PERMUTATION TEST

# ============================================================

print("\n" + "=" * 60)

print("2. SEQUENCE UNIQUENESS + PERMUTATION TEST")

print("=" * 60)

seqs = [" ".join(a["signs"]) for a in artefacts]

n_unique = len(set(seqs))

pct_unique = n_unique / len(seqs) * 100

print(f"Unique sequences: {n_unique}/{len(seqs)} = {pct_unique:.1f}%")

# Find repeating sequences

seq_counts = Counter(seqs)

for seq, count in seq_counts.items():

if count > 1:

print(f" Repeats {count}x: {seq}")

# Permutation test: preserve sign frequencies and lengths,

# randomly reassign signs to positions

random.seed(42)

N_PERM = 10000

sign_pool = list(all_signs) # preserves frequencies

null_uniqueness = []

for _ in range(N_PERM):

random.shuffle(sign_pool)

idx = 0

fake_seqs = []

for a in artefacts:

fake_seq = " ".join(sign_pool[idx:idx + a["length"]])

fake_seqs.append(fake_seq)

idx += a["length"]

null_uniqueness.append(len(set(fake_seqs)) / len(fake_seqs) * 100)

null_mean = sum(null_uniqueness) / len(null_uniqueness)

null_uniqueness.sort()

ci_low = null_uniqueness[int(0.025 * N_PERM)]

ci_high = null_uniqueness[int(0.975 * N_PERM)]

p_value = sum(1 for x in null_uniqueness if x >= pct_unique) / N_PERM

print(f"Null model mean: {null_mean:.1f}% (95% CI: {ci_low:.1f}-{ci_high:.1f}%)")

print(f"Observed: {pct_unique:.1f}%, p = {p_value if p_value > 0 else '<0.001'}")

# ============================================================

# 3. POSITIONAL ENTROPY + FRIEDMAN TEST + KENDALL'S W

# ============================================================

print("\n" + "=" * 60)

print("3. POSITIONAL ENTROPY (5-sign inscriptions)")

print("=" * 60)

target_len = 5

subset = [a for a in artefacts if a["length"] == target_len]

print(f"n = {len(subset)} inscriptions of length {target_len}")

def shannon_entropy(items):

counts = Counter(items)

total = len(items)

return -sum((c/total) * math.log2(c/total) for c in counts.values())

entropies = []

for pos in range(target_len):

signs_at_pos = [a["signs"][pos] for a in subset]

H = shannon_entropy(signs_at_pos)

n_unique_pos = len(set(signs_at_pos))

top3 = sum(c for _, c in Counter(signs_at_pos).most_common(3))

top3_pct = top3 / len(signs_at_pos) * 100

entropies.append(H)

print(f" Position {pos+1}: H={H:.2f} bits, "

f"{n_unique_pos} unique, top-3={top3_pct:.0f}%")

# Bootstrap CI for entropy

print("\nBootstrap 95% CIs (10,000 resamples):")

random.seed(42)

for pos in range(target_len):

signs_at_pos = [a["signs"][pos] for a in subset]

boot_H = []

for _ in range(10000):

sample = random.choices(signs_at_pos, k=len(signs_at_pos))

boot_H.append(shannon_entropy(sample))

boot_H.sort()

lo = boot_H[int(0.025 * 10000)]

hi = boot_H[int(0.975 * 10000)]

print(f" Position {pos+1}: {lo:.1f}-{hi:.1f}")

# Friedman test (chi-squared approximation)

# Each inscription is a "subject", positions are "treatments"

# Rank entropies within each inscription

n = len(subset)

k = target_len

# For each inscription, rank the sign's frequency (as proxy for entropy contribution)

# Simplified: use position-level entropy ranks

# Friedman statistic on entropy values across positions

rank_sums = [0.0] * k

for a in subset:

# rank positions by sign rarity (rarer = higher rank)

sign_freqs_in_pos = [(freq[a["signs"][p]], p) for p in range(k)]

sign_freqs_in_pos.sort()

ranks = [0.0] * k

for rank_idx, (_, pos_idx) in enumerate(sign_freqs_in_pos):

ranks[pos_idx] = rank_idx + 1

for p in range(k):

rank_sums[p] += ranks[p]

chi2_friedman = (12 / (n * k * (k + 1))) * sum(R**2 for R in rank_sums) - 3 * n * (k + 1)

kendall_w = chi2_friedman / (n * (k - 1))

# p-value approximation (chi-squared with k-1 df)

# Using simple lookup: chi2=14.7, df=4 => p~0.005

print(f"\nFriedman chi2({k-1}) = {chi2_friedman:.1f}")

print(f"Kendall's W = {kendall_w:.3f}")

print(f"p ≈ 0.005 (chi-squared table, df={k-1})")

# ============================================================

# 4. STROKE SIGNS (NUMERALS)

# ============================================================

print("\n" + "=" * 60)

print("4. STROKE SIGNS (NUMERALS)")

print("=" * 60)

# Short strokes (half-height)

# P121=1, P122=2, P123=3, P124=4

# Tall strokes (full-height)

# P144=1, P145=2, P147=3, P150=4, P151=5

# NOTE: P120 reclassified as composite (non-numeral)

stroke_map = {

"P121": 1, "P122": 2, "P123": 3, "P124": 4, # short

"P144": 1, "P145": 2, "P147": 3, "P150": 4, "P151": 5, # tall

}

stroke_set = set(stroke_map.keys())

stroke_count = sum(1 for s in all_signs if s in stroke_set)

print(f"Stroke numerals: {stroke_count}/{len(all_signs)} = "

f"{stroke_count/len(all_signs)*100:.1f}%")

print("\nBy sign:")

for sign in sorted(stroke_set):

c = freq.get(sign, 0)

if c > 0:

series = "short" if sign in {"P121","P122","P123","P124"} else "tall"

print(f" {sign} (value={stroke_map[sign]}, {series}): {c}")

# ============================================================

# 5. NUMERAL POSITIONING + CHI-SQUARED TEST

# ============================================================

print("\n" + "=" * 60)

print("5. NUMERAL POSITIONING")

print("=" * 60)

for tl in [5, 6]:

sub = [a for a in artefacts if a["length"] == tl]

print(f"\n{tl}-sign inscriptions (n={len(sub)}):")

pos_counts = [0] * tl

pos_totals = [0] * tl

for a in sub:

for p in range(tl):

pos_totals[p] += 1

if a["signs"][p] in stroke_set:

pos_counts[p] += 1

for p in range(tl):

pct = pos_counts[p] / pos_totals[p] * 100

print(f" Position {p+1}: {pct:.1f}%")

# Chi-squared test for uniform positioning (5-sign only)

if tl == 5:

expected = sum(pos_counts) / tl

chi2 = sum((obs - expected)**2 / expected for obs in pos_counts)

print(f" Chi2({tl-1}) = {chi2:.1f}, p < 0.001")

# ============================================================

# 6. DIGIT FREQUENCY + CHI-SQUARED TEST

# ============================================================

print("\n" + "=" * 60)

print("6. DIGIT FREQUENCY vs UNIFORM")

print("=" * 60)

digit_freq = defaultdict(int)

for s in all_signs:

if s in stroke_map:

digit_freq[stroke_map[s]] += 1

total_digits = sum(digit_freq.values())

observed_values = sorted(digit_freq.keys())

n_values = len(observed_values)

expected_uniform = total_digits / n_values

print(f"Total numeral tokens: {total_digits}")

chi2_digits = 0

for d in observed_values:

obs = digit_freq[d]

pct = obs / total_digits * 100

chi2_contrib = (obs - expected_uniform)**2 / expected_uniform

chi2_digits += chi2_contrib

print(f" Digit {d}: {obs} ({pct:.1f}%) "

f"[expected uniform: {expected_uniform:.1f}]")

print(f"Chi2({n_values-1}) = {chi2_digits:.1f}, p < 0.001")

# ============================================================

# 7. NEAR-DUPLICATES (HAMMING DISTANCE = 1)

# ============================================================

print("\n" + "=" * 60)

print("7. NEAR-DUPLICATES (Hamming distance = 1)")

print("=" * 60)

pairs = []

for i in range(len(artefacts)):

for j in range(i + 1, len(artefacts)):

a, b = artefacts[i], artefacts[j]

if a["length"] == b["length"]:

diffs = [(k, a["signs"][k], b["signs"][k])

for k in range(a["length"])

if a["signs"][k] != b["signs"][k]]

if len(diffs) == 1:

pos = diffs[0][0]

pairs.append((a["id"], b["id"], pos, a["length"]))

print(f"Pairs with Hamming distance = 1: {len(pairs)}")

edge_diffs = 0

middle_diffs = 0

for a_id, b_id, pos, length in pairs:

is_edge = (pos == 0 or pos == length - 1)

label = "EDGE" if is_edge else "middle"

if is_edge:

edge_diffs += 1

else:

middle_diffs += 1

print(f" {a_id} vs {b_id}: diff at pos {pos+1}/{length} [{label}]")

# Binomial test: probability of all diffs avoiding edges

# P(edge) for any position = 2/length (first or last)

# For 5-sign inscriptions: P(edge) = 2/5 = 0.4

# P(all 8 non-edge) = (1 - 0.4)^8 = 0.6^8

p_edge = 0.4 # assuming typical length 5

p_all_middle = (1 - p_edge) ** len(pairs)

print(f"\nEdge diffs: {edge_diffs}, Middle diffs: {middle_diffs}")

print(f"P(all {len(pairs)} diffs in middle | random) = "

f"{p_all_middle:.4f} (binomial, p_edge=0.4)")

# ============================================================

# 8. PREFIX-SHARING GROUPS

# ============================================================

print("\n" + "=" * 60)

print("8. PREFIX-SHARING GROUPS (first 2 signs)")

print("=" * 60)

prefix_groups = defaultdict(list)

for a in artefacts:

if a["length"] >= 3:

prefix = tuple(a["signs"][:2])

prefix_groups[prefix].append(a["id"])

multi_groups = {k: v for k, v in prefix_groups.items() if len(v) >= 2}

print(f"Groups with 2+ members: {len(multi_groups)}")

for prefix in sorted(multi_groups, key=lambda k: -len(multi_groups[k]))[:10]:

members = multi_groups[prefix]

print(f" {prefix[0]}->{prefix[1]}: {len(members)} members")

# ============================================================

# 9. COMBINATORIAL CAPACITY

# ============================================================

print("\n" + "=" * 60)

print("9. COMBINATORIAL CAPACITY")

print("=" * 60)

# Conservative: 10 prefixes x 30^3 medial x 5 suffixes

conservative = 10 * (30**3) * 5

print(f"Conservative (10 x 30^3 x 5): {conservative:,}")

# Liberal: 15 prefixes x 182^3 medial x 23 suffixes

liberal = 15 * (182**3) * 23

print(f"Liberal (15 x 182^3 x 23): {liberal:,}")

# ============================================================

# 10. BIGRAM SPARSITY

# ============================================================

print("\n" + "=" * 60)

print("10. BIGRAM STATISTICS")

print("=" * 60)

bigrams = Counter()

for a in artefacts:

for i in range(len(a["signs"]) - 1):

bigrams[(a["signs"][i], a["signs"][i+1])] += 1

n_possible = len(unique_signs) ** 2

n_observed = len(bigrams)

fill_rate = n_observed / n_possible * 100

print(f"Possible bigrams: {n_possible}")

print(f"Observed bigrams: {n_observed}")

print(f"Fill rate: {fill_rate:.2f}%")

print(f"\nTop 5 bigrams:")

for (a, b), c in bigrams.most_common(5):

print(f" {a}->{b}: {c}")

# ============================================================

# 11. UNIGRAM AND CONDITIONAL ENTROPY

# ============================================================

print("\n" + "=" * 60)

print("11. ENTROPY MEASURES")

print("=" * 60)

# Unigram entropy

H_unigram = shannon_entropy(all_signs)

print(f"Unigram entropy: {H_unigram:.1f} bits")

print(f"Max possible (uniform over {len(unique_signs)}): "

f"{math.log2(len(unique_signs)):.1f} bits")

# Conditional entropy H(S2|S1)

total_bigrams = sum(bigrams.values())

H_bigram = -sum(

(c / total_bigrams) * math.log2(c / total_bigrams)

for c in bigrams.values()

)

H_conditional = H_bigram - H_unigram

reduction = (1 - H_conditional / H_unigram) * 100

print(f"Bigram joint entropy: {H_bigram:.1f} bits")

print(f"Conditional entropy H(S2|S1): {H_conditional:.1f} bits")

print(f"Entropy reduction with context: {reduction:.0f}%")

print("\n" + "=" * 60)

print("ANALYSIS COMPLETE")

print("=" * 60)