Glossary
Source: docs/glossary.md
Quick definitions of terms that appear across these docs. Each entry is one or two sentences.
Hub / science gateway
A web-accessible portal that hosts computational tools for a research community.
HUBzero
The CMS framework hubs are built on. Open-source PHP with custom plugins, hosted at hubzero.org, sourced at github.com/hubzero.
Hub DB (<hub>)
The live CMS database. Owned by HUBzero CMS. Metrics reads it but
generally doesn't write to it (the exceptions are
jos_resource_stats* and jos_session_geo).
Metrics DB (<hub>_metrics)
The analytics database. Owned end-to-end by the metrics pipeline.
Contains web, websessions, toolstart, userlogin,
summary_*_vals, and the static reference tables (continents,
countries, domainclass, etc.).
Web row
One row in web represents one Apache HTTP request. Created by
import-apache. Enriched by resolve-dns / fill-domain /
fill-ipcountry / logfix-session.
Web session
One row in websessions represents a coalesced visitor session — a
sequence of web rows from the same ip + host within 1800 seconds
of each other. Created by logfix-session.
Tool start
One row in toolstart represents a user launching a computational
tool. Created from <hub>.sessionlog by import-hub-data and
enriched with walltime / cputime by middleware-wall / middleware-cpu.
Simulation
In metrics terms, a tool-launch event that resulted in a job being
submitted. Counted by summary_simusage_vals and the sim_users()
function. Distinct from a "view-only" tool session.
Period
The time window a summary cell represents. Six codes:
| Code | Span |
|---|---|
0 |
Calendar year containing datetime |
1 |
The month itself |
3 |
Quarter containing datetime |
12 |
Rolling 12 months ending at datetime |
13 |
Fiscal year (Oct–Sep) containing datetime |
14 |
All time (since 1995-01-01) |
datetime convention
Summary tables use 'YYYY-MM-00 00:00:00' for monthly anchors —
zero-padded day to mark "this is a month, not a specific date".
Period 14 (all-time) uses '0000-00-00 00:00:00' in some legacy
layouts; the rewrite preserves whichever the legacy used.
rowid / colid
The two indexes into a summary cell. rowid is the metric (e.g.,
"registered users", "simulation jobs", "domains served"); colid is
the breakdown axis (1 = total; 2–6 = residence by continent; 7–11 =
org type). Each summary_*_vals table has its own rowid semantics;
see usage-tables.md.
Domain class
A six-bucket categorization of internet domains used for the org-type
breakdown. Stored in the domainclass reference table:
| Class | Meaning |
|---|---|
0 |
Unknown (no domain information) |
1 |
Educational institution |
2 |
Industrial / corporate |
3 |
Governmental |
4 |
Internet service provider |
5 |
Search engine |
6 |
Press / media / publication |
The mapping is hand-maintained. Most of the entries date to 2015;
periodic refresh is recommended but rarely happens. fill-domain
sets the domain on each web / websessions / toolstart row;
fill-user-info and the summary int_users() / reg_users()
functions roll the domain up into the colid 8–11 buckets.
Registered vs unregistered (guest) user
- Registered: appears in
userlogin(i.e., logged into the CMS). Org / residence is taken from theirjos_xprofilesprofile if they filled it out, otherwise treated as "unknown". - Unregistered (or "guest"): not logged in. Identified by
(ip, host)pair inwebsessions. Org / residence is inferred from the resolved hostname → domain →domainclasslookup.
Some hub deployments are anonymous-dominant — registered accounts in the single digits — so their metrics are essentially all guest-user inference.
Tool top / toplist / "top" code
jos_resource_stats_tools_topvals and jos_stats_topvals store
ranked lists of tools by various metrics. The top column codes the
metric:
top |
Tool metric |
|---|---|
2 |
Number of users |
5 |
Number of jobs |
6 |
Walltime |
7 |
Simulation CPU time |
8 |
Simulation interaction time |
10 |
Number of courses |
11 |
Course user count |
rank=0 is the special "total across all tools" row for that
(top, period, datetime) triple. rank=1, 2, 3, … are the
individual tools in descending order.
dnload
A boolean (TINYINT 0/1) column on the web table indicating
whether the row represents a resource download. Set inline by
import-apache for new rows; backfill-dnload populates the
historical rows. Introduced by the rewrite to replace a slow LIKE-
chain in xlogfix_summary.php's download detection.
Login IPs / login_ips_tmp
A temp table built at summary time, indexing every IP that appears
in userlogin_lite. Used as the "registered user" set against
which websessions.ip NOT IN (...) filters identify unregistered
visitors. The rewrite materializes this as an indexed JOIN target
instead of an in-memory comma list, dramatically speeding up the
all-time aggregation.
whoisonline
The live "who is currently online" widget. Reads
<hub>.jos_session every 5 minutes, looks up reverse DNS + GeoIP for
new IPs, writes <hub>.jos_session_geo and
/var/www/<hub>/app/site/stats/maps/whoisonline.xml for the Google
Maps widget on the public usage page. Real-time-ish, but a separate
concern from the daily metrics pipeline.
Tick
The cron entry point — hzmetrics.py tick — that runs every 5
minutes. Always refreshes whoisonline; at :30 past the hour, also
tries to run the metrics pipeline (under PID-lock guard).
Migration (schema)
A row in <hub>_metrics.migrations recording an applied schema
delta. hzmetrics.py migrate --apply walks the unapplied migrations
in order and runs them. Standard schema-migration pattern.
access.cfg
/etc/hubzero-metrics/access.cfg. Bare $var = 'value'; PHP-style
file with DB credentials. Owned root:apache mode 640. Read by
hzmetrics.py, the legacy Perl scripts, and the test harness (via
HZMETRICS_ACCESS_CFG env var).
exclude_list
Per-hub table in the metrics DB. Filters bots / scanners / utility traffic out of metrics processing. See operations.md for ops use.
A/B parity / bug-for-bug parity
The rewrite's parity contract: the new code's output tables must be
byte-identical to the legacy code's output tables for the same input
fixtures. Verified by tests/ab/run-all.sh. Preserved even where
the legacy had quirks (NULL vs empty-string, implementation-defined
ordering with no tie-breaker, etc.) — those are documented and
matched.
_findweeks / week-chunking
A legacy pattern from xlogfix_summary.php / xlogfix_domain.php:
break a month into ~4 week-sized scans for memory-bounded
enrichment. Has a known quirk — each week-chunk starts on the day
before the month begins. So July 2025's first chunk runs from
2025-06-30 to 2025-07-07, not from 2025-07-01. Preserved
bug-for-bug by the rewrite.
Banker's rounding (the middleware port quirk)
MariaDB's ROUND() on a DOUBLE column uses round-half-to-even
(banker's rounding) — ROUND(200.5) → 200. Perl's int($x + 0.5)
in xlogfix_middleware_{wall,cpu}.pl is round-half-up — 200.5 → 201. The rewrite uses FLOOR(x + 0.5) instead of ROUND() to
mirror the Perl semantics exactly. Caught by the
port_middleware A/B test.
gridstat / hctest_ user filter
Two patterns of "test account" excluded from metrics processing
across the legacy code. gridstat is an exact-string match
(so gridstatx is NOT excluded). hctest% is a LIKE pattern (so
hctest, hctest_x, hctestlonger, hctester are all excluded).
Preserved bug-for-bug.
Unflushed last session
A legacy quirk in logfix_session.pl: the very last session of
each run never gets emitted to websessions if no later row
triggers the "session end found" path. Preserved by the rewrite —
the port_logfix_session test fixture exercises it explicitly.
Two paths to MySQL INT rounding
Yet another preserved quirk. The legacy PHP stringifies values
before binding into SQL, so MariaDB applies half-away-from-zero
rounding when casting a float-string to INT: '488.5' → 489. The
Python port's pymysql originally bound Python floats as numeric
literals, hitting banker's rounding: 488.5 → 488. The fix:
stringify Python floats before binding. Caught by the
port_gen_tool_stats A/B test.