Senior Data/Entity Resolution Engineer (FitchSolutions, Python, LATAM/Europe) #26572
... create world-changing products using God-given talents . . .
PROJECT DESCRIPTION:
Part of Fitch Solutions, is the leading platform for federal court intelligence. We interface with 200+ federal court websites under PACER.gov, run thousands of automated data retrievals every day, and pull that information into a single, centralized database — so our customers can search and monitor all 200+ courts in one place instead of fighting through PACER's slow, fragmented, feature-poor interfaces.
There is one high-value problem we have not yet solved: our customers cannot reliably monitor a known company across the courts. Party and case data arrives as freeform text, with no consistent identifiers. “IBM” might appear as IBM, I.B.M., or International Business Machines; a relevant subsidiary may carry a name that doesn't contain the parent's name at all. Today, a client who wants to track a company for risk exposure would have to guess every name variant and run thousands of queries a day — and still miss cases.
Your mission is to make that problem disappear. You will build the entity resolution layer that takes a curated list of companies (each with a known identifier) and accurately links them and their related entities to the freeform party, docket, and filing text already sitting in our database. When you're done, a client searches “IBM” once and gets reliable, comprehensive coverage. This is a clearly scoped, high-impact, build-from-the-core project with direct revenue value.
PROJECT STACK and TEAM:
Language: Python is the core of this work and is a hard requirement. The matching engine, scoring logic, and the script that writes identifiers back to the database will all be Python.
Data Store: There are two data stores, MySQL and OpenSearch. It would be advantageous if someone who can use both to retrieve the required data and do so efficiently. There is also raw file storage in S3 bucket.
Ingestion: A large-scale, daily automated retrieval pipeline pulling from 200+ PACER court sites into the central schema. You will consume and enrich this data, not rebuild the scrapers.
The input: Consider changing "specific textual data." For the subsidiary piece, external data will likely need to be brought in for the corporate structure.
The output: (a) a reviewable, reasonable output form showing reliable matches per company, and (b) an internal process that writes the matched identifier value into a new identifier column on the relevant case/party records [aka new tag(-s)]
The hard parts you'll own: common-word and short names (e.g., “Target”), partial-name collisions where a name is a substring of an unrelated company, duplicate companies sharing a name, mapping the full family of subsidiaries (e.g., Johnson & Johnson's 250+ subsidiaries, many named after drugs/treatments with no parent name in them), and controlling false positives with very little corroborating data we often lack an address or website to disambiguate against
Timezone: Central US Time
Additional Info: The input into the databases will likely be handled on our end by consuming the output of the system the new hire creates. While we may have it as a column in the database, we have not decided that for sure, so we may not need to cite that specific data structure here, since it might change.
MAIN REQUIREMENTS:
Strong, production-level Python non-negotiable. You can build, structure, and maintain real data-processing code, not just notebooks.
Hands-on experience with entity resolution, record linkage, fuzzy matching, or deduplication (e.g., libraries/approaches such as rapidfuzz, dedupe, recordlinkage, splink, or equivalents you can speak to in depth).
Practical data mapping experience: normalizing and reconciling messy, inconsistent values into canonical forms.
SQL proficiency and confidence working directly against a relational database to read source text and write results.
Comfort working with unstructured / freeform text cleaning, standardizing, and matching real-world name data with all its noise.
Solid MDM mindset: canonical identifiers, golden records, alias/cross-reference management, and why match quality and governance matter.
A rigorous, precision-first instinct you treat false positives as a primary risk and can reason about precision/recall trade-offs and confidence thresholds.
GOOD TO HAVE:
Experience with legal, court, regulatory, or PACER/litigation data, or other domains with messy named-party data.
Familiarity with corporate hierarchy / subsidiary reference data (e.g., LEI, DUNS, or similar identifier systems) and parent–child entity mapping.
NLP techniques relevant to name matching phonetic algorithms, embeddings/vector similarity, named-entity handling.
Experience designing human-in-the-loop review workflows for ambiguous matches.
Building repeatable, schedulable batch jobs and clear match-quality dashboards or reporting.
Background in risk, compliance, or financial information products (a natural fit with the Fitch Solutions mission).
JOB RESPONSIBILITIES:
Design, build, and tune a Python-based entity resolution / record-linkage system that matches a curated company list against: 1) case data; 2) freeform party, docket, and filing text; and 3) external data that could assist in resolution
Produce results in a clear, reviewable, and de/serializable output form for validation before anything is written to production.
Develop the data mapping logic that links many surface forms (abbreviations, punctuation variants, legal suffixes, aliases) to a single canonical company identifier.
Solve the subsidiary / corporate-family problem: map related entities up to their parent identifier even when names share no common tokens, using reference data and curated mappings.
Engineer matching against unstructured / freeform text: normalization, tokenization, fuzzy matching, blocking/candidate generation, and confidence scoring.
Build deliberate precision controls to minimize false positives given sparse corroborating data including thresholds, scoring, ambiguity flags, and human-review queues for low-confidence matches.
Write and run the internal script that persists approved identifier values to the correct records in the database.
Define and track match-quality metrics (precision, recall, false-positive rate) and iterate to keep them reliable as the company list and case data grow.
Partner with data, product, and engineering stakeholders to make the curated list, the matching rules, and the MDM approach maintainable over time.
SUMMARY:
Work your way – Enjoy the freedom to work from anywhere, with flexible hours that match your natural rhythm.
Plenty of time to recharge – Take 15 paid vacation days, 10 additional unpaid days if needed, plus all national holidays.
Meaningful, long-term projects – Dive into exciting 1–5+ year projects using the latest in AI, cloud and more.
Support beyond the job – We help cover things like advanced language courses, gym memberships, and mentorship programs to help you grow.
Work with global clients – Collaborate directly with international teams to create real impact.
Make extra cash – Earn bonuses for referring great people or bringing in new business opportunities.
Great people, no micromanagement – Join a supportive, results-focused team where you’re trusted to do your best work.
This flexibility allows developers…
A better work-life balance
Increased productivity
The ability to work any time around the clock
Reduction in commute time
Design your ideal daily schedule.
Build a career, not just a job.
Work smarter, not longer.
More time with family and friends
Apply To