Senior Data/Entity Resolution Engineer (FitchSolutions, Python, LATAM/Europe) #26572

... create world-changing products using God-given talents . . .

PROJECT DESCRIPTION:

Part of Fitch Solutions, we are the leading platform for federal court intelligence. We connect to 200+ federal court websites through PACER.gov, run thousands of automated data retrievals daily, and consolidate the information into a centralized database, enabling customers to search and monitor all federal courts from a single platform.

One key challenge remains: reliably identifying companies across court records. Party and case data is stored as freeform text without consistent identifiers, meaning the same company may appear under multiple name variations (e.g., IBM, I.B.M., International Business Machines), while subsidiaries may not include the parent company's name at all. As a result, customers must search countless name variations and can still miss relevant cases.

Your mission is to build an entity resolution layer that links a curated list of companies and their identifiers to party, docket, and filing data already in our database, providing customers with reliable, comprehensive results through a single search.

The client is looking for a highly capable, autonomous developer who can take ownership of the proof-of-concept and solve complex data engineering challenges independently. You'll architect scalable pipelines to process large volumes of legal and corporate documents, while addressing advanced NLP and data normalization problems—including pronoun resolution, string normalization, and named entity recognition—to accurately identify and match companies across disparate datasets.

The role also requires evaluating and implementing technologies such as vector databases, taxonomies, small language models (SLMs), and external API integrations (e.g., FactSet or SEC tools) to build a deterministic, scalable, and highly reliable entity-resolution framework.

PROJECT STACK and TEAM:

Language

Python is the core of this work and is a hard requirement. The matching engine, scoring logic, and scripts that write identifiers back to the database will all be built in Python.

Data Stores

The system will work with two primary data stores: MySQL and OpenSearch. Experience with both is advantageous, especially for efficiently retrieving the required data. Raw file storage is also available through an S3 bucket.

Ingestion

A large-scale daily automated retrieval pipeline already pulls data from 200+ PACER court sites into a central schema. You will consume and enrich this data rather than rebuild the existing scrapers.

Input Data

The system will match a curated company list against:

Case data
Freeform party, docket, and filing text
External data sources that may assist with entity resolution, including corporate structure and subsidiary relationships

Output

The goal is to:

Produce clear, reviewable, and serializable output showing reliable matches for each company before any production updates occur.
Support an internal process that writes matched identifier values back to relevant case/party records (e.g., adding new identifier tags or columns, depending on the final implementation approach).

Key Challenges You Will Own

The most challenging parts of this work include:

Resolving common-word and short company names (e.g., “Target”)
Handling partial-name collisions where a company name is a substring of an unrelated company name
Managing duplicate companies that share the same name
Mapping complete corporate families and subsidiary relationships (e.g., Johnson & Johnson’s 250+ subsidiaries, many of which may have names unrelated to the parent company)
Minimizing false positives when limited supporting information is available, such as missing addresses or websites for disambiguation

Additional Details

Timezone: Central US Time
The client will likely consume the system’s output on their side. While the matched identifier may eventually be stored as a database column, this has not been finalized, so the solution should not depend on a specific database structure.

Role Summary

Design, build, and tune a Python-based entity resolution / record-linkage system that matches curated company data against legal case data, unstructured text, and external corporate information sources.

MAIN REQUIREMENTS:

Strong, production-level Python non-negotiable. You can build, structure, and maintain real data-processing code, not just notebooks.
Hands-on experience with entity resolution, record linkage, fuzzy matching, or deduplication (e.g., libraries/approaches such as rapidfuzz, dedupe, recordlinkage, splink, or equivalents you can speak to in depth).
Practical data mapping experience: normalizing and reconciling messy, inconsistent values into canonical forms.
SQL proficiency and confidence working directly against a relational database to read source text and write results.
Comfort working with unstructured / freeform text cleaning, standardizing, and matching real-world name data with all its noise.
Solid MDM mindset: canonical identifiers, golden records, alias/cross-reference management, and why match quality and governance matter.
A rigorous, precision-first instinct you treat false positives as a primary risk and can reason about precision/recall trade-offs and confidence thresholds.

GOOD TO HAVE:

Experience with legal, court, regulatory, or PACER/litigation data, or other domains with messy named-party data.
Familiarity with corporate hierarchy / subsidiary reference data (e.g., LEI, DUNS, or similar identifier systems) and parent–child entity mapping.
NLP techniques relevant to name matching phonetic algorithms, embeddings/vector similarity, named-entity handling.
Experience designing human-in-the-loop review workflows for ambiguous matches.
Building repeatable, schedulable batch jobs and clear match-quality dashboards or reporting.
Background in risk, compliance, or financial information products (a natural fit with the Fitch Solutions mission).

JOB RESPONSIBILITIES:

Design, build, and tune a Python-based entity resolution / record-linkage system that matches a curated company list against: 1) case data; 2) freeform party, docket, and filing text; and 3) external data that could assist in resolution
Produce results in a clear, reviewable, and de/serializable output form for validation before anything is written to production.
Develop the data mapping logic that links many surface forms (abbreviations, punctuation variants, legal suffixes, aliases) to a single canonical company identifier.
Solve the subsidiary / corporate-family problem: map related entities up to their parent identifier even when names share no common tokens, using reference data and curated mappings.
Engineer matching against unstructured / freeform text: normalization, tokenization, fuzzy matching, blocking/candidate generation, and confidence scoring.
Build deliberate precision controls to minimize false positives given sparse corroborating data including thresholds, scoring, ambiguity flags, and human-review queues for low-confidence matches.
Write and run the internal script that persists approved identifier values to the correct records in the database.
Define and track match-quality metrics (precision, recall, false-positive rate) and iterate to keep them reliable as the company list and case data grow.
Partner with data, product, and engineering stakeholders to make the curated list, the matching rules, and the MDM approach maintainable over time.

SUMMARY:

Work your way – Enjoy the freedom to work from anywhere, with flexible hours that match your natural rhythm.
Plenty of time to recharge – Take 15 paid vacation days, 10 additional unpaid days if needed, plus all national holidays.
Meaningful, long-term projects – Dive into exciting 1–5+ year projects using the latest in AI, cloud and more.
Support beyond the job – We help cover things like advanced language courses, gym memberships, and mentorship programs to help you grow.
Work with global clients – Collaborate directly with international teams to create real impact.
Make extra cash – Earn bonuses for referring great people or bringing in new business opportunities.
Great people, no micromanagement – Join a supportive, results-focused team where you’re trusted to do your best work.

This flexibility allows developers…

A better work-life balance
Increased productivity
The ability to work any time around the clock
Reduction in commute time
Design your ideal daily schedule.
Build a career, not just a job.
Work smarter, not longer.
More time with family and friends

For more job openings, please follow Evolve Squads on Linkedin

Name/ID of the position:

Full Name:*

LinkedIn profile URL:*

E-mail:*

Phone:*

Country:*

What is your English language level? (1=None, 2=Beginner, 3=Intermediate, 4=Upper-Intermediate, 5=Advanced)

Please select the Evolve Squads recruiter you spoke with regarding this opportunity:

Notice Period:

Desired Gross Salary in dollars per month :

Upload your CV

Choose File No file chosen