Skip to main content

formfyxer.lit_explorer

recursive_get_id

def recursive_get_id(values_to_unpack: Union[dict, list],
tmpl: Optional[set] = None)

Pull ID values out of the LIST/NSMI results from Spot.

spot

def spot(text: str,
lower: float = 0.25,
pred: float = 0.5,
upper: float = 0.6,
verbose: float = 0,
token: str = "")

Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text.

re_case

def re_case(text: str) -> str

Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words

regex_norm_field

def regex_norm_field(text: str)

Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions. See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables

reformat_field

def reformat_field(text: str,
max_length: int = 30,
tools_token: Optional[str] = None)

Generate a snake_case label from text without external similarity scoring.

normalize_name

def normalize_name(jur: str,
group: str,
n: int,
per,
last_field: str,
this_field: str,
tools_token: Optional[str] = None,
context: Optional[str] = None,
openai_creds: Optional[OpenAiCreds] = None,
api_key: Optional[str] = None,
model: str = "gpt-5-nano") -> Tuple[str, float]

Normalize a field name, if possible to the Assembly Line conventions, and if not, to a snake_case variable name of appropriate length.

In most cases, you should use the better performing rename_pdf_fields_with_context function, which renames all fields in one prompt to an LLM.

Arguments

  • jur - Jurisdiction (legacy parameter, maintained for compatibility)
  • group - Group/category (legacy parameter, maintained for compatibility)
  • n - Position in field list (legacy parameter, maintained for compatibility)
  • per - Percentage through field list (legacy parameter, maintained for compatibility)
  • last_field - Previous field name (legacy parameter, maintained for compatibility)
  • this_field - The field name to normalize
  • tools_token - Tools API token (legacy parameter, maintained for compatibility)
  • context - Optional PDF text context to help with field naming
  • openai_creds - OpenAI credentials for LLM calls
  • api_key - OpenAI API key (overrides creds and env vars)
  • model - OpenAI model to use (default: gpt-5-nano)

Returns

Tuple of (normalized_field_name, confidence_score)

If context and LLM credentials are provided, uses LLM normalization. Otherwise, falls back to traditional regex-based approach for backward compatibility.

rename_pdf_fields_with_context

def rename_pdf_fields_with_context(
pdf_path: str,
original_field_names: List[str],
openai_creds: Optional[OpenAiCreds] = None,
api_key: Optional[str] = None,
model: str = "gpt-5-nano") -> Dict[str, str]

Use LLM to rename PDF fields based on full PDF context with field markers.

Arguments

  • pdf_path - Path to the PDF file
  • original_field_names - List of original field names from the PDF
  • openai_creds - OpenAI credentials to use for the API call
  • api_key - explicit API key to use (overrides creds and env vars)
  • model - the OpenAI model to use (default: gpt-5-nano)

Returns

Dictionary mapping original field names to new Assembly Line names

cluster_screens

def cluster_screens(fields: List[str] = [],
openai_creds: Optional[OpenAiCreds] = None,
api_key: Optional[str] = None,
model: str = "gpt-5-nano",
damping: Optional[float] = None,
tools_token: Optional[str] = None) -> Dict[str, List[str]]

Groups the given fields into screens using an LLM (GPT) for semantic understanding.

Arguments

  • fields - a list of field names

  • openai_creds - OpenAI credentials to use for the API call

  • api_key - explicit API key to use (overrides creds and env vars)

  • model - the OpenAI model to use (default: gpt-5-nano, can use gpt-4 variants)

  • damping - deprecated parameter, kept for backward compatibility

  • tools_token - deprecated parameter, kept for backward compatibility

  • Returns - a suggested screen grouping, each screen name mapped to the list of fields on it

InputType Objects

class InputType(Enum)

Input type maps onto the type of input the PDF author chose for the field. We only handle text, checkbox, and signature fields.

field_types_and_sizes

def field_types_and_sizes(
fields: Optional[Iterable[FormField]]) -> List[FieldInfo]

Transform the fields provided by get_existing_pdf_fields into a summary format. Result will look like: [ { "var_name": var_name, "type": "text | checkbox | signature", "max_length": n } ]

AnswerType Objects

class AnswerType(Enum)

Answer type describes the effort the user answering the form will require. "Slot-in" answers are a matter of almost instantaneous recall, e.g., name, address, etc. "Gathered" answers require looking around one's desk, for e.g., a health insurance number. "Third party" answers require picking up the phone to call someone else who is the keeper of the information. "Created" answers don't exist before the user is presented with the question. They may include a choice, creating a narrative, or even applying legal reasoning. "Affidavits" are a special form of created answers. See Jarret and Gaffney, Forms That Work (2008)

classify_field

def classify_field(field: FieldInfo, new_name: str) -> AnswerType

Apply heuristics to the field's original and "normalized" name to classify it as either a "slot-in", "gathered", "third party" or "created" field type.

get_adjusted_character_count

def get_adjusted_character_count(field: FieldInfo) -> float

Determines the bracketed length of an input field based on its max_length attribute, returning a float representing the approximate length of the field content.

The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long) instead of directly using the character count, as forms can allocate different spaces for the same data without considering the space the user actually needs.

Arguments

  • field FieldInfo - An object containing information about the input field, including the "max_length" attribute.

Returns

  • float - The approximate length of the field content, categorized into checkboxes, 2 words, short, medium, or long based on the max_length attribute.

Examples:

>>> get_adjusted_character_count({"type"}: InputType.CHECKBOX) 4.7 >>> get_adjusted_character_count({"max_length": 100}) 9.4 >>> get_adjusted_character_count({"max_length": 300}) 230 >>> get_adjusted_character_count({"max_length": 600}) 115 >>> get_adjusted_character_count({"max_length": 1200}) 1150

time_to_answer_field

def time_to_answer_field(field: FieldInfo,
new_name: str,
cpm: int = 40,
cpm_std_dev: int = 17) -> Callable[[int], np.ndarray]

Apply a heuristic for the time it takes to answer the given field, in minutes. It is hand-written for now. It will factor in the input type, the answer type (slot in, gathered, third party or created), and the amount of input text allowed in the field. The return value is a function that can return N samples of how long it will take to answer the field (in minutes)

time_to_answer_form

def time_to_answer_form(processed_fields,
normalized_fields) -> Tuple[float, float]

Provide an estimate of how long it would take an average user to respond to the questions on the provided form. We use signals such as the field type, name, and space provided for the response to come up with a rough estimate, based on whether the field is:

  1. fill in the blank
  2. gathered - e.g., an id number, case number, etc.
  3. third party: need to actually ask someone the information - e.g., income of not the user, anything else?
  4. created: a. short created (3 lines or so?) b. long created (anything over 3 lines)

cleanup_text

def cleanup_text(text: str, fields_to_sentences: bool = False) -> str

Apply cleanup routines to text to provide more accurate readability statistics.

text_complete

def text_complete(system_message: str,
user_message: Optional[str] = None,
max_tokens: int = 500,
creds: Optional[OpenAiCreds] = None,
temperature: float = 0,
api_key: Optional[str] = None,
model: str = "gpt-5-nano",
prompt: Optional[str] = None) -> Union[str, Dict]

Run a prompt via openAI's API and return the result.

Arguments

  • system_message str - The system message that sets the context/role for the AI.
  • user_message Optional[str] - The user message/question. If None, system_message is used as the prompt.
  • max_tokens int, optional - The number of tokens to generate. Defaults to 500.
  • creds Optional[OpenAiCreds], optional - The credentials to use. Defaults to None.
  • temperature float, optional - The temperature to use. Defaults to 0. Note: Not supported by GPT-5 family models.
  • api_key Optional[str], optional - Explicit API key to use. Defaults to None.
  • model str, optional - The model to use. Defaults to "gpt-5-nano".
  • prompt Optional[str] - Legacy parameter for backward compatibility. If provided, used as system message.

Returns

Union[str, Dict]: Returns a parsed dictionary if JSON was requested and successfully parsed, otherwise returns the raw string response.

complete_with_command

def complete_with_command(text,
command,
tokens,
creds: Optional[OpenAiCreds] = None,
api_key: Optional[str] = None,
model: Optional[str] = None) -> str

Combines some text with a command to send to open ai.

needs_calculations

def needs_calculations(text: str) -> bool

A conservative guess at if a given form needs the filler to make math calculations, something that should be avoided. If

get_passive_sentences

def get_passive_sentences(
text: Union[List, str],
tools_token: Optional[str] = None,
model: str = "gpt-5-nano",
api_key: Optional[str] = None
) -> List[Tuple[str, List[Tuple[int, int]]]]

Return passive voice fragments for each sentence in text.

The function relies on OpenAI's language model (via passive_voice_detection) to detect passive constructions. tools_token is kept for backward compatibility but is no longer used.

Arguments

  • text Union[List, str] - The input text or list of texts to analyze.
  • tools_token Optional[str], optional - Deprecated. Previously used for authentication with tools.suffolklitlab.org. Defaults to None.
  • model str, optional - The OpenAI model to use for detection. Defaults to "gpt-5-nano".
  • api_key Optional[str], optional - OpenAI API key to use. If None, will try docassemble config (if available) then environment variables. Defaults to None.

Returns

List[Tuple[str, List[Tuple[int, int]]]]: A list of tuples, each containing the original text and a list of tuples representing the start and end positions of detected passive voice fragments.

Notes

At least for now, the fragment detection is no longer meaningful (except in tokenized sentences) because the LLM detection simply returns the full original sentence if it contains passive voice. We have not reimplemented this behavior of PassivePy.

get_citations

def get_citations(text: str, tokenized_sentences: List[str]) -> List[str]

Get citations and some extra surrounding context (the full sentence), if the citation is fewer than 5 characters (often eyecite only captures a section symbol for state-level short citation formats)

get_sensitive_data_types

def get_sensitive_data_types(
fields: List[str],
fields_old: Optional[List[str]] = None) -> Dict[str, List[str]]

Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive fields grouped by type. A list of the old field names can also be provided. These fields should be in the same order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value will not contain the old field name, only the corresponding field name from the first parameter.

The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security Number.

substitute_phrases

def substitute_phrases(
input_string: str,
substitution_phrases: Dict[str,
str]) -> Tuple[str, List[Tuple[int, int]]]

Substitute phrases in the input string and return the new string and positions of substituted phrases.

Arguments

  • input_string str - The input string containing phrases to be replaced.
  • substitution_phrases Dict[str, str] - A dictionary mapping original phrases to their replacement phrases.

Returns

Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of tuples, each containing the start and end positions of the substituted phrases in the new string.

Example:

>>> input_string = "The quick brown fox jumped over the lazy dog." >>> substitution_phrases = {"quick brown": "swift reddish", "lazy dog": "sleepy canine"} >>> new_string, positions = substitute_phrases(input_string, substitution_phrases) >>> print(new_string) "The swift reddish fox jumped over the sleepy canine." >>> print(positions) [(4, 17), (35, 48)]

substitute_neutral_gender

def substitute_neutral_gender(
input_string: str) -> Tuple[str, List[Tuple[int, int]]]

Substitute gendered phrases with neutral phrases in the input string. Primary source is https://github.com/joelparkerhenderson/inclusive-language

substitute_plain_language

def substitute_plain_language(
input_string: str) -> Tuple[str, List[Tuple[int, int]]]

Substitute complex phrases with simpler alternatives. Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/

transformed_sentences

def transformed_sentences(
sentence_list: List[str],
fun: Callable) -> List[Tuple[str, str, List[Tuple[int, int]]]]

Apply a function to a list of sentences and return only the sentences with changed terms. The result is a tuple of the original sentence, new sentence, and the starting and ending position of each changed fragment in the sentence.

fallback_rename_fields

def fallback_rename_fields(
field_names: List[str]) -> Tuple[List[str], List[float]]

A simple fallback renaming scheme that just makes field names lowercase and replaces spaces with underscores.

parse_form

def parse_form(in_file: str,
title: Optional[str] = None,
jur: Optional[str] = None,
cat: Optional[str] = None,
normalize: bool = True,
spot_token: Optional[str] = None,
tools_token: Optional[str] = None,
openai_creds: Optional[OpenAiCreds] = None,
openai_api_key: Optional[str] = None,
rewrite: bool = False,
debug: bool = False)

Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the in_file with the new fields (if rewrite=1). If you pass a spot token, we will guess the NSMI code. If you pass openai creds, we will give suggestions for the title and description. If you pass openai_api_key, it will be used for passive voice detection (overrides creds and env vars).

Arguments

  • in_file - the path to the PDF file to analyze

  • title - the title of the form, if not provided we will try to guess it

  • jur - the jurisdiction to use for normalization (e.g., "ny" or "ca")

  • cat - the category to use for normalization (e.g., "divorce" or "small_claims")

  • normalize - whether to normalize the field names

  • spot_token - the token to use for spot.suffolklitlab.org, if provided we will attempt to guess the NSMI code

  • tools_token - the token to use for tools.suffolklitlab.org, needed for normalization

  • openai_creds - the OpenAI credentials to use, if provided we will attempt to guess the title and description

  • openai_api_key - an explicit OpenAI API key to use, if provided it will override any creds or environment variables

  • rewrite - whether to rewrite the PDF in place with the new field names

  • debug - whether to print debug information

  • Returns - a dictionary of information about the form

form_complexity

def form_complexity(stats)

Gets a single number of how hard the form is to complete. Higher is harder.