Impresso API resources

Search content items in the Impresso corpus.

# Search for content items
impresso.search.find(term='Titanic', limit=10)

# Complex queries with AND/OR operators
from impresso import AND, OR
impresso.search.find(term=AND("hitler", "stalin") & OR("molotow", "ribbentrop"))

# Search with date range
from impresso import DateRange
impresso.search.find(term="independence", date_range=DateRange("1921-05-21", "2001-01-02"))

# Search by entity mentions
impresso.search.find(entity_id=AND("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein"))

# Limit to specific newspapers
impresso.search.find(term="independence", newspaper_id=OR("EXP", "GDL"))

# Get facets to analyze search results
impresso.search.facet(facet='newspaper', term='war')

impresso.resources.search.SearchResource

Bases: Resource

Search content items in the impresso database.

Examples:

Search for articles containing a term:

>>> results = search.find(term="war")
>>> print(results.df)

Filter articles by date range and newspaper:

>>> from impresso import DateRange
>>> date_range = DateRange(start="1900-01-01", end="1910-12-31")
>>> results = search.find(term="revolution", newspaper_id="GDL", date_range=date_range)
>>> print(results.df)

Search for front page articles mentioning an entity:

>>> results = search.find(entity_id="aida-0001-54-Napoleon", front_page=True)
>>> print(results.df)

Search by semantic similarity using text embeddings:

>>> embedding = tools.embed_text("military conflict", target="text")
>>> similar_articles = search.find(embedding=embedding)
>>> print(similar_articles.df)

Get facets to analyze search results:

>>> newspaper_facets = search.facet(facet="newspaper", term="war")
>>> print(newspaper_facets.df)

facet(facet, term=None, order_by='value', limit=None, offset=None, with_text_contents=False, title=None, front_page=None, entity_id=None, newspaper_id=None, date_range=None, language=None, mention=None, topic_id=None, collection_id=None, country=None, partner_id=None, text_reuse_cluster_id=None)

Get facets for a search query.

Facets provide aggregated information about a specific dimension of search results, such as counts of newspaper titles, languages, or topics.

Parameters:
  • facet (GetSearchFacetIdLiteral) –

    Type of facet to retrieve (e.g., 'newspaper', 'language', 'topic').

  • term (str | AND[str] | OR[str] | None, default: None ) –

    Search term or combination of terms to filter facets.

  • order_by (GetSearchFacetOrderByLiteral | None, default: 'value' ) –

    Sort order for facet results ('value' or 'count'). Defaults to 'value'.

  • limit (int | None, default: None ) –

    Maximum number of facet buckets to return.

  • offset (int | None, default: None ) –

    Number of facet buckets to skip for pagination.

  • with_text_contents (bool | None, default: False ) –

    Filter for content items with text contents. Defaults to False.

  • title (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items having this term or terms in the title.

  • front_page (bool | None, default: None ) –

    Filter for content items that were on the front page.

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items mentioning this entity or entities.

  • newspaper_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by newspaper ID(s).

  • date_range (DateRange | None, default: None ) –

    Filter by publication date range.

  • language (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content language. Use 2-letter ISO language codes (e.g., 'en', 'de', 'fr').

  • mention (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items mentioning entities with these terms.

  • topic_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by topic ID(s).

  • collection_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by collection ID(s).

  • country (str | AND[str] | OR[str] | None, default: None ) –

    Filter by country of publication. Use 2-letter ISO country codes (e.g., 'ch', 'de', 'lu').

  • partner_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by partner institution ID(s).

  • text_reuse_cluster_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by text reuse cluster ID(s).

Returns:
  • FacetDataContainer( FacetDataContainer ) –

    Data container with facet results, including counts for each bucket.

Examples:

>>> search = SearchResource(client)
>>> # Get newspaper facets for articles mentioning "war"
>>> newspaper_facets = search.facet(facet="newspaper", term="war")
>>> # Get language facets for front page articles
>>> language_facets = search.facet(facet="language", front_page=True)

find(term=None, order_by=None, limit=None, offset=None, with_text_contents=False, title=None, front_page=None, entity_id=None, newspaper_id=None, date_range=None, language=None, mention=None, topic_id=None, collection_id=None, country=None, partner_id=None, issue_id=None, text_reuse_cluster_id=None, embedding=None, copyright=None, include_embeddings=False)

Search for content items in Impresso.

Parameters:
  • term (str | AND[str] | OR[str] | None, default: None ) –

    Search term or combination of search terms.

  • order_by (SearchOrderByLiteral | None, default: None ) –

    Sort order for results.

  • limit (int | None, default: None ) –

    Maximum number of results to return per page. Defaults to 100.

  • offset (int | None, default: None ) –

    Number of results to skip for pagination.

  • with_text_contents (bool | None, default: False ) –

    Return only content items with text contents. Defaults to False.

  • title (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items having this term or all/any of the terms in the title.

  • front_page (bool | None, default: None ) –

    Return only content items that were on the front page.

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items mentioning this entity or all/any of the entities.

  • newspaper_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by newspaper ID(s).

  • date_range (DateRange | None, default: None ) –

    Filter by publication date range.

  • language (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content language or all/any of the languages. Use 2-letter ISO language codes (e.g., 'en', 'de', 'fr').

  • mention (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content items mentioning entities with this term or all/any of entities with the terms.

  • topic_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by topic ID(s).

  • collection_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by collection ID(s).

  • country (str | AND[str] | OR[str] | None, default: None ) –

    Filter by country of publication. Use 2-letter ISO country codes (e.g., 'ch', 'de', 'lu').

  • partner_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by partner institution ID(s).

  • text_reuse_cluster_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by text reuse cluster ID(s).

  • embedding (Embedding | None, default: None ) –

    Text embedding for similarity search. Use tools.embed_text() to generate embeddings from text.

  • copyright (ContentItemAccessRightsCopyrightLiteral | AND[ContentItemAccessRightsCopyrightLiteral] | OR[ContentItemAccessRightsCopyrightLiteral] | None, default: None ) –

    Filter by copyright status.

  • include_embeddings (bool, default: False ) –

    Whether to include embeddings in the response. Defaults to False.

Returns:
  • SearchDataContainer( SearchDataContainer ) –

    Data container with the first page of search results.

impresso.api_client.models.search_order_by.SearchOrderByLiteral = Literal['date', 'id', 'relevance', '-date', '-relevance', '-id'] module-attribute

impresso.api_client.models.content_item_access_rights_copyright.ContentItemAccessRightsCopyrightLiteral = Literal['euo', 'in_cpy', 'nkn', 'pbl', 'und', 'unk'] module-attribute

impresso.resources.tools.Embedding = Annotated[str, 'base64-encoded string with model prefix'] module-attribute

impresso.resources.search.SearchDataContainer

Bases: DataContainer

Response of a search call.

df: DataFrame property

Return the data as a pandas dataframe.

pages()

Iterate over all pages of results.

Entities

Search entities in the Impresso corpus.

# Search for entities
impresso.entities.find(term="Douglas Adams")

# Filter by entity type
impresso.entities.find(term="Paris", entity_type="location")

# Get entities with Wikidata details
impresso.entities.find(term="Paris", resolve=True)

# Search by Wikidata IDs
from impresso import AND
impresso.entities.find(wikidata_id=AND("Q2", "Q4", "Q42"))

# Get a specific entity by ID
impresso.entities.get("entity-id")

impresso.resources.entities.EntitiesResource

Bases: Resource

Search entities in the Impresso database.

Examples:

Search for entities by name:

>>> results = entities.find(term="Napoleon")
>>> print(results.df)

Filter entities by type:

>>> results = entities.find(term="Paris", entity_type="location")
>>> print(results.df)

Get entity details with Wikidata resolution:

>>> results = entities.find(term="Napoleon", resolve=True)
>>> print(results.df)

Get a specific entity by its ID:

>>> entity_id = "some-entity-id"  # Replace with a real ID
>>> entity = entities.get(entity_id)
>>> print(entity.df)

find(term=None, wikidata_id=None, entity_id=None, entity_type=None, order_by=None, resolve=False, limit=None, offset=None)

Search entities in Impresso.

Parameters:
  • term (str | None, default: None ) –

    Search term.

  • wikidata_id (str | AND[str] | OR[str] | None, default: None ) –

    Return only entities resolved to this Wikidata ID.

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Return only entity with this ID.

  • entity_type (EntityType | AND[EntityType] | OR[EntityType] | None, default: None ) –

    Return only entities of this type.

  • order_by (FindEntitiesOrderByLiteral | None, default: None ) –

    Field to order results by.

  • resolve (bool, default: False ) –

    Return Wikidata details of the entities, if the entity is linked to a Wikidata entry.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:
  • FindEntitiesContainer( FindEntitiesContainer ) –

    Data container with a page of results of the search.

get(id)

Get entity by ID.

Parameters:
  • id (str) –

    The ID of the entity to retrieve.

Returns:
  • GetEntityContainer( GetEntityContainer ) –

    Data container with the entity information.

impresso.resources.entities.EntityType = Literal['person', 'location'] module-attribute

impresso.api_client.models.find_entities_order_by.FindEntitiesOrderByLiteral = Literal['count', 'count-mentions', 'name', 'relevance', '-relevance', '-name', '-count', '-count-mentions'] module-attribute

impresso.resources.entities.FindEntitiesContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

pages()

Iterate over all pages of results.

Media sources

Search media sources available in the Impresso corpus.

impresso.media_sources.find(
    term="wort",
    order_by="lastIssue",
)

impresso.resources.media_sources.MediaSourcesResource

Bases: Resource

Search media sources in the Impresso database.

Media sources are newspapers and other publications available in Impresso.

Examples:

Find all media sources:

>>> results = media_sources.find()
>>> print(results.df)

Search media sources by name:

>>> results = media_sources.find(term="Gazette")
>>> print(results.df)

Filter media sources by type:

>>> results = media_sources.find(type="newspaper")
>>> print(results.df)

Get media sources with detailed properties:

>>> results = media_sources.find(with_properties=True)
>>> print(results.df)

find(term=None, type=None, order_by=None, with_properties=False, limit=None, offset=None)

Search media sources in Impresso.

Parameters:
  • term (str | None, default: None ) –

    Search term.

  • type (FindMediaSourcesTypeLiteral | None, default: None ) –

    Type of media sources to search for.

  • order_by (FindMediaSourcesOrderByLiteral | None, default: None ) –

    Field to order results by.

  • with_properties (bool, default: False ) –

    Include properties in the results.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:

impresso.api_client.models.find_media_sources_type.FindMediaSourcesTypeLiteral = Literal['newspaper'] module-attribute

impresso.api_client.models.find_media_sources_order_by.FindMediaSourcesOrderByLiteral = Literal['countIssues', 'firstIssue', 'lastIssue', 'name', '-name', '-firstIssue', '-lastIssue', '-countIssues'] module-attribute

impresso.resources.media_sources.FindMediaSourcesContainer

Bases: DataContainer

Response of a search call.

df: DataFrame property

Return the data as a pandas dataframe.

Content Items

Get a single content item by ID.

# Get a content item by ID
impresso.content_items.get("NZZ-1794-08-09-a-i0002")

# Get a content item with embeddings
impresso.content_items.get("NZZ-1794-08-09-a-i0002", include_embeddings=True)

# Get only the embeddings of a content item
embeddings = impresso.content_items.get_embeddings("NZZ-1794-08-09-a-i0002")

impresso.resources.content_items.ContentItemsResource

Bases: Resource

Get content items from the impresso database.

Examples:

Get a specific content item by its ID:

>>> item_id = "some-item-id"  # Replace with a real ID
>>> item = content_items.get(item_id)
>>> print(item.df)

Get a content item with embeddings:

>>> item = content_items.get(item_id, include_embeddings=True)
>>> print(item.raw.get("embeddings"))

Get only the embeddings of a content item:

>>> embeddings = content_items.get_embeddings(item_id)
>>> print(embeddings)

get(id, include_embeddings=False)

Get a content item by its id.

Parameters:
  • id (str) –

    The id of the content item.

  • include_embeddings (bool, default: False ) –

    Whether to include embeddings in the response.

Returns:

get_embeddings(id)

Get the embeddings of a content item by its id.

Parameters:
  • id (str) –

    The id of the content item.

Returns:
  • list[str]

    list[str]: The embeddings of the content item if present (every embedding is returned in the canonical form: :).

impresso.resources.content_items.ContentItemDataContainer

Bases: DataContainer

Response of a get content item call.

df: DataFrame property

Return the data as a pandas dataframe.

pydantic: ContentItem property

Return the data as a pydantic model.

raw: dict[str, Any] property

Return the data as a python dictionary.

size: int property

Current page size.

total: int property

Total number of results.

Images

Search images in the Impresso corpus. Supports text search, filtering by various metadata, and visual similarity search using embeddings.

# Search for images by keyword and content type
impresso.images.find(term="rocket", content_type="object")

# Get an image with its embeddings
image = impresso.images.get("luxwort-1930-09-26-a-i0036", include_embeddings=True)

# Search for similar images using an in-corpus image
embeddings = impresso.images.get_embeddings("luxwort-1930-09-26-a-i0036")
impresso.images.find(embedding=embeddings[0], limit=10)

# Search for similar images using external image
embedding = impresso.tools.embed_image("https://example.com/image.png", target="image")
impresso.images.find(embedding=embedding, limit=10)

# Multimodal search: find images using text
text_embedding = impresso.tools.embed_text(text="portrait", target="multimodal")
impresso.images.find(embedding=text_embedding, limit=10)

impresso.resources.images.ImagesResource

Bases: Resource

Search images in Impresso.

Examples:

Search for images by keyword:

>>> results = images.find(term="war")
>>> print(results.df)

Filter images by date range and newspaper:

>>> from impresso import DateRange
>>> date_range = DateRange(start="1900-01-01", end="1910-12-31")
>>> results = images.find(media_id="GDL", date_range=date_range)
>>> print(results.df)

Search for front page images only:

>>> results = images.find(is_front=True)
>>> print(results.df)

Search images by visual similarity using embeddings:

>>> embedding = tools.embed_image("path/to/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)
>>> print(similar_images.df)

Get a specific image by its ID:

>>> image_id = "some-image-id"  # Replace with a real ID
>>> image = images.get(image_id)
>>> print(image.df)

find(term=None, media_id=None, issue_id=None, is_front=None, date_range=None, visual_content=None, technique=None, communication_goal=None, content_type=None, embedding=None, include_embeddings=False, order_by=None, limit=None, offset=None)

Find images in Impresso.

Parameters:
  • term (str | None, default: None ) –

    The search term for text-based search.

  • media_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by newspaper media ID(s).

  • issue_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter by issue ID(s).

  • is_front (bool | None, default: None ) –

    Filter for front page images only.

  • date_range (DateRange | None, default: None ) –

    Filter by date range.

  • visual_content (str | AND[str] | OR[str] | None, default: None ) –

    Filter by visual content category.

  • technique (str | AND[str] | OR[str] | None, default: None ) –

    Filter by image technique.

  • communication_goal (str | AND[str] | OR[str] | None, default: None ) –

    Filter by communication goal.

  • content_type (str | AND[str] | OR[str] | None, default: None ) –

    Filter by content type.

  • embedding (Embedding | None, default: None ) –

    Image embedding for similarity search. Use tools.embed_image() or tools.embed_text() to generate embeddings from images.

  • include_embeddings (bool, default: False ) –

    Whether to include image embeddings in the response. Defaults to False.

  • order_by (FindImagesOrderByLiteral | None, default: None ) –

    Sort order for results.

  • limit (int | None, default: None ) –

    Maximum number of results to return per page. Defaults to 100.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:
  • FindImagesContainer( FindImagesContainer ) –

    Data container with the first page of the search results.

get(id, include_embeddings=False)

Get an image by its id.

Parameters:
  • id (str) –

    The id of the image.

  • include_embeddings (bool, default: False ) –

    Whether to include embeddings in the response.

Returns:

get_embeddings(id)

Get the embeddings of an image by its id.

Parameters:
  • id (str) –

    The id of the image.

Returns:
  • list[str]

    list[str]: The embeddings of the image if present (every embedding is returned in the canonical form: :).

impresso.api_client.models.find_images_order_by.FindImagesOrderByLiteral = Literal['date', '-date'] module-attribute

impresso.resources.images.FindImagesContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

pages()

Iterate over all pages of results.

impresso.resources.images.GetImageContainer

Bases: DataContainer

df: DataFrame property

Return the data as a pandas dataframe.

Topics

Search topics in the Impresso database. Topics are thematic clusters discovered through topic modeling of the newspaper content.

# Search for topics
impresso.topics.find(term="economy")

# Get a specific topic by ID
impresso.topics.get("topic-id")

impresso.resources.topics.TopicsResource

Bases: Resource

Search topics in the Impresso database.

Examples:

Search for topics containing specific words:

>>> results = topics.find(term="economy")
>>> print(results.df)

Get a specific topic by its ID:

>>> topic_id = "some-topic-id" # Replace with a real ID
>>> topic = topics.get(topic_id)
>>> print(topic.df)

Iterate through all pages of topic search results:

>>> results = topics.find(term="war", limit=10)
>>> for page in results.pages():
...     print(page.df)

find(term=None, order_by=None, limit=None, offset=None)

Search topics in Impresso.

Parameters:
  • term (str | None, default: None ) –

    Search term to find topics by their words.

  • order_by (FindTopicsOrderByLiteral | None, default: None ) –

    Field to order results by.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:
  • FindTopicsContainer( FindTopicsContainer ) –

    Data container with a page of results of the search.

get(id)

Get topic by ID.

Parameters:
  • id (str) –

    The ID of the topic to retrieve.

Returns:
  • GetTopicContainer( GetTopicContainer ) –

    Data container with the topic information.

impresso.api_client.models.find_topics_order_by.FindTopicsOrderByLiteral = Literal['model', 'name', '-name', '-model'] module-attribute

impresso.resources.topics.FindTopicsContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

pages()

Iterate over all pages of results.

impresso.resources.topics.GetTopicContainer

Bases: DataContainer

Response of a get call.

df: DataFrame property

Return the data as a pandas dataframe.

Data Providers

Search data providers in the Impresso database. Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.

# Search for data providers
impresso.data_providers.find(term="library")

# Get a specific data provider by ID
impresso.data_providers.get("provider-id")

impresso.resources.data_providers.DataProvidersResource

Bases: Resource

Search data providers in the Impresso database.

Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.

Examples:

Find all data providers:

>>> results = data_providers.find()
>>> print(results.df)

Search data providers by name:

>>> results = data_providers.find(term="library")
>>> print(results.df)

Get a specific data provider by its ID:

>>> provider_id = "some-provider-id"  # Replace with a real ID
>>> provider = data_providers.get(provider_id)
>>> print(provider.df)

find(term=None, provider_id=None, limit=None, offset=None)

Search data providers in Impresso.

Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.

Parameters:
  • term (str | None, default: None ) –

    Search term to find data providers by their names.

  • provider_id (str | AND[str] | OR[str] | None, default: None ) –

    Return only data provider with this ID.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:

get(id)

Get data provider by ID.

Parameters:
  • id (str) –

    The ID of the data provider to retrieve.

Returns:

impresso.resources.data_providers.FindDataProvidersContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

pages()

Iterate over all pages of results.

impresso.resources.data_providers.GetDataProviderContainer

Bases: DataContainer

Response of a get call.

df: DataFrame property

Return the data as a pandas dataframe.

Experiments

Execute experiments with the Impresso platform. Experiments allow you to interact with various computational tools and models.

# List all available experiments
experiments = impresso.experiments.find()

# Execute a specific experiment
result = impresso.experiments.execute(
    experiment_id="some-experiment-id",
    body={"param": "value"}
)

impresso.resources.experiments.ExperimentsResource

Bases: Resource

Experiment with Impresso.

execute(experiment_id, body)

Execute an experiment with the given ID.

Parameters:
  • experiment_id (str) –

    ID of the experiment to execute.

  • body (dict) –

    Body of the experiment.

Returns:
  • dict( dict ) –

    Result of the experiment.

find()

Find all available experiments.

Returns:

impresso.resources.experiments.FindExperimentsContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

Collections

Work with collections

# Search for collections
impresso.collections.find(term="war")

# Get a specific collection by ID
collection = impresso.collections.get("collection-id")
collection_id = collection.raw["id"]

# List items in a collection
items = impresso.collections.items(collection_id)

# Add items to a collection (asynchronous - may take a few minutes)
content_item = impresso.content_items.get("NZZ-1794-08-09-a-i0002")
impresso.collections.add_items(collection_id, [content_item.pydantic.uid])

# Remove items from a collection (asynchronous - may take a few minutes)
impresso.collections.remove_items(collection_id, [content_item.pydantic.uid])

impresso.resources.collections.CollectionsResource

Bases: Resource

Work with collections.

Examples:

Find collections containing the term "war":

>>> results = collections.find(term="war")
>>> print(results.df)

Get a specific collection by its ID:

>>> collection_id = "some-collection-id" # Replace with a real ID
>>> collection = collections.get(collection_id)
>>> print(collection.df)

List items in a collection:

>>> items = collections.items(collection_id)
>>> print(items.df)

Add items to a collection:

>>> item_ids_to_add = ["item-id-1", "item-id-2"] # Replace with real item IDs
>>> collections.add_items(collection_id, item_ids_to_add)

Remove items from a collection:

>>> item_ids_to_remove = ["item-id-1"] # Replace with real item IDs
>>> collections.remove_items(collection_id, item_ids_to_remove)

Create a new collection:

>>> new_collection = collections.create("My Collection", description="A test collection")
>>> print(new_collection.df)

Rename a collection:

>>> collections.rename(collection_id, "New Name")

Delete a collection:

>>> collections.delete(collection_id)

add_items(collection_id, item_ids)

Add items to a collection by their IDs.

NOTE: Items are not added immediately. This operation may take up to a few minutes to complete and reflect in the collection.

Parameters:
  • collection_id (str) –

    ID of the collection.

  • item_ids (list[str]) –

    IDs of the content items to add.

create(title, description=None, access_level=None)

Create a new collection.

Parameters:
  • title (str) –

    Title of the collection.

  • description (str | None, default: None ) –

    Optional description of the collection.

  • access_level (NewCollectionRequestAccessLevelLiteral | None, default: None ) –

    Access level of the collection ("private" or "public").

Returns:

delete(id)

Delete a collection by ID.

Parameters:
  • id (str) –

    The ID of the collection to delete.

find(term=None, order_by=None, limit=None, offset=None)

Search collections in Impresso.

Parameters:
  • term (str | None, default: None ) –

    Search term.

  • order_by (FindCollectionsOrderByLiteral | None, default: None ) –

    Order by aspect.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:

get(id)

Get collection by ID.

Parameters:
  • id (str) –

    The ID of the collection to retrieve.

Returns:

items(collection_id, limit=None, offset=None)

Return all content items from a collection.

Parameters:
  • collection_id (str) –

    ID of the collection.

  • limit (int | None, default: None ) –

    Number of results to return.

  • offset (int | None, default: None ) –

    Number of results to skip.

Returns:
  • SearchDataContainer( SearchDataContainer ) –

    Data container with a page of results of the search.

remove_items(collection_id, item_ids)

Remove items from a collection by their IDs.

NOTE: Items are not removed immediately. This operation may take up to a few minutes to complete and reflect in the collection.

Parameters:
  • collection_id (str) –

    ID of the collection.

  • item_ids (list[str]) –

    IDs of the content items to remove.

rename(id, title)

Rename a collection.

Parameters:
  • id (str) –

    The ID of the collection to rename.

  • title (str) –

    The new title for the collection.

Returns:

impresso.api_client.models.find_collections_order_by.FindCollectionsOrderByLiteral = Literal['date', 'size', '-date', '-size'] module-attribute

impresso.resources.collections.FindCollectionsContainer

Bases: DataContainer

Response of a find call.

df: DataFrame property

Return the data as a pandas dataframe.

impresso.resources.collections.GetCollectionContainer

Bases: DataContainer

Response of a get call.

df: DataFrame property

Return the data as a pandas dataframe.

pydantic: Collection property

The response data as a pydantic model.

size: int property

Current page size.

total: int property

Total number of results.

Tools: Named entity recognition and Embeddings

The python library provides tools for text processing and semantic search:

  • Named Entity Recognition (NER): Extract and classify named entities (people, places, organizations) from text.
  • Named Entity Linking (NEL): Resolve recognized entities to Wikidata entries.
  • Text Embeddings: Generate semantic embeddings from text for similarity search across the corpus.
  • Image Embeddings: Generate embeddings from images for visual similarity search and multimodal retrieval.
text = "Jean-Baptiste Nicolas Robert Schuman (29 June 1886 – 4 September 1963) was a Luxembourg-born French statesman."

# Extract named entities from text (fast)
result = impresso.tools.ner(text)
result.df  # View entities as DataFrame

# Extract and link entities to Wikidata (slower but more detailed)
result = impresso.tools.ner_nel(text)
result.df  # Includes Wikidata links

# Link pre-tagged entities to external resources (requires [START] and [END] markers)
tagged_text = "[START] Jean-Baptiste Nicolas Robert Schuman [END] was a statesman."
impresso.tools.nel(tagged_text)

# Generate text embeddings for semantic search
text_embedding = impresso.tools.embed_text("European integration", target="text")
results = impresso.search.find(embedding=text_embedding, limit=5)

# Use in-corpus embedding for similar article search
first_item_id = results.df.index[0]
in_corpus_embedding = impresso.content_items.get_embeddings(first_item_id)[0]
impresso.search.find(embedding=in_corpus_embedding, limit=10)

# Generate image embeddings from URL
image_embedding = impresso.tools.embed_image("https://example.com/image.png", target="image")
impresso.images.find(embedding=image_embedding)

impresso.resources.tools.ToolsResource

Bases: Resource

Various helper tools for text processing and embedding generation.

Examples:

Extract named entities from text:

>>> entities = tools.ner("Napoleon visited Paris in 1815.")
>>> print(entities.df)

Extract and link entities to Wikidata:

>>> entities = tools.ner_nel("Napoleon visited Paris in 1815.")
>>> print(entities.df)

Generate text embedding for semantic search:

>>> embedding = tools.embed_text("military conflict", target="text")
>>> results = search.find(embedding=embedding)

Generate image embedding from file:

>>> embedding = tools.embed_image("path/to/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)

Generate image embedding from URL:

>>> embedding = tools.embed_image("https://example.com/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)

embed_image(image, target)

Embed an image into a vector space.

Parameters:
  • image (bytes | Base64Str | str) –

    Image to embed. Can be raw bytes, a base64-encoded string, a URL of an image, or a path to a file.

  • target (ImpressoImageEmbeddingRequestSearchTargetLiteral) –

    Target collection to embed the image into. Currently, only "image" is supported.

Returns:
  • Embedding( Embedding ) –

    The image embedding as a base64 string prefixed with model tag.

embed_text(text, target)

Embed text into a vector space.

Parameters:
  • text (str) –

    Text to embed.

  • target (ImpressoTextEmbeddingRequestSearchTargetLiteral) –

    Target collection to embed the text into.

Returns:
  • Embedding( Embedding ) –

    The text embedding as a base64 string prefixed with model tag.

nel(text)

Named Entity Linking

This method requires named entities to be enclosed in tags: [START]entity[END].

Parameters:
  • text (str) –

    Text to process

Returns:

ner(text)

Named Entity Recognition

This method is faster than ner_nel but does not provide any linking to external resources.

Parameters:
  • text (str) –

    Text to process

Returns:

ner_nel(text)

Named Entity Recognition and Named Entity Linking

This method is slower than ner but provides linking to external resources.

Parameters:
  • text (str) –

    Text to process

Returns:

impresso.resources.tools.NerContainer

Bases: DataContainer

Name entity recognition result container.

df: DataFrame property

Return the data as a pandas dataframe.

limit: int property

Page size.

offset: int property

Page offset.

size: int property

Current page size.

total: int property

Total number of results.

Text reuse

Two resources can be used to search text reuse clusters and passages.

# Find text reuse clusters
impresso.text_reuse_clusters.find(cluster_size=(10, 20))

# Get facets for clusters (e.g., newspaper distribution)
impresso.text_reuse_clusters.facet(facet='newspaper', order_by='count')

# Find text reuse passages
impresso.text_reuse_passages.find(term='revolution', country='FR')

# Get facets for passages
impresso.text_reuse_passages.facet(facet='newspaper')

impresso.resources.text_reuse.clusters.TextReuseClustersResource

Bases: Resource

Interact with the text reuse clusters endpoint of the Impresso API.

This resource allows searching for text reuse clusters based on various criteria and retrieving facet information about these clusters.

Examples:

Find clusters with size between 10 and 20:

>>> results = textReuseClusters.find(cluster_size=(10, 20))
>>> print(results.df)

Get the distribution of newspapers involved in clusters:

>>> facet_results = textReuseClusters.facet(facet='newspaper', order_by='count')
>>> print(facet_results.df)

facet(facet, order_by='value', limit=None, offset=None, cluster_size=None, date_range=None, newspaper_id=None, lexical_overlap=None, day_delta=None)

Get facet information for text reuse clusters based on specified filters.

Facets provide aggregated counts for different properties of the clusters, such as the distribution of cluster sizes or newspapers involved.

Parameters:
  • facet (GetTrClustersFacetIdLiteral) –

    The specific facet to retrieve (e.g., 'newspaper', 'cluster_size').

  • order_by (GetTrClustersFacetOrderByLiteral | None, default: 'value' ) –

    How to order the facet values (e.g., 'value', 'count').

  • limit (int | None, default: None ) –

    Maximum number of facet values to return.

  • offset (int | None, default: None ) –

    Number of facet values to skip.

  • cluster_size (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by size before calculating facets.

  • date_range (DateRange | None, default: None ) –

    Filter clusters by date range before calculating facets.

  • newspaper_id (str | OR[str] | None, default: None ) –

    Filter clusters by newspaper before calculating facets.

  • lexical_overlap (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by lexical overlap before calculating facets.

  • day_delta (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by day delta before calculating facets.

Returns:
  • FacetDataContainer( FacetDataContainer ) –

    A container holding the facet results.

Examples:

Get the top 10 newspapers involved in clusters:

>>> facet_results = textReuseClusters.facet(facet='newspaper', limit=10, order_by='count')
>>> print(facet_results.df)

Get the distribution of cluster sizes for clusters within a specific date range:

>>> from impresso.structures import DateRange
>>> date_filter = DateRange(start="1900-01-01", end="1910-12-31")
>>> facet_results = textReuseClusters.facet(facet='cluster_size', date_range=date_filter)
>>> print(facet_results.df)

find(term=None, title=None, order_by=None, cluster_size=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, limit=None, offset=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)

Find text reuse clusters based on various criteria.

Parameters:
  • term (str | None, default: None ) –

    Search for clusters containing specific text.

  • title (str | AND[str] | OR[str] | None, default: None ) –

    Filter clusters by the title of the articles within them.

  • order_by (FindTextReuseClustersOrderByLiteral | None, default: None ) –

    Specify the sorting order for the results.

  • cluster_size (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by the number of items they contain.

  • lexical_overlap (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by the lexical overlap score.

  • day_delta (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter clusters by the time span (in days) between the first and last item.

  • date_range (DateRange | None, default: None ) –

    Filter clusters based on the date range of their items.

  • newspaper_id (str | OR[str] | None, default: None ) –

    Filter clusters containing items from specific newspapers.

  • collection_id (str | OR[str] | None, default: None ) –

    Filter clusters containing items from specific collections.

  • limit (int | None, default: None ) –

    Maximum number of clusters to return.

  • offset (int | None, default: None ) –

    Number of clusters to skip from the beginning.

  • front_page (bool | None, default: None ) –

    Filter clusters containing items published on the front page.

  • topic_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter clusters associated with specific topics.

  • language (str | OR[str] | None, default: None ) –

    Filter clusters by the language of their items.

  • country (str | OR[str] | None, default: None ) –

    Filter clusters by the country of publication of their items.

  • mention (str | AND[str] | OR[str] | None, default: None ) –

    Filter clusters containing specific mentions (named entities).

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter clusters associated with specific entity IDs.

Returns:
  • FindTextReuseClustersContainer( FindTextReuseClustersContainer ) –

    A container holding the search results.

Examples:

Find clusters with size between 10 and 20:

>>> results = textReuseClusters.find(cluster_size=(10, 20))
>>> print(results.df)

Find clusters related to 'politics' in Swiss newspapers:

>>> results = textReuseClusters.find(term='politics', country='CH')
>>> print(results.df)

impresso.resources.text_reuse.passages.TextReusePassagesResource

Bases: Resource

Text reuse passages resource.

facet(facet, term=None, limit=None, offset=None, order_by=None, cluster_id=None, cluster_size=None, title=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)

Get facet information for text reuse passages based on specified filters.

Facets provide aggregated counts for different properties of the passages, such as the distribution of newspapers or languages.

Parameters:
  • facet (GetTrPassagesFacetIdLiteral) –

    The specific facet to retrieve (e.g., 'newspaper', 'language').

  • term (str | None, default: None ) –

    Filter passages by text content before calculating facets.

  • limit (int | None, default: None ) –

    Maximum number of facet values to return.

  • offset (int | None, default: None ) –

    Number of facet values to skip.

  • order_by (FindTextReusePassagesOrderByLiteral | None, default: None ) –

    How to order the facet values (e.g., 'value', 'count').

  • cluster_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by cluster ID before calculating facets.

  • cluster_size (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages by cluster size before calculating facets.

  • title (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by article title before calculating facets.

  • lexical_overlap (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages by lexical overlap before calculating facets.

  • day_delta (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages by cluster day delta before calculating facets.

  • date_range (DateRange | None, default: None ) –

    Filter passages by publication date before calculating facets.

  • newspaper_id (str | OR[str] | None, default: None ) –

    Filter passages by newspaper before calculating facets.

  • collection_id (str | OR[str] | None, default: None ) –

    Filter passages by collection before calculating facets.

  • front_page (bool | None, default: None ) –

    Filter passages by front page status before calculating facets.

  • topic_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by topic ID before calculating facets.

  • language (str | OR[str] | None, default: None ) –

    Filter passages by language before calculating facets.

  • country (str | OR[str] | None, default: None ) –

    Filter passages by country before calculating facets.

  • mention (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by mention before calculating facets.

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by entity ID before calculating facets.

Returns:
  • FacetDataContainer( PassagesFacetDataContainer ) –

    A container holding the facet results.

Examples:

Get the top 10 newspapers associated with passages containing 'war':

>>> facet_results = textReusePassages.facet(facet='newspaper', term='war', limit=10)
>>> print(facet_results.df)

Get the language distribution for passages published between 1914 and 1918:

>>> from impresso.structures import DateRange
>>> date_filter = DateRange(start="1914-01-01", end="1918-12-31")
>>> facet_results = textReusePassages.facet(facet='language', date_range=date_filter)
>>> print(facet_results.df)

find(term=None, limit=None, offset=None, order_by=None, cluster_id=None, cluster_size=None, title=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)

Find text reuse passages based on various criteria.

Parameters:
  • term (str | None, default: None ) –

    Search for passages containing specific text.

  • limit (int | None, default: None ) –

    Maximum number of passages to return.

  • offset (int | None, default: None ) –

    Number of passages to skip from the beginning.

  • order_by (FindTextReusePassagesOrderByLiteral | None, default: None ) –

    Specify the sorting order for the results.

  • cluster_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages belonging to specific text reuse clusters.

  • cluster_size (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages based on the size of the cluster they belong to.

  • title (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages by the title of the articles they appear in.

  • lexical_overlap (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages based on the lexical overlap score within their cluster.

  • day_delta (Range | AND[Range] | OR[Range] | None, default: None ) –

    Filter passages based on the time span (in days) of their cluster.

  • date_range (DateRange | None, default: None ) –

    Filter passages based on their publication date.

  • newspaper_id (str | OR[str] | None, default: None ) –

    Filter passages from specific newspapers.

  • collection_id (str | OR[str] | None, default: None ) –

    Filter passages from specific collections.

  • front_page (bool | None, default: None ) –

    Filter passages appearing on the front page.

  • topic_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages associated with specific topics.

  • language (str | OR[str] | None, default: None ) –

    Filter passages by their language.

  • country (str | OR[str] | None, default: None ) –

    Filter passages by the country of publication.

  • mention (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages containing specific mentions (named entities).

  • entity_id (str | AND[str] | OR[str] | None, default: None ) –

    Filter passages associated with specific entity IDs.

Returns:
  • FindTextReusePassagesContainer( FindTextReusePassagesContainer ) –

    A container holding the search results.

Examples:

Find passages containing the term 'revolution' from French newspapers:

>>> results = textReusePassages.find(term='revolution', country='FR')
>>> print(results.df)

Find passages from clusters with a size greater than 50:

>>> results = textReusePassages.find(cluster_size=(51, None))
>>> print(results.df)

Pagination

When you search for content items, entities, or other resources, the library returns a limited subset of the results. This means that the results are divided into pages, and you can request a specific page of results by specifying the limit and offset parameters in your query. This is done to improve performance and avoid transferring large amounts of data at once.

The limit parameter specifies the maximum number of items to return in a single page, and the offset parameter specifies the starting index of the items to return. For example, if you set limit=10 and offset=20, the API will return items 20 through 29. When these parameters are not specified, the library uses default values, which is usually 0 for the offset and between 10 and 50 for the limit, depending on the resource.

The response object, an instance of DataContainer, contains information about the pagination, such as the total number of items (total), the number of items in the current page (size), the limit, and the offset.

When a find or facet method is called, the response object contains data for the first page or the page set by the offset and limit parameters. To get the subsequent pages, you can use the pages method of the response object. This method returns an iterator which yields new DataContainer objects for each of the subsequent pages until it reaches the end of the result set.

For example, if you want to get all the content items that mention "Titanic" with 20 items per page, you can use the following code:

result = impresso.search.find(
    term="titanic",
    limit=20,
)

for page in result.pages():
    print(
        f"Got page {page.offset} - {page.offset + page.size} of {page.total}. "
        + f"The first title is {page.raw['data'][0]['title']}"
    )

impresso.data_container.DataContainer

Bases: Generic[IT, T]

Generic container for responses from the Impresso API returned by resource methods (get, find).

Generally represents a single page of the result. The results can be paginated through by adjusting the offset and limit parameters in the corresponding resource method call (e.g., client.newspapers.find). The total, limit, offset, and size properties provide information about the current page and the overall result set.

df: DataFrame property

The response data for the current page as a pandas dataframe.

Note that this DataFrame only contains the items from the current page of results, not the entire result set across all pages.

limit: int property

Maximum number of items requested for the current page.

offset: int property

The starting index (0-based) of the items on the current page.

pydantic: T property

The response data as a pydantic model.

raw: dict[str, Any] property

The response data as a python dictionary.

size: int property

Number of items actually present on the current page.

total: int property

Total number of results available across all pages.

url: str | None property

URL of an Impresso web application page representing the result set.

pages()

Yields the current page and all subsequent pages of results.

This method first yields the current DataContainer instance (self), then attempts to fetch and yield subsequent pages by making new API calls with adjusted offsets.

Returns:
  • Iterator[DataContainer[IT, T]]

    Iterator["DataContainer[IT, T]"]: An iterator that yields

  • Iterator[DataContainer[IT, T]]

    DataContainer instances, starting with the current one,

  • Iterator[DataContainer[IT, T]]

    followed by subsequent pages.

Example:

# Get the first page with 10 items per page
first_page = client.newspapers.find(limit=10)

# Iterate through all pages
for page in first_page.pages():
    # Process items from the current page
    print(f"Page {page.offset // page.limit + 1}:")
    print(page.df)
    # The loop will continue with the next page, if any