Impresso API resources
Search
Search content items in the Impresso corpus.
# Search for content items
impresso.search.find(term='Titanic', limit=10)
# Complex queries with AND/OR operators
from impresso import AND, OR
impresso.search.find(term=AND("hitler", "stalin") & OR("molotow", "ribbentrop"))
# Search with date range
from impresso import DateRange
impresso.search.find(term="independence", date_range=DateRange("1921-05-21", "2001-01-02"))
# Search by entity mentions
impresso.search.find(entity_id=AND("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein"))
# Limit to specific newspapers
impresso.search.find(term="independence", newspaper_id=OR("EXP", "GDL"))
# Get facets to analyze search results
impresso.search.facet(facet='newspaper', term='war')
impresso.resources.search.SearchResource
Bases: Resource
Search content items in the impresso database.
Examples:
Search for articles containing a term:
>>> results = search.find(term="war")
>>> print(results.df)
Filter articles by date range and newspaper:
>>> from impresso import DateRange
>>> date_range = DateRange(start="1900-01-01", end="1910-12-31")
>>> results = search.find(term="revolution", newspaper_id="GDL", date_range=date_range)
>>> print(results.df)
Search for front page articles mentioning an entity:
>>> results = search.find(entity_id="aida-0001-54-Napoleon", front_page=True)
>>> print(results.df)
Search by semantic similarity using text embeddings:
>>> embedding = tools.embed_text("military conflict", target="text")
>>> similar_articles = search.find(embedding=embedding)
>>> print(similar_articles.df)
Get facets to analyze search results:
>>> newspaper_facets = search.facet(facet="newspaper", term="war")
>>> print(newspaper_facets.df)
facet(facet, term=None, order_by='value', limit=None, offset=None, with_text_contents=False, title=None, front_page=None, entity_id=None, newspaper_id=None, date_range=None, language=None, mention=None, topic_id=None, collection_id=None, country=None, partner_id=None, text_reuse_cluster_id=None)
Get facets for a search query.
Facets provide aggregated information about a specific dimension of search results, such as counts of newspaper titles, languages, or topics.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> search = SearchResource(client)
>>> # Get newspaper facets for articles mentioning "war"
>>> newspaper_facets = search.facet(facet="newspaper", term="war")
>>> # Get language facets for front page articles
>>> language_facets = search.facet(facet="language", front_page=True)
find(term=None, order_by=None, limit=None, offset=None, with_text_contents=False, title=None, front_page=None, entity_id=None, newspaper_id=None, date_range=None, language=None, mention=None, topic_id=None, collection_id=None, country=None, partner_id=None, issue_id=None, text_reuse_cluster_id=None, embedding=None, copyright=None, include_embeddings=False)
Search for content items in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.api_client.models.search_order_by.SearchOrderByLiteral = Literal['date', 'id', 'relevance', '-date', '-relevance', '-id']
module-attribute
impresso.api_client.models.content_item_access_rights_copyright.ContentItemAccessRightsCopyrightLiteral = Literal['euo', 'in_cpy', 'nkn', 'pbl', 'und', 'unk']
module-attribute
impresso.resources.tools.Embedding = Annotated[str, 'base64-encoded string with model prefix']
module-attribute
impresso.resources.search.SearchDataContainer
Bases: DataContainer
Response of a search call.
df: DataFrame
property
Return the data as a pandas dataframe.
pages()
Iterate over all pages of results.
Entities
Search entities in the Impresso corpus.
# Search for entities
impresso.entities.find(term="Douglas Adams")
# Filter by entity type
impresso.entities.find(term="Paris", entity_type="location")
# Get entities with Wikidata details
impresso.entities.find(term="Paris", resolve=True)
# Search by Wikidata IDs
from impresso import AND
impresso.entities.find(wikidata_id=AND("Q2", "Q4", "Q42"))
# Get a specific entity by ID
impresso.entities.get("entity-id")
impresso.resources.entities.EntitiesResource
Bases: Resource
Search entities in the Impresso database.
Examples:
Search for entities by name:
>>> results = entities.find(term="Napoleon")
>>> print(results.df)
Filter entities by type:
>>> results = entities.find(term="Paris", entity_type="location")
>>> print(results.df)
Get entity details with Wikidata resolution:
>>> results = entities.find(term="Napoleon", resolve=True)
>>> print(results.df)
Get a specific entity by its ID:
>>> entity_id = "some-entity-id" # Replace with a real ID
>>> entity = entities.get(entity_id)
>>> print(entity.df)
find(term=None, wikidata_id=None, entity_id=None, entity_type=None, order_by=None, resolve=False, limit=None, offset=None)
Search entities in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get(id)
Get entity by ID.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.resources.entities.EntityType = Literal['person', 'location']
module-attribute
impresso.api_client.models.find_entities_order_by.FindEntitiesOrderByLiteral = Literal['count', 'count-mentions', 'name', 'relevance', '-relevance', '-name', '-count', '-count-mentions']
module-attribute
impresso.resources.entities.FindEntitiesContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
pages()
Iterate over all pages of results.
Media sources
Search media sources available in the Impresso corpus.
impresso.media_sources.find(
term="wort",
order_by="lastIssue",
)
impresso.resources.media_sources.MediaSourcesResource
Bases: Resource
Search media sources in the Impresso database.
Media sources are newspapers and other publications available in Impresso.
Examples:
Find all media sources:
>>> results = media_sources.find()
>>> print(results.df)
Search media sources by name:
>>> results = media_sources.find(term="Gazette")
>>> print(results.df)
Filter media sources by type:
>>> results = media_sources.find(type="newspaper")
>>> print(results.df)
Get media sources with detailed properties:
>>> results = media_sources.find(with_properties=True)
>>> print(results.df)
find(term=None, type=None, order_by=None, with_properties=False, limit=None, offset=None)
Search media sources in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.api_client.models.find_media_sources_type.FindMediaSourcesTypeLiteral = Literal['newspaper']
module-attribute
impresso.api_client.models.find_media_sources_order_by.FindMediaSourcesOrderByLiteral = Literal['countIssues', 'firstIssue', 'lastIssue', 'name', '-name', '-firstIssue', '-lastIssue', '-countIssues']
module-attribute
impresso.resources.media_sources.FindMediaSourcesContainer
Bases: DataContainer
Response of a search call.
df: DataFrame
property
Return the data as a pandas dataframe.
Content Items
Get a single content item by ID.
# Get a content item by ID
impresso.content_items.get("NZZ-1794-08-09-a-i0002")
# Get a content item with embeddings
impresso.content_items.get("NZZ-1794-08-09-a-i0002", include_embeddings=True)
# Get only the embeddings of a content item
embeddings = impresso.content_items.get_embeddings("NZZ-1794-08-09-a-i0002")
impresso.resources.content_items.ContentItemsResource
Bases: Resource
Get content items from the impresso database.
Examples:
Get a specific content item by its ID:
>>> item_id = "some-item-id" # Replace with a real ID
>>> item = content_items.get(item_id)
>>> print(item.df)
Get a content item with embeddings:
>>> item = content_items.get(item_id, include_embeddings=True)
>>> print(item.raw.get("embeddings"))
Get only the embeddings of a content item:
>>> embeddings = content_items.get_embeddings(item_id)
>>> print(embeddings)
get(id, include_embeddings=False)
Get a content item by its id.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get_embeddings(id)
Get the embeddings of a content item by its id.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.resources.content_items.ContentItemDataContainer
Bases: DataContainer
Response of a get content item call.
df: DataFrame
property
Return the data as a pandas dataframe.
pydantic: ContentItem
property
Return the data as a pydantic model.
raw: dict[str, Any]
property
Return the data as a python dictionary.
size: int
property
Current page size.
total: int
property
Total number of results.
Images
Search images in the Impresso corpus. Supports text search, filtering by various metadata, and visual similarity search using embeddings.
# Search for images by keyword and content type
impresso.images.find(term="rocket", content_type="object")
# Get an image with its embeddings
image = impresso.images.get("luxwort-1930-09-26-a-i0036", include_embeddings=True)
# Search for similar images using an in-corpus image
embeddings = impresso.images.get_embeddings("luxwort-1930-09-26-a-i0036")
impresso.images.find(embedding=embeddings[0], limit=10)
# Search for similar images using external image
embedding = impresso.tools.embed_image("https://example.com/image.png", target="image")
impresso.images.find(embedding=embedding, limit=10)
# Multimodal search: find images using text
text_embedding = impresso.tools.embed_text(text="portrait", target="multimodal")
impresso.images.find(embedding=text_embedding, limit=10)
impresso.resources.images.ImagesResource
Bases: Resource
Search images in Impresso.
Examples:
Search for images by keyword:
>>> results = images.find(term="war")
>>> print(results.df)
Filter images by date range and newspaper:
>>> from impresso import DateRange
>>> date_range = DateRange(start="1900-01-01", end="1910-12-31")
>>> results = images.find(media_id="GDL", date_range=date_range)
>>> print(results.df)
Search for front page images only:
>>> results = images.find(is_front=True)
>>> print(results.df)
Search images by visual similarity using embeddings:
>>> embedding = tools.embed_image("path/to/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)
>>> print(similar_images.df)
Get a specific image by its ID:
>>> image_id = "some-image-id" # Replace with a real ID
>>> image = images.get(image_id)
>>> print(image.df)
find(term=None, media_id=None, issue_id=None, is_front=None, date_range=None, visual_content=None, technique=None, communication_goal=None, content_type=None, embedding=None, include_embeddings=False, order_by=None, limit=None, offset=None)
Find images in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get(id, include_embeddings=False)
Get an image by its id.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get_embeddings(id)
Get the embeddings of an image by its id.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.api_client.models.find_images_order_by.FindImagesOrderByLiteral = Literal['date', '-date']
module-attribute
impresso.resources.images.FindImagesContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
pages()
Iterate over all pages of results.
impresso.resources.images.GetImageContainer
Bases: DataContainer
df: DataFrame
property
Return the data as a pandas dataframe.
Topics
Search topics in the Impresso database. Topics are thematic clusters discovered through topic modeling of the newspaper content.
# Search for topics
impresso.topics.find(term="economy")
# Get a specific topic by ID
impresso.topics.get("topic-id")
impresso.resources.topics.TopicsResource
Bases: Resource
Search topics in the Impresso database.
Examples:
Search for topics containing specific words:
>>> results = topics.find(term="economy")
>>> print(results.df)
Get a specific topic by its ID:
>>> topic_id = "some-topic-id" # Replace with a real ID
>>> topic = topics.get(topic_id)
>>> print(topic.df)
Iterate through all pages of topic search results:
>>> results = topics.find(term="war", limit=10)
>>> for page in results.pages():
... print(page.df)
find(term=None, order_by=None, limit=None, offset=None)
Search topics in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get(id)
Get topic by ID.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.api_client.models.find_topics_order_by.FindTopicsOrderByLiteral = Literal['model', 'name', '-name', '-model']
module-attribute
impresso.resources.topics.FindTopicsContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
pages()
Iterate over all pages of results.
impresso.resources.topics.GetTopicContainer
Bases: DataContainer
Response of a get call.
df: DataFrame
property
Return the data as a pandas dataframe.
Data Providers
Search data providers in the Impresso database. Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.
# Search for data providers
impresso.data_providers.find(term="library")
# Get a specific data provider by ID
impresso.data_providers.get("provider-id")
impresso.resources.data_providers.DataProvidersResource
Bases: Resource
Search data providers in the Impresso database.
Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.
Examples:
Find all data providers:
>>> results = data_providers.find()
>>> print(results.df)
Search data providers by name:
>>> results = data_providers.find(term="library")
>>> print(results.df)
Get a specific data provider by its ID:
>>> provider_id = "some-provider-id" # Replace with a real ID
>>> provider = data_providers.get(provider_id)
>>> print(provider.df)
find(term=None, provider_id=None, limit=None, offset=None)
Search data providers in Impresso.
Data providers are partner institutions that provide content to Impresso, such as libraries, archives, and media organizations.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get(id)
Get data provider by ID.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.resources.data_providers.FindDataProvidersContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
pages()
Iterate over all pages of results.
impresso.resources.data_providers.GetDataProviderContainer
Bases: DataContainer
Response of a get call.
df: DataFrame
property
Return the data as a pandas dataframe.
Experiments
Execute experiments with the Impresso platform. Experiments allow you to interact with various computational tools and models.
# List all available experiments
experiments = impresso.experiments.find()
# Execute a specific experiment
result = impresso.experiments.execute(
experiment_id="some-experiment-id",
body={"param": "value"}
)
impresso.resources.experiments.ExperimentsResource
Bases: Resource
Experiment with Impresso.
execute(experiment_id, body)
Execute an experiment with the given ID.
| Parameters: |
|
|---|
| Returns: |
|
|---|
find()
Find all available experiments.
| Returns: |
|
|---|
impresso.resources.experiments.FindExperimentsContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
Collections
Work with collections
# Search for collections
impresso.collections.find(term="war")
# Get a specific collection by ID
collection = impresso.collections.get("collection-id")
collection_id = collection.raw["id"]
# List items in a collection
items = impresso.collections.items(collection_id)
# Add items to a collection (asynchronous - may take a few minutes)
content_item = impresso.content_items.get("NZZ-1794-08-09-a-i0002")
impresso.collections.add_items(collection_id, [content_item.pydantic.uid])
# Remove items from a collection (asynchronous - may take a few minutes)
impresso.collections.remove_items(collection_id, [content_item.pydantic.uid])
impresso.resources.collections.CollectionsResource
Bases: Resource
Work with collections.
Examples:
Find collections containing the term "war":
>>> results = collections.find(term="war")
>>> print(results.df)
Get a specific collection by its ID:
>>> collection_id = "some-collection-id" # Replace with a real ID
>>> collection = collections.get(collection_id)
>>> print(collection.df)
List items in a collection:
>>> items = collections.items(collection_id)
>>> print(items.df)
Add items to a collection:
>>> item_ids_to_add = ["item-id-1", "item-id-2"] # Replace with real item IDs
>>> collections.add_items(collection_id, item_ids_to_add)
Remove items from a collection:
>>> item_ids_to_remove = ["item-id-1"] # Replace with real item IDs
>>> collections.remove_items(collection_id, item_ids_to_remove)
Create a new collection:
>>> new_collection = collections.create("My Collection", description="A test collection")
>>> print(new_collection.df)
Rename a collection:
>>> collections.rename(collection_id, "New Name")
Delete a collection:
>>> collections.delete(collection_id)
add_items(collection_id, item_ids)
Add items to a collection by their IDs.
NOTE: Items are not added immediately. This operation may take up to a few minutes to complete and reflect in the collection.
| Parameters: |
|
|---|
create(title, description=None, access_level=None)
Create a new collection.
| Parameters: |
|
|---|
| Returns: |
|
|---|
delete(id)
Delete a collection by ID.
| Parameters: |
|
|---|
find(term=None, order_by=None, limit=None, offset=None)
Search collections in Impresso.
| Parameters: |
|
|---|
| Returns: |
|
|---|
get(id)
Get collection by ID.
| Parameters: |
|
|---|
| Returns: |
|
|---|
items(collection_id, limit=None, offset=None)
Return all content items from a collection.
| Parameters: |
|
|---|
| Returns: |
|
|---|
remove_items(collection_id, item_ids)
Remove items from a collection by their IDs.
NOTE: Items are not removed immediately. This operation may take up to a few minutes to complete and reflect in the collection.
| Parameters: |
|
|---|
rename(id, title)
Rename a collection.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.api_client.models.find_collections_order_by.FindCollectionsOrderByLiteral = Literal['date', 'size', '-date', '-size']
module-attribute
impresso.resources.collections.FindCollectionsContainer
Bases: DataContainer
Response of a find call.
df: DataFrame
property
Return the data as a pandas dataframe.
impresso.resources.collections.GetCollectionContainer
Bases: DataContainer
Response of a get call.
df: DataFrame
property
Return the data as a pandas dataframe.
pydantic: Collection
property
The response data as a pydantic model.
size: int
property
Current page size.
total: int
property
Total number of results.
Tools: Named entity recognition and Embeddings
The python library provides tools for text processing and semantic search:
- Named Entity Recognition (NER): Extract and classify named entities (people, places, organizations) from text.
- Named Entity Linking (NEL): Resolve recognized entities to Wikidata entries.
- Text Embeddings: Generate semantic embeddings from text for similarity search across the corpus.
- Image Embeddings: Generate embeddings from images for visual similarity search and multimodal retrieval.
text = "Jean-Baptiste Nicolas Robert Schuman (29 June 1886 – 4 September 1963) was a Luxembourg-born French statesman."
# Extract named entities from text (fast)
result = impresso.tools.ner(text)
result.df # View entities as DataFrame
# Extract and link entities to Wikidata (slower but more detailed)
result = impresso.tools.ner_nel(text)
result.df # Includes Wikidata links
# Link pre-tagged entities to external resources (requires [START] and [END] markers)
tagged_text = "[START] Jean-Baptiste Nicolas Robert Schuman [END] was a statesman."
impresso.tools.nel(tagged_text)
# Generate text embeddings for semantic search
text_embedding = impresso.tools.embed_text("European integration", target="text")
results = impresso.search.find(embedding=text_embedding, limit=5)
# Use in-corpus embedding for similar article search
first_item_id = results.df.index[0]
in_corpus_embedding = impresso.content_items.get_embeddings(first_item_id)[0]
impresso.search.find(embedding=in_corpus_embedding, limit=10)
# Generate image embeddings from URL
image_embedding = impresso.tools.embed_image("https://example.com/image.png", target="image")
impresso.images.find(embedding=image_embedding)
impresso.resources.tools.ToolsResource
Bases: Resource
Various helper tools for text processing and embedding generation.
Examples:
Extract named entities from text:
>>> entities = tools.ner("Napoleon visited Paris in 1815.")
>>> print(entities.df)
Extract and link entities to Wikidata:
>>> entities = tools.ner_nel("Napoleon visited Paris in 1815.")
>>> print(entities.df)
Generate text embedding for semantic search:
>>> embedding = tools.embed_text("military conflict", target="text")
>>> results = search.find(embedding=embedding)
Generate image embedding from file:
>>> embedding = tools.embed_image("path/to/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)
Generate image embedding from URL:
>>> embedding = tools.embed_image("https://example.com/image.jpg", target="image")
>>> similar_images = images.find(embedding=embedding)
embed_image(image, target)
Embed an image into a vector space.
| Parameters: |
|
|---|
| Returns: |
|
|---|
embed_text(text, target)
Embed text into a vector space.
| Parameters: |
|
|---|
| Returns: |
|
|---|
nel(text)
Named Entity Linking
This method requires named entities to be enclosed in tags: [START]entity[END].
| Parameters: |
|
|---|
| Returns: |
|
|---|
ner(text)
Named Entity Recognition
This method is faster than ner_nel but does not provide any linking to external resources.
| Parameters: |
|
|---|
| Returns: |
|
|---|
ner_nel(text)
Named Entity Recognition and Named Entity Linking
This method is slower than ner but provides linking to external resources.
| Parameters: |
|
|---|
| Returns: |
|
|---|
impresso.resources.tools.NerContainer
Bases: DataContainer
Name entity recognition result container.
df: DataFrame
property
Return the data as a pandas dataframe.
limit: int
property
Page size.
offset: int
property
Page offset.
size: int
property
Current page size.
total: int
property
Total number of results.
Text reuse
Two resources can be used to search text reuse clusters and passages.
# Find text reuse clusters
impresso.text_reuse_clusters.find(cluster_size=(10, 20))
# Get facets for clusters (e.g., newspaper distribution)
impresso.text_reuse_clusters.facet(facet='newspaper', order_by='count')
# Find text reuse passages
impresso.text_reuse_passages.find(term='revolution', country='FR')
# Get facets for passages
impresso.text_reuse_passages.facet(facet='newspaper')
impresso.resources.text_reuse.clusters.TextReuseClustersResource
Bases: Resource
Interact with the text reuse clusters endpoint of the Impresso API.
This resource allows searching for text reuse clusters based on various criteria and retrieving facet information about these clusters.
Examples:
Find clusters with size between 10 and 20:
>>> results = textReuseClusters.find(cluster_size=(10, 20))
>>> print(results.df)
Get the distribution of newspapers involved in clusters:
>>> facet_results = textReuseClusters.facet(facet='newspaper', order_by='count')
>>> print(facet_results.df)
facet(facet, order_by='value', limit=None, offset=None, cluster_size=None, date_range=None, newspaper_id=None, lexical_overlap=None, day_delta=None)
Get facet information for text reuse clusters based on specified filters.
Facets provide aggregated counts for different properties of the clusters, such as the distribution of cluster sizes or newspapers involved.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Get the top 10 newspapers involved in clusters:
>>> facet_results = textReuseClusters.facet(facet='newspaper', limit=10, order_by='count')
>>> print(facet_results.df)
Get the distribution of cluster sizes for clusters within a specific date range:
>>> from impresso.structures import DateRange
>>> date_filter = DateRange(start="1900-01-01", end="1910-12-31")
>>> facet_results = textReuseClusters.facet(facet='cluster_size', date_range=date_filter)
>>> print(facet_results.df)
find(term=None, title=None, order_by=None, cluster_size=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, limit=None, offset=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)
Find text reuse clusters based on various criteria.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Find clusters with size between 10 and 20:
>>> results = textReuseClusters.find(cluster_size=(10, 20))
>>> print(results.df)
Find clusters related to 'politics' in Swiss newspapers:
>>> results = textReuseClusters.find(term='politics', country='CH')
>>> print(results.df)
impresso.resources.text_reuse.passages.TextReusePassagesResource
Bases: Resource
Text reuse passages resource.
facet(facet, term=None, limit=None, offset=None, order_by=None, cluster_id=None, cluster_size=None, title=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)
Get facet information for text reuse passages based on specified filters.
Facets provide aggregated counts for different properties of the passages, such as the distribution of newspapers or languages.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Get the top 10 newspapers associated with passages containing 'war':
>>> facet_results = textReusePassages.facet(facet='newspaper', term='war', limit=10)
>>> print(facet_results.df)
Get the language distribution for passages published between 1914 and 1918:
>>> from impresso.structures import DateRange
>>> date_filter = DateRange(start="1914-01-01", end="1918-12-31")
>>> facet_results = textReusePassages.facet(facet='language', date_range=date_filter)
>>> print(facet_results.df)
find(term=None, limit=None, offset=None, order_by=None, cluster_id=None, cluster_size=None, title=None, lexical_overlap=None, day_delta=None, date_range=None, newspaper_id=None, collection_id=None, front_page=None, topic_id=None, language=None, country=None, mention=None, entity_id=None)
Find text reuse passages based on various criteria.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Find passages containing the term 'revolution' from French newspapers:
>>> results = textReusePassages.find(term='revolution', country='FR')
>>> print(results.df)
Find passages from clusters with a size greater than 50:
>>> results = textReusePassages.find(cluster_size=(51, None))
>>> print(results.df)
Pagination
When you search for content items, entities, or other resources, the library returns a limited subset of the results. This means that the results are divided into pages, and you can request a specific page of results by specifying the limit and offset parameters in your query. This is done to improve performance and avoid transferring large amounts of data at once.
The limit parameter specifies the maximum number of items to return in a single page, and the offset parameter specifies the starting index of the items to return. For example, if you set limit=10 and offset=20, the API will return items 20 through 29. When these parameters are not specified, the library uses default values, which is usually 0 for the offset and between 10 and 50 for the limit, depending on the resource.
The response object, an instance of DataContainer, contains information about the pagination, such as the total number of items (total), the number of items in the current page (size), the limit, and the offset.
When a find or facet method is called, the response object contains data for the first page or the page set by the offset and limit parameters. To get the subsequent pages, you can use the pages method of the response object. This method returns an iterator which yields new DataContainer objects for each of the subsequent pages until it reaches the end of the result set.
For example, if you want to get all the content items that mention "Titanic" with 20 items per page, you can use the following code:
result = impresso.search.find(
term="titanic",
limit=20,
)
for page in result.pages():
print(
f"Got page {page.offset} - {page.offset + page.size} of {page.total}. "
+ f"The first title is {page.raw['data'][0]['title']}"
)
impresso.data_container.DataContainer
Bases: Generic[IT, T]
Generic container for responses from the Impresso API
returned by resource methods (get, find).
Generally represents a single page of the result. The results can be
paginated through by adjusting the offset and limit parameters
in the corresponding resource method call (e.g., client.newspapers.find).
The total, limit, offset, and size properties provide information
about the current page and the overall result set.
df: DataFrame
property
The response data for the current page as a pandas dataframe.
Note that this DataFrame only contains the items from the current page of results, not the entire result set across all pages.
limit: int
property
Maximum number of items requested for the current page.
offset: int
property
The starting index (0-based) of the items on the current page.
pydantic: T
property
The response data as a pydantic model.
raw: dict[str, Any]
property
The response data as a python dictionary.
size: int
property
Number of items actually present on the current page.
total: int
property
Total number of results available across all pages.
url: str | None
property
URL of an Impresso web application page representing the result set.
pages()
Yields the current page and all subsequent pages of results.
This method first yields the current DataContainer instance (self), then attempts to fetch and yield subsequent pages by making new API calls with adjusted offsets.
| Returns: |
|
|---|
Example:
# Get the first page with 10 items per page
first_page = client.newspapers.find(limit=10)
# Iterate through all pages
for page in first_page.pages():
# Process items from the current page
print(f"Page {page.offset // page.limit + 1}:")
print(page.df)
# The loop will continue with the next page, if any