hub

CELLxGENE: scRNA-seq

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in two ways:

  1. Query collections of AnnData objects (this page).

  2. Query a big array store produced by concatenated AnnData objects via tiledbsoma (see here).

If you are interested in building similar data assets in-house:

  1. See the transfer guide to zero-copy data to your own LaminDB instance.

  2. See the scRNA guide for how to create a growing versioned queryable scRNA-seq dataset.

  3. See the Curate for validating, curating and registering your own AnnData objects.

Show me a screenshot

Load the public LaminDB instance that mirrors cellxgene:

# !pip install 'lamindb[bionty,jupyter]'
!lamin load laminlabs/cellxgene
Hide code cell output
! Full backed capabilities are not available for this version of anndata, please install anndata>=0.9.1.
→ connected lamindb: laminlabs/cellxgene
import lamindb as ln
import bionty as bt
Hide code cell output
→ connected lamindb: laminlabs/cellxgene
! Full backed capabilities are not available for this version of anndata, please install anndata>=0.9.1.

Query & understand metadata

Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

Show me a screenshot
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, source_id=48, updated_at='2023-11-28 22:30:57 UTC')

You can also arbitrarily chain filters and create lookups from them:

users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(name="is_suspension_type").one().children.lookup()  # suspension types

Search & filter metadata

We can use search & filters for metadata:

bt.CellType.search("effector T cell").df().head()
Hide code cell output
uid name ontology_id abbr synonyms description source_id run_id created_by_id updated_at
id
1623 3nfZTVV4 effector T cell CL:0000911 None effector T-cell|effector T-lymphocyte|effector... A Differentiated T Cell With Ability To Traffi... 48 NaN 1 2023-11-28 22:30:57.481778+00:00
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None 48 NaN 1 2023-11-28 22:27:55.572884+00:00
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... 48 NaN 1 2023-11-28 22:27:55.575955+00:00
1169 6JD5JCZC CD8-positive, alpha-beta cytokine secreting ef... CL:0000908 None CD8-positive, alpha-beta cytokine secreting ef... A Cd8-Positive, Alpha-Beta T Cell With The Phe... 48 NaN 1 2023-11-28 22:27:55.571576+00:00
1503 1oa5G2Mq memory T cell CL:0000813 None memory T-cell|memory T lymphocyte|memory T-lym... A Long-Lived, Antigen-Experienced T Cell That ... 48 NaN 1 2023-11-28 22:27:55.580290+00:00

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, source_id=48, updated_at='2023-11-28 22:30:57 UTC')

Understand ontologies

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)
Hide code cell output
_images/6cdfc2f61da5a14e92b8512c8b1af5865ee670a550a55ae2659acf11ebca5fbc.svg

Or access them programmatically:

effector_t_cell.children.df()
Hide code cell output
uid name ontology_id abbr synonyms description source_id run_id created_by_id updated_at
id
931 2VQirdSp effector CD8-positive, alpha-beta T cell CL:0001050 None effector CD8-positive, alpha-beta T lymphocyte... A Cd8-Positive, Alpha-Beta T Cell With The Phe... 48 None 1 2023-11-28 22:27:55.565981+00:00
1088 490Xhb24 effector CD4-positive, alpha-beta T cell CL:0001044 None effector CD4-positive, alpha-beta T lymphocyte... A Cd4-Positive, Alpha-Beta T Cell With The Phe... 48 None 1 2023-11-28 22:27:55.569832+00:00
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None 48 None 1 2023-11-28 22:27:55.572884+00:00
1309 5s4gCMdn cytotoxic T cell CL:0000910 None cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... A Mature T Cell That Differentiated And Acquir... 48 None 1 2023-11-28 22:27:55.575444+00:00
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... 48 None 1 2023-11-28 22:27:55.575955+00:00

Query artifacts

Unlike in the tiledbsoma guide, here, we’ll query sets of .h5ad files, which correspond to AnnData objects.

To see what you can query for, simply look at the registry representation:

ln.Artifact
Hide code cell output
Artifact
  Simple fields
    .uid: CharField
    .description: CharField
    .key: CharField
    .suffix: CharField
    .type: CharField
    .size: BigIntegerField
    .hash: CharField
    .n_objects: BigIntegerField
    .n_observations: BigIntegerField
    .visibility: SmallIntegerField
    .version: CharField
    .is_latest: BooleanField
    .created_at: DateTimeField
    .updated_at: DateTimeField
  Relational fields
    .created_by: User
    .storage: Storage
    .transform: Transform
    .run: Run
    .ulabels: ULabel
    .input_of_runs: Run
    .feature_sets: FeatureSet
    .collections: Collection
  Bionty fields
    .organisms: bionty.Organism
    .genes: bionty.Gene
    .proteins: bionty.Protein
    .cell_markers: bionty.CellMarker
    .tissues: bionty.Tissue
    .cell_types: bionty.CellType
    .diseases: bionty.Disease
    .cell_lines: bionty.CellLine
    .phenotypes: bionty.Phenotype
    .pathways: bionty.Pathway
    .experimental_factors: bionty.ExperimentalFactor
    .developmental_stages: bionty.DevelopmentalStage
    .ethnicities: bionty.Ethnicity

Here is an exemplary string query:

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__name__in=["B cell", "T cell"],  # cell types measured in AnnData
    created_by__handle="sunnyosun"  # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()
Hide code cell output
cell_types__name created_by__handle uid version is_latest description key suffix type size ... n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
879 [conventional dendritic cell, classical monocy... sunnyosun BCutg5cxmqLmy2Z5SS8J 2023-07-25 False Type I interferon autoantibodies are associate... cell-census/2023-07-25/h5ads/01ad3cd7-3929-465... .h5ad None 6353682597 ... 600929 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:14:10.959155+00:00
1106 [immature B cell, monocyte, naive thymus-deriv... sunnyosun 3xdOASXuAxxJtSchJO3D 2023-07-25 False HSC/immune cells (all hematopoietic-derived ce... cell-census/2023-07-25/h5ads/48101fa2-1a63-451... .h5ad None 6214230662 ... 589390 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:11:10.324135+00:00
1174 [monocyte, conventional dendritic cell, plasma... sunnyosun wt7eD72sTzwL3rfYaZr2 2023-07-25 False A scRNA-seq atlas of immune cells at the CNS b... cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0... .h5ad None 1052158249 ... 130908 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:09:45.364255+00:00
1377 [monocyte, ciliated cell, macrophage, natural ... sunnyosun znTBqWgfYgFlLjdQ6Ba7 2023-07-25 False Large-scale single-cell analysis reveals criti... cell-census/2023-07-25/h5ads/9dbab10c-118d-496... .h5ad None 13929140098 ... 1462702 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:14:24.084706+00:00
1482 [effector CD4-positive, alpha-beta T cell, con... sunnyosun dEP0dZ8UxLgwnkLjz6Iq 2023-07-25 False Single-cell sequencing links multiregional imm... cell-census/2023-07-25/h5ads/bd65a70f-b274-413... .h5ad None 1204103287 ... 167283 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:05:49.602044+00:00

5 rows × 22 columns

What happens under the hood?

As you saw from inspecting ln.Artifact, ln.Artifact.cell_types relates artifacts with bt.CellType.

The expression cell_types__name__in performs the join of the underlying registries and matches bt.CellType.name to ["B cell", "T cell"].

Similar for created_by, which relates artifacts with ln.User.

Queries by string are prone to typos. Let’s query with auto-completed records instead.

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[cell_types.b_cell, cell_types.t_cell],  # cell types measured in AnnData
    created_by=users.sunnyosun   # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()
Hide code cell output
cell_types__name created_by__handle uid version is_latest description key suffix type size ... n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
879 [conventional dendritic cell, classical monocy... sunnyosun BCutg5cxmqLmy2Z5SS8J 2023-07-25 False Type I interferon autoantibodies are associate... cell-census/2023-07-25/h5ads/01ad3cd7-3929-465... .h5ad None 6353682597 ... 600929 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:14:10.959155+00:00
1106 [immature B cell, monocyte, naive thymus-deriv... sunnyosun 3xdOASXuAxxJtSchJO3D 2023-07-25 False HSC/immune cells (all hematopoietic-derived ce... cell-census/2023-07-25/h5ads/48101fa2-1a63-451... .h5ad None 6214230662 ... 589390 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:11:10.324135+00:00
1174 [monocyte, conventional dendritic cell, plasma... sunnyosun wt7eD72sTzwL3rfYaZr2 2023-07-25 False A scRNA-seq atlas of immune cells at the CNS b... cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0... .h5ad None 1052158249 ... 130908 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:09:45.364255+00:00
1377 [monocyte, ciliated cell, macrophage, natural ... sunnyosun znTBqWgfYgFlLjdQ6Ba7 2023-07-25 False Large-scale single-cell analysis reveals criti... cell-census/2023-07-25/h5ads/9dbab10c-118d-496... .h5ad None 13929140098 ... 1462702 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:14:24.084706+00:00
1482 [effector CD4-positive, alpha-beta T cell, con... sunnyosun dEP0dZ8UxLgwnkLjz6Iq 2023-07-25 False Single-cell sequencing links multiregional imm... cell-census/2023-07-25/h5ads/bd65a70f-b274-413... .h5ad None 1204103287 ... 167283 md5-n AnnData 1 False 2 11 16 1 2024-01-24 07:05:49.602044+00:00

5 rows × 22 columns

Query collections

Often, you work with collections of artifacts, which Collection helps managing.

Let’s look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts:

collection = ln.Collection.filter(name="cellxgene-census", version="2024-07-01").one()
collection
Hide code cell output
Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=True, name='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC')

You can count all contained artifacts or get them as a dataframe.

collection.artifacts.count()
Hide code cell output
812
collection.artifacts.df().head()  # not tracking run & transform because read-only instance
Hide code cell output
! no run & transform get linked, consider calling ln.context.track()
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
3042 GcVBvpW5MYlrsH1izOjN 2024-07-01 True All cells cell-census/2024-07-01/h5ads/3dc61ca1-ce40-46b... .h5ad dataset 947738392 NDhyYVxRpOG6UiEkDZKswg None 71752 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:43.667567+00:00
3587 1AeEHLQzGyRZL5nwpffu 2024-07-01 True wilms cell-census/2024-07-01/h5ads/ea01c125-67a7-4bd... .h5ad dataset 75413467 TNsJMqhUOekqUh4qtxvccA None 4636 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:48.218901+00:00
2850 vEw6vGy47Zi0Qj6TG6l7 2024-07-01 True Tabula Sapiens - Skin cell-census/2024-07-01/h5ads/0041b9c3-6a49-4bf... .h5ad dataset 199210144 sV0vZMpxZsTXIb6qqCg8ng None 9424 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:44.720154+00:00
3230 tggrprv4cllqGOrH8RlL 2024-07-01 True Dissection: Amygdaloid complex (AMY) - Basolat... cell-census/2024-07-01/h5ads/7d3ab174-e433-40f... .h5ad dataset 330480233 eS_gAyJD_P0oLd6IHEsPJQ None 28984 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:46.355994+00:00
3309 RCzyhZz9tfi6YI4F7mxb 2024-07-01 True Single cell RNA sequencing of follicular lymphoma cell-census/2024-07-01/h5ads/99950e99-2758-41d... .h5ad dataset 749041844 FaUU0Z0Uk6w2oewwJq8zZg None 137147 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:41.753173+00:00

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame
Hide code cell output
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
2961 WwmBIhBNLTlRcSoBDt76 2024-07-01 True Mature kidney dataset: immune cell-census/2024-07-01/h5ads/20d87640-4be8-487... .h5ad dataset 45158726 GCMHkdQSTeXxRVF7gMZFIA None 7803 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:43.756335+00:00
2961 WwmBIhBNLTlRcSoBDt76 2024-07-01 True Mature kidney dataset: immune cell-census/2024-07-01/h5ads/20d87640-4be8-487... .h5ad dataset 45158726 GCMHkdQSTeXxRVF7gMZFIA None 7803 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:43.756335+00:00
3000 gHlQ5Muwu3G9pvFCx3x8 2024-07-01 True Fetal kidney dataset: immune cell-census/2024-07-01/h5ads/2d31c0ca-0233-41c... .h5ad dataset 64546349 2qy8uy-65Sd_XcBU-nrPgA None 6847 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:45.273783+00:00
3324 P4Oai3OLGAzRwoicHfLM 2024-07-01 True Mature kidney dataset: full cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b... .h5ad dataset 194047623 aZVpGZwAfMCziff_5ow2bg None 40268 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:44.478948+00:00
3324 P4Oai3OLGAzRwoicHfLM 2024-07-01 True Mature kidney dataset: full cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b... .h5ad dataset 194047623 aZVpGZwAfMCziff_5ow2bg None 40268 md5-n AnnData 1 False 2 22 27 1 2024-07-12 12:40:44.478948+00:00

Query arrays

Note

Here, we discuss slicing individual AnnData arrays. If you want to slice a large concatenated array store, see the tiledbsoma guide.

In the query above, each artifact stores an array in form of an .h5ad file, which corresponds to an AnnData object.

Let’s look at the first array in the query and show its metadata using .describe().

artifact = query.first()
artifact.describe()
Hide code cell output
Artifact(uid='WwmBIhBNLTlRcSoBDt76', version='2024-07-01', is_latest=True, description='Mature kidney dataset: immune', key='cell-census/2024-07-01/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad', suffix='.h5ad', type='dataset', size=45158726, hash='GCMHkdQSTeXxRVF7gMZFIA', n_observations=7803, _hash_type='md5-n', _accessor='AnnData', visibility=1, _key_is_virtual=False, updated_at='2024-07-12 12:40:43 UTC')
  Provenance
    .created_by = 'sunnyosun'
    .storage = 's3://cellxgene-data-public'
    .transform = 'Census release 2024-07-01 (LTS)'
    .run = '2024-07-16 12:49:41 UTC'
  Labels
    .organisms = 'human'
    .tissues = 'cortex of kidney', 'renal medulla', 'kidney', 'kidney blood vessel', 'renal pelvis'
    .cell_types = 'classical monocyte', 'plasmacytoid dendritic cell', 'natural killer cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'mast cell', 'neutrophil', 'non-classical monocyte', 'CD8-positive, alpha-beta T cell', 'B cell', ...
    .diseases = 'normal'
    .phenotypes = 'male', 'female'
    .experimental_factors = '10x 3' v2'
    .developmental_stages = '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage', ...
    .ethnicities = 'unknown'
    .ulabels = 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3', ...
  Features
    'donor_id' = 'Wilms3', 'TTx', 'pRCC', 'VHL', 'RCC3', 'TxK1', 'TxK4', 'TxK3', 'RCC2', 'Wilms2', ...
    'organism' = 'human'
    'suspension_type' = 'cell'
  Feature sets
    'obs' = 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'tissue', 'organism', 'tissue_type', 'suspension_type'
    'var' = 'None', 'EBF1', 'LINC02202', 'RNF145', 'LINC01932', 'UBLCP1', 'IL12B', 'LINC01845', 'LINC01847', 'ADRA1B', 'TTC1', 'PWWP2A', 'FABP6', 'FABP6-AS1', 'CCNJL', 'C1QTNF2'
More ways of accessing metadata

Access just features:

artifact.features

Or get labels given a feature:

artifact.labels.get(features.tissue).df()
artifact.labels.get(features.collection).one()

If you want to query a slice of the array data, you have two options:

  1. Cache & load the entire array into memory via artifact.load() -> AnnData (caches the h5ad on disk, so that you only download once)

  2. Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

Cache & load:

adata = artifact.load()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 7803 × 32839
    obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

See the array-level query
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
See the artifact-level query
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

AnnData uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

Stream:

adata_backed = artifact.open()
adata_backed
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 7803 × 32839
  constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
    obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'observation_joinid', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id', 'tissue_type']
    obsm: ['X_umap']
    raw: ['X', 'var', 'varm']
    uns: ['citation', 'default_embedding', 'schema_reference', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_length', 'feature_name', 'feature_reference']

We now have an AnnDataAccessor object, which behaves much like an AnnData, and the query looks the same.

See the query
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

Train ML models

You can directly train ML models on very large collections of AnnData objects.

See Train a machine learning model on a collection.

Exploring data by collection

Alternatively,

Let’s search the collections from CELLxGENE within the 2023-12-15 release:

ln.Collection.filter(version="2024-07-01").search("immune human kidney", limit=10)
Hide code cell output
<QuerySet [Collection(uid='GjITgEjsQaa1X5i7T4Ny', version='2024-07-01', is_latest=True, name='A spatially resolved atlas of the human lung characterizes a gland-associated immune niche', description='10.1038/s41588-022-01243-4', hash='KhrJyHkwPYRqNsHBTh-K', reference='c1241244-b22d-483d-875b-75699efb9f3c', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:19:26 UTC'), Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=True, name='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='kAcitlx0g6C2lgacOCAS', version='2024-07-01', is_latest=True, name='Human breast cell atlas', description='10.1038/s41588-024-01688-9', hash='wXMzOvp8a-_nGgkwfjSM', reference='48259aa8-f168-4bf5-b797-af8e88da6637', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='0dOlh07SLGzVeaRMvSGQ', version='2024-07-01', is_latest=True, name='Longitudinal profiling of respiratory and systemic immune responses reveals myeloid cell-driven lung inflammation in severe COVID-19', description='10.1016/j.immuni.2021.03.005', hash='Xof1dryDzC0aYzt-Lkdk', reference='29f92179-ca10-4309-a32b-d383d80347c1', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='oyTLFK0Zl0alAmITAKYO', version='2024-07-01', is_latest=True, name='Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns', description='10.1016/j.cell.2017.09.004', hash='udNTlHGDCBxWd45dRMPU', reference='a238e9fa-2bdf-41df-8522-69046f99baff', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='XGeEFfpeKAYMtQlnJAaY', version='2024-07-01', is_latest=True, name='Multi-scale spatial mapping of cell populations across anatomical sites in healthy human skin and basal cell carcinoma', description='10.1073/pnas.2313326120', hash='SR4yp3Hfk5B3SrqRoNXN', reference='34f12de7-c5e5-4813-a136-832677f98ac8', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:17:41 UTC'), Collection(uid='tZYmzwfh0bIYzKBQVuro', version='2024-07-01', is_latest=True, name='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='nGcCV4HJONcma2SExXw2', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='kDJ9Xb8d11d93LAHMJpf', version='2024-07-01', is_latest=True, name='Human Brain Cell Atlas v1.0', description='10.1126/science.add7046', hash='pD7t82V30Qg-8Nbm52qI', reference='283d65eb-dd53-496d-adb7-7570c7caa443', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='VVsweEynenmLLY85PaXn', version='2024-07-01', is_latest=True, name='Distinct microbial and immune niches of the human colon', description='10.1038/s41590-020-0602-z', hash='TPN8WENkCiWNJnzthA34', reference='7681c7d7-0168-4892-a547-6f02a6430ace', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='8Rn5Tl3rykhNieH79Jhx', version='2024-07-01', is_latest=True, name='LungMAP — Human data from a broad age healthy donor group', description='10.7554/eLife.62522', hash='cv-vdnfWnbj9DmDiIViR', reference='625f6bf4-2f33-4942-962e-35243d284837', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC')]>

Let’s get the record of the top hit collection:

collection = ln.Collection.get("kqiPjpzpK9H9rdtnV67f")
collection
Hide code cell output
Collection(uid='kqiPjpzpK9H9rdtnV67f', version='2023-12-15', is_latest=False, name='Spatiotemporal immune zonation of the human kidney', description='10.1126/science.aat5031', hash='4wGcXeeqsjVdbRdU7ZuJ', reference='120e86b4-1195-48c5-845b-b98054105eec', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:33 UTC')

We see it’s a Science paper and we could find more information using the DOI or CELLxGENE collection id.

Check different versions of this collection:

collection.versions.df()
Hide code cell output
uid version is_latest name description hash reference reference_type visibility transform_id meta_artifact_id run_id created_by_id updated_at
id
17 kqiPjpzpK9H9rdtnHWas 2023-07-25 False Spatiotemporal immune zonation of the human ki... 10.1126/science.aat5031 w_VZE7n841ktaA9FjdLh 120e86b4-1195-48c5-845b-b98054105eec CELLxGENE Collection ID 1 NaN None NaN 1 2024-01-08 12:01:20.121095+00:00
365 kqiPjpzpK9H9rdtnV67f 2023-12-15 False Spatiotemporal immune zonation of the human ki... 10.1126/science.aat5031 4wGcXeeqsjVdbRdU7ZuJ 120e86b4-1195-48c5-845b-b98054105eec CELLxGENE Collection ID 1 17.0 None 22.0 1 2024-01-29 07:54:33.854515+00:00
595 kqiPjpzpK9H9rdtnCt1o 2024-07-01 True Spatiotemporal immune zonation of the human ki... 10.1126/science.aat5031 I6mGKs5YVdoOJwMdRfj_ 120e86b4-1195-48c5-845b-b98054105eec CELLxGENE Collection ID 1 22.0 None 27.0 1 2024-07-16 12:24:39.167691+00:00

Each collection has at least one Artifact file associated to it. Let’s get the associated artifacts:

collection.artifacts.df()
Hide code cell output
! no run & transform get linked, consider calling ln.context.track()
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1778 b2x19Eg28GGSNnXW1hAD 2023-12-15 False Fetal kidney dataset: nephron cell-census/2023-12-15/h5ads/08073b32-d389-41f... .h5ad None 159545411 _JE59jFHDrOn0hj4i1yXSQ None 10790 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:46:06.497662+00:00
1880 WwmBIhBNLTlRcSoBpatT 2023-12-15 False Mature kidney dataset: immune cell-census/2023-12-15/h5ads/20d87640-4be8-487... .h5ad None 44647761 hSLF-GPhLXaC2tVIOJEdXA None 7803 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:46:33.152678+00:00
1930 gHlQ5Muwu3G9pvFC7egT 2023-12-15 False Fetal kidney dataset: immune cell-census/2023-12-15/h5ads/2d31c0ca-0233-41c... .h5ad None 64056560 jENeQIq0JdoHl5PyfY-sjA None 6847 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:46:37.205210+00:00
1944 USUgRVwrCMquHiImhk5D 2023-12-15 False Mature kidney dataset: non PT parenchyma cell-census/2023-12-15/h5ads/2fc9c59f-3cfd-48d... .h5ad None 39294782 3l5iNnBmPFbYfR3-THYWNQ None 4620 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:46:52.173865+00:00
2405 P4Oai3OLGAzRwoicaxCB 2023-12-15 False Mature kidney dataset: full cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b... .h5ad None 192484358 yghldeu2bOC5jtvnqZH8Og None 40268 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:49:11.905786+00:00
2570 6mnZ3SeQFhffr3wTdZZb 2023-12-15 False Fetal kidney dataset: stroma cell-census/2023-12-15/h5ads/c52de62a-058d-4d7... .h5ad None 109942751 s24Q5-FNUNQPLZw9BuwOVg None 8345 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:50:01.866851+00:00
2652 11HQaMeIUaOwyHoOWVvA 2023-12-15 False Fetal kidney dataset: full cell-census/2023-12-15/h5ads/d7dcfd8f-2ee7-438... .h5ad None 341214674 2mnG5TiEpj0Wr5L19TTFRw None 27197 md5-n AnnData 1 False 2 16 22 1 2024-01-29 07:50:28.610568+00:00