Register h5ad files of cellxgene-census¶
Setup¶
# !lamin init --storage s3://lamindata --name cellxgene-census --schema bionty
# !lamin close
2023-09-19 14:02:40,119:INFO - Found credentials in shared credentials file: ~/.aws/credentials
2023-09-19 14:02:40,903:INFO - Found credentials in shared credentials file: ~/.aws/credentials
❗ storage exists already
✅ registered instance on hub: https://lamin.ai/sunnyosun/cellxgene-census
✅ saved: User(id='kmvZDIX9', handle='sunnyosun', email='xs338@nyu.edu', name='Sunny Sun', updated_at=2023-09-19 12:02:50)
✅ saved: Storage(id='B4O1DDsR', root='s3://lamindata', type='s3', region='us-east-1', updated_at=2023-09-19 12:02:50, created_by_id='kmvZDIX9')
💡 loaded instance: sunnyosun/cellxgene-census
❗ locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)
!lamin load laminlabs/cellxgene-census
💡 loaded instance: laminlabs/cellxgene-census
import lamindb as ln
import lnschema_bionty as lb
import cellxgene_census
💡 lamindb instance: laminlabs/cellxgene-census
ln.context.track()
💡 notebook imports: cellxgene_census lamindb==0.54.4 lnschema_bionty==0.31.2
💡 Transform(uid='nhGTqlIHEyn7z8', name='Register h5ad files of cellxgene-census', short_name='files', version='0', type='notebook', reference='https://cellxgene-census-lamin-c192.netlify.app/notebooks/files', reference_type='cellxgene-census-lamin', updated_at=2023-10-16 15:04:08, latest_report_id=852, source_file_id=851, created_by_id=1)
💡 Run(uid='u6hWPhTQXwCTlNSi8Iaj', run_at=2023-10-24 15:48:23, transform_id=1, created_by_id=2)
census_version = "2023-07-25"
Register datasets¶
census = cellxgene_census.open_soma(census_version=census_version)
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
2023-10-05 17:31:56,984:INFO - The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
census
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/' (open for 'r') (2 items)
'census_data': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data' (unopened)
'census_info': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info' (unopened)>
census["census_data"]
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data' (open for 'r') (2 items)
'mus_musculus': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data/mus_musculus' (unopened)
'homo_sapiens': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data/homo_sapiens' (unopened)>
census["census_info"]
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info' (open for 'r') (3 items)
'summary': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/summary' (unopened)
'summary_cell_counts': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/summary_cell_counts' (unopened)
'datasets': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/datasets' (unopened)>
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()
datasets_df.shape
(593, 8)
datasets_df.head()
soma_joinid | collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
---|---|---|---|---|---|---|---|---|
0 | 0 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | donor_p13_trophoblasts | f171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad | 31497 |
1 | 1 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | ecf2e08e-2032-4a9e-b466-b65b395f4a02 | All donors trophoblasts | ecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad | 67070 |
2 | 2 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | 74cff64f-9da9-4b2a-9b3b-8a04a1598040 | All donors all cell states (in vivo) | 74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad | 286326 |
3 | 3 | f7cecffa-00b4-4560-a29a-8ad626b8ee08 | Mapping single-cell transcriptomes in the intr... | 10.1016/j.ccell.2022.11.001 | 5af90777-6760-4003-9dba-8f945fec6fdf | Single-cell transcriptomic datasets of Renal c... | 5af90777-6760-4003-9dba-8f945fec6fdf.h5ad | 270855 |
4 | 4 | 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c | Single-cell sequencing links multiregional imm... | 10.1016/j.ccell.2021.03.007 | bd65a70f-b274-4133-b9dd-0d1431b6af34 | Single-cell sequencing links multiregional imm... | bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad | 167283 |
files = ln.File.from_dir("s3://cellxgene-data-public/cell-census/2023-07-25/h5ads")
ln.save(files)
dataset = ln.Dataset(files, name="cellxgene-census", version=census_version)
❗ returning existing dataset with same hash: Dataset(uid='EAUF1AaT4kOVyHYnZsUJ', name='cellxgene-census', version='2023-07-25', hash='pEJ9uvIeTLvHkZW2TBT5', updated_at=2023-10-24 16:00:07, transform_id=1, run_id=9, created_by_id=2)
init_self_from_db start
init_self_from_db done
slots done
start provenance
start loop
end loop
track_run_input
links
created
dataset.save()
collections_df = (
datasets_df[["collection_id", "collection_name", "collection_doi"]]
.drop_duplicates()
.set_index("collection_id")
)
collections = []
for collection_id, row in collections_df.iterrows():
collection = ln.ULabel(
name=row.collection_name,
description=row.collection_doi,
reference=collection_id,
reference_type="collection_id",
)
collections.append(collection)
ln.save(collections)
is_collection = ln.ULabel(name="is_collection")
is_collection.save()
is_collection.children.set(collections)
collections = is_collection.children
files = ln.File.filter()
feature_collection = ln.Feature(name="collection", type="category")
feature_collection.save()
for _, row in datasets_df.iterrows():
file = files.filter(key__endswith=f"{row.dataset_id}.h5ad").one()
file.description = f"{row.dataset_title}|{row.dataset_id}"
file.save()
file.labels.add(collections.get(reference=row.collection_id), feature_collection)
Annotate with species¶
feature_organism = ln.Feature(name="organism", type="category")
feature_organism.save()
files = ln.File.filter()
lb.settings.organism = "human"
human_datasets = (
census["census_data"][lb.settings.organism.scientific_name]
.obs.read(column_names=["dataset_id"])
.concat()
.to_pandas()
.drop_duplicates()
)
print(human_datasets.shape)
for dataset_id in human_datasets.dataset_id:
file = files.filter(description__contains=dataset_id).one()
file.labels.add(lb.settings.organism, feature_organism)
(511, 1)
lb.settings.organism = "mouse"
mouse_datasets = (
census["census_data"][lb.settings.organism.scientific_name]
.obs.read(column_names=["dataset_id"])
.concat()
.to_pandas()
.drop_duplicates()
)
print(mouse_datasets.shape)
for dataset_id in mouse_datasets.dataset_id:
file = files.filter(description__contains=dataset_id).one()
file.labels.add(lb.settings.organism, feature_organism)
(82, 1)
file.describe()
File(id='0sbCRBKbqkEuSjhzfp42', key='cell-census/2023-07-25/h5ads/8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad', suffix='.h5ad', accessor='AnnData', description='Krasnow Lab Human Lung Cell Atlas, 10X|8c42cfd0-0b0a-46d5-910c-fc833d83c45e', size=588959280, hash='N0yW4Iksvgw93PzdE_4M0w-71', hash_type='md5-n', updated_at=2023-10-05 16:06:49)
Provenance:
🗃️ storage: Storage(id='oIYGbD74', root='s3://cellxgene-data-public', type='s3', region='us-west-2', updated_at=2023-09-19 13:17:56, created_by_id='kmvZDIX9')
📔 transform: Transform(id='nhGTqlIHEyn7z8', name='Register h5ad files of cellxgene-census', short_name='files', version='0', type='notebook', reference='https://github.com/laminlabs/cellxgene-census-lamin/blob/2553c2690909976efe380ca96d9e4d6b9a6c6749/docs/notebooks/datasets.ipynb', reference_type='github', updated_at=2023-10-05 14:04:28, created_by_id='kmvZDIX9')
👣 run: Run(id='60jqKpxivkwkpEFZr8mp', run_at=2023-10-05 15:31:55, transform_id='nhGTqlIHEyn7z8', created_by_id='kmvZDIX9')
👤 created_by: User(id='kmvZDIX9', handle='sunnyosun', email='xs338@nyu.edu', name='Sunny Sun', updated_at=2023-09-19 14:58:33)
Features:
external: FeatureSet(id='OHD9LSDGO1FtSWUtcpqG', n=2, registry='core.Feature', hash='NspE1QMvOo8aoOOrotmH', updated_at=2023-10-05 16:06:49, modality_id='FyZj4S3Z', created_by_id='kmvZDIX9')
🔗 organism (1, bionty.Species): 'human'
🔗 collection (1, core.ULabel): 'A molecular cell atlas of the human lung from single cell RNA sequencing'
Labels:
🏷️ species (1, bionty.Species): 'human'
🏷️ ulabels (1, core.ULabel): 'A molecular cell atlas of the human lung from single cell RNA sequencing'
census.close()