Skip to content

Commit 8127013

Browse files
fixing various typos in diverse texts (#40)
1 parent b98e030 commit 8127013

File tree

8 files changed

+27
-25
lines changed

8 files changed

+27
-25
lines changed

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Rabbat*, Nicolas Ballas*
1515

1616
Official Pytorch codebase for V-JEPA 2 and V-JEPA 2-AC.
1717

18-
V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticpation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
18+
V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
1919

2020
<p align="center">
2121
<img src="assets/flowchart.png" width=100%>
@@ -67,7 +67,7 @@ V-JEPA 2 is a self-supervised approach to training video encoders, using interne
6767

6868
## V-JEPA 2-AC Post-training
6969

70-
**(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot maniuplation tasks using a Franka arm, with input provided through a monocular RGB camera.
70+
**(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.
7171

7272
<img align="left" src="https://github.com/user-attachments/assets/c5d42221-0102-4216-911d-061a4369a805" width=65%>&nbsp;
7373
<table>
@@ -278,8 +278,10 @@ import torch
278278
vjepa2_encoder, vjepa2_ac_predictor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_ac_vit_giant')
279279
```
280280

281-
See [energy_landscape_example.ipynb](notebooks/energy_landscape_example.ipynb) for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab.
282-
To run this notebook, you'll need to aditionally install [Jupyter](https://jupyter.org/install) and [Scipy](https://scipy.org/install/) in your conda environment.
281+
282+
See [energy_landscape_example.ipynb](notebooks/vjepa_droid/energy_landscape.ipynb) for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab.
283+
To run this notebook, you'll need to additionally install [Jupyter](https://jupyter.org/install) and [Scipy](https://scipy.org/install/) in your conda environment.
284+
283285

284286
## Getting Started
285287

@@ -316,7 +318,7 @@ Probe-based evaluation consists in training an attentive probe on top of frozen
316318

317319
Evaluations can be run either locally, or distributed via SLURM. (Running locally is useful for debugging and validation).
318320
These sample commands launch Something-Something v2 video classification; other evals are launched by specifying the corresponding config.
319-
Use provided training configs under "Evaluation Attentive Probes". These configs allow to train multiple probes in parrallel with various optimization parameters.
321+
Use provided training configs under "Evaluation Attentive Probes". These configs allow to train multiple probes in parallel with various optimization parameters.
320322
Change filepaths as needed (e.g. `folder`, `checkpoint`, `dataset_train`, `dataset_val`) to match locations of data and downloaded checkpoints on your local filesystem.
321323
Change \# nodes and local batch size as needed to not exceed available GPU memory.
322324

evals/video_classification_frozen/modelcustom/vit_encoder_multiclip.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def init_module(
8080

8181
class ClipAggregation(nn.Module):
8282
"""
83-
Process each clip indepdnently and concatenate all tokens
83+
Process each clip independently and concatenate all tokens
8484
"""
8585

8686
def __init__(

evals/video_classification_frozen/modelcustom/vit_encoder_multiclip_multilevel.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ def init_module(
8484

8585
class ClipAggregation(nn.Module):
8686
"""
87-
Process each clip indepdnently and concatenate all tokens
87+
Process each clip independently and concatenate all tokens
8888
"""
8989

9090
def __init__(

src/datasets/utils/video/transforms.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,9 @@ def random_short_side_scale_jitter(images, min_size, max_size, boxes=None, inver
9797

9898
def crop_boxes(boxes, x_offset, y_offset):
9999
"""
100-
Peform crop on the bounding boxes given the offsets.
100+
Perform crop on the bounding boxes given the offsets.
101101
Args:
102-
boxes (ndarray or None): bounding boxes to peform crop. The dimension
102+
boxes (ndarray or None): bounding boxes to perform crop. The dimension
103103
is `num boxes` x 4.
104104
x_offset (int): cropping offset in the x axis.
105105
y_offset (int): cropping offset in the y axis.
@@ -150,7 +150,7 @@ def horizontal_flip(prob, images, boxes=None):
150150
"""
151151
Perform horizontal flip on the given images and corresponding boxes.
152152
Args:
153-
prob (float): probility to flip the images.
153+
prob (float): probability to flip the images.
154154
images (tensor): images to perform horizontal flip, the dimension is
155155
`num frames` x `channel` x `height` x `width`.
156156
boxes (ndarray or None): optional. Corresponding boxes to images.
@@ -193,7 +193,7 @@ def uniform_crop(images, size, spatial_idx, boxes=None, scale_size=None):
193193
crop if height is larger than width.
194194
boxes (ndarray or None): optional. Corresponding boxes to images.
195195
Dimension is `num boxes` x 4.
196-
scale_size (int): optinal. If not None, resize the images to scale_size before
196+
scale_size (int): optimal. If not None, resize the images to scale_size before
197197
performing any crop.
198198
Returns:
199199
cropped (tensor): images with dimension of
@@ -296,7 +296,7 @@ def grayscale(images):
296296

297297
def color_jitter(images, img_brightness=0, img_contrast=0, img_saturation=0):
298298
"""
299-
Perfrom a color jittering on the input images. The channels of images
299+
Perform a color jittering on the input images. The channels of images
300300
should be in order BGR.
301301
Args:
302302
images (tensor): images to perform color jitter. Dimension is
@@ -331,7 +331,7 @@ def color_jitter(images, img_brightness=0, img_contrast=0, img_saturation=0):
331331

332332
def brightness_jitter(var, images):
333333
"""
334-
Perfrom brightness jittering on the input images. The channels of images
334+
Perform brightness jittering on the input images. The channels of images
335335
should be in order BGR.
336336
Args:
337337
var (float): jitter ratio for brightness.
@@ -350,7 +350,7 @@ def brightness_jitter(var, images):
350350

351351
def contrast_jitter(var, images):
352352
"""
353-
Perfrom contrast jittering on the input images. The channels of images
353+
Perform contrast jittering on the input images. The channels of images
354354
should be in order BGR.
355355
Args:
356356
var (float): jitter ratio for contrast.
@@ -370,7 +370,7 @@ def contrast_jitter(var, images):
370370

371371
def saturation_jitter(var, images):
372372
"""
373-
Perfrom saturation jittering on the input images. The channels of images
373+
Perform saturation jittering on the input images. The channels of images
374374
should be in order BGR.
375375
Args:
376376
var (float): jitter ratio for saturation.
@@ -435,15 +435,15 @@ def lighting_jitter(images, alphastd, eigval, eigvec):
435435

436436
def color_normalization(images, mean, stddev):
437437
"""
438-
Perform color nomration on the given images.
438+
Perform color normation on the given images.
439439
Args:
440440
images (tensor): images to perform color normalization. Dimension is
441441
`num frames` x `channel` x `height` x `width`.
442442
mean (list): mean values for normalization.
443443
stddev (list): standard deviations for normalization.
444444
445445
Returns:
446-
out_images (tensor): the noramlized images, the dimension is
446+
out_images (tensor): the normalized images, the dimension is
447447
`num frames` x `channel` x `height` x `width`.
448448
"""
449449
if len(images.shape) == 3:

src/datasets/utils/weighted_sampler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ def __next__(self) -> int:
149149

150150
# In order to avoid sampling the same example multiple times between the ranks,
151151
# we limit each rank to a subset of the total number of samples in the dataset.
152-
# For example if our dataet is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], and we have 2 ranks,
152+
# For example if our dataset is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], and we have 2 ranks,
153153
# then rank 0 will ONLY sample from [0, 2, 4, 6, 8], and rank 1 from [1, 3, 5, 7, 9].
154154
# In each iteration we first produce `in_rank_sample` which is the sample index in the rank,
155155
# based on the size of the subset which that rank can sample from.

src/datasets/utils/worker_init_fn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17-
# This code originally comes from PyTorch Lighting with some light modificaitons:
17+
# This code originally comes from PyTorch Lighting with some light modifications:
1818
# https://github.com/Lightning-AI/pytorch-lightning/blob/a944e7744e57a5a2c13f3c73b9735edf2f71e329/src/lightning/fabric/utilities/seed.py
1919

2020

src/models/vision_transformer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ def interpolate_pos_encoding(self, x, pos_embed):
218218

219219
if self.is_video:
220220

221-
# If pos_embed already corret size, just return
221+
# If pos_embed already correct size, just return
222222
_, _, T, H, W = x.shape
223223
if H == self.img_height and W == self.img_width and T == self.num_frames:
224224
return pos_embed
@@ -254,7 +254,7 @@ def interpolate_pos_encoding(self, x, pos_embed):
254254

255255
else:
256256

257-
# If pos_embed already corret size, just return
257+
# If pos_embed already correct size, just return
258258
_, _, H, W = x.shape
259259
if H == self.img_height and W == self.img_width:
260260
return pos_embed

tests/datasets/test_vjepa_transforms.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,14 @@ class TestVideoTransformFunctionalCrop(unittest.TestCase):
3737
def test_tensor_numpy(self):
3838
T, C, H, W = 16, 3, 280, 320
3939
shape = (T, C, H, W)
40-
crop_szie = (10, 10, 224, 224)
40+
crop_size = (10, 10, 224, 224)
4141
video_tensor = torch.randint(low=0, high=255, size=shape, dtype=torch.uint8)
4242
video_numpy = video_tensor.numpy()
4343

44-
cropped_tensor = functional.crop_clip(video_tensor, *crop_szie)
44+
cropped_tensor = functional.crop_clip(video_tensor, *crop_size)
4545
self.assertIsInstance(cropped_tensor[0], torch.Tensor)
4646

47-
cropped_np_array = functional.crop_clip(video_numpy, *crop_szie)
47+
cropped_np_array = functional.crop_clip(video_numpy, *crop_size)
4848
self.assertIsInstance(cropped_np_array[0], np.ndarray)
4949

5050
for clip_tensor, clip_np in zip(cropped_tensor, cropped_np_array):
@@ -72,7 +72,7 @@ def test_tensor_numpy(self):
7272
clip_tensor = clip_tensor.permute(1, 2, 0)
7373
diff = torch.mean((torch.abs(clip_tensor - torch.Tensor(clip_np).to(torch.int16))) / (clip_tensor + 1))
7474

75-
# Transformatinos can not exactly match because of their interpolation functions coming from
75+
# Transformations can not exactly match because of their interpolation functions coming from
7676
# two different sources. Here we check for their relative differences.
7777
# See the discussion here: https://github.com/fairinternal/jepa-internal/pull/65#issuecomment-2101833959
7878
self.assertLess(diff, 0.05)

0 commit comments

Comments
 (0)